Some Assembly Required

By Roger Smith and Alexandra Weber Morales

, June 01, 2001

June 2001: Some Assembly Required

Picture this: In a tiny, cluttered garage office stuffed with children's pictures and toys, an overflowing bookcase (The Idiot's Guide to Red Hat Linux, How Brains Think) and two Linux and Windows workstations, a 41-year-old doctoral candidate at the University of California at Santa Cruz is coding the program that will assemble a first draft of the human genome. He and his colleagues are motivated partially by fear: Some predict that if the international consortium he's helping doesn't speed its progress in charting the nucleotide structure of human DNA, the very material that defines us as a species could be locked up by commercial patents. In four 80-hour weeks—from May 22 to June 22, 2000—the amiable but intense Jim Kent will write most of the 10,000 lines of GigAssember, taking time out to put ice packs on his wrists to ward off pain from the repetitive stress they must endure.

He works at this frenetic pace because a technologically endowed private venture, headed by maverick DNA researcher J. Craig Venter, is itself closing in on sequencing the three billion nucleotides that make up our genome. Though Venter disavows the accusation that he plans to patent thousands of genes, he does have something to prove. After the National Institutes of Health rejected his proposal to quickly map the genome with the "shotgun" method that he'd used to sequence the entire flu bacterium in 1995, Venter, a former NIH lab chief, founded Celera Genomics in 1998. The money and machinery behind the start-up came from Applera Corporation, parent of Applied Biosystems, which builds state-of-the-art DNA sequencing robots. "Discovery can't wait" is the Rockville, Maryland-based Celera's motto. Nor did it wait—with Celera breathing down their necks, researchers for the public Human Genome Project beat their own deadline for drafting the genome.

The first heat is over. After the histrionics had died down and a scientific publishing flap was resolved, the Human Genome Project and Celera Genomics agreed to release their results in the Feb. 15, 2001 issue of Nature and the Feb. 16, 2001 issue of Science, respectively. The story of how Jim Kent—along with computer science professor David Haussler, UCSC's coordinator for genome research—played a part in getting a once-plodding project to the finish line is one of personal motivation in the face of extremely distributed development processes. But first, a brief history.

Slipping the Schedule
The idea of tackling the human genome surfaced in 1985, starting with a brainstorming session convened by Robert Sinsheimer, then chancellor at UCSC. The next year, Nobel laureate Renato Dulbecco suggested that whole-genome sequencing could revolutionize cancer research, while Charles DeLisi, head of the office of Health and Environmental Research at the Department of Energy (DOE), proposed a crash program for meeting that goal.

By 1990, the Human Genome Project had gotten underway and was aiming for completion in 2005 at a cost of $3 billion. There were three goals: first, locate specific genes to their relative positions on the chromosomes; second, physically map the positions, in numbers of base pairs, of known genes and landmarks; and third, sequence the entire chain, using DNA from several individuals.

Early progress was rapid. Genes were identified for debilitating diseases such as muscular dystrophy, Alzheimer's and some cancers. Still, by 1998 only 3 percent of the entire human genome had been sequenced. Celera's entry into the race that year made a big splash, both in the popular press and among scientists in the emerging discipline of bioinformatics (computational biology). "It's like a private company in 1967 announcing they're going to race NASA to the moon," says Harvard professor and AIDS researcher William Haseltine.

"There is no denying that Celera was helpful in giving us the incentive to move quickly and focus our efforts," Kent told Software Development Technical Editor Roger Smith in a recent interview at his office. "On the other hand, while the organization of the public effort is quite diffuse, it's been remarkably cooperative. In some ways, it's amazing that it could possibly work, but it does."

The Human Genome Project is a truly international effort, with much of the funding coming from the Wellcome Trust, a British medical charity. The U.S. DOE's Human Genome Program and the NIH's National Human Genome Research Institute (NHGRI) oversee research in the U.S. Over half of the worldwide effort has taken place here, in various genome centers at the National Laboratories (Los Alamos, Lawrence Livermore and Lawrence Berkeley) and universities such as Baylor, MIT and Washington University in St. Louis.

Running on a Linux cluster of up to 100 Pentium III workstations, Kent's GigAssembler program analyzes data consolidated from the genome consortium's sequencing laboratories and stored in GenBank, a public DNA sequence database. Designed by the National Center for Biotechnology Information to provide the scientific community access to the most up-to-date and comprehensive DNA sequence information, GenBank contains annotated, publicly available DNA sequences. Exchanging data on a daily basis with the DNA DataBank of Japan and the European Molecular Biology Laboratory, GenBank has grown exponentially during the past few years, from two million sequence records in 1997 to approximately 11 million in February 2001. While large sets are sometimes still submitted by tape, scientists can now use a Web tool, BankIt, to submit simple sequences.

"It's a unique situation, in my experience," Kent says. "No one really has any authority over anyone else. The only way we can proceed is by consensus. Yet, at the same time, it's been going quite quickly, largely because our interests are so aligned—we all very strongly want to do the same thing."

Too Many CDs
After earning his bachelor's and master's degrees in mathematics at UCSC in 1983, Kent began writing graphic and animation programs for Amiga and Atari personal computers. He shifted to IBM PCs with the advent of VGA cards, writing the Animator program for Autodesk Inc. The software sold well, financing more academic pursuits.

"Around 1996, I got bored. It seemed to me that my job for the last three years was doing the same thing on a new variant of the Microsoft operating system. It was a shakeout period before they settled on the DirectX [graphics] standard, and their APIs were changing so fast, especially their graphic APIs: two or three major graphic APIs per operating system. I was fed up with it. The last straw was when the developers' kit for Windows 95 came out on 12 CDs," Kent remarks. "The entire human genome fits on one CD. You can't tell me it [software] needs to be that complicated."

Back in school, Kent began analyzing the DNA of C. elegans, the one millimeter-long roundworm much studied by biologists. In December 1999, he was tapped by Haussler—himself recruited by Eric Lander, director of the Genome Center at MIT's Whitehead Institute—to help analyze the genetic blueprint of a slightly more complicated organism: H. sapiens.

How Sequencing Works
At the core of each living cell, be it human, rabbit or worm, are developmental blueprints—a complete gene complement—written in deoxyribonucleic acid (DNA) molecules. DNA is a twisted ladder structure built of nucleotides: "base pairs" of adenine (A) bonded to thymine (T) or cytosine (C) bonded to guanine (G). A strand of these bases—half of the helix—is abbreviated like this:

ATTCGAGCTCGGTACCTTTTCCTGCCATG

The chain of nucleotides that comprises the genome for the H. influenzae bacterium is 1.8 million base pairs long; those of the fruit fly and the human are 120 million base pairs and 3.5 billion base pairs long, respectively. At the root of the sequencing problem is the fact that current technology can only read about 500 nucleotides at a time. The most common method of sequencing is based on Frederick Sanger's Nobel prize-winning technique, developed in 1977, for using enzymes to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths.

Unpolymerized As, Cs, Ts and Gs are incubated with single-stranded DNA and DNA polymerase. A small fraction of a given base is chemically modified so that, if incorporated, it stops the DNA polymer from growing. Since millions of chains are synthesized in this reaction, you end up with a population of DNA chains that stop at each instance of that given base. Thus, if the template is:

GCAATCAGTACCACTA

you end up with the following chains:

GCA
GCAA
GCAATCAGTA
GCAATCA
GCAATCAGTACCA
GCAATCAGTACCACTA

Figure 1. Fluorescent Trace-Data From Ensembl Trace Server

Enzymes are used to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths. A dedicated sequencing machine uses gel electrophoresis—placing the fragments in charged polymeric goo and watching them migrate depending on their length—to sort the chains. Each base is tagged with a specific fluorescent dye, allowing the DNA sequence to be read by the machine.

Similar reactions are performed with the other three bases. A dedicated sequencing machine, such as the $300,000 Applied Biosystems PRISM 3700 DNA Analyzer, automates the initial reaction and analysis. It uses gel electrophoresis—placing the fragments in charged polymeric goo and watching them migrate depending on their length—to sort the chains. Each base is tagged with a specific fluorescent dye, allowing the DNA sequence to be read by the machine, as illustrated in Figure 1. Now comes the hardest part: putting millions of pieces back together in the right order.

Deconstructing Mary
The private and public human genome projects started out with two fundamentally different approaches to assembling small, accurately read sequences into the larger draft map of the genome. Taking advantage of one of the largest civilian computers in the world, an $80-million Compaq supercomputer with four terabytes of memory, Celera Genomics used Venter's "whole-genome shotgun sequencing," a method that blasts the DNA randomly into thousands of partially overlapping fragments, then uses the overlaps to put the genome back together, somewhat akin to a giant jigsaw puzzle. Kent offers the following "nursery rhyme" analogy to explain the differences between Venter's shotgun approach and the public consortium's hierarchical method.

In assembling the overlapping pieces, sometimes placement is uncertain, or pieces don't fit. One of the most challenging engineering problems, according to Kent, is coping with the numerous similar DNA sequences, or "repeats"—recurring sets of letters in the human genome.

The hierarchical approach breaks the repeats into big pieces, which are sequenced separately and then combined; that is,

maryhadalittlelamblittlelamblittlelamb

might be sequenced and assembled into one piece, and

maryhadalittlelambwhosefleecewaswhiteassnow

assembled into another; and only afterward would the two big pieces be combined. The "paired read" shotgun approach comes at the problem in a different fashion, Instead of completely sequencing a 4,000-base-long assembly, say, you sequence only the shorter, 500-base ends and record the distance between them:

maryhadali--------cewaswhite

Because this measurement is a known distance, these "paired read" markers help you do both an original assembly and check the quality of a more finished assembly.

Assembly Required
The two methods are complementary, Kent points out. "Toward the end of the project, in March of 2000, Celera used parts of the hierarchical approach in a divide-and-conquer effort to make the problem smaller, while we used paired reads, as well." With either approach, the alignment problem (locating overlaps in the smaller DNA sequences) is computationally intensive. "At one point, Celera had a pile of 50 million reads. I had, at most, 400,000 longer pieces because, basically, I was doing the second step of a hierarchical assembly.

"Computationally, the alignment problem is this," Kent explains. "You have two strings. How can we put them together so that the most bases will match? On our [Linux] cluster, we spend three days doing alignments and maybe two hours doing the assembly on the alignments."

Similarly, Celera's process of comparing every read against every other read in search of complete end-to-end overlaps of at least 40 base pairs (and with no more than 6 percent difference in the match) took 10,000 CPU hours, or about five days, running on a suite of 40 four-processor Alpha SMPs with 4 gigabytes of RAM each. Since the chemical procedures used to sequence DNA aren't perfect (the error rate is about 5 percent), it takes more than simple string comparisons to find two overlapping pieces of DNA. A procedure that works well for finding these overlaps is to build up an index that indicates where every 12-mer (12 letters, or 24 bits of data) in the DNA database is located.

Once the index is built, multiple query sequences can be quickly located within the original database. Since the average query sequence is about 500 bases long, there are typically at least 15 or 20 12-mers inside of such a sequence, provided there is no error. Therefore, a program can use the index to quickly look in both the query and the database sequence for neighboring clusters of 12-mers. In a matter of milliseconds, this clustering reduces the problem of locating a 500-base query in a 3 billion-base database to that of locating a 500-base query in perhaps a 1000-base window of the database. More leisurely algorithms can then work out the alignment details inside of the smaller window, also in a matter of milliseconds.

Asked if the public consortium's data is of poor quality—a criticism leveled by Venter—Kent responds: "Well, a third party needs to judge. Celera may be biased because of their financial involvement. We're using the same [sequencing] machines as they are. Our reads are as accurate as theirs. It's a question of the assembly.

"We took the hierarchical approach, which has some advantages. You can give out different pieces to different people to work on. This means some pieces are finished earlier than other pieces. At this point in the project, approximately one-third of the entire genome is completely finished—all the gaps have been removed, processed by humans, run through all kinds of quality checks. What we are talking about is the unfinished areas of the draft where our typical scaffold, or ordered piece size, is 10X [10 times the size of an individual gene], while with Celera, the typical scaffold size is 100X, or an order of magnitude larger."

Patenting a Discovery?
This year, Kent's concern that private efforts would force scientists to go through the U.S. Patent and Trademark Office in order to work on the assembled human genome was alleviated by the consortium's publication.

"Until quite recently, the level of effort required to get a gene patent was quite low," he says. "You could spend two to three months in the lab generating the sequence and all of two hours on a computer analyzing it, and you might end up with the raw material for 500 to 1,000 patents." The patent office tightened up the rules last January, after complaints from leaders of the public consortium, including current NHGRI Director Francis Collins and James Watson, first director of the Human Genome Project and best-known for discovering, with Francis Crick, the double-helix structure of DNA. "But," Kent maintains, "I'm still not sure the bar is high enough."

The mapping of the genome is comparable to the discovery of the periodic table of elements in the 19th century, after which the next 100 years were spent filling in the holes.

"Say you had purified a new metal—found tungsten, perhaps—people [are being given the right to] patent anything that involves tungsten: making a foil, conducting electricity, conducting heat—anything that follows from any property of the metal," Kent points out. Next to the periodic table, mapping the human genome is a vast undertaking; instead of a hundred or so elements, it entails 30,000 to 50,000 genes.

Stone Soup
In addition to the GigAssembler program, Kent created another tool called the Human Genome Browser (http://genome.ucsc.edu/ goldenPath/hgTracks.html), which gives Web users a quick display of various portions of the genome at different scales, along with more than two dozen tracks of information (genes, assembly gaps, chromosomal bands and so on) associated with the completed human genome sequence. As Kent explains, the browser he wrote follows an open, "stone soup" development process, with contributions to the kettle from the Ensembl group in England, Genoscope in France, Baylor College of Medicine in Houston, Washington University in St. Louis, University of Washington in Seattle, the International SNP Consortium, Affymetrix Inc., Softberry Inc. and NCBI.

"There are about 20 or 25 data tracks in the browser that can be turned on or off now. And since I was responsible for putting in four or five of the tracks, I might as well call these the carrots," Kent says, pointing at the screen, "since carrots are traditional incentives."

Celera, too, has created software for surfing the genome: a "Discovery System" that allows subscribers to use its databases, non-proprietary genome and biological datasets, computational tools and super-computing power.

"The model is, you don't own the data; it's not secret information," said Venter in an April 2000 television interview on the Public Broadcasting Service's Newshour. "It's such a large data set that [the model is] making it useful, making it interpretable, making it so that pharmaceutical companies, scientists, universities and government can use the human genetic code and understand what it means, how to come up with new treatments for disease.

"Just printing the Drosophila genome in very tiny print covering the entire sheet of paper is a stack of paper about 5.5 feet tall. That's the genetic code. That's just the As, Cs, Gs and Ts. That's not the interpretation. That's not the linkage out to tens of thousands of scientific articles in the literature, to disease information."

In January, Venter's company signed a multi-year subscriber agreement with the University of California system allowing UC investigators access to all of Celera's database products—an agreement that Venter calls "personally very gratifying." Venter's own 1975 Ph.D. in physiology and pharmacology comes from UC San Diego.

And, in another sign that the rivalry has begun to subside, Celera recently received $21 million from the NIH—part of a two-year, $58 million grant package split between Celera and Baylor College of Medicine in Houston—to sequence the rat genome.

The Genomic Economy
It's clear now: Our genes are but a small portion of the nucleotide chains coiled within our cells. They are isolated signposts separated by miles of base pairs—transposons and other "dark matter"—that seem to lead nowhere. In a convergence of robotics, parallel computing power, algorithmic optimization and open source collaboration, a generation of software engineers will find employment in the computationally intensive field of genomics, trying to unravel the deepening mystery of why human cells contain so much apparently useless code. It should be a nice living—as long as they ice their wrists.

Field of Dreams
How a crack team of scientists built the software that sequenced the human genome

While relations between the public and private groups sequencing the human genome had been amicable for most of 1999, by the fall of 2000, discord about the quality and accessibility of each other's data threatened to make for a messy and awkward culmination as the project to put together a draft sequence of the human genome wound down. In an end-game scenario brokered by Ari Patrinos, director of the U.S. Department of Energy's Human Genome Program, the two groups agreed to a coordinated publication date for their respective working drafts of the human genome sequence. The Feb. 16, 2001 issue of Science focused on Celera Genomics' version, while the Feb. 15, 2001 edition of Nature revealed the sequence generated by the publicly sponsored Human Genome Project. And on February 12, a press conference was held in Washington, D.C. to discuss these landmark publications.

Gene Myers, Ph.D., vice president of informatics research for Rockville, Maryland--based Celera Genomics

We caught up with Gene Myers, Ph.D., vice president of informatics research for Rockville, Maryland-based Celera Genomics in February at the San Francisco stop of his company's national tour promoting the completion of Celera's first draft of the human genome. "I'm the guy who ran the team that put all the pieces together," boasted Myers.

SD: How closely did you work with those on the public project? In a nutshell, what's the difference between the two?

Myers: We worked together a lot. I mean, I know those guys. In fact, [University of California, Santa Cruz professor] David Haussler and I went to school together.
The difference is that we solved a 40 million-piece jigsaw puzzle. Their technique was based on breaking the data down into smaller units, assembling those and trying to figure out how to put those together.

SD: How important was the algorithmic approach?

Myers: For Celera's effort, it was critical. We broke the genome into pieces that were 10,000 letters long, and we sequenced the ends of them, so we determined about 500 letters at each end. So we didn't have 40 million reads, we had 40 million pairs of reads—a left-to-right read of the sequence. We knew approximately how far apart they were supposed to be. So we had 20 million randomly selected pairs, where the pairs were at known distances from each other.

SD: How did you put them together?

Myers: We built an assembler that used this base pair information as the driving principle for organizing the data. The assembly strategy evolved through time. The idea of creating pairs evolved in 1990. [Celera founder and CEO J. Craig] Venter's group was the first, in 1995, to do a complete genome via whole-genome "shotgun" approach, and they started to use this idea of pairs. Then Jim Weber and I proposed, in 1996, that you could do the human genome this way, using the protocol that we used at Celera.

We built an assembler that's half a million lines of code. The algorithms were novel. This technology was completely developed at Celera over the last two years.

One of the strategic advantages of using the whole-genome shotgun approach is, it's basically a classic trade-off. The data is homogenous: pairs of reads. So we could highly optimize our factory for that process. Did almost everything with robotics, which is why we could do it on the cheap and so fast. We had 300 sequencing machines and robots that do all of the chemical reactions. We used a server farm to do computes. We used about 20,000 CPU hours on that server farm, 32 gigabytes of memory. With the original prototype, we would have required a 600-gigabyte memory. So one of the things we did on the software development is compress the memory by a factor of 30.

It was a crash program. We had to figure out how we were going to do this; decide our software standards. We did that in a year, for the initial prototype, and then we worked on it for another year. We didn't change much of the logic, but we highly optimized the code. We increased speed by a factor of 5 and reduced memory by a factor of 30.

SD: Describe the average software engineer on your project.

Myers: There are no average software engineers on my project.

SD: How did you compose your team? Did you teach developers the biology or biologists the programming techniques?

Myers: I picked a mixed team: a physicist, a mathematician, two software engineers and three or four computer scientists. I needed someone who understood the statistical issues. The physicist understood the large-scale computing issues. The software engineers made sure the thing actually held together. The computer scientist algorithmists did the design, the combinatorics.

I know everyone always says their team is exceptional, but you have to realize that all these people have Ph.D.s., plus 10 years of experience. It really was a very high-powered team.

I have a group of 30 now, but there were basically 10 people who were instrumental in the beginning. I didn't do this thing alone. That team really excelled. There was a real sense of mission. We're doing something everyone says we can't do. I've never seen a group work that hard. There was real bonding. We developed all sorts of traditions; it was a fun time.

SD: What were the hours?

Myers: We worked pretty hard. I was working 12 to 14 hours every day, seven days a week for the first year and a half. Most of my people were working that hard too, but I didn't ask them to.

I didn't really view it as a competition [with the public Human Genome Project]. I wanted to prove we could do it. I was happy at the time, as a full professor of computer science at the University of Arizona in Tucson; I'd been there for 18 years. I hadn't done too much supercomputing.

SD: Having reached this milestone, is there another one that will keep your team operating at the same intensity?

Myers: We're relaxing a little bit, but not much. My group's now 30 people, and the ones who are fresh are working on the new agenda. The core assembly team is taking a break. We're working on proteomics now. It's just the beginning—we've got to figure out what it means now. And that's frankly harder.

SD: Dr. Venter has said you need to build a supercomputer for analysis.

Myers: We're going to need the cycles, ultimately. We don't just need petaFLOPS [computing equal to a million billion floating-point operations per second]; we need petabyte [1,000 terabytes, or approximately 1015 bytes] connections, as well. We have 1.5 teraFLOPS of capacity now. We're going to need the cycles for interpreting the genome, not sequencing it. The human genome is probably about as big as we're going to go, and we did that at 20,000 CPU hours, so we're inside the envelope. We're going to need it for the massive amounts of interpretation of the information we're going to be producing.

—Alexandra Weber Morales

A Brave New World
A team of UCSC faculty members and graduate students is unraveling the mysteries of our genetic make-up

Known for introducing hidden Markov modeling methods to the emerging field of computational biology or bioinformatics, UCSC computer science professor David Haussler is responsible for coordinating UCSC's human genome project. He is also part of a UCSC committee designing an interdisciplinary curriculum for undergraduate, master's and doctorate degrees in bioinformatics at the university.

"There were two remarkable things about this project from a computer science standpoint," Haussler recounts to Technical Editor Roger Smith during an interview on April 2, 2001.

"One, it was an ill-defined problem up until the last moment. It wasn't like we had all these jigsaw pieces dumped out on the table. Early on, we didn't know what the data was going to look like: 90 percent of the data for the human genome project was generated in a 15-month period in an increasing ramp-up. We had very little data in January, February and March of last year.

"Second, there was a diffuse set of genome groups providing different data sets. [Once the data started coming in,] Jim was dealing with heterogeneous data sources of uneven coverage and quality. Computationally, his algorithm is a heuristic that looks for the most efficient solution—it uses the best information first. It isn't optimized for searching all possibilities."

Asked how complete the UCSC draft assembly of the genome is (compared to the 2003-targeted finished assembly), Haussler says, "You have to have some threshold for your draft, of course. Ours isn't a complete assembly; we haven't completed the entire human genome. But it's night and day compared to the way it was before. Before, you had fragmented pieces everywhere. Kent's work yielded the first, single assembly covering most of the genome. It was possible to say 'This is a genome-wide assembly.'"

Noting that a previous chancellor of UCSC, Dr. Robert Sinsheimer, held the first meeting in 1985 to explore the idea of sequencing the human genome, Haussler lavishes praise on the current chancellor, Dr. Marcia Greenwood, and the former Dean of Engineering, Dr. Pat Mantey, for providing in short order the $500,000 needed to buy the 100 Pentium III computers in UCSC's Linux cluster to run Kent's GigAssembler program. "There wasn't enough time to go through the NIH grant procedures," Haussler says. "The calculation of CPU cycles was a back-of-the-envelope calculation that turned out to be pretty close. Luckily, we're a small university with the kind of personal contacts and flexibility necessary to pull off something like this."

Haussler compares the work of sequencing the human genome to exploring a new continent or a new world. "The one active assembly creates a new coordinate system by which we can now understand the genome. We (Rogic, Furey and others on the UCSC project) are now matching this data to all the other previous coordinate systems associated with the old world—the protocols, markers, details, maps; the huge amount of clinical data gleaned from looking though a microscope—that previously dominated genetics."

—Roger Smith

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Some Assembly Required

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Some Assembly Required

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content