Jun 1

3:30 pm

## Analysis of a Strategy for Sequencing the Human Genome

### Richard Karp

Seminar

University of Washington - Departments of Computer Science and Engineering & Molecular Biotechnology

Large DNA molecules such as chromosomes are usually sequenced using a divide-and-conquer strategy based on the construction of a library of clones. Abstractly, the linear DNA molecule to be sequenced (the target) can be viewed as a long line segment, and the clones, as shorter line segments drawn at random from the long segment. A key parameter is the redundancy R, defined as the average number of clones covering a point on the target. The redundancy must be large enough to ensure that the target is completely, or nearly completely, covered by clones. A process known as physical mapping is used to determine how the clones are arranged along the target. Once this is done, a "tiling path" of clones that together cover the target with minimal overlap is constructed, and the problem is thereby reduced to sequencing each clone in the tiling path.

Recently Hood, Smith and Venter proposed a method for sequencing a large target without physical mapping. The first step is to sequence a short segment at each end of each clone. These segments are called Sequence Tagged Connectors (STC's). Then a seed clone from the library is sequenced completely and, by comparing every STC with this sequence, clones are identified that barely overlap the seed clone and overhang it to the right and left. These clones are then sequenced in turn. At a general step, STC's are used to find new clones that barely overlap the region that has been sequenced and extend it to the right and left.

A. Siegel et al have constructed a probabilistic model of this process, in order to determine its attractiveness for sequencing large genomes. They investigate the following questions: in the presence of errors in the sequencing of the STC's, will comparison of STC's with sequenced clones correctly identify a tiling path? How should the redundancy be chosen to optimize the trade-off between the cost of end sequencing and the cost of full sequencing of clones along the tiling path?

R. Shamir and the speaker have extended this analysis by considering the effects of variation in clone size and inhomogeneity in the distribution of clones along the target. Using the renewal-reward theorem we derive closed-form expressions for the optimal redundancy and the expected cost of sequencing the genome as a function of the cost of end sequencing and the cost of complete sequencing. We also determine an optimal policy for choosing the next clone to extend a tiling path when the sizes of the clones, as well as the locations of their associated STC's, are known.