CAP3: A DNA Sequence Assembly Program
Abstract
We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5′ and 3′ low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward–reverse constraints to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward–reverse constraints.
The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed (Staden 1980; Peltola et al. 1984; Huang 1992; Smith et al. 1993; Gleizes and Henaut 1994; Lawrence et al. 1994; Kececioglu and Myers 1995; Sutton et al. 1995; Green 1996). The FAKII program provides a library of routines for each phase of the assembly process (Larson et al. 1996). The GAP4 program has a number of useful interactive features (Bonfield et al. 1995). The PHRAP program clips 5′ and 3′ low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences (Green 1996). TIGR Assembler has been used in a number of megabase microbial genome projects (Sutton et al. 1995). Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects.
We have developed the third generation of the CAP sequence assembly program (Huang 1992). The CAP3 program includes a number of improvements and new features. A capability to clip 5′ and 3′ low-quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED (Ewing et al. 1998) are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward–reverse constraints.
An unusual feature of CAP3 is the use of forward–reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward–reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly. Thus, a few unsatisfied constraints in a contig may not be sufficient to indicate an assembly error in the contig. However, if a sufficient number of constraints are all inconsistent with a join in a contig and all support an alternative join, it is likely that the current join is an error, and the alternative join should be made.
Acknowledgments
We thank Kathryn Beal for incorporating CAP3 in GAP4, Rachel Dickhoff for help with low-pass data, Lee Hood and Rodger Staden for suggesting use of forward–reverse constraints, Jun Qian for producing CAP3 output in ace format, Bruce Roe, Lee Rowen, and Granger Sutton for discussions and data sets, Shiaw-Pyng Yang for comparing CAP3 and PHRAP, and the referees for constructive suggestions. This project was supported in part by National Institutes of Health grant R01 HG01502-03 from the National Human Genome Research Institute.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
E-MAIL ude.utm@gnauh; FAX (906) 487-2283.

