SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
Abstract
The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software.
Footnotes
Chimeric reads are formed by concatenation of distant substrings of the genome, and chimeric read-pairs are formed by reads at a distance significantly different from the insert length, as well as by read pairs with an incorrect orientation.
This is a generic term that includes mate-pairs and paired-end reads.
Without error correction, de Bruijn graphs for low-coverage assembly projects with Sanger reads were too fragmented.
While some assemblers have built-in error correction procedures (e.g., ALLPATHS [Butler et al., 2008]), others do not (e.g., Velvet [Zerbino and Birney, 2008]).
For single-cell datasets, coupling Hammer with SPAdes resulted in an improved assembly as compared to coupling EULER error correction with SPAdes. For multicell datasets, Quake and Hammer produce similar results.
Internally, SPAdes condenses h-paths into single edges at some stages; for presentation purposes, however, we use the uncondensed de Bruijn graph.
Since multiplicity of edges in graphs arising from assembly data is often unknown, assembly algorithms do not explicitly find Eulerian cycles. However, we use the Eulerian assembly framework (Idury and Waterman, 1995; Pevzner et al., 2001) for simplicity. An alternative formulation is to consider Chinese Postman cycles.
This condition is satisfied if path(α) and path(β) are linked by a pair of reads separated by distance in the range d ± Δ.
If all paths between α and β have the same length L, SPAdes transforms the entire histogram into a single h-biedge (α|β, L). Compare with Pevzner and Tang (2001).
While SPAdes does not explicitly compute the E-transformation, it is useful for presenting the idea of the paired assembly graph; otherwise, the logic of the paired assembly graph construction may appear cryptic.
h-biedges (α|β, D) and (α|β, D′) with the same α and β but different distances correspond to different rectangles. If D and D′ are very close, the corresponding h-biedges will form parallel edges in the paired assembly graph.
After Stage 2 of SPAdes, the h-biedge distance estimate errors δ are greatly reduced as compared to the biread distance estimate errors Δ. For the vast majority of h-biedges, δ = 0. Moreover, the maximal error in distance estimate for each individual h-biedge can be bounded and these bounds may vary across various h-biedges. However, for the sake of simplicity, we describe the paired assembly graph for the case when δ is the same for all h-biedges.
IDBA incorporates each h-read (contig) for smaller k-mer sizes into a final assembly without any changes. SPAdes, in contrast, does not consider such contigs as the final truth and uses all reads at each iteration of the multisized assembly graph construction. This is important since contigs for smaller k have an elevated number of local misassemblies (usually manifested as small indels) as compared to contigs for larger k. For example, reducing vertex size from 55 to 31 (default parameter) in Velvet significantly increases the number of erroneous indels. See the Results section for IDBA benchmarking.