The Sequence Alignment/Map format and SAMtools
Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
With the advent of novel sequencing technologies such as Illumina/Solexa, AB/SOLiD and Roche/454 (Mardis, 2008), a variety of new alignment tools (Langmead et al., 2009; Li et al., 2008) have been designed to realize efficient read mapping against large reference sequences, including the human genome. These tools generate alignments in different formats, however, complicating downstream processing. A common alignment format that supports all sequence types and aligners creates a well-defined interface between alignment and downstream analyses, including variant detection, genotyping and assembly.
The Sequence Alignment/Map (SAM) format is designed to achieve this goal. It supports single- and paired-end reads and combining reads of different types, including color space reads from AB/SOLiD. It is designed to scale to alignment sets of 1011 or more base pairs, which is typical for the deep resequencing of one human individual.
In this article, we present an overview of the SAM format and briefly introduce the companion SAMtools software package. A detailed format specification and the complete documentation of SAMtools are available at the SAMtools web site.
2.1 The SAM format
2.1.1 Overview of the SAM format
The SAM format consists of one header section and one alignment section. The lines in the header section start with character ‘
In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields are briefly described in Table 1. They must be present but their value can be a ‘
2.1.2 Extended CIGAR
The standard CIGAR description of pairwise alignment defines three operations: ‘
2.1.3 Binary Alignment/Map format
To improve the performance, we designed a companion format Binary Alignment/Map (BAM), which is the binary representation of SAM and keeps exactly the same information as SAM. BAM is compressed by the BGZF library, a generic library developed by us to achieve fast random access in a zlib-compatible compressed file. An example alignment of 112 Gbp of Illumina GA data requires 116 GB of disk space (1.0 byte per input base), including sequences, base qualities and all the meta information generated by MAQ. Most of this space is used to store the base qualities.
2.1.4 Sorting and indexing
A SAM/BAM file can be unsorted, but sorting by coordinate is used to streamline data processing and to avoid loading extra alignments into memory. A position-sorted BAM file can be indexed. We combine the UCSC binning scheme (Kent et al., 2002) and simple linear indexing to achieve fast random retrieval of alignments overlapping a specified chromosomal region. In most cases, only one seek call is needed to retrieve alignments in a region.
2.2 SAMtools software package
SAMtools is a library and software package for parsing and manipulating alignments in the SAM/BAM format. It is able to convert from other alignment formats, sort and merge alignments, remove PCR duplicates, generate per-position information in the pileup format (Fig. 1c), call SNPs and short indel variants, and show alignments in a text-based viewer. For the example alignment of 112 Gbp Illumina GA data, SAMtools took about 10 h to convert from the MAQ format and 40 min to index with <30 MB memory. Conversion is slower mainly because compression with zlib is slower than decompression. External sorting writes temporary BAM files and would typically be twice as slow as conversion.
SAMtools has two separate implementations, one in C and the other in Java, with slightly different functionality.
|1||QNAME||Query NAME of the read or the read pair|
|2||FLAG||Bitwise FLAG (pairing, strand, mate strand, etc.)|
|3||RNAME||Reference sequence NAME|
|4||POS||1-Based leftmost POSition of clipped alignment|
|5||MAPQ||MAPping Quality (Phred-scaled)|
|6||CIGAR||Extended CIGAR string (operations:|
|7||MRNM||Mate Reference NaMe (‘=’ if same as|
|8||MPOS||1-Based leftmost Mate POSition|
|9||ISIZE||Inferred Insert SIZE|
|10||SEQ||Query SEQuence on the same strand as the reference|
|11||QUAL||Query QUALity (ASCII-33=Phred base quality)|
We designed and implemented a generic alignment format, SAM, which is simple to work with and flexible enough to keep most information from various sequencing platforms and read aligners. The equivalent binary representation, BAM, is compact in size and supports fast retrieval of alignments in specified regions. Using positional sorting and indexing, applications can perform stream-based processing on specific genomic regions without loading the entire file into memory. The SAM/BAM format, together with SAMtools, separates the alignment step from downstream analyses, enabling a generic and modular approach to the analysis of genomic sequencing data.
We are grateful to James Bonfield for the comments on indexing and to SAMtools users for testing the software as it has matured.
Funding: Wellcome Trust/077192/Z/05/Z;
Conflict of Interest: none declared.
- 1. The human genome browser at UCSCGenome Res.2002129961006
- 2. Ultrafast and memory-efficient alignment of short DNA sequences to the human genomeGenome Biol.200910R25
- 3. Mapping short DNA sequencing reads and calling variants using mapping quality scoresGenome Res.20081818511858
- 4. Next-generation DNA sequencing methodsAnnu. Rev. Genomics Hum. Genet.20089387402