Publications by Heng Li

Publication

The Sequence Alignment/Map format and SAMtools.

Journal: Bioinformatics

January/13/2010

Abstract

CONCLUSIONS

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

BACKGROUND

http://samtools.sourceforge.net.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/bin/btp352f1.jpg

Authors

Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer+4 authors

Pulse

Views:

19

Posts:

No posts

Rating:

Not rated

Publication

Fast and accurate short read alignment with Burrows-Wheeler transform.

Download PDF

Journal: Bioinformatics

October/21/2009

Abstract

BACKGROUND

The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.

RESULTS

We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.

BACKGROUND

http://maq.sourceforge.net.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/bin/btp324f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/bin/btp324f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/bin/btp324f3.jpg

Authors

Heng Li; Richard Durbin

Pulse

Views:

10

Posts:

No posts

Rating:

Not rated

Publication

Fast and accurate long-read alignment with Burrows-Wheeler transform.

Download PDF

Journal: Bioinformatics

June/16/2010

Abstract

BACKGROUND

Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time.

RESULTS

We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both.

BACKGROUND

http://bio-bwa.sourceforge.net

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828108/bin/btp698f1.jpg

Authors

Heng Li; Richard Durbin

Publication

Accurate whole human genome sequencing using reversible terminator chemistry.

Download PDF

Journal: Nature

December/3/2008

Abstract

DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/bin/nihms72488f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/bin/nihms72488f2a.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/bin/nihms72488f2b.jpg

Authors

David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown+188 authors

Publication

Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Download PDF

Journal: Genome Research

January/6/2009

Abstract

New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2577856/bin/1851fig1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2577856/bin/1851fig2.jpg

Authors

Heng Li; Jue Ruan; Richard Durbin

Publication

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Download PDF

Journal: Bioinformatics

March/12/2012

Abstract

BACKGROUND

Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty.

RESULTS

We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors.

BACKGROUND

http://samtools.sourceforge.net.

BACKGROUND

hengli@broadinstitute.org.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/bin/btr509f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/bin/btr509f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/bin/btr509f3.jpg

Authors

Heng Li

Publication

A draft sequence of the Neandertal genome.

Download PDF

Journal: Science

May/17/2010

Abstract

Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100745/bin/nihms-827403-f0003.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100745/bin/nihms-827403-f0006.jpg

Authors

Richard E Green; Johannes Krause; Adrian W Briggs; Tomislav Maricic; Udo Stenzel; Martin Kircher+50 authors

Publication

Minimap2: pairwise alignment for nucleotide sequences.

Download PDF

Journal: Bioinformatics

November/13/2018

Abstract

UNASSIGNED

Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

UNASSIGNED

Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment.

UNASSIGNED

https://github.com/lh3/minimap2.

UNASSIGNED

Supplementary data are available at Bioinformatics online.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6137996/bin/bty191f1.jpg

Authors

Heng Li

Related with

Citations(530)Authors(1)

Publication

The diploid genome sequence of an Asian individual.

Download PDF

Journal: Nature

December/3/2008

Abstract

Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716080/bin/ukmss-27364-f0004.jpg

Authors

Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman+71 authors

Publication

Inference of human population history from individual whole-genome sequences.

Download PDF

Journal: Nature

August/2/2011

Abstract

The history of human population size is important for understanding human evolution. Various studies have found evidence for a founder event (bottleneck) in East Asian and European populations, associated with the human dispersal out-of-Africa event around 60 thousand years (kyr) ago. However, these studies have had to assume simplified demographic models with few parameters, and they do not provide a precise date for the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand and a million years ago, using the pairwise sequentially Markovian coalescent model applied to the complete diploid genome sequences of a Chinese male (YH), a Korean male (SJK), three European individuals (J. C. Venter, NA12891 and NA12878 (ref. 9)) and two Yoruba males (NA18507 (ref. 10) and NA19239). We infer that European and Chinese populations had very similar population-size histories before 10-20 kyr ago. Both populations experienced a severe bottleneck 10-60 kyr ago, whereas African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60 and 250 kyr ago, possibly due to population substructure. We also infer that the differentiation of genetically modern humans may have started as early as 100-120 kyr ago, but considerable genetic exchanges may still have occurred until 20-40 kyr ago.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3154645/bin/ukmss-36009-f0001.jpg

Authors

Heng Li; Richard Durbin

Publication

The sequence and de novo assembly of the giant panda genome.

Download PDF

Journal: Nature

March/2/2010

Abstract

Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951497/bin/nihms467148f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951497/bin/nihms467148f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951497/bin/nihms467148f3.jpg

Authors

Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai+117 authors

Pulse

Views:

2

Posts:

No posts

Rating:

Not rated

Publication

A high-coverage genome sequence from an archaic Denisovan individual.

Download PDF

Journal: Science

October/18/2012

Abstract

We present a DNA library preparation method that has allowed us to reconstruct a high-coverage (30×) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of "missing evolution" in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3617501/bin/nihms453440f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3617501/bin/nihms453440f4.jpg

Authors

Matthias Meyer; Martin Kircher; Marie-Theres Gansauge; Heng Li; Fernando Racimo; Swapan Mallick+28 authors

Publication

Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing.

Download PDF

Journal: Nature Genetics

June/18/2008

Abstract

Human cancers often carry many somatically acquired genomic rearrangements, some of which may be implicated in cancer development. However, conventional strategies for characterizing rearrangements are laborious and low-throughput and have low sensitivity or poor resolution. We used massively parallel sequencing to generate sequence reads from both ends of short DNA fragments derived from the genomes of two individuals with lung cancer. By investigating read pairs that did not align correctly with respect to each other on the reference human genome, we characterized 306 germline structural variants and 103 somatic rearrangements to the base-pair level of resolution. The patterns of germline and somatic rearrangement were markedly different. Many somatic rearrangements were from amplicons, although rearrangements outside these regions, notably including tandem duplications, were also observed. Some somatic rearrangements led to abnormal transcripts, including two from internal tandem duplications and two fusion transcripts created by interchromosomal rearrangements. Germline variants were predominantly mediated by retrotransposition, often involving AluY and LINE elements. The results demonstrate the feasibility of systematic, genome-wide characterization of rearrangements in complex human cancer genomes, raising the prospect of a new harvest of genes associated with cancer using this strategy.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705838/bin/ukmss-5231-f0004.jpg

Authors

Peter J Campbell; Philip J Stephens; Erin D Pleasance; Sarah O'Meara; Heng Li; Thomas Santarius+18 authors

Publication

The complete genome sequence of a Neanderthal from the Altai Mountains.

Download PDF

Journal: Nature

January/27/2014

Abstract

We present a high-quality genome sequence of a Neanderthal woman from Siberia. We show that her parents were related at the level of half-siblings and that mating among close relatives was common among her recent ancestors. We also sequenced the genome of a Neanderthal from the Caucasus to low coverage. An analysis of the relationships and population history of available archaic genomes and 25 present-day human genomes shows that several gene flow events occurred among Neanderthals, Denisovans and early modern humans, possibly including gene flow into Denisovans from an unknown archaic group. Thus, interbreeding, albeit of low magnitude, occurred among many hominin groups in the Late Pleistocene. In addition, the high-quality Neanderthal genome allows us to establish a definitive list of substitutions that became fixed in modern humans after their separation from the ancestors of Neanderthals and Denisovans.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4031459/bin/nihms541101f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4031459/bin/nihms541101f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4031459/bin/nihms541101f9.jpg

Authors

Kay Prüfer; Fernando Racimo; Nick Patterson; Flora Jay; Sriram Sankararaman; Susanna Sawyer+39 authors

Publication

Genetic history of an archaic hominin group from Denisova Cave in Siberia.

Download PDF

Journal: Nature

January/3/2011

Abstract

Using DNA extracted from a finger bone found in Denisova Cave in southern Siberia, we have sequenced the genome of an archaic hominin to about 1.9-fold coverage. This individual is from a group that shares a common origin with Neanderthals. This population was not involved in the putative gene flow from Neanderthals into Eurasians; however, the data suggest that it contributed 4-6% of its genetic material to the genomes of present-day Melanesians. We designate this hominin population 'Denisovans' and suggest that it may have been widespread in Asia during the Late Pleistocene epoch. A tooth found in Denisova Cave carries a mitochondrial genome highly similar to that of the finger bone. This tooth shares no derived morphological features with Neanderthals or modern humans, further indicating that Denisovans have an evolutionary history distinct from Neanderthals and modern humans.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306417/bin/nihms-655275-f0001.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306417/bin/nihms-655275-f0002.jpg

Authors

David Reich; Richard E Green; Martin Kircher; Johannes Krause; Nick Patterson; Eric Y Durand+22 authors

Publication

The Genomes of Oryza sativa: a history of duplications.

Download PDF

Journal: PLoS Biology

March/6/2006

Abstract

We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000-40,000. Only 2%-3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC546038/bin/pbio.0030038.t005.jpg

Authors

Jun Yu; Jun Wang; Wei Lin; Songgang Li; Heng Li; Jun Zhou+111 authors

Publication

A draft sequence for the genome of the domesticated silkworm (Bombyx mori).

Journal: Science

January/4/2005

Abstract

We report a draft sequence for the genome of the domesticated silkworm (Bombyx mori), covering 90.9% of all known silkworm genes. Our estimated gene count is 18,510, which exceeds the 13,379 genes reported for Drosophila melanogaster. Comparative analyses to fruitfly, mosquito, spider, and butterfly reveal both similarities and differences in gene content.

Authors

Qingyou Xia; Zeyang Zhou; Cheng Lu; Daojun Cheng; Fangyin Dai; Bin Li+88 authors

Publication

A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.

Download PDF

Journal: Nature Biotechnology

September/9/2008

Abstract

DNA methylation is an indispensible epigenetic modification required for regulating the expression of mammalian genomes. Immunoprecipitation-based methods for DNA methylome analysis are rapidly shifting the bottleneck in this field from data generation to data analysis, necessitating the development of better analytical tools. In particular, an inability to estimate absolute methylation levels remains a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling. To address this issue, we developed a cross-platform algorithm-Bayesian tool for methylation analysis (Batman)-for analyzing methylated DNA immunoprecipitation (MeDIP) profiles generated using oligonucleotide arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). We developed the latter approach to provide a high-resolution whole-genome DNA methylation profile (DNA methylome) of a mammalian genome. Strong correlation of our data, obtained using mature human spermatozoa, with those obtained using bisulfite sequencing suggest that combining MeDIP-seq or MeDIP-chip with Batman provides a robust, quantitative and cost-effective functional genomic strategy for elucidating the function of DNA methylation.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2644410/bin/ukmss-3883-f0002.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2644410/bin/ukmss-3883-f0004.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2644410/bin/ukmss-3883-f0009.jpg

Authors

Thomas A Down; Vardhman K Rakyan; Daniel J Turner; Paul Flicek; Heng Li; Eugene Kulesha+16 authors

Publication

Ancient human genomes suggest three ancestral populations for present-day Europeans.

Download PDF

Journal: Nature

October/13/2014

Abstract

We sequenced the genomes of a ∼7,000-year-old farmer from Germany and eight ∼8,000-year-old hunter-gatherers from Luxembourg and Sweden. We analysed these and other ancient genomes with 2,345 contemporary humans to show that most present-day Europeans derive from at least three highly differentiated populations: west European hunter-gatherers, who contributed ancestry to all Europeans but not to Near Easterners; ancient north Eurasians related to Upper Palaeolithic Siberians, who contributed to both Europeans and Near Easterners; and early European farmers, who were mainly of Near Eastern origin but also harboured west European hunter-gatherer related ancestry. We model these populations' deep relationships and show that early European farmers had ∼44% ancestry from a 'basal Eurasian' population that split before the diversification of other non-African lineages.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/bin/nihms613260f1a.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/bin/nihms613260f1b.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/bin/nihms613260f2a.jpg

Authors

Iosif Lazaridis; Nick Patterson; Alissa Mittnik; Gabriel Renaud; Swapan Mallick; Karola Kirsanow+114 authors

Publication

A survey of sequence alignment algorithms for next-generation sequencing.

Download PDF

Journal: Briefings in Bioinformatics

January/12/2011

Abstract

Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943993/bin/bbq015f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943993/bin/bbq015f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943993/bin/bbq015f3.jpg

Authors

Heng Li; Nils Homer

Publication

TreeFam: a curated database of phylogenetic trees of animal gene families.

Download PDF

Journal: Nucleic Acids Research

February/27/2006

Abstract

TreeFam is a database of phylogenetic trees of gene families found in animals. It aims to develop a curated resource that presents the accurate evolutionary history of all animal gene families, as well as reliable ortholog and paralog assignments. Curated families are being added progressively, based on seed alignments and trees in a similar fashion to Pfam. Release 1.1 of TreeFam contains curated trees for 690 families and automatically generated trees for another 11 646 families. These represent over 128 000 genes from nine fully sequenced animal genomes and over 45 000 other animal proteins from UniProt; approximately 40-85% of proteins encoded in the fully sequenced animal genomes are included in TreeFam. TreeFam is freely available at http://www.treefam.org and http://treefam.genomics.org.cn.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347480/bin/gkj118f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347480/bin/gkj118f2.jpg

Authors

Heng Li; Avril Coghlan; Jue Ruan; Lachlan James Coin; Jean-Karim Hériché; Lara Osmotherly+9 authors

Publication

Great ape genetic diversity and population history.

Download PDF

Journal: Nature

August/12/2013

Abstract

Most great ape genetic variation remains uncharacterized; however, its study is critical for understanding population history, recombination, selection and susceptibility to disease. Here we sequence to high coverage a total of 79 wild- and captive-born individuals representing all six great ape species and seven subspecies and report 88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/western and central/eastern populations. We find extensive inbreeding in almost all wild populations, with eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at, or close to, genes in almost all species. We discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalogue of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3822165/bin/nihms473355f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3822165/bin/nihms473355f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3822165/bin/nihms473355f3.jpg

Authors

Javier Prado-Martinez; Peter H Sudmant; Jeffrey M Kidd; Heng Li; Joanna L Kelley; Belen Lorente-Galdos+69 authors

Publication

Genome sequence of a 45,000-year-old modern human from western Siberia.

Download PDF

Journal: Nature

December/7/2014

Abstract

We present the high-quality genome sequence of a ∼45,000-year-old modern human male from Siberia. This individual derives from a population that lived before-or simultaneously with-the separation of the populations in western and eastern Eurasia and carries a similar amount of Neanderthal ancestry as present-day Eurasians. However, the genomic segments of Neanderthal ancestry are substantially longer than those observed in present-day individuals, indicating that Neanderthal gene flow into the ancestors of this individual occurred 7,000-13,000 years before he lived. We estimate an autosomal mutation rate of 0.4 × 10(-9) to 0.6 × 10(-9) per site per year, a Y chromosomal mutation rate of 0.7 × 10(-9) to 0.9 × 10(-9) per site per year based on the additional substitutions that have occurred in present-day non-Africans compared to this genome, and a mitochondrial mutation rate of 1.8 × 10(-8) to 3.2 × 10(-8) per site per year based on the age of the bone.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4753769/bin/nihms756266f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4753769/bin/nihms756266f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4753769/bin/nihms756266f3.jpg

Authors

Qiaomei Fu; Heng Li; Priya Moorjani; Flora Jay; Sergey M Slepchenko; Aleksei A Bondarev+22 authors

Publication

Toward better understanding of artifacts in variant calling from high-coverage samples.

Download PDF

Journal: Bioinformatics

December/2/2014

Abstract

BACKGROUND

Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.

RESULTS

We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15 kb, but the error rate of post-filtered calls is reduced to 1 in 100-200 kb without significant compromise on the sensitivity.

METHODS

BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp.

BACKGROUND

hengli@broadinstitute.org

BACKGROUND

Supplementary data are available at Bioinformatics online.