A new method for determining nucleotide sequences in DNA is described. It is similar to the "plus and minus" method [Sanger, F. & Coulson, A. R. (1975) J. Mol. Biol. 94, 441-448] but makes use of the 2',3'-dideoxy and arabinonucleoside analogues of the normal deoxynucleoside triphosphates, which act as specific chain-terminating inhibitors of DNA polymerase. The technique has been applied to the DNA of bacteriophage varphiX174 and is more rapid and more accurate than either the plus or the minus method.
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
MicroRNAs (miRNAs) are endogenous approximately 22 nt RNAs that can play important regulatory roles in animals and plants by targeting mRNAs for cleavage or translational repression. Although they escaped notice until relatively recently, miRNAs comprise one of the more abundant classes of gene regulatory molecules in multicellular organisms and likely influence the output of many protein-coding genes.
The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
A technique for conveniently radiolabeling DNA restriction endonuclease fragments to high specific activity is described. DNA fragments are purified from agarose gels directly by ethanol precipitation and are then denatured and labeled with the large fragment of DNA polymerase I, using random oligonucleotides as primers. Over 70% of the precursor triphosphate is routinely incorporated into complementary DNA, and specific activities of over 10(9) dpm/microgram of DNA can be obtained using relatively small amounts of precursor. These "oligolabeled" DNA fragments serve as efficient probes in filter hybridization experiments.
Three kinds of improvements have been introduced into the M13-based cloning systems. (1) New Escherichia coli host strains have been constructed for the E. coli bacteriophage M13 and the high-copy-number pUC-plasmid cloning vectors. Mutations introduced into these strains improve cloning of unmodified DNA and of repetitive sequences. A new suppressorless strain facilitates the cloning of selected recombinants. (2) The complete nucleotide sequences of the M13mp and pUC vectors have been compiled from a number of sources, including the sequencing of selected segments. The M13mp18 sequence is revised to include the G-to-T substitution in its gene II at position 6 125 bp (in M13) or 6967 bp in M13mp18. (3) M13 clones suitable for sequencing have been obtained by a new method of generating unidirectional progressive deletions from the polycloning site using exonucleases HI and VII.
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source (http://bowtie.cbcb.umd.edu).
The University of Wisconsin Genetics Computer Group (UWGCG) has been organized to develop computational tools for the analysis and publication of biological sequence data. A group of programs that will interact with each other has been developed for the Digital Equipment Corporation VAX computer using the VMS operating system. The programs available and the conditions for transfer are described.
The abbreviated name, 'mfold web server', describes a number of closely related software applications available on the World Wide Web (WWW) for the prediction of the secondary structure of single stranded nucleic acids. The objective of this web server is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large. By making use of universally available web GUIs (Graphical User Interfaces), the server circumvents the problem of portability of this software. Detailed output, in the form of structure plots with or without reliability information, single strand frequency plots and 'energy dot plots', are available for the folding of single sequences. A variety of 'bulk' servers give less information, but in a shorter time and for up to hundreds of sequences at once. The portal for the mfold web server is http://www.bioinfo.rpi.edu/applications/mfold. This URL will be referred to as 'MFOLDROOT'.
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Some simple formulae were obtained which enable us to estimate evolutionary distances in terms of the number of nucleotide substitutions (and, also, the evolutionary rates when the divergence times are known). In comparing a pair of nucleotide sequences, we distinguish two types of differences; if homologous sites are occupied by different nucleotide bases but both are purines or both pyrimidines, the difference is called type I (or "transition" type), while, if one of the two is a purine and the other is a pyrimidine, the difference is called type II (or "transversion" type). Letting P and Q be respectively the fractions of nucleotide sites showing type I and type II differences between two sequences compared, then the evolutionary distance per site is K = -(1/2) ln [(1-2P-Q) square root of 1-2Q]. The evolutionary rate per year is then given by k = K/(2T), where T is the time since the divergence of the two sequences. If only the third codon positions are compared, the synonymous component of the evolutionary base substitutions per site is estimated by K'S = -(1/2) ln (1-2P-Q). Also, formulae for standard errors were obtained. Some examples were worked out using reported globin sequences to show that synonymous substitutions occur at much higher rates than amino acid-altering substitutions in evolution.
WebLogo generates sequence logos, graphical representations of the patterns within a multiple sequence alignment. Sequence logos provide a richer and more precise description of sequence similarity than consensus sequences and can rapidly reveal significant features of the alignment otherwise difficult to perceive. Each logo consists of stacks of letters, one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino or nucleic acid at that position. WebLogo has been enhanced recently with additional features and options, to provide a convenient and highly configurable sequence logo generator. A command line interface and the complete, open WebLogo source code are available for local installation and customization.
We constructed a series of recombinant genomes which directed expression of the enzyme chloramphenicol acetyltransferase (CAT) in mammalian cells. The prototype recombinant in this series, pSV2-cat, consisted of the beta-lactamase gene and origin of replication from pBR322 coupled to a simian virus 40 (SV40) early transcription region into which CAT coding sequences were inserted. Readily measured levels of CAT accumulated within 48 h after the introduction of pSV2-cat DNA into African green monkey kidney CV-1 cells. Because endogenous CAT activity is not present in CV-1 or other mammalian cells, and because rapid, sensitive assays for CAT activity are available, these recombinants provided a uniquely convenient system for monitoring the expression of foreign DNAs in tissue culture cells. To demonstrate the usefulness of this system, we constructed derivatives of pSV2-cat from which part or all of the SV40 promoter region was removed. Deletion of one copy of the 72-base-pair repeat sequence in the SV40 promoter caused no significant decrease in CAT synthesis in monkey kidney CV-1 cells; however, an additional deletion of 50 base pairs from the second copy of the repeats reduced CAT synthesis to 11% of its level in the wild type. We also constructed a recombinant, pSV0-cat, in which the entire SV40 promoter region was removed and a unique HindIII site was substituted for the insertion of other promoter sequences.
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
lin-4 is essential for the normal temporal control of diverse postembryonic developmental events in C. elegans. lin-4 acts by negatively regulating the level of LIN-14 protein, creating a temporal decrease in LIN-14 protein starting in the first larval stage (L1). We have cloned the C. elegans lin-4 locus by chromosomal walking and transformation rescue. We used the C. elegans clone to isolate the gene from three other Caenorhabditis species; all four Caenorhabditis clones functionally rescue the lin-4 null allele of C. elegans. Comparison of the lin-4 genomic sequence from these four species and site-directed mutagenesis of potential open reading frames indicated that lin-4 does not encode a protein. Two small lin-4 transcripts of approximately 22 and 61 nt were identified in C. elegans and found to contain sequences complementary to a repeated sequence element in the 3' untranslated region (UTR) of lin-14 mRNA, suggesting that lin-4 regulates lin-14 translation via an antisense RNA-RNA interaction.
Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
We have designed a system for targeted gene expression that allows the selective activation of any cloned gene in a wide variety of tissue- and cell-specific patterns. The gene encoding the yeast transcriptional activator GAL4 is inserted randomly into the Drosophila genome to drive GAL4 expression from one of a diverse array of genomic enhancers. It is then possible to introduce a gene containing GAL4 binding sites within its promoter, to activate it in those cells where GAL4 is expressed, and to observe the effect of this directed misexpression on development. We have used GAL4-directed transcription to expand the domain of embryonic expression of the homeobox protein even-skipped. We show that even-skipped represses wingless and transforms cells that would normally secrete naked cuticle into denticle secreting cells. The GAL4 system can thus be used to study regulatory interactions during embryonic development. In adults, targeted expression can be used to generate dominant phenotypes for use in genetic screens. We have directed expression of an activated form of the Dras2 protein, resulting in dominant eye and wing defects that can be used in screens to identify other members of the Dras2 signal transduction pathway.
Unique DNA sequences can be determined directly from mouse genomic DNA. A denaturing gel separates by size mixtures of unlabeled DNA fragments from complete restriction and partial chemical cleavages of the entire genome. These lanes of DNA are transferred and UV-crosslinked to nylon membranes. Hybridization with a short 32P-labeled single-stranded probe produces the image of a DNA sequence "ladder" extending from the 3' or 5' end of one restriction site in the genome. Numerous different sequences can be obtained from a single membrane by reprobing. Each band in these sequences represents 3 fg of DNA complementary to the probe. Sequence data from mouse immunoglobulin heavy chain genes from several cell types are presented. The genomic sequencing procedures are applicable to the analysis of genetic polymorphisms, DNA methylation at deoxycytidines, and nucleic acid-protein interactions at single nucleotide resolution.
RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.
DNA can be sequenced by a chemical procedure that breaks a terminally labeled DNA molecule partially at each repetition of a base. The lengths of the labeled fragments then identify the positions of that base. We describe reactions that cleave DNA preferentially at guanines, at adenines, at cytosines and thymines equally, and at cytosines alone. When the products of these four reactions are resolved by size, by electrophoresis on a polyacrylamide gel, the DNA sequence can be read from the pattern of radioactive bands. The technique will permit sequencing of at least 100 bases from the point of labeling.