E W Myers
Citations
All
Search in:AllTitleAbstractAuthor name
Publications
(20)
Patents
Grants
Pathways
Clinical trials
Publication
Journal: Journal of Molecular Biology
December/4/1990
Abstract
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Pulse
Views:
21
Posts:
No posts
Rating:
Not rated
Publication
Journal: Science
March/14/2001
Abstract
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
Publication
Journal: Science
March/29/2000
Abstract
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.
Publication
Journal: Science
March/29/2000
Abstract
We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.
Publication
Journal: Computer applications in the biosciences : CABIOS
August/3/1988
Abstract
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the new proposals, both in theory and in practice. The goal of this paper is to give Hirschberg's idea the visibility it deserves by developing a linear-space version of Gotoh's algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.
Publication
Journal: Journal of Computational Biology
January/17/1996
Abstract
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally, the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the two-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximally likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is very important as the underlying problems are NP-hard. In practice, the transformed problems are so small that simple branch-and-bound algorithms successfully solve them, thus permitting auxiliary experimental information to be taken into account in the form of overlap, orientation, and distance constraints.
Authors
Publication
Journal: Genome Research
August/3/1997
Publication
Journal: Bioinformatics
February/24/2002
Abstract
Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the "compartmentalized shotgun assembler".
Publication
Journal: Bulletin of Mathematical Biology
February/22/1989
Publication
Journal: Bulletin of Mathematical Biology
June/1/1989
Abstract
Given a sequence A and regular expression R, the approximate regular expression matching problem is to find a sequence matching R whose optimal alignment with A is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in time O(MN), where M and N are the lengths of A and R. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires only O(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for sub-strings of A that strongly align with a sequence in R, as required for typical data base searches. We also show how to deliver an optimal alignment between A and R in only O(N + log M) space using O(MN log M) time. Finally, an O(MN(M + N) + N2log N) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.
Publication
Journal: Journal of Computational Biology
October/15/1997
Abstract
We present a round-robin realignment algorithm that improves a potentially crude initial alignment of an assembled collection of DNA sequence fragments, as might, for example, be output by a typical fragment assembly program. The algorithm uses a weighted combination of two scoring schemes to achieve superior multi-alignments, and employs a banded dynamic programming variation to achieve a running time that is linear in the amount of sequence in the data set. We demonstrate that the algorithm improves upon the alignments produced by other assembly programs in a series of empirical experiments on simulated data. Finally, we present a pair of programs embodying the algorithms that are available from the Web site ftp://ftp.cs.arizona.edu/realigner.
Publication
Journal: Journal of Computational Biology
November/16/1998
Abstract
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to epsilon = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2 epsilon from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10(4) when epsilon = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only O (n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O (n . N (e, k)) in the worst case where k is the length of the repeating unit m, e = [epsilon k] is the number of differences allowed between each repeat unit and the model m, and N (e, k) is the maximum number of words that are not more than e differences from another word of length k. That is, N (e, k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.
Publication
Journal: Genomics
February/16/1995
Abstract
A new software system designed for use in high-throughput DNA sequencing laboratories is described. The Genome Reconstruction Manager (GRM) was developed from requirements derived from ongoing large-scale DNA sequencing projects. Object-oriented principles were followed in designing the system, and tools supporting object-oriented system development were employed for its implementation. GRM provides several advances in software support for high-throughput DNA sequencing: support for random, directed, and mixed sequencing strategies; a novel system for fragment assembly; a commercial object data-base management system for data storage; a client/server architecture for using network computational servers; and an underlying data model that can evolve to support fully automatic sequence reconstruction. GRM is currently being deployed for production use in high-throughput DNA sequencing projects.
Publication
Journal: Journal of Computational Biology
September/4/1996
Abstract
Two algorithmic results are presented that are pertinent to the matching of patterns typically used by biologists to describe regions of macromolecular sequences that encode a given function. The first result is a threshold-sensitive algorithm for approximately matching both network and regular expressions. Network expressions are regular expressions that can be composed only from union and concatenation operators. Kleene closure (i.e., unbounded repetition) is not permitted. The algorithm is threshold-sensitive in that its performance depends on the threshold, k, of the number of differences allowed in an approximate match. This result generalizes the O(kn) expected-time algorithm of Ukkonen for approximately matching keywords. The second result concerns the problem of matching a pattern that is a network expression whose elements are approximate matches to network or regular expressions interspersed with specifiable distance ranges. For this class of patterns, it is shown how to determine a backtracking procedure whose order of evaluation is optimal in the sense that its expected time is minimal over all such procedures.
Authors
Publication
Journal: Bulletin of Mathematical Biology
June/25/1992
Abstract
We present an O (R log P) time, O (M+P2) space algorithm for searching a restriction map with M sites for the best matches to a shorter map with P sites, where R, the number of matching site pairs, is bounded by MP. As first proposed by Waterman et al. (1984, Nucl. Acids Res. 12, 237-242) the objective function used to score matches is additive in the number of unaligned sites and the discrepancies in the distances between adjacent aligned sites. Our algorithm is basically a sparse dynamic programming computation in which "candidate lists" are used to model the future contribution of all previously computed entries to those yet to be computed. A simple modification to the algorithm computes the distance between two restriction maps with M and N sites, respectively, in O (MN (log M+log N)) time.
Publication
Journal: Bioinformatics
September/7/1998
Abstract
BACKGROUND
To provide a graphical interface for the generation, display and manipulation of a sequence landscape that will run on all X-windows-based Unix workstations.
RESULTS
The sequence landscape approach enables the representation of the frequency of occurrence of all query sequence sub-words within a database. The landscape approach can detect tandem and other repeating word motifs, specific sub-words that are over-represented words in a particular database using Markov probability and the preference for sub-words belonging to either one of two databases. All these features aid in the classification of a query sequence. Given the open-text format for sequences and databases, the Xlandscape tool can be applied to a wide range of problems.
Publication
Journal: Journal of Computational Biology
December/30/1997
Abstract
Current physical mapping projects based on STS-probes involve additional clues such as the fact that some probes are anchored to a known map and that others come from the ends of clones. Because of the disparate combinatorial contributions of these varied data items, it is difficult to design a "tailored" algorithm that incorporates them all. Moreover, it is inevitable that new experiments will provide new kinds of data, making obsolete any such algorithm. We show how to convert the physical mapping problem into a 0/1 linear programming (LP) problem. We further show how one can incorporate additional clues as additional constraints in the LP formulation. We give a simple relaxation of the 0/1 LP problem, which solves problems of the same scale as previously reported tailored algorithms, to equal or greater optimization levels. We also present a theorem proving that when the data is 100% accurate, then the relaxed and integer solutions coincide. The LP algorithm suffices to solve problems on the order of 80-100 probes--the typical size of the 2- or 3-connected contigs of Arratia et al. (1991). We give a heuristic algorithm which attempts to order and link the set of LP-solved contigs. Unlike previous work, this algorithm only links and orders contigs when the join is 90% or more likely to be correct. It is our view that there is no value in computing an optimal solution with respect to some criteria over very noisy data as this optimal solution rarely corresponds to the true solution. The paper involves extensive empirical trials over real and simulated data.
Publication
Journal: Nucleic Acids Research
March/6/1986
Abstract
We describe a program which may be used to find approximate matches to a short predefined DNA sequence in a larger target DNA sequence. The program predicts the usefulness of specific DNA probes and sequencing primers and finds nearly identical sequences that might represent the same regulatory signal. The program is written in the C programming language and will run on virtually any computer system with a C compiler, such as the IBM/PC and other computers running under the MS/DOS and UNIX operating systems. The program has been integrated into an existing software package for the IBM personal computer (see article by Mount and Conrad, this volume). Some examples of its use are given.
Publication
Journal: Computer applications in the biosciences : CABIOS
August/3/1988
Abstract
We describe software for aligning protein or nucleic acid sequences based on the concept of match density. This method is especially useful for locating regions of short similarity between two longer sequences which may be largely dissimilar (e.g. locating active site regions in distantly related proteins). Our software is able to identify biologically interesting similarities between two sub-regions because it allows the user to control the matching parameters and the manner in which local alignments are selected for display. Furthermore, the collection and ranking of alignments for display uses a novel, highly efficient algorithm. We illustrate these features with several examples. In addition, we show that this tool can be used to find a new conserved sequence in several viral DNA polymerases, which, we suggest, occurs at a functionally important enzymatic site.
Publication
Journal: Journal of Computational Biology
January/17/1996
Abstract
We present an efficient algorithm for scoring clones given an ordering of probes under a schema proposed by Alizadeh et al. (1994) in the context of physical mapping with unique probes. The algorithm runs in time linear in the number of blocks of ones in the underlying sparse incidence matrix. A sparse and efficient algorithm for this task is important as it appears to be a central task in most algorithms for physical mapping.