Abstract:

Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.

Open in

PUBMED | PMC | Google Scholar | Wikipedia

Relations:

Content

Citations

(457)

References

(25)

Genes

(5)

Organisms

(3)

Processes

(2)

Ab initio Gene Finding in <em>Drosophila</em> Genomic DNA

A Salamov

V Solovyev

The Sanger Centre, Hinxton, Cambridge CB10 1SA, UK

^{Corresponding author.}

Received 2000 Feb 9; Accepted 2000 Feb 29.

Abstract

Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using Fgenes (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify ∼90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The Fgenesh program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf.html.

Abstract

Many bacterial, as well as several eukaryotic, complete genomes have been sequenced, and Drosophila, mouse, and human genome sequencing is being pursued aggressively. The first challenge in analyzing sequence data is finding the genes. Knowledge of gene sequences has led to a new way of performing biological studies called functional genomics. The second major challenge is to find out what all of these new genes do, how they interact, and how they are regulated (Wadman 1998). Comparisons among genes of different genomes can provide additional insight into the details of gene structure and function. To meet these challenges we need advanced gene-finding algorithms and computer systems utilizing all available information, such as similarity with known proteins or ESTs to increase the accuracy of genome annotation. We cannot precisely predict all gene components because of limitations in our knowledge of complex biological processes and signals regulating gene expression. In this respect, the analysis of 2.9 Mb of Drosophila sequence by several gene-finding approaches gives us a unique opportunity to define the reliability and limitations of our predictions and provides a strategy for the interpretation of predicted results in the analysis of new genomic sequences. Current gene identification approaches (Burge and Karlin 1998) use dynamic programming and pattern-based or probabilistic scheme for scoring potential gene variants. They employ the best signal and content recognizers and an optimization technique developed previously (Burge and Karlin 1977; Brunak et al. 1991; Fickett and Tung 1992; Guigó et al. 1992; Snyder and Stormo 1993; Krogh et al. 1994; Stormo and Haussler 1994; Solovyev et al. 1994). We tested two gene prediction approaches developed in our group, Fgene (pattern based human gene prediction) and Fgenesh (hidden Markov model(HMM)) based gene prediction with Drosophila gene parameters. The optimal strategy to annotate long genomic sequences and predict new genes was investigated. The best results were produced by organism-specific Fgenesh program that can accurately predict ∼80% of verified exons. The overpredicted exons (∼10%) can be false positives or belong to genes that do not have corresponding ESTs or proteins and have not been predicted by GENSCAN. Some of them represent the retroviruses genes which we included in our annotation.

The set contains 61 genes and 370 exons. (CG) Correctly predicted genes; (Sne and Spe) sensitivity and specificity at the exon level; (Snb and Spb) sensitivity and specificity at the base level; (CC) correlation coefficient.

std3 contains 222 genes and 909 exons; std1 contains 43 genes and 123 exons. The annotated exons were taken from the set presented by organizers of GASP at the time of the initial data analysis. Later corrections were not included.

(Pe) Number of predicted exons; (Ce) number of correctly predicted exons; (Cg) number of correct genes; (Pg) number of predicted genes; (Pe) number of correctly predicted genes.

(Sn) Sensitivity (%); (Sp) specificity (%). At the exon level the second number after the diagonal shows sensitivity, taking into account exactly predicted and overlapped exons.

(Sn and Sp) Sensitivity and specificity at the exon level (%).

The BRCA2 region contains 20 verified genes and 168 exons.

(CC) The correlation coefficient reflecting the accuracy of prediction at the nucleotide level: (Snb and Spb) sensitivity and specificity at the base level (%); (Sne and Spe) sensitivity and specificity at the exon level (%), (Snep) exon sensitivity, including partially correct predicted exons (%); (PCg) number of partially correct genes.

Acknowledgments

We thank Dr. Igor Seledtsov for collaborative work with Infogen visualization. Development of gene prediction approaches was supported by a Wellcome Trust research grant (to V.S.).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Acknowledgments

Footnotes

E-MAIL ku.ca.regnas@veyvolos; FAX 44-1-2223-494919.

Footnotes

REFERENCES

Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.

Learn More