DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
The University of Wisconsin Genetics Computer Group (UWGCG) has been organized to develop computational tools for the analysis and publication of biological sequence data. A group of programs that will interact with each other has been developed for the Digital Equipment Corporation VAX computer using the VMS operating system. The programs available and the conditions for transfer are described.
Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.
We have used the Escherichia coli beta-glucuronidase gene (GUS) as a gene fusion marker for analysis of gene expression in transformed plants. Higher plants tested lack intrinsic beta-glucuronidase activity, thus enhancing the sensitivity with which measurements can be made. We have constructed gene fusions using the cauliflower mosaic virus (CaMV) 35S promoter or the promoter from a gene encoding the small subunit of ribulose bisphosphate carboxylase (rbcS) to direct the expression of beta-glucuronidase in transformed plants. Expression of GUS can be measured accurately using fluorometric assays of very small amounts of transformed plant tissue. Plants expressing GUS are normal, healthy and fertile. GUS is very stable, and tissue extracts continue to show high levels of GUS activity after prolonged storage. Histochemical analysis has been used to demonstrate the localization of gene activity in cells and tissues of transformed plants.
Unique DNA sequences can be determined directly from mouse genomic DNA. A denaturing gel separates by size mixtures of unlabeled DNA fragments from complete restriction and partial chemical cleavages of the entire genome. These lanes of DNA are transferred and UV-crosslinked to nylon membranes. Hybridization with a short 32P-labeled single-stranded probe produces the image of a DNA sequence "ladder" extending from the 3' or 5' end of one restriction site in the genome. Numerous different sequences can be obtained from a single membrane by reprobing. Each band in these sequences represents 3 fg of DNA complementary to the probe. Sequence data from mouse immunoglobulin heavy chain genes from several cell types are presented. The genomic sequencing procedures are applicable to the analysis of genetic polymorphisms, DNA methylation at deoxycytidines, and nucleic acid-protein interactions at single nucleotide resolution.
As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
RNA interference (RNAi) is the process of sequence-specific, post-transcriptional gene silencing in animals and plants, initiated by double-stranded RNA (dsRNA) that is homologous in sequence to the silenced gene. The mediators of sequence-specific messenger RNA degradation are 21- and 22-nucleotide small interfering RNAs (siRNAs) generated by ribonuclease III cleavage from longer dsRNAs. Here we show that 21-nucleotide siRNA duplexes specifically suppress expression of endogenous and heterologous genes in different mammalian cell lines, including human embryonic kidney (293) and HeLa cells. Therefore, 21-nucleotide siRNA duplexes provide a new tool for studying gene function in mammalian cells and may eventually be used as gene-specific therapeutics.
High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.
The ability of p53 to activate transcription from specific sequences suggests that genes induced by p53 may mediate its biological role as a tumor suppressor. Using a subtractive hybridization approach, we identified a gene, named WAF1, whose induction was associated with wild-type but not mutant p53 gene expression in a human brain tumor cell line. The WAF1 gene was localized to chromosome 6p21.2, and its sequence, structure, and activation by p53 was conserved in rodents. Introduction of WAF1 cDNA suppressed the growth of human brain, lung, and colon tumor cells in culture. Using a yeast enhancer trap, a p53-binding site was identified 2.4 kb upstream of WAF1 coding sequences. The WAF1 promoter, including this p53-binding site, conferred p53-dependent inducibility upon a heterologous reporter gene. These studies define a gene whose expression is directly induced by p53 and that could be an important mediator of p53-dependent tumor growth suppression.
The complete sequence of the 16,569-base pair human mitochondrial genome is presented. The genes for the 12S and 16S rRNAs, 22 tRNAs, cytochrome c oxidase subunits I, II and III, ATPase subunit 6, cytochrome b and eight other predicted protein coding genes have been located. The sequence shows extreme economy in that the genes have none or only a few noncoding bases between them, and in many cases the termination codons are not coded in the DNA but are created post-transcriptionally by polyadenylation of the mRNAs.
We have catalogued the protein kinase complement of the human genome (the "kinome") using public and proprietary genomic, complementary DNA, and expressed sequence tag (EST) sequences. This provides a starting point for comprehensive analysis of protein phosphorylation in normal and disease states, as well as a detailed view of the current state of human genome analysis through a focus on one large gene family. We identify 518 putative protein kinase genes, of which 71 have not previously been reported or described as kinases, and we extend or correct the protein sequences of 56 more kinases. New genes include members of well-studied families as well as previously unidentified families, some of which are conserved in model organisms. Classification and comparison with model organism kinomes identified orthologous groups and highlighted expansions specific to human and other lineages. We also identified 106 protein kinase pseudogenes. Chromosomal mapping revealed several small clusters of kinase genes and revealed that 244 kinases map to disease loci or cancer amplicons.
Plasmid expression vectors have been constructed that direct the synthesis of foreign polypeptides in Escherichia coli as fusions with the C terminus of Sj26, a 26-kDa glutathione S-transferase (GST; EC 184.108.40.206) encoded by the parasitic helminth Schistosoma japonicum. In the majority of cases, fusion proteins are soluble in aqueous solutions and can be purified from crude bacterial lysates under non-denaturing conditions by affinity chromatography on immobilised glutathione. Using batch wash procedures several fusion proteins can be purified in parallel in under 2 h with yields of up to 15 micrograms protein/ml of culture. The vectors have been engineered so that the GST carrier can be cleaved from fusion proteins by digestion with site-specific proteases such as thrombin or blood coagulation factor Xa, following which, the carrier and any uncleaved fusion protein can be removed by absorption on glutathione-agarose. This system has been used successfully for the expression and purification of more than 30 different eukaryotic polypeptides.
Overlapping complementary DNA clones were isolated from epithelial cell libraries with a genomic DNA segment containing a portion of the putative cystic fibrosis (CF) locus, which is on chromosome 7. Transcripts, approximately 6500 nucleotides in size, were detectable in the tissues affected in patients with CF. The predicted protein consists of two similar motifs, each with (i) a domain having properties consistent with membrane association and (ii) a domain believed to be involved in ATP (adenosine triphosphate) binding. A deletion of three base pairs that results in the omission of a phenylalanine residue at the center of the first predicted nucleotide-binding domain was detected in CF patients.
The Huntington's disease (HD) gene has been mapped in 4p16.3 but has eluded identification. We have used haplotype analysis of linkage disequilibrium to spotlight a small segment of 4p16.3 as the likely location of the defect. A new gene, IT15, isolated using cloned trapped exons from the target area contains a polymorphic trinucleotide repeat that is expanded and unstable on HD chromosomes. A (CAG)n repeat longer than the normal range was observed on HD chromosomes from all 75 disease families examined, comprising a variety of ethnic backgrounds and 4p16.3 haplotypes. The (CAG)n repeat appears to be located within the coding sequence of a predicted approximately 348 kd protein that is widely expressed but unrelated to any known gene. Thus, the HD mutation involves an unstable DNA segment, similar to those described in fragile X syndrome, spino-bulbar muscular atrophy, and myotonic dystrophy, acting in the context of a novel 4p16.3 gene to produce a dominant phenotype.
The one-step gene disruption techniques described here are versatile in that a disruption can be made simply by the appropriate cloning experiment. The resultant chromosomal insertion is nonreverting and contains a genetically linked marker. Detailed knowledge of the restriction map of a fragment is not necessary. It is even possible to "probe" a fragment that is unmapped for genetic functions by constructing a series of insertions and testing each one for its phenotype.
A method has been developed whereby a very large number of colonies of Escherichia coli carrying different hybrid plasmids can be rapidly screened to determine which hybrid plasmids contain a specified DNA sequence or genes. The colonies to be screened are formed on nitrocellulose filters, and, after a reference set of these colonies has been prepared by replica plating, are lysed and their DNA is denatured and fixed to the filter in situ. The resulting DNA-prints of the colonies are then hybridized to a radioactive RNA that defines the sequence or gene of interest, and the result of this hybridization is assayed by autoradiography. Colonies whose DNA-prints exhibit hybridization can then be picked from the reference plate. We have used this method to isolate clones of ColE1 hybrid plasmids that contain Drosophila melanogaster genes for 18 and 28S rRNAs. In principle, the method can be used to isolate any gene whose base sequence is represented in an available RNA.
To date, more than 200 microRNAs have been described in humans; however, the precise functions of these regulatory, non-coding RNAs remains largely obscure. One cluster of microRNAs, the mir-17-92 polycistron, is located in a region of DNA that is amplified in human B-cell lymphomas. Here we compared B-cell lymphoma samples and cell lines to normal tissues, and found that the levels of the primary or mature microRNAs derived from the mir-17-92 locus are often substantially increased in these cancers. Enforced expression of the mir-17-92 cluster acted with c-myc expression to accelerate tumour development in a mouse B-cell lymphoma model. Tumours derived from haematopoietic stem cells expressing a subset of the mir-17-92 cluster and c-myc could be distinguished by an absence of apoptosis that was otherwise prevalent in c-myc-induced lymphomas. Together, these studies indicate that non-coding RNAs, specifically microRNAs, can modulate tumour formation, and implicate the mir-17-92 cluster as a potential human oncogene.