Clustering of DNA sequences in human promoters.
Journal: 2004/September - Genome Research
ISSN: 1088-9051
Abstract:
We have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a set of 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limited number of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNA sequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups of related sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are known binding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which we named Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozak sequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequences indicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). Human mRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster are predominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is more abundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promoters that we suggest are important for regulating gene expression.
Relations:
Content
Citations
(97)
References
(40)
Organisms
(1)
Processes
(4)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Genome Res 14(8): 1562-1574

Clustering of DNA Sequences in Human Promoters

Genome Analysis Unit, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
Laboratory of Metabolism, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
Corresponding author.
E-MAIL vog.hin.icn.a73cd@cnosniV; FAX (301) 496-8419.
Received 2003 Sep 9; Accepted 2004 May 18.

Abstract

We have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a set of 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limited number of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNA sequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups of related sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are known binding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which we named Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozak sequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequences indicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). Human mRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster are predominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is more abundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promoters that we suggest are important for regulating gene expression.

Abstract

Vertebrate gene expression is often regulated by the basal promoter, which traditionally is defined as being between –200 bp and the transcription start site (TSS). The DNA sequence properties of basal promoters are poorly described because it is difficult to identify the TSS. Two recent results have helped to resolve this problem: (1) RefSeq (Maglott et al. 2000; Pruitt et al. 2000; Pruitt and Maglott 2001) sequences have been mapped to their location in the complete human genome sequence, and (2) TSSs have been experimentally verified for 7889 genes by using cDNA synthesis methods that identify the 5′ CAP site (Suzuki et al. 2002). We have combined these data to assemble genomic DNA sequences that are putative promoter regions for 13,010 genes aligned relative to the putative TSS. We have examined these aligned sequences for 8-mers that are preferentially localized relative to the TSS, namely, clusters.

A fundamental question in gene expression studies is to determine which DNA sequences that are bound by TFs are biologically relevant. Often, the same DNA sequence is functional in one context but not in another. We reasoned that if a DNA sequence clusters relative to the TSS, the DNA sequences that are in the cluster have a high likelihood of being biologically significant. In human promoters the CAAT box, SP1, and TATA box are recognized by the constitutive transcription factors NF-Y, SP1, and TBP, respectively, and are thought to be localized near the TSS (Breathnach and Chambon 1981). Recently, a genome-wide analysis has demonstrated that the CRE sequence clusters in human promoters (Conkright et al. 2003).

To identify additional DNA sequences that localize near the TSS and thus may be biologically important, we determined the distribution of each of the 65,536 8-mer DNA sequences in 13,010 human promoters sequences from –2500 to 500 bp relative to the TSS. A detailed analysis of the 8-mers with the most significant clustering indicates that they primarily represent variations of only nine DNA consensus sequences. Eight motifs cluster between –100 and the TSS. They include (1) TF binding sites that have been previously suggested to cluster within the promoter (CAAT, SP1, CREB, and TATA); (2) TF binding sites that were not known to localize in the core promoter region, ETS, NRF-1, and USF; and (3) a single DNA sequence, designated Clus1, that is not a known TF binding site. The ninth motif is the Kozak sequence that clusters downstream of the TSS. We observe correlations between the presence of DNA sequences that cluster in promoters and the mRNA expression properties and function of genes.

One hundred fifty-six DNA sequences are grouped into related sequences and arranged by their peak position relative to the TSS. From the left the table contains the most abundant bin, the number of times the sequence occurs in the distribution, the 8-mer sequence, and finally the P value (see text). The end of the table contains consensus sequences. Here the leftmost numbers are the bins defining the peak, followed by the clustering factor (CF), the consensus sequence, and finally the number of occurrences of the sequence in the bins that comprise the peak. Exclamation point (!) denotes sequences that are at least threefold more abundant in the maximum bin on the DNA strand presented in the table than on the opposite strand. The asterisk (*) denotes sequences used in Tables 2 and 3. IUPAC letters used to represent degenerate bases are R (G,A), W (A,T), Y (T,C), K (G,T), V (G, C, A), D (G,A,T), and N (A,T,G,C).

To the left are the eight consensus sequences followed by the number of their occurrences in the peak, and the percentage of promoters containing this sequence. Across the top is the same set of consensus sequences. The intersection is the number of promoters containing both sequences in the peak, followed by the percentage of the promoters containing the top sequence that also contain the sequence from the side, and the probability of having the number of elements in the intersection more dramatic than given. For example, 20.7% of the 13,010 promoters contain the SP1 sequence in the peak (2696), but 33.5% of promoters that contain a USF sequence (191) in the peak also contain the SP1 sequence in the peak. The probability of this positive correlation between these two DNA sequences is P = 4.7. Those correlations that are greater than P = 5 are shown in black, a positive correlation has an asterisk in the probability column.

A variety of functional characteristics was examined for each gene, including if they had a GO ontology annotation, were involved in related biological processes (e.g. ribosomal, proteasomal, or channel) and mRNA expression properties (housekeeping or tissue specific). For each criterion, we present the total number of genes in the group. We next present the three numbers for each consensus sequence: (1) the absolute number of promoters in the group with the consensus sequence in the peak, (2) the fraction of genes in the group that have this consensus, and (3) a statistical measure of the correlation between these two terms. For example, 7.6% of the 13,010 promoters contain the CCAAT sequence in the peak, but only 2% of the 147 channel genes contain the CCAAT sequence in the peak. Those correlations that are greater than P = 3 are shown in black, a positive correlation has an asterisk in the probability column.

The genes with promoters that contain the ETS, NRF-1, Clus1, and TATA are divided into two groups, those in which the consensus sequence is in the peak, and those in which the consensus sequence is not in the peak. The same set of parameters as for Table 3 for each functional criteria and consensus sequence is presented. For example, 124 of the 850 housekeeping genes contain the ETS sequence in the peak. This is 2.0 times more than the expected frequency: (124/850)/(1072/13,010). Those correlations that are greater than P = 3 are shown in black, a positive correlation has an asterisk in the probability column.

Acknowledgments

We thank Barbara Graves for conversations about ETS DNA binding, Robert Perry for conversations about ribosomal gene promoters, and David FitzGerald for comments on the manuscript. This study used the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Maryland (http://biowulf.nih.gov).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Acknowledgments

Notes

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1953904. Article published online before print in July 2004.

Notes
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1953904. Article published online before print in July 2004.

WEB SITE REFERENCES

WEB SITE REFERENCES
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.