Subcellular localization of the yeast proteome
Abstract
Protein localization data are a valuable information resource helpful in elucidating eukaryotic protein function. Here, we report the first proteome-scale analysis of protein localization within any eukaryote. Using directed topoisomerase I-mediated cloning strategies and genome-wide transposon mutagenesis, we have epitope-tagged 60% of the Saccharomyces cerevisiae proteome. By high-throughput immunolocalization of tagged gene products, we have determined the subcellular localization of 2744 yeast proteins. Extrapolating these data through a computational algorithm employing Bayesian formalism, we define the yeast localizome (the subcellular distribution of all 6100 yeast proteins). We estimate the yeast proteome to encompass ∼5100 soluble proteins and >1000 transmembrane proteins. Our results indicate that 47% of yeast proteins are cytoplasmic, 13% mitochondrial, 13% exocytic (including proteins of the endoplasmic reticulum and secretory vesicles), and 27% nuclear/nucleolar. A subset of nuclear proteins was further analyzed by immunolocalization using surface-spread preparations of meiotic chromosomes. Of these proteins, 38% were found associated with chromosomal DNA. As determined from phenotypic analyses of nuclear proteins, 34% are essential for spore viability—a percentage nearly twice as great as that observed for the proteome as a whole. In total, this study presents experimentally derived localization data for 955 proteins of previously unknown function: nearly half of all functionally uncharacterized proteins in yeast. To facilitate access to these data, we provide a searchable database featuring 2900 fluorescent micrographs at http://ygac.med.yale.edu.
A global understanding of the molecular mechanisms underpinning cell biology necessitates an understanding not only of an organism's genome but also of the protein complement encoded within this genome (the proteome). In the past, data regarding an organism's proteome have typically been accumulated piecemeal through studies of a single protein or cell pathway. Genomic methodologies have altered this paradigm: a variety of approaches are now in place by which proteins may be directly analyzed on a proteome-wide scale. Chromatography-coupled mass spectrometry (Gygi et al. 1999; Washburn et al. 2001), large-scale two-hybrid screens (Uetz et al. 2000; Ito et al. 2001; Tong et al. 2002), immunoprecipitation/mass spectrometric analysis of protein complexes (Gavin et al. 2002; Ho et al. 2002), and protein microarray technologies (MacBeath and Schreiber 2000; Zhu et al. 2000, 2001) are yielding unprecedented quantities of protein data. Recent genomic techniques combining microarray technologies with either chromatin immunoprecipitation (Ren et al. 2000; Iyer et al. 2001) or targeted DNA methylation (van Steensel et al. 2001) have been used to globally map binding sites of chromosomal proteins in vivo. Initiatives are even underway to automate and industrialize processes by which protein structures may be solved, potentially providing a library of structural data from which homologous proteins may be modeled (Burley 2000; Montelione 2001).
Although these approaches promise a wealth of information, many fundamental proteomic data sets remain uncataloged. Notably, the subcellular distribution of proteins within any single eukaryotic proteome has never been extensively examined, despite the usefulness and importance of these data. Protein localization is assumed to be a strong indicator of gene function. Localization data are also useful as a means of evaluating protein information inferred from genetic data (e.g., supporting or refuting putative protein interactions suggested from two-hybrid analysis; Ito et al. 2001). Furthermore, the subcellular localization of a protein can often reveal its mechanism of action.
To determine the subcellular localization of a protein, its corresponding gene is typically either fused to a reporter or tagged with an epitope. Reporters and epitope tags are fused routinely to either the N or C termini of target genes, a choice that can be critical in obtaining accurate localization data. Organelle-specific targeting signals (e.g., mitochondrial targeting peptides and nuclear localization signals) are often located at the N terminus (Silver 1991); N-terminal reporter fusions may disrupt these sequences, resulting in anomalous protein localizations. In other cases, C-terminal sequences may be important for proper function and regulation, as recently shown from analysis of the yeast γ-tubulin-like protein Tub4p (Vogel et al. 2001). Gene copy number can also have an impact on the accuracy with which a protein is localized; overexpressed protein products may saturate intracellular transport mechanisms, potentially producing an aberrant subcellular protein distribution. In other cases, weakly expressed single-copy genes may not yield sufficient protein to be visualized, particularly by fluorescence microscopy. The effects of copy number and reporter/tag orientation on protein localization, however, have never been studied in a large data set.
To date, few studies have characterized protein localization on a large scale, primarily because few high-throughput methods exist by which reporter fusions or epitope-tagged proteins can be generated and subsequently localized. Typically, systematic approaches have been used to construct a limited number of chimeric reporter fusions applicable to pilot localization studies. For example, >100 human cDNAs have been cloned as N- and C-terminal gene fusions to spectral variants of green fluorescent protein (GFP) as a means of examining the subcellular localization of these proteins in living cells (Simpson et al. 2000). Thus far, the majority of localization studies have been undertaken in yeast, owing primarily to the fidelity of homologous recombination in Saccharomyces cerevisiae and the concomitant ease with which integrated reporter gene fusions can be generated. As part of a pilot study in S. cerevisiae, Niedenthal et al. (1996) constructed GFP reporter fusions to three unknown open reading frames (ORFs) from yeast Chromosome XIV and subsequently localized these chimeric GFP-fusion proteins by fluorescence microscopy.
In addition to directed cloning methods, strains suitable for localization analysis may be generated through random approaches. Recently, a plasmid-based GFP-fusion library of Schizosaccharomyces pombe DNA was constructed by fusing random fragments of genomic DNA upstream of GFP-coding sequence. Fission yeast cells transformed with this library were subsequently screened for GFP fluorescence, and 250 independent gene products were localized (Ding et al. 2000). In S. cerevisiae, transposon-based methods have been used to generate random lacZ gene fusions (Burns et al. 1994) and epitope-tagged alleles (Ross-MacDonald et al. 1999) for subsequent immunolocalization. Although these transposon-based studies have resulted in the localization of ∼300 yeast proteins, the majority of the S. cerevisiae proteome has remained uncharacterized in regards to its subcellular distribution.
To address this deficiency, we have undertaken the largest analysis to date of protein localization in yeast. Employing high-throughput methods of epitope-tagging and immunofluorescence analysis, our study defines the subcellular localization of 2744 proteins. By integrating these localization data with those previously published, we identify the subcellular localization of >3300 yeast proteins, 55% of the proteome. Building on these data, we have applied a Bayesian system to estimate the intracellular distribution of all 6100 yeast proteins and have further characterized a subset of nuclear proteins both by immunolocalization on surface spread chromosomal preparations and by phenotypic analysis. In total, our findings provide a wealth of insight into protein function, while formally corroborating an expected link between protein function and localization. Furthermore, this study provides experimentally derived localization data for nearly 1000 proteins of previously unknown function, thereby providing, at minimum, a starting point for informed analysis of this previously uncharacterized segment of the proteome.
A subset of yeast proteins is represented within both the V5- and HAT-tagged data sets; therefore, the cumulative totals correspond to the number of distinct proteins within the union of these two data sets. The number of functionally characterized proteins (as extracted from the MIPS CYGD) showing each respective staining pattern is indicated in parentheses beside the cumulative totals (see Fig. Fig.6).6). Major subcategories within the mixed and other categories are indicated. Specific protein localization data and corresponding immunofluorescence images may be accessed at http: //ygac.med.yale.edu (Protein Localization in Yeast link).
Acknowledgments
We thank James R. Chambers, Shannon Hattier, and Jon Rowland of Invitrogen Corporation for strain organization and DNA preparation. This work was supported by NIH Grant R01-CA77808 to M.S. A.K. is supported by a postdoctoral fellowship from the American Cancer Society.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
E-MAIL ude.elay@redyns.leahcim; FAX (203) 432-6161.
Article and publication are at http://www.genesdev.org/cgi/doi/10.1101/gad.970902.