Structure and architecture of the maize genome.
Journal: 2006/March - Plant Physiology
ISSN: 0032-0889
Abstract:
Maize (Zea mays or corn) plays many varied and important roles in society. It is not only an important experimental model plant, but also a major livestock feed crop and a significant source of industrial products such as sweeteners and ethanol. In this study we report the systematic analysis of contiguous sequences of the maize genome. We selected 100 random regions averaging 144 kb in size, representing about 0.6% of the genome, and generated a high-quality dataset for sequence analysis. This sampling contains 330 annotated genes, 91% of which are supported by expressed sequence tag data from maize and other cereal species. Genes averaged 4 kb in size with five exons, although the largest was over 59 kb with 31 exons. Gene density varied over a wide range from 0.5 to 10.7 genes per 100 kb and genes did not appear to cluster significantly. The total repetitive element content we observed (66%) was slightly higher than previous whole-genome estimates (58%-63%) and consisted almost exclusively of retroelements. The vast majority of genes can be aligned to at least one sequence read derived from gene-enrichment procedures, but only about 30% are fully covered. Our results indicate that much of the increase in genome size of maize relative to rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana) is attributable to an increase in number of both repetitive elements and genes.
Relations:
Content
Citations
(62)
References
(58)
Chemicals
(2)
Organisms
(1)
Processes
(7)
Anatomy
(1)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Plant Physiol 139(4): 1612-1624

Structure and Architecture of the Maize Genome<sup><a href="#fn1" rid="fn1" class=" fn">1</a>,</sup><sup><a href="#fn3" rid="fn3" class=" fn">[W]</a></sup>

+4 authors
Munich Information Center for Protein Sequences, Institute for Bioinformatics, Gesellschaft für Strahlenforschung Research Center for Environment and Health, D–85764 Neuherberg, Germany (G.H., H.G., K.F.X.M.); Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02141 (S.Y., C.R., S.R., B.B., C.N.); Plant Genome Initiative at Rutgers, Waksman Institute, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854 (A.K.B., G.F., J.M.); and Arizona Genomics Institute, University of Arizona, Tucson, Arizona 85721 (E.B., R.A.W.)
Corresponding author; e-mail ude.sregtur.lcbm@gnissem; fax 732–445–0072.
These authors contributed equally to the paper.
Received 2005 Jul 21; Revised 2005 Sep 11; Accepted 2005 Oct 5.

Abstract

Maize (Zea mays or corn) plays many varied and important roles in society. It is not only an important experimental model plant, but also a major livestock feed crop and a significant source of industrial products such as sweeteners and ethanol. In this study we report the systematic analysis of contiguous sequences of the maize genome. We selected 100 random regions averaging 144 kb in size, representing about 0.6% of the genome, and generated a high-quality dataset for sequence analysis. This sampling contains 330 annotated genes, 91% of which are supported by expressed sequence tag data from maize and other cereal species. Genes averaged 4 kb in size with five exons, although the largest was over 59 kb with 31 exons. Gene density varied over a wide range from 0.5 to 10.7 genes per 100 kb and genes did not appear to cluster significantly. The total repetitive element content we observed (66%) was slightly higher than previous whole-genome estimates (58%–63%) and consisted almost exclusively of retroelements. The vast majority of genes can be aligned to at least one sequence read derived from gene-enrichment procedures, but only about 30% are fully covered. Our results indicate that much of the increase in genome size of maize relative to rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana) is attributable to an increase in number of both repetitive elements and genes.

Abstract

Maize (Zea mays or corn) has a wide variety of uses and broad economic impact. It is a significant food source for humans, a chief ingredient in livestock feed, and is the source of a wide range of manufactured products, including sweeteners, fuel, and adhesives. It also has a long and storied history as a model organism in genetic studies. The combination of its genetic and economic importance has made maize a prime organism for genomic studies (for review, see Messing, 2005). Despite its evident value, progress toward generating a whole-genome sequence of maize has been held back by the cost and complexity of such a project. Although it is a medium-sized grass genome, at 2.4 Gb the maize genome is large compared to other sequenced plants and so will require significant funding to sequence. On top of this, its high repeat content poses computational challenges for accurately assembling a genome sequence.

In the absence of a genome sequence, studies of selected regions of the maize genome and comparisons to related species have been carried out. Comparative genetic analyses (Hulbert et al., 1990; Ahn and Tanksley, 1993; Moore et al., 1995; Gale and Devos, 1998) have suggested that significant portions of grass genomes are conserved (collinear). It has been proposed that, aside from polyploidization, large genome sizes in the grasses are caused primarily by the high content of repetitive elements (SanMiguel and Bennetzen, 1998; Meyers et al., 2001; Song et al., 2002). Several studies have investigated the local structure of orthologous regions in various grass species (Chen et al., 1997; Feuillet and Keller, 1999; Tikhonov et al., 1999; Tarchini et al., 2000; Ramakrishna et al., 2002a, 2002b; Song et al., 2002; Brunner et al., 2003; Ilic et al., 2003; Langham et al., 2004). These studies paint a picture of grass genomes that have macrocollinearity, or a general conservation of genes and gene order, but because of numerous small-scale genic rearrangements, such as insertions, deletions, amplifications, inversions, and translocations, lack perfect microcollinearity. Although the results are suggestive, the regions studied represent a tiny fraction of the genome. In addition, since all the regions were selected based on the presence of mapped genes of specific interest, they are also intrinsically biased and are not likely to be representative of the general genome organization. An accurate assessment of the content and organization of the maize genome requires a more comprehensive and unbiased dataset.

Existing data suggest that plant genomes are much more dynamic than similarly related animal genomes in terms of size, gene content, organization, and repeat content (for review, see Messing, 2005). For example, grass genomes vary in size from rice (Oryza sativa; 0.4 Gb) to wheat (Triticum aestivum; 16 Gb). Because of its relatively small size and low proportion of repetitive DNA, whole-genome sequencing efforts in the grasses were initially focused on rice. Rice has about 30% more genes than Arabidopsis (Arabidopsis thaliana), which is largely attributed to gene family expansion (International Rice Genome Sequencing Project, 2005). Even within a single species, significant deviations from gene collinearity are observed (Fu and Dooner, 2002; Song and Messing, 2003; Brunner et al., 2005), which can involve illegitimate recombination mediated by helicases (Lai et al., 2005). Several species of grasses have undergone whole-genome duplication (WGD) events, creating large internally duplicated regions. For example, as recently as 4.8 million years ago (mya), maize underwent a WGD by the hybridization of two progenitors (Swigoňová et al., 2004). Comparison of duplicated regions from the maize genome with the orthologous regions of rice and sorghum (Sorghum bicolor; whose progenitor split from the progenitors of maize only 11.9 mya) indicates that the maize genome has lost many of its duplicated genes. In addition, there is increasing evidence that a significant portion of genes in all these grass species may have moved to other locations within the genome over the last 50 million years (Lai et al., 2004b).

There are a variety of strategies for sequencing whole genomes, and part of the goal of this work was to generate a reference sequence for evaluation of an appropriate sequencing strategy for the maize genome. Suitability of a sequencing strategy to a genome depends on the character of the genome, the state of the technology, and availability of funding. Published strategies include whole-genome shotgun, clone by clone, various reduced representation shotgun (RRS) methods, and various combinations of these (Lander et al., 2001; Venter et al., 2001; Waterston et al., 2002; Bedell et al., 2005). Several new RRS strategies have been developed specifically to address the challenges posed by the high repeat content of maize, with the goal of enrichment of nonrepetitive regions prior to sequencing (Rabinowicz et al., 1999; Yuan et al., 2002; Yuan et al., 2003). Two fractionation methods were used to generate about 1 million sequence reads from the genome of the maize inbred line B73 (Palmer et al., 2003; Whitelaw et al., 2003). Effective evaluation of the performance of a genome sequencing strategy will be greatly facilitated by a high-quality, randomly selected sampling of the genome in relatively large regions (containing both genic and intergenic sequences).

To this end we randomly selected 100 bacterial artificial chromosomes (BACs) from the genome of the maize inbred line B73 for sequence analysis. They were sequenced to deep coverage and manually curated to derive an accurate consensus. This provided a high-quality reference sequence representing approximately 0.6% of the genome that can serve as a basis for both an unbiased study of genome content and evaluation of potential strategies for sequencing the whole maize genome. Based on the sequence information from this large random sampling, we undertook an assessment of the organization and structure of genes, repeat sequence families, and of the coverage by RRS datasets.

Notes

This work was supported by the National Science Foundation Plant Genome (grant no. 0211851). Work at the Munich Information Center for Protein Sequences was in part supported by the Genomanalyse im biologischen System Pflanze program of the German Ministry for Education and Research.

The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Joachim Messing (ude.sregtur.lcbm@gnissem).

The online version of this article contains Web-only data.

www.plantphysiol.org/cgi/doi/10.1104/pp.105.068718.

Notes
The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Joachim Messing (ude.sregtur.lcbm@gnissem).www.plantphysiol.org/cgi/doi/10.1104/pp.105.068718.
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.