Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines
Although pioneered by human geneticists as a potential solution to the challenging problem of finding the genetic basis of common human diseases1,2, advances in genotyping and sequencing technology have made genome-wide association (GWA) studies an obvious general approach for studying the genetics of natural variation and traits of agricultural importance. They are particularly useful when inbred lines are available because once these lines have been genotyped, they can be phenotyped multiple times, making it possible (as well as extremely cost-effective) to study many different traits in many different environments, while replicating the phenotypic measurements to reduce environmental noise. Here we demonstrate the power of this approach by carrying out a GWA study of 107 phenotypes in Arabidopsis thaliana, a widely distributed, predominantly selfing model plant, known to harbor considerable genetic variation for many adaptively important traits3. Our results are dramatically different from those of human GWA studies in that we identify many common alleles with major effect, but they are also, in many cases, harder to interpret because confounding by complex genetics and population structure make it difficult to distinguish true from false associations. However, a priori candidates are significantly overrepresented among these associations as well, making many of them excellent candidates for follow-up experiments by the Arabidopsis community. Our study clearly demonstrates the feasibility of GWA studies in A. thaliana, and suggests that the approach will be appropriate for many other organisms.
The genotyped sample (Supplementary Table 1) includes a core set of 95 lines4 for which a wide variety of phenotypes were available, plus a second set of 96 for which many phenotypes related to flowering were available5. The lines were genotyped using a custom Affymetrix SNP-chip containing 250,000 SNPs6. Since the genome of A. thaliana is around 120 million bp and the extent of LD comparable to that in humans7,8, the resulting SNP density of one SNP per 500 bp is considerably higher than is commonly used in human studies6.
To evaluate the feasibility of GWA studies in this organism, a variety of phenotypes were generated or assembled. The phenotypes broadly fell into four categories: 23 were related to flowering under different environmental conditions; 23 were related to defense, ranging from recognition of specific bacterial strains to trichome density; 18 were element concentrations measured using inductively coupled plasma mass spectroscopy (“ionomics”); and 43 were loosely defined developmental traits, including dormancy and plant senescence. For details about each phenotype, see Supplementary Tables 2–5. The flowering phenotypes are generally strongly positively correlated, and are also negatively correlated with some other phenotypes, e.g., those related to size at flowering (see Supplementary Fig. 9).
We first assessed evidence of association between each SNP and phenotype using the non-parametric Wilcoxon Test (Fisher’s Exact Test was used for the small number of phenotypes that were categorical rather than quantitative). A major difference between our study and the human GWA studies published to date is that our study population is heavily structured, and there is thus every reason to expect elevated false-positive rates9. Indeed, most phenotypes gave rise to a distribution of p-values that was strongly skewed toward zero (Supplementary Section 2.1.6). Fig. 1a shows the number of distinct peaks of association identified for each phenotype using different p-value thresholds, as well as the number expected by chance alone. There is an excess of strong associations across phenotypes, as expected given the presence of confounding population structure (although we also expect some of these associations to be true). Furthermore, the degree of confounding varies greatly between phenotypes. Phenotypes related to flowering are generally more strongly affected, as would be expected given the correlation between flowering and geographic origins.
The population structure in our sample is highly complex, involving patterns of relatedness on all scales (see Supplementary Fig. 4). As in previous studies9, we found that, at least in terms of producing a p-value distribution that does not show obvious signs of confounding, statistical methods commonly used to control for population structure in human genetics10,11 fail to correct for population structure, whereas the mixed-model approach introduced by maize geneticists12 appears to perform well (see Supplementary Section 2.1.6). Fig. 1b shows the number of peaks of association identified using this approach (as implemented in the program EMMA13). The excess of nominally significant association for flowering-related phenotypes has been eliminated, as would be expected if this excess were mostly due to confounding by population structure. There is a marked reduction in the number of associations of moderate significance (e.g., around 10−4) across phenotypes, but the excess of highly significant associations clearly persists (or has even become greater). This is precisely what would be expected from an increase in statistical power by switching to a parametric method that reduces confounding. It is tempting to conclude that most of these extreme p-values must represent true associations, but there are reasons to be skeptical. First, although EMMA appears to produce a p-value distribution that conforms to the null-expectation (except for extreme values: see Supplementary Figs. 12–118), it seems almost certain that some confounding remains (see below). Second, simulation studies suggest that the p-values produced by EMMA are not always well estimated and should be interpreted with caution (see Supplementary Section 2.1.5).
It is thus not straightforward to distinguish true from spurious associations, regardless of whether we correct for population structure. There is no doubt, however, that there is real signal in the data. Indeed, regardless of method used, six phenotypes yield single, strong peaks of association that are obvious by eye. In all cases, the association results effectively identify single genes, and correspond to known functional polymorphisms. An example of this is shown in Fig. 2, where the hypersensitive response to the bacterial avirulence gene AvrRpm1 directly identifies the corresponding resistance gene RESISTANCE TO P. SYRINGAE PV MACULICOLA 1 (RPM1)14. Similar results were obtained for other disease resistance (R) responses, sodium concentration, lesioning, and FRIGIDA (FRI) expression (see Supplementary Figs. 35, 37, 60, 76, and 21, respectively).
More generally, SNPs closely linked to genes that are a priori likely to be responsible for a particular phenotype are significantly over-represented among SNPs associated with that phenotype. For many of the phenotypes analyzed it is possible to predict which genes might be important in natural variation based on existing functional knowledge. For these phenotypes we determined which of our SNPs were located within 20 kb of an a priori candidate, and tested whether these SNPs were over-represented among nominally significant associations. Fig. 3 illustrates the procedure for flowering time at 10°C (FT10). For example, SNPs with a p-value less than 10−3 from EMMA and a p-value less than 10−5 from the Wilcoxon test are 2.7 times more likely to be close to candidate genes than are randomly chosen SNPs. This simultaneously demonstrates that background functional knowledge about flowering pathways helps predict which genes are involved in natural variation and that our GWA results identify many true associations. Indeed, by assuming that all associations not involving a priori candidates are false, we see that the inverse of the enrichment ratio provides a crude upper bound for the false-positive rate among the a priori candidates (see Supplementary Section 3.2). Continuing the example in Fig. 3, no more than 40% of the candidate SNPs that are significant using both tests are false. This is an upper bound because it seems almost certain that many of the (much larger set of) strong associations that are not close to a priori candidates will also turn out to be real.
As illustrated in Fig. 3, a priori candidates are over-represented among strongly associated SNPs regardless of whether we correct for population structure or not, and the SNPs identified are not necessarily the same. Enrichment is clearly greatest among SNPs that are strongly associated using both methods, however. This is true across the flowering-related phenotypes (Supplementary Fig. 10), and it is thus clear that both methods have utility. More stringent thresholds typically yield stronger enrichment, but the variance also increases because the number of significant genes decreases, and there is thus no simple relationship between degree of enrichment and its statistical significance (Supplementary Fig. 11). Results for the other phenotypes are consistent with those for the flowering-related traits, but the candidate gene lists are too short for statistical analysis (Supplementary Section 3.1).
An additional problem in identifying true positives was the existence of complex peaks of association. While many peaks were sharply defined and clearly identified a small number of genes (illustrated in Fig. 2b), others were much more diffuse, sometimes covering several hundred kb without a clear center. Fig. 4 shows an example of such a peak, and also suggests an explanation for their existence. The figure shows the pattern of association with FLOWERING LOCUS C (FLC) expression in the chromosomal region containing the vernalization-response gene FRI. Polymorphisms in FRI are known to affect flowering time partly through their effect on expression of FLC5,15. SNPs in the FRI region should thus be associated with FLC expression. This is indeed the case, but rather than a single peak of association centered on FRI, we have a mountain range covering 500 kb and on the order of a hundred genes (Fig. 4a).
That FRI should be surrounded by a wide peak of association is, in itself, not surprising given that the two common loss-of-function alleles at FRI appear to have been the subject of recent positive selection16. Indeed, the entire range collapses if these two alleles are added as co-factors to the model (Fig. 4d). More surprising is the fact that these two causal polymorphisms do not have the strongest association within the region. If we reduce allelic heterogeneity by factoring out one or the other of the two alleles, the significance of the remaining polymorphism increases, but it is still not the most significant in the region (Fig. 4b–c). A likely explanation for this is that some SNPs in the region are positively correlated (in linkage disequilibrium) with one of the FRI alleles (because of linkage) and the genomic background (because of population structure). This dual confounding is sufficient to make some of these SNPs more strongly associated with the phenotype than the true positives.
Given the difficulties described above, deciding which associations are worth following-up must necessarily be highly subjective. The strongest associations do not always correspond to obvious candidates and are perhaps more interesting than associations in genes with known function. However, in the absence of further evidence there is little point in discussing these associations. Supplementary Table 6 lists some of the most promising associations: additional a posteriori candidates for each phenotype are given in Supplementary Figs. 12–118. The genes listed were selected based on annotation from within a 20 kb window surrounding each of the top 500 most strongly associated SNPs (with minor allele frequency ≥0.1 for EMMA), distinguishing between those that had been considered candidates a priori and those that had not (the latter category is marked with asterisks in the tables). As demonstrated in Fig. 3, we expect a high fraction of the associated a priori candidates to be real. The full data are available through the project website (http://arabidopsis.usc.edu).
For the flowering-related phenotypes, one of the most striking findings was the strong correlation between phenotypes generated under very different growth-chamber and greenhouse conditions (Supplementary Table 2 and Supplementary Fig. 9: note that phenotypes from a field experiment were much less strongly correlated). As expected given the correlation in phenotypes, there are several regions of association that are shared across the majority of the flowering phenotypes (Supplementary Fig. 119). These regions vary considerably in width, and many of them are complex in the sense of Fig. 4, perhaps as consequence of strong selection. As expected given the results presented in Fig. 3, several of these regions coincided with a priori candidates, like FRI15 and FLC17 (see Supplementary Table 6). Another interesting candidate is DELAY OF GERMINATION 1 (DOG1)18, which, though not originally thought to be involved with flowering, is highly associated with 20 different flowering phenotypes.
Among the other phenotypes, three previously identified resistance (R) genes polymorphisms were readily identified8, as were genes known to be involved with variation in sodium (Na)19 and molydenum (Mo)20 levels (Supplementary Table 6). Four genes known to be involved trichome-formation were strongly associated with both trichome phenotypes: one of these associations has recently been confirmed21. Finally, ACD6, which has been experimentally shown to be directly involved in lesioning22, is detected here as associated with several lesioning and chlorosis phenotypes. This association has also recently been experimentally confirmed23.
By the standards of human GWA studies, the sample sizes used in this study (~96 or ~192 lines) are tiny. It may thus seem surprising that we are able to map anything at all. However, power depends on the genetic architecture of the traits as well, and this works in our favor in at least two ways. First, we clearly find common alleles of major effect. Although effect sizes are hard to estimate for the same reason p-values are, we note, for example, that in 44 phenotypes at least one of the 50 most strongly associated SNPs with a minor allele frequency greater than 15% explain more than 20% of the phenotypic variance (effect-size estimates and allele frequencies for every association are on the project website). This is very different from human studies, which have generally identified only polymorphisms of very small phenotypic effect. The difference is likely due to the fact that, whereas human studies have focused on traits that are either deleterious or under strong stabilizing selection (for which a very strong trade-off between allelic effect and frequency is expected24), we are working with adaptively important traits. Indeed, human GWA studies focusing on traits like skin color seem to yield results more like those presented here25,26.
Second, our study takes full advantage of the fact that we are working with inbred lines that can be grown in replicate under controlled conditions, making is possible to study multiple phenotypes while controlling environmental noise. Partially as a result of this, heritabilities for the traits studied are generally high, ranging from 42% for aphid number to over 99% for several flowering traits (Supplementary Table 7).
Without these advantages, the amount of genotyping required would have made a study like the present one prohibitively expensive. That said, there is little doubt that power in our study is severely limited by sample size. Simulation studies indicate that our power to detect alleles similar to those actually detected in the study is often no more than 30–40% using a sample size of 96 (results not shown). Increasing the sample size to 192 typically more than doubles power. Well over 1,000 lines genotyped with our 250k SNP chip will soon be available: we look forward to seeing associations from these lines. Efforts are also underway to sequence the genomes of all these lines (http://www.1001genomes.org) — this will greatly facilitate follow-up studies, but we do not expect a massive increase in power as a result of this, because the SNP density used here seems adequate.
Power also depends on sample composition. In comparison with typical human GWA studies, our sample is characterized by extremely strong population structure. This is expected given that our global sample was collected partly to study population structure in A. thaliana4. GWA studies that utilize different samples (including regional, more homogeneous ones) are under way.
The fact that population structure can cause confounding and lead to an elevated false-positive rate is well known and the relative advantages of alternative statistical methods to correct for this have been much debated10,11,12,27. We feel that the discussion of this phenomenon has often been misleading, in that population structure is neither necessary nor sufficient for confounding to occur. At least for complex traits, the problem is better thought of as model mis-specification: when we carry out GWA analysis using a single SNP at a time (as is done in this, and most other GWA studies to date), we are in effect modeling a multi-factorial trait as if it were due to a single locus. The polygenic background of the trait is ignored, as are other unobserved variables. This kind of marginal analysis causes no problem as long as the background is adequately captured by a variance term (or similar), but if the background variables are correlated with the SNP included in the model, bias will result. Population structure will lead to correlations (i.e., linkage disequilibrium) between unlinked loci, and this will usually (but not always28) lead to confounding. Positive correlations are also expected as a result of strong selection. Both factors are likely to be important in the present study: for example, it is easy to imagine that plants from northern Sweden will tend to share cold-adaptive alleles at many causal loci as a result of selection, and marker alleles genome-wide as a result of demographic history.
This way of thinking about the problem helps us interpret many of the results presented above. First, it becomes clear why GWA works so well for traits that are monogenic, or at least are mostly due to a single major locus. Examples in our study include the R gene responses (RPM1, RPS2, and RPS5), FRI expression (FRI itself), and lesioning (ACD6). In all cases, GWA yields unambiguous results regardless of whether we correct for population structure. The reason is not that there is no confounding in these cases. The problem that has received so much attention in human genetics — inflated significance among unlinked, non-causal loci — is clearly present, but with truly genome-wide coverage this is not very important, because the true positive is expected to show the strongest association.
Second, it helps us explain the occurrence of broad, complex regions of association. As ex-emplified in Fig. 4, these can arise when SNPs in a region containing a major causative allele are positively correlated not only with that causative allele (due to linkage disequilibrium in the narrow sense), but also with the genomic background (because of population structure and/or natural selection). The paradoxical consequence is that instead of a single peak of association centered on the causative locus, we expect a complex mountain landscape where many non-causal markers will show stronger association than the causative allele itself. This makes it difficult to identify the causal variant within such regions. This type of confounding does not appear to have been recognized in the literature, and probably deserves more attention.
Third, it is clear that we should not expect statistical methods that are designed to take genome-wide patterns of relatedness into account10,11,12 to correct for confounding that is due to selection generating correlations between causal loci. These types of methods will work only to the extent that the loci responsible for the genomic background have allele-frequency distributions that are similar to those of non-causal loci, which is expected only if selection on each locus is weak. Accurate estimation of the size of the effects of the many candidate polymorphisms identified here (including distinguishing it from zero) will require either crosses or transgenic experiments. Any cross will eliminate long-range linkage disequilibrium, and short-range linkage disequilibrium can be overcome by choosing appropriate parental strains. For example, the FRI region in Fig. 4 also contains a promising association at CRYPTIC PRECOCIOUS (CRP), less than 100 kb away from FRI (see Supplementary Fig. 127). If polymorphism at FRI is taken into account, CRP is no longer significantly associated, but this does not necessarily mean that the association is spurious. The two genes are too closely linked to be separated using standard crosses, but we can select parental strains that segregate for CRP but not FRI.
Such crosses are currently being carried out, by us and by other members of the Arabidopsis community. We also anticipate that many more phenotypes will be generated and added to our public database. By combining results from GWA, linkage mapping, and perhaps also intermediate phenotypes (e.g., expression data), it will be possible to make progress on deconstructing the regulatory networks that determine natural variation.
As genotyping and sequencing costs continue to decrease, GWA studies will become a standard tool for dissecting natural variation. It is thus important to recognize their limitations. The problems raised here are not unique to A. thaliana. GWA alone will often not allow accurate estimate of allelic effects. It must also be remembered that all mapping studies are biased in the sense that they can only detect alleles that explain a sufficient fraction of the variation in the mapping population29. The present study can only detect alleles that are reasonably common in our global sample. A GWA study using a more local sample would undoubtedly uncover more variants that are locally common, and linkage mapping will identify major polymorphisms that happen to be segregating in the cross, even if one of the alleles is extremely rare in natural populations. The “genetic architecture” of a trait depends on the population studied. In order to determine how genetic architecture affects selection and evolution, we thus also need to understand the spatial and temporal scales over which selection is important.
Because slightly different sets were used in different phenotyping experiments, the total number of lines used was 199 (Supplementary Table 1). Genotyping was done using standard protocols, and a combination of SNP calling and imputation algorithms were used to analyze the results (see Supplementary Section 1). We called 216,130 SNPs, at an estimated error rate of 1.6%. GWA analysis was done with and without correction for confounding. For the former, a mixed-model12 implemented in the program EMMA13 was used. For the latter, Wilcoxon’s test was used for ordinal data, and Fisher’s Exact Test for categorical data. Enrichment for candidate genes was investigated using lists of a priori candidates identified from the literature.
Author Contributions J.R.E. and D.W. generated the SNPs used by this project. S.A., M.H, Y.L., N.W.M, X.Z, J.O.B., and J.B. were responsible for the experimental aspects of genotyping. Y.S.H., B.J.V., M.H., T.T.H., R.J., X.Z., M.A.A., P.M., J.O.B., J.B., and M.N were responsible for data management and the bioinformatics pipeline. S.A., I.B., B.B., J.C., C.D., M.D., J.d.M., N.F., J.M.K, J.D.G.J, T.M., A.N., F.R., D.E.S., C.T., M.T., M.B.T., D.W., J.B., and M.N. were responsible for phenotyping. S.A., Y.S.H., B.J.V., G.W., D.M., A.P., A.M.T., P.M., and M.N carried out the GWA analyses. Y.S.H. and D.M. developed the project website. M.N. wrote the paper with major contributions from S.A., Y.S.H. B.J.V., G.W., A.P., and J.B.. J.O.B., J.B., and M.N conceived and supervised the project.
Reprints and permissions information is available at www.nature.com/reprints.
We thank B. Carvalho for his advice on how to modify the Oligo package. This work was primarily supported by NSF DEB-0519961 (J.B., M.N.), NIH GM073822 (J.O.B.), and NSF DEB-0723935 (M.N.). Additional support was provided by the Dropkin foundation, NIH GM057994 and NSF MCB-0603515 (J.B.), the Max Planck Society (D.W., M.T.), the Austrian Academy of Sciences (M.N.), the University of Lille 1 (F.R.), NIH GM078536 and NIH P42ES007373 (D.E.S.), NIH GM62932 (J.C. and D.W.), the Howard Hughes Medical Institute (J.C.), DFG SFB 680 (J.d.M.), a Marie-Curie International Outgoing Fellowship “ANAVACO” (Project Nr. 220833; G.W.), and a Gottfried Wilhelm Leibniz Award of the DFG (D.W.). The project would not have been possible without the existence of TAIR (http://arabidopsis.org).
- 1. Genome-wide association studies for common diseases and complex traitsNat Rev Genet6951082005
- 2. Wellcome Trust Case-Control ConsortiumGenome-wide association study of 14,000 cases of seven common diseases. 3,000 shared controlsNature4476616782007
- 3. Naturally occurring genetic variation in Arabidopsis thalianaAnnu Rev Plant Biol551411722004
- 4. The pattern of polymorphism in Arabidopsis thalianaPLoS Biol3e1962005
- 5. Role of FRIGIDA and FLC in determining variation in flowering time of Arabidopsis thalianaPlant Physiol138116311732005
- 6. Recombination and linkage disequilibrium in Arabidopsis thalianaNature Genet39115111552007
- 7. The extent of linkage disequilibrium in Arabidopsis thalianaNature Genet301901932002
- 8. Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genesPLoS Genet1e602005
- 9. An Arabidopsis example of association mapping in structured samplesPLoS Genet3e42007
- 10. Association mapping in structured populationsAm J Hum Genet671701812000
- 11. Principal components analysis corrects for stratification in genome-wide association studiesNat Genet389049092006
- 12. A unified mixed-model method for association mapping that accounts for multiple levels of relatednessNat Genet382032082006
- 13. Efficient control of population structure in model organism association mappingGenetics178170917232008
- 14. Structure of the Arabidopsis Rpm1 gene enabling dual-specificity disease resistanceScience2698438461995
- 15. Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering timeScience2903443472000
- 16. A non-parametric test reveals selection for rapid flowering in the Arabidopsis genomePLoS Biol4e1372006
- 17. FLOWERING LOCUS C encodes a novel MADS domain protein that acts as a repressor of floweringPlant Cell119499561999
- 18. Cloning of DOG1, a quantitative trait locus controlling seed dormancy in ArabidopsisProc Natl Acad Sci USA10317042170472006
- 19. Natural variants of AtHKT1 enhance Na+ accumulation in two wild populations of ArabidopsisPlos Genetics2196419732006
- 20. Variation in molybdenum content across broadly distributed populations of Arabidopsis thaliana is controlled by a mitochondrial molybdenum transporter (MOT1)Plos Genetics42008
- 21. A single amino acid replacement in ETC2 acts as major modifier of trichome patterning in natural Arabidopsis populationsCurrent Biology2009To appear
- 22. ACD6, a novel ankyrin protein, is a regulator and an effector of salicylic acid signaling in the Arabidopsis defense responsePlant Cell15240824202003
- 23. A single locus causing major pleiotropic tradeoffs in Arabidopsis thaliana accessionsUnder review for Nature2009
- 24. Finding the missing heritability of complex diseasesNature4617477532009
- 25. A genome-wide association study identifies novel alleles associated with hair color and skin pigmentationPLoS Genet4e10000742008
- 26. A genomewide association study of skin pigmentation in a South Asian populationAm J Hum Genet81111911322007
- 27. Genomic control for association studiesBiometrics5599710041999
- 28. A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed, or spatially distributed populationsGenetics173166516782006
- 29. Next-generation genetics in plantsNature4567207232008