Genome-wide detection and characterization of positive selection in human populations.
Journal: 2007/November - Nature
ISSN: 1476-4687
PUBMED: 17943131
Abstract:
With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.
Open in
Relations:
Content
Citations
(610)
References
(33)
Chemicals
(2)
Organisms
(1)
Processes
(6)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Nature 449(7164): 913-918

# Genome-wide detection and characterization of positive selection in human populations

+255 authors

## METHODS SUMMARY

### Genotyping data

Phase 2 of the International Haplotype Map (HapMap2) (www.hapmap.org) contains 3.1 million SNPs genotyped in 420 chromosomes in 3 continental populations (120 European (CEU), 120 African (YRI) and 180 Asian (JPT+CHB))1. We further genotyped our top HapMap2 functional candidates in the HGDR-CEPH Human Genome Diversity Cell Line Panel20.

### LRH, iHS and XP-EHH tests

The Long-Range Haplotype (LRH), integrated Haplotype Score (iHS) and Cross Population EHH (XP-EHH) tests detect alleles that have risen to high frequency rapidly enough that long-range association with nearby polymorphisms—the long-range haplotype—has not been eroded by recombination; haplotype length is measured by the EHH89. The first two tests detect partial selective sweeps, whereas XP-EHH detects selected alleles that have risen to near fixation in one but not all populations. To evaluate the tests, we simulated genomic data for each HapMap population in a range of demographic scenarios—under neutral evolution and twenty scenarios of positive selection—developing the program Sweep (www.broad.mit.edu/mpg/sweep) for analysis. For our top candidates by the three tests, we tested for haplotype-specific recombination rates and copy-number polymorphisms, possible confounders.

### Localization

We calculated FST and derived-allele frequency for all SNPs within the top candidate regions. We developed a database for those regions to annotate all potentially functional DNA changes (B.F., unpublished), including non-synonymous variants, variants disrupting predicted functional motifs, variants within regions of conservation in mammals and variants previously associated with human phenotypic differences, as well as synonymous, intronic and untranslated region variants.

### Structural model

We generated a homology model of the EDAR death domain (DD) from available DD structures using Modeller 9v1 (ref. 22). The distribution of conserved residues, built using ConSurf23 with an EDAR sequence alignment from 22 species, shows a bias to the protein core in helices H1, H2 and H5, supporting our model.

### Genotyping data

Phase 2 of the International Haplotype Map (HapMap2) (www.hapmap.org) contains 3.1 million SNPs genotyped in 420 chromosomes in 3 continental populations (120 European (CEU), 120 African (YRI) and 180 Asian (JPT+CHB))1. We further genotyped our top HapMap2 functional candidates in the HGDR-CEPH Human Genome Diversity Cell Line Panel20.

### LRH, iHS and XP-EHH tests

The Long-Range Haplotype (LRH), integrated Haplotype Score (iHS) and Cross Population EHH (XP-EHH) tests detect alleles that have risen to high frequency rapidly enough that long-range association with nearby polymorphisms—the long-range haplotype—has not been eroded by recombination; haplotype length is measured by the EHH89. The first two tests detect partial selective sweeps, whereas XP-EHH detects selected alleles that have risen to near fixation in one but not all populations. To evaluate the tests, we simulated genomic data for each HapMap population in a range of demographic scenarios—under neutral evolution and twenty scenarios of positive selection—developing the program Sweep (www.broad.mit.edu/mpg/sweep) for analysis. For our top candidates by the three tests, we tested for haplotype-specific recombination rates and copy-number polymorphisms, possible confounders.

### Localization

We calculated FST and derived-allele frequency for all SNPs within the top candidate regions. We developed a database for those regions to annotate all potentially functional DNA changes (B.F., unpublished), including non-synonymous variants, variants disrupting predicted functional motifs, variants within regions of conservation in mammals and variants previously associated with human phenotypic differences, as well as synonymous, intronic and untranslated region variants.

### Structural model

We generated a homology model of the EDAR death domain (DD) from available DD structures using Modeller 9v1 (ref. 22). The distribution of conserved residues, built using ConSurf23 with an EDAR sequence alignment from 22 species, shows a bias to the protein core in helices H1, H2 and H5, supporting our model.

## METHODS

### Genotyping data

The chromosomes examined in HapMap 2 were phased by the consortium using PHASE25.

The HGDR-CEPH Human Genome Diversity Cell Line Panel20 consists of 1,051 individuals from 51 populations across the world. We obtained DNA for the panel from the Foundation Jean Dausset (CEPH) and genotyped our top functional candidates for selection in the panel.

### LRH, iHS, and XP-EHH tests

The Long-Range Haplotype (LRH) and the integrated Haplotype Score (iHS) tests have been previously described89 and our methods are given in Supplementary Methods.

EHH between two SNPs, A and B, is defined as the probability that two randomly chosen chromosomes are homozygous at all SNPs between A and B, inclusive8; it is usually calculated using a sample of chromosomes from a single population. Explicitly, if the N chromosomes in a sample form G homozygous groups, with each group i having ni elements, EHH is defined as

$EHH=Σi=1G(ni2)(N2)$

The XP-EHH test detects selective sweeps in which the selected allele has risen to high frequency or fixation in one population, but remains polymorphic in the human population as a whole; for this purpose it is more powerful than either iHS or LRH (Supplementary Fig. 2 and Supplementary Tables 3-6). XP-EHH uses cross-population comparison of haplotype lengths to control for local variation in recombination rates. Such cross-population comparison is complicated by the fact that haplotype lengths also depend on population history, such as bottlenecks and expansions26. The XP-EHH test normalizes for genome-wide differences in haplotype length between populations.

We define the XP-EHH test with respect to two populations, A and B, a given core SNP and a given direction (centromere distal or proximal). EHH is calculated for all SNPs in population A between the core SNP and X, and the value integrated with respect to genetic distance, with the result defined as IA. IB is defined analogously for population B. The statistic ln(IA/IB) is then calculated; an unusually positive value suggests selection in population A, a negative value selection in B. For identifying outliers, the log-ratio is normalized to have zero mean and unit variance. Details are given in Supplementary Methods.

We developed a computer program, Sweep, to implement these tests (LRH, iHS and XP-EHH) for positive selection, (Supplementary Methods; www.broad.mit.edu/mpg/sweep). In identifying the 22 strongest candidate regions, we considered regions with signals in at least two of five tests (LRH, iHS and XP-EHH in the three pairwise comparisons among the three populations), as well as those that had the strongest signal for each individual test. With this threshold we found no events in 10 Gb of simulated neutrally evolving sequence. For the top candidates by the three tests, we have taken additional steps to rule out the effects of recombination rate variation and copy number polymorphisms (Supplementary Methods).

### Simulations and power calculations

We simulated the evolution of 1 MB sections of 120 chromosomes from each of the three continental HapMap populations, using a previously validated demographic model27, under neutrality and under twenty scenarios of positive selection. We studied the effects of demo-graphy by further simulating recent bottlenecks with a range of intensity. Details of simulations and power calculations are given in Supplementary Methods.

### Functional annotation

We developed an annotation database for our candidate regions to identify all DNA changes with potential functional consequence (B.F., unpublished). We first examined candidates most likely to be functional, including non-synonymous mutations, variants that disrupt predicted functional motifs (transcription factor motifs in conserved regions up to 10-kb 5′ of known genes and miRNA binding-site motifs in conserved 3′ untranslated regions of known genes), and variations reported to be associated with human phenotypic differences. For the last category, we identified variations associated with a clinical state (for example, malaria resistance) by a review of the published literature and those associated with changes to gene expression in lymphoblastoid cell lines from the HapMap individuals. The annotation included insertion/deletion mutations of all sizes. We also examined candidates with lower probability of being functional, including synonymous, intronic and untranslated variations and those that occur within regions of conservation in mammalian species. These methods are described in greater detail in Supplementary Methods.

### Structural model of EDAR's death domain

We generated a homology model for EDAR's death domain (DD) using six solved DD structures: p75 NGFR-DD, RAIDD-DD, Pelle-DD, FADD-DD, Fas-DD and IRAK4-DD242832. We aligned the corresponding protein sequences using SALIGN33. We then added the amino acid sequence of EDAR's DD (residues 356-431) to this structural alignment using Modeller 9v1 (ref. 22). The resulting alignment was used as the input to Modeller 9v1 to build ten EDAR-DD structure models, and the best model was selected based on the Objective Function Score. Owing to the high DOPE scores in the H1-H2 loop we performed a loop refinement using Modeller9v1, significantly reducing the energy of this region. We further evaluated the model by examining the distribution of conserved residues using ConSurf23 with an alignment of EDAR-DD sequences from 22 species. We observed a bias of conserved residues to the protein core in H1, H2 and H5, which supports our EDAR-DD model. To identify potential binding regions of EDAR-DD, we used LSQMAN34 to superimpose the model to the Tube-DD-Pelle-DD complex structure24. The H1-H2 and H5-H6 loops of the EDAR-DD correspond to Tube residues interacting with Pelle, and H2-H3 and H4-H5 loops to Pelle residues interacting with Tube. We focused our analysis on the residues corresponding to the interacting region in Tube because our EDAR-DD model is most similar to Tube. Figures were generated with PyMOL35.

### Other analysis

Description of methods for calculating FST, derived-allele frequency, alignment of the SLC24 amino acids, species alignments, conservation graphs, and estimation of the fraction of SNPs genotyped in HapMap2 and identified in dbSNP, are given in Supplementary Methods.

### Genotyping data

The chromosomes examined in HapMap 2 were phased by the consortium using PHASE25.

The HGDR-CEPH Human Genome Diversity Cell Line Panel20 consists of 1,051 individuals from 51 populations across the world. We obtained DNA for the panel from the Foundation Jean Dausset (CEPH) and genotyped our top functional candidates for selection in the panel.

### LRH, iHS, and XP-EHH tests

The Long-Range Haplotype (LRH) and the integrated Haplotype Score (iHS) tests have been previously described89 and our methods are given in Supplementary Methods.

EHH between two SNPs, A and B, is defined as the probability that two randomly chosen chromosomes are homozygous at all SNPs between A and B, inclusive8; it is usually calculated using a sample of chromosomes from a single population. Explicitly, if the N chromosomes in a sample form G homozygous groups, with each group i having ni elements, EHH is defined as

$EHH=Σi=1G(ni2)(N2)$

The XP-EHH test detects selective sweeps in which the selected allele has risen to high frequency or fixation in one population, but remains polymorphic in the human population as a whole; for this purpose it is more powerful than either iHS or LRH (Supplementary Fig. 2 and Supplementary Tables 3-6). XP-EHH uses cross-population comparison of haplotype lengths to control for local variation in recombination rates. Such cross-population comparison is complicated by the fact that haplotype lengths also depend on population history, such as bottlenecks and expansions26. The XP-EHH test normalizes for genome-wide differences in haplotype length between populations.

We define the XP-EHH test with respect to two populations, A and B, a given core SNP and a given direction (centromere distal or proximal). EHH is calculated for all SNPs in population A between the core SNP and X, and the value integrated with respect to genetic distance, with the result defined as IA. IB is defined analogously for population B. The statistic ln(IA/IB) is then calculated; an unusually positive value suggests selection in population A, a negative value selection in B. For identifying outliers, the log-ratio is normalized to have zero mean and unit variance. Details are given in Supplementary Methods.

We developed a computer program, Sweep, to implement these tests (LRH, iHS and XP-EHH) for positive selection, (Supplementary Methods; www.broad.mit.edu/mpg/sweep). In identifying the 22 strongest candidate regions, we considered regions with signals in at least two of five tests (LRH, iHS and XP-EHH in the three pairwise comparisons among the three populations), as well as those that had the strongest signal for each individual test. With this threshold we found no events in 10 Gb of simulated neutrally evolving sequence. For the top candidates by the three tests, we have taken additional steps to rule out the effects of recombination rate variation and copy number polymorphisms (Supplementary Methods).

### Simulations and power calculations

We simulated the evolution of 1 MB sections of 120 chromosomes from each of the three continental HapMap populations, using a previously validated demographic model27, under neutrality and under twenty scenarios of positive selection. We studied the effects of demo-graphy by further simulating recent bottlenecks with a range of intensity. Details of simulations and power calculations are given in Supplementary Methods.

### Functional annotation

We developed an annotation database for our candidate regions to identify all DNA changes with potential functional consequence (B.F., unpublished). We first examined candidates most likely to be functional, including non-synonymous mutations, variants that disrupt predicted functional motifs (transcription factor motifs in conserved regions up to 10-kb 5′ of known genes and miRNA binding-site motifs in conserved 3′ untranslated regions of known genes), and variations reported to be associated with human phenotypic differences. For the last category, we identified variations associated with a clinical state (for example, malaria resistance) by a review of the published literature and those associated with changes to gene expression in lymphoblastoid cell lines from the HapMap individuals. The annotation included insertion/deletion mutations of all sizes. We also examined candidates with lower probability of being functional, including synonymous, intronic and untranslated variations and those that occur within regions of conservation in mammalian species. These methods are described in greater detail in Supplementary Methods.

### Structural model of EDAR's death domain

We generated a homology model for EDAR's death domain (DD) using six solved DD structures: p75 NGFR-DD, RAIDD-DD, Pelle-DD, FADD-DD, Fas-DD and IRAK4-DD242832. We aligned the corresponding protein sequences using SALIGN33. We then added the amino acid sequence of EDAR's DD (residues 356-431) to this structural alignment using Modeller 9v1 (ref. 22). The resulting alignment was used as the input to Modeller 9v1 to build ten EDAR-DD structure models, and the best model was selected based on the Objective Function Score. Owing to the high DOPE scores in the H1-H2 loop we performed a loop refinement using Modeller9v1, significantly reducing the energy of this region. We further evaluated the model by examining the distribution of conserved residues using ConSurf23 with an alignment of EDAR-DD sequences from 22 species. We observed a bias of conserved residues to the protein core in H1, H2 and H5, which supports our EDAR-DD model. To identify potential binding regions of EDAR-DD, we used LSQMAN34 to superimpose the model to the Tube-DD-Pelle-DD complex structure24. The H1-H2 and H5-H6 loops of the EDAR-DD correspond to Tube residues interacting with Pelle, and H2-H3 and H4-H5 loops to Pelle residues interacting with Tube. We focused our analysis on the residues corresponding to the interacting region in Tube because our EDAR-DD model is most similar to Tube. Figures were generated with PyMOL35.

### Other analysis

Description of methods for calculating FST, derived-allele frequency, alignment of the SLC24 amino acids, species alignments, conservation graphs, and estimation of the fraction of SNPs genotyped in HapMap2 and identified in dbSNP, are given in Supplementary Methods.

## Acknowledgements

P.C.S. is funded by a Burroughs Wellcome Career Award in the Biomedical Sciences and has been funded by the Damon Runyon Cancer Fellowship and the L'Oreal for Women in Science Award. We thank A. Schier, B. Voight, R. Roberts, M. Kreiger, A. Abzhanov, D. Degusta, M. Burnette, E. Lieberman, M. Daly, D. Altshuler, D. Reich, D. Lieberman and I. Woods for helpful discussions on our analysis and results. We also thank L. Ziaugra, D. Tabbaa and T. Rachupka for experimental assistance. This work was funded in part by grants from the National Human Genome Research Institute (to E.S.L.) and from the Broad Institute of MIT and Harvard.

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA
Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts 02114, USA.
Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA.
Department of Biology, MIT, Cambridge, Massachusetts 02139, USA.
Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA.
Department of Systems Biology, Harvard Medical School, Boston, Massachusetts 02115, USA.
These authors contributed equally to this work.
Lists of participants and affiliations appear at the end of the paper.
Author Contributions P.C.S., P.V., B.F. and E.S.L. initiated the project. P.V., B.F. and P.C.S. developed key software. P.C.S., P.V., B.F., S.F.S., J.L., E.H., C.C., X.X., E.B., S.A.McC. and R.G. performed analysis. P.C.S., E.B. and E.H. performed experiments. P.C.S., E.S.L., P.V. and S.F.S. wrote the manuscript.
Correspondence and requests for materials should be addressed to P.C.S. (ude.tim.daorb@sidrap).

## Abstract

With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2)1. We used ‘long-range haplotype’ methods, which were developed to identify alleles segregating in a population that have undergone recent selection2, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population: LARGE and DMD, both related to infection by the Lassa virus3, in West Africa; SLC24A5 and SLC45A2, both involved in skin pigmentation45, in Europe; and EDAR and EDA2R, both involved in development of hair follicles6, in Asia.

Abstract

An increasing amount of information about genetic variation, together with new analytical methods, is making it possible to explore the recent evolutionary history of the human population. The first phase of the International Haplotype Map, including ~1 million single nucleotide polymorphisms (SNPs)7, allowed preliminary examination of natural selection in humans. Now, with the publication of the Phase 2 map (HapMap2)1 in a companion paper, over 3 million SNPs have been genotyped in 420 chromosomes from three continents (120 European (CEU), 120 African (YRI) and 180 Asian from Japan and China (JPT + CHB)).

In our analysis of HapMap2, we first implemented two widely used tests that detect recent positive selection by finding common alleles carried on unusually long haplotypes2. The two, the Long-Range Haplotype (LRH)8 and the integrated Haplotype Score (iHS)9 tests, rely on the principle that, under positive selection, an allele may rise to high frequency rapidly enough that long-range association with nearby polymorphisms—the long-range haplotype8—will not have time to be eliminated by recombination. These tests control for local variation in recombination rates by comparing long haplotypes to other alleles at the same locus. As a result, they lose power as selected alleles approach fixation (100% frequency), because there are then few alternative alleles in the population (Supplementary Fig. 2 and Supplementary Tables 1–2).

We next developed, evaluated and applied a new test, Cross Population Extended Haplotype Homozogysity (XP-EHH), to detect selective sweeps in which the selected allele has approached or achieved fixation in one population but remains polymorphic in the human population as a whole (Methods, and Supplementary Fig. 2 and Supplementary Tables 3–6). Related methods have recently also been described1012.

Our analysis of recent positive selection, using the three methods, reveals more than 300 candidate regions1(Supplementary Fig. 3 and Supplementary Table 7), 22 of which are above a threshold such that no similar events were found in 10 Gb of simulated neutrally evolving sequence (Methods). We focused on these 22 strongest signals (Table 1), which include two well-established cases, SLC24A5 and LCT2,5,13, and 20 other regions with signals of similar strength.

### Table 1

The twenty-two strongest candidates for natural selection

RegionChr:position
(MB, HG17)
Selected populationLong Haplotype TestSize (Mb)Total SNPs with
Long Haplotype
Signal
Subset of
SNPs that
fulfil criteria
1
Subset of
SNPs that
fulfil criteria
1 and 2
Subset of SNPs
that fulfil
criteria 1, 2
and 3
Genes at or near SNPs that
fulfil all three criteria
1chr1:166CHB+JPTLRH, iHS0.49239302BLZF1, SLC19A2
2chr2:72.6CHB+JPTXP-EHH0.873225000
3chr2:108.7CHB+JPTLRH, iHS, XP-EHH1.097226571EDAR
4chr2:136.1CEULRH, iHS, XP-EHH2.41,213282243RAB3GAP1, R3HDM1, LCT
5chr2:177.9CEU,CHB+JPTLRH, iHS, XP-EHH1.21,388399799PDE11A
6chr4:33.9CEU,YRI, CHB+JPTLRH, iHS1.7413161330
7chr4:42CHB+JPTLRH, iHS, XP-EHH0.324994656SLC30A9
8chr4:159CHB+JPTLRH, iHS, XP-EHH0.323367341
9chr10:3CEULRH, iHS, XP-EHH0.317963161
10chr10:22.7CEU, CHB+JPTXP-EHH0.32549300
11chr10:55.7CHB+JPTLRH, iHS, XP-EHH0.473522152PCDH15
12chr12:78.3YRILRH, iHS0.815191250
13chr15:46.4CEUXP-EHH0.686723351SLC24A5
14chr15:61.8CHB+JPTXP-EHH0.225273406HERC1
15chr16:64.3CHB+JPTXP-EHH0.448413720
17chr17:53.3CHB+JPTXP-EHH0.21434100
18chr17:56.4CEUXP-EHH0.429098263BCAS3
19chr19:43.5YRILRH, iHS, XP-EHH0.3833000
20chr22:32.5YRILRH0.4318188353LARGE
21chr23:35.1YRILRH, iHS0.65035250
22chr23:63.5YRILRH, iHS3.513310
Total SNPs16.749,1662,89848041

Twenty-two regions were identified at a high threshold for significance (Methods), based on the LRH, iHS and/or XP-EHH test. Within these regions, we examined SNPs with the best evidence of being the target of selection on the basis of having a long haplotype signal, and by fulfilling three criteria: (1) being a high-frequency derived allele; (2) being differentiated between populations and common only in the selected population; and (3) being identified as functional by current annotation. Several candidate polymorphisms arise from the analysis including well-known LCT and SLC24A5 (ref. 2), as well as intriguing new candidates.

The challenge is to sift through genetic variation in the candidate regions to identify the variants that were the targets of selection. Our candidate regions are large (mean length, 815 kb; maximum length, 3.5 Mb) and often contain multiple genes (median, 4; maximum, 15). A typical region harbours ~400–4,000 common SNPs (minor allele frequency >5%), of which roughly three-quarters are represented in current SNP databases and half were genotyped as part of HapMap2 (Supplementary Table 8).

We developed three criteria to help highlight potential targets of selection (Supplementary Fig. 1): (1) selected alleles detectable by our tests are likely to be derived (newly arisen), because long-haplotype tests have little power to detect selection on standing (pre-existing) variation14; we therefore focused on derived alleles, as identified by comparison to primate outgroups; (2) selected alleles are likely to be highly differentiated between populations, because recent selection is probably a local environmental adaptation2; we thus looked for alleles common in only the population(s) under selection; (3) selected alleles must have biological effects. On the basis of current knowledge, we therefore focused on non-synonymous coding SNPs and SNPs in evolutionarily conserved sequences. These criteria are intended as heuristics, not absolute requirements. Some targets of selection may not satisfy them, and some will not be in current SNP databases. Nonetheless, with ~50% of common SNPs in these populations genotyped in HapMap2, a search for causal variants is timely.

We applied the criteria to the regions containing SLC24A5 and LCT, each of which already has a strong candidate gene, mutation and trait. At SLC24A5, the 600 kb region contains 914 genotyped SNPs. Applying filters progressively (Table 1 and Fig. 1a–d), we found that 867 SNPs are associated with the long-haplotype signal, of which 233 are high-frequency derived alleles, of which 12 are highly differentiated between populations, and of which only 5 are common in Europe and rare in Asia and Africa. Among these five SNPs, there is only one implicated as functional by current knowledge; it has the strongest signal of positive selection and encodes the A111T polymorphism associated with pigment differences in humans and thought to be the target of positive selection5. Our criteria thus uniquely identify the expected allele.

Localizing SLC24A5 and EDAR signals of selection

ad, SLC24A5. a, Strong evidence for positive selection in CEU samples at a chromosome 15 locus: XP-EHH between CEU and JPT + CHB (blue), CEU and YRI (red), and YRI and JPT + CHB (grey). SNPs are classified as having low probability (bordered diamonds) and high probability (filled diamonds) potential for function. SNPs were filtered to identify likely targets of selection on the basis of the frequency of derived alleles (b), differences between populations (c) and differences between populations for high-frequency derived alleles (less than 20% in non-selected populations) (d). The number of SNPs that passed each filter is given in the top left corner in red. The threonine to alanine candidate polymorphism in SLC24A5 is the clear outlier. eh, EDAR. e, Similar evidence for positive selection in JPT + CHB at a chromosome 2 locus: XP-EHH between CEU and JPT + CHB (blue), between YRI and JPT + CHB (red), and between CEU and YRI (grey); iHS in JPT + CHB (green). A valine to alanine polymorphism in EDAR passes all filters: the frequency of derived alleles (f), differences between populations (g) and differences between populations for high-frequency derived alleles (less than 20% in non-selected populations) (h). Three other functional changes, a D→E change in SULT1C2 and two SNPs associated with RANBP2 expression (Methods), have also become common in the selected population.

At the LCT locus, we found similar degrees of filtration. Within the 2.4 Mb selective sweep, 24 polymorphisms fulfil the first two criteria (Table 1, and Supplementary Fig. 4), with the polymorphism thought to confer adult persistence of lactase among them. However, this SNP was only identified as functional after extensive study of the LCT gene15. Thus LCT shows both the utility and the limits of the heuristics.

Given the encouraging results for SLC24A5 and LCT, we performed a similar analysis on all 22 candidate regions (Table 1). Filtering the 9,166 SNPs associated with the long-haplotype signal, we found that 480 satisfied the first two criteria. We identified 41 out of the 480 SNPs (0.2% of all SNPs genotyped in the regions) as possibly functional on the basis of a newly compiled database of polymorphisms in known coding elements, evolutionarily conserved elements and regulatory elements (Methods; B.F., unpublished), together containing ~ 5.5% of all known SNPs.

Eight of the forty-one SNPs encode non-synonymous changes (Table 1 and Supplementary Table 9). Apart from the well-known case of SLC24A5, they are found in EDAR, PCDH15, ADAT1, KARS, HERC1, SLC30A9 and BLFZ1. The remaining 33 potentially functional SNPs lie within conserved transcription factor motifs, introns, UTRs and other non-coding regions.

To identify additional candidates, we reversed the process by taking non-synonymous coding SNPs with highly differentiated high-frequency derived alleles; these SNPs comprise a tiny fraction of all SNPs and have a higher a priori probability of being targets of selection. Of the 15,816 non-synonymous SNPs in HapMap2, 281 (Supplementary Table 10) have both a high derived-allele frequency (frequency >50%) and clear differentiation between populations (FST is in the top 0.5 percentile). We examined these 281 SNPs to identify those embedded within long-range haplotypes16, and identified 26 putative cases of positive selection. These include the eight non-synonymous SNPs identified in the genome-wide analysis above.

Interestingly, analysis of the top regions and the non-synonymous SNPs together revealed three cases of two genes in the same pathway both having strong evidence of selection in a single population.

In the European sample, there is strong evidence for two genes already shown to be associated with skin pigment differences among humans. The first is SLC24A5, described above. We further examined the global distribution (Fig. 2) and the predicted effect on protein activity of the SLC24A5 A111T polymorphism (Supplementary Fig. 5, 6). The second, SLC45A2, has an important role in pigmentation in zebrafish, mouse and horse4. An L374F substitution in SLC45A2 is at 100% frequency in the European sample, but absent in the Asian and African samples. A recent association study has shown that the Phe-encoding allele is correlated with fair skin and non-black hair in Europeans4. Together, the data support SLC45A2 as a target of positive selection in Europe1017.

Global distribution of SLC24A5 A111T and EDAR V370A

Worldwide allele-frequency distributions for candidate polymorphisms with the strongest evidence for selection20. a, SLC24A5 A111T is common in Europe, Northern Africa and Pakistan, but rare or absent elsewhere. b, EDAR V370A is common in Asia and the Americas, but absent in Europe and Africa.

In the African sample (Yoruba in Ibadan, Nigeria), there is evidence of selection for two genes with well-documented biological links to the Lassa fever virus. The strongest signal in the genome, on the basis of the LRH test, resides within a 400 kb region that lies entirely within the gene LARGE. The LARGE protein is a glycosylase that post-translationally modifies α-dystroglycan, the cellular receptor for Lassa fever virus (as well as other arenaviruses), and the modification has been shown to be critical for virus binding3. The virus name is derived from Lassa, Nigeria, where the disease is endemic, with 21% of the population showing signs of exposure18. We also noted that the DMD locus is on our larger candidate list of regions, with the signal of selection again in the Yoruba sample. DMD encodes a cytosolic adaptor protein that binds to α-dystroglycan and is critical for its function. We hypothesize that Lassa fever created selective pressure at LARGE and DMD12. This hypothesis can be tested by correlating the geographical distribution of the selected haplotype with endemicity of the Lassa virus, studying infection of genotyped cells in vitro, and searching for an association between the selected haplotype and clinical outcomes in infected patients.

In the Asian samples, we found evidence of selection for non-synonymous polymorphisms in two genes in the ectodysplasin (EDA) pathway, which is involved in development of hair, teeth and exocrine glands6. The genes are EDAR and EDA2R, which encode the key receptors for the ligands EDA A1 and EDA A2, respectively. Notably, the EDA signalling pathway has been shown to be under positive selection for loss of scales in multiple distinct populations of freshwater stickleback fish19. A mutation encoding a V370A substitution in EDAR is near fixation in Asia and absent in Europe and Africa (Fig. 1e–h). An R57K substitution in EDA2R has derived-allele frequencies of 100% in Asia, 70% in Europe and 0% in Africa.

The EDAR polymorphism is notable because it is highly differentiated between the Asian and other continental populations (the 3rd most differentiated among 15,816 non-synonymous SNPs), and also within Asian populations (in the top 1% of SNPs differentiated between the Japanese and Chinese HapMap samples). Genotyping of the EDAR polymorphism in the CEPH (Centre d'Etudie du Polymorphisme Humain) global diversity panel20 shows that it is at high but varying frequency throughout Asia and the Americas (for example, 100% in Pima Indians and in parts of China, and 73% in Japan) (Fig. 2, and Supplementary Fig. 7). Studying populations like the Japanese, in which the allele is still segregating, may provide clues to its biological significance.

EDAR has a central role in generation of the primary hair follicle pattern, and mutations in EDAR cause hypohidrotic ectodermal dysplasia (HED) in humans and mice, characterized by defects in the development of hair, teeth and exocrine glands6. The V370A polymorphism, proposed to be the target of selection, lies within EDAR's highly conserved death domain (Supplementary Fig. 8), the location of the majority of EDAR polymorphisms causing HED21. Our structural modelling predicts that the polymorphism lies within the binding site of the domain (Fig. 3).

Structural model of the EDAR death domain

Ribbon representation of a homology model of the EDAR death domain (DD), based on the alignment of the EDAR DD amino acid sequence (EDAR residues 356–431), with multiple known DD structures. The helices are labelled H1 to H6. Residues in blue (the H1–H2 and H5–H6 loops, residues 370–376 and 419–425, respectively) correspond to the homologous residues in Tube that interact with Pelle in the Tube-DD–Pelle-DD structure24. These EDAR-DD residues therefore form a potential region of interaction with a DD-containing EDAR-interacting protein, such as EDARADD. The V370A polymorphic residue (red) is located prominently within this potential binding region in the H1–H2 loop. Seven of the thirteen known mis-sense mutations in EDAR that lead to hypohidrotic ectodermal dysplasia (HED) in humans are located in the EDAR-DD: the only four mutations in EDAR that lead to the dominant transmission of HED (green) and three recessive mutations (yellow)21. Four of these mutations, R375H, L377F, R420Q and I418T are located in the vicinity of the predicted interaction interface.

Our analysis only scratches the surface of the recent selective history of the human genome. The results indicate that individual candidates may coalesce into pathways that reveal traits under selection, analogous to the alleles of multiple genes (for example, HBB, G6PD and DARC) that arose and spread in Africa and other tropical populations as a result of the partial protection they confer against malaria212. Such endeavours will be enhanced by continuing development of analytical methods to localize signals in candidate regions, generation of expanded data sets, advances in comparative genomics to define coding and regulatory regions, and biological follow-up of promising candidates. True understanding of the role of adaptive evolution will require collaboration across multiple disciplines, including molecular and structural biology, medical and population genetics, and history and anthropology.

Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature.

Supplementary Information is linked to the online version of the paper at www.nature.com/nature.

Reprints and permissions information is available at www.nature.com/reprints.

The Scripps Research Institute, 10550 North Torrey Pines Road MEM275, La Jolla, California 92037, USA.

Perlegen Sciences, 2021 Stierlin Court, Mountain View, California 94043, USA.

Baylor College of Medicine, Human Genome Sequencing Center, Department of Molecular and Human Genetics, 1 Baylor Plaza, Houston, Texas 77030, USA.

Affymetrix, 3420 Central Expressway, Santa Clara, California 95051, USA.

Pacific Biosciences, 1505 Adams Drive, Menlo Park, California 94025, USA.

Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA.

The Broad Institute of Harvard and Massachusetts Institute of Technology, 1 Kendall Square, Cambridge, Massachusetts 02139, USA.

Beijing Genomics Institute, Chinese Academy of Sciences, Beijing 100300, China.

Massachusetts General Hospital and Harvard Medical School, Simches Research Center, 185 Cambridge Street, Boston, Massachusetts 02114, USA.

Chinese National Human Genome Center at Beijing, 3-707 N. Yongchang Road, Beijing Economic-Technological Development Area, Beijing 100176, China.

Chinese National Human Genome Center at Shanghai, 250 Bi Bo Road, Shanghai 201203, China.

Fudan University and CAS-MPG Partner Institute for Computational Biology, School of Life Sciences, SIBS, CAS, Shanghai, 201203, China.

The Chinese University of Hong Kong, Department of Biochemistry, The Croucher Laboratory for Human Genetics, 6/F Mong Man Wai Building, Shatin, Hong Kong.

Hong Kong University of Science and Technology, Department of Biochemistry and Applied Genomics Center, Clear Water Bay, Knowloon, Hong Kong.

Illumina, 9885 Towne Centre Drive, San Diego, California 92121, USA.

Complete Genomics, 658 North Pastoria Avenue, Sunnyvale, California 94085, USA.

Prognosys Biosciences, 4215 Sorrento Valley Boulevard, Suite 105, San Diego, California 92121, USA.

McGill University and Génome Québec Innovation Centre, 740 Dr Penfield Avenue, Montréal, Québec H3A 1A4, Canada.

University of Montréal, The Public Law Research Centre (CRDP), PO Box 6128, Downtown Station, Montréal, Québec H3C 3J7, Canada.

Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 500, Toronto, Ontario,M5G 1L7, Canada.

University of California, San Francisco, Cardiovascular Research Institute, 513 Parnassus Avenue, Box 0793, San Francisco, California 94143, USA.

Washington University School of Medicine, Department of Genetics, 660 S. Euclid Avenue, Box 8232, St Louis, Missouri 63110, USA.

University of Hong Kong, Genome Research Centre, 6/F, Laboratory Block, 21 Sassoon Road, Pokfulam, Hong Kong.

University of Tokyo, Institute of Medical Science, 4-6-1 Sirokanedai, Minatoku, Tokyo 108-8639, Japan.

RIKEN SNP Research Center, 1-7-22 Suehiro-cho, Tsurumi-ku Yokohama, Kanagawa 230-0045, Japan.

Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

University of Cambridge, Department of Oncology, Cambridge CB1 8RN, UK.

Solexa, Chesterford Research Park, Little Chesterford, nr SaffronWalden, Essex CB10 1XL, UK.

Columbia University, 500 West 120th Street, New York, New York 10027, USA.

University of Leicester, Department of Genetics, Leicester LE1 7RH, UK.

Johns Hopkins University School of Medicine, McKusick-Nathans Institute of Genetic Medicine, Broadway Research Building, Suite 579, 733 N. Broadway, Baltimore, Maryland 21205, USA.

University of Michigan, Center for Statistical Genetics, Department of Biostatistics, 1420 Washington Heights, Ann Arbor, Michigan 48109, USA.

International Epidemiology Institute, 1455 Research Boulevard, Suite 550, Rockville, Maryland 20850, USA.

Center for Biomolecular Science and Engineering, Engineering 2, Suite 501, Mail Stop CBSE/ITI, UC Santa Cruz, Santa Cruz, California 95064, USA.

University of Oxford, Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK.

University of Chicago, Department of Statistics, 5734 S. University Avenue, Eckhart Hall, Room 126, Chicago, Illinois 60637, USA.

Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109, USA.

University of Oxford/Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, UK.

University of Washington Department of Biostatistics, Box 357232, Seattle, Washington 98195, USA.

US National Institutes of Health, National Human Genome Research Institute, 50 South Drive, Bethesda, Maryland 20892, USA.

US National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, Maryland 20894, USA.

University of Chicago, Department of Medicine, Section of Genetic Medicine, 5801 South Ellis, Chicago, Illinois 60637, USA.

Beijing Normal University, 19 Xinjiekouwai Street, Beijing 100875, China.

Health Sciences University of Hokkaido, Ishikari Tobetsu Machi 1757, Hokkaido 061-0293, Japan.

Shinshu University School of Medicine, Department of Medical Genetics, Matsumoto 390-8621, Japan.

United Nations Educational, Scientific and Cultural Organization (UNESCO Bangkok), 920 Sukhumwit Road, Prakanong, Bangkok 10110, Thailand.

University of Tsukuba, Eubios Ethics Institute, PO Box 125, Tsukuba Science City 305-8691, Japan.

Howard University, National Human Genome Center, 2216 6th Street, NW, Washington, District of Columbia 20059, USA.

Case Western Reserve University School of Medicine, Department of Bioethics, 10900 Euclid Avenue, Cleveland, Ohio 44106, USA.

University of Utah, Eccles Institute of Human Genetics, Department of Human Genetics, 15 North 2030 East, Salt Lake City, Utah 84112, USA.

Chinese Academy of Social Sciences, Institute of Philosophy/Center for Applied Ethics, 2121, Building 9, Caoqiao Xinyuan 3 Qu, Beijing 100067, China.

Genetic Interest Group, 4D Leroy House, 436 Essex Road, London N130P, UK.

Kyoto University, Institute for Research in Humanities and Graduate School of Biostudies, Ushinomiya-cho, Sakyo-ku, Kyoto 606-8501, Japan.

Nagasaki University Graduate School of Biomedical Sciences, Department of Human Genetics, Sakamoto 1-12-4, Nagasaki 852-8523, Japan.

University of Oklahoma, Department of Anthropology, 455 W. Lindsey Street, Norman, Oklahoma 73019, USA.

Vanderbilt University, Center for Genetics and Health Policy, 507 Light Hall, Nashville, Tennessee 37232, USA.

Wellcome Trust, 215 Euston Road, London NW1 2BE, UK.

Washington University School of Medicine, Genome Sequencing Center, Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA.

Genome Canada, 150 Metcalfe Street, Suite 2100, Ottawa, Ontario K2P 1P1, Canada.

McGill University, Office of Technology Transfer, 3550 University Street, Montréal, Québec H3A 2A7, Canada.

Génome Québec, 630, boulevard René-Lévesque Ouest, Montréal, Québec H3B 1S6, Canada.

Ministry of Education, Culture, Sports, Science, and Technology, 3-2-2 Kasumigaseki, Chiyodaku, Tokyo 100-8959, Japan.

Ministry of Science and Technology of the People's Republic of China, 15 B. Fuxing Road, Beijing 100862, China.

The Human Genetic Resource Administration of China, b7, Zaojunmiao, Haidian District, Beijing 100081, China.

US National Institutes of Health, National Human Genome Research Institute, 5635 Fishers Lane, Bethesda, Maryland 20892, USA.

US National Institutes of Health, Office of Behavioral and Social Science Research, 31 Center Drive, Bethesda, Maryland 20892, USA.

Novartis Pharmaceuticals Corporation, Biomarker Development, One Health Plaza, East Hanover, New Jersey 07936, USA.

US National Institutes of Health, Office of Technology Transfer, 6011 Executive Boulevard, Rockville, Maryland 20852, USA.

University of Maryland School of Law, 500 W. Baltimore Street, Baltimore, Maryland 21201, USA.

US National Institutes of Health, National Human Genome Research Institute, 31 Center Drive, Bethesda, Maryland 20892, USA.

## References

• 1. The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. doi:10.1038/nature06258 (this issue)
• 2. Sabeti PC, et al Positive natural selection in the human lineage. Science. 2006;312:1614–1620.[PubMed]
• 3. Kunz S, et al Posttranslational modification of α-dystroglycan, the cellular receptor for arenaviruses, by the glycosyltransferase LARGE is critical for virus binding. J. Virol. 2005;79:14282–14296.
• 4. Graf J, Hodgson R, van Daal ASingle nucleotide polymorphisms in the MATP gene are associated with normal human pigmentation variation. Hum. Mutat. 2005;25:278–284.[PubMed]
• 5. Lamason RL, et al SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 2005;310:1782–1786.[PubMed]
• 6. Botchkarev VA, Fessing MYEdar signaling in the control of hair follicle development. J. Investig. Dermatol. Symp. Proc. 2005;10:247–251.[PubMed]
• 7. The International Haplotype Map Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320.
• 8. Sabeti PC, et al Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002;419:832–837.[PubMed]
• 9. Voight BF, Kudaravalli S, Wen X, Pritchard JKA map of recent positive selection in the human genome. PLoS Biol. 2006;4:e72.
• 10. Kimura R, Fujimoto A, Tokunaga K, Ohashi JA practical genome scan for population-specific strong selective sweeps that have reached fixation. PLoS ONE. 2007;2:e286.
• 11. Tang K, Thornton KR, Stoneking MA new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 2007;5:e171.
• 12. Williamson SH, et al Localizing recent adaptive evolution in the human genome. PLoS Genet. 2007;3:e90.
• 13. Bersaglieri T, et al Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 2004;74:1111–1120.
• 14. Teshima KM, Coop G, Przeworski MHow reliable are empirical genomic scans for selective sweeps? Genome Res. 2006;16:702–712.
• 15. Kuokkanen M, et al Transcriptional regulation of the lactase–phlorizin hydrolase gene by polymorphisms associated with adult-type hypolactasia. Gut. 2003;52:647–652.
• 16. Miller RG Simultaneous statistical inference. XVI. Springer; New York: 1981. p. 299. [PubMed]
• 17. Soejima M, Tachida H, Ishida T, Sano A, Koda YEvidence for recent positive selection at the human AIM1 locus in a European population. Mol. Biol. Evol. 2006;23:179–188.[PubMed]
• 18. Richmond JK, Baglole DJLassa fever: epidemiology, clinical features, and social consequences. Br. Med. J. 2003;327:1271–1275.
• 19. Colosimo PF, et al Widespread parallel evolution in sticklebacks by repeated fixation of Ectodysplasin alleles. Science. 2005;307:1928–1933.[PubMed]
• 20. Rosenberg NA, et al Genetic structure of human populations. Science. 2002;298:2381–2385.[PubMed]
• 21. Chassaing N, Bourthoumieu S, Cossee M, Calvas P, Vincent MCMutations in EDAR account for one-quarter of non-ED1-related hypohidrotic ectodermal dysplasia. Hum. Mutat. 2006;27:255–259.[PubMed]
• 22. Marti-Renom MA, et al Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 2000;29:291–325.[PubMed]
• 23. Landau M, et al ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res. 2005;33:W299–W302.
• 24. Xiao T, Towb P, Wasserman SA, Sprang SRThree-dimensional structure of a complex between the death domains of Pelle and Tube. Cell. 1999;99:545–555.
• 25. Stephens M, Smith NJ, Donnelly PA new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 2001;68:978–989.
• 26. Crawford DC, et al Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genet. 2004;36:700–706.[PubMed]
• 27. Schaffner SF, et al Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583.
• 28. Berglund H, et al The three-dimensional solution structure and dynamic properties of the human FADD death domain. J. Mol. Biol. 2000;302:171–188.[PubMed]
• 29. Huang B, Eberstadt M, Olejniczak ET, Meadows RP, Fesik SWNMR structure and mutagenesis of the Fas (APO-1/CD95) death domain. Nature. 1996;384:638–641.[PubMed]
• 30. Lasker MV, Gajjar MM, Nair SKCutting edge: molecular structure of the IL-1R-associated kinase-4 death domain and its implications for TLR signaling. J. Immunol. 2005;175:4175–4179.[PubMed]
• 31. Liepinsh E, Ilag LL, Otting G, Ibanez CFNMR structure of the death domain of the p75 neurotrophin receptor. EMBO J. 1997;16:4999–5005.
• 32. Park HH, Wu HCrystal structure of RAIDD death domain implicates potential mechanism of PIDDosome assembly. J. Mol. Biol. 2006;357:358–364.
• 33. Marti-Renom MA, Madhusudhan MS, Sali AAlignment of protein sequences by their profiles. Protein Sci. 2004;13:1071–1087.
• 34. Kleywegt GJUse of non-crystallographic symmetry in protein structure refinement. Acta Crystallogr. D. 1996;52:842–857.[PubMed]
• 35. DeLano WL MacPyMOL: A PyMOL-based Molecular Graphics Application for MacOS X. DeLano Scientific LLC; Palo Alto, California, USA: 2007. [PubMed]
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.