Literature DB >> 35860855

Uncovering Signals of Positive Selection in Peruvian Populations from Three Ecological Regions.

Rocio Caro-Consuegra¹, Maria A Nieves-Colón^2,3,4, Erin Rawls³, Verónica Rubin-de-Celis⁵, Beatriz Lizárraga⁶, Tatiana Vidaurre⁷, Karla Sandoval², Laura Fejerman⁸, Anne C Stone^3,9, Andrés Moreno-Estrada², Elena Bosch^1,10.

Abstract

Peru hosts extremely diverse ecosystems which can be broadly classified into the following three major ecoregions: the Pacific desert coast, the Andean highlands, and the Amazon rainforest. Since its initial peopling approximately 12,000 years ago, the populations inhabiting such ecoregions might have differentially adapted to their contrasting environmental pressures. Previous studies have described several candidate genes underlying adaptation to hypobaric hypoxia among Andean highlanders. However, the adaptive genetic diversity of coastal and rainforest populations has been less studied. Here, we gathered genome-wide single-nucleotide polymorphism-array data from 286 Peruvians living across the three ecoregions and analyzed signals of recent positive selection through population differentiation and haplotype-based selection scans. Among highland populations, we identify candidate genes related to cardiovascular function (TLL1, DUSP27, TBX5, PLXNA4, SGCD), to the Hypoxia-Inducible Factor pathway (TGFA, APIP), to skin pigmentation (MITF), as well as to glucose (GLIS3) and glycogen metabolism (PPP1R3C, GANC). In contrast, most signatures of adaptation in coastal and rainforest populations comprise candidate genes related to the immune system (including SIGLEC8, TRIM21, CD44, and ICAM1 in the coast; CBLB and PRDM1 in the rainforest; and BRD2, HLA-DOA, HLA-DPA1 regions in both), possibly as a result of strong pathogen-driven selection. This study identifies candidate genes related to human adaptation to the diverse environments of South America.

Entities: Chemical

Keywords: Peruvian populations; high-altitude adaptation; human adaptation

Mesh：

Substances：

Year: 2022 PMID： 35860855 PMCID： PMC9356722 DOI： 10.1093/molbev/msac158

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 8.800

Introduction

Since the Out-of-Africa event, humans have spread across the world inhabiting regions with a wide variety of environments. Thus, human populations have been exposed to very different selective pressures, which have left traceable marks in the human genome (Nielsen ). Peru hosts extremely diverse ecosystems which can be broadly classified into three major ecological regions (ecoregions): the arid Pacific coast, the Andean highlands, and the Amazon rainforest (Ponce de León Bardalez 1994). This diversity provides a unique opportunity to identify those genomic regions and functional variants that could have favored human adaptation to the contrasting environmental pressures across the Americas. Ancestral Indigenous American populations are estimated to have diverged from east Asians around 23,000 years ago (ya) and entered the Americas after an approximately 8,000 year period of isolation in Beringia (Raghavan ). Although this was likely followed by a divergence into northern and southern branches between 17,500 and 14,600 ya in south-eastern Beringia, the specifics of the posterior population dispersal across the Americas are still under debate (Reich ; Raghavan ; Moreno-Mayar ; Posth ). Archaeological findings—such as the site of Monte Verde in southern Chile, dating to approximately 14,000 ya (Dillehay )—and genetic evidence, suggest that after the initial entry into the Americas humans dispersed rapidly, probably moving along a coastal route and reaching South America within a 1,500 year time span (Brandini ; Goldberg ; Llamas ). As for the three Peruvian ecoregions under study in this work, no consensus has been reached on how they were populated, but two main hypotheses have been suggested. The first hypothesis posits that there was a split migration through both sides of the Andes, followed by a divergence of highland and coastal populations (Skoglund and Reich 2016; Gómez-Carballa ; Harris ). The second hypothesis suggests that the three ecoregions were populated in parallel 12,000 ya by populations descending from Ancestral South Americans (Rothhammer and Dillehay 2009; Harris ). Genetic studies have identified differential genetic patterns when comparing populations from the Andes and Amazonia. These patterns suggest that Andean communities were large and connected by gene flow, whereas Amazonian groups were probably small and isolated (Tarazona-Santos ; Fuselli ; Lewis ). However, more recent archaeological modeling challenges that view by pointing to material evidence suggestive of the existence of large-scale Amazonian societies (de Souza ). Moreover, there is evidence for extensive past migrations and gene flow across the three ecoregions probably fostered by trade in products unique to each (Rodriguez-Delfin ; Tarazona-Santos ; Lewis ; Sandoval ; Barbieri ; Castro e Silva ). Many of these population movements occurred in the pre-Columbian period, when state-level civilizations such as the Inca Empire forced population movements across the region (Quilter 2013; D’Altroy 2014; Barberena ). Recent genetic research also suggests that some of these migrations may have occurred in response to climatic fluctuations (Fehren-Schmitz ). Subsequent movements also occurred during the Spanish conquest in the 16th century, which not only induced migration, but also caused important bottlenecks as a result of wars and of newly imported pathogens and epidemics (Merbs 1992; Patterson and Runge 2002; Livi-Bacci 2006; Riley 2010; O’Fallon and Fehren-Schmitz 2011; Lindo ; Llamas ). Furthermore, during the colonial period a number of admixture events occurred among Indigenous South Americans, European settlers, and populations of African ancestry who were forcibly brought to Peru by the Transatlantic Slave Trade (Gravel ; Homburger ; Chacón-Duque ; Harris ; Barbieri ). The diverse environments found across Peru have also influenced patterns of human settlement, migration, interaction, and subsistence (Escobar and Beall 1982; Cárdenas-Arroyo and Bray 1998; Murra 2002). As a global biodiversity hotspot, Peru hosts multiple ecosystems with high species endemism (de Queiroz ; Garcia-Longoria ). This biodiversity can be classified into ecological regions; defined land areas which contain distinct natural communities and differ from each other due to factors such as topography, climate, and vegetation (Olson ). Following this framework, three distinctive and sharply contrasting ecoregions have been identified in Peru: the hyper-arid desert Pacific coast, the high-altitude Andean mountain range, and the tropical rainforest lowlands of the Amazon (de Queiroz ). These three ecoregions differ in elevation, vegetation, faunal communities, and overall climate (although see Vidal 2014 and Britto 2017 for alternate classifications). We hypothesize that given the contrasting pressures posed by differences in environment, geography, and historical population sizes between coastal, highland and lowland rainforest environments, patterns of adaptation may differ in native populations from each Peruvian ecoregion. Several studies have focused on identifying human adaptation to hypobaric hypoxia at high altitudes, usually defined as altitudes >2,500 m above sea level (Moore 2001; Bigham ; Jacovas ). These high-altitude regions comprise populations living in the Andean Altiplano, as well as in the Qinghai–Tibetan plateau in China and the Semien plateau in Ethiopia (reviewed in Bigham and Lee 2014 and Moore 2017). Lowlanders experience a set of short-term physiological responses (or acclimatization) when exposed to lower atmospheric oxygen concentrations, such as an increase in ventilation rate, a reduction in plasma volume, and an increase in erythrocyte production, among others (Heath and Williams 1995; Siebenmann ). These responses aim to compensate for the arterial oxygen saturation decline, collectively contributing to an increase in hemoglobin. However, high concentrations of hemoglobin increase blood viscosity, exerting an additional stress on the cardiovascular system and hindering appropriate blood flow to the tissues (Guyton and Richardson 1961). In populations permanently inhabiting high-altitude environments, several physiological adaptations have arisen to avoid or somehow compensate for the putative negative effects of such physiological responses. Andeans, unlike other long-term highland populations, show larger hemoglobin concentrations (Beall 2007) and thus higher blood viscosity. However, to counteract its effects, they also present an additional layer of muscles in the pulmonary artery (Penaloza and Arias-Stella 2007) and higher pulmonary vasoconstriction (Rupert and Hochachka 2001; Beall 2007). These specific adaptive phenotypes suggest that local genetic adaptations to hypoxia must have arisen in Andean highlanders. Genome scans of positive selection focused on highland populations have described candidate genes for adaptation to hypoxia in the Hypoxia-Inducible transcription Factor (HIF) pathway (Bigham et al. 2009, 2010). Among these, EGLN1 plays a critical role in oxygen homeostasis in mammals and has been found to be under selection in both Andean and Tibetan populations (Bigham ; Yi ). ENDRA, PRKAA1, and NOS2A (Bigham et al. 2009, 2010) are involved in vasoregulation and have been associated with reproductive success (Moore 2010). Crawford generated low-coverage whole genome sequencing data from a sample of high-altitude resident Andeans and identified genes such as BRINP3, NOS2, and TBX5, which play an important functional role in the cardiovascular system, among the strongest signals of positive selection. Thus, a genetic adaptive strategy to permanent hypoxia related to cardiovascular phenotypes was specifically suggested for Andeans. In agreement with that, when exploring signatures of positive selection in Andeans, Borda identified HAND2-AS1, which is involved in the modulation of cardiogenesis (Anderson ; Cheng and Jiang 2019), but also DUOX2, which plays a role in the synthesis of thyroid hormones and in innate immunity (De Deken ; Maruo ; van der Vliet ). Deserts are typically characterized not only by aridity and water scarcity but also by extreme temperature changes and intense UV radiation. Studies of human adaptation to desert conditions are quite scarce (reviewed in Rocha ). Moreover, in the case of South American deserts, most selection scans have focused on populations from northern Chile (Apata ; Vicuña ) and Argentina (Eichstaedt ), which have been historically exposed to inorganic arsenic. As for the Amazon rainforest, characterized by low light incidence, tropical climate, and high pathogen diversity (Guernier ), Borda recently described the PTPRC gene, which regulates T- and B-cell antigen receptor interactions and plays a major role in the innate immune system (Anand and Ganju 2006; Dawes ; Caignard ; Windheim ), as one of the top signals for positive selection. These results are in concordance with previous genome scans for positive selection performed in other rainforest populations, where a number of immune-related candidate genes have been identified (Amorim ; reviewed in Fan ). In this study, genome-wide single-nucleotide polymorphism (SNP) data (Illumina Infinium® MEGA array) was gathered for 286 individuals distributed across Peru, and then analyzed to investigate signals of recent positive selection specific to populations living in the high-altitude environments of the Andes, the arid Pacific coast, and the Amazon rainforest. To this aim, we computed the Population Branch Statistic (PBS) and performed haplotype-based selection scans with the integrated Haplotype Score (iHS) and the cross-population Extended Haplotype Homozygosity (XP-EHH) statistics. Besides identifying the strongest signals of recent positive selection specific to each ecoregion, we used gene-set and trait-associated SNP over-representation approaches to unravel more subtle trends of adaptation in specific biological functions. Furthermore, we also explored genotype frequency changes correlated with elevation to identify putative adaptive changes directly related to environmental selective pressures across ecoregions.

Results

Dataset

We genotyped 189 samples collected from populations distributed across Peru with the Illumina Infinium® Multi-Ethnic Global Array (MEGA). The resulting dataset was merged with 119 additional Peruvian individual samples genotyped with the same array (Wojcik , Ioannidis ; see Materials and Methods), as well as to the publicly available genotype data from four populations from the 1000 Genomes Project (1 KGP): Yoruba from Ibdaban, Nigeria (YRI), Utah residents with Northern and Western European Ancestry (CEU), Han Chinese from Beijing (CHB), and Peruvians from Lima (PEL) (Auton ). After quality control (QC), the intersected dataset included 610,576 autosomal SNPs for 681 individuals, including 85 PEL and 286 newly compiled Peruvian individuals (supplementary figs. S1 and S2, Supplementary Material online). The latter were arranged into 15 departments across three differentiated ecoregions: Andean highlands (n = 94, after downsampling Puno individuals to 25 individuals to avoid over-representation), arid coast (n = 75), and Amazon rainforest (n = 27) (table 1 and fig. 1; for details about collection sites and excluded individuals see Materials and Methods and supplementary table S1, Supplementary Material online).

Table 1.

Individual Samples Grouped by Department and Ecoregion

Ecoregion	Department	Altitude range (mamsl)	N
Coast	Ica	15–585	11 (+1)[a]
	La Libertad	28	28 (+1)[a]
	Lambayeque	5–43	18
	Piura	5–60	11
	Tumbes	6–12	5
Highlands	Apurimac	2,760–3,665	20
	Cusco	3,345–3,913	24
	Ancash	2,965–3,281	10
	La Libertad	2,641–3,099	3
	Junin	3,249	2
	Lima	2,836	10
	Puno	3,827	25 (+75)[b]
Rainforest	Amazonas	1,022–1,630	2
	Loreto	100–111	9
	San Martin	207–860	9
	Junin	631	3
	Ucayali	280	4

Samples were assigned to each department and ecoregion according to the population collection site and/or additional origin information available (for details, see supplementary table S1, Supplementary Material online). La Libertad and Junin departments comprise two differentiated ecoregions.

One individual from La Libertad and one from Ica were excluded from the selection analyses due to the high proportions of European and African ancestries detected in them, respectively.

Puno samples were randomly downsampled to avoid over-representation in the genetic structure and selection analyses.

mamsl, meters above mean sea level.

Fig. 1.

(A) Map of Peru indicating 15 departments where sampled populations are located. Data points colored by ecoregion. Elevation is represented in meters above sea level using the colorbar scale. (B) PCA of Peruvian individuals colored by ecoregion, including populations from the 1000 Genomes Project (1 KGP*). (C) ADMIXTURE plot at K = 4 with the newly analyzed Peruvian samples classified per ecoregion. (D) Zoom in on the ADMIXTURE plot at K = 4 for the analyzed Peruvian samples further divided by department. *YRI, Yoruba from Ibdaban, Nigeria; CEU, Utah residents with Northern and Western European Ancestry; CHB, Han Chinese from Beijing; PEL, Peruvians from Lima. Individual Samples Grouped by Department and Ecoregion Samples were assigned to each department and ecoregion according to the population collection site and/or additional origin information available (for details, see supplementary table S1, Supplementary Material online). La Libertad and Junin departments comprise two differentiated ecoregions. One individual from La Libertad and one from Ica were excluded from the selection analyses due to the high proportions of European and African ancestries detected in them, respectively. Puno samples were randomly downsampled to avoid over-representation in the genetic structure and selection analyses. mamsl, meters above mean sea level.

Population Structure and Admixture

Global population structure and admixture patterns across the three Peruvian ecoregions were investigated by means of Principal Component Analysis (PCA) and ADMIXTURE analyses including the YRI, CEU, CHB, and PEL reference populations from the 1 KGP (fig. 1– and supplementary fig. S3, Supplementary Material online). The first principal component (PC1) clearly divides African from non-African ancestries, while PC2 separates European ancestries from the rest. Most Peruvian individuals cluster near the CHB individuals but form a rather dispersed cloud around them as a result of their diversity of African and European ancestry proportions. Individuals from the highlands have the smallest proportions of continental admixture relative to the rest of our sample. In the ADMIXTURE analysis, the lowest cross-validation error is found when four ancestral components (K = 4) are considered. These represent African, European, and East Asian ancestries, in addition to Native American ancestry (shown in blue in fig. 1and). The Peruvian individuals from Lima (PEL) included in the 1 KGP and the Peruvian populations included in our analyses show varying proportions of admixture with European and African ancestries, as expected given the PCA. Specifically, individuals from the coast carry the highest proportions of African admixture with an average African ancestry component of 13.4% in the Ica department, while the maximum average African ancestry component for highlanders is only 0.5% in the Apurimac department. For individuals from the rainforest ecoregion, the African component represents an average of 1.9% of their autosomal ancestry. European ancestry is more evenly represented throughout all Peruvian populations sampled, except for the highland population of Puno where it is practically absent (0.6% in average). For the coastal groups, proportions of European ancestry range from 11.1% in the Ica department to 24.3% in Tumbes, whereas in rainforest populations, European ancestry represents 16.8%. At K = 5, an additional ancestry component (shown in black in supplementary fig. S3, Supplementary Material online) appears in most Peruvian populations, displaying a North–South cline, as it is predominant in the coast and rainforest ecoregions, intermediate in the central Ancash and Apurimac highland departments, and marginal in Cusco and Puno (2.8% and 0.9%, respectively). At K = 7, this Peruvian component is further divided, slightly separating the coastal individuals (shown in black) from the rainforest individuals (shown in light green). Next, we repeated the PCA including only Peruvian populations, and the PCA and ADMIXTURE analyses masking all non-Native American components (see details in supplementary note, Supplementary Material online). The PCA on the unmasked Peruvian samples does not show a clear differentiation pattern across ecoregions. However, after the masking procedure, the following two patterns emerge: 1) the first component differentiates the rainforest from the coast and highland samples and 2) a gradient from the coast to the highlands is apparent in the second component (supplementary note, Supplementary Material online; fig. 1). The ADMIXTURE analysis on the masked dataset is in concordance with these findings, even if some sharing of the coast component is also detected in the Loreto and San Martin rainforest departments at K = 7 (supplementary note, Supplementary Material online; fig. 1). On the contrary, an IBD network analysis did not show any clear clustering by ecoregion (supplementary fig. S4, Supplementary Material online), in agreement with the Analysis of Molecular Variance (AMOVA), which showed that the fraction of variation among ecoregions was similar to that found among departments within each ecoregion (1.33% vs. 1.93%, both P-values = 0.001). Thus, even if ecoregions are not a clear distinctive unit of differentiation, a subtle degree of genetic substructure corresponding to the ecoregions can be recognized.

Candidate Genes for Positive Selection

Since differential environmental pressures exist between the three Peruvian ecoregions under study, we grouped all newly compiled Peruvian samples by ecoregion and performed three complementary tests of recent positive selection: the PBS, based on allele frequency differences between two closely related population groups with respect to an outgroup: the iHS based on identifying unusual extended haplotypes within a population to detect incomplete or ongoing sweeps (Voight ) and the XP-EHH, also based on haplotypes but better suited for detecting complete (or near complete) selective sweeps when comparing populations (Sabeti ). For each selection test, the top 50 signals for positive selection were identified following an empirical outlier approach (see details in Materials and Methods). We then annotated all genes present in these candidate regions with ANNOVAR (Wang ), and we used GeneCards (Stelzer ) and UniProt (The UniProt Consortium 2019) to characterize their functions. We only considered a signal of positive selection to be specific for each ecoregion when it was not present in the top 50 candidate regions of the other ecoregions. From a total of 118 candidate regions identified within the top 10 signals of each selection statistic and ecoregion, 22 are specific for highland populations, 21 for the coast, and 23 for the rainforest (tables 2–4; fig. 2). The remaining candidate regions are shared between at least two of the ecoregions (supplementary table S3, Supplementary Material online; for detailed annotations on the SNPs within each top region—including position, CADD-score, allele frequencies, overlapping genes, and local ancestry deviations—see supplementary tables S4–S15, Supplementary Material online). We note, however, that since each selection statistic captures different features of the selective sweep and since different pairwise ecoregion comparisons have been used, the candidate regions displaying outlier values for more than one statistic and consistently across comparisons should be considered the more reliable.

Fig. 2.

Manhattan plots representing the scores of each statistic for selection tests performed on highland (A), coastal (B), and rainforest (C) populations. In red, the top 10 significant candidate regions identified in each test. H, Highlands; C, Coast; R, Rainforest. (D) Number of top signals of positive selection detected with the PBS, his, and XP-EHH statistics per each ecoregion. Genomic regions are considered under selection exclusively in an ecoregion when they are not found within the top 50 signals obtained by tests performed on other ecoregions. In brackets, total number of top 10 signals detected without filtering out those also within the top 50 of the remaining ecoregions. *Indicates that the highest scoring SNPs within a peak are in an intergenic region. *1 in (B): HLA-DRB1, HLA-DQA1, HLA-DQA2, HLA-DQB2, HLA-DOB, TAP2, PSMB8, PSMB8-AS1, TAP1, PSMB9, LOC100294145, HLA-DMB, HLA-DMA, BRD2, HLA-DOA, HLA-DPA1. *2 in (B): SSSCA1-AS1, EHBP1L1, KCNK7, MAP3K11, PCNX3, RELA, KAT5, RNASEH2C, AP5B1, MIR1234, OVOL1-AS1, OVOL1, SNX32, CFL1, MUS81, EFEMP2, FIBP. Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Highland Ecoregion For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C, and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. When the region is identified by multiple tests, the top SNP is taken from the highest ranked and marked with *. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the highlands. PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity; HC, highlands to coast comparison; HR, highlands to rainforest comparison; H (HC), Highlands signal for the XP-EHH highlands to coast comparison; H (HR), Highlands signal for the XP-EHH highlands to rainforest comparison; H, highlands. Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Coast Ecoregion For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the coast. PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity. CR, coast to rainforest comparison; CH, coast to highlands comparison; C (HC), Coast signals from the XP-EHH highlands to coast comparison; C (RC), Coast signals from the XP-EHH rainforest to coast comparison; C, coast. Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Rainforest Ecoregion For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. When the region is identified by multiple tests, the top SNP is taken from the highest ranked and marked with *. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the rainforest ecoregion. PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity. RH, rainforest to highlands comparison; RC, rainforest to coast comparison; R (HR), Rainforest signal from the XP-EHH highlands to rainforest comparison; R (RC), Rainforest signal from the XP-EHH rainforest to coast comparison; R, rainforest. Within the top 10 signals of positive selection identified exclusively in Andean highlanders, we detect genes related to heart development (TLL1) and cardiac function (SGCD), glucose (GLIS3) and glycogen metabolism (PPP1R3C), as well as a gene belonging to the HIF pathway (TGFA), among others (table 2 and fig. 2). However, additional candidate genes related to the same biological functions were further identified among the top 50 strongest signals of positive selection shared between highlanders and populations from other ecoregions (fig. 2 and supplementary table S3, Supplementary Material online). Among these, we identify DUSP27 and TBX5, which are involved in heart development and detected in the rainforest ecoregion with different strengths; and GANC, which codifies a key enzyme in glycogen metabolism and is detected within the top 50 signals in the coast. Other interesting genes with strong signals of positive selection in highlands but also detected as top 50 candidates in the other ecoregions are related to thrombophilia (SERPINE1), cardiovascular development (PLXNA4), skin pigmentation (MITF), and the immune system (CD40) (supplementary table S3, Supplementary Material online).

Table 2.

Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Highland Ecoregion

Candidate region	Candidate genes	PBS	iHS	XP-EHH	Nat Am LAP (%)	Nat Am LAD (SDs)	RS id—Ref. allele	Allele freq. in H	Allele freq. in C	Allele freq. in R
1:111740082–111944073	CHIA, PIFO, PGCP1	—	H5*	—	88.03	0.53	rs2275254-T	0.89	0.84	0.76
1:178563987–178671750	C1orf220, MIR4424, RALGPS2	HC8	—	—	93.62	1.68	rs1122579-T	0.15	0.42	0.41
1:184029993–184283674	Intergenic—TSEN15, C1orf21	HC5 HR	—	—	94.15	1.90	rs12119930-A	0.21	0.5	0.43
2:146944278–147020927	Intergenic—PABPC1P2	HC1	—	—	90.43	0.35	rs2016340-G	0.72	0.58	0.65
2:70187173–72306479	FAM136A, TGFA, TGFA-IT1	HR5 HC	—	H (HR)	92.02	1.02	rs6714409-G	0.24	0.49	0.61
4:166597189–167143385	TLL1	—	—	H3 (HC)	92.02	1.02	rs1995126-A	0.94	0.74	0.94
4:178992130–179614998	Intergenic	—	H	H1 (HR)* H2 (HC)	88.83	0.31	rs2702432-G	0.54	0.73	0.57
4:61626653–61950476	MIR548AG1	—	H8	—	90.96	0.58	rs7699903-G	0.18	0.28	0.35
5:11641039–11742668	CTNND2	HC2	—	—	90.43	0.35	rs4702813-G	0.95	0.72	0.93
5:154745227–155568469	Intergenic—SGCD	HC9 HR1*	H	H8 (HR)	91.49–90.43	0.8–0.35	rs1432723-G	0.11	0.34	0.48
5:29886971–30137291	Intergenic—LOC105374704	HC6	—	—	94.68	2.12	rs10940848-C	0.14	0.44	0.19
5:86932232–87197535	Intergenic—CCNH, TMEM161B	—	—	H7 (HC)	90.96	0.58	rs710375-T	0.84	0.69	0.72
6:153600685–153955693	MIR7641-2	HR6 HC	—	—	90.43	0.35	rs1221930-G	0.22	0.49	0.63
9:18046709–18349993	Intergenic—SH3GL2, ADAMTSL1	HR10 HC	—	—	95.74	2.56	rs10810942-A	0.97	0.84	0.78
9:3962727–4529671	GLIS3, SLC1A1	HC4 HR	H6	H1 (HC)* H6 (HR)	94.15	1.90	rs7024944-A	0.90	0.78	0.85
10:93369096–93542186	PPP1R3C, TNKS2-AS1	HR2	—	—	91.49	0.80	rs150183914-T	0.06	0.16	0.28
11:134196849–134622517	Intergenic—LOC283177, LOC100507548	—	—	H10 (HC)	91.49	0.80	rs10750576-A	0.90	0.75	0.81
12:47312454–47985899	PCED1B, LOC105369747, MIR4494, VDR	HC HR	—	H4 (HR) H (HC)	90.96	0.58	rs855185-G	0.25	0.49	0.56
14:21924207–22160553	RAB2B, METTL3, SALL2, OR10G3, OR10G2	HC3	—	—	93.09	0.90	rs1263807-T	0.14	0.36	0.30
18:72073738–72108787	C18orf63, LINC01922, FAM69C	HR4	—	—	94.15	1.90	rs377380065-T	0.04	0.16	0.22
21:38251558–39104180	DSCR3, DYRK1A	—	—	H9 (HC) H (HR)	88.83	0.31	rs11911146-T	0.36	0.55	0.48
22:45421536–45527815	PHF21B, NUP50-AS1	—	—	H8 (HC)	92.55	1.24	rs738548-A	0.80	0.53	0.54

For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C, and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. When the region is identified by multiple tests, the top SNP is taken from the highest ranked and marked with *. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the highlands.

PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity; HC, highlands to coast comparison; HR, highlands to rainforest comparison; H (HC), Highlands signal for the XP-EHH highlands to coast comparison; H (HR), Highlands signal for the XP-EHH highlands to rainforest comparison; H, highlands.

Among the top 10 strongest signals of positive selection exclusively identified in coastal or rainforest populations, we mostly identified genes related to the immune system. These include different SIGLEC genes and the CD44 and ICAM1 genes in the coast, as well as the ALCAM-CBLB genomic region and the CARD8 gene in the rainforest ecoregion (tables 3 and 4 and fig. 2and). An additional strong selection signal around the PRDM1 gene, probably related to the immune response, was detected among the top 10 signals in the rainforest but also within the top 50 in highlands (supplementary table S3, Supplementary Material online). Moreover, at least three further candidate genes related to the immune response were identified as top 10 signals in both rainforest and coastal populations (FBXO40-HCLS1, BDR2, HLA-DOA, HLA-DPA1, and RNF220).

Table 3.

Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Coast Ecoregion

Candidate region	Candidate genes	PBS	iHS	XP-EHH	Nat Am LAP (%)	Nat Am LAD (SDs)	RS id—Ref. allele	Allele freq. in H	Allele freq. in C	Allele freq. in R
2:227205102–227428810	Intergenic—LOC646736, MIR5702	CR	—	C8 (RC)	78.77	0.81	rs9789638-A	0.80	0.84	0.65
2:231044200–231429220	SP110, SP140, SP140L, SP100	—	—	C6 (HC)	81.51	1.63	rs58941251-C	0.63	0.88	0.81
2:98629966–100380563	TSGA10, LIPT1	CH8	—	—	72.60	1.05	rs2632277-C	0.76	0.48	0.65
3:195477791–196737295	PIGZ, MELTF	CH3	—	—	76.03	0.02	rs544688-G	0.48	0.72	0.74
5:1875037–1948519	CTD-2194D22.4	CH10	—	—	82.19	1.84	rs200756822-G	0.53	0.33	0.46
5:63309892–64417806	RNF180, RGS7BP	CR3	—	—	78.77	0.81	rs16892721-A	0.51	0.53	0.85
7:101060777–101449071	COL26A1	—	C3	—	76.03	0.02	rs28759973-T	0.81	0.77	0.74
8:10390452–10485154	PRSS55, RP1L1	—	—	C4 (HC)	73.29	0.84	rs150931842-G	0.85	0.97	0.94
8:82335354–82849452	ZFAND1, CHMP4C, SNX16	CH5	—	—	73.29	0.84	rs11991098-A	0.51	0.32	0.46
8:96200507–96223793	PLEKHF2, LINC01298	—	C7	—	69.18	2.08	rs77609822-C	0.95	0.89	0.91
9:109802363–109982588	Intergenic—LOC340512, RAD23B	CR	—	C4 (RC)	75.34	0.22	rs12551497-A	0.71	0.87	0.69
10:63248358–63583070	TMEM26-AS1, C10orf107	CR10	—	—	78.77	0.81	rs1456279-A	0.22	0.12	0.30
11:35123195–35360016	CD44	CH6 CR	—	—	81.51	1.63	rs4756196-G	0.49	0.79	0.83
11:4434519–4460261	TRIM21, OR52K2	CH1	—	—	82.19	1.84	rs10633520-A	0.32	0.14	0.15
13:24656407–24789809	SPATA13	CH4	—	—	77.40	0.39	rs60187376-T	0.18	0.11	0.19
15:63290705–63372180	TLN2, TPM1, LOC100128979	—	—	C5 (HC)	84.25	2.46	rs72741190-A	0.87	0.94	0.70
18:6667606–6776759	Intergenic—LINC01387, LOC101927168	CH9	—	—	71.92	1.25	rs12456358-T	0.69	0.91	0.91
19:10285806–13466988	MRPL4, ICAM1	CR2	—	—	71.23	1.46	rs74257295-G	0.90	0.94	0.59
19:51892016–52250216	SIGLEC10, LOC100129083, SIGLEC8, CEACAM18, SIGLEC12, SIGLEC6	CH	—	C3 (HC)	80.14	1.22	rs39711-T	0.68	0.83	0.81
19:52009560–52246157	ZNF175, LINC01530, SIGLEC5, SIGLEC14, SPACA6P-AS	CH7	—	C (CH)	80.82	1.42	rs10500308-T	0.61	0.37	0.54
20:37013181–37291377	ARHGAP40	—	—	C6 (RC)	79.45	1.01	rs74983286-T	0.96	0.95	0.81
22:45116127–45367385	PRR5, PRR5-ARHGAP8, ARHGAP8	—	—	C2 (HC)	77.40	0.39	rs132410-A	0.66	0.78	0.69

For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the coast.

PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity. CR, coast to rainforest comparison; CH, coast to highlands comparison; C (HC), Coast signals from the XP-EHH highlands to coast comparison; C (RC), Coast signals from the XP-EHH rainforest to coast comparison; C, coast.

Table 4.

Top 10 Selection Signals per Statistic and Population Comparison Exclusively Found in the Rainforest Ecoregion

Candidate region	Candidate genes	PBS	iHS	XP-EHH	Nat Am LAP (%)	Nat Am LAD (SDs)	RS id—Ref. allele	Allele freq. in H	Allele freq. in C	Allele freq. in R
1:198957867–199707262	Intergenic—LINC01221	—	R5*	R5 (RC) R8 (HR)	88.89	2.03	rs1325187-C	0.43	0.51	0.31
2:12821349–13089226	TRIB2, LOC100506474	RC10 RH	—	—	81.48	0.42	rs973977-T	0.61	0.64	0.22
2:164399671–164879136	FIGN	RH8 RC	—	—	81.48	0.42	rs13003002-T	0.51	0.52	0.31
2:41894321–42063475	Intergenic—LINC01913	—	R4	—	79.63	0.02	rs4952511-C	0.60	0.66	0.63
2:54506070–55183537	EML6	—	R8	R (RC)	70.37	1.99	rs17046413-A	0.54	0.65	0.54
3:105334386–105790348	ALCAM, CBLB	—	R	R2 (RC)	87.04	1.63	rs61138958-A	0.40	0.55	0.31
3:192279892–192338861	FGF12	—	—	R6 (RC)	81.48	0.42	rs781417-C	0.88	0.78	0.89
3:59667385–59821071	FHIT	RH6 RC	—	—	83.33	0.82	rs1683366-G	0.71	0.66	0.41
6:119335865–119655145	FAM184A	—	R6	—	72.22	1.59	rs612607-G	0.74	0.82	0.91
7:116617532–117003695	ST7, WNT2	RH7	—	R7 (HR)* R (RC)	83.33	0.82	rs4612282-T	0.60	0.73	0.81
8:59017474–59194917	FAM110B	—	—	R1 (RC) R (RH)	87.04	1.63	rs7843838-A	0.49	0.56	0.33
9:73915883–74044936	Intergenic—TRPM3, TMEM2	RH5* RC7	—	—	85.19	1.22	rs147846688-G	0.84	0.84	0.52
10:31860345–32054109	Intergenic—ZEB1, ARHGAP12	RC	—	R10 (RC)	85.19	1.22	rs796159-T	0.46	0.51	0.74
12:97635817–97737126	Intergenic—NEDD1, RMST	RC9	—	—	74.07	1.19	rs6538791-A	0.61	0.68	0.46
13:29222286–29411957	Intergenic—SLC46A3, MTUS2	RC4* RH4	—	—	79.63	0.02	rs17561728-A	0.69	0.80	0.44
13:97027560–97323933	HS6ST3	—	—	R10 (HR)	85.19	1.22	rs643765-T	0.57	0.60	0.76
14:32389856–32488986	LINC02313	RH10 RC	R	—	79.63	0.02	rs4640079-G	0.62	0.59	0.39
15:92467322–92717179	SLCO3A1	RH1* RC1	—	—	77.78	0.38	rs371469142-C	0.17	0.12	0.04
16:50361170–50505628	BRD7, LINC02178	RC5 RH	—	—	81.48	0.42	rs142711401-G	0.78	0.86	0.50
17:9536109–9724255	USP43, DHRS7C	—	—	R7 (RC)	83.33	0.82	rs380596-G	0.79	0.87	0.98
18:65485121–65528209	LOC643542	RC2	—	—	85.19	1.22	rs12454152-G	0.72	0.84	0.70
19:48622545–48935653	CARD8, CARD8-AS1, ZNF114	—	—	R9 (HR)	77.78	0.38	rs4801753-G	0.59	0.72	0.76
21:43444731–43454614	ZNF295-AS1	RH2* RC3	—	—	79.63	0.02	rs2839444-C	0.84	0.80	0.57

For each candidate region, the Native American local ancestry proportion (LAP) is included together with the corresponding local ancestry deviation (LAD). The allele frequency per ecoregion (H, C and R for highlands, coast and rainforest, respectively) for the reference allele of a top outlier SNP within each candidate region is also indicated. When the region is identified by multiple tests, the top SNP is taken from the highest ranked and marked with *. Genes in intergenic candidate regions are only shown when their distance to the peak is <500 kbp. In bold, top 10 selection signals with the number representing the ranking of that signal; otherwise additional hits detected within the top 50 signals in the rainforest ecoregion.

PBS, Population Branch Statistic; iHS, integrated Haplotype Score; XP-EHH, cross-population Extended Haplotype Homozygosity. RH, rainforest to highlands comparison; RC, rainforest to coast comparison; R (HR), Rainforest signal from the XP-EHH highlands to rainforest comparison; R (RC), Rainforest signal from the XP-EHH rainforest to coast comparison; R, rainforest.

Other top 10 selection signals were shared across regions (fig. 2 and supplementary table S3, Supplementary Material online). One of the strongest signals detected in the three ecoregions with the iHS statistic includes two candidate genes involved in lipid metabolism (CPT2 and LRP8). Additional strong signals shared between the coast and highlands include three candidate genes, detected mainly with the PBS statistic, which encode dual oxidases (DUOX2, DUOXA1, and DUOX1) that are involved in thyroid hormone synthesis and the immune system; as well as the TLR4 gene, detected with the iHS and XP-EHH statistics, and with an important recognized role in pathogen recognition and the innate immune response. As for those strong signatures shared between highlands and rainforest populations, we identified a candidate gene related to the negative regulation of hypoxic injury (APIP) and the aforementioned TBX5 gene related to heart development. Since previous studies have reported candidate genes for high-altitude adaptation in Andeans, we also analyzed the overlap with the top 50 signatures described here for the highland ecoregion (supplementary tables S16 and S17, Supplementary Material online). Out of 217 genes compiled from the literature, 24 were replicated within the top 50 signals found in highlands, with the TBX5, TGFA and DUOX2 genes being among the top 10 (supplementary table S17, Supplementary Material online). Other replicated previously known candidate genes for high-altitude adaptation include the BRINP3 and the HAND2-AS1 genes, associated with cardiac function and related phenotypes, and the EGLN3 gene, from the HIF pathway.

Over-representation Analysis in Candidate Regions for Positive Selection

For each test and ecoregion, we also conducted gene-set (supplementary table S18, Supplementary Material online) and trait-associated SNP (supplementary table S19, Supplementary Material online) over-representation analyses on the top 50 candidate regions for positive selection, using the DAVID and TraseR tools, respectively. In highland populations, the top 50 highest scoring signals for positive selection present a clear over-representation of genes related to sensory perception (PBS highlands vs. coast), fatty acid metabolism (PBS highlands vs. rainforest), xenobiotic metabolic process and drug metabolism (PBS highlands vs. rainforest and XP-EHH highlands vs. rainforest), as well as many immune-related terms, including immunoglobulin production involved in immunoglobulin-mediated immune response, antigen processing and presentation of peptide or polysaccharide antigen via MHC class II, innate immune response and antibacterial humoral response (iHS and XP-EHH highlands vs. rainforest). Moreover, we identify a significant over-representation of SNPs associated with traits related to body mass index and body weights and measures (mostly with PBS), to stroke and cardiovascular diseases (with PBS highlands vs. rainforest and iHS), to diabetes mellitus type 1 and metabolic syndrome X, to nutritional and metabolic diseases (with iHS), to gout (with XP-EHH highlands vs. rainforest), and also to immune system disease and bacterial infections and mycoses (with iHS). In coastal populations, the gene-set over-representation identifies groups of terms associated with fatty acid transport and triglyceride catabolism (PBS coast vs. highlands), thyroid synthesis and response to oxidative stress (PBS coast vs. rainforest), as well as to several immune-related terms (antigen processing and presentation of peptide or polysaccharide antigen via MHC class II with iHS, adaptive immune response with iHS and XP-EHH highlands vs. coast, and antibacterial humoral response and innate immune response with XP-EHH rainforest vs. coast). As for the identified over-represented traits, they include body height, body weights and measures and hypertrophy, left ventricular hypertrophy (with PBS coast vs. highlands) and diabetes mellitus type 1, digestive system disease, respiratory tract disease, immune system and cardiovascular disease (with iHS). In the rainforest ecoregion, we detect a similar over-representation of terms associated with the immune system (such as antigen processing and presentation of peptide or polysaccharide antigen via MHC class II with PBS rainforest vs. coast and with iHS), but also with xenobiotics and fatty acids (PBS rainforest vs. highlands), as well as with arrhythmogenic right ventricular cardiomyopathy and hypertrophic cardiomyopathy (with iHS), and with steroid hormone biosynthesis and metabolism of lipids (with XP-EHH rainforest vs. coast). Similarly, we identified enriched traits related to lipoproteins (PBS rainforest vs. highlands), but also to diabetes mellitus type 1 and metabolic syndrome, nutritional and metabolic diseases, cardiovascular diseases and myocardial infarction, immune system diseases, and bacterial infections and mycoses (mostly with PBS rainforest vs. coast and with iHS).

Adaptation to Environmental Pressures

Next, we used the Samβada software (Stucki ) to identify genotype frequency changes correlating with elevation as a proxy for any associated environmental pressure, while controlling for population structure. After Bonferroni correction for multiple testing on the P-values obtained from the Wald score, we obtained 1,937 SNPs, assigned to 1,118 genes as annotated using Variant Effect Predictor (VEP) (McLaren ), whose genotypes significantly correlated with elevation (supplementary tables S20 and S21, Supplementary Material online). A gene-set over-representation analysis on these genes revealed several terms related to xenobiotic metabolism in the highest scoring cluster, but also to dilated cardiomyopathy, cardiac muscle contractions, platelet homeostasis, and type II diabetes mellitus and regulation of insulin secretion, among others (supplementary table S22, Supplementary Material online). This result suggests the existence of concerted shifts of allele frequencies correlated with elevation in genes associated with xenobiotic metabolism as well as with cardiovascular and metabolic traits, and is consistent with the over-represented clusters previously identified for the top candidate regions for selection in the highlands ecoregion. Some of the genes detected in the Samβada analysis are also found among the top 10 signals of selection identified in each ecoregion, including 18 candidate genes for highlands, 12 for coast, and 14 for rainforest ecoregions (supplementary table S23, Supplementary Material online). Among them, we find some of the aforementioned candidate genes for positive selection, such as SGCD, DUOX1-DUOXA1, RNF220, CBLB, APIP, DUSP27, and GLIS3. These results suggest elevation (or any correlated environmental variable) as the primary selective pressure behind the genomic signatures of positive selection detected. For example, the TT genotype at an intronic SNP in GLIS3 (rs4567095) is highly correlated with elevation. This genotype is found at low frequencies in coastal populations and conversely, at higher frequencies in the highlands (supplementary fig. S5, Supplementary Material online).

Impact of Admixture on Positive Selection

We also investigated whether the detected signals of positive selection could have been facilitated by the recent admixture from European and African continental populations. After computing the local ancestry proportions of the Indigenous American, European, African, and East Asian ancestral components per SNP in each ecoregion, we did not identify any significant local standard deviations from the global ancestry proportions, meaning that none of the identified selection signals can be directly attributed to past admixture events with European or African populations (see details in supplementary tables S3–S15 and S24, Supplementary Material online). To ensure the robustness of our selection scan results to potential biases resulting from external (African or European) recent admixture, we also recalculated the three selection tests after masking all haplotypes with non-Native American ancestry from the dataset (see details in supplementary note, Supplementary Material online). The masking procedure significantly decreased the power to detect signals of selection because of the reduction of the number of available individuals, particularly in rainforest populations, and of marker density after QC. Despite that, an important proportion of SNPs within the top 10 signals identified for the original dataset were also found at least within the top 5% SNPs identified for the masked version, indicating a consistent overlap of selection signatures (supplementary note, Supplementary Material online; table 2).

Discussion

We have gathered and analyzed SNP genome-wide data for 286 individuals distributed across three differentiated ecoregions in Peru to investigate signals of recent positive selection specific to populations from the Andean highlands, the Pacific coast, and the Amazon rainforest. The compilation of SNP data from 1 KGP reference populations and subsequent PCA and ADMIXTURE analyses indicated varying proportions of European and African ancestral components across the three ecoregions, consistent with historical records of the Spanish colonization of Peru in the 16th century and the posterior arrival of African peoples through the Transatlantic Slave Trade (Aguirre 2005; D’Altroy 2014). Furthermore, we detect evidence for a North–South gradient of genetic differentiation between the northern coast and rainforest populations into the southern Andes (mostly including Cusco and Puno) that is consistent with previous analyses on Peruvian populations (Barbieri , Nakatsuka , Borda , Castro e Silva ). Additionally, when masking the non-native American genetic components, we also find some sharing of the coastal genetic component in northwestern rainforest individuals, which is consistent with the mixed ancestry profiles recently described by Borda and Castro e Silva . Within the top 10 signals of positive selection found in Andean highlanders, we identified multiple genes involved in heart function, including TLL1, known for its role in heart septation and positioning (Clark ; Sieron ); SGCD, highly expressed in skeletal and cardiac muscle and associated to dilated cardiomyopathy; PLXNA4 that encodes for plexin receptors which in turn interact with semaphorins that participate in heart and vascular morphogenesis (Epstein ); DUSP27, involved in muscle and heart development (Li 2011); and TBX5 which is annotated in the Gene Ontology (GO) under terms such as ‘heart development’, ‘bundle of His development’ and ‘atrial septum morphogenesis’, among others. Variants contained within the TBX5 gene signal have previously been associated with multiple heart-related traits in genome-wide association studies (GWASs) including atrial fibrillation (Christophersen ), QRS complex (Prins ) and other electrocardiographic traits (Holm ). Moreover, TBX5 is one of the three candidate genes related to cardiovascular health previously identified by Crawford in high-altitude Andeans. Similarly, PLXNA4 has been identified as a candidate gene for adaptation to hypoxia in Tibetan dogs (Witt and Huerta-Sánchez 2019). It has been suggested that the putative adaptive variants driving the selection signatures detected on genes related to heart function traits could have a protective role against an additional lifelong stress to the cardiovascular system given the higher concentration of red blood cells present in the bloodstreams of Andeans, and consequently, the higher blood viscosity (Crawford ). According to our selection analyses, TLL1 and SGCD are candidate genes exclusively identified in Andean highlander populations, whereas DUSP27, PLXNA4, and TBX5 were also detected within the top 50 iHS signals in either coastal or rainforest populations. Such a pattern could have easily resulted from gene flow from the highland ecoregion towards the coastal and rainforest populations, and would be consistent with previous genetic research (Gravel ; Chacón-Duque , Castro e Silva ). Alternatively, we could be detecting selection signatures already present in the ancestral Indigenous South American population before the divergence of the three population groups under study. In Andean highlanders, we also identified candidate genes for positive selection related to hypoxia such as APIP, which has been described to act as a negative regulator of ischemic injury (Cho ) and TGFA, already described as a candidate gene for positive selection in highland populations by Bigham and Bigham . Other top 10 signals of selection in Andeans are related to pigmentation and to the vitamin D receptor. MITF, a gene involved in melanocyte development, was identified within the top 10 signals for positive selection with the XP-EHH statistic when comparing highlands versus coast, and in the top 50 signals in highlands with the remaining tests of selection. As for the VDR gene, it was detected within a large candidate region in the top 10 signals of selection exclusively detected in highlands with the XP-EHH statistic when using the rainforest ecoregion as reference. Such a selection signal at VDR could result from a co-adaptation to reach optimal levels of vitamin D in an environment where strong UV radiation could have also favored darker skin pigmentation (Missaggia ). Incident ultraviolet (UV) radiation, which is correlated with skin pigmentation (Jablonski and Chaplin 2000), is the most likely environmental variable responsible for these selection signals in Andeans, since UV radiation increases with altitude (Blumthaler ). Other candidate genes for selection identified in highland populations are related to glucose (GLIS3) and glycogen (PPP1R3C and GANC) metabolism, which could result from differential fasting glycemia compared with populations living at sea level (reviewed in Woolcott ). In particular, GLIS3 was identified in the top 10 signals with the PBS statistic when comparing highlands to coast, with the iHS statistic in highlands, and in highlands versus coast and highlands versus rainforest with the XP-EHH statistic. Notably, an intronic SNP in GLIS3 (rs10974438), linked to the detected signatures, presents an elevated CADD-score (20.2) and has been associated with type 2 diabetes in multiple GWASs (Mahajan ; Xue ; Vujkovic ). Moreover, GLIS3 has been previously identified as a candidate for adaptation to highland environments in Tibetan populations (Wang ). Additional candidate regions for positive selection in Andeans include genes such as CD40, TNIP1, and CHIA, which play a role in the immune system and innate response (Blotta ; Mikolajczak ). Further research may reveal whether these adaptations occurred due to long-term exposure to the selective pressures posed by pathogens present in the Andes during the pre-Columbian period, or if they occurred more recently after exposure to novel diseases brought by Europeans (Merbs 1992; Patterson and Runge 2002; Riley 2010; O’Fallon and Fehren-Schmitz 2011; Lindo ). In the case of populations from the arid Pacific coast, many of the identified candidate genes for positive selection are involved in the regulation of the immune system. For instance, one of the top 10 signals includes several sialic acid-binding immunoglobulin-like lectin (SIGLEC) genes, which encode important cell surface receptors expressed on innate immune cells involved in modulating the host response (Cao and Crocker 2011). Among these, we identified SIGLEC5, whose encoded receptor has been shown to be targeted by group B Streptococcus to promote immune evasion (Carlin ). Other examples of immune-related candidate genes include ICAM1, which participates in several pathways annotated in Reactome as ‘adaptive immune system’, ‘signaling by interleukins’ and ‘cytokine signaling in immune system’, among others; and CD44, encoding for a widely-expressed adhesion receptor known to be upregulated after activation of naive T lymphocytes during their responses against invading microbes (Baaten ). To our knowledge, as of this writing there are no genome-wide positive selection studies conducted among populations from South American desert areas, except for those focused on the effects of arsenic as a selective pressure (Eichstaedt ; Apata ; Vicuña ; Rocha ). In populations from the Amazon rainforest, as at the coast, most of the identified selection signals contain candidate genes with functional roles in the immune system. Among those specifically found within the top 10 signatures in rainforest populations, we detected the ALCAM-CBLB genomic region and the CARD8 gene with the XP-EHH statistic when comparing rainforest with coastal and rainforest with highland populations, respectively. In particular, while the CBLB gene encodes a negative regulator of adaptive immune responses (Sanna ), CARD8 mediates inflammasome activation in response to pathogens and other signals (Johnson ). Despite initially focusing on candidate genes specific for each ecoregion, a significant number of them were detected in more than one ecoregion. Similarly, when extending the selection analysis to the top 50 signals, both the gene-set and the trait-associated SNP over-representation analyses displayed several common terms significantly enriched across ecoregions. In most cases, a candidate gene is clearly found within a top 10 signal of positive selection in a given ecoregion but it is also detected among the top 50 signals in others (mostly with the iHS and the PBS statistics). In other instances, two or more ecoregions share strong signatures (i.e., share the same candidate within the top 10 signals of selection detected by any selection statistic). In such cases, it is difficult to discern whether the selection signal and corresponding selective pressures are shared across ecoregions or whether the shared selection signal results from gene flow across them. Moreover, it could also be the case that the driving selective pressure was already present before the divergence of these populations from a common ancestral population group, and that we are thus detecting the remnants of past selection signatures. The same scenario applies for those gene-sets and trait-associated SNPs found enriched in the top 50 candidate regions for positive selection in more than one ecoregion. Some of the top 10 shared signatures across ecoregions include the TBX5 gene, detected with the iHS statistic in rainforest and highlands populations; the APIP gene also detected in highlands (with XP-EHH) and rainforest (with PBS); the DUOX2–DUOXA1–DUOX1 family of genes, that were found with PBS in both coast and highlands when compared with rainforest; the BRD2, HLA-DOA, and HLA-DPA1 genes, detected in rainforest and coastal populations with the iHS and XP-EHH statistics, respectively; and the CPT2 and LRP8 genes, involved in lipid metabolism, which were detected within the top 10 iHS signals in all three ecoregions. The DUOX2–DUOXA1–DUOX1 family encodes for dual oxidases which are required for the synthesis of the thyroid hormone (and hence could be related to thermoregulation), but also play a role in antimicrobial defence at mucosal surfaces (De Deken ; van der Vliet ). DUOX2 has already been reported under positive selection in Andean populations (Zhou ; Jacovas ; Borda ). Furthermore, adaptations related to lipid metabolism have been suggested to have occurred before the dispersals of the first Indigenous Americans. Amorim showed that the fatty acid desaturases (FADS) genes present signals of positive selection in Indigenous populations throughout the Americas, probably resulting from an adaptation event that took place in the common ancestral population previous to the first peopling events. Note that although we searched for deviations in local ancestry proportions, we did not find evidence of post-admixture selection in any of the signatures detected across ecoregions. Finally, we also used the Samβada software to investigate genotype frequency changes correlating with elevation while correcting for population structure. Such a strategy not only allowed us to explore whether the top signals of positive selection detected across ecoregions significantly correlated with elevation, but also to interrogate for concerted shifts of allele frequencies on genes involved in particular biological functions as expected under a model of polygenic selection (Pritchard ; Stephan 2016; Novembre and Barton 2018). Although we used elevation as an environmental variable in this analysis, we also note that the particular selective pressure driving such allele frequency correlations could be any correlated environmental variable such as temperature (annual mean, maximum in the warmest month or minimum in the coldest month), solar radiation, and other related environmental factors (such as dryness or vegetation). Several of the top candidate genes for adaptation across ecoregions were found to match with those genes displaying significant genotype frequency correlations with elevation (supplementary table S23, Supplementary Material online), meaning that elevation (or any highly correlated environmental variable) might have been acting as a selective pressure. Moreover, several of the common gene-set terms found enriched among the top 50 signals across ecoregions were also found to be enriched among those identified by Samβada. A limitation of this study is the use of genotype data. While we can reliably identify regions of the genome with the patterns of variation expected under positive selection, and thus propose putative candidate genes for adaptation, the use of SNP-array data does not allow the direct identification of the causal variant driving the selection signals. Therefore, for most of the top signals of positive selection detected we have focused on identifying plausible candidate genes, even if no functional variant in the suggested candidate region was interrogated in our array. Because of that, our results are limited by the current available knowledge on gene functions and biased towards particular selective pressures hypothesized as adaptive in each ecoregion. Furthermore, by comparing signals of selection across populations grouped into three ecoregions, our analysis may also overlook the impact of other localized environmental variation on human adaptation patterns. Future studies may overcome this limitation by comparing signals of selection between populations within each ecoregion, or by employing alternative frameworks to guide comparative analyses such as the 8, 11, or 13 ecoregion models proposed by geography and ecology scholars (Vidal 2014; Britto 2017). It should as well be noted, that the retrieved data does not have sufficient geographic coverage to be representative of all the genetic variation present in each ecoregion. This is especially the case for the rainforest ecoregion, for which we only analyzed 27 individuals, thus limiting the interpretation of our results. A further limitation is our heterogeneous sampling strategy which combined samples collected from several studies and thus contained varying levels of ethnographic and family origin information. Future efforts could include focused sampling designs that prioritize the detailed collection of family background information of the participants, including place of birth and long-term residence, at each sampling location. Despite these limitations, the study of signatures of positive natural selection in the human genome is valuable as it allows deciphering past and ongoing selective pressures which continue to impact present-day populations. Since natural selection operates on functional variants, the adaptive events resulting from local selective pressures could in turn have impacts on functional and medically relevant traits. In conclusion, our study investigated genetic adaptation among Peruvian populations from three distinct ecological regions: the desert Pacific coast, the Andean highlands and the Amazon rainforest lowlands. We found that among highlanders most top candidate genes for positive selection are involved in oxygen homeostasis, cardiovascular function, skin pigmentation, and lipid or glucose metabolism. In contrast, the top selection signals specific for populations living in the coast and rainforest ecoregions are mostly associated with the innate and adaptive immune response. These patterns suggest dissimilar selective pressures across populations from different ecoregions, and identify chronic altitude-associated hypoxia exposure and localized pathogen diversity as two of the main drivers of human adaptation in South America. Finally, we also found several signals of selection shared across populations from the three ecoregions. We infer that this pattern might result from historical gene flow between highland, coastal and rainforest communities, or from common adaptations present among ancestral Indigenous populations during the first dispersals into the Americas. Further detailed knowledge on the past and recent demographic events experienced by the populations inhabiting the three ecoregions studied here is required to be implemented in the appropriate selection and demography models, and to ultimately help to distinguish between selective pressures acting on more than a particular region or, conversely, shared signals originating from selection before divergence or posterior gene flow. By characterizing signals of selection among Peruvian populations from diverse ecological contexts, this work contributes to our understanding of the genetic changes underlying human adaptations to the diverse environments of the Americas.

Materials and Methods

Samples, Data Generation and QC

This study combines publicly available and newly generated SNP data from human populations from several coastal, highlands (defined as populations living >2,500 m above sea level) and rainforest areas of Peru (N = 308) (fig. 1, supplementary table S1, Supplementary Material online). Sample collection and DNA extraction were completed over different time periods with informed consent and ethical review approval from the participating institutions (for details, see supplementary table S1, Supplementary Material online). Protocols for this study were approved by Institutional Review Boards at Arizona State University under IRB ID 0312002159 and IRB ID 0312001490 Genetic Diversity in Present-Day Populations of the South-Central Andes (including Ancash and Nomatsiguenga); and Stanford University under eProtocol 20839 Population and Functional Genomics of the Americas. Approval was also granted by the Scientific Research Ethics Committee of Universidad Ricardo Palma under stipulation No. 1889-2004 and by the Parc de Salut Mar Clinical Research Ethics Committee (project reference No. 2019/8916/I). We generated genome-wide genotypes for over 1.2 million SNPs with the Illumina Infinium® Multi-Ethnic Global Array (MEGA) for 189 DNA samples collected in both rural and urban locations across Peru. These samples were genotyped across two batches in 2018 at the LABSERGEN Genomics Services Laboratory in the National Laboratory of Genomics for Biodiversity, Advanced Genomics Unit (UGA-LANGEBIO), CINVESTAV. Additionally, we included genotypes from 100 unrelated individuals from Puno, a highland city in south-eastern Peru, previously generated through the Population Architecture using Genomics and Epidemiology (PAGE) consortium (Wojcik ); and from 19 Peruvian individuals from Magdalena de Cao (La Libertad department) previously published by Ioannidis . All datasets used in this study are summarized in supplementary table S2, Supplementary Material online. We performed QC using Plink 1.9 (Chang ) independently on each genotype batch or previously available dataset. For each dataset, we removed structural variants, SNPs with duplicate marker names and SNPs with no physical position in the GRCh37 human genome assembly (hg19). SNPs on the reverse strand were flipped to the forward strand using snpflip (https://github.com/biocore-ntnu/snpflip). SNPs with ambiguous strandedness were removed. We restricted our analyses to autosomal SNPs, removed all variants with missing call rates over 5% (with plink command –geno 0.05) and filtered all SNPs failing a Hardy Weinberg equilibrium (HWE) exact test with P-values below 10−8 (–hwe 1e-8). We also filtered each dataset for duplicates and first-degree relatives by eliminating individuals with IBD values >0.5 as described in Anderson . Finally, individual samples with missing call frequencies over 10% (–mind 0.10) were also removed. After QC, each dataset had between 1.2 and 1.5 million autosomal SNPs (see details in supplementary table S2 and supplementary fig. S1, Supplementary Material online). Next, we merged all datasets at overlapping sites using Plink 1.9 and repeated the QC filtering on the obtained merged dataset, which contained 1,036,981 autosomal SNPs and 286 individuals. No SNPs or individuals were removed when filtering variants with missing call rates over 5% (–geno 0.05) and individuals with missing call frequencies over 10% (–mind 0.10). However, we removed 88 SNPs that failed the HWE filter (–hwe 1e-8) and excluded 110 monomorphic sites. The final merged dataset after QC included 1,036,783 SNPs and 286 individuals. We then intersected the resulting dataset with a subset of 395 individuals from the 1000 Genomes Phase 3 populations (1 KGP) including YRI, CEU, CHB, and PEL (Auton ). We restricted the intersected dataset to overlapping sites in the 1 KGP populations and again controlled for variant and sample missingness, as detailed above. Finally, we filtered SNPs with minor allele frequencies (MAFs) below 1% (–maf 0.01). The final intersected dataset after QC retained 681 individuals and 610,576 SNPs. The full QC process is detailed in supplementary fig. S2, Supplementary Material online. Except for five individuals from the Cajamarca department, six from Arequipa, and four additional samples that could not be clearly assigned, the newly compiled Peruvian samples were allocated across three different ecoregions taking into account place of birth, long-term residence or family origin information as documented during the sample collection stage (table 1 and supplementary table S1, Supplementary Material online). Moreover, to avoid over-representation of highland individuals, we randomly subsampled 25 individuals from the original 100 collected Puno samples. After this procedure, our dataset comprised 591 individuals, including 395 individuals from reference populations in the 1 KGP and 196 newly compiled Peruvians (27 from the rainforest ecoregion, 94 from the highlands, and 75 from the coast).

Global Ancestry Analyses

PCA was performed on the intersected dataset using SmartPCA from the EIGENSOFT 6.0.1 package (Patterson ). In addition, we also ran an unsupervised analysis with ADMIXTURE 1.3.0 (Alexander ) to estimate and recognize potential differential ancestry patterns in the Peruvian departments when analyzed together with the YRI, CEU, CHB and PEL reference populations from the 1 KGP. Specifically, we ran 10 random seeds for K = 2 to 10 ancestral components, and used PONG to represent the results (Behr ). For these analyses, input data were pruned for linkage disequilibrium by applying the –indep-pairwise flag in Plink 1.9 with a window size of 200 variants, a step of 25 variants and a correlation threshold of r2 equal to 0.5. The resulting dataset contained 405,493 variants and 681 individuals. Next, we identified identity-by-descent (IBD) segments using IBDseq r1206 (Browning and Browning 2013). As in Dai , we only kept IBD segments ≥ 3 cM, and removed those segments overlapping with chromosomal regions above 1 Mb with no SNPs across all individuals. Then, we computed the cumulative IBD length between each pair of individuals. We built an IBD network where each vertex represents an individual and each edge weight represents the cumulative IBD length between the connected individuals. As described in Han , edges with cumulative IBD lengths below 12 cM were removed to reduce information about less recent demography and spurious identified IBD. Edges with cumulative IBD lengths above 72 cM were removed as well, to avoid clustering of extended families. In addition, we executed an AMOVA (Excoffier ) at ecoregion and department levels in order to quantify the degree of population differentiation. For that, we used the poppr R package (Kamvar ) and a permutation test to obtain the corresponding P-values.

Analyses of Genomic Signatures of Positive Selection

Given that this study seeks to characterize adaptation amongst Peruvians with large proportions of Indigenous American ancestry, two individuals showing high proportions of African and European ancestries in the global ancestry analysis were not included in the selection analyses. Specifically, these individuals were CM_0240 from Ica (Coast) with an estimated 90% African ancestry proportion; and MDC025 from Magdalena de Cao (Coast), with an estimated 56% of European ancestry proportion (supplementary table S1, Supplementary Material online). Thus, a total of 194 Peruvian individuals including 73 samples from the coast, 94 from the highlands, and 27 from the rainforest (table 1) were analyzed to identify signals of positive selection in each ecoregion. Population Branch Statistic (PBS) values (Yi ) were calculated using a self-implemented script in R based on FST values (Weir and Cockerham 1984) computed with VCFTools 0.1.14 (Danecek ). We considered all pairwise comparisons among ecoregions (highlands, coast, and rainforest) and used CHB as an outgroup. After phasing the dataset using SHAPEIT2 (Delaneau ), the integrated Haplotype Score (iHS) (Voight ) was computed on each ecoregion separately using selscan 1.2.0 (Szpiech and Hernandez 2014). For this analysis, genotype data was polarized according to the ancestral allele information provided for the 1 KGP dataset. We also used selscan to compute the cross-population Extended Haplotype Homozygosity (XP-EHH) test (Sabeti ) on the phased data for the following ecoregion comparisons: highlands against coast, highlands against rainforest, and rainforest against coast. Normalization was computed for iHS and XP-EHH by minor allele frequency bins and genome-wide, respectively, using selscan default parameters. Candidate regions for positive selection were then identified through an empirical outlier approach. For each statistic and ecoregion comparison, we initially identified as putative peaks of selection those genomic regions whose SNP scores are above the top 5% of the empirical distribution only if they also comprised at least three SNPs with scores above the top 1% within the surrounding ±100 Kb. The SNPs within the 50 strongest peaks of signals of positive selection and all other SNPs in the top 1% of the distribution were annotated with ANNOVAR (Wang ). For the top 10 peak signals, putative candidate genes were determined considering the location of the highest scoring SNPs. Intergenic signals were assigned to surrounding genes only when these were at a distance below 50 Kb. Putative functional SNPs within the top 10 candidate regions for positive selection were further explored by searching for expression quantitative trait loci (eQTLs) and splicing QTLs (sQTLs) in the GTEx database (GTEx Consortium 2013), and for SNP-trait associations in the GWAS catalog (Welter ). For each ecoregion comparison and statistic, a gene-set over-representation analysis was performed in the top 50 highest scoring regions with the functional annotation clustering tool DAVID 2021 (Huang ), using the whole genome as background. The queried databases included OMIM (Amberger ), GO (Harris ), KEGG (Kanehisa ), and Reactome (Fabregat ). Only enrichment scores above 1.3, equivalent to a P-value of 0.05, were considered. Additionally, we performed a trait-associated SNP over-representation analysis on the candidate regions comprising the 50 top strongest signals. For this, we used the R package TraseR (Chen and Qin 2016), which includes all traits annotated in dbGaP (Mailman ) and the NHGRI GWAS catalog (Welter ). We also explored genotype frequency changes correlated with elevation, while correcting for population structure using the Samβada software (Stucki ). For this analysis, we retrieved the mean annual temperature, maximum temperature in the warmest month, minimum temperature in the coldest month and the solar radiation per month data from the WorldClim dataset (Fick and Hijmans 2017), as well as elevation data from the Shuttle Radio Topography Mission (Farr ). Because these environmental variables were greatly correlated among each other, for the analysis we only used elevation. We then included the two first principal components obtained in the analyses above as independent variables to account for population structure. After spliting all markers and corresponding genotypes for all individuals into 17 independent files of 75,000 markers, we ran Samβada in its supervision module and merged the corresponding output results. The P-value of each associated genotype in the output was computed from the Wald score (pop version), and adjusted using Bonferroni correction to account for multiple testing. Next, significant SNPs (P-value < 0.05) were annotated using VEP (McLaren ) and a gene-set over-representation analysis using DAVID 2021 was performed on the obtained candidate gene list. Finally, since most coastal and rainforest Peruvian populations display admixture with European and African ancestries (Ruiz-Linares ; Homburger ; Chacón-Duque ; Harris ), we investigated significant local ancestry proportion deviations (SD) that could potentially indicate post-admixture selection [SD > 4.42 as seen in Bhatia ]. We used RFMix v.1.5.4 (Maples ) in PopPhased mode, with a window size of 0.2 cM, and with two expectation-maximization (EM) iterations. The rest of parameters were kept as defined by default. As reference populations we used the CEU (n = 75), CHB (n = 75), and YRI (n = 75) individual samples from the 1 KGP (Auton ), and 75 Puno individuals from the PAGE consortium (Wojcik ) as the Native American proxy. Local ancestry proportions were then computed per SNP and per ecoregion from the per individual values obtained, after processing the outputs as described in Martin . In addition, to further assess the potential effects of recent admixture in the detection of positive selection, we repeated the selection analyses in each ecoregion after masking all alleles assigned to any ancestry other than Native American, and checked the concordance between the top candidates identified in the masked and unmasked procedures (see for deatils supplementary note, Supplementary Material online). Click here for additional data file.

126 in total

1. Detecting identity by descent and estimating genotype error rates in sequence data.

Authors: Brian L Browning; Sharon R Browning
Journal: Am J Hum Genet Date: 2013-10-24 Impact factor: 11.025

2. Multifaceted regulation of T cells by CD44.

Authors: Bas Jg Baaten; Cheng-Rui Li; Linda M Bradley
Journal: Commun Integr Biol Date: 2010-11-01

3. Population Histories of the United States Revealed through Fine-Scale Migration and Haplotype Analysis.

Authors: Chengzhen L Dai; Mohammad M Vazifeh; Chen-Hsiang Yeang; Remi Tachet; R Spencer Wells; Miguel G Vilar; Mark J Daly; Carlo Ratti; Alicia R Martin
Journal: Am J Hum Genet Date: 2020-03-05 Impact factor: 11.025

4. Smallpox and American Indians revisited.

Authors: James C Riley
Journal: J Hist Med Allied Sci Date: 2010-03-09 Impact factor: 2.088

5. Natural Selection on Genes Related to Cardiovascular Health in High-Altitude Adapted Andeans.

Authors: Jacob E Crawford; Ricardo Amaru; Jihyun Song; Colleen G Julian; Fernando Racimo; Jade Yu Cheng; Xiuqing Guo; Jie Yao; Bharath Ambale-Venkatesh; João A Lima; Jerome I Rotter; Josef Stehlik; Lorna G Moore; Josef T Prchal; Rasmus Nielsen
Journal: Am J Hum Genet Date: 2017-11-02 Impact factor: 11.025

6. Induced inhibition of ischemic/hypoxic injury by APIP, a novel Apaf-1-interacting protein.

Authors: Dong-Hyung Cho; Yeon-Mi Hong; Ho-June Lee; Ha-Na Woo; Jong-Ok Pyo; Tak W Mak; Yong-Keun Jung
Journal: J Biol Chem Date: 2004-07-15 Impact factor: 5.157

7. Evolutionary genomic dynamics of Peruvians before, during, and after the Inca Empire.

Authors: Daniel N Harris; Wei Song; Amol C Shetty; Kelly S Levano; Omar Cáceres; Carlos Padilla; Víctor Borda; David Tarazona; Omar Trujillo; Cesar Sanchez; Michael D Kessler; Marco Galarza; Silvia Capristano; Harrison Montejo; Pedro O Flores-Villanueva; Eduardo Tarazona-Santos; Timothy D O'Connor; Heinner Guio
Journal: Proc Natl Acad Sci U S A Date: 2018-06-26 Impact factor: 11.205

8. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes.

Authors: Angli Xue; Yang Wu; Zhihong Zhu; Futao Zhang; Kathryn E Kemper; Zhili Zheng; Loic Yengo; Luke R Lloyd-Jones; Julia Sidorenko; Yeda Wu; Allan F McRae; Peter M Visscher; Jian Zeng; Jian Yang
Journal: Nat Commun Date: 2018-07-27 Impact factor: 14.919

9. Genome-wide mouse mutagenesis reveals CD45-mediated T cell function as critical in protective immunity to HSV-1.

Authors: Grégory Caignard; Gabriel A Leiva-Torres; Michael Leney-Greene; Benoit Charbonneau; Anne Dumaine; Nassima Fodil-Cornu; Michal Pyzik; Pablo Cingolani; Jeremy Schwartzentruber; Jeremy Dupaul-Chicoine; Huaijian Guo; Maya Saleh; André Veillette; Marc Lathrop; Mathieu Blanchette; Jacek Majewski; Angela Pearson; Silvia M Vidal
Journal: PLoS Pathog Date: 2013-09-12 Impact factor: 6.823

10. Genomic Insights into the Ancestry and Demographic History of South America.

Authors: Julian R Homburger; Andrés Moreno-Estrada; Christopher R Gignoux; Dominic Nelson; Elena Sanchez; Patricia Ortiz-Tello; Bernardo A Pons-Estel; Eduardo Acevedo-Vasquez; Pedro Miranda; Carl D Langefeld; Simon Gravel; Marta E Alarcón-Riquelme; Carlos D Bustamante
Journal: PLoS Genet Date: 2015-12-04 Impact factor: 5.917