Literature DB >> 31124564

Resolving the Insertion Sites of Polymorphic Duplications Reveals a HERC2 Haplotype under Selection.

Abstract

Polymorphic duplications in humans have been shown to contribute to phenotypic diversity. However, the evolutionary forces that maintain variable duplications across the human genome are largely unexplored. We developed a linkage-disequilibrium based method to detect insertion sites of polymorphic duplications not represented in reference genomes. This method also allows resolution of haplotypes harboring the duplications. Using this approach, we conducted genome-wide analyses and identified the insertion sites of 22 common polymorphic duplications. We found that the majority of these duplications is intrachromosomal and only one of them is an interchromosomal insertion. Further characterization of these duplications revealed significant associations to blood and skin phenotypes. On the basis of population genetics analyses, we found that the duplication of a well-characterized pigmentation-related region, including the HERC2 gene, may be selected against in European populations. We further demonstrated that the haplotype harboring this duplication significantly affects the expression of the HERC2P9 gene in multiple tissues. Our study sheds light onto the evolutionary impact of understudied polymorphic duplications in human populations and presents methodological insights for future studies.

Entities: Chemical Disease Gene Mutation Species

Keywords: KRT; copy number variation; natural selection; structural variants

Mesh：

Substances：

Year: 2019 PMID： 31124564 PMCID： PMC6587411 DOI： 10.1093/gbe/evz107

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Genomic structural variation (duplications, deletions, translocations, and inversions of genomic segments) has increasingly been appreciated as a driver of human phenotypic variation, accounting for several key adaptive phenotypes, as well as disease susceptibility (Zhang et al. 2009; Weischenfeldt et al. 2013). One of the best-known examples of the evolutionary impact of structural variation is the CCR5-Δ32 deletion polymorphism which is associated with HIV resistance (Dean et al. 1996; Sabeti et al. 2005). Another recent example that also invokes the adaptive role of structural variations to resistance to pathogens is the reassessment of the haplotypic architecture of the structural variants involving haptoglobin Glycophorin A and Glycophorin B genes. This study revealed multiple instances of recurrent evolution of structural variants in this locus that are associated with malaria resistance in African populations (Leffler et al. 2017). Even though its timing and nature is under scrutiny (Inchley et al. 2016; Fernández and Wiley 2017), another example of likely adaptation involving structural variations is the expansion of salivary amylase gene copy number among humans, likely driven by high starch consumption (Meisler and Ting 1993; Perry et al. 2007). Despite their genomic and phenotypic impacts exemplified by these interesting examples, few studies have addressed the evolutionary forces that shape the evolutionary trajectories of polymorphic duplications. We argue that the main reason for the paucity of evolutionary studies on polymorphic duplications is that current, short-read based discovery and genotyping approaches are unable to resolve the genomic locations of inserted duplicated gene copies; consequently, the haplotypic variation associated with a given duplication cannot be fully studied. The two commonly used approaches to detect polymorphic duplications based on short-read sequences depend on paired-end mapping and read-depth (Mills et al. 2011; Zhao et al. 2013). Paired-end mapping approach depends on discordantly mapped paired-reads sequences where the distances between these two sequences are different from the expected. This method can detect some of the tandem duplications (Sudmant et al. 2015) and can be modified to detect mobile element insertions (Lee et al. 2012). However, this method is highly prone to false negatives as the short reads often fail to map to repetitive sequences (Narzisi and Schatz 2015). This problem is further aggravated by the complexity of a considerable portion of the loci harboring duplications, that is, they involve highly repetitive sequences (Sudmant et al. 2015). The more sensitive approaches to detect polymorphic duplications depend on read-depth, where deviations in the depth of coverage in a genomic region as compared with genome-wide expectations can signal copy number gain and loss of that particular sequence (Alkan et al. 2011). This method is relatively robust especially if the duplication is large. However, read-depth methods alone cannot detect the insertion site of the duplicated sequence. In summary, currently available methods using short-reads to detect polymorphic duplications are limited in their ability to detect the insertion sites of the duplicated sequences. Thus, the haplotypes harboring polymorphic duplications, which are crucial to conduct neutrality tests and functional analyses, often remain elusive. Here, by applying a novel linkage disequilibrium based method to the 1000 Genome Project phase 3 data set (Sudmant et al. 2015), we detected the insertion sites of 22 common human polymorphic duplications. This data set allowed us to more thoroughly investigate the potential adaptive contributions of some of these duplications on human phenotypic diversity.

Results

Detecting Putatively Adaptive Duplications and Their Insertion Sites

To detect insertion sites of polymorphic duplications, we utilized genome-wide linkage disequilibrium between the genotyped duplications (for which the insertion site is unknown) and single nucleotide variants (SNVs) across the human genome. Specifically, we assumed that when a duplicated sequence was inserted in a certain genomic region and subsequently increased in allele frequency, the flanking SNVs would show linkage disequilibrium with the duplication (fig. 1). This method can only detect the insertion sites of gene duplications with relatively high allele frequency. The signal weakens considerably if the haplotype harboring the duplicated sequence undergoes recombination or gene conversion as expected from the previous studies (Saitou et al. 2018). It is also important to note here that our method’s power depends on the accuracy of variation calls and phasing of the database that we are using.

. 1.

—The strategy of the linkage disequilibrium-based method to detect the insertion region of the polymorphic duplication. (A) The schematic representation of our approach to detect insertion sites and haplotypes harboring polymorphic duplications. (B) One example of the linkage disequilibrium-peak between and duplication and SNVs. Each dot indicates a SNV. The X-axis shows chromosomal locations and Y-axis shows the linkage disequilibrium between each SNV and a specific polymorphic duplication, in this case, esv3641421, in the CEU population. (C) The filtering process of the duplications. (D) The breakdown of the number of duplications based on their exonic content and allele frequency. The legend below indicates the color-coded functional categories. We observed an enrichment of genic content among the very common (>5% allele frequency) duplications when compared with the genic content of all polymorphic bi-allelic duplications (P-value = 0.03684, one-tail Pearson's Chi-squared test with Yates' continuity correction). We chose to apply this method to the data provided by the 1000 Genome Project phase 3 data set (Sudmant et al. 2015), which reports 6,024 polymorphic duplications. We chose this data set as it remains the most accurate population-level compilation of human variable duplications and phased SNVs essential for our analysis. Specifically, the 1000 Genomes consortium applied multiple algorithms to detect polymorphic duplications including Delly (Rausch et al. 2012) and Genome STRiP (Handsaker et al. 2015). More importantly, comprehensive external validation of the discovered structural variants was used to minimize the false positive rate. Last but not least, the whole genome sequencing from thousands of individuals allows integrative phasing of all variants, which provided haplotype context of the structural variations (The 1000 Genomes Project Consortium et al. 2015). Therefore, we argue that the 1000 Genome Project phase 3 data set provides one of the most accurate short-read sequence based population-level structural variants callsets available along with the SNV information from the same individuals. To further minimize false-positive structural variant calls and to avoid complicating our data set, we conducted some preliminary filtering (fig. 1). First, we eliminated multiallelic copy number variations and focused only on bi-allelic duplications reported as 2, 3, or 4 diploid copies in humans. To increase our power for detecting linkage disequilibrium, we focused on common duplications observed in >5% in any of Central Europeans from Utah (CEU), Yoruba from Ibadan (YRI), or Han Chinese from Beijing (CHB). After this filtering, we were left with 33 common, bi-allelic duplications for this study. We identified observable peak(s) of linkage disequilibrium for 22 out of 33 common duplications (fig. 1) with SNVs across the genome (table 1, supplementary table S1, Supplementary Material online). For the other 11 common duplications, we were not able to identify a linkage disequilibrium peak. Previous studies have shown that gene conversion and recurrence can explain this pattern. For example, Boettger (2016) reported the recurrent exonic deletions of the haptoglobin locus, for which the haplotype architecture was complex. Similarly, our own work showed the joint effect of recurrence and gene conversion in complicating the haplotypic background of structural variation in the GSTM1 locus (Saitou et al. 2018). Thus, similar characterization efforts to resolve the haplotypes harboring these 11 duplication polymorphisms would provide important venues for future research. Nevertheless, in this study, we focused our analysis on the 22 duplications for which we were able to detect linked haplotypic variation.

Table 1

All the 22 Duplications and One of Their Tag SNVs, R2 Value, and the Phenotypic Information by Gene ATLAS

ID	chr	gene	freq_CEU	freq_CHB	freq_YRI	overlap	tag SNP	TagSNP_hg19	R ²	GENEATLAS
esv3584976	chr1	FAM41C, FAM87B	—	0.058	—	Whole-gene	rs528265132	chr1_844674	0.406	NA
esv3585141	chr1	nongenic	0.056	—	—	Nongenic	rs74865018	chr1_8208573	0.496	Skin color, P-value = 1.4E–09
esv3589561	chr1	OR2T27	—	—	0.343	Partial-exon	rs28502564	chr1_248831191	0.514	NA
esv3590421	chr2	GALM, SRSF7	0.076	—	—	Whole-exon	rs112011213	chr2_38967847	0.857	NS
esv3590859	chr2	PNPT1	—	—	0.069	Whole-exon	rs115094228	chr2_55731753	0.626	NA
esv3592511	chr2	GPR39	—	—	0.069	Intronic	rs77354775	chr2_133345124	0.751	NA
esv3594536	chr2	TM4SF20	0.076	—	—	Whole-exon	rs80058427	chr2_228399781	0.425	NS
esv3599142	chr3	FGF12, FGF12-AS1	—	0.058	0.005	Intronic	rs6788805	chr3_192012511	0.569	NA
esv3599420	chr4	HTT-AS	—	—	0.051	Whole-exon	rs1557213	chr4_3038415	0.912	NA
esv3601317	chr4	nongenic	0.136	0.049	—	Nongenic	rs74797043	chr4_90102254	1.000	NS
esv3603011	chr4	TRIM61	—	—	0.116	Whole-gene	rs78990101	chr4_161987911	0.880	NA
esv3620370	chr9	UNC13B	—	—	0.083	Intronic	rs111637861	chr9_35230046	0.942	NA
esv3620559	chr9	APBA1	—	—	0.111	Whole-exon	rs186797639	chr9_72022475	0.511	NA
esv3631000	chr12	ZNF664, ZNF664-FAM101A	0.096	—	0.037	Whole-exon	rs73131333	chr2_3953369	1.000	NS
esv3631499	chr13	nongenic	—	—	0.065	Nongenic	rs115022408	chr13_23424799	1.000	NA
esv3632749	chr13	COMMD6	—	—	0.134	Whole-exon	rs61645976	chr13_76107661	0.718	NA
esv3635993	chr15	HERC2	0.025	0.66	0.282	Intronic	rs376191081	chr15_28549862	0.751	NA
esv3640164	chr17	TRIM16L	0.056	0.083	0.079	Whole-exon	rs199526489	chr17_15546785	0.950	NS
esv3640585	chr17	KRT34	0.025	0.029	0.13	Whole-gene	rs9914283	chr17_39541260	0.959	NS
esv3641421	chr17	TEX19	0.071	0.005	—	Whole-gene	rs74001624	chr17_80314483	1.000	Monocyte percentage, P-value = 2.8E–16
esv3643776	chr19	CYP4F12	—	—	0.069	Whole-gene	rs112344570	chr19_15831904	1.000	NS
esv3645658	chr20	TTLL9	—	—	0.056	Whole-exon	rs73903650	chr20_30391721	0.928	NS

Note.—We described one tag SNV for each polymorphic duplication, with the highest linkage disequilibrium in table 1. We provide the highest R2 values observed in CEU, CHB, or YRI populations (the frequency column is bolded). The tag SNVs, thus, can be population specific. We bolded the allele frequency column to designate the populations where we identified the tag SNPs for the particular duplications. When we found multiple SNVs with the same R2 value, we chose one SNV which reported the SNV that is physically located in the middle of the most upstream and downstream SNV with equally high R2 values. All the tag SNVs are reported in table S1, Supplementary Material online.

All the 22 Duplications and One of Their Tag SNVs, R2 Value, and the Phenotypic Information by Gene ATLAS Note.—We described one tag SNV for each polymorphic duplication, with the highest linkage disequilibrium in table 1. We provide the highest R2 values observed in CEU, CHB, or YRI populations (the frequency column is bolded). The tag SNVs, thus, can be population specific. We bolded the allele frequency column to designate the populations where we identified the tag SNPs for the particular duplications. When we found multiple SNVs with the same R2 value, we chose one SNV which reported the SNV that is physically located in the middle of the most upstream and downstream SNV with equally high R2 values. All the tag SNVs are reported in table S1, Supplementary Material online. We found that 21 out of 22 (∼96%) of duplication insertion sites are found on the same chromosome as where the duplicated sequence is found. Further scrutinization of the haplotypes harboring intrachromosomal duplications revealed that five of them overlap with the original duplicated region, six of them were located (>1 kb) upstream of the region and eight of them located (>1 kb) downstream of the region (table 1 and supplementary fig. S1, Supplementary Material online). Additionally, we found that one duplication (esv3631000), which contains the gene ZNF664 located on chromosome 12, is inserted into chromosome 2. This observation was supported by 17 SNVs on chromosome 2 in strong linkage disequilibrium (R2 > 0.8) with the duplication (supplementary table S1, Supplementary Material online). ZNF664 is classified as retro-duplication (Abyzov et al. 2013). Thus, the retroposon machinery may facilitate a copy and paste mechanism of the reverse transcribed mRNA of the original gene to a random insertion point, in this case, chromosome 2. We then scrutinized the genic content of the filtered duplications. Of the 22 duplicated sequences, 5 contain whole genes, 9 contain coding exonic sequences, 5 contain intronic sequences, and 3 contain only intergenic sequences (table 1). We then asked whether the high proportion of duplications containing coding sequences is more than expected, especially given that a previous study reported that only ∼20% of common duplications overlap with coding sequences (Conrad et al. 2010). This contrasts with the >50% of duplications we outlined that overlap with an entire gene or coding exon (fig. 1). We observed the enrichment of genic duplication in the common duplications compared with the initial duplication set (P-value = 0.03684, one-tail Pearson's chi-squared test) (fig. 1). We found that duplications associated with strong linkage disequilibrium with SNVs do not significantly differ in their coding sequence content from duplications that do not (P-value = 0.8845, one-tail Pearson's chi-squared test) (fig. 1). We further confirmed the general consensus that the allele frequency is negatively correlated with genic content among the 1000 Genome Project phase 3 data set duplications. However, we found an increase of genic duplications among very common (>5% allele frequency) duplications in general (supplementary fig. S2, Supplementary Material online). Thus, the highly genic nature of the 22 duplications that we focus on this study is a property of their high allele frequency and the underlying evolutionary reasons for this overall increase remains an open question.

Partial HERC2 Duplication May Be Selected against in European Populations

Our main goal in this paper is to leverage the haplotypes of the polymorphic duplications to identify potential selective forces acting on specific polymorphic duplications. To achieve this, we first calculated the allele frequency differences between populations for the 22 polymorphic duplications that we focus in this study. Then we compared these differences to those calculated for randomly selected 3,102 very common (>5% alternative allele frequency in CEU, YRI, or CHB to match our initial filtering) SNVs extracted from 1000 Genomes phase 3 data set (The 1000 Genomes Project Consortium et al. 2015) (fig. 2). We found that a partial duplication of a well-characterized gene, the HECT And RLD Domain Containing E3 Ubiquitin Protein Ligase 2 gene (HERC2) (esv3635993) showed apparently higher allele frequency differentiation from the other gene duplications as well as the majority of random SNVs analyzed as a null background (table 1, fig. 2, supplementary fig. S3, Supplementary Material online). The HERC2 partial duplication was also reported as the top population-stratified structural variants among 5,887 polymorphic duplications analyzed in a previous study (Sudmant et al. 2015) based on VST statistics (Redon et al. 2006).

. 2.

—The population differentiation of the partial HERC2 duplication. (A) The frequency of the target duplications which was observed either CHB or CEU (pink dots) and randomly selected 3,000 SNVs (>5% in CEU, YRI, or CHB) (blue background cloud). The density of the blue color reflects the density observations. The x-axis shows the frequency of the variation in CEU and the y-axis shows the frequency of the variation in CHB. (B) The geographical distribution of the HERC2 gene duplication allele. Yellow refers to the frequency of duplication allele and red refers to the frequency of the nonduplication allele. (C) Left: the putative location of the HERC2 duplication based on the linkage disequilibrium in the European populations. Right: the magnified version of the chromosomal location of HERC2 on chromosome 15. Dots are SNVs with R2 > 0.05 with the duplication in this location. The X-axis shows the chromosomal location and Y-axis shows the R2 between the SNV and the HERC2 duplication. The pale blue bar at upper-right indicates the haplotype block, which contains the SNVs with high linkage disequilibrium (R2 > 0.7) with the HERC2 duplication (hg19 chr15: 28894038-28927368). We assume that the insertion site of the duplication resides in this haplotype block and used this region for the subsequent analysis. The purple colored dots indicate SNVs that show significant association (P-value < 0.0001) with expression levels of neighboring genes based on GTEx portal (Lonsdale et al. 2013). To further characterize this polymorphic duplication, we first manually confirmed this duplication by investigating the read-depth profiles of multiple samples from the 1000 Genomes phase 3 data set (supplementary fig. S3, Supplementary Material online). Then, we extended our linkage disequilibrium analysis to include the additional 1000 Genomes populations categorized across continental meta-populations (see Materials and Methods). On the basis of this analysis, we narrowed down the insertion site of the HERC2 partial gene duplication to hg19 chr15: 28894038-28927368 (R2 > 0.75), and observed a detectable increase in linkage disequilibrium between the duplication and flanking SNVs in all three continental populations (fig. 2, supplementary fig. S4, Supplementary Material online). In addition, we attempted to resolve the breakpoints of the insertion site. To do this, we searched the recently available long-read sequence data sets including fosmid sequence data (Kidd et al. 2010) and long-read sequence data sets (Seo et al. 2016; Audano et al. 2019; Levy-Sakin et al. 2019; Nagasaki et al. forthcoming). However, none of these studies has reported this particular duplication. In addition, we were not able to locate this duplication among recently available segmental duplications in de novo genome assemblies (Vollger et al. 2019, also Volger M, personal communication). Two issues should be noted here. First, these long-read based sequence data sets focus on a small number of samples and thus it is plausible that genomes that are carrying this particular duplication are not represented in the data sets that we investigated. A second issue is that long-read sequences, even though substantially lnger than Illumina-based sequences, may have failed to cover the large ∼14 kb HERC2 duplication that we are focusing on. Last but not least, it is also possible that this duplication is a false-positive. However, the fact that we detected clear read-depth difference among genomes (supplementary fig. S5, Supplementary Material online) and the haplotype-level linkage disequilibrium between the duplication and SNVs strongly support the presence of a polymorphic duplication. We found that linkage disequilibrium was strongest in European populations and weaker in East Asian and African populations. To investigate if the duplication is ancestral or derived, we compared the ∼100 kb region around the putative insertion site (hg19 chr15: 28894038-28927368) to the orthologous section in the chimpanzee reference assembly (determined by lift-over [Hinrichs et al. 2006], Pantro6, chr15: 1879453-1910950). We deduced that if the duplication is ancestral, we would identify an additional ∼14 kb sequence in the chimpanzee reference assembly that does not exist in the human reference genome. We failed to identify such a sequence in the chimpanzee assembly, strongly suggesting that duplication is the derived allele in the human lineage (supplementary fig. S6, Supplementary Material online). Further, we found that chimpanzees and Denisovan genomes do not harbor the 17 alleles that are linked with the duplication (R2 > 0.75 in European populations) (supplementary table S2, Supplementary Material online). This analysis supports our initial conclusion that the haplotype harboring the duplication is likely derived in the human lineage as compared with chimpanzees. Intriguingly we found that the Neanderthal genome is heterozygous at this locus, carrying both the haplotype associated with the duplication and those do not (supplementary table S2, Supplementary Material online). Given that we do not observe any deviation from the expected read-depth in Neanderthals, it is possible that the haplotype that harbors the duplication in humans have evolved before Humans and Neanderthals diverge and the duplication has evolved after their split. However, given the repetitive nature of this locus, future work is needed to definitively resolve the ancestral haplotype. Next, we used VCFtoTree (Xu et al. 2017) to obtain an alignment file for the HERC2 duplication haplotype (hg19, chr15: 28898098-28902929), containing 2,504 samples available in the 1000 Genome phase 3 data set (Sudmant et al. 2015), as well as the reference chimpanzee genome (The Chimpanzee Sequencing Consortium 2005). We then constructed haplotype networks using PopART (version 1.7) (Leigh and Bryant 2015) using the Median Joining method (Bandelt et al. 1999) (fig. 3). This network reveals an apparent reduction of haplotypic diversity in European populations as compared with East Asian and African populations (fig. 3). This observation is consistent with the dramatically lower allele frequency of the duplication in European populations, which initially led us to focus on this locus (fig. 2).

. 3.

—Haplotype networks of the HERC2 duplication insertion region. (A) Merged haplotype network of the three meta-populations (AFR, EUR, EAS) constructed from 3,336 haplotypes from hg19 chr15: 28898098-28902929 (represented in fig. 2). (B) Breakdown of individual networks to help visualization of the distribution of alleles in each meta-population. Yellow refers to the frequency of duplication allele and red refers to the frequency of the nonduplication allele. We then asked whether population-specific selective forces can explain the reduction in haplotypic diversity at this locus in European populations. We calculated several neutrality measures at the locus harboring the HERC2 duplication and compared these to empirical distributions constructed from 26,283 3 kb regions across chromosome 15 from the 1000 Genomes Selection Browser (Last accessed, March 21, 2019) (Pybus et al. 2014). We found Tajima’s D scores in this genomic region to be lower than 90% of the values of control regions on chromosome 15 for European populations (fig. 4). However, in East Asian and African populations, we observed the opposite trend, where Tajima’s D scores fell within an expected range, if not slightly higher, based on the empirical distribution. Tajima’s D measures deviations in the allele frequency spectrum (Tajima 1993); negative values indicate an excess of rare alleles, which may be a consequence of negative or positive selection. In this case, based on network analysis (fig. 3), we argue that this signal is primarily driven by the reduction of the frequency of haplotypes harboring the duplication in the European populations. One model that is consistent with the observed Tajima’s D values is negative selection against the duplication allele acting specifically in European populations.

. 4.

—Neutrality test on the putative insertion region of the partial HERC2 duplication. All values were obtained through the 1000 Genomes selection browser (Pybus et al. 2014). (A) Tajima’s D (11 bins of 3 kb window) and (B) XP-EHH values calculated for the HERC2 target region (hg19 chr15: 28894038-28927368, represented in fig. 2), compared with the distributions calculated for all the accessible regions on the chromosome 15 on the 1000 Genomes selection browser (Pybus et al. 2014). * Represents that the mean value of the target region was within the lower 10% of the control region and there was a significant difference between control and target region (P-value = 6.18E–07, Wilcoxon rank sum test). ** Represents the mean value of the target region was within the upper 5% of the control region and there was a significant difference between control and target region (P-value < 2.2E–16, Wilcoxon rank sum test). Yellow cross represents SNVs with R2 > 0.75 with the HERC2 duplication in the European populations in the CEU–CHB comparison. To test this hypothesis, we calculated XP-EHH scores between the three representative populations in a pairwise fashion for the SNVs from the same region that we calculate Tajima’s D values. Then, we compared these values to the empirical distribution of XP-EHH values constructed from the same 26,283 randomly chosen regions across chromosome 15 (fig. 4). XP-EHH calculates the probability of runs of homozygosity around a given locus assuming that there is the same allele between two populations. A positive XP-EHH score is indicative of positive selection in the first population, whereas a negative score indicates positive selection in the second population (Sabeti et al. 2007). On the basis of this calculation, we found the average XP-EHH scores in this genomic region to be higher than 5% of the control regions on chromosome 15 in CEU versus CHB comparison (fig. 4). In contrast, we found no clear population differentiation between other comparisons (fig. 4). It should be noted here that if the selective pressure is on the duplication, it is plausible that the lack of XP-EHH signal in YRI and CHB populations may be due to lack of linkage disequilibrium between SNVs and the duplication in these two populations. In fact, a more focused analysis revealed that the high XP-EHH values are driven by SNVs that are linked with the duplication allele in the European population (fig. 4). This means that there are relatively long runs of homozygosity in this region in CEU population as compared with CHB and YRI, concordant with the excess of rare variants suggested by Tajima’s D comparisons. In sum, these results are in line with a scenario that a recent selection event in Europe favors nonduplicated haplotypes over duplicated-haplotypes. Next, we investigated the potential functional impact of the duplication haplotype in Europeans. We noted that HERC2 duplication is likely inserted within the neighboring HERC2P9 gene (fig. 2). It is intriguing that a much more recent duplication of the HERC2 gene is inserted into an older paralog of the HERC2, which is expressed in multiple tissues. It is possible that recombination-based mechanisms facilitated by sequence homology between these genes led to the insertion of the duplication into HERC2P9. Eight HERC2 pseudogenes are reported in Ensembl (Zerbino et al. 2018) distributed across chromosomes 15 and 16, suggesting frequent duplication of this gene. On the basis of the GTEx portal (https://www.gtexportal.org/home/ Last accessed, March 21, 2019, Lonsdale et al. 2013), HERC2P2, HERC2P3, and HERC2P9 are expressed, as well as the intact HERC2 (supplementary fig. S7, Supplementary Material online). Furthermore, we found that the duplication haplotype (imputed by rs77868920, R2 = 0.75 in European populations) downregulates the expression of HERC2P9 in various tissues (fig. 5). The most significant effect observed for downregulation was in the sun-exposed skin (P-value = 3.3E–17, Normalized effect size = –0.96). It is possible that the polymorphic duplication may affect not only the expression levels but also alter the sequence in this region and change the transcribed RNA sequence of the HERC2P9. This remains an interesting area for further study. While there are multiple SNVs associated with skin color (Crawford et al. 2017) and iris color (Eiberg et al. 2008; Kayser et al. 2008; Sturm et al. 2008) in this region of the genome, the HERC2 duplication haplotype does not harbor any of them (MacArthur et al. 2017) (fig. 2). It is important to note here that the duplication polymorphism is more common outside of Western Eurasia and thus association studies in nonEuropean populations will be key to resolve the putative functional impact of this duplication and associated haplotypes.

. 5.

—The expression change of the HERC2P9 gene in various tissues associated with the HERC2 duplication tag SNV (rs77868920). The top 20 tissues on the GTEx (Lonsdale et al. 2013) with the lowest P-value are shown. Normalized effect size is defined as the slope of the linear regression of the expression of the HERC2P9 gene for three genotypes of the tag SNV. Normalized effect size is computed as the effect of the alternative allele relative (tagged to duplication) to the reference allele (tagged to nonduplication) in the human genome reference GRCh37/hg19. The whiskers on the plot represent the 95% confidence intervals.

The Functional Impact of Haplotypes Harboring Common Duplications

Our approach resolved the tag SNVs that are in strong linkage disequilibrium with common polymorphic gene duplications that may have important functional consequences (table 1). The ascertainment bias in most functional databases limits further scrutinization of the functional impact of polymorphisms to some extent. Specifically, most comprehensive data sets for expression quantitative trait loci analysis and most genome-wide association studies (e.g., GeneATLAS) were constructed mostly by data gathered from western European individuals. Majority of gene duplications for which we were able to resolve the haplotypes were found in African populations only (table 1). Still, we were able to search for specific associations of eight gene duplication haplotypes with >5% allele frequency in European populations with gene expression levels documented in GTEx (Last accessed, March 21, 2019) (Lonsdale et al. 2013), as well as with 778 traits documented in GeneATLAS (http://geneatlas.roslin.ed.ac.uk/, Last accessed March 21, 2019; Canela-Xandri et al. 2018). We found two significant associations. First, we found the exonic duplication involving TEX19 gene (esv3641421, tag-variant: rs74001624 [R2 = 1, G-allele is associated with the duplication]) is significantly associated with lower levels of expression of the adjacent gene SECTM1 on GTEx (P-value = 4.5E–10). Further, we found that the duplication haplotype is significantly (P-value = 2.8E–16) associated with Monocyte percentage (table 1). This finding is concordant with the previous findings that SECTM1 is involved in hematopoietic processes (Slentz-Kesler et al. 1998). Second, we found that the haplotype harboring esv3585141 duplication, which involves nongenic sequences only (tag-variant: rs74865018 [R2 = 0.5, A-allele is associated with the duplication]), is associated with skin color in GeneATLAS Phenome-Wide Association Study database (P-value = 1.3E–09). However, we have not found a significant association with the expression levels of any neighboring genes based on our search in the GTEx database. Our previous research has shown that copy number variants, including gene duplications, may be important factors in shaping skin/hair phenotypes (Eaaswarkhanth et al. 2014, 2016; Pajic et al. 2016). Indeed, we found that one of the haplotypes in our study harbors the whole gene duplication of the KRT34 (RefSeq: NM_021013), a member of the keratin gene family, which is important for hair phenotypes and is shown to be affected by copy number variation. This whole gene duplication is common in African populations, but not observed in Eurasian populations (table 1). Gene expression of KRT34 in human hair follicles is higher in young individuals than that in old individuals (Giesen et al. 2011). On the basis of GTEx data (Lonsdale et al. 2013), we demonstrate that the haplotype harboring the duplication led to an increase in the dosage of the KRT34 expression (supplementary fig. S10, Supplementary Material online). Interestingly, this duplication shows high linkage disequilibrium (R2 = 0.80) with the adjacent deletion of the KRT33B (esv3640584, chr17: 39506753-39525903), which may suggest that KRT34 replaced KRT33B through gene conversion. Overall, resolving the haplotypes harboring gene duplications provide a powerful framework to further scrutinize the functional impact, if any, of these variants. Our observations involving the highlighted genes provide candidates for future evolutionary and biomedical studies.

Discussion

One of the major questions in the population genetics is the impact of polymorphic duplications on phenotypic variation and the evolutionary consequences of this impact. Only a few studies have investigated associations between polymorphic duplications to phenotypic variation (Stranger et al. 2007; Wellcome Trust Case Control Consortium et al. 2010; Yang et al. 2015). Similarly, standard population genetics tools are often designed for analysis of SNV and thus cannot directly be applied to scrutinize the evolution of duplications (Iskow et al. 2012). To resolve these problems, we first identified 22 common polymorphic duplications that show linkage disequilibrium with SNVs across the human genome (table 1). By investigating the SNVs, we were able to investigate the evolutionary trajectories of the haplotypes harboring the duplications. On the basis of such analysis, we here present multiple lines of evidence that the haplotype harboring the partial HERC2 gene duplication is selected against in European populations. We found that this haplotype affects the expression level of HERC2P9 significantly, even though the exact phenotype that is under selection is not clear. This methodology enabled us to resolve the haplotypes that harbor these duplications. Similarly, using SNV information, we were able to associate two of these haplotypes to skin color and monocyte percentage. Given that these duplications are major mutation events potentially affecting thousands of base pairs, it is likely that they are the causal variants that affect these phenotypes. We argue that as more complete variation data sets and accompanying databases with expression and phenotype data become available, the haplotype level analysis of gene duplications, in particular, and structural variation, in general, will become more commonplace. Our study represents a first step in integrating multiple data types to understand the evolutionary impact of gene duplications. It is important to note here that we did not find high linkage disequilibrium between some polymorphic duplications and tag SNVs (table 1), reducing the statistical power of imputation of these duplications in both genome-wide association studies and evolutionary inquiries. As Hong and Park (2012) demonstrated that the statistical power to detect an association largely depends on the strength of the linkage disequilibrium between the casual and tag variants. We believe that this is a major issue that leads to a general underappreciation of the biomedical and evolutionary impact of structural variation. Eventually direct genotyping of duplications will be a more straightforward and statistically powerful way of conducting such associations and understanding the evolutionary trends that underlie these variations.

Materials and Methods

The Linkage Disequilibrium Based Detection of Duplications

We modified VCFtools (0.1.16) (Danecek et al. 2011) to calculate the R2 between a target duplication and other variants in a genome-wide manner. We first made a custom genome-wide VCF file from 1000 Genomes phase 3 data set for CEU, YRI, and CHB population. We conducted population-specific analyses to increase the sensitivity of linkage disequilibrium. To reduce file size, we omitted variants which were not observed in the population of interest. Then we calculated the R2 between a target duplication and other variants in a genome-wide manner with VCFtools (0.1.16). We visualized linkage disequilibrium by using R qqman package (fig. 1). To ensure the accuracy of these haplotypes, we manually verified the informative variants in the Integrated Genome Browser (Thorvaldsdóttir et al. 2013). For example, we verified one insertion–deletion polymorphism that is in strong linkage disequilibrium (R2 = 0.75) with a duplication (esv3635993), which provides a clear example of a likely true-positive variant calling in this region tagging the duplication polymorphism (supplementary fig. S8, Supplementary Material online). To identify the haplotype block that likely harbor the duplicated HERC2 sequence, we set the threshold R2 value as 0.75 in the Europeans and defined the putative insertion region “hg19 chr15: 28894038-28927368” accordingly. We conducted all the downstream analysis with these coordinates. To avoid analysis complications, we did not consider two SNVs (hg19 chr15: 28553017 and hg19 chr15: 28971921) as they are relatively distant outliers from the observed haplotype block (supplementary fig. S4, Supplementary Material online).

The Detection of Genic/Exonic Duplications

We used NCBI RefSeq track on UCSC Genome Table Browser (Last accessed March 21, 2019) to get the gene and exon information. By using Bedtools (v2.27.1) intersect (Quinlan and Hall 2010), we counted the number of duplications that overlap with 1) entire genes, 2) entire exons, 3) more than one base pair of a gene (including introns). Note that none of the 22 duplications that we scrutinized here partially overlap with a coding exon, that is, if a duplication overlap with a coding sequence, it contains at least one entire exon (table 1). The gene functions listed in supplementary table S1, Supplementary Material online are based on the genetic associations in GeneATLAS (Canela-Xandri et al. 2018). We found one transchromosomal duplication, esv3631000, which contains ZNF664. To verify this, we checked all the 19 SNVs that have strong linkage disequilibrium (>0.8) with this duplication. We found that 17 of those clusters in the 50 kb region on chr2 (hg19, chr2: 3918719-3970271). The other two SNVs actually overlap with the original duplicated copy on chromosome 12. We thought that these may be false positive calls due to misalignment of the reads originating from the duplicated copy onto the original gene. If this is the case, we expect that the mapped reads to have a ⅓ ratio in a sample where there is a heterozygous duplication. We also expect that these SNVs are called as heterozygous in all cases. Indeed, we found that 235 out of 2,504 individuals are heterozygous for both of these two SNVs (rs80197353, rs78005948) and no homozygous variants were documented. Furthermore, we manually inspected these SNVs using exome data from 1,000 Genomes data set and found that reads carrying the nonreference alleles were found in ∼⅓ of the reads for both SNVs. This is not consistent with the expected 50–50 ratio for heterozygous variant calls (supplementary fig. S9, Supplementary Material online). Collectively, our analysis suggests that the SNVs on chromosome 2 are likely false positive variant calls due to erroneous read mapping and that the duplication insertion site is indeed on chromosome 12.

Getting Random Control Regions

To obtain random SNVs which match our initial filtering process for polymorphic duplications (>5% in CEU, YRI, or CHB), we first used bedtools (v2.27.1) (Quinlan and Hall 2010) for constructing random chromosomal coordinates. We then applied the random chromosomal coordinates to the 1000 Genome Project phase 3 data set variants (Sudmant et al. 2015) and used Vcftools (0.1.16) (Danecek et al. 2011) to retrieve the allele frequency information. We finally used 3,000 SNVs for the comparison between duplicated regions and random SNVs (fig. 2). In a similar way, we used all the available coordinates on chromosome 15, on which the HERC2 is located, for the neutrality test on the selection browser (Pybus et al. 2014).

Population Genetics Analyses on the HERC2 Duplication

To increase the sensitivity and confirm the initial linkage disequilibrium calculation, we extended the linkage disequilibrium analysis on the original HERC2 gene—putative HERC2 duplication region (hg19, chr15: 28894038-28927368) from CEU to all the available European populations (Utah residents with Northern and Western European ancestry [CEU], Toscani in Italy [TSI], Finnish in Finland [FIN], British in England and Scotland [GBR], Iberian populations in Spain [IBS]), YRI to all the available African populations (Gambian in Western Division, The Gambia [GWD], Mende in Sierra Leone [MSL], Esan in Nigeria [ESN], Yoruba in Ibadan, Nigeria [YRI], Luhya in Webuye, Kenya [LWK]), CHB to all the available East Asian populations (Han Chinese in Beijing, China [CHB], Japanese in Tokyo, Japan [JPT], Southern Han Chinese, China [CHS], Chinese Dai in Xishuangbanna, China [CDX], Kinh in Ho Chi Minh City, Vietnam [KHV]). We observed similar peaks in all populations in hg19 chr15: 28894038-28927368 (fig. 2, supplementary fig. S4, Supplementary Material online). To visualize the geographic distribution of the HERC2 duplication allele before recent human migrations, we used data from 15 populations in the 1000 Genome Project: BEB, CDX, CHB, ESN, FIN, GBR, GWD, IBS, JPT, KHV, LWK, MSL, PJL, TSI, and YRI, which have not experienced recent population admixture or migration (fig. 2). We used the “rworldmap” package (South 2011).

Neutrality Tests

Tajima’s D (Tajima 1993) and XP-EHH (Sabeti et al. 2007) values were downloaded from the 1000 Genomes selection browser (Pybus et al. 2014) for the bins containing the target region (hg19 chr15: 28894038-28927368) and control region (all the available 26,283 3 kb regions across the chromosome 15).

Haplotype Network Analysis

To draw the haplotype networks, we first converted the target region vcf file (chr15: 28898098-28902929) from the 1000 Genome Project phase 3 data set and hg19 reference genome to a fasta file by VCTtoTree (V3.0.0) (Xu et al. 2017). We also included the chimpanzee genome sequence (The Chimpanzee Sequencing Consortium 2005). We manually checked informative alleles in Neanderthal and Denisovan genomes (Reich et al. 2010; Prüfer et al. 2014). We used PopART (Version 1.7) (Leigh and Bryant 2015) for the visualization.

Association Analysis

To assess the phenotypic effect of the polymorphic duplications, we searched the tag SNV of the polymorphic duplications (supplementary table S1, Supplementary Material online). Then we searched the tag SNVs on GeneATLAS phewas (http://geneatlas.roslin.ed.ac.uk/phewas/). However, since GeneATLAS is based on the UK population specifically, neither the East-Asian specific variants nor African-specific variants are included in the data set. We used the nominal P-value 10−8 as a threshold of significant association using the GeneATLAS phewas. Given that we are investigating the association between SNVs and 778 traits, this significance threshold can be considered conservative. If the tag variants are not reported in the GeneATLAS database, we reported them as NA and if no significant phenotype association was found, we described these as NS.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

57 in total

1. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives.

Authors: Min Zhao; Qingguo Wang; Quan Wang; Peilin Jia; Zhongming Zhao
Journal: BMC Bioinformatics Date: 2013-09-13 Impact factor: 3.169

2. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.

Authors: Nick Craddock; Matthew E Hurles; Niall Cardin; Richard D Pearson; Vincent Plagnol; Samuel Robson; Damjan Vukcevic; Chris Barnes; Donald F Conrad; Eleni Giannoulatou; Chris Holmes; Jonathan L Marchini; Kathy Stirrups; Martin D Tobin; Louise V Wain; Chris Yau; Jan Aerts; Tariq Ahmad; T Daniel Andrews; Hazel Arbury; Anthony Attwood; Adam Auton; Stephen G Ball; Anthony J Balmforth; Jeffrey C Barrett; Inês Barroso; Anne Barton; Amanda J Bennett; Sanjeev Bhaskar; Katarzyna Blaszczyk; John Bowes; Oliver J Brand; Peter S Braund; Francesca Bredin; Gerome Breen; Morris J Brown; Ian N Bruce; Jaswinder Bull; Oliver S Burren; John Burton; Jake Byrnes; Sian Caesar; Chris M Clee; Alison J Coffey; John M C Connell; Jason D Cooper; Anna F Dominiczak; Kate Downes; Hazel E Drummond; Darshna Dudakia; Andrew Dunham; Bernadette Ebbs; Diana Eccles; Sarah Edkins; Cathryn Edwards; Anna Elliot; Paul Emery; David M Evans; Gareth Evans; Steve Eyre; Anne Farmer; I Nicol Ferrier; Lars Feuk; Tomas Fitzgerald; Edward Flynn; Alistair Forbes; Liz Forty; Jayne A Franklyn; Rachel M Freathy; Polly Gibbs; Paul Gilbert; Omer Gokumen; Katherine Gordon-Smith; Emma Gray; Elaine Green; Chris J Groves; Detelina Grozeva; Rhian Gwilliam; Anita Hall; Naomi Hammond; Matt Hardy; Pile Harrison; Neelam Hassanali; Husam Hebaishi; Sarah Hines; Anne Hinks; Graham A Hitman; Lynne Hocking; Eleanor Howard; Philip Howard; Joanna M M Howson; Debbie Hughes; Sarah Hunt; John D Isaacs; Mahim Jain; Derek P Jewell; Toby Johnson; Jennifer D Jolley; Ian R Jones; Lisa A Jones; George Kirov; Cordelia F Langford; Hana Lango-Allen; G Mark Lathrop; James Lee; Kate L Lee; Charlie Lees; Kevin Lewis; Cecilia M Lindgren; Meeta Maisuria-Armer; Julian Maller; John Mansfield; Paul Martin; Dunecan C O Massey; Wendy L McArdle; Peter McGuffin; Kirsten E McLay; Alex Mentzer; Michael L Mimmack; Ann E Morgan; Andrew P Morris; Craig Mowat; Simon Myers; William Newman; Elaine R Nimmo; Michael C O'Donovan; Abiodun Onipinla; Ifejinelo Onyiah; Nigel R Ovington; Michael J Owen; Kimmo Palin; Kirstie Parnell; David Pernet; John R B Perry; Anne Phillips; Dalila Pinto; Natalie J Prescott; Inga Prokopenko; Michael A Quail; Suzanne Rafelt; Nigel W Rayner; Richard Redon; David M Reid; Susan M Ring; Neil Robertson; Ellie Russell; David St Clair; Jennifer G Sambrook; Jeremy D Sanderson; Helen Schuilenburg; Carol E Scott; Richard Scott; Sheila Seal; Sue Shaw-Hawkins; Beverley M Shields; Matthew J Simmonds; Debbie J Smyth; Elilan Somaskantharajah; Katarina Spanova; Sophia Steer; Jonathan Stephens; Helen E Stevens; Millicent A Stone; Zhan Su; Deborah P M Symmons; John R Thompson; Wendy Thomson; Mary E Travers; Clare Turnbull; Armand Valsesia; Mark Walker; Neil M Walker; Chris Wallace; Margaret Warren-Perry; Nicholas A Watkins; John Webster; Michael N Weedon; Anthony G Wilson; Matthew Woodburn; B Paul Wordsworth; Allan H Young; Eleftheria Zeggini; Nigel P Carter; Timothy M Frayling; Charles Lee; Gil McVean; Patricia B Munroe; Aarno Palotie; Stephen J Sawcer; Stephen W Scherer; David P Strachan; Chris Tyler-Smith; Matthew A Brown; Paul R Burton; Mark J Caulfield; Alastair Compston; Martin Farrall; Stephen C L Gough; Alistair S Hall; Andrew T Hattersley; Adrian V S Hill; Christopher G Mathew; Marcus Pembrey; Jack Satsangi; Michael R Stratton; Jane Worthington; Panos Deloukas; Audrey Duncanson; Dominic P Kwiatkowski; Mark I McCarthy; Willem Ouwehand; Miles Parkes; Nazneen Rahman; John A Todd; Nilesh J Samani; Peter Donnelly
Journal: Nature Date: 2010-04-01 Impact factor: 49.962

3. The Genotype-Tissue Expression (GTEx) project.

Authors:
Journal: Nat Genet Date: 2013-06 Impact factor: 38.330

4. Origins and functional impact of copy number variation in the human genome.

Authors: Donald F Conrad; Dalila Pinto; Richard Redon; Lars Feuk; Omer Gokcumen; Yujun Zhang; Jan Aerts; T Daniel Andrews; Chris Barnes; Peter Campbell; Tomas Fitzgerald; Min Hu; Chun Hwa Ihm; Kati Kristiansson; Daniel G Macarthur; Jeffrey R Macdonald; Ifejinelo Onyiah; Andy Wing Chun Pang; Sam Robson; Kathy Stirrups; Armand Valsesia; Klaudia Walter; John Wei; Chris Tyler-Smith; Nigel P Carter; Charles Lee; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2009-10-07 Impact factor: 49.962

5. An atlas of genetic associations in UK Biobank.

Authors: Oriol Canela-Xandri; Konrad Rawlik; Albert Tenesa
Journal: Nat Genet Date: 2018-10-22 Impact factor: 38.330

6. Recurring exon deletions in the HP (haptoglobin) gene contribute to lower blood cholesterol levels.

Authors: Linda M Boettger; Rany M Salem; Robert E Handsaker; Gina M Peloso; Sekar Kathiresan; Joel N Hirschhorn; Steven A McCarroll
Journal: Nat Genet Date: 2016-02-22 Impact factor: 38.330

7. An integrated map of structural variation in 2,504 human genomes.

Authors: Peter H Sudmant; Tobias Rausch; Eugene J Gardner; Robert E Handsaker; Alexej Abyzov; John Huddleston; Yan Zhang; Kai Ye; Goo Jun; Markus Hsi-Yang Fritz; Miriam K Konkel; Ankit Malhotra; Adrian M Stütz; Xinghua Shi; Francesco Paolo Casale; Jieming Chen; Fereydoun Hormozdiari; Gargi Dayama; Ken Chen; Maika Malig; Mark J P Chaisson; Klaudia Walter; Sascha Meiers; Seva Kashin; Erik Garrison; Adam Auton; Hugo Y K Lam; Xinmeng Jasmine Mu; Can Alkan; Danny Antaki; Taejeong Bae; Eliza Cerveira; Peter Chines; Zechen Chong; Laura Clarke; Elif Dal; Li Ding; Sarah Emery; Xian Fan; Madhusudan Gujral; Fatma Kahveci; Jeffrey M Kidd; Yu Kong; Eric-Wubbo Lameijer; Shane McCarthy; Paul Flicek; Richard A Gibbs; Gabor Marth; Christopher E Mason; Androniki Menelaou; Donna M Muzny; Bradley J Nelson; Amina Noor; Nicholas F Parrish; Matthew Pendleton; Andrew Quitadamo; Benjamin Raeder; Eric E Schadt; Mallory Romanovitch; Andreas Schlattl; Robert Sebra; Andrey A Shabalin; Andreas Untergasser; Jerilyn A Walker; Min Wang; Fuli Yu; Chengsheng Zhang; Jing Zhang; Xiangqun Zheng-Bradley; Wanding Zhou; Thomas Zichner; Jonathan Sebat; Mark A Batzer; Steven A McCarroll; Ryan E Mills; Mark B Gerstein; Ali Bashir; Oliver Stegle; Scott E Devine; Charles Lee; Evan E Eichler; Jan O Korbel
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

8. VCFtoTree: a user-friendly tool to construct locus-specific alignments and phylogenies from thousands of anthropologically relevant genome sequences.

Authors: Duo Xu; Yousef Jaber; Pavlos Pavlidis; Omer Gokcumen
Journal: BMC Bioinformatics Date: 2017-09-26 Impact factor: 3.169