Literature DB >> 28580430

The evolution and population diversity of human-specific segmental duplications.

Megan Y Dennis^1,2, Lana Harshman², Bradley J Nelson², Osnat Penn², Stuart Cantsilieris², John Huddleston^2,3, Francesca Antonacci⁴, Kelsi Penewit², Laura Denman², Archana Raja^2,3, Carl Baker², Kenneth Mark², Maika Malig², Nicolette Janke², Claudia Espinoza², Holly A F Stessman², Xander Nuttle², Kendra Hoekzema², Tina A Lindsay-Graves⁵, Richard K Wilson⁵, Evan E Eichler^2,3.

Abstract

Segmental duplications contribute to human evolution, adaptation and genomic instability but are often poorly characterized. We investigate the evolution, genetic variation and coding potential of human-specific segmental duplications (HSDs). We identify 218 HSDs based on analysis of 322 deeply sequenced archaic and contemporary hominid genomes. We sequence 550 human and nonhuman primate genomic clones to reconstruct the evolution of the largest, most complex regions with protein-coding potential (n=80 genes/33 gene families). We show that HSDs are non-randomly organized, associate preferentially with ancestral ape duplications termed "core duplicons", and evolved primarily in an interspersed inverted orientation. In addition to Homo sapiens-specific gene expansions (e.g., TCAF1/2), we highlight ten gene families (e.g., ARHGAP11B and SRGAP2C) where copy number never returns to the ancestral state, there is evidence of mRNA splicing, and no common gene-disruptive mutations are observed in the general population. Such duplicates are candidates for the evolution of human-specific adaptive traits.

Entities: Chemical

Year: 2017 PMID： 28580430 PMCID： PMC5450946 DOI： 10.1038/s41559-016-0069

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Genetic mutations have shaped the unique adaptation and evolution of the human lineage, but their characterization has been a slow and difficult endeavor. Despite a few potential success stories over the years with various degrees of support[1], the genetic basis of most of the unique aspects of human adaptation await discovery. As sequencing technologies have improved, more systematic efforts have been directed to discover regulatory differences among the great apes[2-6]. One potential source of genetic variation, which has been difficult to explore due to missing or erroneous sequences within reference genomes, are genes embedded within recently (<25 million years ago (mya)) duplicated regions also called segmental duplications (SDs)[7]. Unlike the focus on regulatory mutations or gene loss, which typically modify the expression of ancestral genes mapping to unique regions, duplicated regions have long been recognized as a potential source for the rapid evolution of new genes with novel functions[8]. Recent functional studies have emphasized the potential importance of SDs with respect to unique features of synaptogenesis, neuronal migration, and neocortical expansion in the human lineage[9-12]. The genomes of apes are enriched in SDs having experienced a burst of interspersed duplications over the last 10 million years of evolution[13,14]. The mosaic and interspersed architecture of ape SDs offers tremendous potential for transcript innovation because duplicate paralogs may be truncated, combined with other transcripts to create fusion genes, or acquire alternate promoters directing the differential expression of novel transcripts[15]. Previous investigations have been limited to microarray studies[16,17] and whole-genome sequencing read-depth comparisons[14,18,19] between humans and great apes. None of these methods provide information regarding the structure and sequence of the duplicated segments, limiting gene annotation and the understanding of the functional potential of the duplicated genes. In this study, we focus on understanding the sequence structure, genetic variation, and transcriptional potential of the largest human-specific segmental duplications (HSDs). HSDs are particularly problematic because they are highly identical (~99%), among the most copy number polymorphic parts of the genome, and frequently embedded within larger blocks of shared ape duplications. Not surprisingly, the genome assembly builds of these regions are highly enriched for euchromatic gaps and misassembly errors even within the most recent versions of the human reference[20,21]. We specifically target 33 human-specific gene families contained within these HSDs for high-quality sequence assembly by selecting large-insert bacterial artificial chromosome (BAC) clones from a library (CH17) generated from a well-characterized complete hydatidiform mole cell line (CHM1tert). The mole derives from the fertilization of an enucleated human oocyte with a single spermatozoon[22,23] or from postzygotic loss of a complete parental genome[24]. The end result is a haploid as opposed to a diploid equivalent of the human genome where the absence of allelic variation allows high-identity paralogous regions of the genome to be rapidly resolved[11]. We apply the resulting high-quality sequence to more systematically investigate copy number variation, transcriptional potential, and human genetic variation in an effort to understand their evolutionary history as well as discover regions that have become fixed and potentially functional in the human species.

RESULTS

Refining regions of HSDs

With the wealth of deep-coverage Illumina sequence data from both humans and great apes, we began by first redefining the map location of HSDs. We mapped a genetically diverse panel of 236 human and 86 chimpanzee, gorilla and orangutan genomes to the human reference (GRCh37) to identify regions uniquely duplicated in humans (Figure 1, Supplementary Figure 1). Our approach identified 218 autosomal regions ranging in size from 5 kbp (our size threshold) to 362 kbp with HSDs dispersed non-randomly near each other (empirical median distance to nearest HSD 440 kbp, P < 1×10−7; Supplementary Figure 2; see Methods). Of these regions, 85 included entire or parts of RefSeq annotated genes (Supplementary Table 1). We orthogonally validated 87% (190/218) of our events as HSDs.

Figure 1

Identification of human-specific segmental duplications or HSDs

(a) The locations of large, gene-containing HSDs are highlighted (blue lines) with 80 individual gene paralogs from 33 gene families listed across 9 different human autosomes. Included in this set are paralogs of GPR89, which duplicated in other great apes but experienced human-specific expansions. Many of these HSDs overlap known disease-implicated genomic hotspots (red lines) prone to recurrent copy number variation associated with developmental delay. The genomic hotspots labeled with numbers (1 to 6) have significant associations with specific disorders including epilepsy, autism, schizophrenia and intellectual disability (ID). (b) Duplicated regions were detected based on read-depth analysis of Illumina reads mapped to the human reference genome (GRC37). The set included a diversity panel of humans (Human Genome Diversity Project or HGDP (N = 236)[30]) and nonhuman primates or NHPs (gorillas (N = 32), chimpanzees (N = 23), bonobos (N = 14), and orangutans (N = 17)[34]). Overall copy number (CN) was averaged across 500 bp sliding windows and depicted as colored heatmaps (see pictured index). Also pictured are heatmaps for Neanderthal, Denisovan, and archaic human (LBK, Loschbour, and Ust Ishim) individuals. Any genomic region >5 kbp shown to have diploid CN ≥ 3 in 90% of humans tested compared to all NHPs was considered an HSD.

The set included 38 previously unreported HSDs mapping to genic regions. Among these, we included HSDs where there was evidence of independent or distinct duplications in great apes (i.e., homoplasy; N = 21) and duplications corresponding to introns (N = 12). For example, the 3′ portion of MST1L (macrophage stimulating 1 like) on chromosome 1p36.13 is partially duplicated in chimpanzee and gorilla, but a complete duplication of the gene (>36 kbp) has risen to high copy uniquely in humans (diploid copy number (CN) > 8; Supplementary Figure 3). Similarly, we identified a 6.6 kbp duplication corresponding the third intron of CACNA1B (calcium voltage-gated channel subunit alpha1 B)—a pore-forming subunit of an N-type voltage-dependent calcium channel that controls neurotransmitter release from neurons (Supplementary Figures 3 and 4). We also identified a novel duplication of SCGB1C1 (secretoglobin family 1C member 1), a gene family whose products are secreted at large concentrations in the lung, lacrimal and salivary glands (Supplementary Figure 3). Next, we focused on the largest gene-containing HSD regions (>20 kbp; Supplementary Figure 5). These HSDs and their ancestral counterparts reside on 16 autosomal regions with many appearing to cluster with other smaller HSDs and at “genomic hotspots”—regions prone to recurrent large-scale microdeletions and microduplications associated with neurodevelopmental disorders (Figure 1, Supplementary Table 2)[25]. Using the haploid BAC library (CH17), we generated alternate sequence assemblies (Supplementary Tables 3 and 4) of which 18.2 Mbp have now been incorporated into the most recent human reference build (GRCh38), allowing us to close 24 euchromatic gaps and correct large-scale errors in the human reference genome. The new sequence allowed us to distinguish 28 HSD events ranging in size from 11 kbp to 677 kbp corresponding to 33 HSD gene families accounting for 80 paralogous genes (Supplementary Table 5). The majority of events (N = 24 events or 3.2 Mbp) were primary duplications—defined here as the initiating SD from the ancestral locus shared between human and chimpanzee (Figure 2). Compared to a random null distribution, we found that these primary HSDs map closer to each other than by chance (empirical median distance to nearest HSD 377 kbp, P = 1×10−7; Supplementary Figure 2). Consistent with previous observations[26-28], we also found our primary HSDs to be significantly enriched near core duplicons—high-copy ancestral ape duplications significantly linked with the accumulation of SDs and breakpoints of rearrangement (empirical median distance to a core 250 kbp, P < 1×10−7; Supplementary Figure 6). We identified four secondary HSDs—additional duplications derived from a human-specific duplicate paralog. These secondary events account for 35% (1.7 Mbp) of HSD base pairs because the events are larger when compared to primary duplications (minimum median sizes: 497 kbp vs. 95 kbp, P = 0.041, Wilcoxon-Mann-Whitney test). The majority of HSDs are intrachromosomal and arranged in inverted orientation with respect to their ancestral paralogs (18/26, P = 0.014, binomial test), including all secondary duplications (4/4). HSD clustering is most pronounced on chromosome 1p12 to 1q32.1, which contains the greatest number of gene-containing HSDs (at least 6 independent HSD events including ~2 Mbp (0.8%) of human chromosome 1; Supplementary Note; Supplementary Figure 7). We find that 85% of this 8 Mbp region (chr1:119,989,248–121,395,939 and chr1:143,311,826–149,876,379, GRCh38) has been duplicated in humans and great apes with only 1.15 Mbp remaining unique in humans.

Figure 2

Timing of HSDs

(a) Duplication timing estimates are plotted as a ratio of human–chimpanzee divergence (x-axis). The total estimated size of primary (black) and secondary (orange) duplications is shown for each event. Uncertainty in the size of each event is due to breakpoints mapping in high-identity flanking duplications (gray). (b) Generally accepted phylogeny indicating timing of each event assuming chimpanzee and human lineage divergence of 6 million years ago (mya)[76–78]. Recent estimates based on recalibrated substitution rates suggest an earlier divergence time of 12 mya, with this maximum scale indicated in parentheses. The analysis is based on high-quality sequencing, assembly and alignment of large-insert clones (human N = 224; NHP N = 269). Asterisks indicate adjusted timing estimates because of failed Tajima’s D relative rate (*) and genes with evidence of interlocus gene conversion (**).

Evolutionary timing of HSDs

We sequenced large-insert clones from nonhuman primate (NHP) genomic BAC libraries and applied standard phylogenetic methods under a model of gene conversion[29] to understand the evolutionary timing of each large HSD (Supplementary Tables 6–8, Supplementary Figure 8). The results reveal differences in number and size of HSDs when we compare across three equal time periods during the evolution of the human lineage (P = 0.017; Figure 2). The first was a period of relative quiescence, which occurred after the human–chimpanzee divergence. This included five smaller primary duplications corresponding to seven genes with a median minimum size of 36 kbp for a total of 285 kbp. This was followed temporally by a set of larger primary (N = 6) and secondary (N = 1) duplications containing 12 HSD genes (median minimum size of primary events 262 kbp for a total of 1.5 Mbp, P = 0.026). The final set of duplications involved more secondary (N = 3) and primary (N = 13) duplications and are estimated to be the most recent. Although primary duplication lengths were not significantly different in size compared to either of the other two time periods (median minimum size 93 kbp for a total of 1.4 Mbp), they resulted in many more HSD genes (28 gene paralogs).

Human copy number diversity

We undertook three different approaches to assess the potential functional significance of HSDs—namely, copy number constraint, transcriptional potential and protein-coding mutations. We first assessed copy number in contemporary and archaic hominin (N= 2,384)[30,3132,33] as well as 86 NHP genomes[34] in order to distinguish fixed duplications from those that are highly stratified among humans (Figure 3, Supplementary Tables 9–11). Thirteen HSD genes were among the most copy number polymorphic, including genes at chromosomes 7q35 (three units: ARHGEF5 and OR2A, TCAF1, and TCAF2), 5q13.1 (four units: SMN1 and SERF1, GTF2H2, OCLN, and NAIP), 16p11.2 (two units: BOLA2 and DUSP22), and 10q11.23 (one unit: GPRIN2 and NPY4R). Conversely, eight HSD genes were largely fixed for copy number, showing the lowest variance among contemporary human populations (six units: HYDIN, GPR89 and PDZK1, CFC1 and TISP43, CD8B, ROCK1, and ARHGAP11; Figure 3, Supplementary Figures 9 and 10, Supplementary Tables 12 and 13). As expected, higher copy number HSD genes are generally more copy number polymorphic (Supplementary Note). We identified 11/23 duplicated units with at least one normal individual identified who carried the ancestral state copy number (diploid CN of two) suggesting the HSD paralogs are missing in these individuals (e.g., DUSP22 and ROCK1; Figure 3B). Population differentiation (as measured by Vst[35]) generally correlated with copy number variance (R2 = 0.32; ρ = 0.54, Pearson correlation) but not copy number (R2 = 0.01; ρ = −0.01, Pearson correlation; Supplementary Figure 11).

Figure 3

Human copy number diversity

Overall average CN was calculated per individual from read depth produced from Illumina mappings across a set region defining each duplication (Supplementary Table 9) in human populations, including the HGDP (N = 236; GRCh38) and 1000 Genomes Project (1KG, N = 2,143; GRCh37) cohorts, NHPs, archaic humans, a Denisovan and a Neanderthal. From these results, the mean, standard deviation (s.d.), Vst, and number of individuals with CN = 2 indicating no duplicate paralogs exist were calculated for average CN of each duplicated gene family (Supplementary Table 11). For each gene family, plots are shown for the CN average vs. s.d. across all HGDP individuals with duplicate gene family names indicated next to each data point. Red data points indicate genes with no homozygous deletions in any human tested. Genes with higher s.d. are considered CN polymorphic and tend to have higher CN (R2 = 0.09; ρ = 0.30, Pearson correlation) and average Vst (R2 = 0.32; ρ = 0.54, Pearson correlation; Supplementary Figure 11).

We also identified three genes expanded uniquely in Homo sapiens when compared to two sequenced archaic hominins, a Neanderthal and a Denisovan. This included the previously reported BOLA2 on chromosome 16[32,36] (Supplementary Figure 12) and two novel genes, TRPM8-associated TCAF1 and TCAF2 (formerly FAM115A and FAM115C), on chromosome 7 (Figure 4, Supplementary Table 14, Supplementary Note)[37]. In the case of TCAF1 and TCAF2, the timing estimate (0.048 ± 0.008 human–chimpanzee distance) is consistent with its absence in archaic hominins. The fact that we observe high copy number in two archaic humans (Loschbour and Ust Ishim individuals with CN ≥ 6) suggests these HSDs spread rapidly in the population. The TCAF1/TCAF2 HSD is differentiated (HGDP mean Vst = 0.11) between human populations with the highest copy number observed for African and European populations (in particular, Gambian and Esan from Nigeria where multiple individuals with diploid CN = 7 are observed) and the lowest copy number observed for Asian and Amerindian populations.

Figure 4

Copy number polymorphism across diverse populations of TCAF1 and TCAF2 HSDs

(a) Heatmap of overall CN of TCAF1 and TCAF2 HSD region on human chromosome 7 with predicted gene models and segmental duplications (SDs; depicted as colored arrows) pictured above. Representative modern humans are shown for each genotyped CN across the locus with a single person (*HGDP00798) showing deletion of the region, likely due to non-allelic homologous recombination between directly oriented SDs B1 and B2 (Supplementary Note). (b) A scatter plot of TCAF1 and TCAF2 SDs (A1, B1, and C1), overall CN of individuals from modern human (HGDP cohort), archaic humans, a Denisova and a Neanderthal, and NHPs (chimpanzee, bonobo, gorilla, and orangutan) plotted on each axis. The one Western European individual circled in red that deviates from the rest of the individuals’ copy numbers is the deletion carrier pictured in a. (c) CN predictions across modern humans from the 1KG and HGDP (N = 2,379), archaic hominins, and NHPs were made across a representative region (chr7:143,533,137-143,571,789; GRCh37). Overall CNs in the pie charts per population are represented as colors depicted in the legend shown panel a.

Patterns of HSD mRNA expression

To understand expression differences, we specifically examined RNA-seq data from GTEx[38] and mapped the distribution of reads to HSDs in 45 different tissues across multiple individuals (Supplementary Tables 15 and 16, Supplementary Figures 13 and 14). Of the 26 comparisons that could be made between known ancestral and duplicate paralogs (Supplementary Figure 13), 65% (17/26) of duplicate paralogs showed significantly lower expression levels compared to their ancestral paralog (versus 19% (5/26) showing significantly greater expression and 15% (4/26) showing no difference in expression). In contrast, human-specific FRMPD2B and CHRFAM7A each show increased expression in specific tissues compared to their ancestral paralogs (FRMPD2A and CHRNA7). Both of the derived duplicates are incomplete, lacking the 5′ portion when compared to the ancestral gene. CHRFAM7A, for example, is the product of a gene fusion of FAM7A and CHRNA7 duplications and shows increased expression in the aorta, liver, lung, testis, and thyroid. FRMPD2B shows increased expression compared with FRMPD2A in several regions of the brain cortex as well as reproductive organs, including fallopian tubes and uterus.

Discovery of likely gene-disruptive events

Since pseudogenization is the most likely fate of duplicated genes, we tested whether paralogs had accumulated likely gene-disrupting (LGD) mutations by targeted sequencing of canonical protein-coding exons of 30 gene families using molecular inversion probes (MIPs)[39]. In 658 individuals from the 1000 Genomes Project, we identified 96 LGD variants (25 gene families) of which 33 could be definitively assigned to a copy (Supplementary Tables 10, 17–19). Ten duplicate paralogs and no ancestral paralogs harbored common loss-of-function mutations (population frequency >5%) suggesting potential functional constraint. The remaining LGD variants (N = 63) could not be unambiguously assigned but typically fell into gene families with more than one duplicate paralog. Overall, we identified no LGD variants in five gene families and discovered only rare LGD variants (<5% population frequency) in an additional 15 gene families (Supplementary Table 10). Next, we sequenced and compared the LGD frequency in 3,444 children with autism and 2,617 unaffected siblings. From this set, we identified 4,069 total coding and splice variants, of which 247 were LGD, in both cases and controls for 30 genes (Supplementary Tables 20 and 21). The majority of LGD variants (N = 231) were considered rare and collectively found in equal proportions of cases and controls (24% of individuals). Examining burden of rare LGD variants of individual genes, we identified a nominal enrichment in cases versus controls of GPR89 with seven variants exhibiting an overall frequency of 19/3,430 cases versus 5/2,605 controls (puncorr = 0.02, Supplementary Figure 15), though this result did not pass Bonferroni multiple-testing correction. Common LGD variants existed in 11/30 genes, with six genes (FCGR1, GTF2I, GTF2IRD2, HIST2H2BF, HYDIN, and ROCK1) carrying fixed LGD variants in nearly all individuals tested. Notably, five of these genes represent partial duplications where we might expect a greater likelihood of disruptive mutations if the paralogs represent pseudogenes. Combining these results with our assessment of 1000 Genomes Project individuals, 16/30 gene families showed an absence of common protein-disrupting variants in the 6,719 humans tested (Supplementary Table 10).

The complexity of HSD evolutionary history

To highlight the complex evolutionary history associated with such regions, we selected three loci for further investigation (see Supplementary Note for details of genomic hotspot chromosome 10q11.23). Large deletions ~1.8 Mbp in size of the chromosome 7q11.23 region lead to Williams-Beuren syndrome (OMIM #194050) and reciprocal duplications are associated with autism and intellectual disability[40]. The directly oriented flanking HSDs (labeled B; Figure 5A) contain three genes: GTF2I, GTF2IRD2, and NCF1. Our analysis predicts that the most common human haplotype arose through a three-step evolutionary process. The first two events occurred within the distal duplication cluster of the region (Supplementary Figure 16). They involved an inverted duplication of a ~116 kbp SD (termed A, containing paralogs of the high-copy duplicon (SPDYE)) and a possible 90 kbp inversion (0.318 ± 0.028 human–chimpanzee distance) followed by a separate ~106 kbp inverted duplication of B (0.229 ± 0.019 human–chimpanzee distance). These events created truncated paralogs GTF2IB and GTF2IRD2B and a full-length version of NCF1B. A third large-scale inverted duplication transposed an ~395 kbp region comprised of SDs A, B, and C (containing POM121L) from the distal to proximal breakpoints of the disease-associated region (0.122 ± 0.014 human–chimpanzee distance). This tertiary duplication established “granddaughter” truncated copies of GTF2IRDC, GTF2IC, as well as a full-length paralog NCF1C, likely overwriting the 3′ end of the ancestral POM121 with POM121L. This final event created directly oriented SDs A and B, providing a substrate for non-allelic homologous recombination leading to disease-associated copy number variants. The great ape sequence (Supplementary Figure 17) matched nearly perfectly the deduced genomic configuration hypothesized previously[41], with the exception of a large-scale inversion of the region proximal to BP1 in orangutan.

Figure 5

Complex models of HSD evolutionary history

BACs tiling across human chromosome (a) 7q11 and (b) 7q35 regions were sequenced and assembled (representing human and additional great apes) and supercontigs were created. Estimates of sizes and evolutionary timing (human–chimpanzee distance; Supplementary Table 8) of events are denoted between each predicted intermediate genomic structure. SD organization is depicted as colored arrows across the 7q11 (SDs annotated with subscripts representing relative positions including centromeric (c), middle (m), and telomeric (t) as previously defined[41]) and 7q35 regions. The orientations of intervening regions are shown with arrows. Models of the predicted evolutionary histories of the HSDs at all loci are depicted starting with the predicted human–chimpanzee common ancestor to the most common haplotype present in modern humans. A Miropeats comparison of the human and chimpanzee contigs shows the pairwise differences between the orthologous regions. Lines connect stretches of homologous regions based on a chosen threshold (s), defined as the number of matching bases minus the number of mismatching bases (s = 500 for a, s = 1000 for b) and match the arrow colors when they connect SD blocks. Additional annotations include whole-genome shotgun sequence detection (WSSD) in human and chimpanzee, indicating duplicated regions identified by sequence read depth[56], DupMasker[79], and genes.

We also characterized one of the youngest HSD regions unique to modern humans on chromosome 7q35 containing TCAF1 and TCAF2 and primate-duplicated CTAGE6[42] (Figure 4, Supplementary Note, Supplementary Figure 18). We note that expansion of a CTAGE-paralog also occurred in the duplication of HSD gene ARHGEF5, located less than 500 kbp distal to this locus. Pairwise comparisons between human and chimpanzee suggest the possibility of three distinct duplication events (A: 65 kbp, B: 10 kbp, and C: 56 kbp) as well as a large-scale inversion (~200 kbp) (Figure 5B). We estimate an initial 10 kbp inverted duplication of HSD B containing the 3′ end of TCAF2A (0.275 ± 0.041 human–chimpanzee distance) creating a truncated TCAF2B. The subsequent events occurred very recently during human evolution potentially during or after the split from a common ancestor of Denisova and Neanderthal. These subsequent rearrangements created a new full-length paralog of CTAGE6 (contained in A; 0.091 ± 0.008 human–chimpanzee distance) and truncated paralogs TCAF1A (the putative ancestral paralog contained in C1; 0.048 ± 0.008 human–chimpanzee distance) and TCAF2C (contained in C2). Notably, we estimate that the full-length and functional TCAF1B and TCAF2A now reside on distinct SD paralogs that are separated by 130 kbp transcribed on opposite strands—as opposed to the ancestral configuration where the genes are tandem, adjacent, and transcribed on the same strand.

DISCUSSION

In this study we generated new reference sequence for some of the most complex and gap-ridden sequence of the human genome. Several important features emerge from our targeted sequencing (48.4 Mbp) and evolutionary reconstruction of HSD regions (Supplementary Discussion). The largest HSDs are significantly clustered near core duplicons, including at chromosomes 1q21, 5q13 and 7q11.3. Most regions have been subjected to multiple large structural variation events during human evolution with inverted duplications being the predominant mode of structural change (71.4% of the total predicted 28 intrachromosomal duplication events, P = 0.006; Supplementary Table 5). Inverted SDs have been noted before in complex structural rearrangements associated with genomic disorders, such as Pelizaeus-Merzbacher disease[43,44] and Smith-Magenis syndrome[45], and may be a product of replication-based mechanisms, such as fork-stalling and template switching (FoSTeS)[43] and/or microhomology-mediated break-induced repair (MMBIR)[46]. We enriched for potential functional HSD genes by applying three criteria: (1) all humans must carry the duplicate paralog; (2) no common truncating mutations are observed in the human population and (3) duplicates show evidence of spliced mRNA expression. Ten HSD gene families met all criteria, including two genes previously implicated in cortical development and neuronal spine density, ARHGAP11B[12] and SRGAP2C[10,11], as well as the gene families BOLA2, CD8B, CFC1, FAM72, GPR89, GPRIN2, NPY4R, and TISP43. GPRIN2 (G protein regulated inducer of neurite outgrowth 2) has been shown to interact directly with G-coupled proteins (GNAO1 and GNAZ)[47] and has been implicated in the control of neurite outgrowth[48]. Our RNA-seq analysis points to localized expression in various regions of the brain, including the cerebellum and hypothalamus (Figure 5). Other genes of interest include CFC1 (cripto, FRL-1, cryptic family 1), which encodes a member of the epidermal growth factor important in patterning the left-right embryonic axis[49], and NPY4R (neuropeptide Y receptor Y4)—a gene implicated in energy homeostasis. Large copy number variants of the region are associated with obesity[50] (Supplementary Discussion; Supplementary Figure 20). Although our analysis provides a framework for the evolution of new human-specific genes, there are a number of limitations. First, additional mutation events, such as interlocus gene conversion (IGC), frequently occur between high-identity paralogs[51-53] (Supplementary Figure 19). We identified 2.9% of the sequence showing signatures of IGC, consistent with previous estimates[51] (Supplementary Table 7). Though such duplications will make HSDs appear evolutionarily “younger”, excluding these regions increases our timing estimates only by a small degree (on average an increase of 0.008 human–chimpanzee distance across 17 HSDs; Supplementary Table 8). The second caveat is that the full extent of HSDs is often difficult to assess because they frequently occur in duplication blocks where there have been multiple rounds of structural variation over the last 15 million years. Breakpoints and boundaries become challenging to delineate due to a series of overlapping complex rearrangements (e.g., SMN1 region on chromosome 5q13.3; Supplementary Discussion; Supplementary Figure 21). Third, we assume that any individual with two copies of a gene family represents the ancestral non-duplicated state. It is possible that the alternative scenario of duplication followed by subsequent deletion of the ancestral paralog may have occurred (Supplementary Discussion; Supplementary Figures 22 and 23; Supplementary Tables 22 and 23). Fourth, this study focuses on protein-encoding gene models and does not consider the possibility of functional noncoding RNA. Notably, three of the annotated genes (MIR4435, MIR4267, and OR2A) mapping to HSDs are identified as noncoding RNA (Ensemble Variant Effect Finder). Moreover, the canonical gene model being investigated in our analysis is heavily weighted by the ancestral intron-exon structure (Supplementary Discussion; Supplementary Figure 24). Thus, novel fusion genes and transcripts not previously annotated that have gained alternate promoters would not have been considered[36]. It is likely that long-read genome and transcriptome data will be required to explore such gene innovations[20]. Finally, although we focused on HSDs that had become fixed in the human population, it may be that some of the most copy number polymorphic loci (or, additionally, loci that exist at <90% frequency in the population) are candidates for more recent adaptations between populations[30]. In this regard, duplications of TCAF1/TCAF2 are particularly intriguing. The genes encode TRP channel-associated factors that bind to TRPM8—the primary detector of environmental cold[54,55] expressed in 10–15% of somatosensory neurons. The two TCAF proteins are thought to exert opposing effects in TRPM8 gating and insertion into the plasma membrane[37]. Our copy number analysis agrees with our evolutionary finding that duplications of this locus are Homo sapiens-specific—not existing in Neanderthal and Denisova but at high copy in archaic humans. In modern humans, African and European populations show the greatest copy numbers while Asians show the lowest with some humans showing no duplication of the region (Figure 4). The model suggests that a single full-length paralog of TCAF1B (predicted HSD duplicate paralog) and TCAF2A (predicted ancestral paralog) exist at the locus, respectively, while additional TCAF1/TCAF2 copies appear to be truncated or incomplete. It is interesting to note that the conserved function of full-length TCAF2 may have been co-opted by a duplicate paralog after truncation of the ancestral paralog, a mechanism we also suggest occurred for duplicate family PTPN20 (Supplementary Figures 25 and 26). Although the function of the truncated duplicates awaits further characterization, it is clear that this locus has been radically restructured in most humans resulting in the ancestral functional loci being separated by hundreds of kilobase pairs and being transcribed in opposite orientations with the potential effect of altering regulation of these genes important in cold sensation.

METHODS

Characterization of HSD regions

HSD regions >5 kbp in length were initially identified by read-depth analysis of 236 human[30] and 86 NHP[34] whole-genome Illumina sequence data sets mapped against the human reference genome (GRCh37/hg19). We defined HSDs as regions with evidence of copy number gain in >90% of all humans (>2.5 copies) but where >90% of all great apes did not harbor duplications of the locus (<2.8 copies). Previously uncharacterized HSDs were validated by WSSD[56] and whole-genome analysis comparison (WGAC)[57] methods (Supplementary Table 1) and comparison to a genome assembly of the CHM1 haploid hydatidiform mole (NCBI Assembly PacBioCHM1_r2_GenBank_08312015) using BLASR. A combination of BLAST[58], BLAT[59], and WGAC[57] methods were used to annotate HSD paralogous regions and identify duplication breakpoints. HSD clustering simulations were performed 10 million times using BEDTools shuffle (v2.23.0)[60], median midpoint distances to specific genomic features for each iteration of the simulation were calculated, and comparisons of these distribution were made to the empirical values.

BAC clone sequencing

Large-insert clones from primate BAC libraries (CH17, CH251, CH276 and CH277) were sequenced using either capillary-based methods or single-molecule, real-time (SMRT) sequencing using Pacific Biosciences (PacBio) RSII P4C2 or P6C4 chemistry. Inserts were assembled using Quiver and HGAP (Hierarchical Genome Assembly Process) as described previously[61] (Supplementary Tables 3 and 6). Tiling paths of human BAC clones (CH17) were subjected to capillary (N = 205) or PacBio (N = 76) sequence and assembly, resulting in 47.2 Mbp of high-quality sequence from 224 BACs. Of these, 85 BACs were included in the most recent human reference assembly (GRCh38). Contig sequences not included in the human reference may be found in Supplementary Dataset 1. All NHP clones were subjected to SMRT sequence and assembly (N = 269 clones).

Evolutionary analyses

Multiple sequence alignments were generated using MAFFT[62] (Supplementary Dataset 1), visualized for manual editing using Jalview[63], and phylogenetic analyses were performed using MEGA6[64] (Supplementary Dataset 2). Evolutionary timing of HSDs was estimated as a fraction of the human–chimpanzee branch length. IGC regions were identified by GENECONV[29], masked using BEDTools, and timing estimates repeated with masked alignments. Duplication mechanisms were predicted using a combined approach of defining ancestral paralogs/configurations using genomic synteny taken from chimpanzee and/or orangutan and evolutionary timing estimates to predict the order of rearrangements.

Copy number genotyping

Copy number genotyping was performed from genome sequence data from 2,379 humans from the HGDP[30] and Phase 3 of the 1000 Genomes Project[31], 86 NHP individuals from the Great Ape Genome Project [including bonobo (N = 14), chimpanzee (N = 23), gorilla (N = 32), and orangutan (N = 17)][34], a Denisovan individual[33], a Neanderthal individual[32], and three archaic hominids[65,66]. Copy number variant genotypes were determined based on mrsFAST sequence alignment[67] and paralog-specific read-depth (SUNK) mapping[19]. We used the Vst statistic[35] (custom python script available at https://github.com/EichlerLab/vst_calc) to measure copy number stratification between populations. In some cases, gene copy numbers were validated via fluorescence in situ hybridization (FISH) using fosmid clones performed on lymphoblast cell lines (Coriell Cell Repository, Camden, NJ) as described previously[68] (Supplementary Tables 12 and 13).

RNA-seq

GTEx RNA-seq data from different subtissues (dbGaP version phs000424.v3.p1) were used to analyze the expression of a set of representative transcripts from hg38 RefSeq annotation. We quantified relative levels of expression using an adjusted version of RPKM (reads per kilobase of transcript per million mapped reads) with reads intersecting unique genomic 30-mers of a canonical isoform (RefSeq) corresponding to each gene paralog. Alternatively, we also applied the Sailfish[69] method version 0.63 with the default parameters and k = 20.

MIP sequencing

Single-molecule MIPs (N = 1,105 capturing 415 exonic regions of 30 gene families) designed using MIPgen[70] were phosphorylated, captured, barcoded and sequenced as previously described[71]. Variants were identified using FreeBayes (https://github.com/ekg/freebayes) with relaxed constraints allowing for reduced allele ratios (0.07) and annotated with the Ensembl Variant Effect Predictor[72] based on the canonical transcript for each gene. We sequenced a total of 1,096 MIPs from 6,719 individuals, including population controls from the 1000 Genomes Project and cases and controls from the Simons Simplex Collection (SSC)[73], Autism Genetic Resource Exchange (AGRE)[74], and The Autism Simplex Collection (TASC)[75] cohorts.

Statistical analysis

We applied the Wilcoxon-Mann-Whitney test when comparing primary versus secondary HSD sizes and the Kruskal-Wallis rank sum test to assess size differences across three different evolutionary periods. We applied a Wilcoxon-Mann-Whitney test post hoc to identify the duplication period(s) with significant differences and adjusted for multiple comparisons using the Holm method. For paralogous gene expression comparisons median RPKM values of annotated RefSeq transcripts were compared across all tissue types using a Wilcoxon signed-rank test and a Bonferroni correction applied for multiple test comparisons. A one-tailed Fisher’s exact test was used to compare frequency of HSD-exonic mutations in autism cases versus unaffected sibling controls and Bonferroni-corrected for multiple testing comparisons.

Human subjects

The 1000 Genomes and SSC cohorts included in this study did not meet the U.S. federal definitions for human subjects research. All samples were publicly available or encoded, with no individual identifiers available to the study authors. The University of Washington institutional review board (IRB) approved the AGRE and TASC cohorts for human subjects research. All samples were collected at respective institutions after receiving informed consent and approval by the appropriate IRBs. There are no new health risks to participants.

Data availability

BAC sequencing data generated during the current study are available in GenBank with the primary accession numbers provided in Supplementary Tables 3 and 6. Targeted MIP sequencing data generated during this study are available from NCBI BioProject (1000 Genomes Project cohort, ID# PRJNA356308) and the National Database for Autism Research (autism cohorts, NDAR project number #431; doi to be assigned).

78 in total

1. A candidate target for G protein action in brain.

Authors: L T Chen; A G Gilman; T Kozasa
Journal: J Biol Chem Date: 1999-09-17 Impact factor: 5.157

2. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

3. Geology and palaeontology of the Upper Miocene Toros-Menalla hominid locality, Chad.

Authors: Patrick Vignaud; Philippe Duringer; Hassane Taïsso Mackaye; Andossa Likius; Cécile Blondel; Jean-Renaud Boisserie; Louis De Bonis; Véra Eisenmann; Marie-Esther Etienne; Denis Geraads; Franck Guy; Thomas Lehmann; Fabrice Lihoreau; Nieves Lopez-Martinez; Cécile Mourer-Chauviré; Olga Otero; Jean-Claude Rage; Mathieu Schuster; Laurent Viriot; Antoine Zazzo; Michel Brunet
Journal: Nature Date: 2002-07-11 Impact factor: 49.962

4. Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation.

Authors: Joseph B Hiatt; Colin C Pritchard; Stephen J Salipante; Brian J O'Roak; Jay Shendure
Journal: Genome Res Date: 2013-02-04 Impact factor: 9.043

5. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

6. Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders.

Authors: Brian J O'Roak; Laura Vives; Wenqing Fu; Jarrett D Egertson; Ian B Stanaway; Ian G Phelps; Gemma Carvill; Akash Kumar; Choli Lee; Katy Ankenman; Jeff Munson; Joseph B Hiatt; Emily H Turner; Roie Levy; Diana R O'Day; Niklas Krumm; Bradley P Coe; Beth K Martin; Elhanan Borenstein; Deborah A Nickerson; Heather C Mefford; Dan Doherty; Joshua M Akey; Raphael Bernier; Evan E Eichler; Jay Shendure
Journal: Science Date: 2012-11-15 Impact factor: 47.728

7. Human-specific loss of regulatory DNA and the evolution of human-specific traits.

Authors: Cory Y McLean; Philip L Reno; Alex A Pollen; Abraham I Bassan; Terence D Capellini; Catherine Guenther; Vahan B Indjeian; Xinhong Lim; Douglas B Menke; Bruce T Schaar; Aaron M Wenger; Gill Bejerano; David M Kingsley
Journal: Nature Date: 2011-03-10 Impact factor: 49.962

8. Great ape genetic diversity and population history.

Authors: Javier Prado-Martinez; Peter H Sudmant; Jeffrey M Kidd; Heng Li; Joanna L Kelley; Belen Lorente-Galdos; Krishna R Veeramah; August E Woerner; Timothy D O'Connor; Gabriel Santpere; Alexander Cagan; Christoph Theunert; Ferran Casals; Hafid Laayouni; Kasper Munch; Asger Hobolth; Anders E Halager; Maika Malig; Jessica Hernandez-Rodriguez; Irene Hernando-Herraez; Kay Prüfer; Marc Pybus; Laurel Johnstone; Michael Lachmann; Can Alkan; Dorina Twigg; Natalia Petit; Carl Baker; Fereydoun Hormozdiari; Marcos Fernandez-Callejo; Marc Dabad; Michael L Wilson; Laurie Stevison; Cristina Camprubí; Tiago Carvalho; Aurora Ruiz-Herrera; Laura Vives; Marta Mele; Teresa Abello; Ivanela Kondova; Ronald E Bontrop; Anne Pusey; Felix Lankester; John A Kiyang; Richard A Bergl; Elizabeth Lonsdorf; Simon Myers; Mario Ventura; Pascal Gagneux; David Comas; Hans Siegismund; Julie Blanc; Lidia Agueda-Calpena; Marta Gut; Lucinda Fulton; Sarah A Tishkoff; James C Mullikin; Richard K Wilson; Ivo G Gut; Mary Katherine Gonder; Oliver A Ryder; Beatrice H Hahn; Arcadi Navarro; Joshua M Akey; Jaume Bertranpetit; David Reich; Thomas Mailund; Mikkel H Schierup; Christina Hvilsom; Aida M Andrés; Jeffrey D Wall; Carlos D Bustamante; Michael F Hammer; Evan E Eichler; Tomas Marques-Bonet
Journal: Nature Date: 2013-07-03 Impact factor: 49.962

9. The complete genome sequence of a Neanderthal from the Altai Mountains.

Authors: Kay Prüfer; Fernando Racimo; Nick Patterson; Flora Jay; Sriram Sankararaman; Susanna Sawyer; Anja Heinze; Gabriel Renaud; Peter H Sudmant; Cesare de Filippo; Heng Li; Swapan Mallick; Michael Dannemann; Qiaomei Fu; Martin Kircher; Martin Kuhlwilm; Michael Lachmann; Matthias Meyer; Matthias Ongyerth; Michael Siebauer; Christoph Theunert; Arti Tandon; Priya Moorjani; Joseph Pickrell; James C Mullikin; Samuel H Vohr; Richard E Green; Ines Hellmann; Philip L F Johnson; Hélène Blanche; Howard Cann; Jacob O Kitzman; Jay Shendure; Evan E Eichler; Ed S Lein; Trygve E Bakken; Liubov V Golovanova; Vladimir B Doronichev; Michael V Shunkov; Anatoli P Derevianko; Bence Viola; Montgomery Slatkin; David Reich; Janet Kelso; Svante Pääbo
Journal: Nature Date: 2013-12-18 Impact factor: 49.962

10. Ancient human genomes suggest three ancestral populations for present-day Europeans.

Authors: Iosif Lazaridis; Nick Patterson; Alissa Mittnik; Gabriel Renaud; Swapan Mallick; Karola Kirsanow; Peter H Sudmant; Joshua G Schraiber; Sergi Castellano; Mark Lipson; Bonnie Berger; Christos Economou; Ruth Bollongino; Qiaomei Fu; Kirsten I Bos; Susanne Nordenfelt; Heng Li; Cesare de Filippo; Kay Prüfer; Susanna Sawyer; Cosimo Posth; Wolfgang Haak; Fredrik Hallgren; Elin Fornander; Nadin Rohland; Dominique Delsate; Michael Francken; Jean-Michel Guinet; Joachim Wahl; George Ayodo; Hamza A Babiker; Graciela Bailliet; Elena Balanovska; Oleg Balanovsky; Ramiro Barrantes; Gabriel Bedoya; Haim Ben-Ami; Judit Bene; Fouad Berrada; Claudio M Bravi; Francesca Brisighelli; George B J Busby; Francesco Cali; Mikhail Churnosov; David E C Cole; Daniel Corach; Larissa Damba; George van Driem; Stanislav Dryomov; Jean-Michel Dugoujon; Sardana A Fedorova; Irene Gallego Romero; Marina Gubina; Michael Hammer; Brenna M Henn; Tor Hervig; Ugur Hodoglugil; Aashish R Jha; Sena Karachanak-Yankova; Rita Khusainova; Elza Khusnutdinova; Rick Kittles; Toomas Kivisild; William Klitz; Vaidutis Kučinskas; Alena Kushniarevich; Leila Laredj; Sergey Litvinov; Theologos Loukidis; Robert W Mahley; Béla Melegh; Ene Metspalu; Julio Molina; Joanna Mountain; Klemetti Näkkäläjärvi; Desislava Nesheva; Thomas Nyambo; Ludmila Osipova; Jüri Parik; Fedor Platonov; Olga Posukh; Valentino Romano; Francisco Rothhammer; Igor Rudan; Ruslan Ruizbakiev; Hovhannes Sahakyan; Antti Sajantila; Antonio Salas; Elena B Starikovskaya; Ayele Tarekegn; Draga Toncheva; Shahlo Turdikulova; Ingrida Uktveryte; Olga Utevska; René Vasquez; Mercedes Villena; Mikhail Voevoda; Cheryl A Winkler; Levon Yepiskoposyan; Pierre Zalloua; Tatijana Zemunik; Alan Cooper; Cristian Capelli; Mark G Thomas; Andres Ruiz-Linares; Sarah A Tishkoff; Lalji Singh; Kumarasamy Thangaraj; Richard Villems; David Comas; Rem Sukernik; Mait Metspalu; Matthias Meyer; Evan E Eichler; Joachim Burger; Montgomery Slatkin; Svante Pääbo; Janet Kelso; David Reich; Johannes Krause
Journal: Nature Date: 2014-09-18 Impact factor: 49.962

42 in total

Review 1. Chromothripsis, a credible chromosomal mechanism in evolutionary process.

Authors: Franck Pellestor; Vincent Gatinois
Journal: Chromosoma Date: 2018-08-07 Impact factor: 4.316

2. Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates.

Authors: Arbel Harpak; Xun Lan; Ziyue Gao; Jonathan K Pritchard
Journal: Proc Natl Acad Sci U S A Date: 2017-11-14 Impact factor: 11.205

3. Genomics at cellular resolution: insights into cognitive disorders and their evolution.

Authors: Stefano Berto; Yuxiang Liu; Genevieve Konopka
Journal: Hum Mol Genet Date: 2020-09-30 Impact factor: 6.150

Review 4. Substitutions Are Boring: Some Arguments about Parallel Mutations and High Mutation Rates.

Authors: Maximilian Oliver Press; Ashley N Hall; Elizabeth A Morton; Christine Queitsch
Journal: Trends Genet Date: 2019-02-20 Impact factor: 11.639

5. A complete reference genome improves analysis of human genetic variation.

Authors: Sergey Aganezov; Stephanie M Yan; Daniela C Soto; Melanie Kirsche; Samantha Zarate; Pavel Avdeyev; Dylan J Taylor; Kishwar Shafin; Alaina Shumate; Chunlin Xiao; Justin Wagner; Jennifer McDaniel; Nathan D Olson; Michael E G Sauria; Mitchell R Vollger; Arang Rhie; Melissa Meredith; Skylar Martin; Joyce Lee; Sergey Koren; Jeffrey A Rosenfeld; Benedict Paten; Ryan Layer; Chen-Shan Chin; Fritz J Sedlazeck; Nancy F Hansen; Danny E Miller; Adam M Phillippy; Karen H Miga; Rajiv C McCoy; Megan Y Dennis; Justin M Zook; Michael C Schatz
Journal: Science Date: 2022-04-01 Impact factor: 63.714

6. High-resolution comparative analysis of great ape genomes.

Authors: Zev N Kronenberg; Ian T Fiddes; David Gordon; Shwetha Murali; Stuart Cantsilieris; Olivia S Meyerson; Jason G Underwood; Bradley J Nelson; Mark J P Chaisson; Max L Dougherty; Katherine M Munson; Alex R Hastie; Mark Diekhans; Fereydoun Hormozdiari; Nicola Lorusso; Kendra Hoekzema; Ruolan Qiu; Karen Clark; Archana Raja; AnneMarie E Welch; Melanie Sorensen; Carl Baker; Robert S Fulton; Joel Armstrong; Tina A Graves-Lindsay; Ahmet M Denli; Emma R Hoppe; PingHsun Hsieh; Christopher M Hill; Andy Wing Chun Pang; Joyce Lee; Ernest T Lam; Susan K Dutcher; Fred H Gage; Wesley C Warren; Jay Shendure; David Haussler; Valerie A Schneider; Han Cao; Mario Ventura; Richard K Wilson; Benedict Paten; Alex Pollen; Evan E Eichler
Journal: Science Date: 2018-06-08 Impact factor: 47.728

Review 7. Deconstructing cortical folding: genetic, cellular and mechanical determinants.

Authors: Cristina Llinares-Benadero; Víctor Borrell
Journal: Nat Rev Neurosci Date: 2019-03 Impact factor: 34.870

8. Assembly of higher-order SMN oligomers is essential for metazoan viability and requires an exposed structural motif present in the YG zipper dimer.

Authors: Kushol Gupta; Ying Wen; Nisha S Ninan; Amanda C Raimer; Robert Sharp; Ashlyn M Spring; Kathryn L Sarachan; Meghan C Johnson; Gregory D Van Duyne; A Gregory Matera
Journal: Nucleic Acids Res Date: 2021-07-21 Impact factor: 16.971

9. Genomic regions associated with microdeletion/microduplication syndromes exhibit extreme diversity of structural variation.

Authors: Yulia Mostovoy; Feyza Yilmaz; Stephen K Chow; Catherine Chu; Chin Lin; Elizabeth A Geiger; Naomi J L Meeks; Kathryn C Chatfield; Curtis R Coughlin; Urvashi Surti; Pui-Yan Kwok; Tamim H Shaikh
Journal: Genetics Date: 2021-02-09 Impact factor: 4.562

Review 10. Human-Specific Genes, Cortical Progenitor Cells, and Microcephaly.

Authors: Michael Heide; Wieland B Huttner
Journal: Cells Date: 2021-05-15 Impact factor: 6.600