Literature DB >> 30677042

Intronic CNVs and gene expression variation in human populations.

Maria Rigau¹, David Juan², Alfonso Valencia^1,3, Daniel Rico⁴.

Abstract

Introns can be extraordinarily large and they account for the majority of the DNA sequence in human genes. However, little is known about their population patterns of structural variation and their functional implication. By combining the most extensive maps of CNVs in human populations, we have found that intronic losses are the most frequent copy number variants (CNVs) in protein-coding genes in human, with 12,986 intronic deletions, affecting 4,147 genes (including 1,154 essential genes and 1,638 disease-related genes). This intronic length variation results in dozens of genes showing extreme population variability in size, with 40 genes with 10 or more different sizes and up to 150 allelic sizes. Intronic losses are frequent in evolutionarily ancient genes that are highly conserved at the protein sequence level. This result contrasts with losses overlapping exons, which are observed less often than expected by chance and almost exclusively affect primate-specific genes. An integrated analysis of CNVs and RNA-seq data showed that intronic loss can be associated with significant differences in gene expression levels in the population (CNV-eQTLs). These intronic CNV-eQTLs regions are enriched for intronic enhancers and can be associated with expression differences of other genes showing long distance intron-promoter 3D interactions. Our data suggests that intronic structural variation of protein-coding genes makes an important contribution to the variability of gene expression and splicing in human populations.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2019 PMID： 30677042 PMCID： PMC6345438 DOI： 10.1371/journal.pgen.1007902

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

Introduction

Most eukaryotic protein coding genes contain introns that are removed from the messenger RNA during the process of splicing. In humans, up to 35% of the sequenced genome corresponds to intronic sequence, while exons cover around the 2.8% of the genome (based on the genome version and gene set used for this study). Human introns can have very different lengths, contrarily to exons. This difference in intron length leads to substantial differences in size among human genes, which cause differences in the time taken to transcribe a gene from seconds to over 24 hours [1]. Indeed, intron size is highly conserved in genes associated with developmental patterning [2], suggesting that genes that require a precise time coordination of their transcription are reliant on a consistent transcript length. It has been suggested that selection could be acting to reduce the costs of transcription by keeping short introns in highly expressed genes [3], which are enriched in housekeeping essential functions [4]. Genes transcribed early in development [5-7] and genes involved in rapid biological responses [8] also conserve intron-poor structures. Interestingly, Keane and Seoighe [9] recently found that intron lengths of some genes tend to coevolve (their relative sizes co-vary across species) possibly because a precise temporal regulation of the expression of these genes is required. In fact, these genes tend to be coexpressed or participating in the same protein complexes [9]. It is well known that introns contribute to the control of gene expression by their inclusion of regulatory regions and non-coding functional RNA genes or directly by their length [10-12]. Despite the importance of introns in regulating transcription levels, transcription timing and splicing, little attention has been payed to their potential role in human population variability studies. A recent analysis of the literature has revealed a substantial amount of pathogenic variants located “deep” within introns (more than 100bp from exon-intron boundaries) which suggests that the sequence analysis of full introns may help to identify causal mutations for many undiagnosed clinical cases [13]. Given that direct associations between intronic mutations and certain diseases have been reported [13-16], we need to characterise the normal genetic variability in introns so we can better distinguish normal from pathogenic variations.

Results

Deletions are enriched in purely intronic regions

We studied the effect of structural intronic variants on protein coding gene loci in healthy humans using five copy number variant (CNV) maps of high resolution [17-21]. Most of these CNVs were detected using whole genome sequencing (WGS) data, which allows to determine the exact genomic boundaries of these variants. CNVs may have neutral, advantageous or deleterious consequences [22] and can be classified in (regions that are found duplicated when compared with expected number from the reference genome, which is 2 for autosomes), (homozygously or heterozygously deleted regions) and (regions that are found duplicated in some individuals—or alleles—and deleted in others). Each of the maps in our study was derived from a different number of individuals, from different populations and using different techniques and algorithms for CNV detection ( and . Due to these differences, each dataset provided us with a different set of CNVs (), which we analysed independently, excluding sex chromosomes and private variants. CNVs affect genes in different ways depending on the degree of overlap with them. Some CNVs cover entire genes (from now on ), other CNVs overlap with part of the coding sequence but not the whole gene () and other CNVs are found within purely intronic regions (, not overlapping with any exon from any annotated isoform, ). The latter group is the most common, with 63% of all CNVs falling within intronic regions, but remains the least studied. More than the 95% of these 12,986 intronic CNVs are losses (12,334) or gain/loss CNVs (652) ( [23]). The prevalence of losses in introns is in stark contrast with whole gene CNVs (1,412), which tend to be exclusively gains (55% of the cases) or gain/loss CNVs (25% of the cases) ().

Types of CNVs in the different datasets.

(A) CNVs can overlap entire genes or fractions of genes. CNVs overlapping with exons of a gene (exonic CNVs) and CNVs found within introns (intronic CNVs). (B-D) Number of whole gene, exonic and intronic CNV events, showing the different proportions of CNV gains, losses and gain and loss CNVs. Surprisingly, purely intronic losses are not only the most prevalent form of CNV, but also they are observed more often than expected by chance in most datasets (). We compared the observed values with expected distributions calculated using permutations in local and global background models (see Methods and ). We find significantly more deletions (4.14–9.3%) falling in introns than expected in 3 out of the 5 maps (4.14% in Sudmant-Nature, P = 0.0002, global permutation test). For the sake of clarity, the P-values in the main text correspond to the results in Sudmant-Nature’s map [20] using the global background model (unless otherwise indicated). The results obtained with the alternative background model and with both models in additional maps are shown in the supplementary tables and figures. In contrast with intronic deletions, there are 51.2% fewer coding deletions (overlapping with exons) than would be expected by chance (P < 1e-04, Sudmant Nature, global permutation test). These patterns are consistent using the two different background models ( and ) and the enrichment is not limited to any specific range of intron sizes (. The enrichment of deletions in introns might seem contradictory to what was originally reported by the 1000 Genomes (1KG) Project [20], as they stated that introns had less CNVs than expected by chance. However, we would like to note that they did not separate purely intronic from intron-exon overlapping deletions, while we are talking about strictly intronic deletions (see Methods for details). Indeed, if we group all purely intronic and intron-exon overlapping deletions together, we also observe a significant depletion (). The enrichment of intronic deletions could be explained as a consequence of the negative selection of CNVs in exonic regions. To better understand the selective pressure on introns, we decided to compare the prevalence of deletions in intronic regions and in intergenic regions of similar size. Compared to intergenic regions, introns are less enriched with deletions (FC = 0.81, P = 2.23e-308, paired Student’s t-test). In addition to this, the deletions we find within intergenic regions are, on average, larger in intergenic fragments (FC = 1.14, P-value 6.23e-35, ). In conclusion, intronic regions are less enriched in deletions than intergenic regions of similar size. These results suggest that the sequence and distance between exons are more conserved in intronic regions than in intergenic regions.

Highly variable sizes in highly conserved protein-coding genes

The percentage of each intron that can be lost in the population due to CNV losses is highly variable, from 0.03% to 98.1% (51bp to 293kb), representing a loss of the 0.01% to 77.5% of the total genic size (51bp to 893.4kb, ). Some examples of genes with a notable change in size after a single intronic deletion in one individual are the neuronal glutamate transporter SLC1A1 (Solute Carrier Family 1 Member 1), with a loss of the 37% of its genic size () and the LINGO2 (Leucine Rich Repeat And Ig Domain Containing 2, alias LERN3 or LRRN6C) gene with a loss of the 34% of its size.

Changes in intron and gene size.

(A) Proportion of the reference intron that has been observed as deleted in any of the studies. (B) Proportion of the whole intronic content of a gene that has been observed as deleted. (C) Change in gene length by intronic deletions. (D) Example of gene with a substantial change in gene size with a single intronic deletion. (E) Number of different gene sizes observed in the population as a function on the number of intronic deletions detected. Genes names of the seven most extreme cases are indicated. The combination of different intronic deletions within a gene can give place to alleles of several different sizes (). Following with the same two examples, in the dataset from the final phase of the 1KG Project [20], we found 5 different intronic deletions in SLC1A1. These deletions result in 8 different sizes of genes in the population, with individual losses ranging from 1.1kb to 48kb. In LINGO2, the 20 different deletions give place to 36 different gene lengths in the 1KG population, with losses of 51bp to 233kb. The gene with more different allele sizes in the 1KG population [20] is CSMD1 (CUB And Sushi Multiple Domains 1), with a total of 66 common intronic annotated deletions that, combined, produce 150 alleles of different sizes. Strikingly, CSMD1 is highly conserved at the protein level and is amongst the most intolerant genes to functional variation. According to the ranking of the RVIS (Residual Variation Intolerance Score) gene scores [24], which is based on the amount of genetic variation of each gene at an exome level, only 0.169% genes in the human genome are more intolerant to variation in their coding sequence than CSMD1. In summary, intolerance to variation in the coding sequence seems to be compatible with extreme variation in the intronic sequence. These losses might affect their regulation without affecting their protein structure. A total of 1,638 OMIM genes carry intronic deletions in the population. Diseases associated to SLC1A1 (OMIM: 133550) include Dicarboxylic Aminoaciduria and susceptibility to Schizophrenia, while LINGO2 (OMIM: 609793) has variants associated with essential tremor and Parkinson disease and also has an intronic SNP associated with body mass [25]. CSMD1 has been associated to diseases such as Benign Adult Familial Myoclonic Epilepsy (Malacards [26], MCID: BNG079) and Smallpox (MCID: SML019). Interestingly, rare intronic deletions in this gene have been recently reported to be associated to both male and female infertility [27]. To better understand possible epistatic effects between protein-coding and intronic mutations, it will be useful to incorporate information about gene length variation in future studies of these disease genes.

Intronic deletions are frequent in evolutionary ancient and essential genes

Structural variants in the germline DNA constitute an important source of genetic variability that serves as the substrate for evolution. Therefore, dating the evolutionary age of genes allows the study of structural variants that were fixed millions of years ago. Whole gene CNVs are known to differentially affect genes depending of their evolutionary age, mainly involving evolutionary young genes [28]. Genes of younger ages are generally cell-type specific, while ancient genes tend to be more conserved, ubiquitously expressed and enriched in cellular essential functions. Intrigued to see many cases where intronic CNVs were affecting highly conserved protein-coding loci, we compared the distribution of coding (including exonic and whole gene) and intronic deletions across different gene ages (). These and subsequent analyses were done using 3 maps: Sudmant-Nature’s [20], Zarrei’s [19] and Abyzov’s [17] maps. Handsaker’s [18] and Sudmant-Science’s [21] maps were discarded because they had very few intronic deletions (less than 1,000, ).

Evolutionary age of affected genes.

Ratios of observed versus expected number of genes from each gene evolutionary age that contain deletions overlapping with exons, including partial and whole gene CNVs (A) or intronic deletions (B). Expected values were calculated with 10,000 random permutations using a global background model. Red asterisks mark the significantly enriched groups of genes. Significance: * for P<0.05, ** for P<0.005 and *** for P<0.0005. Plot (C) shows, from all the genes overlapping with deletions after aggregating the three maps, what is the proportion of genes that have all or part of their exons affected by deletions and what is the percentage of genes with intronic deletions only. The equivalent figure for each separate map is shown in (D) Percentage of highly stratified variants (HSV, maximum Vst > 0.2) in each age group and by type of overlap with the gene. The absolute number of deletions is indicated above each bar. We observed that most ancient genes are depleted of deletions that affect their coding regions, while primate-specific genes are enriched with coding CNVs (), meaning that the coding region of recent genes has a higher tendency to be lost or disrupted. This pattern was also observed when CNV gains were included (). The generation of random background models revealed that ancient genes (present in the Sarcopterygii ancestor) were significantly depleted of coding region losses (both exonic and whole gene, P < 1e-04, global permutation test), while these were enriched in young genes (from Hominoidea to Homo sapiens, P < 1e-04, global permutation test; see and ). In contrast with coding deletions, the number of intronic deletions have a uniform distribution across gene ages, being slightly enriched in ancient genes in Sudmant-Nature’s map (P = 2e-04, global permutation test, and ). A similar pattern was also observed when taking only genes with big introns (larger than 1,500 bp, see ) and by calculating the enrichment within big introns independently from genes (. Remarkably, larger sizes of introns in ancient genes lead to a higher proportion of these genes being affected by intron deletions (). Therefore, while coding deletions are less frequent in ancient genes than in younger genes, intronic deletions are more frequent in the older ones (). We would expect that essential genes, which tend to be ancient [29], could be an exception to the enrichment of deletions. Essential genes have on average shorter introns than the rest of the genes [30,31] and relative to the genes of the same evolutionary age ( and ). Up to 1,154 essential genes carry intronic deletions if we take into account all five CNV maps. In Sudmant Nature, 907 essential genes have intronic deletions, a higher number than expected by chance (P = 0.034, global permutation test, ). We investigated if intron variability in genes was associated with any biological function. Genes with more or less intronic deletions than expected by chance ( see Methods) were not associated to any particular function using DAVID [32]. Nevertheless, genes with less intronic deletions than expected show more protein-protein interactions among them than expected by chance (P = 2.43e-10, calculated with STRING [33]). These results are compatible with previous evolutionary studies that showed high levels of conservation of intron length in genes associated with development protein complexes in mammals [2], presumably to facilitate a more precise temporal regulation of expression [9]. Population stratification of CNVs has previously been suggested to be indicative of loci under adaptive selection [20,21]. We identified 352 highly stratified variants (HSVs, maximum Vst>0.2, see Methods) from Sudmant-Nature’s map overlapping with protein-coding genes: 282 are intronic, 53 exonic and 17 whole gene. We classified deleted regions according to the age of the genes and the type of gene structure affected and calculated the percentage of each group that is highly stratified (). Interestingly, the contribution of intronic HSVs is higher for younger genes, a pattern coherent with the expected higher functional impact of HSVs in older genes. Remarkably, the percentage of intronic HSVs is similar or higher than that of whole-gene and exonic HSVs in all age groups (and always higher than partial exonic deletions). These signatures of potential positive selection in purely-intronic CNVs suggest that a fraction of them might contribute to human adaptation.

Intronic deletions are associated with gene expression variability in the population

Multiallelic CNVs affecting whole genes have been shown to correlate with gene expression: generally, the higher the number of copies of a gene, the higher its expression levels [18,20]. We hypothesized that intronic size variation may also impact the expression of the affected genes (without affecting the actual number of copies of the gene). Therefore, we looked into the possible effect of intronic hemizygous deletions on gene expression variation at the population level, comparing the effects with hemizygous deletions in coding (whole gene and exonic) and intergenic non-coding deletions (). We used available RNA-seq data from Geuvadis [34] that was derived from lymphoblastoid cell lines for 445 individuals for whom we have the matching CNV data from the 1KG Project [20].

DEL-eQTLs.

(A-E) Five types of DEL-eQTLs analysed. Thunder symbols indicate deletion breakpoints. (F) DEL-eQTL results for the five types. Number of eGenes, eTranscripts and genes with eTranscripts when comparing expression levels of individuals with a reference allele and an allele with a specific deletion versus individuals with two reference alleles. P-values obtained after performing Student’s t-tests were FDR-corrected (FDR = 5%). The number of expected eGenes, eTranscripts or genes with eTranscripts was calculated after randomizations of the individuals carrying or not the deletion, and P-values were calculated by comparing the observed versus the 10000 random values. Significance: * for P<0.05, ** for P<0.005 and *** for P<0.0005. (G) Examples of intronic deletions in a gene associated to expression changes of another gene that interacts in 3D. Black boxes represent exons and the light blue box the PCHiC fragment in contact with the differentially expressed gene. The position of the deletion is marked with dashed red lines. Gene expression of the eGenes is represented using PEER-RPKM values. In order to look for differences in gene expression we selected variants for which we had at least 2 hemizygous individuals (individuals with copy number = 1) and at least 2 wild-type individuals (copy number = 2) and we compared the expression levels among these two groups to identify deleted CNV regions associated with expression quantitative trait loci (eQTL, ). We will refer to the deleted regions associated with expression changes as DEL-eQTLs, and the genes associated with as eGenes. For comparative purposes, we first looked at the effect of hemizygous deletions in coding regions (whole gene and exonic DEL-eQTLs). We found that 7 eGenes out of 50 genes with whole gene deletion CNVs resulted in significant downregulation of gene expression in lymphoblastoid cell lines (14%, a higher number eGenes than expected by chance, P < 5e-4, permutation test, ). In addition, we found 35 eGenes out of 437 genes with partial exonic deletions that were differentially expressed (8%, a number higher than expected by chance, P < 5e-4, permutation test, ). The majority of these eGenes (32/35) where down-regulated in the individuals carrying the deletion. Although intronic deletions do not affect the coding sequence of genes, we observed significant differences in gene expression in 53 eGenes out of the 1,505 genes with intronic deletions, a number of intronic-eGenes that is also higher than expected by chance (P < 51e-4, permutation test) (). Given the higher abundance of intronic deletions in the population, the absolute number of intronic-eGenes (53 genes) was similar to the total of coding-eGenes (39 genes, and ). Of the intronic-eGenes, 62% were downregulated and the other 38% upregulated, suggesting that intronic deletions might result both in enhancing or repressing gene expression (while coding losses mostly associate to gene down-regulation). Regulatory regions are known to be preferentially located in first introns [35]. From all 56 intronic eDeletions that are associated to changes of gene expression in our study, 17 (30.4%) are found within first introns. However, this percentage is not significantly higher than in non eDeletions (26%, P = 0.54, Fisher’s test). Finally, we identified that four of the intronic cis-eDeletions in lymphoblastoid cells are HSVs, suggesting adaptive potential of these expression differences. These intronic HSVs are located in four ancient genes (Sarcopterygii or older): EXOC2, SKAP2, PTGR1 and PHYHD1 (). EXOC2 is an essential gene encoding one of the proteins of the exocyst complex and is among the top 5% most conserved genes in human (RVIS = 3.34). Since intron length can impact the inclusion of alternative exons [36], we hypothesised that there might be genes with differentially expressed transcripts (eTranscripts) in any gene containing an intronic deletion. In addition to the 53 intronic-eGenes, we found 217 intronic-eTranscripts in a total of 185 genes (this is more than expected by chance, P = 0.018, permutation test, and . These results suggest that deletions within introns may cause the inclusion or exclusion of exons and thus influencing the relative proportion of alternative transcripts in many genes. Changes in GC content as the result of intronic deletions might also contribute to these splicing differences, as in genes with long introns, the recognition of introns and exons by splicing machinery is based on their differential GC content [37,38] and the lower GC content in introns facilitates their recognition. We found that, in general, the deleted sequences have a significantly higher GC content to that of the introns where they are located (P = 1.8e-28, paired Student’s t-test), and the loss of these sequences causes a significant decrease of the overall GC content of the introns (P = 2.23e-16, paired Student’s t-test) ( and ). This drop of GC content is more pronounced in introns with deletions originated through transposable element insertion (TEI, P = 2.01e-9, paired Student’s t-test). The 84% of TEI deletions overlap almost completely with Alu elements (S13 Fig), which are known to be GC rich. The GC drops happening in introns with deletions associated to non-allelic homologous repair (NAHR) are less significant (P = 0.0063), while the difference is not significant in deletions caused by non-homologous repair (NH) (P = 0.7676). The drop in intronic GC content associated to most TEI and many NAHR deletions would increase the difference of GC content between introns and their flanking exons, what could facilitate exon definition during splicing and might contribute to the observed differential expression of some transcripts. It has been recently shown that human enhancers are associated to high GC [39] and that Alu elements can act as enhancers [40], suggesting that deletions could not only alter splicing but also influence regulatory features located within introns.

Deletion of intronic regions and changes in expression in trans

Introns in human are particularly enriched in regulatory regions and frequently interact with gene promoters of other genes via chromatin looping (). Therefore, deletions in introns that show long-range interactions with promoters of other genes could potentially affect their expression (trans effects). We used promoter-capture Hi-C published data for B-lymphocytes [41] to link intronic regions and gene promoters. We identified 322 deletions in intronic regions that interact with gene promoters of other genes (672 in total). Taking all combinations of genes and the trans-intronic regions with deletions, we searched for intronic trans-DEL-eQTLs: intronic regions that, when deleted, are associated with changes in expression of a different gene. Twelve of these genes were found to be significantly differentially expressed in the individuals presenting an intronic deletion in another gene (trans-intron-eGenes, ). For example, PRSS36 (Protease, Serine 36) is downregulated in individuals with an intronic deletion in SETD1A (SET Domain Containing 1A) gene, while LIAS (Lipoic Acid Synthetase) gene is upregulated in individuals with a intronic deletion in PDS5A (PDS5 Cohesin Associated Factor A) (. In addition, 81 transcripts from 65 genes were also differentially expressed (trans-intron-eTranscripts) in the individuals with a trans-DEL-eQTLs. The loss of intergenic fragments in 3D contact with a gene were associated to a similar number of DEGs than the DEGs associated to intronic trans-DEL-eQTLs (16 trans-eGenes, 123 eTranscripts associated to intergenic deletions, and ). We analysed the age of different types of eGenes and observed that whole-gene and exonic eGenes are enriched in young age classes (). This pattern is very different in intronic and intergenic eGenes: intronic cis-eGenes are enriched in old ages, while intronic trans-eGenes and intergenic-eGenes do not seem to be associated with gene age. If we compare the RVIS of the different types of eGenes, we find that whole gene and exonic eGenes are actually among the most tolerant genes to point mutations in their coding sequence (). In contrast, we found that a significant proportion of intronic cis-eGenes with low RVIS percentiles, indicating that protein-coding genes that are intolerant to point mutations at the protein level can have intronic deletions associated to gene expression changes. Strikingly, trans-eGenes show the lowest RVIS percentiles, indicating that intronic variation might impact the gene expression of interacting genes that are quite intolerant to coding mutations ().

Impact of CNVs on genes and their evolution.

(A) Percentage of genes of each group of evolutionary ages that is associated to an eCNV, for each type of eCNV. (B) RVIS percentile of the eGenes, by type of eCNV. Genes with the lowest percentile are among the most intolerant of human genes. (C) Evolutionarily ancient and young genes accumulate different kinds of structural variants. While young genes are enriched in coding deletions (which alter gene dosage or disrupt the protein, sometimes affecting gene expression), ancient genes have highly conserved coding sequence but an enrichment of deletions within their introns. As we have shown, these changes in introns can be associated with changes in gene expression, showing that although the protein is highly conserved, the expression of it can change from an individual to another due to changes in regulation.

Intronic deletions can alter enhancer sequences or their location

To further study the potential impact of intronic deletions in regulatory regions, we analyzed the co-occurrence of these events with enhancers. In eGenes or eTranscripts, 15 intronic DEL-eQTLs overlap with enhancers (an overlap that is higher than expected by chance, P = 0.023, odds ratio = 2.04, Fisher’s test). These 15 deletions represent the 24% of the tested intronic deletions overlapping with enhancers in this cell type. We need to consider that many intronic deletions were not investigated because they fall within genes that are expressed in other cell types. Based on our observations in lymphoblastoid cells, we estimate that there might be 105 additional eDels of the 422 that overlap with enhancers. Regarding the deletions not overlapping with enhancers, we found that the distance between the DEL-eQTL and the closest enhancer is shorter than the distance of the deletions not associated with expression changes (P = 9.2e-04, Student’s t-test). These results suggest that intronic DEL-eQTLs could also be affecting interactions between promoter and intronic enhancers without directly disrupting the enhancer sequence. Motivated by these findings, we investigated if there is a global tendency (independently of gene expression) for intronic deletions to affect or not affect enhancers. First, we observed that enhancers are enriched in introns (P < 1e-04, global permutation test) agreeing with previous findings in plants [42,43]. Strikingly, we find that intronic deletions and intronic enhancers co-occur in the same intron more often than expected by chance (P < 2.2e-16, Fisher’s test), possibly because most intronic deletions and intronic regulatory features are found in very long introns. However, by randomly relocating each intronic deletion within the same intron, we observed that the direct overlap of the deletions with enhancers is significantly lower than expected (P = 0.0304, global permutation test, ). A possible functional interpretation of these results is that there might be some degree of plasticity on the distance between intronic enhancer and promoters, but many intronic enhancers might be essential and cannot be lost. Interestingly, as we saw above, the loss of non-essential intronic enhancers can be associated to changes of gene expression.

Discussion

Intronic CNVs constitute the most abundant form of CNV in protein-coding genes () and might have a previously unsuspected role in human evolution and disease. This variation in intronic length in healthy human populations implies that the actual size of many genes is different among individuals and, therefore, it might change in populations over time. However, little attention has been given to this variability even if gene length has been shown to be important in many genes. We have shown that intronic deletions are occurring more often than expected by chance in three different CNV maps (using two different background models). Other studies have previously reported that CNVs are impoverished [20,44] or neither impoverished nor overrepresented within introns [45]. To explain this apparent controversy, we have to carefully review the different definitions of “intronic CNVs”. Here, we looked at deleted regions located completely within constitutive intronic regions (excluding intronic regions that contain alternative exons). Mu et al. [45] showed that purely intronic CNVs in general are not either enriched or impoverished in their dataset, but they observed that the subset of events associated to NAHR are found more often than expected by chance using the Pilot Phase dataset of the 1KG Project. We obtained similar results using the CNV map by Abyzov et al. [17] (Phase 1 of the 1KG) where intronic deletions were neither significantly enriched nor impoverished but the subset of NAHR deletions was significantly enriched (FC = 1.17, P = 0.0002). These results illustrate the importance of a clear definition of intronic CNVs and the danger of generalising the results of one particular study. Each study is normally biased to detect different mechanisms, different sizes or types of CNVs or events observed at different frequency in distinct populations. Finally, it should be noted that deletions annotated in CNV maps are not based on the ancestral human genome but on the reference genome [46]. In consequence, a fraction of the so-called deletions could in fact be inserted regions that are present in the reference. However, an additional comparative genomic analysis based on recent high quality primate assemblies [23] show that most of them correspond to actual deletion events in humans (). Our results suggest that copy number variation is shaping gene evolution in different ways depending on the age of genes, duplicating or deleting young genes and contributing to fine-tuning the regulation in both young and old genes (). Although we expect stronger functional effects for CNVs affecting the coding sequence, we have shown that intronic sequences are more conserved than intergenic regions of similar characteristics and that some purely intronic CNVs also show signatures of potential positive selection. Interestingly, the proportion of highly segregating intronic CNVs is similar or higher than for coding CNVs. Popadin et al. showed that primate-specific genes in human are enriched in single nucleotide variants correlated with gene expression (cis-eQTLs) with their associated SNPs tending to be closer to the TSS than in older genes [47]. These data highlight the need of dissecting the different types of genetic variation in order to understand the complex relationships between SNPs, CNVs, gene expression and gene age. While point mutations near the TSS [47] and coding CNVs seem to have a higher effect in young genes, intronic CNVs are frequently associated with gene expression variation in genes of any age. Finally, it is important to highlight that an unknown proportion of these strong statistical associations could actually be the result of other unexplored variants linked with certain CNV alleles. Previously published studies on the effect of genetic variants on gene expression have proven the effect of CNVs on expression variability [48-50]. Chiang and co-workers identified 789 SVs associated to changes in gene expression, most of them (88.3%) not overlapping with exons from the eGene [50]. DeBoever and co-workers observed that a large proportion of common CNVs associated with gene expression levels is located in intergenic regulatory regions [49]. However, research on the subject has been mostly restricted to SVs found within 1Mb from the gene and previous works did not analyse intronic regions in detail. In contrast, we relied on Hi-C data to define deletions affecting regions in 3D contact with a gene. In this way, we do not require the CNV to be located within any particular distance to the TSS position of a gene. We tested intergenic eCNVs that can be located at any distance from 864bp to 82Mb from the nearest gene. The differences that we observe in gene expression could be the result of intronic CNVs affecting the rate of transcription, the splicing process, the stability of RNA or a combination of them. For example, intronic deletions interfering with splicing recognition might trigger the nonsense-mediated decay (NMD) pathway that would degrade the transcript. Recently, it has been shown that the balance of unspliced and spliced mRNA (RNA velocity) is a cell type-specific signature that can be used to predict the future increases or decreases of gene expression in single cells [51]. As the amount of unspliced transcript detected will depend on the length of introns—which can be highly variable in some genes—we would expect that the RNA velocity of intron-varying genes will be also varying in human populations. Despite the clear trends shown, our results are likely to underestimate the extent of the impact of intron losses in gene expression. On one hand, we only investigated the effect on gene expression in lymphoblastoid cell lines. On the other hand, the regulatory data currently available is also limited. The interaction maps change in different cell types [41,52] and many enhancers are tissue-specific [53]. Therefore, the loss of intronic sequence could affect the expression of genes in other cell types. In addition, the 3D contacts involving frequently deleted regions in the population will be underrepresented in the interaction map used in our study, as they are less likely to be present in the assayed samples. The availability of CNV, personal gene expression and genome interactomes from multiple tissues will allow to evaluate more accurately what is the impact of coding and non-coding deletions in the whole organism.

Methods

Origin and filtering of CNV maps

Whole genome CNV maps were downloaded from 5 different publications [17-21]. For our analysis we selected autosomal and not private CNVs. Some extra filters were applied to some maps: In Handsaker et al. we removed CNVs marked as low quality and all the variants from two of the individuals (NA07346 and NA11918) because they were not included in the phased map. From Zarrei’s maps we used the stringent map that considered CNVs that appeared in at least 2 individuals and in 2 studies. The complete list of CNVs analysed is available in

Gene structures

Autosomal gene structures and sequences were retrieved from Ensembl [54] (http://www.ensembl.org; version 75) and principal isoforms were determined according to the APPRIS database [55], Ensembl version 74. In order to avoid duplicate identification of introns, intronic regions were defined as regions within introns that aren’t coding in any transcript of any gene. When analyzing real introns, in order to avoid duplicate identification of introns, the principal isoform with a higher exonic content was taken. The complete list of genes affected by different types of CNVs is available in .

Essential genes

The list of essential genes was obtained by aggregating lists of genes reported as essential after CRISPR-based genomic targeting [56,57], gene-trap insertional mutagenesis methodology [58], and shRNA [59-61].

Dating gene and intron ages

An age was assigned to all duplicated genes as described before [28]. In the case of singletons gene ages were assigned from the last common ancestor to all the genes in their family according to the gene trees retrieved from Ensembl. Singleton’s ages can be noisy for genes suffering important alterations as gene fusion/fission events or divergence shifts. As a consequence, these ages should not be interpreted as the age of the oldest region of the gene, but as a restrictive definition of gene age considering a similar gene structure and gene product. The ages (from ancient to recent) and number of genes per age are as follows: FungiMetazoa: 1119, Bilateria: 2892, Chordata: 1152, Euteleostomi: 8230, Sarcopterygii: 182, Tetrapoda: 154, Amniota: 408, Mammalia: 375, Theria: 515, Eutheria: 848, Simiiformes: 233, Catarrhini: 170, Hominoidea: 106, Hominidae: 64, HomoPanGorilla: 204, HomoSapiens: 500. For some analyses, Primates age groups (Simiiformes to HomoSapiens) were collapsed. For other analyses, we only grouped the 16 ages in three, “ancient” (collapsing groups from FungiMetazoa to Sarcopterygii), “middle” (from Tetrapoda to Eutheria) and “young” genes (Primates). Intronic regions were assigned the evolutionary age of the gene they belonged to. In the cases when an intron could be assigned to more than one gene, the most recent age was assigned to them.

Statistical assessment of genome-wide distribution of CNVs

To estimate statistical significance of our results we performed permutation tests. In order to compare the number of overlaps of CNVs with genic functional elements we compared our observed values to a background model. A global background was obtained by relocating all the CNVs in the whole genome 10,000 times, avoiding low-mappability regions in R package “BSgenome.Hsapiens.UCSC.hg19.masked”). Genome coordinates and low mappability regions were downloaded using RegioneR package [62]. A local background was obtained by segmenting the genome in 278 windows of at least 10Mb and randomly shuffling the CNVs within their original window 10,000 times, also avoiding low-mappability regions. P-values were computed using a function derived from the permTest function from package RegioneR version 1.6.2 [62]. Code is available in https://github.com/orgs/IntronicCNVs. We compared the location of the CNVs in our datasets and compared with their distribution in the random models in order to calculate enrichments or depletions depending on the intron size and gene age and essentiality.

Comparison of intronic and intergenic regions

To compare the content of deletions between intronic and intergenic DNA, we randomly selected a subset of 500 intronic regions and assigned an intergenic region with the most similar size to each of the introns. We then calculated the total number of deletions in the intronic and the intergenic compartments, as well as their sizes and the percentage of region that is lost. We repeated the sampling 10,000 times and compared (with a paired Student’s t-test) the distribution of deletion number and size in intronic versus intergenic regions.

Regulatory features

We downloaded a genome-wide set of regions that are likely to be involved in gene regulation from the Ensembl Regulatory Build [63], assembled from IHEC epigenomic data [64]. We checked if introns are enriched in these regulatory features (promoters, enhancers, promoter flanking regions or insulators) by comparing to a random background model generated by relocating 10,000 times all regulatory features in the genome. P-values are the fraction of random values superior or inferior to the observed values. In order to check for the significance of the overlaps between intronic deletions and regulatory features we relocated 10,000 times each intronic deletion within their host intronic region, avoiding overlaps with exons. Then, we compared the observed and the expected overlap with regulatory features. Introns that overlapped with low-mappability regions were previously removed.

Analysis of differential GC content

Genomic sequences were obtained from the primary GRCh37/hg19 assembly, and were used for calculating the GC content of introns and intronic CNVs. Differences in GC content between a CNV and the intron where it is located were calculated with paired Student’s t-tests taking as statistical unit the CNV. The same was done for changes in intronic GC content before and after a deletion.

Alu elements

Alu element genomic coordinates were extracted from the RepeatMasker tracks from UCSC, build GRCh37.

CNV mechanisms

The analysis of intronic deletions generated through different mechanisms was done using the dataset from Abyzov’s [20] study.

Gene expression analysis

We used available RNA-seq data at Geuvadis [34] that was derived from lymphoblastoid cell lines for 445 individuals who were sequenced by the 1KG Project and for whom we have the intronic deletions in the largest CNV map [20]. We focused our analyses on the 763 genes that have only one intronic deletion in the population with at least two individuals affected in the Geuvadis dataset. For each of these genes we classified the PEER normalized gene expression levels [65] in two groups: 1) gene expression of individuals homozygous for the reference genotype and 2) gene expression of individuals with one allele with the deletion and the other with the reference genotype. We then performed Student’s t-tests to compare the expression of the two different genotypes. We corrected for multiple testing with p.adjust R function (Benjamini-Hochberg method). In addition, in order to see if the number of significant differentially expressed genes is higher than expected by chance, for each intronic deletion, we the shuffled 10,000 times the genotypes of the individuals and performed t-tests with the expression of the random groups of wild-type and heterozygous individuals. For example, if a deletion is found in heterozygosis in 50 individuals and the rest are wild-type, we will test if there is differential expression when comparing the expression of 50 randomly selected individuals versus the rest. By repeating this shuffling 10,000 times for every tested deletion we can calculate the expected percentages of significantly differentially expressed genes.

Observed vs expected intronic deletion content score

The number and size of expected intronic deletions per gene was calculated in two different ways: 1) relocating 10,000 times all deletions in the whole genome (except for low mappability regions) and 2) relocating 1,000 times all intronic deletions within the intronic regions. In both cases, a score was generated to determine what genes have more or less intronic deletions than expected. This score was calculated taking into account 1) the ranked position of the number of intronic deletions per gene divided by their median expected value, 2) the ranked position of the observed divided by the median expected size of the deletions, 3) the ranked position of the percentage of intronic content that is lost, 4) the ranked inverse of the expected intronic loss and 5) the ranked frequency of the deletion in the 1KG Project populations, if available. Because the frequency of the event depends on the reference genome, we find that a deletion present in, for example, all except for two individuals, should probably be considered as a rare gain and the deletion should be the reference. For this reason, the values were normalized in a way that 0.5 would be the maximum frequency and 0.9 and 0.1 would be given the same position in the ranking. Once all rankings were calculated and normalized from 0 to 1, a score was assigned to each gene by averaging their five ranks. The final set of 458 genes with less deletions than expected is the intersection of the top 500 genes of the two randomizations, and the set of 484 genes with more deletions than expected, the intersection of the bottom 500 genes.

Functional enrichment analysis

Functional enrichment analysis of the genes with a lower scores and higher scores was performed with DAVID [32] and STRING [33]. Enrichment of essential genes in our datasets was performed with a Fisher test using our list of essential genes (see the “Essential genes” section in Materials and Methods).

Population stratification

For the study of population stratification of deletions, Vst statistics were extracted from Sudmant Nature [20]. As in Sudmant Nature, a cutoff of 0.2 was selected to indicate high population stratification of a locus.

Comparison of datasets.

Only variants in autosomes are considered and private events are excluded. (A) Number and type of CNVs per dataset. (B) Autosomal Mb that are CNV. Gray part of the bars corresponds to the CNV Mbs that are shared among maps. Colored parts of the bars are map-specific CNV regions. (C) Width distribution of gains and losses in each map. Bean lines and overall line are means). (D) Number of subjects and number of populations of origin used for building of each filtered map. DGV: Database of Genomic Variants (http://dgv.tcag.ca). For more information on the 1000 Genomes Project, see http://www.internationalgenome.org. (PDF) Click here for additional data file.

Enrichment analysis of purely intronic, intron-intersecting and coding deletions.

Enrichment or impoverishment of deletions within introns, deletions intersecting introns (purely intronic and intron-exon combined) and exon-overlapping deletions (purely coding and intron-exon combined) in different maps of copy number variation and using global (A) and local (B) background models. Values are given as log2 ratios observed/expected (median expected value from 10,000 randomisations). Error bars show the median absolute deviation and asterisks indicate significance: * for P<0.05, ** for P<0.005 and *** for P<0.0005. (PDF) Click here for additional data file.

Enrichment analysis of deletions in introns of different sizes.

Ratios of observed versus expected number of deletions in each size bin after 10,000 random permutations using global (A) and local (B) background models. All size bins have a similar number of intronic regions (deciles, size intervals indicated between brackets). Asterisks mark the bins significantly enriched with intronic deletions: * for P<0.05, ** for P<0.005 and *** for P<0.0005. (PDF) Click here for additional data file.

Comparison of intronic and intergenic regions.

Groups of 500 introns were randomly sampled 10,000 times and paired each time with 500 intergenic regions of similar size. In every permutation, their deletion content was compared. Each point in a boxplot corresponds to a single value per randomization. A) Total number of deletions in 500 intronic or intergenic regions. B) Percentage of the sequence that is lost in each subset of 500 intronic/intergenic regions. C) Total amount of genome comprised in each subset, to verify that the intronic and intergenic subsets they have similar sizes (no significant differences). D) Median and E) mean size of deletions within intronic and intergenic regions. P-values were calculated with paired Student’s T-test comparisons. (PDF) Click here for additional data file.

Enrichment analysis of CNVs (including gains, losses and gain/losses) in genes of different evolutionary ages.

Ratios of observed versus expected number of genes with CNVs (gains, losses and gain and loss CNVs) affecting their coding region in each gene age after 10,000 random permutations using the global background model. Abyzov’s map [17] is excluded because it is the only CNV map that does not contain any gain. Red asterisks show an enrichment when above the box, a depletion when below the box: * for P<0.05, ** for P<0.005 and *** for P<0.0005. (PDF) Click here for additional data file.

Impact of deletions on genes of different evolutionary ages.

Percentage of genes from each gene evolutionary age that contain intronic deletions in (A) or deletions overlapping with exons (including partial and whole gene CNVs) in (B). The gray line represents the expected value, calculated as the median of the genes in the 10,000 random permutations. Significance is marked with asterisks: * for P<0.05. ** for P<0.005.*** for P<0.0005 and their color represents enrichment (red) or impoverishment (black). (PDF) Click here for additional data file.

Differential effect of intronic deletions in big introns.

Ratios of observed versus expected number of deletions within introns bigger than 1.5kb from different evolutionary ages. Expected values are calculated 10,000 random permutations using a global background model. Asterisks show an enrichment when above the box, a depletion when below the box: * for P<0.05, ** for P<0.005 and *** for P<0.0005. (PDF) Click here for additional data file.

Effect of the different types of deletions on all evolutionary ages.

Proportion of genes with deletions that have the whole locus deleted, only part of their exons (exonic) affected by deletions or intronic deletions only. This figure is equivalent to Fig 3C, but here separated by CNV map.

Fig 3

Evolutionary age of affected genes.

(PDF) Click here for additional data file.

Essential genes.

(A) Intron sizes of non-essential and essential genes. (B) Percentage of essential genes per evolutionary age. (PDF) Click here for additional data file.

Highly stratified deletions associated with expression differences.

(A) Characteristics of highly stratified variants (HSVs) that are significant cis-intronic-eDeletions. (B) Gene expression of “wild-type” (CN = 2) individuals and heterozygous carriers (CN = 1) of the eDeletion. (PDF) Click here for additional data file.

GC content in introns and intronic deletions.

(A) Bean-plots showing the different GC distribution between the flanking exons of introns with or without deletions, separated by intron size bins (with equal number of introns per bin). (B) GC content distributions in introns with or without deletions, separated by intron size bins. Significance is considered for p-values < 0.05. Beans show the estimated density of each distribution; horizontal lines show the mean values of each side of the bean and the dashed horizontal line line represents the average of all values. (PDF) Click here for additional data file.

Examples of introns with a drop of GC content.

X-axis represents the coordinates of the intron with its flanking exons (black boxes). Y-axis shows the GC content, calculated with sliding 200bp windows. The deleted region is highlighted in grey. (PDF) Click here for additional data file.

Alu element content in regions deleted by different mechanisms.

(A) Proportion of deletions of each mechanism that overlaps with Alu elements. (B) Percentage of the deleted regions covered by Alu elements. Deleted regions and mechanisms from Abyzov’s map [17]. NAHR: Non-allelic homologous recombination. NH: Non-homologous end joining. TEI: Transposable Element Insertion. (PDF) Click here for additional data file.

Number of individuals in each map, project the variants belong to and methods used for CNV detection.

(PDF) Click here for additional data file.

Fold changes and P-values for all maps.

(XLSX) Click here for additional data file.

Top genes with more or less deletions than expected.

(XLSX) Click here for additional data file.

List of differentially expressed genes (eGenes).

(TSV) Click here for additional data file.

Summary of differentially expressed transcripts (eTranscripts).

(PDF) Click here for additional data file.

Overlap of intronic deletions with regulatory features.

(PDF) Click here for additional data file.

List of CNVs and their impact on genes.

(TSV) Click here for additional data file.

Genes affected by deletions in any CNV map.

(TSV) Click here for additional data file.

CNV gains and losses relative to the human reference genome.

(DOCX) Click here for additional data file.

62 in total

1. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nat Protoc Date: 2009 Impact factor: 13.491

2. Gene length as a biological timer to establish temporal transcriptional regulation.

Authors: Killeen S Kirkconnell; Brian Magnuson; Michelle T Paulsen; Brian Lu; Karan Bedi; Mats Ljungman
Journal: Cell Cycle Date: 2017-01-05 Impact factor: 4.534

3. Evolution of Alu elements toward enhancers.

Authors: Ming Su; Dali Han; Jerome Boyd-Kirkup; Xiaoming Yu; Jing-Dong J Han
Journal: Cell Rep Date: 2014-04-03 Impact factor: 9.423

4. Predicting tissue-specific enhancers in the human genome.

Authors: Len A Pennacchio; Gabriela G Loots; Marcelo A Nobrega; Ivan Ovcharenko
Journal: Genome Res Date: 2007-01-08 Impact factor: 9.043

5. Integrative annotation of variants from 1092 humans: application to cancer genomics.

Authors: Ekta Khurana; Yao Fu; Vincenza Colonna; Xinmeng Jasmine Mu; Hyun Min Kang; Tuuli Lappalainen; Andrea Sboner; Lucas Lochovsky; Jieming Chen; Arif Harmanci; Jishnu Das; Alexej Abyzov; Suganthi Balasubramanian; Kathryn Beal; Dimple Chakravarty; Daniel Challis; Yuan Chen; Declan Clarke; Laura Clarke; Fiona Cunningham; Uday S Evani; Paul Flicek; Robert Fragoza; Erik Garrison; Richard Gibbs; Zeynep H Gümüş; Javier Herrero; Naoki Kitabayashi; Yong Kong; Kasper Lage; Vaja Liluashvili; Steven M Lipkin; Daniel G MacArthur; Gabor Marth; Donna Muzny; Tune H Pers; Graham R S Ritchie; Jeffrey A Rosenfeld; Cristina Sisu; Xiaomu Wei; Michael Wilson; Yali Xue; Fuli Yu; Emmanouil T Dermitzakis; Haiyuan Yu; Mark A Rubin; Chris Tyler-Smith; Mark Gerstein
Journal: Science Date: 2013-10-04 Impact factor: 47.728

6. The effect of intron length on exon creation ratios during the evolution of mammalian genomes.

Authors: Meenakshi Roy; Namshin Kim; Yi Xing; Christopher Lee
Journal: RNA Date: 2008-09-16 Impact factor: 4.942

7. High-resolution comparative analysis of great ape genomes.

Authors: Zev N Kronenberg; Ian T Fiddes; David Gordon; Shwetha Murali; Stuart Cantsilieris; Olivia S Meyerson; Jason G Underwood; Bradley J Nelson; Mark J P Chaisson; Max L Dougherty; Katherine M Munson; Alex R Hastie; Mark Diekhans; Fereydoun Hormozdiari; Nicola Lorusso; Kendra Hoekzema; Ruolan Qiu; Karen Clark; Archana Raja; AnneMarie E Welch; Melanie Sorensen; Carl Baker; Robert S Fulton; Joel Armstrong; Tina A Graves-Lindsay; Ahmet M Denli; Emma R Hoppe; PingHsun Hsieh; Christopher M Hill; Andy Wing Chun Pang; Joyce Lee; Ernest T Lam; Susan K Dutcher; Fred H Gage; Wesley C Warren; Jay Shendure; David Haussler; Valerie A Schneider; Han Cao; Mario Ventura; Richard K Wilson; Benedict Paten; Alex Pollen; Evan E Eichler
Journal: Science Date: 2018-06-08 Impact factor: 47.728

8. An integrated map of structural variation in 2,504 human genomes.

Authors: Peter H Sudmant; Tobias Rausch; Eugene J Gardner; Robert E Handsaker; Alexej Abyzov; John Huddleston; Yan Zhang; Kai Ye; Goo Jun; Markus Hsi-Yang Fritz; Miriam K Konkel; Ankit Malhotra; Adrian M Stütz; Xinghua Shi; Francesco Paolo Casale; Jieming Chen; Fereydoun Hormozdiari; Gargi Dayama; Ken Chen; Maika Malig; Mark J P Chaisson; Klaudia Walter; Sascha Meiers; Seva Kashin; Erik Garrison; Adam Auton; Hugo Y K Lam; Xinmeng Jasmine Mu; Can Alkan; Danny Antaki; Taejeong Bae; Eliza Cerveira; Peter Chines; Zechen Chong; Laura Clarke; Elif Dal; Li Ding; Sarah Emery; Xian Fan; Madhusudan Gujral; Fatma Kahveci; Jeffrey M Kidd; Yu Kong; Eric-Wubbo Lameijer; Shane McCarthy; Paul Flicek; Richard A Gibbs; Gabor Marth; Christopher E Mason; Androniki Menelaou; Donna M Muzny; Bradley J Nelson; Amina Noor; Nicholas F Parrish; Matthew Pendleton; Andrew Quitadamo; Benjamin Raeder; Eric E Schadt; Mallory Romanovitch; Andreas Schlattl; Robert Sebra; Andrey A Shabalin; Andreas Untergasser; Jerilyn A Walker; Min Wang; Fuli Yu; Chengsheng Zhang; Jing Zhang; Xiangqun Zheng-Bradley; Wanding Zhou; Thomas Zichner; Jonathan Sebat; Mark A Batzer; Steven A McCarroll; Ryan E Mills; Mark B Gerstein; Ali Bashir; Oliver Stegle; Scott E Devine; Charles Lee; Evan E Eichler; Jan O Korbel
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

9. Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters.

Authors: Biola M Javierre; Oliver S Burren; Steven P Wilder; Roman Kreuzhuber; Steven M Hill; Sven Sewitz; Jonathan Cairns; Steven W Wingett; Csilla Várnai; Michiel J Thiecke; Frances Burden; Samantha Farrow; Antony J Cutler; Karola Rehnström; Kate Downes; Luigi Grassi; Myrto Kostadima; Paula Freire-Pritchett; Fan Wang; Hendrik G Stunnenberg; John A Todd; Daniel R Zerbino; Oliver Stegle; Willem H Ouwehand; Mattia Frontini; Chris Wallace; Mikhail Spivakov; Peter Fraser
Journal: Cell Date: 2016-11-17 Impact factor: 41.582

10. The impact of structural variation on human gene expression.

Authors: Colby Chiang; Alexandra J Scott; Joe R Davis; Emily K Tsang; Xin Li; Yungil Kim; Tarik Hadzic; Farhan N Damani; Liron Ganel; Stephen B Montgomery; Alexis Battle; Donald F Conrad; Ira M Hall
Journal: Nat Genet Date: 2017-04-03 Impact factor: 38.330

19 in total

Review 1. An evolving view of copy number variants.

Authors: Stephanie Lauer; David Gresham
Journal: Curr Genet Date: 2019-05-10 Impact factor: 3.886

2. Single-Nucleotide Variants in the AIM2 - Absent in Melanoma 2 Gene (rs1103577) Associated With Protection for Tuberculosis.

Authors: Mariana Brasil de Andrade Figueira; Dhêmerson Souza de Lima; Antonio Luiz Boechat; Milton Gomes do Nascimento Filho; Irineide Assumpção Antunes; Joycenéa da Silva Matsuda; Thaís Rodrigues de Albuquerque Ribeiro; Luana Sousa Felix; Ariane Senna Fonseca Gonçalves; Allyson Guimarães da Costa; Rajendranath Ramasawmy; Alessandra Pontillo; Mauricio Morishi Ogusku; Aya Sadahiro
Journal: Front Immunol Date: 2021-04-01 Impact factor: 7.561

3. Dosage sensitivity and exon shuffling shape the landscape of polymorphic duplicates in Drosophila and humans.

Authors: Dan Zhang; Liang Leng; Chunyan Chen; Jiawei Huang; Yaqiong Zhang; Hao Yuan; Chenyu Ma; Hua Chen; Yong E Zhang
Journal: Nat Ecol Evol Date: 2021-12-30 Impact factor: 15.460

4. On the relation of gene essentiality to intron structure: a computational and deep learning approach.

Authors: Ethan Schonfeld; Edward Vendrow; Joshua Vendrow; Elan Schonfeld
Journal: Life Sci Alliance Date: 2021-04-27

5. Rare mutations in the complement regulatory gene CSMD1 are associated with male and female infertility.

Authors: Arthur S Lee; Jannette Rusch; Ana C Lima; Abul Usmani; Ni Huang; Maarja Lepamets; Katinka A Vigh-Conrad; Ronald E Worthington; Reedik Mägi; Xiaobo Wu; Kenneth I Aston; John P Atkinson; Douglas T Carrell; Rex A Hess; Moira K O'Bryan; Donald F Conrad
Journal: Nat Commun Date: 2019-10-11 Impact factor: 14.919

6. Genetic Variations in Prostaglandin E₂ Pathway Identified as Susceptibility Biomarkers for Gastric Cancer in an Intermediate Risk European Country.

Authors: Catarina Lopes; Carina Pereira; Mónica Farinha; Rui Medeiros; Mário Dinis-Ribeiro
Journal: Int J Mol Sci Date: 2021-01-11 Impact factor: 5.923

7. An inferred fitness consequence map of the rice genome.

Authors: Zoé Joly-Lopez; Adrian E Platts; Brad Gulko; Jae Young Choi; Simon C Groen; Xuehua Zhong; Adam Siepel; Michael D Purugganan
Journal: Nat Plants Date: 2020-02-10 Impact factor: 15.793

8. Detecting rare copy number variants from Illumina genotyping arrays with the CamCNV pipeline: Segmentation of z-scores improves detection and reliability.

Authors: Joe Dennis; Logan Walker; Jonathan Tyrer; Kyriaki Michailidou; Douglas F Easton
Journal: Genet Epidemiol Date: 2020-10-05 Impact factor: 2.135

9. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize.

Authors: Baoxing Song; Edward S Buckler; Hai Wang; Yaoyao Wu; Evan Rees; Elizabeth A Kellogg; Daniel J Gates; Merritt Khaipho-Burch; Peter J Bradbury; Jeffrey Ross-Ibarra; Matthew B Hufford; M Cinta Romay
Journal: Genome Res Date: 2021-05-27 Impact factor: 9.043

10. Copy number variants and fixed duplications among 198 rhesus macaques (Macaca mulatta).

Authors: Marina Brasó-Vives; Inna S Povolotskaya; Diego A Hartasánchez; Xavier Farré; Marcos Fernandez-Callejo; Muthuswamy Raveendran; R Alan Harris; Douglas L Rosene; Belen Lorente-Galdos; Arcadi Navarro; Tomas Marques-Bonet; Jeffrey Rogers; David Juan
Journal: PLoS Genet Date: 2020-05-11 Impact factor: 5.917