Literature DB >> 29051702

Finding the Sources of Missing Heritability within Rare Variants Through Simulation.

Baishali Bandyopadhyay1, Veda Chanda1, Yupeng Wang1,2,3.   

Abstract

Thousands of genome-wide association studies (GWAS) have been conducted to identify the genetic variants associated with complex disorders. However, only a small proportion of phenotypic variances can be explained by the reported variants. Moreover, many GWAS failed to identify genetic variants associated with disorders displaying hereditary features. The "missing heritability" problem can be partly explained by rare variants. We simulated a causality scenario that gestational ages, a quantitative trait that can distinguish preterm (<37 weeks) and term births, were significantly correlated with the rare variant aggregations at 1000 single-nucleotide polymorphism loci. These 1000 simulated causal rare variants were embedded into randomly selected subsets of 9642 promoter regions from the 1000 Genomes Project genotypic data according to different proportions of causal rare variants within the embedded promoters. Through analysis of the correlations between rare variant aggregations and gestational ages, we found that the embedded promoters as a whole showed weaker genetic association when the proportion of causal rare variants decreased, and no individual embedded promoters showed genetic association when the proportion of causal rare variants was smaller than 0.4. Our analyses indicate that association signals can be greatly diluted when causal rare variants are dispersedly and sparsely distributed in the genome, accounting for an important source of missing heritability.

Entities:  

Keywords:  Missing heritability; causal variant; preterm birth; rare variant; simulation

Year:  2017        PMID: 29051702      PMCID: PMC5638154          DOI: 10.1177/1177932217735096

Source DB:  PubMed          Journal:  Bioinform Biol Insights        ISSN: 1177-9322


Introduction

Genome-wide association study (GWAS) is a common approach for pinpointing the genetic variants associated with complex disorders.[1] According to the GWAS Catalog, thousands of GWAS have been conducted.[2] Each GWAS may report several to several tens of genetic variants associated with its investigated disorder. However, the identified genetic variants frequently show only modest effects on the disease risk or quantitative trait variation, which is referred to as the “missing heritability” problem.[3] Moreover, GWAS for spontaneous preterm birth, a complex disorder displaying hereditary features,[4] have not reported any convincing associated variants.[5,6] Many theories have been proposed to explain the missing heritability problem in GWAS. Conventionally, GWAS limit analyses to common variants according to minor allele frequency (MAF) ≥5%. It is possible that low-frequency (0.5% ≤ MAF < 5%) and/or rare (MAF <0.5%) variants account for part of the missing heritability.[3,7] In rare mendelian disorders, causal rare variants tend to show high penetrance, whereas in complex disorders, the penetrance levels of rare variants are now believed to be mostly moderate to small.[8] Recent studies have reported potentially pathogenic roles of rare variants in schizophrenia.[9,10] Due to the rareness problem, analysis of individual rare variants is difficult. Thus, association testing for rare variants often relies on collapsing methods, ie, examining the combined effects of rare variants in a gene or a functional unit so as to amplify association signals.[11] Specific forms of rare variant collapsing methods include the BURDEN test[12] and the sequence kernel association test (SKAT).[13] The effectiveness of most rare variant collapsing methods relies on a large proportion of variants in some scanned genomic regions being causal.[11] However, it is not reasonable to simply assume that causal rare variants tend to be clustered within several long chromosomal regions. Short functional elements such as transcript factor binding sites, promoters, enhancers, open chromatins, nucleosome positioning, and histone modifications are dispersedly distributed in the genome, and rare variants across a large number of (say >100) such functional elements may collectively modulate phenotypes. Recent studies have demonstrated that disease risk–associated variants may be enriched in particular epigenetic marks across the entire genome.[14-16] Effective rare variant analysis approaches must properly model how rare variants are associated with complex disorders. From a network view, the normal functionality of a life system is contingent on the spatiotemporal harmony of the entire gene networks, whereas on the opposite, multiple small genetic disturbances can collective render rewiring of gene networks, further leading to genetic disorders.[17-20] In this sense, hundreds to thousands of rare variants that modulate disease-related pathways can be the causes of some complex disorders. Of note, a large number of causal variants do not mean that any disease individual carries most of the causal variants. The genetics of complex disorders are often heterogeneous,[21] indicating that combinations of causal variants in specific disease individuals could be distinct. Effective rare variant analysis approaches should have the capabilities of capturing large numbers of small additive effects and accommodating genetic heterogeneity. Spontaneous preterm birth (gestational age <37 weeks) is apparently a complex genetic disorder, as a woman’s preterm birth risk is higher if she was born preterm or she has preterm birth history.[4] In this study, we designed a scenario that preterm birth was caused by the additive effects of 1000 rare variants. One advantage of selecting preterm birth as the disease model is that preterm birth can be approximated by gestational ages, rendering enhanced statistical power. Through simulations we demonstrated that strong genetic associations can simply become undetectable because of the ineffectiveness of rare variant collapsing methods, shedding lights on an important source of the missing heritability in GWAS. Our study may help to explain why genetic associations have not been detected for preterm birth.[5,6] Moreover, our simulation procedure can serve as a framework for examining the effectiveness of rare variant association testing approaches.

Methods

Statistics

The correlation coefficient (r) in this study was always the Pearson correlation coefficient. Both the statistics and P value were computed using the R software. Multiple testing was corrected by the Benjamini-Hochberg method,[22] also available in the R software.

Promoter regions

Gene positions (GRCh37) were downloaded from the Ensembl Biomart (http://grch37.ensembl.org/biomart). Promoter regions are defined as −800 to 199 bp (base pairs) of the transcription start sites. For the promoters with overlapped positions, only the left promoter was used. Then, 10 000 promoters were randomly selected for subsequent analyses.

Whole-genome genotypic data

The whole-genome genotypic data of 2504 samples were downloaded from the 1000 Genome Project Web site (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).[23] The rare variants (biallelic single-nucleotide polymorphism loci with MAF <0.5%) within the 10 000 selected promoters were retrieved using VCFtools.[24] Then, the promoter regions containing less than 10 rare variants were dropped. The total number of promoters included in the analysis was 9642, consisting of 235 842 rare variants.

Aggregation of rare variants

Aggregation of rare variants is not the count of rare variant loci. For any region or set of regions, the rare variant at position j in individual i is coded by the number of the minor allele: The rare variant aggregation of the analyzed region(s) in individual i is the summation of all rare variant variables:

Simulation of causal rare variants

We designed a simulation study that 1000 rare variants were strongly associated with preterm birth. We simulated 2504 samples, of which half were preterm birth and the other half were term birth. The phenotype was gestational age, ranging from 21 to 41 weeks. Gestational ages of both preterm (21-36 weeks) and term (37-41 weeks) births were generated according to uniform distributions. Then, a total of 1000 causal rare variants were simulated. For each rare variant locus, we generated a guiding MAF ranging from 0.02% to 0.25%, which was obtained according to an exponential distribution with rate = 2. Then, the specific genotype of this locus in each individual was generated by the following procedure: The genotype was coded in 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for the minor allele). The probability of generating the minor allele was determined by guiding MAF × risk factor, where the risk factor ranged from 0.75 to 2 depending on the gestational age: The genotype always had 2 alleles. For each allele, a random number between 0 and 1 was generated using a uniform distribution. If the random number was smaller than the probability of generating the minor allele as described above, the minor allele (+1) was generated.

Results

Quality of the simulated causal rare variants

The actual MAF of the 1000 simulated causal rare variants ranged from 0.02% to 0.48%, falling within the typical MAF range of rare variants. The actual MAFs were also highly correlated with the guiding MAF (R2 = 0.666, P < 2.2 × 10−16). For each sample, rare variant aggregation was computed. Across all samples, rare variant aggregations were significantly associated with gestational ages (R2 = 0.246, P = 2.671 × 10−155). A plot between rare variant aggregations and gestational ages (Figure 1) confirms this association. Moreover, the plot shows that any individual carries no more than 12 causal rare variants, indicating that genetic heterogeneity is also achieved. Thus, simulation of 1000 causal rare variants for preterm birth was achieved.
Figure 1.

Rare variant aggregations versus gestational ages for the 1000 simulated causal rare variants.

Rare variant aggregations versus gestational ages for the 1000 simulated causal rare variants.

Identifying the simulated causal rare variants from the genome reveals an important source of missing heritability

Identifying causal variants from the entire genome is a core task for association studies. Whole-genome sequencing technologies may generate millions of variants for a study cohort. Thus, this task is very challenging. We further assumed that preterm birth was caused by the 1000 simulated causal rare variants located in the promoter regions of mothers’ whole-blood transcriptome at delivery which consisted of a total of 9642 transcripts. We retrieved the genotypes of 2504 whole-genome sequencing samples from the 1000 Genomes Project and embedded the 1000 simulated causal rare variants into subsets of the 9642 promoter regions from the whole-genome genotypic data. Note that at the embedded locations, the original rare variants were replaced by the simulated causal rare variants. To ameliorate the effect of population stratification, the simulated samples were randomly assigned to 1000 Genomes Project samples. We assessed the association signals of the 1000 simulated causal rare variants from the whole-genome genotypic data. We generated a series of data sets by varying the proportion of causal rare variants within the embedded promoters. Analysis of individual rare variants is suggested to be impractical due to the rareness problem. Actually, we used the quantitative trait association testing available from the PLINK package[25] to assess individual rare variants but did not find any significant rare variant after adjusting for multiple testing. Thus, we assessed genetic associations by correlating promoters’ rare variant aggregations with gestational ages. Under each proportion of causal rare variants, we first computed the number of embedded (affected) promoters and the association signal of all affected promoters as a whole. It is noted that the real association signal (only the 1000 simulated causal rare variants were aggregated) is R2 = 0.246, P = 2.671 × 10−155. As shown in Table 1, when the proportion of causal rare variants decreases, the association signal of all affected promoters becomes weaker. This analysis suggests that the missing heritability is connected to inclusion of noncausal rare variants into the aggregation procedure. However, even with a proportion of 0.1, the association signal is still significant, suggesting that pinpointing the functional elements containing causal rare variants is critical for rare variant collapsing methods.
Table 1.

Association signals under different proportions of causal rare variants in embedded (affected) promoters.

Proportion of causal rare variantsAssociation signal of all affected promoters
Individual promoters identified for association
No. of affected promotersAssociation signal (R2, P value)No. of positivesNo. (%) of true positives
1440.240, 7.52 × 10−1523128 (63.6)
0.9450.234, 7.37 × 10−1482423 (55.1)
0.8530.225, 6.57 × 10−1412121 (39.6)
0.7610.209, 1.35 × 10−1292522 (36.1)
0.6690.183, 2.46 × 10−112109 (13.0)
0.5840.168, 4.38 × 10−10299 (10.7)
0.41110.140, 3.55 × 10−8422 (1.8)
0.31430.098, 6.04 × 10−5800 (0)
0.22250.054, 3.69 × 10−3200 (0)
0.14970.015, 6.23 × 10−1010 (0)
Association signals under different proportions of causal rare variants in embedded (affected) promoters. We then scanned individual promoters to examine whether individual promoters could be identified for genetic association, using an adjusted (for all scanned promoters) P value of .05 as the cutoff. The number of true positives (ie, number of affected promoters being identified) was highly dependent on the proportion of causal rare variants (Table 1). When the proportion was very high (≥0.9), more than half of the affected promoters were identified. When the proportion was smaller than 0.8, most of the affected promoters could not be identified, and the approach became totally ineffective when the proportion was smaller than 0.3. This analysis suggests that rare variant collapsing methods are ineffective when causal rare variants are dispersedly and sparsely distributed across the genome. In summary, our simulation and analyses demonstrate that rare variants could be causes of complex disorders, and the missing heritability problem may result from the ineffectiveness of rare variant collapsing methods.

Discussion

Many theories have been proposed to explain the missing heritability problem in association studies, of which rare variants play important roles.[3,7] Causal rare variants were previously suggested to have strong effects.[26] However, from the view of gene networks, it is possible that a large number of rare variants with moderate effects can collectively render rewiring of gene networks. Thus, it is reasonable to analyze the additive effects from a large number of rare variants. In this study, using a simulation approach, we demonstrated that the missing heritability problem can result from the ineffectiveness of rare variant collapsing methods when very few chromosomal regions contain a large proportion of causal rare variants. We used actual promoters instead of simulated promoters to accommodate simulated causal rare variants, which was a real data-based simulation procedure. Real data-based simulations incorporate genomic and population genetic contexts and thus are more realistic than purely simulated data. This strategy was also adopted in one of our previous studies.[27] Optimally, any combination of rare variants should be examined for genetic association so that the real association can be eventually identified. However, an exhaustive search is computationally intractable, as a study cohort can have millions of genetic variants. Thus, it is desired to develop novel big data and artificial intelligence approaches to cleverly enhance the scope of examined rare variant combinations. For rare variant association testing, a big challenge is which variants to aggregate.[7] Functional annotation of variants such as nonsynonymous, stop-gain/loss, and frameshift may help selection of rare variants for aggregation,[7] but this approach may exclude the causal variants within noncoding regions. We suggest an optimization procedure which uses a set of suspected rare variants as the start point and iteratively adds the variants that maximize the association signal of the rare variant aggregation until reaches convergence. The simulation framework of this study also has implications on the genetic mechanisms of preterm birth. Childbirth is a complicated biological procedure involving multiple pathways such as increased uterine contractility, cervical ripening, and decidua and fetal membrane activation.[28] Occurrences of multiple deleterious regulatory rare variants increase the chances of network rewiring in these pathways, which may further lead to enhanced risks of preterm birth.
  26 in total

1.  Genetic heterogeneity in human disease.

Authors:  Jon McClellan; Mary-Claire King
Journal:  Cell       Date:  2010-04-16       Impact factor: 41.582

2.  The human disease network.

Authors:  Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási
Journal:  Proc Natl Acad Sci U S A       Date:  2007-05-14       Impact factor: 11.205

3.  Investigation of genetic risk factors for chronic adult diseases for association with preterm birth.

Authors:  Nadia Falah; Jude McElroy; Victoria Snegovskikh; Charles J Lockwood; Errol Norwitz; Jeffey C Murray; Edward Kuczynski; Ramkumar Menon; Kari Teramo; Louis J Muglia; Thomas Morgan
Journal:  Hum Genet       Date:  2012-09-13       Impact factor: 4.132

Review 4.  Exploring the human diseasome: the human disease network.

Authors:  Kwang-Il Goh; In-Geol Choi
Journal:  Brief Funct Genomics       Date:  2012-10-12       Impact factor: 4.241

5.  Searching for missing heritability: designing rare variant association studies.

Authors:  Or Zuk; Stephen F Schaffner; Kaitlin Samocha; Ron Do; Eliana Hechter; Sekar Kathiresan; Mark J Daly; Benjamin M Neale; Shamil R Sunyaev; Eric S Lander
Journal:  Proc Natl Acad Sci U S A       Date:  2014-01-17       Impact factor: 11.205

Review 6.  Network medicine: a network-based approach to human disease.

Authors:  Albert-László Barabási; Natali Gulbahce; Joseph Loscalzo
Journal:  Nat Rev Genet       Date:  2011-01       Impact factor: 53.242

7.  A genome-wide association study of early spontaneous preterm delivery.

Authors:  Heping Zhang; Don A Baldwin; Radek K Bukowski; Samuel Parry; Yaji Xu; Chi Song; William W Andrews; George R Saade; M Sean Esplin; Yoel Sadovsky; Uma M Reddy; John Ilekis; Michael Varner; Joseph R Biggio
Journal:  Genet Epidemiol       Date:  2015-01-19       Impact factor: 2.135

8.  AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm.

Authors:  Yupeng Wang; Xinyu Liu; Kelly Robbins; Romdhane Rekaya
Journal:  BMC Res Notes       Date:  2010-04-28

9.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

10.  Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression.

Authors:  Richard Cowper-Sal lari; Xiaoyang Zhang; Jason B Wright; Swneke D Bailey; Michael D Cole; Jerome Eeckhoute; Jason H Moore; Mathieu Lupien
Journal:  Nat Genet       Date:  2012-09-23       Impact factor: 38.330

View more
  6 in total

Review 1.  Gene-gene interaction: the curse of dimensionality.

Authors:  Amrita Chattopadhyay; Tzu-Pin Lu
Journal:  Ann Transl Med       Date:  2019-12

2.  Analyses of epithelial Na+ channel variants reveal that an extracellular β-ball domain critically regulates ENaC gating.

Authors:  Xueqi Wang; Jingxin Chen; Shujie Shi; Shaohu Sheng; Thomas R Kleyman
Journal:  J Biol Chem       Date:  2019-09-24       Impact factor: 5.157

Review 3.  Genetic and epigenetic analyses of panic disorder in the post-GWAS era.

Authors:  Yoshiro Morimoto; Shinji Ono; Naohiro Kurotaki; Akira Imamura; Hiroki Ozawa
Journal:  J Neural Transm (Vienna)       Date:  2020-05-09       Impact factor: 3.575

Review 4.  Deciphering the Genetic Architecture of Plant Virus Resistance by GWAS, State of the Art and Potential Advances.

Authors:  Severine Monnot; Henri Desaint; Tristan Mary-Huard; Laurence Moreau; Valerie Schurdi-Levraud; Nathalie Boissot
Journal:  Cells       Date:  2021-11-08       Impact factor: 6.600

5.  Association study and mutation sequencing of genes on chromosome 15q11-q13 identified GABRG3 as a susceptibility gene for autism in Chinese Han population.

Authors:  Linyan Wang; Jun Li; Mei Shuang; Tianlan Lu; Ziqi Wang; Tian Zhang; Weihua Yue; Meixiang Jia; Yanyan Ruan; Jing Liu; Zhiliu Wu; Dai Zhang; Lifang Wang
Journal:  Transl Psychiatry       Date:  2018-08-14       Impact factor: 6.222

6.  Progressive effects of single-nucleotide polymorphisms on 16 phenotypic traits based on longitudinal data.

Authors:  Donghe Li; Hahn Kang; Sanghun Lee; Sungho Won
Journal:  Genes Genomics       Date:  2020-01-04       Impact factor: 1.839

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.