| Literature DB >> 33240333 |
Ruby Fore1, Jaden Boehme2, Kevin Li3, Jason Westra4, Nathan Tintle4.
Abstract
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This "multi-set" approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype-phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.Entities:
Keywords: missing heritability; pathway testing; power investigation; rare variant analysis; statistical genetics
Year: 2020 PMID: 33240333 PMCID: PMC7680887 DOI: 10.3389/fgene.2020.591606
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of simulation scenarios investigated.
| Equal numbers of variants per set | ||
| Unequal numbers of variants per set |
P-values produced by multi-set testing as compared to single-set testing.
| FFAR gene family (4 genes) | 0.051 | 0.019 | 0.064 | |
| Fatty acid gene family (8 genes) | 0.013 | 0.05 |
FIGURE 1Multi-set approaches using aggregation statistics have improved power over multi-set testing with a Bonferroni correction as the number of subsets increases.
FIGURE 2Relative subset size has little impact on the power of multi-set approaches when the proportion of causal variants is equally distributed between two sets.
FIGURE 3Multi-set aggregation statistics Fisher’s and SUMSTAT yielded power similar to a single set test, whereas Bonferroni showed generally worse power for large number of subsets, with some specific cases where Bonferroni improved power. (A) Power curves when subsets are equally sized and causal variants are distributed among only 1/8 of the total subsets. (B) When casual variants are distributed among 1/4 of the subsets. (C) When causal variants are distributed among 1/2 of the subsets. (D) When causal variants are distributed among 3/4 of the subsets.
FIGURE 4Varied distribution of causal variants into unequal subsets demonstrates increased power for multi-set methods as the proportion of causal variants increases in the smaller set. (A) Power curves when the SNVs are divided into two equal subsets, and the proportion causal in one subset is varied from 0 to 1/4. (B) Power curves when SNVs are divided into two unequal subsets, and the proportion causal in the smaller set is varied from 0 to 1/3. (C) Power curves when SNVs are divided into two subsets, one three times the size of the other, and the proportion causal in the smaller set is varied from 0 to 1/2. (D) Power curves when SNVs are divided into two subsets, one seven times the size of the other, and proportion causal in the smaller set is varied from 0 to 1.
Rare variant set sizes in application to real data.
| Number of rare variants | 15 | 20 | 15 | 248 | 108 | 247 | 94 | 322 |
| Single-set | 0.659 | 0.529 | 0.081 | 0.388 | 0.255 | 0.403 |