Chayanika Goswami1,2, Amrita Chattopadhyay2, Eric Y Chuang2,3,4. 1. Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei. 2. Centre of Genomic and Precision Medicine, National Taiwan University, Taipei. 3. Department of Electrical Engineering, National Taiwan University, Taipei. 4. China Medical University, Taichung.
Rare variants are defined as single nucleotide polymorphisms (SNPs) with a minor allele frequency (MAF) of less than 0.01. They often have larger phenotypic effects in comparison to low-frequency (less common) (0.01< MAF <0.05) or common (MAF >0.05) disease-associated SNPs (1). Genome wide association studies (GWASs) have been extensively used to investigate the underlying genetic etiology of complex diseases and quantitative traits. The GWAS catalog (https://www.ebi.ac.uk/gwas/) is a repository of more than 70,000 identified disease-associated variants. It provides numerous novel clues to disease biology, thus improving our knowledge from a few positionally cloned loci to several thousands of replicated associations.Despite such findings, the genetic predisposition toward many complex disease traits is left unexplained, even for diseases where large GWAS meta-analyses have been performed. Overall, <24% of the heritability of complex diseases has been accounted for (2). Furthermore, translating such information into functional understanding or therapeutic treatment is still a decades-long journey. Among several hypotheses regarding why there is still missing heritability, one is that GWASs primarily identify common variants (3). In contrast, studies of low-frequency or rare variants can provide an enhanced number of insights into disease risk and trait variability.The success of rare variant studies highly depends on the design scheme of the experiment. The choice of data and its preprocessing determines a viable start for the discovery of crucial rare variants. Low-depth whole genome sequencing (WGS) is a preferred option in rare variant studies, as deep WGS is often expensive (4). Moreover, variant detection and disease detection are quite achievable with low-depth sequencing, if the sample is large. Exome sequencing is another option; while it only targets the coding region of the genome, as studies show, many Mendelian disorders have been identified through it (3). Despite the fact that many GWAS loci lie in the non-coding region, concentrating a study on a high-value region of the genome still proves to be worthwhile for rare variant studies, keeping in mind the cost of WGS. Targeted-region sequencing has also proven to be effective (5). The discovery of common variants associated with known complex diseases in GWASs has paved the way for targeted-region sequencing and discovery of rare variants (5). It is also a cost-effective approach and is promising for the identification of rare variants.Customized genotyping arrays are a cost-effective alternative to sequencing. Such chips include both common and rare variants, with common variants replicating the original GWAS signals, thereby enabling fine mapping of rare variants. Extreme-phenotype sampling is a strategy of smart sampling for rare variant studies. Preferential selection of the most likely informative samples while designing arrays can greatly reduce the sequencing budget and increase the association power. Sampling those individuals with known disease phenotypes and risk factors enriches the arrays with rare variants (2).As rare variants are numerous and are less closely correlated with each other in comparison to common variants, they suffer from multiple testing burden. Rare variant association tests further suffer from a decrease in statistical power due to the rarity of individuals carrying these variant alleles. Hence, rare variant association is confirmed by combining multiple variants within genetic units of association, which are defined by gene annotations, genomic coordinates, or functional characterization. Burden tests, adaptive burden tests, variance component tests, and combined tests are some of the gene-based tests used frequently for rare variant association studies (2).Burden tests create genetic scores by collapsing the rare variant count. The key principle behind them is to aggregate the information contained in multiple genotypes of one sample into a burden score that can be easily used for association, and they assume that all rare variants that are causal and trait-associated have the same intensity of effect. The cohort allelic sums test (CAST) is one such burden test, and is available as an R package (6). Similar to the single-SNP analysis, χ2 and Fisher exact tests can also be used for burden testing (7), depending on the dataset tested. However, a limitation of burden tests is that they assume that all variants influence the phenotype in the same direction.Adaptive burden tests are more robust than burden tests, being applied with fixed thresholds. They remove the limitations of a burden test and allow the presence of null, trait-decreasing, and trait-increasing variants. Adaptive burden tests are computationally intensive, as they require permutation for the computation of P values. They also make use of regression coefficient for each variant, to be used as weights.Variance component tests allow a mixture of effects across rare variant sets, including both protective and risk variants, with varying magnitudes of effect sizes. The sequence kernel association test (SKAT) employs this method (8). It can be applied for both quantitative and binary traits. SKAT-O is based on a blend of the burden test and the variance component test, commonly known as a combination test (9). It allows for a more flexible framework in terms of score statistics, leading to an optimal combination of efficient computations. cSKAT (10) can be used to optimize SKAT statistic over multiple potentially relevant SNV annotations. It is powerful for larger sample size (N≥5,000) and correctly specified SNV weights.A recent study proposes the Bayes Factor method for rare variant association test in sequencing data. It has informative priors which shows sensitivity to rare variant count differences in binary studies or allelic distribution differences or both. Although it could be applied to unbalanced case-control study designs but it hasn’t been tested for different population structures (11). Adaptive hierarchically structured variable selection (HSVS-A) (12) is another newly proposed method which is powerful than both burden test and variance-component tests for continuous phenotypes. It can be applied to both set-based and region-based analysis. It automatically controls the type I error rates and can produce individual effect estimates for rare variants. Association test for rare variants based on algebraic statistics (ASRV) is a novel method to test association when the causal variants has effects in different directions (13). Single variant association tests such as Transmission Disequilibrium Tests (TDTs) (14) or Family-based Association Tests (FBATs) (15) that are robust against genetic confounding can be applied in family-based association studies. Aggregated Cauchy Association Test (ACAT) (16) is a set-based association test for sequencing studies. It is computationally efficient and requires only P values for association test between a trait and a variant-group. RVFam is an R package providing tools for testing association between rare variants and continuous traits, binary traits or survival measured in family samples (17), but outperformed by generalized linear model (GLM) (18) and Firth test (19) which do not account for sample-relatedness. A hybrid strategy using GLM for gene-based tests and Firth test otherwise with family data for rare variant association analyses of binary outcome proves to be computationally efficient without variant filtration. The Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI) test integrated in the rare-variant Genome Wide Association Study (rvGWAS) framework (20) includes both categorical and numerical variant characteristics as covariates for rare variant association test and shows powerful analysis in case of loss-of-function variants.A pathway-based approach or multi-set testing for rare variant association test shows increase in statistical power when the subsets of genes such as exons, introns or gene windows contains fewer variants overall and may also improve potential disease-etiology elucidation (21). Copula-based Joint Analysis of Multiple Phenotypes (C-JAMP) is a single-marker association test, implemented as an R package, which uses a joint model of various phenotypes and variants or other covariates and is powerful for causal variants with large effect sizes (22). Quantitative Phenotype Scan Statistic (QPSS) has an advantage of localizing genomic regions of rare quantitative-phenotype-associated variant group by refining a known region of interest using variant annotation (23).For rare variant association tests in non-coding regions such as introns, promoters, enhancers or silencers, a sliding window approach can be used to scan the genome, especially in WGS studies, as there have been very few studies in such regions. SAIGE-GENE is a scalable generalized mixed model region-based association test, used in exome-wide and genome-wide region-based analysis for large sample data (N>40,000). It can also work with unbalanced case-control ratios for binary traits and control the type I error rates well (24). For region-based rare-variant association studies in WGS data, GECS helps in estimating the significance thresholds, with FWER controlled at 5%. For single-marker analysis studies, a significance threshold of α =5.0×10−8 had been set based on previous studies. But GECS shows the threshold to be much stringent with α =2.95×10−8 (25).Other issues for rare variant studies include population stratification and genotype imputation. The former often acts as a confounder in such studies and should be adjusted before proceeding. Principal component analysis and linear-mixed models are commonly used for this purpose; however, it is not clear how effective they are for rare variants. Genotype imputation, on the other hand, negatively affects rare variant studies, as the imputation accuracy decreases with lower MAFs, leading to removal of rare variants in the quality-check step (2). The introduction of a hybrid reference panel may help resolve this issue, leading to rare variants being imputed with higher accuracy (1).For a true positive association of rare variants with a disease, it is important to replicate the association in a large number of samples, often relying on sequencing or genotyping. Follow-up studies targeting high-priority variants in multiple samples can be beneficial once the initial study has proven informative (2). Further experiments linking the rare variants to molecular and cellular functions can be carried out once clear evidence of an association with disease is established.The article’s supplementary files as
Authors: Wei Zhou; Zhangchen Zhao; Jonas B Nielsen; Lars G Fritsche; Jonathon LeFaive; Sarah A Gagliano Taliun; Wenjian Bi; Maiken E Gabrielsen; Mark J Daly; Benjamin M Neale; Kristian Hveem; Goncalo R Abecasis; Cristen J Willer; Seunggeun Lee Journal: Nat Genet Date: 2020-05-18 Impact factor: 38.330