| Literature DB >> 26866664 |
Abstract
This paper summarizes the contributions from the Population-Based Association group at the Genetic Analysis Workshop 19. It provides an overview of the new statistical approaches tried out by group members in order to take best advantage of population-based sequence data.Although contributions were highly heterogeneous regarding the applied quality control criteria and the number of investigated variants, several technical issues were identified, leading to practical recommendations. Preliminary analyses revealed that Hurdle-negative binomial regression is a promising approach to investigate the distribution of allele counts instead of called genotypes from sequence data. Convergence problems, however, limited the use of this approach, creating a technical challenge shared by environment-stratified models used to investigate rare variant-environment interactions, as well as by rare variant haplotype analyses using well-established public software. Estimates of relatedness and population structure strongly depended on the allele frequency of selected variants for inference. Another practical recommendation was that dissenting probability values from standard and small-sample tests of a particular hypothesis may reflect a lack of validity of large-sample approximations. Novel statistical approaches that integrate evolutionary information showed some advantage to detect weak genetic signals, and Bayesian adjustment for confounding was able to efficiently estimate causal genetic effects. Haplotype association methods may constitute a valuable complement of collapsing approaches for sequence data. This paper reports on the experience of members of the Population-Based Association group with several novel, promising approaches to preprocessing and analyzing sequence data, and to following up identified association signals.Entities:
Mesh:
Year: 2016 PMID: 26866664 PMCID: PMC4895250 DOI: 10.1186/s12863-015-0310-0
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Fig. 1Mind-map with the 9 accepted contributions from the Population-Based Association group
Genotypes, phenotypes, and quality control filters applied by authors of accepted papers in the Population-Based Association group
| Contribution | Genotypes | Phenotypes | Quality control |
|---|---|---|---|
| Blue et al. [ | GWSNPA data for odd-numbered autosomes from 959 subjects in 20 pedigrees | Longitudinal SBP, real and simulated phenotypes | Support vector machine filter, exclusion of variants with more than 10 % missing calls, extracted with VCFtools |
| Datta et al. [ | WES data within | Cases were defined as persons with a SBP >140 mm Hg, DBP >90 mm Hg or taking antihypertension medication. Other persons, including individuals with a missing medication field, were treated as controls | Exclusion of variants with more than 25 % missing calls or a MAF >0.001, leaving 70 |
| Fernández-Rhodes et al. [ | GWSNPA data for odd-numbered autosomes from 959 subjects in 20 pedigrees | Hypertension phenotype PHEN simulated based on 984 variants with main SBP effects, and 3 CYP3A43 variants that interacted with medication but showed no main effect | Excluded 92 individuals with missing phenotype data; monomorphic and singleton variants were filtered out. Only the last SBP measurement was considered |
| González-Silos et al. [ | WES variants in chromosome 3 from 407 samples with information on blood pressure medication out of 1943 unrelated samples | DBP | Reference and alternative allele counts (AD fields in the FORMAT tag of the vcf file), genotype (GT field in the FORMAT tag) and average genotype quality (GQ field in the FORMAT tag), extracted with VCFtools. Nonbiallelic, monomorphic and variants with a MAF <0.003 were excluded, leaving 8957 variants for analysis |
| Oh [ | WES data in | Log-transformed baseline measurements of SBP and DBP | Exclusion 92 individuals with missing phenotype data, monomorphic and singleton variants were filtered out |
| Schwantes-An et al. [ | WES data in odd-numbered autosomes from 1943 unrelated subjects | Four traits were simulated by the authors under a null hypothesis of no genetic association. The fifth trait was Q1 provided | Alternative allele counts (NALTT field) were extracted with VCFtools and converted to 2-allele genotype calls. Nonbiallelic and monomorphic variants, and variants with more than 5 % missing calls were excluded, leaving 313,340 variants for analysis |
| Shin et al. [ | WES data in | Real data: Cases were defined as persons with SBP >140 mm Hg, DBP >90 mm Hg or taking antihypertension medication. Other persons, including individuals with a missing medication field, were treated as controls | Excluded 92 individuals with missing phenotype data |
| Predicted alternative allele counts (DOSAGE field) were extracted with VCFtools; monomorphic variants were filtered out, leaving 90 variants for analysis | |||
| Simulated phenotypes: Null trait Q1 (dichotomomized) and PHEN, both with disease prevalence of 17.8 % | |||
| Thompson and Fardo [ | Variants in | Simulated phenotypes Q1 and PHEN on 1943 unrelated subjects | Data extracted with VCFtools; monomorphic variants were filtered out |
| Wang et al. [ | WES data 5 kb within, up- and downstream of | Simulated data, including a null trait (25 variants have true SBP effects) | Excluded 81 subjects without age information; monomorphic and low-coverage (<20×) variants were filtered out, leaving 94 variants |
DBP diastolic blood pressure, GWSNPA genome-wide single nucleotide polymorphism array, MAF minor allele frequency, NALTT number of nonreference alleles for each individual thresholded, SBP systolic blood pressure, VCF variant call format, WES whole exome sequence
Key concepts addressed by authors of accepted papers in the Population-Based Association group
| Theme | [Contribution reference] concept |
|---|---|
| New methods for new data types | [1] Alternative allele count: Number of reads that support a given alternative allele based on individual sequence data |
| [1] Negative binomial regression: Type of regression model used to investigate response variables that are counts. In contrast to Poisson regression, negative binomial regression allows for overdispersion—a variance larger than the mean | |
| [1] Hurdle and zero-inflated models: Two statistical models used to investigate count response variables with a large proportion of zeros. Hurdle models assume that a Bernoulli process determines whether counts are zero or positive. If the response is positive, its conditional distribution is governed by a truncated-at-zero count data model. Zero-inflated models assume the response variable is a mixture of a Bernoulli and a count distribution, eg, negative binomial | |
| [1] Downsampling: Selecting a subset of the reads in a high-coverage position to improve computational efficiency | |
| Handling rare variants | [2] Variant ascertainment bias: Variant selection criteria, such as minor allele frequency, can influence kinship and population structure estimates |
| [2] Kinship estimation: the estimation of relationships among samples based upon genotypes rather than known pedigrees is sensitive to the selected variants and the applied statistical methods | |
| [2] Population structure: Admixture events leave a signature in the patterns of genetic variation within a population. This can bias genome-wide association studies, and be used as a tool to identify genetic variants influencing a trait | |
| [3] Firth’s penalized likelihood: A logistic regression likelihood penalized by Jeffrey’s invariant prior. A first-order bias term is introduced into the score function to reduce the bias in the log odds ratio estimate that arises as a result of sparse data | |
| [3] Small-sample-adjusted score test: A logistic regression score test in which the null distribution of the test statistic is adjusted using estimates of small sample variance and kurtosis | |
| [3,9] Sequence kernel association test: Variant-collapsing test for a subset of variants constructed by aggregating individual variant score test statistics | |
| [4] Quantitative trait mapping: The search for positions along the genome associated with quantitative traits | |
| [4] Tree-based methods: Methods that account for uneven evolutionary relatedness among genetic variants | |
| [4] Phylogenetic tree: A bifurcating tree used to represent the evolutionary relationships among variants (illustrated in Fig. | |
| [5] Within-chain permutation: Permutation of individual phenotypes is a widely used strategy to investigate the null distribution. Under the frequentist approach, statistics based on actual data are compared with the distribution of statistics from permuted data sets. In Bayesian analyses, computing time can be reduced by permuting phenotypes within the single Markov chains used to infer posterior distributions. | |
| [6] Minor allele count (MAC): The total count of minor alleles for all individuals evaluated at a particular position. For rare variants, the MAC reflects better data sparsity than the minor allele frequency | |
| Rare variant behavior | [7] Gene–environment interaction term model: Statistical approach that tests for gene–environment interactions by including a gene–environment interaction term to measure the change in the outcome when both the genetic marker and environmental factor are present, as compared to when one or both factors are not present |
| Follow up of association signals | [8] Bayesian adjustment for confounding: A Bayesian approach for estimating the average causal effect of an exposure on an outcome in observational studies while accounting for the uncertainty in confounder selection. It uses Bayesian model averaging to average inference across many models according to posterior weight determined by a joint model of the exposure and the outcome |
| [9] Logistic Bayesian LASSO (least absolute shrinkage and selection operator): Method based on a retrospective likelihood that models the probability of haplotypes given disease status. The odds of disease are expressed as a logistic regression model, whose coefficients are regularized through Bayesian LASSO |
Relevant bibliography and software used by authors of accepted papers in the Population-Based Association group
| Topic | Bibliography | Software |
|---|---|---|
| New methods for new data types | Satten GA, Johnson HR, Allen AS et al. Testing association without calling genotypes allows for systematic differences in read depth between cases and controls. In: Abstracts from the 22nd Annual Meeting of the International Genetic Epidemiology Society, Chicago IL, USA. ISBN: 978-1-940377-00-1, 2012, 9. Original proposal to use the proportion of calls for the minor allele instead of called genotypes | R-packages stats and pscl to fit negative binomial/linear and zero-inflated/Hurdle-negative regression models, respectively |
| Handling rare variants | Conomos MP, Miller MB, and Thornton TA. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. | PC-AiR is implemented in R and is available from |
| SNPRelate is an R package, available from | ||
| EMMAX for genome wide association testing is available from | ||
| RFMiX for local ancestry mapping is available from | ||
| R-package pmlr to conduct penalized logistic regression likelihood ratio tests ( | ||
| SKAT to perform single-variant score tests, and 3 variant-collapsing tests: burden, nonburden sequence kernel association test, and optimal unified test ( | ||
| Blossoc to estimate phylogenetic trees | ||
| Maples BK, Gravel S, Kenny EE, and Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. | R packages ape and geiger to manipulate phylogenetic trees | |
| Bull SB, Mak C, and Greenwood CMT: A modified score function estimator for multinomial logistic regression in small samples. | ||
| Firth D. Bias reduction of maximum likelihood estimates. | ||
| Lee S, Emond MJ, Bamshad MJ, et al. Optimal unified approach for rare-variant association testing with application to small-sample case–control whole-exome sequencing studies. | ||
| Thompson K, Kubatko L. Using ancestral information to detect and localize quantitative trait loci in genome-wide association studies. | ||
| Mailund T, Besenbacher S, and Schierup MH: Whole genome association mapping by incompatibilities and local phylogenies. | ||
| Rare variant behavior | Tabangin ME, Woo JG, and Martin LJ. The effect of minor allele frequency on the likelihood of obtaining false positives. | MMAP to fit linear mixed model in a family-based sample, estimate either model-based or robust standard errors, and conduct a 1 df test of gene–environment interactions in an “interaction model” using the estimates gene–environment interaction term |
| METAL to estimate 1 df and 2 df tests of gene–environment interactions using a model with a gene–environment interaction term (“interaction model”) | ||
| Goh L and Yap VB. Effects of normalization on quantitative traits in association test. | R-package EasyStrata to estimate 1 df and 2 df tests of gene–environment interactions by comparing the genetic effects across environmental strata (“med-diff” approach) | |
| Manning AK, LaValley M, Liu CT, et al. Meta-analysis of gene-environment interaction: joint estimation of SNP and SNP × environment regression coefficients. | ||
| Randall JC, Winkler TW, Kutalik Z, et al. Sex-stratified genome-wide association studies including 270,000 individuals show sexual dimorphism in genetic loci for anthropometric traits. | ||
| Aschard H, Hancock DB, London SJ, and Kraft P. Genome-wide meta-analysis of joint tests for genetic and gene-environment interaction effects. | ||
| Follow up of association signals | Wang C, Parmigiani G, and Dominici F. Bayesian effect estimation accounting for adjustment uncertainty. | Codes that implement Bayesian adjustment for confounding are available at |
| Wang C, Dominici F, Parmigiani G, Zigler CM. Accounting for uncertainty in confounder and effect modifier selection when estimating average causal effects in generalized linear models. | R-packages hapassoc, haplo.stats, LBL to implement the haplotype association methods | |
| These two papers proposed the Bayesian adjustment for confounding (BAC) method | ||
| Biswas S and Lin S: Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneration. | ||
| Biswas S, Xia S, and Lin S: Detecting rare haplotype-environment interaction with logistic Bayesian LASSO. |
Fig. 2Illustration of the evolutionary history of a particular genetic variant represented by a phylogeny tree. In the phylogenetic tree, time moves from past (left) to present (right). Suppose some of the variants represented in this tree are associated with a trait. Then, a large covariance is expected among trait values from 2 variants (eg, the blue diamonds) sharing a large portion of their evolutionary history (shown by the branches in blue). In contrast, the 2 variants denoted by black circles share a smaller portion of evolutionary history, so that little covariance in the corresponding trait values is expected
Fig. 3Distribution of alternative allele counts. Mean alternative allele counts (AACs) per variant (a), median AACs per variant (b), exemplary AAC distribution grouped by genotype for the variant in position Chr3:16249998 with minor allele frequency equal to 0.17 (c), and exemplary comparison of the ratios (AAC/total read depth) and called genotypes for the variant in position Chr3:16249998 (d)