| Literature DB >> 30657943 |
Kevin Caye1, Basile Jumentier1, Johanna Lepeule2, Olivier François1.
Abstract
Gene-environment association (GEA) studies are essential to understand the past and ongoing adaptations of organisms to their environment, but those studies are complicated by confounding due to unobserved demographic factors. Although the confounding problem has recently received considerable attention, the proposed approaches do not scale with the high-dimensionality of genomic data. Here, we present a new estimation method for latent factor mixed models (LFMMs) implemented in an upgraded version of the corresponding computer program. We developed a least-squares estimation approach for confounder estimation that provides a unique framework for several categories of genomic data, not restricted to genotypes. The speed of the new algorithm is several order faster than existing GEA approaches and then our previous version of the LFMM program. In addition, the new method outperforms other fast approaches based on principal component or surrogate variable analysis. We illustrate the program use with analyses of the 1000 Genomes Project data set, leading to new findings on adaptation of humans to their environment, and with analyses of DNA methylation profiles providing insights on how tobacco consumption could affect DNA methylation in patients with rheumatoid arthritis. Software availability: Software is available in the R package lfmm at https://bcm-uga.github.io/lfmm/.Entities:
Keywords: confounding factors; ecological genomics; gene-environment association; local adaptation; statistical methods
Mesh:
Year: 2019 PMID: 30657943 PMCID: PMC6659841 DOI: 10.1093/molbev/msz008
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Base 10 logarithm of the ratio of runtimes for LFMM 1.5 and LFMM 2.0. A value of 5 means that LFMM 2.0 runs 105 times faster than LFMM 1.5. (A) n = 100 individuals, (B) n = 400, (C) n = 1,000, p is the total number of markers in the simulation.
. 2.True discovery rate and F-score as a function of confounding intensity. Three fast methods are considered: CATE, LFMM 2.0 and PCA. All methods were applied with K = 8 factors as determined by a PCA screeplot. The F-score is the harmonic mean of the true discovery rate (precision) and power.
. 3.Human GEA study. Association study based on genomic data from the 1000 Genomes Project database and climatic data from the Worldclim database. (A) Latent factors estimated by LFMM 2.0. (B) Target genes corresponding to top hits of the GEA analysis (expected FDR level of 5%). The highlighted genes correspond to functional variants. (C) Predictions obtained from the VEP program.
. 4.EWAS of RA and smoking. Fisher’s scores for CpG sites showing significant association with RA and smoking in at least two of three approaches (PCA, CATE, LFMM 2.0).