| Literature DB >> 32625242 |
Florent Guinot1, Marie Szafranski1,2, Julien Chiquet3, Anouk Zancarini4, Christine Le Signor5, Christophe Mougel6, Christophe Ambroise1.
Abstract
MOTIVATION: Association studies have been widely used to search for associations between common genetic variants observations and a given phenotype. However, it is now generally accepted that genes and environment must be examined jointly when estimating phenotypic variance. In this work we consider two types of biological markers: genotypic markers, which characterize an observation in terms of inherited genetic information, and metagenomic marker which are related to the environment. Both types of markers are available in their millions and can be used to characterize any observation uniquely.Entities:
Keywords: Dimensionality reduction; GWAS; Gene-environement interactions; Genetic and metagenomic markers; Statistical machine learning; Variable selection
Year: 2020 PMID: 32625242 PMCID: PMC7329492 DOI: 10.1186/s13015-020-00173-2
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Dimension reduction strategy. a Original hierarchical tree with an example for 5 variables. b Expanded representation of the tree with all possible weighted groups derived from the original hierarchy. The group in blue gathers the variables contained in the groups in orange and green. c Compressed representation of the tree after construction of the supervariables
Fig. 2Examples of group structures: correlations observed on (a) genomic data and b metagenomic data
Fig. 3Illustration of the true block interaction matrix with , and . Each non-zero value in this matrix is considered as a true interaction between two variables
Fig. 5Boxplots of a Precision and b recall results obtained on the numerical simulations with a Bonferroni-Holm correction for blocks. The lines correspond to different numbers of observations (top: , middle: and bottom: ), and the columns correspond to levels of difficulty of the problem (left: , middle: and right: ). The boxplots are best seen in colors: from the left to the right, GLgap is in purple, GLtree is in blue, MLGL is in red, SICOMORE is in green, -SICOMORE is in orange
Fig. 4Confusion matrices of interactions for the different methods, using the following simulation parameters: , , . We can see from this example that MLGL and -SICOMORE behave similarly, with very large genomic regions identified. SICOMORE tends to work with smaller genomic and metagenomic regions
Average computation time (in minutes) over 5 replicates for varying dimensions of , with the dimension of being fixed ()
| 50 | 100 | 500 | 1000 | 1500 | 2000 | 3000 | 4000 | |
|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.01 | 0.02 | 0.03 | 0.03 | 0.04 | 0.05 | 0.06 | |
| SICOMORE | 0.21 | 0.34 | 0.82 | 0.76 | 0.75 | 0.96 | 0.93 | 1.09 |
| MLGL | 0.06 | 0.09 | 3.35 | 0.86 | 3.12 | 4.52 | 8.02 | 24.20 |
| GLtree | 0.07 | 0.28 | 0.67 | 3.83 | 11.69 | 26.31 | 88.17 | 210.64 |
Results of the search for interactions using the -SICOMORE method
| PH | #MG | CHR | GP | #SNPs | ||
|---|---|---|---|---|---|---|
| RTDBR | 39 genera | 3 | 129:980206 | 6705 | 0.03 | 0.18 |
| RTDBR | 39 genera | 3 | 980235:32366703 | 196705 | 0.04 | 0.18 |
| RTDBR | 39 genera | 7 | 21704918:33495621 | 68658 | 0.03 | 0.23 |
| RTDBR | 39 genera | 8 | 50:18024047 | 93142 | 0.02 | 0.14 |
| SNU | 180 genera | 2 | 38539843:45729381 | 33033 | 0.04 | 0.13 |
| SNU | 180 genera | 6 | 33985403:35275305 | 6174 | 0.04 | 0.13 |
| SNU | 180 genera | 8 | 18024755:45569421 | 156827 | 0.05 | 0.09 |
From left to right, the names of the columns are: PH for the phenotype studied; #MG for the number of genera; CHR for the chromosome; GP for the genomic postion (pb) and #SNPs for the number of SNPs in the genomic region