| Literature DB >> 35734430 |
Marie Courbariaux1, Kylliann De Santiago1, Cyril Dalmasso1, Fabrice Danjou2, Samir Bekadar2, Jean-Christophe Corvol2, Maria Martinez3, Marie Szafranski4,1, Christophe Ambroise1.
Abstract
Motivation: Identifying new genetic associations in non-Mendelian complex diseases is an increasingly difficult challenge. These diseases sometimes appear to have a significant component of heritability requiring explanation, and this missing heritability may be due to the existence of subtypes involving different genetic factors. Taking genetic information into account in clinical trials might potentially have a role in guiding the process of subtyping a complex disease. Most methods dealing with multiple sources of information rely on data transformation, and in disease subtyping, the two main strategies used are 1) the clustering of clinical data followed by posterior genetic analysis and 2) the concomitant clustering of clinical and genetic variables. Both of these strategies have limitations that we propose to address. Contribution: This work proposes an original method for disease subtyping on the basis of both longitudinal clinical variables and high-dimensional genetic markers via a sparse mixture-of-regressions model. The added value of our approach lies in its interpretability in relation to two aspects. First, our model links both clinical and genetic data with regard to their initial nature (i.e., without transformation) and does not require post-processing where the original information is accessed a second time to interpret the subtypes. Second, it can address large-scale problems because of a variable selection step that is used to discard genetic variables that may not be relevant for subtyping.Entities:
Keywords: Parkinson’s disease; clinical data; disease subtyping; genotyping; high dimension; longitudinal data; mixture of experts model; variable selection
Year: 2022 PMID: 35734430 PMCID: PMC9207464 DOI: 10.3389/fgene.2022.859462
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Results on artificial data over 100 simulations. ARI with a K-means algorithm, the 2-step method (no use of genetic information) and the integrative method (use of genetic information) and their corresponding oracles. The y-axis represents the ARI score (the higher the better).
FIGURE 2Results on artificial data over 100 simulations. Non-negative { } parameter estimates and their respective true values. The y-axis represents the values of the estimates.
FIGURE 3Global sensitivity and specificity of the integrative method compared with the 2-step method for the selection of the genetic variables in the artificial data experiment with 100 simulations. Among 2,657 variables, 10 had to be selected. The number of times these variables have been selected over the 100 simulations is also specified for both methods.
FIGURE 4Results on artificial data over 100 simulations. Estimated number of clusters according to the BIC. The y-axis represents the number of simulations for the number of clusters selected in the x-axis.
FIGURE 5Clustering with regard to the clinical variables. The top and bottom left-hand graphs show the fitted trajectories (straight lines) for each of the four clusters and the corresponding 68% confidence intervals obtained by adding and subtracting the fitted σ parameters. The top y-axis represents the MMSE score (evaluation of cognitive impairment) and the bottom y-axis the UPDRS III score (motor evaluation). The other graphs show in detail, for each cluster and each score around the standard trajectory, all the trajectories of the patients assigned to the cluster.
FIGURE 6Boxplot of the age at diagnosis for each of the clusters.
FIGURE 7Genetic association: estimated logistic regression parameters { }. Cluster 1 is the reference: = 0; top (A): ; middle (B): ; and bottom (C): . The confidence intervals are computed from the Hessian matrix provided by the R function nnet::nnet (Ripley et al., 2016).