| Literature DB >> 31941030 |
Zhaozhou Lin1, Qiao Zhang2, Shengyun Dai3, Xiaoyan Gao2.
Abstract
Temporal associations in longitudinal nontargeted metabolomics data are generally ignored by common pattern recognition methods such as partial least squares discriminant analysis (PLS-DA) and orthogonal partial least squares discriminant analysis (OPLS-DA). To discover temporal patterns in longitudinal metabolomics, a multitask learning (MTL) method employing structural regularization was proposed. The group regularization term of the proposed MTL method enables the selection of a small number of tentative biomarkers while maintaining high prediction accuracy. Meanwhile, the nuclear norm imposed into the regression coefficient accounts for the interrelationship of the metabolomics data obtained on consecutive time points. The effectiveness of the proposed method was demonstrated by comparison study performed on a metabolomics dataset and a simulating dataset. The results showed that a compact set of tentative biomarkers charactering the whole antipyretic process of Qingkailing injection were selected with the proposed method. In addition, the nuclear norm introduced in the new method could help the group norm to improve the method's recovery ability.Entities:
Keywords: antipyretic effects; longitudinal study; multitask learning; nontargeted metabolomics; structural regularization
Year: 2020 PMID: 31941030 PMCID: PMC7022931 DOI: 10.3390/metabo10010033
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figures of merit of PLS-DA (Partial Least Squares Discriminant Analysis) models established by LPOCV (leave-one-pair-out cross validation) before and after variable selection.
| Full PLS-DA | N of Vars. | Reduced PLS | |||||
|---|---|---|---|---|---|---|---|
| Time Points | CE_AUC | ACC | DQ2 | - | CE_AUC | ACC | DQ2 |
| 4 h | 0.7814 | 0.70 | 0.3769 | 49 | 1 | 1 | 0.9857 |
| 8 h | 0.7501 | 0.72 | 0.1990 | 97 | 1 | 1 | 0.9150 |
| 12 h | 0.7968 | 0.87 | 0.0553 | 54 | 1 | 1 | 0.7981 |
| 24 h | 0.8749 | 0.88 | 0.6542 | 34 | 1 | 1 | 0.9876 |
A summary of the prediction performance of PLS-DA (Partial Least Squares Discriminant Analysis) models retained on adjacent time points after variable selection.
| CE_AUC | DQ2 | |||||||
|---|---|---|---|---|---|---|---|---|
| Time Points | 4 h | 8 h | 12 h | 24 h | 4 h | 8 h | 12 h | 24 h |
| 4 h * | 1 | 0.9344 | 0.7808 | 0.8576 | 0.9857 | 0.6122 | −0.3148 | 0.4112 |
| 8 h * | 0.9344 | 1 | 0.9088 | 0.8768 | 0.6969 | 0.9150 | 0.3671 | 0.3085 |
| 12 h * | 0.7360 | 0.8 | 1 | 0.9536 | −0.1339 | −0.3079 | 0.7981 | 0.7672 |
| 24 h * | 0.6720 | 0.7967 | 0.9216 | 1 | −1.0532 | 0.1240 | 0.5987 | 0.9876 |
The * signifies that the variables were selected based on the model trained on data of these time points.
Figure 1Venn diagram showing the relations between the features reported for each time point.
Comparisons between the proposed GNNR (Group and Nuclear Norm Regularization) and LFS (Longitudinal Feature Selection) in terms of condition expected AUC (Area Under Curve) and discriminant Q2.
| Methods | Metrics | 4 h | 8 h | 12 h | 24 h |
|---|---|---|---|---|---|
| LFS | CE_AUC | 0.8 | 0.7808 | 0.8448 | 0.9088 |
| DQ2 | 0.4074 | 0.3658 | 0.5024 | 0.6301 | |
| GNNR | CE_AUC | 0.8768 | 0.7168 | 0.8256 | 0.9216 |
| DQ2 | 0.4614 | 0.2843 | 0.5112 | 0.6740 |
Figure 2Overlay plot of regression coefficients. (a) The coefficients of LFS; (b) the coefficients of GNNR. Different colors correspond to different time points.
Figure 3Venn diagram comparing the overlap between the features reported by each method.
Figure 4Heat map of the recovery ratio for LFS (a) and GNNR (b). The variable group 9 denotes the false recovered uninformative variables; the variable group 1~4 corresponds to type A’ to D’ of Figure 5b and the variable group 5~8 corresponds to type A to D of Figure 5a, respectively. The light blue cells represent the false discovery ratio except for the second group on time point two of the subplot (a). While the dark blue table cells show the true discovery ratio.
Figure 5Two groups of profiles formed by linking the mean values on each time point. (a) The variables detected on a long period or more than one time point; (b) the variables detected on one single time point. POS in the legend means the variables were generated for the positive class, and NEG stands for the negative class.