| Literature DB >> 31918677 |
Attila Csala1, Aeilko H Zwinderman2, Michel H Hof2.
Abstract
BACKGROUND: Recent technological developments have enabled the measurement of a plethora of biomolecular data from various omics domains, and research is ongoing on statistical methods to leverage these omics data to better model and understand biological pathways and genetic architectures of complex phenotypes. Current reviews report that the simultaneous analysis of multiple (i.e. three or more) high dimensional omics data sources is still challenging and suitable statistical methods are unavailable. Often mentioned challenges are the lack of accounting for the hierarchical structure between omics domains and the difficulty of interpretation of genomewide results. This study is motivated to address these challenges. We propose multiset sparse Partial Least Squares path modeling (msPLS), a generalized penalized form of Partial Least Squares path modeling, for the simultaneous modeling of biological pathways across multiple omics domains. msPLS simultaneously models the effect of multiple molecular markers, from multiple omics domains, on the variation of multiple phenotypic variables, while accounting for the relationships between data sources, and provides sparse results. The sparsity in the model helps to provide interpretable results from analyses of hundreds of thousands of biomolecular variables.Entities:
Keywords: High dimensional omics data; Multivariate analysis; Partial least squares
Mesh:
Year: 2020 PMID: 31918677 PMCID: PMC6953292 DOI: 10.1186/s12859-019-3286-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1msPLS identified a combination of 40 epigenomic markers (denoted as 1) and 52 transcriptomic markers (denoted as 2) that explain the most variance in the proteome variables. The color scale represents the strength of w regression weights
Fig. 2msPLS identified 40 methylation markers and 52 gene expression markers that optimised the sum of squared correlation of the explanatory LVs of the epigenome and transcriptome with the MVs from the proteome
The weights of the epigenomic, transcriptomic and proteomic variables extracted by msPLS from the Marfan data
| Methylation markers | Gene expression markers | Cytokine markers | |||
|---|---|---|---|---|---|
| Site | Gene code | Marker code | |||
| cg02394578 | 0.93 | AKAP4 | 0.65 | b NGF 46 | 0.43 |
| cg20332866 | 0.96 | ANXA2P3 | 0.63 | CTACK 72 | 0.34 |
| cg00906428 | 0.91 | ASMT_A | 0.83 | GRO a 61 | 0.31 |
| cg05024291 | 0.93 | ATHL1 | 0.73 | HGF 62 | 0.21 |
| cg05093318 | 0.95 | B9D2 | 0.66 | Hu Eotaxin 43 | -0.34 |
| cg18850112 | 0.97 | C15orf52 | 0.76 | Hu FGF basic 44 | 0.46 |
| cg07475117 | 0.95 | C17orf54 | 0.44 | Hu G CSF 57 | 0.61 |
| cg07588614 | 0.94 | C1orf170 | 0.8 | Hu GM CSF 34 | 0.43 |
| cg10372701 | 0.94 | C9orf98 | 0.4 | Hu IFN g 21 | 0.28 |
| cg01718788 | -0.9 | CALHM1 | 0.84 | Hu IL 10. 56 | 0.82 |
| cg17061156 | 0.87 | CHTF18 | 0.81 | Hu IL 12 p70. 75 | 0.51 |
| cg24000259 | 0.95 | CLEC4C | 0.39 | Hu IL 13 51 | 0.44 |
| cg25914270 | 0.92 | COL4A6 | 0.69 | Hu IL 15 73 | 0.19 |
| cg14203970 | 0.91 | CXorf50B | 0.84 | Hu IL 17 76 | 0.65 |
| cg22859054 | 0.94 | CYP1A2 | 0.64 | Hu IL 1b 39 | 0.48 |
| cg02927485 | 0.96 | DKFZp779 | 0.85 | Hu IL 1ra 25 | 0.58 |
| cg22389215 | 0.88 | FKRP | 0.85 | Hu IL 2 38 | 0.49 |
| cg09570676 | 0.96 | FUZ | 0.71 | Hu IL 4 52 | 0.26 |
| cg14751679 | 0.93 | GTF2IRD1 | 0.29 | Hu IL 5 33 | 0.29 |
| cg24477102 | 0.94 | HIGD1B | 0.53 | Hu IL 6 19 | 0.04 |
| cg24882220 | 0.93 | ICAM5 | 0.85 | Hu IL 7 74 | 0.44 |
| cg25626012 | 0.95 | JPH3 | 0.88 | Hu IL 8 54 | -0.26 |
| cg26502549 | 0.94 | LBP | 0.53 | Hu IL 9 77 | 0.42 |
| cg00015699 | 0.93 | LOC64322 | -0.44 | Hu IP 10. 48 | 0.72 |
| cg01281611 | 0.93 | LRRC50 | 0.66 | Hu MCP 1 | -0.04 |
| cg07685390 | 0.95 | MGC4294 | 0.73 | Hu MIP 1a 55 | 0.23 |
| cg26550872 | 0.97 | MT1DP | 0.43 | Hu MIP 1b 18 | 0.47 |
| cg09236780 | 0.92 | MT1L | 0.51 | Hu PDGF bb 47 | -0.29 |
| cg20618527 | 0.92 | NDUFAF2 | -0.47 | Hu RANTES 37 | 0.52 |
| cg01546046 | 0.89 | OR11A1 | -0.53 | Hu VEGF 45 | 0.59 |
| cg20999565 | 0.96 | PEX11G | 0.77 | IFN a2 20 | 0.15 |
| cg11152012 | 0.87 | PIF1 | 0.88 | Il 12 p40 28 | 0.52 |
| cg22958262 | 0.96 | PIN4 | 0.65 | Il 16 27 | 0.38 |
| cg26820811 | 0.9 | POTEF | -0.72 | Il 18 42 | 0.64 |
| cg06731730 | 0.95 | PRIC285 | 0.83 | Il 1a 63 | 0.14 |
| cg08531998 | 0.94 | PSPN | 0.63 | Il 2Ra 13 | 0.56 |
| cg19696891 | 0.92 | PTGR1 | -0.65 | Il 3 64 | 0.24 |
| cg27495444 | 0.93 | REC8 | 0.72 | LIF 29 | 0.57 |
| cg17840843 | 0.93 | RINL | 0.85 | M CSF 67 | 0.4 |
| cg05062854 | 0.95 | ROM1 | 0.69 | MCP 3 26 | 0.42 |
| SLC16A3 | 0.78 | MIF 35 | 0.27 | ||
| SLPI | -0.63 | MIG 14 | 0.63 | ||
| SNORA39 | 0.81 | SCF 65 | 0.23 | ||
| SPATA2L | 0.57 | SCGF b 78 | 0.42 | ||
| TMEM179 | 0.79 | SDF 1a 22 | 0.24 | ||
| TNFRSF6B | 0.79 | TNF b 30 | 0.26 | ||
| TRBV12_3 | 0.46 | TRAIL 66 | -0.2 | ||
| UBQLNL | 0.65 | ||||
| UNQ6494 | 0.55 | ||||
| ZNF215 | -0.76 | ||||
| ZNF574 | 0.74 | ||||
| ZNF688 | 0.65 | ||||
Fig. 3The resulting model from Section 3.2 extended to two LVs per dataset. The first set of LVs 1(1) and 2(1) partition out a different portion of variance in the proteome MVs than the second set of LVs 1(2) and 2(2). The colour scale represents the strength of w regression weights
The second set of weights of the epigenomic, transcriptomic and proteomic variables extracted by msPLS from the Marfan data
| Methylation markers | Gene expression markers | Cytokine markers | |||
|---|---|---|---|---|---|
| Site | Gene code | Marker code | |||
| cg23054189 | 0.93 | AGTR2 | -0.59 | b NGF 46 | 0.57 |
| cg18347642 | 0.93 | C2orf43 | 0.87 | CTACK 72 | 0.52 |
| cg16489610 | 0.85 | CCDC112 | 0.87 | GRO a 61 | 0.16 |
| cg27013696 | 0.91 | DKFZP434 | -0.58 | HGF 62 | -0.08 |
| cg20457796 | 0.93 | GMCL1 | 0.88 | Hu Eotaxin 43 | 0.12 |
| cg03181582 | 0.91 | LPO | -0.77 | Hu FGF basic 44 | 0.23 |
| cg10521851 | 0.9 | MAD2L1 | 0.74 | Hu G.CSF 57 | -0.57 |
| cg19968840 | 0.92 | MGC4473 | -0.81 | Hu GM CSF 34 | -0.03 |
| cg27648075 | 0.92 | NFAM1 | -0.7 | Hu IFN g 21 | -0.03 |
| cg22891500 | 0.92 | NMI | 0.8 | Hu Il 10 56 | 0.17 |
| cg05158197 | 0.92 | PF4 | -0.83 | Hu IL 12 p70 75 | 0.43 |
| cg20119106 | 0.93 | PRDM14 | -0.71 | Hu IL 13 51 | 0.26 |
| cg02675353 | 0.91 | PSMA8 | -0.63 | Hu IL 15 73 | 0.67 |
| cg26991025 | 0.93 | RDM1 | 0.47 | Hu IL 17 76 | -0.14 |
| cg20643012 | 0.92 | RNF8 | 0.69 | Hu IL 1b 39 | -0.03 |
| TTC30A | 0.8 | Hu IL 1ra 25 | -0.03 | ||
| TTC30B | 0.68 | Hu IL 2 38 | 0.36 | ||
| TTC4 | 0.81 | Hu IL 4 52 | 0.67 | ||
| UNQ6126 | -0.79 | Hu IL 5 33 | 0.23 | ||
| ZNF677 | 0.86 | Hu IL 6 19 | -0.04 | ||
| Hu IL 7 74 | 0.44 | ||||
| Hu IL 8 54 | -0.69 | ||||
| Hu IL 9 77 | 0.13 | ||||
| Hu IP 10 48 | 0.73 | ||||
| Hu MCP 1 MCAF 53 | -0.09 | ||||
| Hu MIP 1a 55 | -0.24 | ||||
| Hu MIP 1b 18. | 0.26 | ||||
| Hu PDGF bb 47 | 0.3 | ||||
| Hu RANTES 37 | 0.84 | ||||
| Hu VEGF 45 | 0.69 | ||||
| IFN a2 20 | -0.29 | ||||
| Il 12 p40 28 | 0.59 | ||||
| Il 16 27 | 0.35 | ||||
| Il 18 42 | 0.4 | ||||
| Il 1a 63 | 0.04 | ||||
| Il 2Ra 13 | 0.57 | ||||
| Il 3 64 | 0.18 | ||||
| LIF 29 | -0.04 | ||||
| M CSF 67 | 0.24 | ||||
| MCP 3 26 | 0.19 | ||||
| MIF 35 | -0.48 | ||||
| MIG 14 | 0.24 | ||||
| SCF 65 | -0.11 | ||||
| SCGF b 78 | 0.65 | ||||
| SDF 1a 22 | 0.39 | ||||
| TNF b 30 | -0.19 | ||||
| TRAIL 66 | 0.38 | ||||
Over representation analysis results of the msPLS analysis on Marfan data
| Pathway name | Associated with Marfan disease through pathway | |
|---|---|---|
| Influenza Virus Induced Apoptosis | 3.41 ×10−5 | Not known* |
| Non-integrin membrane-ECM interactions | 2.92 ×10−4 | Collagene formation [ |
| Anchoring fibril formation | 4.73 ×10−4 | Collagene formation [ |
| ECM proteoglycans | 6.19 ×10−4 | Extracellular matrix organization [ |
| Integrin cell surface interactions | 7.90 ×10−4 | Extracellular matrix organization [ |
| Transcriptional activation of mitochondrial biogenesis | 8.17 ×10−4 | Possibly through reduced mitochondrial respiration [ |
| Crosslinking of collagen fibrils | 1.20 ×10−3 | Collagene formation [ |
| Laminin interactions | 1.98 ×10−3 | Extracellular matrix organization [ |
| Mitochondrial biogenesis | 2.40 ×10−3 | Possibly through reduced mitochondrial respiration [ |
| NCAM1 interactions | 3.92 ×10−3 | NCAM signaling for neurite out-growth [ |
| Collagen chain trimerization | 3.92 ×10−3 | Collagene biosynthesis and modifying enzymes [ |
| TGFBR2 MSI Frameshift Mutants in Cancer | 4.20 ×10−3 | Signaling by TGF-beta receptor complex [ |
| Extracellular matrix organization | 4.82 ×10−3 | Extracellular matrix organization [ |
| Host Interactions with Influenza Factors | 5.02 ×10−3 | Not known* |
| Organelle biogenesis and maintenance | 5.14 ×10−3 | Possibly through reduced mitochondrial respiration [ |
| Transfer of LPS from LBP carrier to CD14 | 6.30 ×10−3 | Possibly through toll-like receptor-4 signaling [ |
| Transport of HA trimer, NA tetramer and M2 tetramer from the endoplasmic reticulum to the Golgi Apparatus | 6.30 ×10−3 | Not known* |
| Loss of Function of TGFBR2 in Cancer | 8.39 ×10−3 | Signaling by TGF-beta receptor complex [ |
| TGFBR1 LBD Mutants in Cancer | 8.39 ×10−3 | Signaling by TGF-beta receptor complex [ |
| TGFBR2 Kinase Domain Mutants in Cancer | 8.39 ×10−3 | Signaling by TGF-beta receptor complex [ |
| Assembly of collagen fibrils and other multimeric structures | 8.81 ×10−3 | Collagene formation [ |
| Collagen degradation | 9.32 ×10−3 | Degradation of the extracellular matrix [ |
| NCAM signaling for neurite out-growth | 9.58 ×10−3 | NCAM signaling for neurite out-growth [ |
| Interleukin-4 and Interleukin-13 signaling | 9.78 ×10−3 | Vascular inflammation through interleukins [ |
| Collagen biosynthesis and modifying enzymes | 1.12 ×10−2 | Collagene formation [ |
| TGFBR1 KD Mutants in Cancer | 1.26 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Loss of Function of TGFBR1 in Cancer | 1.46 ×10−2 | Signaling by TGF-beta receptor complex [ |
| SMAD2/3 Phosphorylation Motif Mutants in Cancer | 1.46 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Assembly of Viral Components at the Budding Site | 1.46 ×10−2 | Not known* |
| Loss of Function of SMAD2/3 in Cancer | 1.67 ×10−2 | Signaling by TGF-beta receptor complex [ |
| RUNX3 regulates CDKN1A transcription | 1.67 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Signaling by TGF-beta Receptor Complex in Cancer | 1.88 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Collagen formation | 2.02 ×10−2 | Extracellular matrix organization [ |
| Transcriptional regulation of white adipocyte differentiation | 2.17 ×10−2 | Possibly by depleted or abnormal adipose tissue [ |
| Aromatic amines can be N-hydroxylated | ||
| or N-dealkylated by CYP1A2 | 2.29 ×10−2 | Not known |
| Formation of annular gap junctions | 2.29 ×10−2 | Endothelial dysfunction [ |
| Gap junction degradation | 2.50 ×10−2 | Endothelial dysfunction [ |
| Proton-coupled monocarboxylate transport | 2.50 ×10−2 | Not known |
| RUNX3 regulates p14-ARF | 3.31 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Fusion of the Influenza Virion to the Host Cell Endosome | 3.52 ×10−2 | Not known* |
| Packaging of Eight RNA Segments | 3.52 ×10−2 | Not known* |
| Fusion and Uncoating of the Influenza Virion | 3.72 ×10−2 | Not known* |
| Uncoating of the Influenza Virion | 3.72 ×10−2 | Not known* |
| Budding | 3.72 ×10−2 | Not known* |
| Release | 3.72 ×10−2 | Not known* |
| Biosynthesis of protectins | 3.72 ×10−2 | Possibly by proresolving lipid mediators [ |
| Degradation of the extracellular matrix | 3.87 ×10−2 | Extracellular matrix organization [ |
| RHO GTPases Activate Formins | 3.92 ×10−2 | Extracellular matrix organization [ |
| TGF-beta receptor signaling in EMT (epithelial to mesenchymal transition) | 3.92 ×10−2 | Signaling by TGF-beta receptor complex [ |
| Cell-extracellular matrix interactions | 3.92 ×10−2 | Extracellular matrix organization [ |
| Synthesis of (16-20)-hydroxyeicosatetraenoic acids (HETE) | 4.13 ×10−2 | Arachidonic acid metabolism [ |
| Entry of Influenza Virion into Host Cell via Endocytosis | 4.13 ×10−2 | Not known* |
| Virus Assembly and Release | 4.13 ×10−2 | Not known* |
| Biosynthesis of maresin-like SPMs | 4.33 ×10−2 | Possibly by proresolving lipid mediators [ |
| Biosynthesis of specialized proresolving mediators (SPMs) | 4.41 ×10−2 | Possibly by proresolving lipid mediators [ |
| Cytokine Signaling in Immune system | 4.49 ×10−2 | Cytokine signaling [ |
| Synthesis of epoxy (EET) and dihydroxyeicosatrienoic acids (DHET) | 4.73 ×10−2 | Arachidonic acid metabolism [ |
| Arachidonic acid metabolism | 4.76 ×10−2 | Arachidonic acid metabolism [ |
The pathway names and p-values are obtained from https://reactome.org. Not known associations marked with asterisk (*) are all biomolecular pathways associated with reactions to Influenza virus
Fig. 4The co-expression pattern of the resulting Marfan genes queried on their biological process based functions. The figure was produced with GeneMania (available at https://genemania.org)
The percentage variation in the chronic lymphocytic leukemia (CLL) data sources explained by the subsequent LVs of msPLS and MOFA
| Genomic variables | Epigenomic variables | Transcriptomic variables | Drug response variables | |||||
|---|---|---|---|---|---|---|---|---|
| msPLS | MOFA | msPLS | MOFA | msPLS | MOFA | msPLS | MOFA | |
| LV 1 | 72% | 15% | 92% | 17% | 92% | 7.5% | 57% | 15% |
| LV 2 | 18% | 8.2% | 4% | 0.5% | 5% | 4.7% | 21% | 3.5% |
| LV 3 | 2% | <0.1% | 1% | <0.1% | 1% | 1.4% | 7% | 11.2% |
| LV 4 | <0.1% | <0.1% | 9% | <0.1% | ||||
| LV 5 | <0.1% | <0.1% | 2.8% | 6.1% | ||||
| LV 6 | <0.1% | <0.1% | 4.8% | 3.4% | ||||
| LV 7 | 0.9% | 2.4% | 1.9% | 1% | ||||
| LV 8 | <0.1% | 0.5% | 3.8% | 0.5% | ||||
| LV 9 | <0.1% | 2.6% | 0.9% | 0.4% | ||||
| LV 10 | <0.1% | <0.1% | 2.2% | <0.1% | ||||
| Total | 92% | 24% | 97% | 24% | 98% | 38% | 85% | 41% |
The weights of the genomic, epigenomic, and transcriptomic variables extracted by msPLS from CLL data sources
| Genomic variables | Epigenomic variables | Transcriptomic variables | |||
|---|---|---|---|---|---|
| Name | Site | Gene code | |||
| del11q22.3 | 0.31 | cg06369076 | 0.036 | ADAM29 | 0.046 |
| del17p13 | 0.16 | cg22449085 | 0.036 | AGPAT4 | 0.043 |
| BRAF | 0.17 | cg12208353 | 0.036 | ANK2 | 0.047 |
| TP53 | 0.21 | cg04694619 | 0.037 | CRY1 | 0.049 |
| IGHV | -0.66 | cg20782816 | 0.038 | DNAH3 | 0.046 |
| cg00832703 | 0.037 | ENO4 | -0.041 | ||
| cg01399475 | -0.036 | ESPNL | 0.043 | ||
| cg21398469 | 0.037 | GFI1 | 0.045 | ||
| cg11181763 | 0.036 | GLDN | 0.044 | ||
| cg01360627 | 0.036 | ITPRIPL2 | 0.040 | ||
| cg09087901 | 0.036 | KANK2 | 0.047 | ||
| cg04848693 | 0.037 | L3MBTL4 | 0.049 | ||
| cg12522599 | 0.038 | LDOC1 | 0.041 | ||
| cg11090458 | 0.037 | LPL | 0.041 | ||
| cg00148025 | 0.038 | MAPK4 | -0.040 | ||
| cg12032915 | 0.036 | MRO | 0.043 | ||
| cg07629149 | 0.039 | MSI2 | 0.046 | ||
| cg23844018 | 0.037 | NDUFA4L2 | 0.042 | ||
| cg05213414 | 0.037 | NUGGC | 0.041 | ||
| cg01928411 | 0.037 | PLD1 | 0.043 | ||
| cg07699978 | 0.036 | PON1 | 0.042 | ||
| cg03035162 | 0.036 | PRR18 | -0.044 | ||
| cg03462096 | 0.039 | SEPT10 | -0.040 | ||
| cg08171667 | 0.036 | SOWAHC | 0.041 | ||
| cg26441291 | 0.038 | TP63 | 0.043 | ||
| cg21400896 | 0.037 | USP6NL | -0.040 | ||
| cg15236196 | 0.036 | VSIG10 | 0.042 | ||
| cg21394039 | 0.038 | ZNF135 | -0.040 | ||
| cg04613057 | 0.036 | ZNF471 | -0.042 | ||
| cg08496123 | 0.036 | ZNF667 | 0.041 | ||
The loadings of the three subsequent LVs extracted by msPLS from the genomic variables of the CLL data set
| 1st set of LVs | 2nd set of LVs | 3rd set of LVs | |||
|---|---|---|---|---|---|
| Name | loading | Name | loading | Name | loading |
| del11q22.3 | 0.31 | del11q22.3 | -0.27 | NRAS | 0.35 |
| del17p13 | 0.16 | trisomy12 | 0.65 | COL6A5 | -0.34 |
| BRAF | 0.17 | del13q14_any | -0.37 | FAM47A | -0.35 |
| TP53 | 0.21 | del14q24.3 | 0.20 | FAT4 | -0.39 |
| IGHV | -0.66 | CREBBP | 0.15 | PRPF8 | -0.52 |
Fig. 5The samples of the CLL data clustered around on their IGHV and trisomy 12 status, extracted by the first and second LV of the msPLS model. The figure was produced by the MOFA R package [11]
Fig. 6The proposed relationship between three data sources. X1 and X2 have a symmetric relation (i.e. they are responses for each other) and X3 have asymmetric relation with both X1 and X2 (i.e. X3 is response for both X2 and X1)
True-positive rate (TPR) and true-negative rate (TNR) results of the simulation study
| n = 50 | n = 100 | n = 250 | n = 37 | |
|---|---|---|---|---|
| 0.67 | 0.93 | 0.99 | 0.61 | |
| 0.66 | 0.94 | 0.99 | 0.72 | |
| 0.99 | 0.99 | 0.99 | 0.99 | |
| 0.99 | 0.99 | 0.99 | 0.99 |
Fig. 7The null distributions of the optimisation criteria (with respect to X3) for the simulated data with different sample sizes (n = 50, 100, 250), obtained after 1000 permutations. The red bars indicate the optimisation criteria obtained applying msPLS to the original data with the optimal λ1 parameters for UST. The red dots are the bootstrapped values, and the dashed red bars are the 95% confidence intervals