| Literature DB >> 33812384 |
Beatriz Galindo-Prieto1,2,3,4, Paul Geladi5, Johan Trygg6,7.
Abstract
BACKGROUND: For multivariate data analysis involving only two input matrices (e.g., X and Y), the previously published methods for variable influence on projection (e.g., VIPOPLS or VIPO2PLS) are widely used for variable selection purposes, including (i) variable importance assessment, (ii) dimensionality reduction of big data and (iii) interpretation enhancement of PLS, OPLS and O2PLS models. For multiblock analysis, the OnPLS models find relationships among multiple data matrices (more than two blocks) by calculating latent variables; however, a method for improving the interpretation of these latent variables (model components) by assessing the importance of the input variables was not available up to now.Entities:
Keywords: Feature selection; Latent variable interpretation; MB-VIOP; Multiblock variable selection; OnPLS; VIP; Variable importance in multiblock regression; Variable influence on projection
Year: 2021 PMID: 33812384 PMCID: PMC8019512 DOI: 10.1186/s12859-021-04015-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Venn diagram that shows the three types of variable influences in MB-VIOP according to the type of variation (global, local or unique) that they explain. The three data blocks are represented by three big circles (yellow for D, blue for D, red for D). There are three different types of zones according to how the information is shared (i.e. globally, locally or uniquely) by the variables among the blocks. Variables that belong to D are represented by stars, variables of D by squares, and variables of D by circles. Variables filled in white are important, whereas the ones filled in black are not. Variables labeled with an e are special cases. A further explanation is provided in section "Methods"
Fig. 2MB-VIOP results for the synthetic data set SD16_235GLU. An overview of the 4-block (D–D) system and its interactions is shown at the top right of the figure. The normalized loadings directly extracted from the synthetic dataset (not from the model) are provided at the top left. For the whole figure, the color code is indicated in the legend (pink is used for unique, black and blue for global, cyan (D–D) and orange (D–D) for local information related to two-block interactions, and green for local information related to the three-block interaction (D–D–D)). The MB-VIOP plots are distributed by columns according to type of interpreted variation, and by rows according to data block. The important variables are the ones with MB-VIOP values above the red line (MB-VIOP > 1). A more detailed interpretation of the results of this figure is given in section "Evidence of the reliability and the efficiency of MB-VIOP using synthetic data"
Values of explained variation per data block (D1–D4) and per component for the OnPLS model of the SD16_235GLU dataset
| SD16_235GLU MODEL | ||||||||
|---|---|---|---|---|---|---|---|---|
| Percentage of explained variation per data block and per component | ||||||||
| Data block | ag1 | ag2 | al1 | al2 | al3 | au1 | au2 | au3 |
| D1 | 14.3 | 14.3 | 14.3 | 14.3 | 14.3 | 14.3 | ||
| D2 | 25.0 | 25.0 | 25.0 | 25.0 | ||||
| D3 | 25.0 | 25.0 | 25.0 | 25.0 | ||||
| D4 | 20.0 | 20.0 | 20.0 | 20.0 | ||||
Values are given as percentages (%), a stands for component, g for global, l for local, and u for unique
Values of explained variation per data block and per component for the OnPLS model of the Marzipan dataset
| Marzipan model | ||||
|---|---|---|---|---|
| Percentage of explained variation per data block and per model component | ||||
| Data block | ag1 | ag2 | au1 | au2 |
| NIRS1 | 76.3 | 11.1 | 8.8 | |
| NIRS2 | 90.5 | 3.3 | ||
| INFRAPROVER | 84.7 | 11.1 | ||
| BOMEM | 94.2 | 2.8 | ||
| INFRATECH | 99.2 | 0.7 | ||
| IR | 41.5 | 26.9 | 7.1 | |
Values are given as percentages (%), a stands for component, g for global, and u for unique
Values of explained variation per data block and per component for the OnPLS model of the Hybrid Aspen dataset
| Hybrid aspen model | ||||||||
|---|---|---|---|---|---|---|---|---|
| Percentage of explained variation per data block and per component | ||||||||
| Data block | ag1 | ag2 | ag3 | ag4 | al1 | al2 | au1 | au2 |
| Transcriptomics | 11.9 | 30.9 | 12.0 | 2.4 | 4.4 | 5.3 | 8.1 | |
| Proteomics | 17.8 | 14.4 | 10.6 | 4.0 | 8.2 | |||
| Metabolomics | 12.3 | 14.2 | 7.8 | 6.1 | 5.7 | 12.3 | ||
Values are given as percentages (%), a stands for component, g for global, l for local, and u for unique
Fig. 3MB-VIOP results for the marzipan dataset. The normalized loadings (for all the blocks and components) obtained from the OnPLS model are provided on the top. The unique, global and total MB-VIOP plots are also provided, including the threshold line at MB-VIOP = 1. The variables determined as relevant by the MB-VIOP algorithm have been annotated in the unique MB-VIOP plot for the data block NIRS1 according to the organic compound of marzipan and/or cocoa that they help to explain
Summary of the number of variables used for the OnPLS models (the original and the two reduced models) and the percentages of explained total variation for the Hybrid Aspen data
| Data | OnPLS models | Number of variables used | Explained total variation (%) |
|---|---|---|---|
| Transcript | Original | 14,738 | 75.0 |
| Total MB-VIOP ≥ 0.5 | 13,127 | 80.1 | |
| Total MB-VIOP ≥ 1.0 | 4452 | 85.2 | |
| Protein | Original | 3132 | 55.0 |
| Total MB-VIOP ≥ 0.5 | 2186 | 67.3 | |
| Total MB-VIOP ≥ 1.0 | 683 | 71.6 | |
| Metabolite | Original | 281 | 58.3 |
| Total MB-VIOP ≥ 0.5 | 232 | 65.5 | |
| Total MB-VIOP ≥ 1.0 | 81 | 76.2 |
The information has been distributed in three areas according to data block (transcriptomics, proteomics and metabolomics), and each area is divided in three rows: one for the original model, one for the reduced model using the variables with total MB-VIOP ≥ 0.5, and one for the reduced model using the variables with total MB-VIOP ≥ 1
Fig. 4Three plots corresponding to each Hybrid Aspen dataset grouped by type of variation. The number of variables before variable selection is represented in red, the number of variables after MB-VIOP ≥ 0.5 selection is represented in green, and the number of variables after MB-VIOP ≥ 1 selection is represented in blue
Summary of the number of variables used for the block-sPLS models (the original and the two reduced models) and the percentages of explained total variation for the Hybrid Aspen data
| Data | Block-sPLS models | Number of variables used | Explained total variation (%) |
|---|---|---|---|
| Transcript | Original block-sPLS | 14,738 | 68.0 |
| Block-sPLS comparable to MB-VIOPtot ≥ 0.5 model | 13,151 | 68.0 | |
| Block-sPLS comparable to MB-VIOPtot ≥ 1.0 model | 4483 | 66.0 | |
| Protein | Original block-sPLS | 3132 | 50.0 |
| Block-sPLS comparable to MB-VIOPtot ≥ 0.5 model | 2201 | 50.0 | |
| Block-sPLS comparable to MB-VIOPtot ≥ 1.0 model | 685 | 48.0 | |
| Metabolite | Original block-sPLS | 281 | 54.0 |
| Block-sPLS comparable to MB-VIOPtot ≥ 0.5 model | 236 | 54.0 | |
| Block-sPLS comparable to MB-VIOPtot ≥ 1.0 model | 77 | 52.0 |
The information has been distributed in three areas according to data block (transcriptomics, proteomics and metabolomics), and each area is divided in three rows: one for the original model, one for the reduced model using a constraint degree similar to the total MB-VIOP ≥ 0.5, and one for the reduced model using a constraint degree similar to the total MB-VIOP ≥ 1