| Literature DB >> 33553773 |
Nur Ain Ishak1, Noor Idayu Tahir1, Syafi'ah Nadiah Mohd Sa'id2, Kathiresan Gopal3, Abrizah Othman1, Umi Salamah Ramli1.
Abstract
Recent advances in phytochemical analysis have allowed the accumulation of data for crop researchers due to its capacity to footprint and distinguish metabolites that are present within an organisms, tissues or cells. Apart from genotypic traits, slight changes either by biotic or abiotic stimuli will have significant impact on the metabolite abundances and will eventually be observed through physicochemical characteristics. Apposite data mining to interpret the mounds of phytochemical information from such a dynamic system is thus incumbent. In this investigation, several statistical software platforms ranging from exploratory and confirmatory technique of multivariate data analysis from four different statistical tools of COVAIN, SIMCA-P+, MetaboAnalyst and RIKEN Excel Macro were appraised using an oil palm phytochemical data set. As different software tool encompasses its own advantages and limitations, the insights gained from this assessment were documented to enlighten several aspects of functions and suitability for the adaptation of the tools into the oil palm phytochemistry pipeline. This comparative analysis will certainly provide scientists with salient notes on data assessment and data mining that will later allow the depiction of the overall oil palm status in-situ and ex-situ.Entities:
Keywords: Metabolites; Multivariate data analysis; Oil palm; Phytochemical analysis; Statistical tools
Year: 2021 PMID: 33553773 PMCID: PMC7856480 DOI: 10.1016/j.heliyon.2021.e06048
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Summary of unsupervised and supervised method strategies.
| Type of methods | Examples of analysis | Goal | Application | Input | Output |
|---|---|---|---|---|---|
| Unsupervised method | Principal Component Analysis (PCA) | Reduction of data dimensionality, visual inspections of data grouping and pattern recognition | Dimensional reduction recognition into an observed variation data and extraction of components that explain maximum variance(s) | Data tables without class associations: each row | Data summary in scores and loadings plots for pattern recognition |
| Hierarchical Clustering | Display of subjects' connectivity in cluster formation | Grouping of data into dendrogram and heat map | |||
| Supervised method | Partial Least Squares (PLS) | Biomarker discovery and class membership prediction | Assessment of variables contributing to discrimination of subjects | Data tables with prior class membership dictation | Selection of dependent variable (metabolites) to represent class membership |
| Support Vector Machine (SVM) | Construction of model that can assign new subject(s) to one category or the other | Selection of metabolites as predictors to construct prediction model |
Figure 1General metabolomics workflow for oil palm.
Figure 2Interface (A) and workflow (B) of COVAIN toolbox.
Figure 3Interface (A) and workflow (B) of SIMCA-P+.
Figure 4Interface (A) and workflow (B) of MetaboAnalyst.
Figure 5Interface (A) and workflow (B) of RIKEN Macro tool in Excel.
Scaling methods.
| Scaling type | Calculation | Details |
|---|---|---|
| Unit variance | 1/σ | Data analysis based on correlations instead of covariances |
Inflation of measurement errors | ||
| Pareto | 1/√σ | Metabolites of large fold less dominant, balances data intensity |
Data does not become dimensionless compared to unit variance scaling | ||
Data remains nearer to original measurement | ||
| Variance | 1/σ2 | Emphasise relative responses |
Inflation of measurement errors | ||
| Vast (variable stability) | 1/(σ/m) | Concentrate on metabolites with small variations |
Unsuitable for data of large induced variation(s) |
∗m = mean, adapted from Van den Berg et al. (2006).
Normalisation and scaling methods applied prior to data analysis.
| Tools | Normalization | Scaling | Notes |
|---|---|---|---|
| COVAIN ( | Several options provided: -Normalisation by internal standard - Normalisation by sample fresh weight | N/A | Other pre-treatment process options: - data transformation (Log transformation and z-score transformation) |
| SIMCA-P+ ( | - Standard Normal Variate – SNV | Pareto scaling | Other types of scaling selections: - mean centering - auto scaling |
| MetaboAnalyst ( | Several options provided: - None (no normalization applied) - Sample-specific normalization (i.e., weight, volume) - Normalization by sum - Normalization by median - Normalization by reference sample (probalistic quotient normalization, PQN) - Normalization by a pooled sample from group - Normalization by reference feature - Quantile normalization | Pareto scaling | Other pre-treatment process options: - data transformation (Log transformation and cube root transformation) |
| RIKEN Excel Micro tool ( | - Normalization method by internal standard | Pareto scaling | Other pre-treatment process options: - data transformation (Log10 transformation and 1/4 root transformation) |
∗N/A: not available.
Figure 6PCA scores (a) and loadings plots (b) of oil palm leaf metabolome of different planting sites generated by COVAIN toolbox, SIMCA-P+, MetaboAnalyst and RIKEN Excel Macro tool.
Figure 7Scores (a) and loadings plots (b) of oil palm leaf metabolome of two different planting sites generated by MetaboAnalyst. The image orientations changed with “flip image” button function on X-axis (c and d), Y-axis (e and f) or both axes (g and h).
Distance and linkage metrics for available tools for clustering.
| Tools | Distance | Linkage | Reference | |
|---|---|---|---|---|
| 1 | COVAIN (dendrogram + heatmap) | ( | ||
| 2 | SIMCA-P+ (dendrogram) | ( | ||
| 3 | MetaboAnalyst (dendrogram + heatmap) | Euclidean | ( | |
| MetaboAnalyst (dendrogram) | Euclidean | ( |
∗items in bold are default metrics.
Figure 8Dendrogram and heat map generated by COVAIN.
Figure 9Heat map generated by COVAIN toolbox.
Figure 10Dendrogram generated by SIMCA-P+.
Figure 11Dendrogram and heat map generated by MetaboAnalyst using default metrics.
Figure 12Dendrogram and heat map generated by MetaboAnalyst using Pearson distance metric.
Figure 13Dendrograms from heat map generated by MetaboAnalyst using Euclidean and Pearson distance metrics.
Figure 14Top 10 variables based on t-test/ANOVA analysis using Pearson distance and Ward clustering algorithm presented in dendrograms and heat map generated by MetaboAnalyst.
Figure 15Dendrograms generated by MetaboAnalyst with Euclidean and Spearman distances.
Examples of deviating samples discovered from clustering analysis.
| Tools | Teluk Intan Cluster | Keratong Cluster |
|---|---|---|
| COVAIN (dendrogram + heatmap) | Keratong 154 | Teluk Intan 528 |
| Keratong 152 | Teluk Intan 513 | |
| Keratong 151 | ||
| Keratong 139 | ||
| Keratong 138 | ||
| SIMCA (dendrogram) | Keratong 194 | Teluk Intan 765 |
| Keratong 153 | Teluk Intan 672 | |
| Teluk Intan 513 | ||
| Teluk Intan 233 | ||
| MetaboAnalyst (dendrogram + heatmap using Pearson distance) | Keratong 194 | |
| Keratong 154 | ||
| Keratong 152 | ||
| MetaboAnalyst (dendrogram using Spearman distance) | Keratong 194 | Teluk Intan 672 |
| Keratong 154 | Teluk Intan 513 | |
| Keratong 152 | Teluk Intan 233 |
Figure 16PLS-DA scores (a) and loadings plots (b) of oil palm leaf metabolome of different planting sites generated by SIMCA-P+, MetaboAnalyst and RIKEN Excel Macro tool.
Figure 17PLS-DA loadings line plot generated by SIMCA-P+.
Figure 18PLS-DA VIP plot generated by SIMCA-P+.
Figure 19PLS-DA variable importance in projection (VIP) by MetaboAnalyst.
Figure 20PLS-DA VIP scores by RIKEN Excel Macro tool.
Comparison of investigated metabolomics statistical tools.
| Tools | COVAIN toolbox Version 2017-May-16 | SIMCA-P+ version 15.2 | MetaboAnalyst version 4.0 | RIKEN Excel Macro Tool |
|---|---|---|---|---|
| Type of platform | Toolbox with window or pane with quick access to common operation functions in the program | Licensed software | Web server | Tool |
| Statistical analysis offered | PCA | PCA | PCA | PCA |
| Cost | COVAIN itself is an open source software but annual renewal of MATLAB license costs at least USD29 (student license) on top of one-time software and tools package purchase | Perpetual software license for one time purchase | Free online and offline open source local installation using Web Application Resource (.war) file | Microsoft Excel which cost at least USD140 (for Home & Student version) |
| Runtime (Analysis time) | 1.79 s | 1.91 s | Runtime of the online tool depends on user's internet connection unless locally installed. | 1.29 s |
| Limitations | • Requires MATLAB licensed software to run the toolbox | • The software product need to be purchased at a cost | • Requires internet connections | • Requires: |
| Advantages | • Freely accessible for download | • Excellent graphic capabilities | • Freely accessible | • Freely accessible |
| Experience of use | Generated figures from statistical analysis are adjustable. However, data pre-processing parameters and supervised methods are limited. Apart from statistical analysis, COVAIN tool consist of Granger time-series analysis, pathway mapping, correlation network topology analysis and visualization. | A well-known data mining software for more than 30 years. However, users need to clearly define variables including identifiers for the variables, the roles of the variables, the data type which can be quite confusing for the newbies. This could be due to usage of SIMCA-P+ for multiple fields other than metabolomics. | A complete pipeline for high-throughput metabolomics starting from data pre-processing, multivariate data analysis and data annotation. It provides interesting functions-biomarker analysis, various pathway analysis, etc. The software is constantly updated for public use. | A user friendly tool with comprehensive manual. Figures generated are easy to adjust due to familiarity of Microsoft Excel. |