| Literature DB >> 32878155 |
Marina Creydt1,2, Markus Fischer1,2.
Abstract
Experiments based on metabolomics represent powerful approaches to the experimental verification of the integrity of food. In particular, high-resolution non-targeted analyses, which are carried out by means of liquid chromatography-mass spectrometry systems (LC-MS), offer a variety of options. However, an enormous amount of data is recorded, which must be processed in a correspondingly complex manner. The evaluation of LC-MS based non-targeted data is not entirely trivial and a wide variety of strategies have been developed that can be used in this regard. In this paper, an overview of the mandatory steps regarding data acquisition is given first, followed by a presentation of the required preprocessing steps for data evaluation. Then some multivariate analysis methods are discussed, which have proven to be particularly suitable in this context in recent years. The publication closes with information on the identification of marker compounds.Entities:
Keywords: chemometrics; data preprocessing; food authenticity; food fraud; mass spectrometry; metabolite identification; metabolomics; multivariate methods; non-targeted; pathway analysis
Mesh:
Substances:
Year: 2020 PMID: 32878155 PMCID: PMC7504784 DOI: 10.3390/molecules25173972
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Comparison of different mass analyzers [37,38,39]. The number of plus signs weights the displayed categories, where + stand for moderate and +++++ for relatively high.
| Mass Analyzer | Resolution | Mass Accuracy | Scan Rate | Linear Dynamic Range | Sensitivity | Quantitation | Handling | Cost Effort | |
|---|---|---|---|---|---|---|---|---|---|
| FT-ICR-MS | +++++ | +++++ | ++ | ++++ | +++ | ++ | ++ | + | +++++ |
| Orbitrap | ++++ | +++++ | +++ | +++ | +++ | +++ | ++ | +++ | ++++ |
| ToF/QToF | +++ | ++++ | +++++ | +++++ | ++++ | ++++ | ++++ | +++ | +++ |
| QTrap | ++ | +++ | ++++ | ++ | +++ | +++++ | +++ | +++++ | ++ |
| QqQ | ++ | + | ++++ | ++ | +++++ | +++++ | +++++ | +++++ | + |
Figure 1Possible topics for a representative sampling of plant foods in the field. The field border in blue is not sampled. The black dots represent the sampling of individual samples, which are then mixed to form a collective sample. (a) Example for sampling in X-shape and (b) in W-shape pattern.
Figure 2Workflow of metabolomics analyses and the individual steps that are carried out during data evaluation. After data acquisition, a preprocessing of the data is necessary to prepare the data sets for further evaluation. A feature matrix is calculated that can be evaluated using various multivariate methods. This is followed by a selection of the most relevant features and, if necessary, an identification and biological interpretation using pathway analysis.
Overview of various software programs for evaluating non-targeted LC-MS data.
| Software | Provider | Access | Reference |
|---|---|---|---|
|
| |||
| Compound Discoverer | ThermoFisher Scientific, Waltham, MA, USA | local installation required | [ |
| DataAnalysis, ProfileAnalysis, MetaboScape | Bruker Daltonics, Bremen, Germany | local installation required | [ |
| Mass Profiler Professional and various other modules that can be combined to design different workflows | Agilent Technologies, Santa Clara, CA, USA | local installation required | [ |
| Progenesis QI | Progenesis QI Waters Corporation, Milford, MA, USA | local installation required | [ |
|
| |||
| Galaxy-M | School of Biosciences, University of Birmingham, Birmingham, UK | local installation required | [ |
| KnitMet | Department of Biochemistry and Cambridge Systems Biology Centre, University of Cambridge, Cambridge, UK | local installation required | [ |
| MAVEN | Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NY, USA | local installation required | [ |
| MetaboAnalyst | Xia Lab at McGill University, Montreal, QC, Canada | web-based | [ |
| MZmine 2 | Okinawa Institute of Science and Technology (OIST), Onna, Okinawa, Japan / Quantitative Biology and Bioinformatics, VTT Technical Research Centre of Finland, Espoo, Finland | local installation required | [ |
| OpenMS | Center for Integrative Bioinformatics (CIBI), University of Tübingen, Tübingen, Germany | local installation required | [ |
| Workflow4Metabolomics | National Research Institute for Agriculture, Food and Environment, Paris, France | web-based | [ |
| XCMS online | The Scripps Research Institute, La Jolla, CA, USA | web-based | [ |
Important abbreviations and definitions.
| Term | Explanation |
|---|---|
| Analysis of variance (ANOVA) | In contrast to the t-test, significant differences of more than two sample groups can be compared using ANOVA. |
| Bias | Random errors, which are based, for example, on inaccuracies in sample preparation, injection and fluctuations in the measuring instruments. |
| Bootstrap approach | Resampling method, which means that a sample can be used more than once. It can also be applied for non-normally distributed data [ |
| Correlation optimized warping (COW), dynamic time warping (DTW) and Parametric Time Warping (PTW) | Different warping algorithms that are used for retention time alignment by shifting, stretching or reducing the retention time axis. DTW [ |
| Cross validation (CV) | CV is an internal method for the validation of supervised models to check the predictive power and rule out overfitting. In this approach, a model is first calculated with the help of a training set, which is checked with a test data set. The process is repeated several times [ |
| Feature | In LC-MS analyses, a feature is defined based on retention time and |
| Mean Centering | Subtraction of the average of a feature from each measure of that feature so that the new average of that feature is zero. The interpretation of the data is made easier because the differences are in the foreground and an offset of the data is eliminated [ |
| Normalization | Ensures the comparability of the samples with each other by eliminating systematic errors, e.g., from different sample weights or dilutions [ |
| Null hypothesis (H0) | The null hypothesis is based on the assumption that there is no difference in various sample groups and should usually be rejected. This indirect approach is intended to prevent the likelihood of false positive results. |
| Out-of-bag (OOB) error | The OOB error is used to describe the predictive power of random forests models. |
| Over-representation analysis (ORA), functional class scoring (FCS), pathway topology (PT), mummichog, gene set enrichment analysis (GSEA) | Different algorithms for performing pathway analyses. The identification of metabolites is not necessary for the mummichog and GSEA algorithm [ |
| Overfitting | Overinterpretation of a data set. Correlations are recognized that are based on noise signals and not on real differences between the samples. |
| Permutation test | The class names are swapped randomly, and a new classification model is calculated on this basis. This new model should not be able to achieve a good separation of the different groups of samples [ |
| Principal component analysis (PCA) | Unsupervised method to show differences and similarities in various samples by orthogonal transformation. This approach is often used to get a first overview of the data [ |
| Partial least square discriminant analysis (PLS-DA), orthogonal PLS-DA (OPLS-DA), sparse PLS-DA (SPLS-DA) | Fast and simple supervised method, which sometimes tends to overfit. Therefore, careful validation should take place. OPLS-DA and SPLS-DA are extensions of a classical PLS-DA [ |
| R2 and Q2 | Parameters for the assessment of supervised methods to identify possible overfitting. R2 (goodness of fit) describes the proportion of the declared variance in the total variance. R2 can have a maximum value of 1. In the ideal case, a model should achieve the largest possible R2 value. Q2 (goodness of prediction) describes the prediction accuracy of a model and is obtained from a cross validation. Q2 can have a maximum of 1 [ |
| Regions of interest (ROI) | ROI describe a relevant measuring range that contains a supposed signal [ |
| Random forests (RF) | RF are based on decision trees, can also be used for very noisy data and small sample groups. Robust to overfitting and outliers, but equally large class sizes must be ensured. The visualization is quite complex, so VIP plots are often used to extract the most relevant features [ |
| Random oversampling (ROS) und synthetic minority over-sampling technique (SMOTE) | For some multivariate analysis methods, such as RF or SVM, class sizes must be the same. This requirement can either be achieved by excluding individual samples (undersampling) or by performing ROS. For example, by taking individual samples into account several times or calculating them synthetically. For the latter, the SMOTE algorithm is suitable. Briefly explained, the difference is calculated from a feature based on the intensities or peak areas found in two samples of the same class. The result is multiplied by a randomized number between 0 and 1. The lower feature value of the two samples is then added. A new value is obtained, which lies between the feature values of the two known samples [ |
| Scaling | Ensures the comparability of the different features with each other, since signals with strong intensities, compared to signals with lower intensities, otherwise have a greater influence [ |
| Support vector machines (SVM) | SVM is a kernel method. Robust to overfitting and outliers, sensitive to imbalance datasets. High calculation effort, can take some time with many samples and features [ |
| Transformation | Ensures that heteroscedasticity and skewness of the data are reduced to achieve an almost normal distribution of the data [ |
| Underfitting | The opposite of overfitting, which occurs when relevant features are not taken into account. |
| VIP | Variable importance in projection, reflects the influence of a feature on a model. Promising features have a VIP score >1. However, this limit should not be seen too narrowly. Features with a VIP score <0.5 are irrelevant for a model [ |
| Wavelet transformation | Transformation method developed by Morlet and Grossmann. In a way, an extension of a Fourier transform, which can also be used for signals with different lengths and frequencies, and which enables time and location to be resolved [ |
Figure 3Schematic representation for setting up a feature matrix. In this way LC-MS data are reduced and converted into a tabular notation that can be processed by many different software applications.
Figure 4Representation of the difference between (a) homoscedasticity und (b) heteroscedasticity. (c) The same data set as in B, after log transformation without mean centering.
Figure 5Influence on signal intensities after a scaling procedure. (a) Representation of the signal intensities of two features using raw data. In this data set, feature 2 would have a greater impact on the result due to the higher intensity. (b) After a scaling process, all features have normalized signal intensities and are thus equally taken into account.
Figure 6Representation of overfitting. The data were subjected to sum normalization and pareto scaling (a) PCA scores plot of a 10-fold injection from the same vial of a plant extract, measured by LC-MS. (b) The same data record as in (a), this time evaluated using PLS-DA. The individual measurements were alternately divided into two groups. At first glance, there is a clear separation of the two groups. Only the result of the CV indicates that there is an overfitting.
Figure 7Schematic representation for the calculation of RF analyses. (a) Relationship between two different features within different sample groups. (b) Potential decision tree for the classification of the sample groups.
Figure 8Examples of ROC curves created with the MetaboAnalyst software. (a) The ROC curve is based on a classifier (feature), which makes it impossible to differentiate between the sample groups. (b) ROC curve of a classifier that enables an optimal distinction between two sample groups.
Figure 9Signal used to classify the two groups from Section 3.2.2. (a) Box plot of the analyzed feature, which indicates a supposedly significant difference. (b) Peak intensities of the individual injections without normalization or scaling. When looking at the y-axis, it becomes clear that there can actually be only a slight difference. (c) EICs of the analyzed signal, which also demonstrate that there is no significant difference.
Overview of the most common freely accessible databases.
| Database | Provider | Availability of LC-MS/MS Reference Spectra | Reference |
|---|---|---|---|
| Chemspider | Royal Society of Chemistry, London, UK | Experimental LC-MS / MS spectra available for some compounds | [ |
| FooDB | Canadian Institutes of Health Research, Canada Foundation for Innovation, Ottawa, Canada/The Metabolomics Innovation Centre, Edmonton, AB, Canada | Experimental LC-MS / MS spectra for numerous compounds are available where no original spectra are available, in-silco spectra can be used | [ |
| HMDB (Human Metabolome Database) | Canadian Institutes of Health Research, Canada Foundation for Innovation, Ottawa, Canada/The Metabolomics Innovation Centre, Edmonton, AB, Canada | Experimental LC-MS / MS spectra for numerous compounds are available where no original spectra are available, in-silco spectra can be used | [ |
| KNApSAcK | Nara Institute of Science and Technology, Nara, Japan | No, but helpful links to further primary literature | [ |
| MoNA (Mass Bank of North America) | Fiehn Lab, Davis, CA, USA | Experimental LC-MS / MS spectra for numerous compounds are available | [ |
| LipidMaps | Cardiff University, Cardiff UK/Babraham Institute, Cambridge, UK/University of California, San Diego, CA, USA | Spectra from other databases are partially embedded, and there is also the option of predicting MS / MS spectra for certain lipid classes | [ |
| MassBank | Mass Spectrometry Society of Japan, Tokyo, Japan | Experimental LC-MS / MS spectra for numerous compounds are available | [ |
| METLIN (Metabolite and Chemical Entity Database) | The Scripps Research Institute, Loa Jolla, CA, USA | Experimental LC-MS / MS spectra for numerous compounds are available | [ |
| Pubchem | National Center for Biotechnology Information, Rockville Pike, MD, USA | Spectra from other databases are partially embedded | [ |