| Literature DB >> 25853128 |
Nikolas Kessler1, Anja Bonte2, Stefan P Albaum3, Paul Mäder4, Monika Messmer5, Alexander Goesmann6, Karsten Niehaus7, Georg Langenkämper2, Tim W Nattkemper8.
Abstract
We present results of our machine learning approach to the problem of classifying GC-MS data originating from wheat grains of different farming systems. The aim is to investigate the potential of learning algorithms to classify GC-MS data to be either from conventionally grown or from organically grown samples and considering different cultivars. The motivation of our work is rather obvious nowadays: increased demand for organic food in post-industrialized societies and the necessity to prove organic food authenticity. The background of our data set is given by up to 11 wheat cultivars that have been cultivated in both farming systems, organic and conventional, throughout 3 years. More than 300 GC-MS measurements were recorded and subsequently processed and analyzed in the MeltDB 2.0 metabolomics analysis platform, being briefly outlined in this paper. We further describe how unsupervised (t-SNE, PCA) and supervised (SVM) methods can be applied for sample visualization and classification. Our results clearly show that years have most and wheat cultivars have second-most influence on the metabolic composition of a sample. We can also show that for a given year and cultivar, organic and conventional cultivation can be distinguished by machine-learning algorithms.Entities:
Keywords: computational metabolomics; food authentication; machine learning; metabolome informatics; metabolomics; organic farming; statistics
Year: 2015 PMID: 25853128 PMCID: PMC4371749 DOI: 10.3389/fbioe.2015.00035
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Parameters that were applied for preprocessing tools.
| Tool | Description | Parameter | Value |
|---|---|---|---|
| Warped peak detection | Mexican-wavelet based peak detection, which can be rerun locally (at certain RT). | FWHM | 7 |
| SN | 10 | ||
| RISimple | Detects and tags retention indices based on heir characteristic spectra. | Ion filter | 57, 71, 85, 99 |
| Multiple profiling | Gives peaks across chromatograms a common TAG if they are similar. | Retention time window | 20–35 s |
| Reference list | Annotates peaks that match reference spectra, uses dot-product. | RT Window | 20 s |
Number of samples for each combination of factors “farming system,” “year,” and “cultivar”.
| Farming system | Year | Cultivar | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Antonius | Caphorn | CCP | DJ 9714 | Mont Calme 245 | Probus | Rouge de Bordeaux | Runal | Sandomir | Scaro | Titlis | Σ | ||
| Conventional organic | 2007 | 7 | 8 | 7 | 7 | 8 | 7 | 7 | 8 | 6 | 6 | 8 | 160 |
| 8 | 7 | 7 | 7 | 7 | 8 | 8 | 7 | 8 | 7 | 7 | |||
| Conventional organic | 2009 | 8 | 16 | ||||||||||
| 8 | |||||||||||||
| Conventional organic | 2010 | 8 | 8 | 7 | 8 | 8 | 7 | 8 | 7 | 8 | 137 | ||
| 8 | 8 | 8 | 8 | 8 | 7 | 7 | 7 | 7 | |||||
Figure 1The principal component analysis on the entire dataset of all samples throughout all years, cultivars, and treatments show that the first two components mainly separate samples by the factor year. A separation by the factor farming system is not possible.
Figure 2A principal component analysis performed on a dataset from 1 year only will mainly cluster samples by their cultivar, regardless of the applied farming system. This PCA is based on samples from the year 2007.
Figure 3Similar to Figure .
Figure 4Plotting samples from one cultivar (here “Runal”) along the principal components two and four show that a separation by farming system might be possible, even though the main variance is caused by the factor year.
Figure 5The t-SNE method applied to all samples results in clusters and sub clusters formed according to the factor year and cultivar, respectively.
Figure 6The same t-SNE result as in Figure .
Results of the support vector machines, trained and tested on different subsets of all samples.
| Trained on | Tested on | Accuracy | NIR | Sensitivity | Specificity | PPV | NPV | ||
|---|---|---|---|---|---|---|---|---|---|
| 2007 | 2007 | 31 | 0.9677 | 0.52 | 3.75e–08 | 1 | 0.9375 | 0.9375 | 1 |
| 2010 | 2010 | 26 | 0.8846 | 0.5 | 4.40e–05 | 0.9231 | 0.8462 | 0.8571 | 0.9167 |
| 2007 | 2010 | 137 | 0.5547 | 0.5 | 0.1333 | 0.2754 | 0.8382 | 0.6333 | 0.5327 |
| 2010 | 2007 | 160 | 0.5562 | 0.51 | 0.1177 | 0.8101 | 0.3086 | 0.5333 | 0.625 |
| 2007, 2009, 2010 | 2007, 2009, 2010 | 62 | 0.9032 | 0.5 | 1.49e–11 | 0.9032 | 0.9032 | 0.9032 | 0.9032 |
Measures are given for the evaluation results and are based on the confusion matrix for classification as biological or conventional farming system.
.
.
.
.