| Literature DB >> 26618079 |
Abstract
In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as 'workflow decay', can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g., the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: (1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, (2) Cluster analysis and Data Mining in targeted Metabolomics, and (3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein-protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www.bioprocess.org/massypup/) enable the continuous improvement of the system.Entities:
Keywords: Association analyses; Bioinformatics; Computational mass spectrometry; Data Mining; Metabolomics; Model building; Proteomics; Random forest trees; Workflow decay; Workflow management systems
Year: 2015 PMID: 26618079 PMCID: PMC4655102 DOI: 10.7717/peerj.1401
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Universal workflow of mass spectrometry data analyses.
Raw data need to be processed to extract informative features and to create statistically valid models.
Figure 2Building of descriptive and predictive models.
Data can be transformed either into descriptive models, which integrate all available data, or into predictive models. For generating predictive models, only part of the data is used to train the model, another part to validate the model, and finally the model can be tested with the resting data to estimate error rates.
Programs installed on MASSyPup64.
|
| |
| devx | Development tools for fatdog64-701, such as C/C++, FORTRAN compiler, header libraries |
| jdk | Java Development Toolkit |
| java | Version 1.7.0_65 |
| Java(TM) SE Runtime Environment | Build 1.7.0_65-b17 |
| Java HotSpot(TM) 64-Bit Server VM | Build 24.65-b04, mixed mode |
| Python | Version 2 (python) and version 3 (python3) |
|
| |
| taverna | Java-based workflow management with additional functions (condition, iteration) |
| OpenMS/TOPPAS | Framework for mass spectrometry data processing and GUI supported design of pipelines |
|
| |
| ProteoWizard tools | MS data conversion and tools, see /usr/local/massypup64/proteowizard-tools.txt; without vendor libraries |
| transpose | Transposing matrix data from commandline, source code in /usr/local/bin/transpose.c. Currently compiled for a default maximum matrix size of 10,000 × 10,000 |
| teapot/fteapot | Spreadsheet program supporting three dimensional data sets. The manual “teapot.pdf” is located in /usr/local/doc |
|
| |
| MZmine | Java-based MS analysis program, focused on metabolomics |
|
| |
| Comet | MS/MS protein search algorithm |
| OMSSA | MS/MS protein search algorithm |
| X!Tandem | MS/MS protein search algoritm |
| TPP 4.8 | Trans-Proteomic-Pipeline |
|
| |
| XCMS/metaXCMS | Metabolomics |
| Metab.R | Statistical analysis of metabolomics results |
| Rattle | Data Mining |
|
| |
| massXpert | Analysis of polymer spectra |
| SpiderMass | Target DB creation/ matching, Online searches and formula generation |
|
| |
| HelloPhidget | Test tool for detection of connected Phidgets (prototyping of MSI) |
| OpenMZxy | Control of Phidget imaging robot |
| MSI.R | MSI analysis with R scripts |
|
| |
| GGobi | Interactive data visualization and exploration tool |
Figure 3Proteomics workflow with validation of hits by PeptideProphet/ ProteinProphet and final extraction of hits for proteins of potential interest.
Figure 4Plot of estimated sensitivity vs. error for sample 1DM, as delivered by the taverna workflow.
Figure modified from original program output for improved readability.
Identified putative peroxidases, after PeptideProphet/ ProteinProphet validation.
| Sample | Accession | Protein Prob. | Coverage | Unique peps | Description |
|---|---|---|---|---|---|
| 2DM | B4FBY8 | 0.6181 | 5.9 | 1 | Peroxidase |
| 1DM | B4FK72 | 1.0000 | 2.7 | 2 | Peroxidase |
| 1DM | B6T173 | 1.0000 | 12.7 | 7 | Peroxidase |
| 1DM | K7TID5 | 1.0000 | 39.5 | 24 | Peroxidase |
| 1DM | K7TID0 | 0.9937 | 9.5 | 1 | Peroxidase |
| 1DM | B4FY83 | 0.9890 | 3.7 | 2 | Peroxidase |
| 1DM | B4FNL8 | 0.0000 | 0 | Peroxidase | |
| 1DM | B6SI04 | 0.0000 | 0 | Peroxidase | |
| 1DM | K7VNV5 | 0.0000 | 0 | Peroxidase | |
| 1DR | K7TID5 | 1.0000 | 17.7 | 7 | Peroxidase |
| 1DR | B6T173 | 0.9995 | 7.1 | 2 | Peroxidase |
| 1DR | B4FSW5 | 0.9990 | 2.9 | 1 | Peroxidase |
| 1DR | B4FY83 | 0.9990 | 3.7 | 1 | Peroxidase |
| 1DR | K7TMB4 | 0.9990 | 3.3 | 1 | Peroxidase |
| 1DR | Q6JAH6 | 0.6603 | 7.1 | 1 | Glutathione peroxidase |
| 1DR | K7V8K5 | 0.5743 | 3.0 | 1 | Peroxidase |
| 1DR | B4FNI0 | 0.3475 | 5.4 | 1 | Peroxidase |
| 1DR | A0A0B4J371 | 0.0000 | 0 | Peroxidase | |
| 1DR | B4FBC8 | 0.0000 | 0 | Peroxidase | |
| 1DR | B4G0X5 | 0.0000 | 0 | Peroxidase | |
| 1DR | B6TWB1 | 0.0000 | 0 | Peroxidase | |
| 1DR | C0PKS1 | 0.0000 | 0 | Peroxidase | |
| 1DR | Q9ZTS6 | 0.0000 | 0 | Peroxidase K (Fragment) |
Figure 5Associated peptides across the samples.
Association Analysis of peptides across three samples.
‘x’ stands for peptides present in the sample and ‘o’ for peptides missing in the sample.
| Peptide | 2DM | 1DM | 1DR | Acession | Description |
|---|---|---|---|---|---|
| DSACSAGGLEYEVPSGRR | o | x | x | K7TID5 | Peroxidase |
| TDPSVDPAYAGHLK | o | x | x | B6T173 | Peroxidase |
| TVSCADVLAFAAR | o | x | x | B4FY83 | Peroxidase |
| VQVLTGDEGEIR | o | x | x | K7TID5 | Peroxidase |
| AFVHGDGDLFSR | x | x | x | B6SRJ2 | Senescence-inducible chloroplast stay-green protein |
| LFLNLQKEMNSVMVTRK | o | x | x | A0A096PYN5 | 30S ribosomal protein S2, chloroplastic |
| GSGGGGGGGGGQGQSR | x | x | x | A0A096RDU5 | Uncharacterized protein |
Statistical analysis of targeted metabolomics data with MetabR.
Bold values are significant with p-values <0.01 (Tukey HSD).
| Fast-control | InsNeut-control |
| ||||
|---|---|---|---|---|---|---|
| Fold | Fold | Fold | ||||
| ATP | 1.27 | 0.38 | 1.06 | 0.93 | 0.83 | 0.59 |
| Citraconate | 1.08 | 0.25 | 1.05 | 0.56 | 0.97 | 0.81 |
| Citrate | 1.22 | 0.08 | 1.00 | 0.96 | 0.82 | 0.13 |
| Dihexose | 0.08 |
| 0.59 | 0.93 | 7.22 |
|
| Inosine | 0.74 | 0.33 | 0.91 | 0.58 | 1.24 | 0.89 |
| Lactate | 0.87 | 0.14 | 0.99 | 0.97 | 1.14 | 0.20 |
| Pyruvate | 1.20 | 0.19 | 0.97 | 0.95 | 0.81 | 0.11 |
| 2-Oxoglutarate | 0.93 | 0.75 | 1.51 |
| 1.63 |
|
| 1-Methyladenosine | 1.20 | 0.99 | 1.13 | 0.96 | 0.95 | 0.99 |
| Glutamine | 0.68 | 0.03 | 2.51 |
| 3.71 |
|
| Guanosine | 0.76 | 0.22 | 0.83 | 0.26 | 1.09 | 0.99 |
| O-Acetyl-L-serine | 0.59 | 0.30 | 2.13 | 0.11 | 3.62 |
|
| Glucosamine | 1.36 | 0.22 | 2.98 |
| 2.20 |
|
| Thiamine | 0.54 | 0.14 | 0.89 | 1.00 | 1.66 | 0.14 |
Figure 6Hierarchical Cluster Analysis (HCA) of targeted metabolomics from chicken groups.
Figure modified from original program output for improved readability.
Figure 7(A) K-Means clustering of the normalized chicken data set, considering three clusters, (B) SSE plot for estimating the cluster number.
Figure 8(A) Silhouette plot and (B) Silhouette plot based clusters.
Figure 9Estimation of the number of clusters using the Caliński-Harabasz index.
Figure 10(A) Affinity propagation (AP) clustering and (B) AP clustering with transformed data matrix.
Figure 11MClust analysis for testing different probability models.
Comparison of methods for estimating the number of clusters in the targeted metabolomics dataset of three chicken groups.
| Method | No. of clusters |
|---|---|
| K-Means/SSE | n.a. |
| Silhouette Plot | 2 |
| Caliński–Harabasz | 3 |
| Affinity Propagation clustering | 4 |
| MClust algorithm | n.a. |
Error Matrix for predictive models, which were developed for the classification of chicken groups, based on targeted metabolomics data.
| TRAINING | VALIDATION | TESTING | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
| |||||||||
|
| ||||||||||
|
| Control | Fast | InsNeut | Control | Fast | InsNeut | Control | Fast | InsNeut |
|
| Control |
| 0 | 0 | 0 | 0 |
|
| 0 | 0 | 0.0 |
| Fast | 0 |
| 0 | 0 | 0 | 0 |
|
| 0 | 0.5 |
| InsNeut | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| 0.0 |
|
| ||||||||||
|
| ||||||||||
|
| Control | Fast | InsNeut | Control | Fast | InsNeut | Control | Fast | InsNeut |
|
| Control |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0.0 |
| Fast | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0.0 |
| InsNeut | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| 0.0 |
|
| ||||||||||
|
| ||||||||||
|
| Control | Fast | InsNeut | Control | Fast | InsNeut | Control | Fast | InsNeut |
|
| Control |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0.0 |
| Fast | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0.0 |
| InsNeut | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| 0.0 |
|
| ||||||||||
|
| ||||||||||
|
| Control | Fast | InsNeut | Control | Fast | InsNeut | Control | Fast | InsNeut |
|
| Control |
| 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1.0 |
| Fast | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0.0 |
| InsNeut | 0 | 0 |
|
| 0 |
| 0 | 0 |
| 0.0 |
Figure 12Decision tree model for the classification of chicken samples.
Figure 13Variable importance from the Random Forest Tree modeling for the classification of chicken samples.
Figure 14TOPPAS pipeline for MS feature detection and alignment, with output of the consensus features in a text file.
Putative identifications for important variables for the classification of Arabidopsis, based on untargeted metabolomics profiles.
| m/z | Variable importance | Ionization mode | Name | Function/pathway | Mass error [mDa] |
|---|---|---|---|---|---|
| 463.105 | 2.65 | [M+H]+ | 7-Methylthioheptyl glucosinolate | Glucosinolate biosynthesis | 4.6 |
| 249.149 | 2.45 | [M+H]+ | Abscisic acid aldehyde | Abscisic acid biosynthesis | 0.1 |
| 249.149 | 2.45 | [M+Na]+ | Methyl Dihydrojasmonate | Aroma compound | 2.5 |
| 227.070 | 2.45 | [M+Na]+ | Tryptophan | Amino acid | −9.3 |
| 202.090 | 2.00 | [M+Na]+ | L-Phenylalanine | Amino acid | 5.8 |
| 647.159 | 2.00 | [M+Na]+ | Isorhamnetin-3-O-rutinoside | Flavonoid glycoside | 0.8 |
| 245.099 | 2.00 | [M+H]+ | Biotin | Vitamin | 4.0 |
| 631.162 | 2.00 | [M+Na]+ | Diosmin | Flavonoid glycoside | −1.3 |
| 387.025 | 2.00 | [M+Na]+ | Xanthosine 5’-phosphate | Purine metabolism | −6.0 |
| 329.068 | 2.00 | [M+Na]+ | Leucocyanidin | Flavonoid | 4.8 |
| 221.031 | 2.00 | [M+H]+ | Imidazole acetol phosphate | Amino acid biosynthesis | −0.9 |
| 633.141 | 1.73 | [M+Na]+ | Rutin | Flavonoid glycoside | −2.0 |
| 223.169 | 1.73 | [M+Na]+ | Lauric acid | Fatty acid | 2.4 |
| 595.160 | 1.73 | [M+H]+ | Flavonoide glycoside (isobars) | Flavonoid glycoside | −5.4 |
| 579.163 | 1.73 | [M+H]+ | Flavonoide glycoside (isobars) | Flavonoid glycoside | −7.7 |
| 263.090 | 1.73 | [M+H]+ | 2-(6’-Methylthio)hexylmalic acid | Glucosinolate biosynthesis | −6.2 |
| 271.132 | 1.73 | [M+Na]+ | Abscisic acid aldehyde | Abscisic acid biosynthesis | 1.3 |
| 195.065 | 1.73 | [M+H]+ | Ferulic acid | Cell wall formation | −0.4 |
| 251.021 | 1.73 | [M+Na]+ | Mevalonate 5-phosphate | Terpene biosynthesis | −7.9 |
| 403.064 | 1.73 | [M+Na]+ | O-Acetylserine | Amino acid biosynthesis | −6.1 |
| 331.158 | 1.73 | [M+H]+ | Gibberellin A5 | Plant hormone | 4.0 |
| 457.044 | 1.73 | [M+Na]+ | 5-Methylthiopentylglucosinolate | Glucosinolate biosynthesis | −7.1 |
| 317.175 | 1.73 | [M+H]+ | Gibberellin A9 | Plant hormone | 0.1 |
| 333.209 | 1.73 | [M+H]+ | Gibberellin A12 | Plant hormone | 2.6 |
| 333.209 | 1.73 | [M+Na]+ | 6,9-Octadecadienedioic acid | Fatty acid | 5.0 |
| 479.099 | 1.73 | [M+H]+ | Hyryl | Coenzyme (Riboflavin, FMN, FAD) | 5.1 |
| 479.099 | 1.73 | [M+Na]+ | Flavin mononucleotide (FMN) | Coenzyme | 5.1 |
| 625.174 | 1.41 | [M+H]+ | Narcisin | Flavonoid glycoside | −1.8 |
| 245.042 | 1.41 | [M+H]+ | 1,3,7-Trihydroxyxanthone | Xanthones | −2.7 |
| 611.157 | 1.41 | [M+H]+ | Rutin | Flavonoid glycoside | −3.9 |
| 601.147 | 1.41 | [M+Na]+ | Flavonoide glycoside (isobars) | Flavonoid glycoside | −5.5 |
| 369.123 | 1.41 | [M+Na]+ | Gibberellin (isobars) | Plant hormone | −8.2 |
| 349.058 | 1.41 | [M+H]+ | Inosinic acid | Ribonucleotid biosynthesis | 3.6 |
| 328.941 | 1.41 | [M+Na]+ | D-Ribulose 1,5-bisphosphate | Phothosynthesis | −4.9 |
| 365.128 | 1.41 | [M+Na]+ | Abietin | Terpene | 7.5 |
| 369.124 | 1.41 | [M+Na]+ | Gibberellin (isobars) | Plant hormone | −7.3 |
| 311.187 | 1.41 | [M+H]+ | Botrydial | Terpene | 1.3 |
| 385.014 | 1.41 | [M+Na]+ | Xanthosine 5’-monophosphate | Purine metabolism | −2.3 |
| 433.118 | 1.41 | [M+H]+ | Apigenin glucoside | Flavonoid glycoside | 4.7 |
| 349.057 | 1.41 | [M+H]+ | Inosinic acid | Ribonucleotid biosynthesis | 2.5 |
| 221.042 | 1.41 | [M+H]+ | Imidazole acetol phosphate | Amino acid biosynthesis | 9.6 |
| 221.042 | 1.41 | [M+H]+ | 2-(3’-Methylthio)propylmalic acid | Glucosinolate biosynthesis | −7.0 |
| 221.042 | 1.41 | [M+Na]+ | Syringic Acid | Aminobenzoate degradation | −0.2 |
| 625.170 | 1.41 | [M+H]+ | Narcisin | Flavonoid glycoside | −6.1 |
| 349.200 | 1.41 | [M+H]+ | Gibberellin (isobars) | Plant hormone | −0.8 |
| 363.039 | 1.41 | [M+H]+ | Xanthosine 5’-monophosphate | Purine metabolism | 4.4 |
| 211.057 | 1.41 | [M+H]+ | 5-Hydroxyferulic acid | Phenylpropanoid biosynthesis | −3.0 |