| Literature DB >> 35119864 |
Magnus Palmblad1, Sebastian Böcker2, Sven Degroeve3, Oliver Kohlbacher4, Lukas Käll5, William Stafford Noble6, Mathias Wilhelm7.
Abstract
Machine learning is increasingly applied in proteomics and metabolomics to predict molecular structure, function, and physicochemical properties, including behavior in chromatography, ion mobility, and tandem mass spectrometry. These must be described in sufficient detail to apply or evaluate the performance of trained models. Here we look at and interpret the recently published and general DOME (Data, Optimization, Model, Evaluation) recommendations for conducting and reporting on machine learning in the specific context of proteomics and metabolomics.Entities:
Mesh:
Year: 2022 PMID: 35119864 PMCID: PMC8981311 DOI: 10.1021/acs.jproteome.1c00900
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 5.370
Figure 1Number of publications on machine learning in proteomics or metabolomics has increased rapidly in the last 5 years, as revealed by a Web of Science literature search. The results also suggest an early ML hype in these domains around 2009–2011 and a shallow but noticeable “trough of disillusionment” from 2012 to 2013. The DOME query was the same as in the original paper [TS = “machine learning” AND ALL = (“biolog*” OR “medicine” OR “genom*” OR “prote*” OR “cell*” OR “post translational” OR “metabolic” OR “clinical”)]. The modified proteomics or metabolomics query was TS = “machine learning” AND ALL = (“biolog*” OR “medicine” OR “genom*” OR “prote*” OR “cell*” OR “post translational” OR “metabolic” OR “clinical”) AND ALL = (“proteome” OR “proteomics” OR “metabolome” OR “metabolomics” OR “metabonome” OR “metabonomics”). The searches were done on 2022-01-22.
Specific Recommendations under the Broad Topics and BOLOs as in the DOME Recommendations[1]
| broad topic | be on the lookout for | specific recommendations |
|---|---|---|
| Data size and quality | Training data sufficiently represents the complexity of the modeled molecular class (e.g., tryptic peptides, lipids, all metabolites). | |
| Be clear if data used for training and testing are acquired on similar instruments (e.g., with the same mass analyzer) using similar settings (e.g., collision energy) or on a range of instruments or conditions. | ||
| Beware of chimeric spectra and their possibly contaminating effects. | ||
| Appropriate partitioning, dependence between train and test data | Training and
test data should
be disjoint on not only the spectrum level but also the molecular
structure (e.g., peptide) level. Stereoisomers fragment highly similarly,
and hence stereoisomers must not be present in the training and test
sets to avoid biased statistics. Structural similarity or homology
between training and test data should be kept to a minimum or should
be controlled to mimic realistic test conditions.[ | |
| No access to data | Training and test data are
available in a public repository[ | |
| If filtering or partitioning
spectra in the same data sets, provide lists of Universal Spectrum
Identifiers[ | ||
| Other | Beware of redundancy in training or test data (e.g., multiple spectra of the same or similar molecular structures). | |
| Beware of false-positives and -negatives in training data and possible bias when selecting strict thresholds for compound identification. | ||
| Beware of events affecting instrument performance over time, as those can artificially decrease or increase the apparent performance on an independent test set (e.g., instrument maintenance and calibration events). | ||
| Overfitting, underfitting, and illegal parameter tuning | Compare with experimental variability. Is the claimed performance better than the expected experimental variability (e.g., in peak intensities or retention times)? | |
| Report any hyperparameter tuning (e.g., of deep neural network architectures). | ||
| Imprecise parameters and protocols given | Define the optimization target (e.g., spectrum-, peptide-, or protein-level statistics). | |
| Provide the metric for comparing chromatograms or spectra (e.g., spectral angle, cosine score, or dot product) and a detailed description on how to apply it (e.g., if specific peaks for cosine score calculation were discarded, tolerances used for matching peaks, or strategies to resolve ambiguities). | ||
| Unclear if black box (opaque) or interpretable (transparent) model | If the model is interpretable, describe how the trained model can be interpreted and what can be learned from it. | |
| No access to resulting source code and trained models | Specify which model, software, and version were used. | |
| Make documented source code publicly available. | ||
| Execution time is impractical | Execution time for the training or application of a model should not be a bottleneck in its intended pipeline. As a rule of thumb, applying the model should not take longer than data acquisition. Execution time is even more critical in real-time applications such as continuous retention time alignment. | |
| Performance measures inadequate | Motivate the use of performance
measures, especially if reporting a single number. Report the Matthews
correlation coefficient (MCC), not F1 scores or AUCs, for binary classifiers
trained on classes of different sizes.[ | |
| No comparisons to baselines or other methods | Compare performance with simpler baseline methods (e.g., linear regression predicting ion mobility using only mass or retention times using only amino acid composition). | |
| Include tests measuring performance of the algorithm in a practical user situation. For example, when predicting retention time, what increase in the number of identified or quantified compounds or peptides does the prediction imply? | ||
| Evaluate model on independent test data acquired on a different instrument in a different lab. | ||
| Do not include peaks corresponding to the precursor ion when comparing tandem mass spectra. | ||
| Highly variable performance | Compare models on the same data. Make sure that the metric used for comparison is the same and was applied the same (e.g., do not compare cosine scores that were calculated based on different sets of ions). If a (community-developed) benchmark data set is available, then use it. If cross-validation is employed, then report the random splits (e.g., USIs) so that others can reproduce your work. (Communities are encouraged to develop benchmarking data sets for ML). | |
| Explicitly state model limitations (e.g., the type and conditions of chromatography for retention time prediction or the ionization mode, fragmentation, and mass analyzer for simulating tandem mass spectra). |
For training and test data sets containing millions of identified spectra, such Universal Spectrum Identifier (USI) lists will be very long. However, we expect that they would be primarily generated and read by machines, and even if they are long, USI lists take considerably less space than the mass spectra themselves. Furthermore, lists of USIs with many identifiers from the same data sets can be compressed by at least a factor of 5. More than 1 billion identifiers are already available in the ProteomeXchange repositories.