| Literature DB >> 34984411 |
Edward S Lee1,2, Thomas J S Durant1,2.
Abstract
As the demand for laboratory testing by mass spectrometry increases, so does the need for automated methods for data analysis. Clinical mass spectrometry (MS) data is particularly well-suited for machine learning (ML) methods, which deal nicely with structured and discrete data elements. The alignment of these two fields offers a promising synergy that can be used to optimize workflows, improve result quality, and enhance our understanding of high-dimensional datasets and their inherent relationship with disease. In recent years, there has been an increasing number of publications that examine the capabilities of ML-based software in the context of chromatography and MS. However, given the historically distant nature between the fields of clinical chemistry and computer science, there is an opportunity to improve technological literacy of ML-based software within the clinical laboratory scientist community. To this end, we present a basic overview of ML and a tutorial of an ML-based experiment using a previously published MS dataset. The purpose of this paper is to describe the fundamental principles of supervised ML, outline the steps that are classically involved in an ML-based experiment, and discuss the purpose of good ML practice in the context of a binary MS classification problem.Entities:
Keywords: Amino acid; Artificial intelligence; CART, Classification and Regression Trees; ML, Machine Learning; MS, Mass Spectrometry; Mass spectrometry; NLL, Negative Log Loss; PAA, Plasma Amino Acid; PR, Precision-Recall; PRAUC, Area Under the Precision-Recall Curve; RL, Reinforcement Learning; ROC, Receiver Operator Curve; SCF, Supplemental Code File; Supervised machine learning; XGBT, Extreme Gradient Boosted Trees; Xgboost
Year: 2021 PMID: 34984411 PMCID: PMC8692990 DOI: 10.1016/j.jmsacl.2021.12.001
Source DB: PubMed Journal: J Mass Spectrom Adv Clin Lab ISSN: 2667-145X
Fig. 1Data preprocessing schematic. (Top) Original PAA profile dataset represented in a tabular format. This is an example of structured dataset. (Bottom) Post-preprocessing: Identifier column is removed, and categorical data is encoded as numeric values. In addition, the data used for making predictions (input) is separated from the labels (output). This is done so that the labels are not used by the algorithm as a feature for making predictions during training. (Bottom right) Lastly, input and output data are separated into train, validation, and test datasets.
Fig. 2Machine learning workflow schematic for binary classification of PAA profiles as ‘normal’ or ‘abnormal’. (From left to right) A sample (batch) of training data is presented to the XGBoost algorithm. XGBoost analyses the data and makes predictions for each class, ‘normal’ and ‘abnormal’, and represents the predictions as a sum-to-one probability distribution, with probability of ‘normal’ in the left column and ‘abnormal’ in the right column of the prediction table. The true labels are then compared against the predicted classes via a loss function. The calculated loss is then used to update the parameters of the XGBoost algorithm. This process repeats itself until the loss no longer decreases.
Fig. 3Training and validation loss plotted as a function of training iterations.