| Literature DB >> 34945760 |
Denis V Petrovsky1, Arthur T Kopylov1, Vladimir R Rudnev1,2, Alexander A Stepanov1, Liudmila I Kulikova1,2, Kristina A Malsagova1, Anna L Kaysheva1.
Abstract
Mass spectrometric profiling provides information on the protein and metabolic composition of biological samples. However, the weak efficiency of computational algorithms in correlating tandem spectra to molecular components (proteins and metabolites) dramatically limits the use of "omics" profiling for the classification of nosologies. The development of machine learning methods for the intelligent analysis of raw mass spectrometric (HPLC-MS/MS) measurements without involving the stages of preprocessing and data identification seems promising. In our study, we tested the application of neural networks of two types, a 1D residual convolutional neural network (CNN) and a 3D CNN, for the classification of three cancers by analyzing metabolomic-proteomic HPLC-MS/MS data. In this work, we showed that both neural networks could classify the phenotypes of gender-mixed oncology, kidney cancer, gender-specific oncology, ovarian cancer, and the phenotype of a healthy person by analyzing 'omics' data in 'mgf' data format. The created models effectively recognized oncopathologies with a model accuracy of 0.95. Information was obtained on the remoteness of the studied phenotypes. The closest in the experiment were ovarian cancer, kidney cancer, and prostate cancer/kidney cancer. In contrast, the healthy phenotype was the most distant from cancer phenotypes and ovarian and prostate cancers. The neural network makes it possible to not only classify the studied phenotypes, but also to determine their similarity (distance matrix), thus overcoming algorithmic barriers in identifying HPLC-MS/MS spectra. Neural networks are versatile and can be applied to standard experimental data formats obtained using different analytical platforms.Entities:
Keywords: bioinformatics; cancer; metabolomics; multiomics data; neural network; proteomics; system biology
Year: 2021 PMID: 34945760 PMCID: PMC8707435 DOI: 10.3390/jpm11121288
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Sizes of datasets.
| Dataset | Training Size | Testing Size | Validating Size |
|---|---|---|---|
| Original | 217 | 120 | – |
| Augmented | 1302 | 420 | 300 |
| Original | 203 | 120 | – |
| Augmented | 1302 | 420 | 300 |
Figure 1The schematic representation of 1D-CNN and 3D-CNN pipeline architecture for mass spectrometry-based data processing using data conversion into ‘mgf’ files.
Figure 2The schematic representation of 1D residual CNN architecture.
1D-residual CNN model parameters.
| Layer Name | Kernel Size, Filters | Number of Blocks | Stride |
|---|---|---|---|
| Conv1 | (32, 64) | 1 | 2 |
| Conv2_x | (7, 64) | 4 | 2 |
| (7, 64) | |||
| Conv3_x | (7, 128) | 3 | 2 |
| (7, 128) | |||
| Conv4_x | (7, 256) | 3 | 2 |
| (7, 256) | |||
| Conv5_x | (7, 512) | 4 | 2 |
| (7, 512) | |||
| Avgpool, kernel size = 3 | |||
| Dense (512 × 4) | |||
Figure 3The schematic representation of 3D residual CNN architecture.
3D-CNN model parameters.
| Layer Name | Kernel Size, Filters | Stride |
|---|---|---|
| Conv1 | ((7, 9, 9), 16) | (1, 1, 1) |
| Maxpool1 | (3, 3, 3) | (1, 1, 1) |
| Conv | ((5, 7, 7), 16) | (1, 1, 1) |
| Maxpool2 | (2, 2, 2) | (1, 1, 1) |
| Conv3 | ((5, 7, 7), 32) | (1, 1, 1) |
| Maxpool3 | (2, 2, 2) | (1, 1, 1) |
| Conv4 | ((2, 5, 5), 32) | (1, 1, 1) |
| Conv5 | ((2, 5, 5), 64) | (1, 1, 1) |
| Conv6 | ((1, 3, 3), 128) | (1, 1, 1) |
| Conv7 | ((1, 3, 3), 256) | (1, 1, 1) |
| Conv8 | ((1, 3, 3), 512) | (1, 1, 1) |
| Dense (4096, 64) | ||
| Dense (64 × 4) | ||
Indicators of the model learning process.
| CNN | Epochs | Total Training Time | Learning Rate |
|---|---|---|---|
| 1D residual CNN | 25 | 23 min | 1 × 10−3 → (reduced to) → > 1.25 × 10−4 |
| 3D residual CNN | 25 | 145 min | 1 × 10−3 → (reduced to) →5 × 10−5 |
Figure 4Training curves of two CNN models for average loss curve. The OY axis plots cross-entropy loss function, which is an arbitrary value and depends on the predictive probability of the model.
Training and evaluation metrics of the CNN models.
| CNN | Input | Dataset | Accuracy | Recall | F1-Score |
|---|---|---|---|---|---|
| 1D-residual CNN | 4 channels of raw data: | Train | 0.953 | 0.941 | 0.956 |
| Test | 0.812 | 0.796 | 0.801 | ||
| Validation | 0.784 | 0.781 | 0.781 | ||
| Class | Control | 0.69 | 0.86 | 0.76 | |
| Ovarian cancer | 0.79 | 0.95 | 0.85 | ||
| Prostate cancer | 0.89 | 0.77 | 0.82 | ||
| Kidney cancer | 0.92 | 0.7 | 0.79 | ||
| 3D CNN | Sequence of spectrum images | Train | 0.974 | 0.968 | 0.972 |
| Test | 0.893 | 0.889 | 0.893 | ||
| Validation | 0.861 | 0.850 | 0.854 | ||
| Class | Control | 0.79 | 0.95 | 0.86 | |
| Ovarian cancer | 0.83 | 0.95 | 0.88 | ||
| Prostate cancer | 0.94 | 0.78 | 0.85 | ||
| Kidney cancer | 0.91 | 0.83 | 0.86 |
Figure 5Confusion matrix of neural network predictive results obtained on the testing dataset for Table 1. One-dimensional residual CNN (a) and 3D CNN (b). The number of correctly recognized files is indicated in the interception of certain pathology unless the number is in the intercept between different phenotypes. CNT—control, OVC—ovarian cancer, RNC—kidney cancer, PRC—prostate cancer.