| Literature DB >> 31748623 |
Fatema Tuz Zohora1, M Ziaur Rahman1,2, Ngoc Hieu Tran1, Lei Xin2, Baozhen Shan2, Ming Li3.
Abstract
Liquid chromatography with tandem mass spectrometry (LC-MS/MS) based quantitative proteomics provides the relative different protein abundance in healthy and disease-afflicted patients, which offers the information for molecular interactions, signaling pathways, and biomarker identification to serve the drug discovery and clinical research. Typical analysis workflow begins with the peptide feature detection and intensity calculation from LC-MS map. We are the first to propose a deep learning based model, DeepIso, that combines recent advances in Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to detect peptide features of different charge states, as well as, estimate their intensity. Existing tools are designed with limited engineered features and domain-specific parameters, which are hardly updated despite a huge amount of new coming proteomic data. On the other hand, DeepIso consisting of two separate deep learning based modules, learns multiple levels of representation of high dimensional data itself through many layers of neurons, and adaptable to newly acquired data. The peptide feature list reported by our model matches with 97.43% of high quality MS/MS identifications in a benchmark dataset, which is higher than the matching produced by several widely used tools. Our results demonstrate that novel deep learning tools are desirable to advance the state-of-the-art in protein identification and quantification.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31748623 PMCID: PMC6868186 DOI: 10.1038/s41598-019-52954-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow of our proposed method to detect peptide features from LC-MS map of protein sample.
Class distribution of samples in our dataset consisting of 57 LC-MS maps.
| Class (charge state) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Amount | 163,038 | 863,050 | 428,909 | 29,183 | 1,503 | 653 | 179 | 236 | 233 |
Class sensitivity and precision of IsoDetecting module and amount of samples for training and validation.
| Class ( | Training | Validation | ||||
|---|---|---|---|---|---|---|
| Dataset Size | Sensitivity (%) | Precision (%) | Dataset size | Sensitivity (%) | Precision (%) | |
| 0 | 257,250 | 98.83 | 99.73 | 28,992 | 97.62 | 99.21 |
| 1 | 21,345 | 98.19 | 96.47 | 3126 | 94.53 | 96.23 |
| 2 | 131,951 | 98.94 | 96.94 | 26,480 | 98.18 | 95.91 |
| 3 | 59,045 | 99.26 | 95.95 | 10,903 | 97.95 | 94.12 |
| 4 | 6,765 | 99.38 | 96.07 | 646 | 95.72 | 92.23 |
| 5 | 4,140 | 98.28 | 97.36 | 20 | 86.59 | 82.32 |
| 6 | 8,446 | 99.91 | 96.48 | 30 | 40.36 | 18.04 |
| 7 | 3,324 | 94.28 | 97.87 | 10 | 50.00 | 16.00 |
| 8 | 4,060 | 99.66 | 97.44 | 15 | 61.79 | 61.80 |
| 9 | 4,203 | 99.69 | 96.58 | 14 | 28.21 | 76.74 |
Training set for class 5 has some duplicated samples (oversampling). Training sets for class 6 to 9 have both augmented and duplicated samples. Amount of samples from class 0 depends on our choice (discussed in Method section) and this is kept higher than other classes because the LC-MS map is very sparse. The validation set does not contain any duplicated data and there is no overlapping between validation dataset and training dataset.
Class sensitivity of IsoGrouping module on training set and validation set.
| Class | Sensitivity on Training Set (%) | Sensitivity on Validation Set (%) |
|---|---|---|
| A (noise) | 95.06 | 94.68 |
| B (2 isotopes) | 56.49 | 57.52 |
| C (3 isotopes) | 72.24 | 72.41 |
| D (4 isotopes) | 72.69 | 74.23 |
| E (5 isotopes or more) | 72.41 | 72.67 |
Confusion matrix produced by IsoGrouping module on validation dataset.
| Class | A | B | C | D | E |
|---|---|---|---|---|---|
| A | 94.68% | 2.77% | 1.73% | 0.57% | 0.25% |
| B | 3.4% | 57.52% | 33.86% | 4.59% | 0.62% |
| C | 0.89% | 5.59% | 72.41% | 20.19% | 0.93% |
| D | 0.31% | 0.89% | 16.18% | 74.23% | 8.39% |
| E | 0.79% | 0.37% | 2.70% | 23.46% | 72.67% |
The diagonal values, e.g. [C, C] represent the sensitivity for class C. We say a feature is misclassified as class A when the monoisotope (first isotope) or all of the isotopes are missed, i.e., the feature is thought to be noise by mistake. The value of [C, A] indicates what percentage of features with three isotopes are either misclassified as noise, or monoisotope is missed. [C, B] indicates the percentage of features which actually have three isotopes but the third one is missed, and only first two are combined together. Similarly [C, D] shows the percentage of three isotope features, for whom IsoGrouping module finds ONE additional isotope at the end.
Percentage of high confidence MS/MS identifications matched by feature list produced by different algorithms.
| Algorithms | MaxQuant | OpenMS | Dinosaur | DeepIso |
|---|---|---|---|---|
| Matching | 96.83% | 97.14% | 97.23% | 97.43% |
Figure 2(a) A peptide feature with broken signals; (b) Detection of overlapping peptide features; (c) Adjacent feature case.
Pearson correlation coefficient of the peptide feature intensity between DeepIso and other tools.
| Dinosaur | MaxQuant | OpenMS | |
|---|---|---|---|
| 87.73% | 88.99% | 91.46% |
Approximated running time of different algorithms.
| Platform | Processor: Intel Core i7, 4 cores OS: Windows 10 for running the applications | Processor: Intel(R) Xeon(R) Gold 6134 CPU, NVIDIA Tesla OS: Ubuntu 16.04.5 LTS for running the python scripts | ||
|---|---|---|---|---|
| Algorithms | Dinosaur | MaxQuant | OpenMS | |
| Running Time | 15 minutes | 30 minutes | 2 hours and 50 minutes | |
Here the platform used for OpenMS and DeepIso did not have support for running Windows application of MaxQuant and Dinosaur. So we used different machine for running those.
IsoDetecting module give better validation sensitivity with FC-RNN network than attention-gated RNN.
| Class ( | 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|---|
| FC-RNN network (better) | 96.43% | 93.80% | 96.98% | 98.74% | 97.94% | 85.86% |
| CNN with attention-gated RNN | 96.15% | 89.00% | 96.04% | 96.46% | 95.07% | 54.29% |
Performance of IsoGrouping module in different stages of the development (based on validation dataset).
| Model | Matching with MS/MS identified peptides |
|---|---|
| Initial model | 87.55% |
| Retraining using Adjacent Feature cases | 92.82% |
| Addition of max-pooling layers, one more fully connected layers and state size raised to 8 | 94.66% |
| Network with attention-gated RNN | 95.08% |
| Same as above but trained with more data | 95.13% |
| Ensemble of multiple instance of the model | 95.46% |
Improvement of class sensitivity of IsoDetecting module for charge states 6 to 9 with increasing amount of training samples.
| Class ( | Initial Dataset | Oversampling was performed by duplicating training samples ( | Augmented samples were created from training samples ( | |||
|---|---|---|---|---|---|---|
| Validation Sensitivity | Training Sensitivity | Validation Sensitivity | Training Sensitivity | Validation Sensitivity | Training Sensitivity | |
| 6 | 0% | 0% | 52.65% | 98.32% | 40.36% | 99.9% |
| 7 | 0% | 0% | 0 | 96.53% | 50% | 94.1% |
| 8 | 0% | 0% | 31.67% | 99.14% | 61.80% | 99.6% |
| 9 | 0% | 0% | 38.57% | 98.12% | 28.20% | 99.7% |
| Comments | Network does not learn anything due to negligible amount of original samples. | Although the network starts learning because of introducing higher penalty for lower abundant samples, however it still does not learn well due to lack of variance in samples. Network also overfits due to lack of data. | Various samples were created using augmentation which improves the validation sensitivity further. However, the network still overfits due to lack of data amount and variation. | |||
We do not bother for further improvement (by including more data from different but similar dataset) since most of the peptide features generally appear in LC-MS map with charge states <6. Here the validation set does not contain duplicated data and there is no overlapping among the training set and validation set.
Figure 3Intuitive image showing the effect of resolution along the m/z axis of LC-MS map: (a) lower resolution merges the closely residing peptide features; (b) higher resolution separates the same feature and let the IsoDetecting module perform correct detection.
Figure 4Network of IsoDetecting module.
Figure 5Intuition of scanning by IsoGrouping module on the clusters of features.
Figure 6Network of IsoGrouping module.