| Literature DB >> 33154370 |
Khawla Seddiki1,2, Philippe Saudemont2, Frédéric Precioso3, Nina Ogrinc2, Maxence Wisztorski2, Michel Salzet2, Isabelle Fournier4, Arnaud Droit5.
Abstract
Rapid and accurate clinical diagnosis remains challenging. A component of diagnosis tool development is the design of effective classification models with Mass spectrometry (MS) data. Some Machine Learning approaches have been investigated but these models require time-consuming preprocessing steps to remove artifacts, making them unsuitable for rapid analysis. Convolutional Neural Networks (CNNs) have been found to perform well under such circumstances since they can learn representations from raw data. However, their effectiveness decreases when the number of available training samples is small, which is a common situation in medicine. In this work, we investigate transfer learning on 1D-CNNs, then we develop a cumulative learning method when transfer learning is not powerful enough. We propose to train the same model through several classification tasks over various small datasets to accumulate knowledge in the resulting representation. By using rat brain as the initial training dataset, a cumulative learning approach can have a classification accuracy exceeding 98% for 1D clinical MS-data. We show the use of cumulative learning using datasets generated in different biological contexts, on different organisms, and acquired by different instruments. Here we show a promising strategy for improving MS data classification accuracy when only small numbers of samples are available.Entities:
Mesh:
Year: 2020 PMID: 33154370 PMCID: PMC7644674 DOI: 10.1038/s41467-020-19354-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Overall accuracies of SpiderMass spectra classification using three CNN architectures.
| Datasets | # classes | variant_Lecun | variant_LeNet | variant_VGG9 |
|---|---|---|---|---|
| Canine sarcoma | 2 | 0.96 ± 0.01 | 0.96 ± 0.01 | |
| 12 | 0.88 ± 0.03 | 0.88 ± 0.02 | ||
| Microorganisms | 3 | 0.52 ± 0.11 | 0.67 ± 0.09 | |
| 5 | 0.68 ± 0.03 | 0.61 ± 0.13 |
The best result for each task (accuracy ± standard variation over 10 independent iterations) is indicated in boldface.
Overall accuracies of SpiderMass spectra classification using three CNN architectures after transfer learning.
| Datasets | # classes | variant_Lecun | variant_LeNet | variant_VGG9 |
|---|---|---|---|---|
| Canine sarcoma | 12 | 0.90 ± 0.01 (02%) | ||
| Microorganisms | 3 | 0.96 ± 0.01 (84%) | 0.95 ± 0.02 (41%) | |
| 5 | 0.96 ± 0.02 (57%) |
The best result for each task (accuracy ± standard variation over 10 independent iterations) is indicated in boldface. The improvement in performance from scratch is expressed as a percentage.
Overall accuracies of canine sarcoma spectra classification by the three CNN architectures.
| Protocol | variant_Lecun | variant_LeNet | variant_VGG9 |
|---|---|---|---|
| Scenario A | 0.92 ± 0.01 (04%a 02%b) | 0.94 ± 0.01 (04%a 01%b) | |
| Scenario B | 0.95 ± 0.02 (08%a 05%b 03%c) | 0.96 ± 0.00 (06%a 03%b 02%c) |
The best result for each task (accuracy ± standard variation over 10 independent iterations) is indicated in boldface.
aThe improvement is expressed as a percentage relative to learning from scratch.
bThe improvement is expressed as a percentage relative to transfer learning.
cThe improvement is expressed as a percentage relative to Scenario A.
Fig. 1Workflow of CNNs classification: by transfer learning (in gray for canine sarcoma, microorganisms, and in green for human ovary 1).
By cumulative learning Scenario A (in blue for canine sarcoma and in green for human ovary 2). By cumulative learning Scenario B (in orange for canine sarcoma). Final representation of Scenario B is tested on the datasets used during the training (in purple arrows).
Overall accuracies of variant_LeNet architecture at classifying ovarian spectra.
| Dataset | # classes | CNN from scratch | Transfer learning | Cumulative learning |
|---|---|---|---|---|
| Human ovary 1 | 2 | 0.78 ± 0.02 | – | |
| Human ovary 2 | 2 | 0.80 ± 0.00 | 0.83 ± 0.02 (03%a) |
The best result for each task (accuracy ± standard variation over 10 independent iterations) is indicated in boldface.
aThe improvement is expressed as a percentage relative to learning from scratch.
bThe improvement is expressed as a percentage relative to transfer learning.
Overall accuracies of raw and preprocessed clinical spectra classification by SVM, RF, and LDA.
| Datasets | # classes | Applied to raw datasets | Applied to preprocessed datasets | |||||
|---|---|---|---|---|---|---|---|---|
| Best CNNs | SVM | RF | LDA | SVM | RF | LDA | ||
| Canine sarcoma | 2 | 0.98 ± 0.00a | 0.77 ± 0.02 | 0.71 ± 0.02 | 0.76 ± 0.16 | 0.93 ± 0.02 | ||
| 12 | 0.99 ± 0.00b | 0.61 ± 0.00 | 0.41 ± 0.01 | 0.52 ± 0.19 | 0.60 ± 0.04 | |||
| Microorganisms | 3 | 0.99 ± 0.00c | 0.45 ± 0.03 | 0.77 ± 0.03 | 0.87 ± 0.02 | 0.88 ± 0.02 | ||
| 5 | 0.99 ± 0.00c | 0.54 ± 0.35 | 0.67 ± 0.13 | 0.19 ± 0.09 | 0.85 ± 0.03 | |||
| Human ovary 1 | 2 | 0.98 ± 0.00c | 0.53 ± 0.04 | 0.65 ± 0.04 | 0.66 ± 0.24 | 0.91 ± 0.02 | ||
| Human ovary 2 | 2 | 0.99 ± 0.00b | 0.60 ± 0.06 | 0.71 ± 0.03 | 0.60 ± 0.05 | 0.88 ± 0.03 | ||
The best result for each task (accuracy ± standard variation over 10 independent iterations) is indicated in boldface.
aThe best CNNs from scratch.
bThe best CNNs after cumulative learning.
cThe best CNNs after transfer learning.
Description of datasets.
| MS instruments | Datasets | Classes | # spectra | # samples | Mass ranges | # features |
|---|---|---|---|---|---|---|
| Target domain data | ||||||
| Synapt G2-S Q-TOF (Waters, SpiderMass) | Canine sarcoma | Healthy | 482 | 8 | 100–1600 Da | 15,000 |
| Myxosarcoma | 60 | 1 | ||||
| Fibrosarcoma | 404 | 6 | ||||
| Hemangiopericytoma | 134 | 2 | ||||
| Malignant peripheral nerve tumor | 60 | 1 | ||||
| Osteosarcoma | 339 | 5 | ||||
| Undifferentiated pleomorphic sarcoma | 376 | 5 | ||||
| Rhabdomyosarcoma | 66 | 1 | ||||
| Splenic fibrohistiocytic nodules | 63 | 1 | ||||
| Histiocytic sarcoma | 105 | 1 | ||||
| Soft tissue sarcoma | 69 | 1 | ||||
| Gastrointestinal stromal sarcoma | 70 | 1 | ||||
| Total | 2228 | 33 biopsies | ||||
| Synapt G2-S Q-TOF (Waters, SpiderMass) | Microorganisms | 26 | 1 | 100–2000 Da | 19,000a | |
| 26 | 1 | 15,000b | ||||
| 24 | 1 | |||||
| 18 | 1 | |||||
| 23 | 1 | |||||
| Total | 117 | 5 colonies | ||||
| PBSII SELDI-TOF | Human ovary 2 | Healthy | 91 | 700–12,000 Da | 7084 | |
| Cancer | 162 | [ | ||||
| Total | 253 | |||||
| Source domain data | ||||||
| Rapiflex MALDI-TOF (Bruker) | Rat brain | Gray matter | 4635 | A single section | 300–1300 Da | 19,000a |
| White matter | 5465 | 15,000b | ||||
| Total | 10100 | 7084c | ||||
| Synapt G2-S Q-TOF (Waters, SpiderMass) | Beef liver | Positive mode | 1372 | 10 | 100–1600 Da | 15,000 |
| Negative mode | 1265 | 10 | ||||
| Total | 2637 | 20 samples | ||||
| Hybrid Quadrupole (QSTAR pulsar I) | Human ovary 1 | Healthy | 95 | 1–20,000 Da | 7084 | |
| Cancer | 121 | [ | ||||
| Total | 216 | |||||
aNumber of features used for microorganisms transfer learning.
bNumber of features used for canine sarcoma transfer and cumulative learning.
cNumber of features used for ovarian transfer and cumulative learning.
Fig. 2Protocols of classification with variant_LeNet architecture.
a Protocol of transfer learning. b Protocol of cumulative learning Scenario A. c Protocol of cumulative learning Scenario B.