| Literature DB >> 29101330 |
Dieter Galea1, Paolo Inglese1, Lidia Cammack1, Nicole Strittmatter1, Monica Rebec1, Reza Mirnezami1, Ivan Laponogov1, James Kinross1, Jeremy Nicholson1, Zoltan Takats1, Kirill A Veselkov2.
Abstract
Hierarchical classification (HC) stratifies and classifies data from broad classes into more specific classes. Unlike commonly used data classification strategies, this enables the probabilistic prediction of unknown classes at different levels, minimizing the burden of incomplete databases. Despite these advantages, its translational application in biomedical sciences has been limited. We describe and demonstrate the implementation of a HC approach for "omics-driven" classification of 15 bacterial species at various taxonomic levels achieving 90-100% accuracy, and 9 cancer types into morphological types and 35 subtypes with 99% and 76% accuracy, respectively. Unknown bacterial species were probabilistically assigned with 100% accuracy to their respective genus or family using mass spectra (n = 284). Cancer types were predicted by mRNA data (n = 1960) for most subtypes with 95-100% accuracy. This has high relevance in clinical practice where complete datasets are difficult to compile with the continuous evolution of diseases and emergence of new strains, yet prediction of unknown classes, such as bacterial species, at upper hierarchy levels may be sufficient to initiate antimicrobial therapy. The algorithms presented here can be directly translated into clinical-use with any quantitative data, and have broad application potential, from unlabeled sample identification, to hierarchical feature selection, and discovery of new taxonomic variants.Entities:
Mesh:
Year: 2017 PMID: 29101330 PMCID: PMC5670129 DOI: 10.1038/s41598-017-14092-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Hierarchical classification of bacterial mass spectral profiles. (a) Hierarchical tree structure for the bacterial species analyzed, where color-coding represents species belonging to the same genus, as indicated in the legend. Grey-scaling indicates upper level hierarchies; (b) Plot of the mean % classification accuracies for 5 predictions at the different tree levels achieved by the selective classifier approach; (c) Semi-quantitative plot showing the classification performance at the lower-most/species level, as well as where misclassifications occurred. The inner circle indicates the actual species while the outer circle indicates the predicted class. Each column represents a genus while rows represent one or multiple species belonging to the respective genus. The overall color for the species in each genus corresponds to the color legend in (a).
Figure 2Cancer genomic dataset hierarchical classification. (a) Hierarchical tree structure for the cancer dataset analyzed derived from previous literature, where cancer types (level 1) were classified with a mean accuracy of 99% while subtypes (level 2) were classified with a mean accuracy of 76 ± 2%; (a) Semi-quantitative plot showing the classification performance at the bottom-most level/cancer sub-type level, as well as where misclassifications occurred. The inner circle indicates the actual class while outer circle indicates the predicted class. Columns represent the different cancer types while rows represent corresponding sub-types. Sub-type colors correspond with the node colors assigned in the lower-most layer of the hierarchical tree (a).
Figure 3Representative leave-one-species-out scores plots for the prediction of unknown bacterial spectra. Part of the bacterial classification tree with representative discrimination plots generated for the prediction of Streptococcus agalactiae at various hierarchical levels using the leave-one-species-out algorithm, where S. agalactiae was omitted and predicted. Correctly predicted samples are indicated by a green outline. S. agalactiae was predicted up to genus level with 100% accuracy. The scores plotted are obtained from the ‘best’-chosen dimensionality reduction space.
Figure 4Representative leave-one-subtype-out scores plots. Discrimination plots generated for the prediction of: (a) ‘squamous’ subtype to bladder urothelial carcinoma (BLCA); (b) ‘reactive-like’ subtype to breast adenocarcinoma (BRCA); (c) ‘classical’ subtype to glioblastoma multiforme (GBM); (d) ‘ccB(2)’ subtype to kidney renal clear cell carcinoma (KIRC); (e) ‘PRCC Type 2’ subtype to kidney renal papillary cell carcinoma (KIRP); (f) ‘FAB M5’ subtype to acute myeloid leukemia (LAML); (g) ‘IDHmut-codel’ to lower grade glioma (LGG); (h) ‘proximal proliferative’ to lung adenocarcinoma (LUAD); and (i) ‘ETS-fusion negative’ to prostate adenocarcinoma (PRAD). Correctly predicted samples are indicated by green outline, a red ‘x’ denotes non-classified samples and misclassified samples are indicated by red outline. Axes represent discriminant components, where the second component is used only for visualization purposes. The scores plotted are obtained from the ‘best’-chosen dimensionality reduction space. Subtype assignment information is provided in the Methods and Supplementary Information Note 1. Most subtypes were correctly assigned to the respective cancer type with 95–100% accuracy (see Supplementary Table 4).
Principles of the dimensionality reduction techniques used in the classification and prediction algorithms.
| Method | Method Abbrev. | Components derivation |
|---|---|---|
| Principal Component Analysis | PCA | Maximizes overall dataset variance without considering between-class variance |
| Partial Least Squares | PLS | Maximizes between-class variance without considering within-class variance |
| Maximum Margin Criterion | MMC | Maximizes between-class variance, while minimizing within-class variance |
| Linear Discriminant Analysis | LDA | Maximizes ratio of between- and within-class variation while the number of samples is greater than the number of variables |
| Support Vector Machines | SVM | Maximizes the margin of separation between the classes |
Methods, their respective abbreviation and a descriptive derivation of their components to obtain a reduced dimensionality space. PCA and LDA were used in combination with each other or with other methods to achieve the combinatory methods: PCA-LDA and MMC-LDA.