| Literature DB >> 34880265 |
Guruduth Banavar1, Oyetunji Ogundijo2, Ryan Toma3, Sathyapriya Rajagopal3, Yen Kai Lim4,5, Kai Tang4,5, Francine Camacho2, Pedro J Torres2, Stephanie Gline2, Matthew Parks2, Liz Kenny6, Ally Perlina3, Hal Tily2, Nevenka Dimitrova7, Salomon Amar7, Momchilo Vuyisich3, Chamindie Punyadeera8,9.
Abstract
Despite advances in cancer treatment, the 5-year mortality rate for oral cancers (OC) is 40%, mainly due to the lack of early diagnostics. To advance early diagnostics for high-risk and average-risk populations, we developed and evaluated machine-learning (ML) classifiers using metatranscriptomic data from saliva samples (n = 433) collected from oral premalignant disorders (OPMD), OC patients (n = 71) and normal controls (n = 171). Our diagnostic classifiers yielded a receiver operating characteristics (ROC) area under the curve (AUC) up to 0.9, sensitivity up to 83% (92.3% for stage 1 cancer) and specificity up to 97.9%. Our metatranscriptomic signature incorporates both taxonomic and functional microbiome features, and reveals a number of taxa and functional pathways associated with OC. We demonstrate the potential clinical utility of an AI/ML model for diagnosing OC early, opening a new era of non-invasive diagnostics, enabling early intervention and improved patient outcomes.Entities:
Year: 2021 PMID: 34880265 PMCID: PMC8654845 DOI: 10.1038/s41525-021-00257-x
Source DB: PubMed Journal: NPJ Genom Med ISSN: 2056-7944 Impact factor: 8.617
Study cohorts.
| A: High-risk OC+OPMD discovery cohort | B: High-risk OC+OPMD cross-validation (A+27 samples) | C: Average-risk OC-only (OC subset of A + 7 average-risk) | D: Average-risk technical validation | Total unique across all cohorts | |
|---|---|---|---|---|---|
| Number of participants | 117 | 144 | 99 | 91 | 242 |
| Controls | 59 | 75 | 49 | 91 | 171 |
| Cases | 58 | 69 | 50 | n/a | 71 |
| Number of samples total | 117 | 117 from Cohort A+ | 92 from Cohort A+ | 282 | n/a |
| Unique samples | 117 | 27 | 7 | 282 | 433 |
| Cases | 58 | 69 | 50 | n/a | 71 |
| Pre-malignant | 10 | 14 | n/a | n/a | 14 |
| Malignant | 48 | 55 | 50 | n/a | 57 |
| Sex (% female) | 37.6 | 37.5 | 40.4 | 38.7 | 38.8 |
| Controls | 54.2 | 50.7 | 57.1 | 38.7 | |
| Cases | 20.7 | 23.2 | 24 | n/a | |
| Age (y) mean ± std | 60.2 ± 11.3 | 61.4 ± 11.4 | 59.7 ± 12.6 | 22.6 ± 10.5 | 37.2 ± 21.7 |
| Controls | 56.3 ± 10 | 58.5 ± 11 | 56 ± 10.8 | 22.6 ± 10.5 | |
| Cases | 64.1 ± 11.4 | 64.5 ± 11.1 | 63.3 ± 13.3 | n/a |
The 433 unique samples in this study (57 OC samples, 14 OPMD samples, and 362 cancer-free samples) are organized into 4 cohorts A, B, C, and D according to the study goals. High-risk population is 50 years or older OR a history of smoking (current or past smoker). Average-risk population is the general population across all backgrounds and histories.
Fig. 1Descriptive statistics of salivary metatranscriptome of the high-risk population (Cohort A in Table 1).
a Species richness; control median 463, case median 415 and function richness; control median 2306, case median 2205. b Shannon diversity index; control mean 2.25, case mean 2.20; and Inverse Simpson diversity index; control mean 3.41, case mean 3.26. c Using Mann–Whitney U tests and at least twofold difference in means (0.69 in CLR space), 139 differentially expressed species (at p < 0.05) up- or downregulated (red and blue respectively) in cases relative to controls, organized by genus and phylum (median difference in CLR values); the size of the bubble is inversely proportional to the p value. d Using Mann–Whitney U tests and at least twofold difference in means (0.69 in CLR space), 49 differentially expressed KOs (at p < 0.05) up- or downregulated in cases relative to controls, organized by KEGG level-3 and level-2 functional groups; the size of each triangle is inversely proportional to its p value e Clustermap using Euclidean distance of CLR transformed sum(transcripts per million) data for active function (KO) features significant by Mann–Whitney U tests. Features are shown with corrected p values < 0.01 and median CLR differences between the cohorts of greater than 0 or less than −1. KOs are color coded by their KEGG level-3 functional group.
Fig. 2Predictive performance of machine-learnt classifier trained with discovery dataset (Cohort A in Table 1).
a Distribution of classifier output probabilities across the sample set. b Sensitivity & specificity tradeoff with 95% confidence interval computed using the Clopper-Pearson method; at the default decision boundary of 0.5, sensitivity is 0.81 and specificity is 0.85. c ROC AUC of the classifier using the LOOCV method is 0.87 (blue curve); using differentially expressed features only is 0.76 (orange curve). d Classifier probabilities separated by gender. e Classifier probabilities separated by smoking status. f PCA analysis using top 100 features (PC1 and PC2 capture 10.2% and 6.3% of the total variation, respectively.). g Probability of cancer output from the classifier for control samples with and without interference from chewing gum, chewing tobacco, and brushing teeth.
Model performance for cohorts described in Table 1.
| A: High-risk | B: High-risk CV | C: Average-risk OC only | D: Average-risk Technical validation | |
|---|---|---|---|---|
| ROC AUC | 0.87 | 0.87 | 0.90 | n/a |
| Sensitivity | 81% | 83% | 76% | n/a |
| Specificity | 85% | 79% | 88% | 97.9% |
| True positives by stage | ||||
| OPMD | 7/10 | 11/14 | n/a | n/a |
| OC Stage 1 | 12/13 | 11/14 | 12/14 | n/a |
| OC Stage 2 | 11/16 | 12/17 | 11/16 | n/a |
| OC Stage 3 | 1/2 | 2/2 | 2/3 | n/a |
| OC Stage 4 | 13/14 | 18/19 | 10/14 | n/a |
For sensitivity and specificity, we used the standard default clinical decision threshold of prediction probability = 0.5. Technical validation for the average-risk cohort D was performed using the model developed for Cohort A.
Fig. 3Oral metatranscriptomic signature from the ML classifier trained with Cohort A from Table 1.
Effect sizes (coefficient values within the classification model) of 101 active species (circles) and 247 active KOs (triangles), grouped into curated Viome Functional Categories (VFC), see ‘Supplementary Note 4’ section of the Supplementary Material; sizes of circles or triangles are proportional to the CLR median difference in expression level between cases and controls.