| Literature DB >> 33087128 |
Yoonha Choi1, Jianghan Qu1, Shuyang Wu1, Yangyang Hao1, Jiarui Zhang2, Jianchang Ning1, Xinwu Yang1, Lori Lofaro1, Daniel G Pankratz1, Joshua Babiarz1, P Sean Walsh1, Ehab Billatos2, Marc E Lenburg2, Giulia C Kennedy1, Jon McAuliffe3, Jing Huang4.
Abstract
BACKGROUND: Bronchoscopy for suspected lung cancer has low diagnostic sensitivity, rendering many inconclusive results. The Bronchial Genomic Classifier (BGC) was developed to help with patient management by identifying those with low risk of lung cancer when bronchoscopy is inconclusive. The BGC was trained and validated on patients in the Airway Epithelial Gene Expression in the Diagnosis of Lung Cancer (AEGIS) trials. A modern patient cohort, the BGC Registry, showed differences in key clinical factors from the AEGIS cohorts, with less smoking history, smaller nodules and older age. Additionally, we discovered interfering factors (inhaled medication and sample collection timing) that impacted gene expressions and potentially disguised genomic cancer signals.Entities:
Keywords: Bronchoscopy; Gene expression; Lung cancer; Machine learning; Molecular diagnostic test; Risk stratification; Whole transcriptome RNA sequencing
Year: 2020 PMID: 33087128 PMCID: PMC7579926 DOI: 10.1186/s12920-020-00782-1
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Analysis and evaluation pipeline based on a nested cross-validation schema
Fig. 2Genomic sequencing classifier structure. a Overall structure of the Ensemble model. b Detailed structure of the hierarchical logistic regression component
Training and test set composition. a The OOI Malignant set includes subjects out of indication due only to having a positive lung cancer diagnosis from bronchoscopy. b The OOI Other set includes subjects that are out of indication due to other reasons (never smokers, concurrent or prior cancer or metastatic to lung)
| Set | Cohort | Pre-Test Risk Group | N | ||||
|---|---|---|---|---|---|---|---|
| Low | Intermediate | High | Missing | ||||
| Training ( | Primary (Within Indication) | AEGIS | 25 | 50 | 78 | 36 | 189 |
| Registry | 7 | 80 | 35 | . | 122 | ||
| Total | 311 | ||||||
| OOI Malignanta | AEGIS | 1 | 24 | 477 | 77 | 579 | |
| Total | 579 | ||||||
| OOI Otherb | AEGIS | 48 | 122 | 217 | 82 | 469 | |
| Registry | 22 | 85 | 33 | 3 | 143 | ||
| Total | 612 | ||||||
| Test ( | Primary (Within Indication) | AEGIS | 58 | 82 | 106 | . | 246 |
| Registry | 22 | 106 | 38 | . | 166 | ||
| Total | 412 | ||||||
Demographic and clinical characteristics of training and test sets focusing on within-indication subjects
| Training | Test | ||||
|---|---|---|---|---|---|
| Characteristic | AEGIS | Registry | AEGIS | Registry | |
| 0.36 | |||||
| Female | 72 | 65 | 83 | 84 | |
| Male | 117 | 57 | 163 | 82 | |
| 62 (54–70) | 64 (57–71) | 62 (54–70) | 65 (58–71) | 0.45 | |
| 0.59 | |||||
| White | 141 | 106 | 192 | 132 | |
| Black | 34 | 14 | 42 | 29 | |
| Other | 11 | 2 | 12 | 4 | |
| Unknown | 3 | 0 | 0 | 1 | |
| 0.45 | |||||
| Current | 79 | 48 | 107 | 73 | |
| Former | 110 | 74 | 139 | 93 | |
| 40 (18–57) | 35 (20–50) | 35 (20–56) | 35 (20–56) | 0.82 | |
| Infiltrate | 0 | 0 | 25 | 0 | |
| < 2 cm | 42 | 61 | 88 | 80 | |
| 2 to 3 cm | 30 | 29 | 48 | 29 | |
| > 3 cm | 41 | 26 | 75 | 44 | |
| Unknown | 60 | 6 | 10 | 13 | |
| 0.47 | |||||
| Central | 50 | 9 | 72 | 10 | |
| Peripheral | 78 | 107 | 108 | 144 | |
| Central and peripheral | 46 | 0 | 53 | 0 | |
| Unknown | 15 | 6 | 13 | 12 | |
| Small-cell | 8 | 3 | 8 | 1 | |
| Non-small-cell | 69 | 48 | 100 | 43 | 0.18 |
| Adenocarcinoma | 30 | 25 | 58 | 25 | |
| Squamous | 28 | 12 | 26 | 10 | |
| Large-cell | 6 | 1 | 4 | 0 | |
| Non-small-cell not otherwise specified | 5 | 10 | 12 | 8 | |
| Other | 0 | 2 | 0 | 2 | |
| Unknown | 21 | 3 | 3 | 6 | |
| Fibrosis | 1 | 0 | 1 | 0 | |
| Granuloma | 15 | 6 | 26 | 10 | |
| Infection | 30 | 15 | 36 | 15 | |
| Inflammation | 4 | 2 | 1 | 2 | |
| Multiple | 6 | 0 | 8 | 0 | |
| Other | 17 | 4 | 25 | 2 | |
| Resolution of Stability | 18 | 39 | 38 | 40 | |
| Clinically benign | 0 | 0 | 0 | 45 | |
Fig. 3Gene correlation analysis (WGCNA): module eigengenes (listed by row) correlation with clinical factors (by column). Heatmap color is based on absolute Pearson correlation. Legend for p-value significance: '***' 0 < p-value ≤ 0.001; '**' 0.001 < p-value ≤ 0.01; '*' 0.01 < p-value ≤ 0.05; '.' 0.05 < p-value ≤ 0.1; ' ' 0.1 < p-value ≤ 1. Number of genes in each module is shown in parenthesis in row labels
Fig. 4Gene expression variation associated with smoking status, specimen collection timing, cohort and inhaled medication. PCA was performed using scaled normalized (VST) expression data for a 17,954 genes from all subjects in the smoking index training set (N = 1578); b and c 17,954 genes from within-indication subjects in the training set (N = 311); d 998 benign vs malignant DE genes from within-indication subjects in the training set (N = 311). e Distribution of basal, blood, cilia and immune cell type indexes in within-indication subjects in the training set, separate by specimen collection timing
Fig. 5Cross-validation performance for down-classification. The performance is evaluated on within-indication training samples with low/intermediate pre-test risk (N = 162) using 10 repeats of 5-fold CV. The original Gould model was used to score training samples. GLM(m) is a generalized linear regression model only containing main effects of clinical features and genomic features. GLM(i) includes main effects and interactions between clinical features and genomic features. GLM(m) and GLM(i) used the same set of input clinical features: age, gender, nodule size, pack-year, years since quitting smoking, specimen collection timing and genomic smoking index. “Ensemble” is the final GSC classifier
Lung cancer genomic sequencing classifier validation performance (Down, up classification). *Cancer prevalence calculation includes local benign subjects as Prevalence = . The local benign subjects had local label as benign but did not have an adjudicated label. NPV, PPV and % Reclassified are all functions of prevalence (estimated including local benign subjects), sensitivity and specificity (both estimated excluding local benign subjects)
| AUC | Pre-test Cancer Risk | *Cancer prevalence | Cancer risk re-stratification | Specificity | Sensitivity | Post-test NPV/PPV | §% Re-stratified |
|---|---|---|---|---|---|---|---|
| 73.4% [68.3–78.4] | Low | 5% | Low to Very Low | 57.4% [44.8–69.3] | 100% [39.8–100] | 100% NPV [91.0–100] | 54.5% |
| Intermediate | 28.2% | Intermediate to Low | 37.3% [27.9–47.4] | 90.6% [79.3–96.9] | 91.0% NPV [80.8–96.0] | 29.4% | |
| Intermediate to High | 94.1% [87.6–97.8] | 28.3% [16.8–42.3] | 65.4% PPV [43.8–82.1] | 12.2% | |||
| High | 73.6% | High to Very High | 91.2% [76.3–98.1] | 34.0% [25.0–43.8] | 91.5% PPV [77.9–97.0] | 27.3% |
§% Reclassified (Low to Very Low, Intermediate to Low) = (1- Prevalence) specificity + Prevalence (1-sensitivity)
§% Reclassified (Intermediate to High, High to Very High) = Prevalence sensitivity + (1-Prevalence) (1- specificity)
* There are 8, 33 and 4 local benign subjects in low, intermediate and high-risk group