| Literature DB >> 33059716 |
Kerry E Poppenberg1,2, Vincent M Tutino1,3,2,4, Lu Li5, Muhammad Waqas2,6, Armond June4, Lee Chaves7, Kaiyu Jiang8, James N Jarvis8,9, Yijun Sun8,10, Kenneth V Snyder1,2,11,6, Elad I Levy1,2,11, Adnan H Siddiqui1,2,11, John Kolega1,4, Hui Meng12,13,14,15.
Abstract
BACKGROUND: Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.Entities:
Keywords: Inflammation; Intracranial aneurysm; Machine learning; Neutrophil; Prediction model; Transcriptomics
Mesh:
Year: 2020 PMID: 33059716 PMCID: PMC7565814 DOI: 10.1186/s12967-020-02550-2
Source DB: PubMed Journal: J Transl Med ISSN: 1479-5876 Impact factor: 5.531
Clinical characteristics of training and testing cohorts
| Training Cohort | Testing Cohort | |||
|---|---|---|---|---|
| Control (n = 55) | Aneurysm (n = 39) | Control (n = 24) | Aneurysm | |
| Age (Mean ± SE) | 62 ± 2.0 | 61 ± 1.7 | 59 ± 2.9 | 57 ± 3.3 |
| Age [Median (Q1/Q3)] | 66 (54/72) | 60 (54/68) | 59 (54/68) | 58.5 (49.25/63.25) |
| Sex (% of patients) | ||||
| Female | 56.36% | 69.23% | 50% | 75% |
| Smoker (% of patients) | ||||
| Yes | 10.91% | 26.64% | 20.83% | 43.75% |
| Comorbidities (% of patients) | ||||
| Hypertension | 61.82% | 53.85% | 54.17% | 50% |
| Heart disease | 30.91% | 23.08% | 25% | 18.75% |
| High cholesterol | 52.73% | 48.72% | 62.50% | 50% |
| Stroke history | 12.73% | 10.26% | 25% | 0% |
| Diabetes | 29.09% | 17.95% | 8.33% | 31.25% |
| Arthritis | 16.36% | 30.77% | 16.67% | 18.75% |
Clinical characteristics of the randomly-created training and testing cohorts. With the exception of age, these factors were quantified as binary data points. The clinical factors were retrieved from the patients’ medical records via the latest “Patient Medical History” form administered prior to imaging
Fig. 1RNAseq data from whole dataset (n = 134). a The scatter plot demonstrates the dispersion in expression between the IA and control groups. b The volcano plot produced following edgeR analysis demonstrates that there are 65 differentially expressed genes. Red points are increased in IA group and blue points are decreased in IA group. c Clustering performed on all transcriptome data demonstrates several distinct clusters of IA and control samples. Overall, 73% of samples were assigned to the correct group
Fig. 2Networks derived from IPA of the 65 differentially expressed transcripts (q < 0.05, fold-change > 2). Transcripts with increased expression in IA are red; transcripts with lower expression in IA are green; fold-change is represented by intensity. a This network (p-score = 21) has related functions of cell-to-cell signaling and interaction, nervous system development and function, and cell morphology. b This network (p-score = 21) associated with dermatological diseases and conditions, organismal injury and abnormalities, and connective tissue development and function. c This network (p-score = 15) has ties to cell death and survival, connective tissue disorders, and inflammatory disease
Fig. 3Verification of RNA-Sequencing data for 9 transcripts by qPCR. A total of 49 of the sequenced samples were analyzed by RT-qPCR, as the other samples did not have enough RNA for the additional reactions. Seven of the 9 transcripts in samples in a subset of patients had the same direction of expression difference on qPCR. There was a statistically significant difference in fold-change in expression (indicated by *) between RNAseq and qPCR for C1QL1 and ZBTB16. Only ZBTB16 had both a significant and different fold-change direction (indicated by †) than that calculated with RNA sequencing data. (Negative fold-change values calculated by negative inverse of fold-change, error bars = standard error.)
The 37 transcripts selected for classification model training
| Gene | Gene ID | Accession # | Training Cohort | Testing Cohort | ||||
|---|---|---|---|---|---|---|---|---|
| Acc | Sen | Spe | Acc | Sen | Spe | |||
| – | AC011380 | 0.56 | 0.64 | 0.51 | 0.58 | 0.69 | 0.50 | |
| 10882 | NM_006688.5 | 0.80 | 0.56 | 0.96 | 0.93 | 0.88 | 0.96 | |
| 387885 | NM_001144872.2 | 0.59 | 0.05 | 0.96 | 0.65 | 0.19 | 0.96 | |
| 100653515 | NM_001243541.1 | 0.71 | 0.59 | 0.80 | 0.80 | 0.56 | 0.96 | |
| 79603 | NM_024552.3 | 0.79 | 0.72 | 0.84 | 0.85 | 0.88 | 0.83 | |
| 10978 | NM_006831.3 | 0.57 | 0.00 | 0.98 | 0.58 | 0.00 | 0.96 | |
| 54165 | NM_020640.3 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 8637 | NM_003732.3 | 0.47 | 0.26 | 0.62 | 0.68 | 0.69 | 0.67 | |
| 2321 | NM_002019.4 | 0.52 | 0.00 | 0.89 | 0.50 | 0.00 | 0.83 | |
| 26301 | NM_021996.6 | 0.71 | 0.74 | 0.69 | 0.70 | 0.88 | 0.58 | |
| 2838 | NM_005290 | 0.79 | 0.79 | 0.78 | 0.93 | 1.00 | 0.88 | |
| 80045 | NM_024980.5 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 2959 | NM_001514.6 | 0.55 | 0.03 | 0.93 | 0.55 | 0.00 | 0.92 | |
| 3043 | NM_000518.5 | 0.79 | 1.00 | 0.64 | 0.58 | 0.94 | 0.33 | |
| 8367 | NM_003545.3 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 317,72 | NM_175065.2 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 57461 | NM_001199469.1 | 0.60 | 0.05 | 0.98 | 0.65 | 0.13 | 1.00 | |
| 57535 | NM_020775.5 | 0.59 | 0.05 | 0.96 | 0.55 | 0.06 | 0.88 | |
| 57710 | NM_020950.2 | 0.63 | 0.15 | 0.96 | 0.63 | 0.06 | 1.00 | |
| 100129697 | NM_001290330.2 | 0.65 | 0.26 | 0.93 | 0.58 | 0.06 | 0.92 | |
| 105377284 | XR_938891.2 | 0.59 | 0.05 | 0.96 | 0.65 | 0.13 | 1.00 | |
| 54674 | NM_001099658.2 | 0.78 | 0.79 | 0.76 | 0.85 | 1.00 | 0.75 | |
| 162387 | NM_152599.3 | 0.67 | 0.59 | 0.73 | 0.65 | 0.50 | 0.75 | |
| 23515 | NM_015358.3 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 100462977 | NM_001190452.1 | 0.57 | 0.03 | 0.96 | 0.60 | 0.00 | 1.00 | |
| 64168 | NM_022351.5 | 0.62 | 0.15 | 0.95 | 0.58 | 0.06 | 0.92 | |
| 55247 | NM_018248.3 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 11235 | NM_007217.4 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
| 5239 | NM_021965.4 | 0.55 | 0.00 | 0.95 | 0.50 | 0.00 | 0.83 | |
| 117584 | NR_037713.1 | 0.57 | 0.00 | 0.98 | 0.58 | 0.00 | 0.96 | |
| 27111 | NM_080489.5 | 0.71 | 0.33 | 0.98 | 0.65 | 0.31 | 0.88 | |
| 57150 | NM_001042493.3 | 0.57 | 0.03 | 0.96 | 0.50 | 0.00 | 0.83 | |
| 6855 | NM_003179.2 | 0.59 | 0.05 | 0.96 | 0.63 | 0.13 | 0.96 | |
| 96764 | NM_024831.7 | 0.80 | 0.79 | 0.80 | 0.78 | 0.94 | 0.67 | |
| 147798 | NM_001145303.2 | 0.64 | 0.49 | 0.75 | 0.63 | 0.44 | 0.75 | |
| 7391 | NM_007122.5 | 0.59 | 0.03 | 0.98 | 0.60 | 0.06 | 0.96 | |
| 7404 | NM_182660.1 | 0.57 | 0.00 | 0.98 | 0.60 | 0.00 | 1.00 | |
Acc accuracy, Sen sensitivity, Spe specificity
We show the per-transcript performance in the training and testing dataset. Transcripts with high accuracy (> 0.70) in both training and testing cohorts are denoted by †
Fig. 4Models’ performance in the training dataset. a PCA using the 37 selected transcripts demonstrated clear separation between samples from patients with IA and those from controls. b Estimation of model performance during LOO C-V in the training cohort demonstrated that models performed with an accuracy of 0.85–0.91. Considering a 5% prevalence of IA, PPV ranged from 0.33–1 and NPV ranged from 0.98 to 0.99. c ROC analysis showed that all models had AUCs ≥ 0.95. d PCA using the 26 previously-identified transcripts demonstrated inferior separation between IA and control cases. e Estimation of model performance during LOO C-V in the training cohort demonstrated that models performed with an accuracy of 0.71–0.80. Considering a 5% prevalence, PPV and NPV ranged from 0.13–0.41 and 0.97–0.98, respectively. f ROC analysis also showed subpar performance compared to newly identified transcripts (AUC range 0.71–0.92). (AUC = area under the ROC curve, C-V = cross validation, cSVM = cubic support vector machines, gSVM = Gaussian support vector machines, KNN = k-nearest neighbors, LOO = leave-one-out, NPV = negative predictive value, PCA = principal component analysis, PPV = positive predictive value, RF = random forests, ROC = receiver operator characteristic)
Fig. 5Models’ performance in the testing dataset. a PCA using the 37 selected transcripts in this independent dataset also demonstrated strong separation between samples from patients with IA and from controls. b Assessment of true model performance showed that models performed with an accuracy of 0.83–0.90. In this dataset all models had a sensitivity of 1. At 5% IA prevalence, the PPV ranged from 0.15 to 0.24 and NPV was 1 for all models. c ROC analysis showed that all models again had AUCs ≥ 0.95. d PCA using the 26 previously-identified transcripts demonstrated mediocre separation between IA and control cases. e Estimation of model performance in the testing cohort demonstrated that models performed with an accuracy of 0.83–0.93. Considering a 5% prevalence, PPV and NPV ranged from 0.15–0.52 and 0.99–1, respectively. f ROC analysis also showed inferior performance compared to newly identified transcripts (AUC range 0.84–0.97). (AUC = area under the ROC curve, C-V = cross validation, cSVM = cubic support vector machines, gSVM = Gaussian support vector machines, KNN = k-nearest neighbors, LOO = leave-one-out, NPV = negative predictive value, PCA = principal component analysis, PPV = positive predictive value, RF = random forests, ROC = receiver operator characteristic)
Fig. 6Networks derived from IPA of the 37 genes identified by LASSO. Transcripts with increased expression in IA are red; transcripts with lower expression in IA are green; fold-change is represented by intensity. a This network (p-score = 47) affiliated with cancer, cellular movement, and connective tissue disorders. b This network (p-score = 25) has associated functions of cell cycle, cellular assembly and organization, DNA replication, recombination, and repair
Clinical characteristic differences in entire population
| Control | Aneurysm | Chi-square test | |
|---|---|---|---|
| Age (Mean ± SE) | 61 ± 1.7 | 60 ± 1.5 | (age 60 cutoff) 0.243 |
| Age [Median (Q1/Q3)] | 65 (54/72) | 60 (54/67) | |
| Sex (% of patients) | |||
| Female | 54.43% | 70.91% | 0.054 |
| Smoking (% of patients) | |||
| Current | 13.92% | 30.91% | 0.017† |
| Comorbidities (% of patients) | |||
| Arthritis | 16.46% | 27.27% | 0.130 |
| Asthma | 7.59% | 18.18% | 0.063 |
| Cancer | 11.39% | 9.09% | 0.668 |
| Diabetes | 22.78% | 21.82% | 0.895 |
| Heart disease | 29.11% | 21.82% | 0.344 |
| High cholesterol | 55.70% | 49.09% | 0.451 |
| Hypertension | 59.49% | 52.73% | 0.437 |
| IA family history | 7.69% | 12.73% | 0.336 |
| Stroke history | 18.99% | 9.09% | 0.114 |
None of the reported covariates were significantly different in either group (chi-square test < 0.05) except for smoking†