| Literature DB >> 31208324 |
Michael C Rendleman1, John M Buatti2, Terry A Braun3, Brian J Smith4, Chibuzo Nwakama5, Reinhard R Beichel6,7, Bart Brown8, Thomas L Casavant5,3.
Abstract
BACKGROUND: In the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC).Entities:
Keywords: Decision support; Dimensionality reduction; Gene ontology enrichment analysis; Machine learning; Unsupervised transformation; hnscc; tcga
Mesh:
Substances:
Year: 2019 PMID: 31208324 PMCID: PMC6580485 DOI: 10.1186/s12859-019-2929-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 2Cumulative Percent Explained Variance. Percent explained variance from SPCA as it relates to the number of components retained. The dark vertical line indicates the value used for transforming RNA expression into the SPCA feature set for these experiments
Effect of Imputation on Classifier Performance
| Classifier | Dataset | AUC |
|---|---|---|
| Naïve Bayes | Pre-imputation | 0.633 ± 0.077 |
| Post-imputation | 0.675 ± 0.063 | |
| Random Forest | Pre-imputation | 0.668 ± 0.062 |
| Post-imputation | 0.675 ± 0.063 |
Classifier performance on the imputed and non-imputed datasets. Baseline AUC is 0.500
RNA Expression Classifier Performance
| Datasets | Classifiers: | RF | WSRF | CIRF |
|---|---|---|---|---|
| AUCs | ||||
| Full RNA |
| 0.596 ± 0.038 | 0.629 ± 0.105 | |
| SPCA | 0.640 ± 0.128 | 0.626 ± 0.114 |
| |
| Nested CV Runtimes | – | – | – | |
| Full RNA |
| 185 h | 85 h | |
| SPCA |
| 1.9 h | 30 min | |
AUC and approximate runtime values for the RNA expression feature sets. The best value in each row is bolded. Here, runtimes are evaluation times for a given classifier on a given feature set via 10-fold nested cross validation with the internal cross validation procedures as described in Methods. Computations performed on the University of Iowa’s Argon High-Performance Computing cluster
Fig. 3Importance Change with Imputation. Pre-imputation and post-imputation CIRF conditional variable importance for predicting two-year recurrence-free survival. Importance values are relative to the most important variable. Imputed treatment features are denoted with *, and several clinical variables are shown for comparison
SPC Explained Variances
| SPC | Percent Explained Variance |
|---|---|
| X1a | 53.84% |
| X2 a | 9.43% |
| X3 a | 9.19% |
| X4 | 5.31% |
| X5 | 3.14% |
| X6 a | 2.27% |
| X7 a | 2.04% |
| X8 | 1.67% |
| X9 a | 1.24% |
| X10 | 0.93% |
Explained variances for the sparse principal components. The 10 SPCs account for 89.05% of the original data’s variance. a denotes SPCs chosen for further analysis based on variable importance (see Fig. 4)
Fig. 4SPC Conditional Importance Values. Relative conditional variable importance values for the 10 SPCs, labeled X1–10. In cases where a very low importance is reported for an SPC, its effect on classifier performance is negligible
RNA SPCA Enriched GO Terms
| SPC | Contributing Genes (Gene Name|GeneID) | Enriched GO Biological Processes |
|---|---|---|
| X6 | ADAM6|8755, FBP4|2167, FN1|2335, GAPDH|2597, KRT13|3860, KRT16|3868, KRT17|3872, LOC96610|96,610 | Cornification |
| X2 | COL1A1|1277, COL1A2|1278, COL3A1|1281, FN1|2335, KRT13|3860, KRT14|3861, KRT16|3868, KRT17|3872, KRT5|3852, KRT6A|3853, SPARC|6678 | Cornificationa, keratinocyte differentiation, wound healing, cell-substrate junction assemblya, collagen fibril organizationa |
| X9 | ACTB|60, ADAM6|8755, COL1A1|1277, COL1A2|1278, FN1|2335, LAMC2|3918, TGFBI|7045 | Skin morphogenesis, protein heterotrimerization, platelet activationa, cell junction assembly, cell junction organization, extracellular matrix organization, extracellular structure organization, blood vessel developmenta, cell adhesion |
| X7 | ADAM6|8755, FABP4|2167, KRT16|3868, KRT17|3872, KRT5|3852, KRT6B|3854, LOC96610|96,610, PI3|5266 | Cornificationa, programmed cell death, cell death, keratinization, skin development |
| X1 | KRT14|3861, KRT16|3868, KRT17|3872, KRT5|3852, KRT6A|3853, KRT6B|3854, KRT6C|286,887, S100A9|6280 | Cornificationa, intermediate filament cytoskeleton organizationa, cell death, hair cycle |
| X3 | COL1A1|1277, COL1A2|1278, COL3A1|1281, KRT13|3860, KRT14|3861, KRT16|3868, KRT17|3872, KRT5|3852, KRT6A|3853, KRT6B|3854, KRT6C|286,887, SFN|2810 | Cornificationa, multicellular organism development, intermediate filament cytoskeleton organizationa, collagen fibril organization |
SPC gene sets listing both gene names and Entrez gene IDs; PANTHER annotation terms found to be enriched in each of the SPC gene sets. Annotation terms are reported in increasing order of p-value, with all p < 0.001. a indicates some lower level hierarchical terms omitted for brevity