| Literature DB >> 32257050 |
Lin Li1,2,3, Qiushi Feng1,2,3, Xiaosheng Wang1,2,3.
Abstract
Microsatellite instability (MSI) is a genomic property of the cancers with defective DNA mismatch repair and is a useful marker for cancer diagnosis and treatment in diverse cancer types. In particular, MSI has been associated with the active immune checkpoint blockade therapy response in cancer. Most of computational methods for predicting MSI are based on DNA sequencing data and a few are based on mRNA expression data. Using the RNA-Seq pan-cancer datasets for three cancer cohorts (colon, gastric, and endometrial cancers) from The Cancer Genome Atlas (TCGA) program, we developed an algorithm (PreMSIm) for predicting MSI from the expression profiling of a 15-gene panel in cancer. We demonstrated that PreMSIm had high prediction performance in predicting MSI in most cases using both RNA-Seq and microarray gene expression datasets. Moreover, PreMSIm displayed superior or comparable performance versus other DNA or mRNA-based methods. We conclude that PreMSIm has the potential to provide an alternative approach for identifying MSI in cancer.Entities:
Keywords: ACC, adrenocortical carcinoma; AUC, area under the curve; Algorithm; BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; CV, cross validation; Cancer; Classification; DLBC, lymphoid neoplasm diffuse large B-cell lymphoma; ESCA, esophageal carcinoma; GBM, glioblastoma multiforme; GEO, Gene Expression Omnibus; GO, gene ontology; Gene expression profiling; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; MESO, mesothelioma; MSI, microsatellite instability; MSS, microsatellite stability; Machine learning; Microsatellite instability; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic adenocarcinoma; PCPG, pheochromocytoma and paraganglioma; PPI, protein-protein interaction; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; RF, random forest; ROC, receiver operating characteristic; SARC, sarcoma; SKCM, skin cutaneous melanoma; STAD, stomach adenocarcinoma; SVM, support vector machine; TCGA, The Cancer Genome Atlas; TGCT, testicular germ cell tumors; THCA, thyroid carcinoma; THYM, thymoma; UCEC, uterine corpus endometrial carcinoma; UCS, uterine carcinosarcoma; UVM, uveal melanoma; XGBoost, extreme gradient boosting; k-NN, k-nearest neighbor
Year: 2020 PMID: 32257050 PMCID: PMC7113609 DOI: 10.1016/j.csbj.2020.03.007
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
A summary of datasets.
| Platform | Cancer type | Source | Number of samples | Number of MSI-H samples | Number of MSI-L/MSS samples |
|---|---|---|---|---|---|
| RNA-seq | Colon cancer | TCGA | 281 | 52 | 229 |
| Endometrial cancer | 367 | 123 | 244 | ||
| Esophageal cancer | 89 | 2 | 87 | ||
| Gastric cancer | 415 | 80 | 335 | ||
| Rectum cancer | 94 | 3 | 91 | ||
| Uterine cancer | 56 | 2 | 54 | ||
| Pan-cancer | 1383 | 328 | 1055 | ||
| Microarray (GPL570) | Gastric cancer | GSE13911 | 39 | 19 | 20 |
| GSE62254 | 300 | 68 | 232 | ||
| Colorectal cancer | GSE13067 | 74 | 11 | 63 | |
| GSE13294 | 155 | 78 | 77 | ||
| GSE18088 | 53 | 19 | 34 | ||
| GSE26682 | 160 | 18 | 142 | ||
| GSE35896 | 61 | 5 | 56 | ||
| GSE39084 | 70 | 16 | 54 | ||
| GSE39582 | 536 | 77 | 459 | ||
| GSE75316 | 59 | 11 | 48 | ||
| GSE92921 | 58 | 5 | 53 | ||
| Microarray (GPL5175) | GSE24550 | 65 | 14 | 51 | |
| Microarray (GPL2986) | GSE25071 | 46 | 5 | 41 | |
| Microarray (GPL13158) | GSE27544 | 22 | 8 | 14 | |
| Microarray (GPL96) | GSE26682 | 140 | 17 | 123 | |
| GSE41258 | 168 | 35 | 133 |
Note:
Poly-A.
Affymetrix Oligonucleotide Array.
Agilent Oligonucleotide Array.
Affymetrix Exon Array.
Fig 1A summary of the PreMSIm algorithm and 15 gene signatures selected. A, Flowchart for the algorithm. B, Heatmap for the expression levels of 15 gene signatures in PreMSIm in the MSI-H and MSI-L/MSS subtypes of the TCGA pan-cancer. MSI-H: MSI-high. MSI-L/MSS: MSI-low/microsatellite stability.
The classification performance within TCGA datasets (%).
| Cancer type | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| Gastric cancer | 97 | 86 | 99 | 97 |
| Colon cancer | 96 | 96 | 97 | 97 |
| Endometrial cancer | 90 | 86 | 92 | 93 |
| Pan-cancer (all samples) | 95 | 85 | 97 | 95 |
| Pan-cancer (80% of samples) | 94 | 88 | 97 | 97 |
| Pan-cancer (20% of samples) | 94 | 86 | 96 | 95 |
Note:
10-fold cross validation.
Validation in the independent test set.
The classification performance of PreMSIm (%).
| Cancer type | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| Esophageal cancer | 96 | 100 | 95 | 99 |
| Rectum cancer | 91 | 67 | 92 | 79 |
| Uterine cancer | 95 | 100 | 94 | 99 |
| Gastric cancer (GSE13911) | 90 | 89 | 90 | 89 |
| Gastric cancer (GSE62254) | 88 | 78 | 91 | 87 |
| Colorectal cancer (GSE13067) | 98 | 100 | 95 | 99 |
| Colorectal cancer (GSE13294) | 92 | 86 | 99 | 96 |
| Colorectal cancer (GSE18088) | 96 | 95 | 97 | 97 |
| Colorectal cancer (GSE26682-GPL570) | 98 | 83 | 99 | 93 |
| Colorectal cancer (GSE26682-GPL96) | 70 | 82 | 68 | 81 |
| Colorectal cancer (GSE35896) | 92 | 100 | 91 | 98 |
| Colorectal cancer (GSE39084) | 93 | 100 | 91 | 99 |
| Colorectal cancer (GSE39582) | 90 | 90 | 90 | 94 |
| Colorectal cancer (GSE41258) | 77 | 60 | 82 | 82 |
| Colorectal cancer (GSE75316) | 95 | 100 | 94 | 99 |
| Colorectal cancer (GSE92921) | 95 | 80 | 96 | 89 |
| Colorectal cancer (GSE27544) | 91 | 75 | 100 | 94 |
| Colorectal cancer (GSE24550) | 88 | 100 | 84 | 94 |
| Colorectal cancer (GSE25071) | 87 | 100 | 85 | 98 |
Note:
* These prediction results were obtained by the PreMSIm R package.
The Esophageal, Rectum, and Uterine cancer datasets were from TCGA and the others were from GEO.
Fig 2Comparisons of the MSI prediction results by PreMSIm with those by other algorithms. A, B, and C, The overlapping rates of the MSI prediction results between PreMSIm and MOSAIC [16] (A), MANTIS [17] (B), and MSIsensor [8] (C) in the TCGA pan-cancer and multiple individual cancer types. The Fisher’s exact test P-values are shown. *P < 0.05, **P < 0.01, ***P < 0.001. D and E, Comparisons of the prediction performance of PreMSIm with that of two other mRNA-based methods by Danaher et al. [11] (D) and by Pacinkova et al. [12] (E), respectively. BLCA: bladder urothelial carcinoma. BRCA: breast invasive carcinoma. CESC: cervical squamous cell carcinoma and endocervical adenocarcinoma. COAD: colon adenocarcinoma. ESCA: esophageal carcinoma. HNSC: head and neck squamous cell carcinoma. LUAD: lung adenocarcinoma. READ: rectum adenocarcinoma. STAD: stomach adenocarcinoma. UCEC: uterine corpus endometrial carcinoma. UCS: uterine carcinosarcoma.
Fig 3Comparison of k-NN with other classifiers. A, The grid search with 10-fold CV in the TCGA pan-cancer to search for the optimal k(s) for k-NN. B, Comparison of the performance between four different k-NNs (k = 5, 7, 9, and 11) in predicting MSI. C, Comparison of the performance between k-NN (k = 5) and the RF, SVM, and XGBoost classifiers. RF: random forest. SVM: support vector machine. XGBoost: extreme gradient boosting.
Pathways and GO associated with the 15 gene features in PreMSIm.
| Gene Symbol | Pathway | GO (BP) |
|---|---|---|
| NA | ribosome biogenesis; rRNA processing | |
| NA | positive regulation of glycogen biosynthetic process | |
| Gene Expression; Mitotic Prophase; PIWI-interacting RNA (piRNA) biogenesis | RNA methylation; methylation; gene silencing by RNA; piRNA metabolic process; production of siRNA involved in RNA interference | |
| NA | metabolic process | |
| Mismatch repair; Gene Expression; Meiosis; DNA Damage; Fanconi anemia pathway; Pathways in cancer; Cell Cycle, Mitotic; DNA Double-Strand Break Repair; Regulation of TP53 Activity; DNA damage_Role of Brca1 and Brca2 in DNA repair; Direct p53 effectors | mismatch repair; DNA repair; cellular response to DNA damage stimulus; cell cycle; double-strand break repair via nonhomologous end joining; reciprocal meiotic recombination; somatic hypermutation of immunoglobulin genes; somatic recombination of immunoglobulin gene segments; meiotic chromosome segregation; homologous chromosome segregation; negative regulation of mitotic recombination; meiotic cell cycle | |
| Meiosis; Cell Cycle, Mitotic | meiotic cell cycle; reciprocal meiotic recombination | |
| Glucose metabolism; Ubiquitin mediated proteolysis; Metabolism | protein ubiquitination; autophagy; glycogen biosynthetic process; regulation of protein phosphorylation; glycogen metabolic process; regulation of gene expression; regulation of protein ubiquitination; response to endoplasmic reticulum stress; cellular macromolecule metabolic process; regulation of protein kinase activity; regulation of protein localization to plasma membrane | |
| NA | NA | |
| NA | oxidation–reduction process | |
| Gene Expression; Metabolism; Metabolism of proteins | cytoplasmic translation | |
| NA | mitotic DNA replication termination; regulation of DNA stability; site-specific DNA replication termination at RTS1 barrier | |
| NA | multicellular organism development; actin filament organization; actin cytoskeleton organization | |
| Endocytosis | positive regulation of GTPase activity; regulation of clathrin-dependent endocytosis | |
| Organelle biogenesis and maintenance | cell projection organization | |
| NA | NA |
Fig 4Prediction performance of PreMSIm in predicting MSI. A, ROC curve analysis of TCGA colon cancer. B, ROC curve analysis of pan-cancer. All pan-cancer samples were separated into training (80% of samples) and test sets (20% of samples). In the training set, the 10-fold CV AUC was shown. C, ROC curve analysis of TCGA gastric and colon cancers using the TCGA endometrial cancers as the training set. D and E, ROC curve analysis of two gastric (D) and two colorectal (E) cancer cohorts in which the PreMSIm R package was used to predict MSI. MSI: microsatellite instability. CV: cross validation. AUC: area under the ROC curve. COAD: colon adenocarcinoma. STAD: stomach adenocarcinoma.