| Literature DB >> 31324856 |
Sungsoo Park1, Bonggun Shin1,2, Won Sang Shim1, Yoonjung Choi1, Kilsoo Kang1, Keunsoo Kang3.
Abstract
Next-generation sequencing (NGS), which allows the simultaneous sequencing of billions of DNA fragments simultaneously, has revolutionized how we study genomics and molecular biology by generating genome-wide molecular maps of molecules of interest. However, the amount of information produced by NGS has made it difficult for researchers to choose the optimal set of genes. We have sought to resolve this issue by developing a neural network-based feature (gene) selection algorithm called Wx. The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups. With a gene list ranked by DI score, researchers can institutively select the optimal set of genes from the highest-ranking ones. We applied the Wx algorithm to a TCGA pan-cancer gene-expression cohort to identify an optimal set of gene-expression biomarker candidates that can distinguish cancer samples from normal samples for 12 different types of cancer. The 14 gene-expression biomarker candidates identified by Wx were comparable to or outperformed previously reported universal gene expression biomarkers, highlighting the usefulness of the Wx algorithm for next-generation sequencing data. Thus, we anticipate that the Wx algorithm can complement current state-of-the-art analytical applications for the identification of biomarker candidates as an alternative method. The stand-alone and web versions of the Wx algorithm are available at https://github.com/deargen/DearWXpub and https://wx.deargendev.me/ , respectively.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31324856 PMCID: PMC6642261 DOI: 10.1038/s41598-019-47016-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The number of cancer and normal samples used in this study.
| Type ID | Full name | # of cancer samples | # of normal samples | # of total samples | Ratio (cancer/total) |
|---|---|---|---|---|---|
| BLCA | Bladder urothelial carcinoma | 408 | 19 | 427 | 0.95 |
| BRCA | Breast invasive carcinoma | 1101 | 113 | 1214 | 0.90 |
| COAD | Colon adenocarcinoma | 286 | 41 | 327 | 0.87 |
| HNSC | Head and neck squamous cell carcinoma | 522 | 44 | 566 | 0.92 |
| KICH | Kidney chromophobe | 65 | 25 | 90 | 0.72 |
| KIRC | Kidney renal clear cell carcinoma | 534 | 72 | 606 | 0.88 |
| KIRP | Kidney renal papillary cell carcinoma | 291 | 32 | 323 | 0.90 |
| LIHC | Liver hepatocellular carcinoma | 374 | 50 | 424 | 0.88 |
| LUAD | Lung adenocarcinoma | 517 | 59 | 576 | 0.89 |
| LUSC | Lung squamous cell carcinoma | 502 | 51 | 553 | 0.90 |
| PRAD | Prostate adenocarcinoma | 497 | 52 | 549 | 0.90 |
| THCA | Thyroid carcinoma | 512 | 59 | 571 | 0.89 |
Figure 1Classification accuracy according to given number of genes. The x-axis indicates the number of top genes (sorted in descending order by the DI values) used for the calculation and the y-axis represents the average accuracy.
Gene expression biomarkers identified by different studies.
| Cancer type | Wx (this study) | Peng | Emmanual |
|---|---|---|---|
| BLCA | |||
| BRCA | |||
| COAD (READ) | |||
| HNSC | |||
| LIHC | — | ||
| LUAD | |||
| LUSC | |||
| KICH | — | — | |
| KIRC | |||
| KIRP | — | ||
| PRAD | — | ||
| THCA | — |
Classification accuracy comparison (%).
| # of UGCBs | 14 | 7 | |||
|---|---|---|---|---|---|
| Type | Wx | Peng’s | edgeR | Wx | Martinez-Ledesma’s |
| BLCA | 95.79 | 97.20 | 94.86 | 95.79 | 96.26 |
| BRCA | 98.19 | 96.38 | 91.78 | 97.20 | 91.45 |
| COAD | 94.51 | 87.20 | 98.78 | 92.68 | — |
| HNSC | 97.17 | 92.23 | 94.35 | 95.05 | 92.57 |
| KICH | 95.65 | — | 100.00 | 97.83 | — |
| KIRC | 99.67 | — | 99.34 | 98.68 | 90.09 |
| KIRP | 99.38 | — | 99.38 | 100.00 | — |
| LIHC | 90.57 | 94.81 | 87.74 | 88.21 | — |
| LUAD | 97.92 | 97.58 | 98.96 | 97.92 | 90.27 |
| LUSC | 98.19 | 96.75 | 99.28 | 97.83 | 94.56 |
| PRAD | 93.45 | — | 92.36 | 90.55 | — |
| THCA | 95.80 | — | 90.21 | 95.45 | |
| Total | 96.72 | 94.59 | 94.81 | 95.74 | 92.20 |
Figure 2Performance of the Wx-14-UGCB on BRCA, LUAD, and LUSC RNA-seq data. AUC values are listed. ROC, receiver operating characteristic.
Figure 3Comparison of genes identified by Wx and edgeR. The x-axis indicates the number of top genes used for the comparison and the y-axis represents the percentage of overlap between the gene sets.
The classification accuracy of the UGCBs identified by different methods.
| GSE id | Cancer type | Wx-14-UGCB | Peng-14-UGCB |
|---|---|---|---|
| GSE72056 | Melanoma | 90.71 | 70.22 |
| GSE40419 | Lung adenocarcinoma | 80.00 | 56.87 |
| GSE103322 | Head and neck squamous cell carcinoma (primary tumors and lymph node metastases; single-cell transcriptomes | 81.10 | 68.28 |
Figure 4Discriminative index (DI) vector construction for = 3, where represents the parameter related to the k-th softmax output value, is the averaged vector from all data samples with label , and the result of a multiplication between and .