| Literature DB >> 29152097 |
Yu-Hang Zhang1,2, Tao Huang2, Lei Chen3, YaoChen Xu4, Yu Hu2, Lan-Dian Hu2, Yudong Cai5, Xiangyin Kong2.
Abstract
Detection and diagnosis of cancer are especially important for early prevention and effective treatments. Traditional methods of cancer detection are usually time-consuming and expensive. Liquid biopsy, a newly proposed noninvasive detection approach, can promote the accuracy and decrease the cost of detection according to a personalized expression profile. However, few studies have been performed to analyze this type of data, which can promote more effective methods for detection of different cancer subtypes. In this study, we applied some reliable machine learning algorithms to analyze data retrieved from patients who had one of six cancer subtypes (breast cancer, colorectal cancer, glioblastoma, hepatobiliary cancer, lung cancer and pancreatic cancer) as well as healthy persons. Quantitative gene expression profiles were used to encode each sample. Then, they were analyzed by the maximum relevance minimum redundancy method. Two feature lists were obtained in which genes were ranked rigorously. The incremental feature selection method was applied to the mRMR feature list to extract the optimal feature subset, which can be used in the support vector machine algorithm to determine the best performance for the detection of cancer subtypes and healthy controls. The ten-fold cross-validation for the constructed optimal classification model yielded an overall accuracy of 0.751. On the other hand, we extracted the top eighteen features (genes), including TTN, RHOH, RPS20, TRBC2, in another feature list, the MaxRel feature list, and performed a detailed analysis of them. The results indicated that these genes could be important biomarkers for discriminating different cancer subtypes and healthy controls.Entities:
Keywords: RNA-seq data; cancer detection; liquid biopsy; maximum relevance minimum redundancy; support vector machine
Year: 2017 PMID: 29152097 PMCID: PMC5675649 DOI: 10.18632/oncotarget.20903
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Figure 1IFS-curves for the results yielded in the first stage of the IFS method
The Y-axis represents the overall accuracy, and the X-axis represents the number of features used for classification. The high overall accuracies (no less than 0.740) all cluster between 2000 and 2200.
Figure 2IFS-curves for the results yielded in the second stage of the IFS method
The Y-axis represents the overall accuracy, and the X-axis represents the number of features used for classification. The highest overall accuracy was 0.751 when 2047 features were used.
Figure 3The performance of the optimal classification model evaluated by ten-fold cross-validation
The performance of the optimal classification models using different reference gene sets
| Reference gene set | TACC |
|---|---|
| This study | 0.751 |
| Genes in CancerNext | 0.407 |
| Genes in CancerNextExpanded | 0.463 |
| Genes in CloudHealth | 0.421 |
| Genes in GeneDx | 0.400 |
| Genes in Illumina | 0.656 |
| Genes in NanoString | 0.618 |
| Genes in xGen | 0.519 |
The top 18 features in the MaxRel feature list
| Order | Feature name | Gene name | Description | MI value | Rank in the mRMR feature list |
|---|---|---|---|---|---|
| 1 | ENSG00000155657 | TTN | Titin | 0.416 | 1 |
| 2 | ENSG00000008988 | RPS20 | Ribosomal Protein S20 | 0.407 | 13 |
| 3 | ENSG00000177600 | RPLP2 | Ribosomal Protein Lateral Stalk Subunit P2 | 0.405 | 6 |
| 4 | ENSG00000211772 | TRBC2 | T Cell Receptor Beta Constant 2 | 0.396 | 19 |
| 5 | ENSG00000168028 | RPSA | Ribosomal Protein SA | 0.393 | 35 |
| 6 | ENSG00000142534 | RPS11 | Ribosomal Protein S11 | 0.384 | 64 |
| 7 | ENSG00000142676 | RPL11 | Ribosomal Protein L11 | 0.381 | 48 |
| 8 | ENSG00000105193 | RPS16 | Ribosomal Protein S16 | 0.380 | 57 |
| 9 | ENSG00000160654 | CD3G | CD3g Molecule | 0.379 | 25 |
| 10 | ENSG00000168421 | RHOH | Ras Homolog Family Member H | 0.373 | 3 |
| 11 | ENSG00000139193 | CD27 | CD27 Molecule | 0.369 | 8 |
| 12 | ENSG00000131469 | RPL27 | Ribosomal Protein L27 | 0.368 | 106 |
| 13 | ENSG00000163682 | RPL9 | Ribosomal Protein L9 | 0.368 | 86 |
| 14 | ENSG00000071082 | RPL31 | Ribosomal Protein L31 | 0.367 | 78 |
| 15 | ENSG00000149311 | ATM | ATM Serine/Threonine Kinase | 0.367 | 17 |
| 16 | ENSG00000149806 | FAU | FAU, Ubiquitin Like And Ribosomal Protein S30 Fusion | 0.366 | 31 |
| 17 | ENSG00000109475 | RPL34 | Ribosomal Protein L34 | 0.366 | 122 |
| 18 | ENSG00000089009 | RPL6 | Ribosomal Protein L6 | 0.366 | 117 |
Figure 4The heat map of all samples using the important eighteen genes
Figure 5The eighteen important genes found in the MaxRel feature list were clustered into three groups
Breakdown of 285 RNA-seq samples
| Cancer subtype | Number of samples |
|---|---|
| Breast cancer | 39 |
| Colorectal cancer | 42 |
| Glioblastoma | 40 |
| Hepatobiliary cancer | 14 |
| Lung cancer | 60 |
| Pancreatic cancer | 35 |
| Healthy control | 55 |
Figure 6The flow chart of constructing the mRMR feature list in the mRMR method