| Literature DB >> 34917619 |
ShiJian Ding1, Hao Li2, Yu-Hang Zhang3, XianChao Zhou4, KaiYan Feng5, ZhanDong Li2, Lei Chen6, Tao Huang7,8, Yu-Dong Cai1.
Abstract
There are many types of cancers. Although they share some hallmarks, such as proliferation and metastasis, they are still very different from many perspectives. They grow on different organ or tissues. Does each cancer have a unique gene expression pattern that makes it different from other cancer types? After the Cancer Genome Atlas (TCGA) project, there are more and more pan-cancer studies. Researchers want to get robust gene expression signature from pan-cancer patients. But there is large variance in cancer patients due to heterogeneity. To get robust results, the sample size will be too large to recruit. In this study, we tried another approach to get robust pan-cancer biomarkers by using the cell line data to reduce the variance. We applied several advanced computational methods to analyze the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles which included 988 cell lines from 20 cancer types. Two feature selection methods, including Boruta, and max-relevance and min-redundancy methods, were applied to the cell line gene expression data one by one, generating a feature list. Such list was fed into incremental feature selection method, incorporating one classification algorithm, to extract biomarkers, construct optimal classifiers and decision rules. The optimal classifiers provided good performance, which can be useful tools to identify cell lines from different cancer types, whereas the biomarkers (e.g. NCKAP1, TNFRSF12A, LAMB2, FKBP9, PFN2, TOM1L1) and rules identified in this work may provide a meaningful and precise reference for differentiating multiple types of cancer and contribute to the personalized treatment of tumors.Entities:
Keywords: biomarker; classification algorithm; decision rule; feature selection; pan-cancer study
Year: 2021 PMID: 34917619 PMCID: PMC8669964 DOI: 10.3389/fcell.2021.781285
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
Distribution of samples and decision rules in different cancer cell lines.
| Cancer cell line types | Number of cell lines | Number of decision rules | Number of criteria | Number of involved genes |
|---|---|---|---|---|
| Autonomic ganglia | 16 | 4 | 42 | 17 |
| Bone | 20 | 4 | 46 | 19 |
| Breast | 51 | 23 | 325 | 63 |
| Central nervous system | 65 | 18 | 219 | 57 |
| Endometrium | 28 | 16 | 191 | 52 |
| Fibroblast | 37 | 3 | 28 | 15 |
| Haematopoietic and lymphoid tissue | 173 | 8 | 96 | 34 |
| Kidney | 32 | 7 | 88 | 38 |
| Large intestine | 56 | 9 | 123 | 47 |
| Liver | 25 | 9 | 115 | 42 |
| Lung | 188 | 51 | 740 | 80 |
| Oesophagus | 27 | 8 | 145 | 47 |
| Ovary | 47 | 25 | 306 | 67 |
| Pancreas | 41 | 14 | 208 | 56 |
| Skin | 49 | 7 | 83 | 34 |
| Soft tissue | 28 | 9 | 126 | 41 |
| Stomach | 37 | 26 | 361 | 72 |
| Thyroid | 12 | 7 | 79 | 38 |
| Upper aerodigestive tract | 31 | 11 | 162 | 47 |
| Urinary tract | 25 | 16 | 216 | 59 |
FIGURE 1Flow chart to show the entire analysis procedures. The CCLE dataset which includes 988 cell lines and 20 tumor types is analyzed by Boruta and mRMR methods, resulting in a feature list. The list is then fed into the incremental feature selection method to extract optimal genes, build the optimal classifiers and construct decision rules.
FIGURE 2IFS curve with SVM classification algorithm on the different number of features. The SVM provides the highest MCC of 0.976 when the top 3,130 features are adopted. When top 400 features are adopted, SVM provides good performance with MCC of 0.951.
Performance of some key classifiers.
| Classification algorithm | Number of features | Overall accuracy | MCC |
|---|---|---|---|
| Support vector machine | 3,130 | 0.978 | 0.976 |
| 400 | 0.954 | 0.951 | |
| Decision tree | 390 | 0.771 | 0.754 |
| 100 | 0.757 | 0.739 |
FIGURE 3Radar graph to show the performance of two support vector machine (SVM) classifiers and two decision tree (DT) classifiers on 20 cancer types. Two SVM classifiers provide almost equal performance, also for two DT classifiers.
FIGURE 4IFS curve with DT classification algorithm on the different number of features. The DT provides the highest MCC of 0.754 when the top 390 features are adopted. DT yields high performance with MCC of 0.739 when only 100 features are used.
FIGURE 5Top enriched GO terms to the top 400 genes in the mRMR feature list.
FIGURE 6Top enriched KEGG pathways to the top 400 genes in the mRMR feature list.
Information of essential genes.
| Ensembl ID | Gene symbol | Description |
|---|---|---|
| ENSG00000061676 | NCKAP1 | NCK Associated Protein 1 |
| ENSG00000006327 | TNFRSF12A | TNF Receptor Superfamily Member 12A |
| ENSG00000172037 | LAMB2 | Laminin Subunit Beta 2 |
| ENSG00000122642 | FKBP9 | FKBP Prolyl Isomerase 9 |
| ENSG00000070087 | PFN2 | Profilin 2 |
| ENSG00000141198 | TOM1L1 | Target Of Myb1 Like 1 Membrane Trafficking Protein |