| Literature DB >> 25210421 |
Zhi Yan1, Brian T Luke2, Shirley X Tsang3, Rui Xing1, Yuanming Pan1, Yixuan Liu1, Jinlian Wang4, Tao Geng5, Jiangeng Li5, Youyong Lu1.
Abstract
High-throughput gene expression microarrays can be examined by machine-learning algorithms to identify gene signatures that recognize the biological characteristics of specific human diseases, including cancer, with high sensitivity and specificity. A previous study compared 20 gastric cancer (GC) samples against 20 normal tissue (NT) samples and identified 1,519 differentially expressed genes (DEGs). In this study, Classification Information Index (CII), Information Gain Index (IGI), and RELIEF algorithms are used to mine the previously reported gene expression profiling data. In all, 29 of these genes are identified by all three algorithms and are treated as GC candidate biomarkers. Three biomarkers, COL1A2, ATP4B, and HADHSC, are selected and further examined using quantitative real-time polymerase chain reaction (qRT-PCR) and immunohistochemistry (IHC) staining in two independent sets of GC and normal adjacent tissue (NAT) samples. Our study shows that COL1A2 and HADHSC are the two best biomarkers from the microarray data, distinguishing all GC from the NT, whereas ATP4B is diagnostically significant in lab tests because of its wider range of fold-changes in expression. Herein, a data-mining model applicable for small sample sizes is presented and discussed. Our result suggested that this mining model may be useful in small sample-size studies to identify putative biomarkers and potential biological features of GC.Entities:
Keywords: gastric cancer; gene signature; machine-learning algorithm; microarray
Year: 2014 PMID: 25210421 PMCID: PMC4149392 DOI: 10.4137/BMI.S13059
Source DB: PubMed Journal: Biomark Insights ISSN: 1177-2719
Feature gene selection using CII, IGI and Relief algorithm.
| ALGORITHMS | INTERVALS | GENE NUMBERS | PERCENTS |
|---|---|---|---|
| i < 0.9 | 1451 | 95.52% | |
| 0.9 < i < 1 | 24 | 1.58% | |
| 1 < i < 1.5 | 39 | 2.57% | |
| 1.5 < i < 2 | 4 | 0.26% | |
| i > 2 | 1 | 0.07% | |
| g < 0.35 | 1420 | 93.48% | |
| 0.35 < g < 0.45 | 64 | 4.21% | |
| 0.45 < g < 0.6 | 33 | 2.17% | |
| 0.6 < g | 2 | 0.13% | |
| w < 0.3 | 1194 | 78.54% | |
| 0.3 < w < 0.5 | 214 | 14.09% | |
| 0.5 < w < 0.6 | 47 | 3.09% | |
| 0.6 < w < 0.8 | 52 | 3.42% | |
| 0.8 < w < 1 | 11 | 0.72% | |
| w > 1 | 2 | 0.13% |
Abbreviations: i, Classification Information Index of each gene; g, Information Gain Index of each gene; w, Relief classification weight of each gene.
Figure 1Venn diagram and cluster analysis of the selected genes by all of the filtering methods. The thresholds of CII, IGI, and RELIEF algorithms were set to 0.9, 0.35, and 0.3, respectively. In the cluster figure, columns represent samples and rows represent genes (black, green, and red correspond to unchanged, down-regulated, and up-regulated, respectively).
29 candidate feature genes selected by CII, IGI and relief algorithm.
| ACCESSION# | GENE | CHANGE | THRESHOLD | SENSITIVITY | SPECIFICITY | RANGE[LOG(FC)] |
|---|---|---|---|---|---|---|
| NM_005327 | down | 0.763 | ||||
| NM_000089 | up | 2.394 | ||||
| NM_001275 | CHGA | down | 0.317 | 95% | 100% | 4.491 |
| NM_019891 | ERO1LB | down | 0.626 | 100% | 94.7% | 5.648 |
| BC014245 | CTHRC1 | up | 2.112 | 95% | 95% | 4.343 |
| AK056767 | MAFK | down | 0.395 | 90% | 100% | 3.979 |
| NM_012277 | NM_012277 | down | 0.307 | 90% | 100% | 10.753 |
| AB033025 | KIAA1199 | up | 2.073 | 95% | 94.7% | 7.681 |
| NM_003247 | THBS2 | up | 2.570 | 100% | 85% | 3.819 |
| NM_002371 | MAL | down | 0.187 | 85% | 100% | 6.361 |
| NM_005672 | PSCA | down | 0.531 | 95% | 90% | 5.366 |
| NM_002909 | REG1A | down | 0.314 | 85% | 100% | 5.257 |
| NM_032744 | C6orf105 | down | 1.064 | 100% | 85% | 4.513 |
| NM_144646 | IGJ | down | 0.486 | 95% | 90% | 3.916 |
| AL117382 | C20orf142 | down | 0.369 | 85% | 100% | 4.559 |
| NM_020707 | C3orf3 | down | 0.640 | 85% | 100% | 4.774 |
| NM_003652 | CPZ | up | 3.155 | 85% | 100% | 11.824 |
| NM_000705 | down | 0.169 | 95% | 89.5% | ||
| NM_005136 | NM_005136 | down | 0.287 | 80% | 100% | 7.089 |
| NM_005145 | NM_005145 | down | 0.514 | 80% | 100% | 6.53 |
| BC015417 | MAMDC2 | down | 0.503 | 80% | 100% | 9.76 |
| BC003517 | ATXN7L1 | down | 0.755 | 80% | 100% | 10.797 |
| NM_007193 | ANXA10 | down | 0.149 | 75% | 100% | 6.131 |
| NM_003657 | BCAS1 | down | 0.301 | 75% | 100% | 4.044 |
| NM_018658 | KCNJ16 | down | 1.765 | 100% | 75% | 9.853 |
| NM_022129 | MAWBP | down | 0.558 | 75% | 100% | 4.693 |
| NM_032471 | PKIB | down | 0.617 | 100% | 75% | 9.477 |
| AA513382 | IGJ | down | 0.379 | 80% | 94.7% | 3.755 |
| NM_001854 | COL11A1 | up | 4.397 | 70% | 100% | 7.699 |
Note:
The threshold producing the sensitivity and specificity is selected to minimize the GINI index of the daughter nodes.
Gene ontology analyses of the candidate signatures.
| GO TERMS | INPUT SYMBOL | |
|---|---|---|
| Collagen fibril organization | COL1A2;COL11 A1 | 1.90E-06 |
| Cell adhesion | THBS2;COL1A2;COL11 A1 | 0.001965 |
| Anti-apoptosis | MAL | 0.01571 |
| Ion transport | ATP4B;KCNJ16;CPZ;ERO1 LB | 0.001751 |
| Protein binding | MAFK;THBS2;MAL;C3orf3;BCAS1 | 1.63E-06 |
| TGF beta receptor signaling pathway | COL1A2 | 0.009773 |
| Rho protein signal transduction | COL1A2 | 0.010276 |
| Wnt receptor signaling pathway | CPZ | 0.011365 |
| Cell differentiation | MAL | 0.106455 |
| Cell proliferation | REG1 A | 0.025169 |
| Metabolism | HADHSC | 0.016543 |
| Regulation of transcription | MAFK;MAL | 0.016968 |
| Protein thiol-disulfide exchange | ERO1 LB | 5.08E-04 |
| Sensory perception of sound | COL11 A1 | 5.92E-04 |
| Immune response | MAL;IGJ | 0.002569 |
| Negative regulation of protein kinase activity | PKIB | 0.005825 |
| ATP biosynthesis | ATP4B | 0.007507 |
| Blood vessel development | COL1A2 | 0.018041 |
| Nervous system development | MAFK | 0.067536 |
| Proteolysis | CPZ | 0.084789 |
| Calcium ion binding | CHGA;THBS2;ANXA10 | 2.03E-05 |
Genes with the highest scores for each of the filtering methods.
| CII | IGI | RELIEF | |||
|---|---|---|---|---|---|
| Gene | Score | Gene | Score | Gene | Score |
| ATP4B | 2.41855 | HADHSC | 0.69315 | COL1A2 | 1.08879 |
| NM_012277 | 1.77127 | COL1A2 | 0.69282 | HADHSC | 1.01081 |
| ATP4 A | 1.69775 | SULF2 | 0.59264 | NM_005145 | 0.93394 |
| KCNJ16 | 1.58771 | CHGA | 0.59264 | CHGA | 0.86624 |
| COL4 A6 | 1.53349 | RDH12 | 0.59264 | ERO1 LB | 0.84882 |
| FAM3B | 1.39394 | CPZ | 0.58432 | ATXN7 L1 | 0.84856 |
| ANXA10 | 1.39052 | SPARC | 0.52560 | KIAA1199 | 0.84567 |
| SULT1C1 | 1.37157 | COL18 A1 | 0.52560 | NM_012277 | 0.84271 |
| PSCA | 1.35426 | CDC25B | 0.52560 | APBB1IP | 0.82600 |
| NM_005136 | 1.35310 | MAFK | 0.52560 | NQO3 A2 | 0.81854 |
RT-PCR classification results when the 30 NAT and GC samples are treated as independent data.
| A | |||
|---|---|---|---|
| NAT | GC | ||
| <0.044 | 3 | 25 | Sensitivity = 89.3% |
| >0.044 | 27 | 5 | Specivifity = 84.4% |
| NPV = 90.0% | PPV = 83.3% | ||
| <1.027 | 22 | 7 | Specificity = 75.9% |
| >1.027 | 8 | 23 | Sensitivity = 74.2% |
| NPV = 73.3% | PPV = 76.6% | ||
| <3.052 | 17 | 29 | Sensitivity = 63.0% |
| >3.052 | 13 | 1 | Specivifity = 92.9% |
| NPV = 43.3% | PPV = 96.7% | ||
Note: The threshold values are determined by maximizing the GINI index. A, AT4B using a threshold of 0.044 for 2-DDCt. B, COL1A2 using a threshold of 1.027. C, HADHSC using a threshold of 3.052.
Figure 2Validation of the feature genes using IHC staining. (A) and (B): positive staining of COL1A2 appeared in cancer but not in NT. COL1A2 was highly expressed in 17 GC samples with the positive rate of 77.3% (17/22). (C) and (D): negative staining of ATP4B appeared more often in cancer but positive in NT. ATP4B was highly expressed in 20 normal samples with the positive rate of 83.4% (20/24). (E) and (F). positive staining of HADHSC appeared in normal but not in cancer tissue. HADHSC showed 24 normal samples with high expression of 92.3% positivity (24/26).
IHC staining results for the three selected putative biomarkers.
| ANTIBODY | TYPES OF SAMPLES | POSITIVE | NEGATIVE | |
|---|---|---|---|---|
| COL1A2 | T = 22 | 17(77.3%) | 5(22.7%) | 0.0305 |
| N = 22 | 9(40.9%) | 13(59.1%) | ||
| ATP4B | T = 25 | 11(44%) | 14(56%) | 0.0072 |
| N = 24 | 20(83.4%) | 4(16.6%) | ||
| HADHSC | T = 25 | 4(16.0%) | 21(84.0%) | 9.15 × 10−6 |
| N = 26 | 24(92.3%) | 2(7.7%) |