| Literature DB >> 25475756 |
Lingjian Yang1, Chrysanthi Ainali2, Sophia Tsoka3, Lazaros G Papageorgiou4.
Abstract
BACKGROUND: Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies.Entities:
Mesh:
Year: 2014 PMID: 25475756 PMCID: PMC4269079 DOI: 10.1186/s12859-014-0390-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets
|
|
|
|
|
|
|---|---|---|---|---|
| Swindell [ | Psoriasis | 180 | Healthy control: 64; | GSE13355 |
| Psoriatic non-lesional skin: 58; | ||||
| Psoriatic lesional skin: 58 | ||||
| Yao [ | Psoriasis | 82 | Healthy control: 21; | GSE14905 |
| Psoriatic non-lesional skin: 28; | ||||
| Psoriatic lesional skin: 33 | ||||
| Farmer [ | Breast cancer | 49 | Apocrine tumour: 6; | GSE1561 |
| Basal tumour: 16; | ||||
| Luminal tumour: 27 | ||||
| Pawitan [ | Breast cancer | 139 | Normal: 37; | GSE1456 |
| Luminal tumour: 62; | ||||
| ERBB2: 15; | ||||
| Basal: 25 | ||||
| Singh [ | Prostate cancer | 102 | Normal: 50; |
|
| Tumour: 52 | ||||
| Shipp [ | DLBCL | 77 | DLBCL: 58; |
|
| Follicular lymphoma: 19 | ||||
| Popovici [ | Breast cancer | 230 | Residual invasive cancer: 182; | GSE24061 |
| No residual invasive cancer: 48 | ||||
| Desmedt [ | Breast cancer | 198 | Metastatic: 51; | GSE7390 |
| Non-metastatic: 147 |
Figure 1Schematic flow chart of the DIGS-based approach for multiclass disease classification problems. Pathway specific gene expression profiles are created by integrating gene expression profile and pathway information. For each pathway, build pathway activity as a weighted (variables) linear summation of expression of member genes, with the objective function maximising the number of samples whose pathway activity are inside the range of their own classes. The maximum number of member genes in a pathway allowed to have non-zero weights is explicitly constrained in the model by specifying the parameter NoG. Create pathway activity profile by assembling all pathway activities and a classifier is trained on the pathway activity profile and predicts the class label of a new sample. It is important to note that training procedure, i.e., inferring pathway activity and training a classifier, is always blind to testing samples to achieve an objective evaluation of classification performance.
Overview of Evaluated Methods
| Guo et al. [ |
|
|
| |
|
| |
| Guo et al. [ |
|
|
| |
|
| |
| Bild et al. [ |
|
|
| |
|
| |
| Lee et al. [ |
|
|
| |
|
| |
| Ainali et al. [ |
|
|
| |
|
| |
| Single Genes |
|
|
| |
|
| |
| Proposed in this work |
|
|
| |
|
|
Figure 2Sensitivity analysis of parameter for DIGS model with SMO (A) and NN (B) classifiers. For each of the 8 datasets, the proposed DIGS model is applied to infer pathway activity while setting NoG, i.e. the maximum number of member genes in a pathway allowed to have non-zero weights, to 5, 10, 15 and 20. In addition, DIGS model is also applied with NoG set to equal to the number of available member genes in a pathway, i.e. all member genes can take non-zero weights to construct pathway activity. A classifier is trained using the pathway activity profiles and tests the prediction accuracy. For both SMO (A) and NN (B) classifiers, it is clear that the proposed DIGS model is robust to the parameter NoG during the tested ranged 5 to 20. Furthermore, constraining the maximum number of active constituent genes appears to generally improve classification accuracy as DIGS_ALL usually leads to lower prediction rate compared with the others.
Figure 3Classification accuracy comparison of 7 competing methods using 5-NN (A) and NN (B) classifiers. The proposed DIGS pathway activity inference method is compared against other pathway activity inference methods (Mean, Median, PCA and CORGs) and also genes-based methods (SG and per_pathway). Classification accuracy is summarised as average prediction rates over 50 runs of random partition of datasets into a 70% training set and a 30% testing set. With 5-NN classifier (A), it is evident that DIGS outperforms other methods by some distance as topping the chart on 6 datasets (Singh, Popovici, Desmedt, Swindell, Farmer and Pawitan) while being tied 1st on the other 2 datasets (Shipp and Yao). Prediction rates achieved by DIGS are generally high, over 80% in most datasets, which facilities its application in real world. With NN classifier (B), the same trend can be observed that prediction accuracies achieved by DIGS at least matches the state-of-the-arts methods in literature for binary disease classification problems, while consistently outperforms the competing methods for multi-phenotype problems.
Mean normalised classification rates over 4 two-phenotype datasets according to performance
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 5-NN |
| 0.9071 | 0.8737 | 0.8751 | 0.8903 | 0.8389 | 0.9371 |
| NN |
| 0.9584 | 0.9323 | 0.9004 | 0.9041 | 0.9480 | 0.9769 |
| SMO |
| 0.9474 | 0.9435 | 0.9225 | 0.9325 | 0.9704 | 0.9645 |
| HB |
| 0.9730 | 0.8819 | 0.8707 | 0.8547 | 0.8402 | 0.9595 |
| Logistic | 0.9318 |
| 0.8902 | 0.8789 | 0.8632 | 0.8482 | 0.9684 |
| Mean |
| 0.9535 | 0.9043 | 0.8895 | 0.8890 | 0.8891 | 0.9613 |
The highest classification rate achieved across all competing methods is highlighted as bold for each classifier.
Mean normalised classification rates over 4 multi-phenotype datasets according to performance
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 5-NN |
| 0.9532 | 0.8126 | 0.8090 | 0.8158 | 0.8696 |
| NN |
| 0.9488 | 0.9402 | 0.9334 | 0.8322 | 0.9585 |
| SMO |
| 0.9335 | 0.9372 | 0.9246 | 0.8521 | 0.9452 |
| HB |
| 0.9241 | 0.7518 | 0.7639 | 0.7893 | 0.8043 |
| Logistic |
| 0.8290 | 0.5614 | 0.5440 | 0.5589 | 0.6450 |
| Mean |
| 0.91772 | 0.80064 | 0.79498 | 0.76966 | 0.84452 |
The highest classification rate achieved across all competing methods is highlighted as bold for each classifier.
Significant pathways and constituent genes identified by the proposed DIGS model for Pawitan
|
|
|
|---|---|
| PROSTATE CANCER | EGFR, TCF7L1, GSTP1, PDGFRA, CCNE1, CHUK, PIK3R3, ERBB2, PIK3R1 |
| UBIQUITIN MEDIATED PROTEOLYSIS | UBE2E3, MID1, SKP2, BRCA1, WWP1 |
| WNT SIGNALING PATHWAY | FZD7, SOX17, TCF7L1, SKP1, SFRP1, FZD8 |
| O GLYCAN BIOSYNTHESIS | GALNT3, GALNT7, GALNT11, GALNT6, GCNT3, B4GALT5, GALNT8, C1GALT1, GALNT12, GCNT4, GALNT14, GALNT10, GALNT2, ST3GAL2, GCNT1, ST3GAL1, C1GALT1C1, GALNT1 |
| ADHERENS JUNCTION | EGFR, ERBB2, TCF7L1, TCF7L2, MET, RAC3, SMAD3, MLLT4, RHOA |
| ERBB SIGNALING PATHWAY | EGFR, NCK2, ERBB2, AKT3, PAK4, EREG, MAPK9, AKT2 |
| NITROGEN METABOLISM | CA12, CA5A, CA9, GLUL, CA3, CA14, CA8, CA7, CA5B, GLUD1, CA2, AMT, CA6, CA1, CTH, GLS2, GLUD2, HAL, CA4, ASNS, CPS1 |
| DORSO VENTRAL AXIS FORMATION | EGFR, NOTCH1, GRB2, MAPK3, NOTCH3, SOS1, CPEB1, PIWIL2 ETS2, MAPK1, NOTCH4, ETV6, PIWIL1, MAP2K1, NOTCH2, SOS2, ETS1, ETV7, KRAS |
| ENDOMETRIAL CANCER | EGFR, TCF7L1, ERBB2, TCF7L2, MLH1, ELK1, NRAS, AKT3, ARAF, CTNNA2, PIK3CB, AKT2, CCND1, FOXO3, LEF1 |
| NON SMALL CELL LUNG CANCER | EGFR, AKT3, E2F3, ERBB2, BAD, E2F1, RARB, CDKN2A, PLCG2, GRB2, HRAS, MAPK3, PIK3CD, RXRG, TGFA |
| PANCREATIC CANCER | EGFR, ERBB2, AKT3, CDKN2A, MAPK9, PLD1, RAC3, RALA, CCND1, E2F3, JAK1, PIK3R1 |
Figure 4Pathway activity of the significant pathways in Pawitan. Pathway activities are inferred with DIGS model using all samples. Red/green blocks indicate up-/down- regulation of pathways (rows) in samples (columns). Pathways are clustered according to similarity of their activities.