| Literature DB >> 25750696 |
Konstantina Kourou1, Themis P Exarchos2, Konstantinos P Exarchos1, Michalis V Karamouzis3, Dimitrios I Fotiadis2.
Abstract
Cancer has been characterized as a heterogeneous disease consisting of many different subtypes. The early diagnosis and prognosis of a cancer type have become a necessity in cancer research, as it can facilitate the subsequent clinical management of patients. The importance of classifying cancer patients into high or low risk groups has led many research teams, from the biomedical and the bioinformatics field, to study the application of machine learning (ML) methods. Therefore, these techniques have been utilized as an aim to model the progression and treatment of cancerous conditions. In addition, the ability of ML tools to detect key features from complex datasets reveals their importance. A variety of these techniques, including Artificial Neural Networks (ANNs), Bayesian Networks (BNs), Support Vector Machines (SVMs) and Decision Trees (DTs) have been widely applied in cancer research for the development of predictive models, resulting in effective and accurate decision making. Even though it is evident that the use of ML methods can improve our understanding of cancer progression, an appropriate level of validation is needed in order for these methods to be considered in the everyday clinical practice. In this work, we present a review of recent ML approaches employed in the modeling of cancer progression. The predictive models discussed here are based on various supervised ML techniques as well as on different input features and data samples. Given the growing trend on the application of ML methods in cancer research, we present here the most recent publications that employ these techniques as an aim to model cancer risk or patient outcomes.Entities:
Keywords: ANN, Artificial Neural Network; AUC, Area Under Curve; BCRSVM, Breast Cancer Support Vector Machine; BN, Bayesian Network; CFS, Correlation based Feature Selection; Cancer recurrence; Cancer survival; Cancer susceptibility; DT, Decision Tree; ES, Early Stopping algorithm; GEO, Gene Expression Omnibus; HTT, High-throughput Technologies; LCS, Learning Classifying Systems; ML, Machine Learning; Machine learning; NCI caArray, National Cancer Institute Array Data Management System; NSCLC, Non-small Cell Lung Cancer; OSCC, Oral Squamous Cell Carcinoma; PPI, Protein–Protein Interaction; Predictive models; ROC, Receiver Operating Characteristic; SEER, Surveillance, Epidemiology and End results Database; SSL, Semi-supervised Learning; SVM, Support Vector Machine; TCGA, The Cancer Genome Atlas Research Network
Year: 2014 PMID: 25750696 PMCID: PMC4348437 DOI: 10.1016/j.csbj.2014.11.005
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Classification task in supervised learning. Tumors are represented as X and classified as benign or malignant. The circled examples depict those tumors that have been misclassified.
Fig. 2An indicative ROC curve of two classifiers: (a) Random Guess classifier (red curve) and (b) A classifier providing more robust predictions (blue dotted curve).
Fig. 3An illustration of the ANN structure. The arrows connect the output of one node to the input of another.
Fig. 4An illustration of a DT showing the tree structure. Each variable (X, Y, Z) is represented by a circle and the decision outcomes by squares (Class A, Class B). T(1–3) represents the thresholds (classification rules) in order to successfully classify each variable to a class label.
Fig. 5A simplified illustration of a linear SVM classification of the input data. Figure was reproduced from the ML lectures of [21]. Tumors are classified according to their size and the patient's age. The depicted arrows display the misclassified tumors.
Fig. 6An illustration of a BN. Nodes (A–D) represent a set of random variables across with their conditional probabilities which are calculated in each table.
Fig. 7Distribution of published studies, within the last 5 years, that employ ML techniques for cancer prediction.
Publications relevant to ML methods used for cancer susceptibility prediction.
| Publication | Method | Cancer type | No of patients | Type of data | Accuracy | Validation method | Important features |
|---|---|---|---|---|---|---|---|
| Ayer T et al. | ANN | Breast cancer | 62,219 | Mammographic, demographic | AUC = 0.965 | 10-fold cross validation | Age, mammography findings |
| Waddell M et al. | SVM | Multiple myeloma | 80 | SNPs | 71% | Leave-one-out cross validation | snp739514, snp521522, snp994532 |
| Listgarten J et al. | SVM | Breast cancer | 174 | SNPs | 69% | 20-fold cross validation | snpCY11B2 (+) 4536 T/C snpCYP1B1 (+) 4328 C/G |
| Stajadinovic et al. | BN | Colon carcinomatosis | 53 | Clinical, pathologic | AUC = 0.71 | Cross-validation | Primary tumor histology, nodal staging, extent of peritoneal cancer |
Publications relevant to ML methods used for cancer recurrence prediction.
| Publication | ML method | Cancer type | No of patients | Type of data | Accuracy | Validation method | Important features |
|---|---|---|---|---|---|---|---|
| Exarchos K et al. | BN | Oral cancer | 86 | Clinical, imaging tissue genomic, blood genomic | 100% | 10-fold cross validation | Smoker, p53 stain, extra-tumor spreading, TCAM, SOD2 |
| Kim W et al. | SVM | Breast cancer | 679 | Clinical, pathologic, epidemiologic | 89% | Hold-out | Local invasion of tumor |
| Park C et al. | Graph-based SSL algorithm | Colon cancer, breast cancer | 437 | Gene expression, PPIs | 76.7% | 10-fold cross validation | BRCA1, CCND1, STAT1, CCNB1 |
| Tseng C-J et al. | SVM | Cervical cancer | 168 | Clinical, pathologic | 68% | Hold-out | pathologic_S, pathologic_T, cell type RT target summary |
| Eshlaghy A et al. | SVM | Breast cancer | 547 | Clinical, population | 95% | 10-fold cross validation | Age at diagnosis, age at menarche |
Publications relevant to ML methods used for cancer survival prediction.
| Publication | ML method | Cancer type | No of patients | Type of data | Accuracy | Validation method | Important features |
|---|---|---|---|---|---|---|---|
| Chen Y-C et al. | ANN | Lung cancer | 440 | Clinical, gene expression | 83.5% | Cross validation | Sex, age, T_stage, N_stage |
| Park K et al. | Graph-based SSL algorithm | Breast cancer | 162,500 | SEER | 71% | 5-fold cross validation | Tumor size, age at diagnosis, number of nodes |
| Chang S-W et al. | SVM | Oral cancer | 31 | Clinical, genomic | 75% | Cross validation | Drink, invasion, p63 gene |
| Xu X et al. | SVM | Breast cancer | 295 | Genomic | 97% | Leave-one-out cross validation | 50-gene signature |
| Gevaert O et al. | BN | Breast cancer | 97 | Clinical, microarray | AUC = 0.851 | Hold-Out | Age, angioinvasion, grade |
| Rosado P et al. | SVM | Oral cancer | 69 | Clinical, molecular | 98% | Cross validation | TNM_stage, number of recurrences |
| Delen D et al. | DT | Breast cancer | 200,000 | SEER | 93% | Cross validation | Age at diagnosis, tumor size, number of nodes, histology |
| Kim J et al. | SSL Co-training algorithm | Breast cancer | 162,500 | SEER | 76% | 5-fold cross validation | Age at diagnosis, tumor size, number of nodes, extension of tumor |