| Literature DB >> 35521553 |
Yang Yang1, Li Xu2, Liangdong Sun2, Peng Zhang2, Suzanne S Farid1.
Abstract
Machine learning is an important artificial intelligence technique that is widely applied in cancer diagnosis and detection. More recently, with the rise of personalised and precision medicine, there is a growing trend towards machine learning applications for prognosis prediction. However, to date, building reliable prediction models of cancer outcomes in everyday clinical practice is still a hurdle. In this work, we integrate genomic, clinical and demographic data of lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) patients from The Cancer Genome Atlas (TCGA) and introduce copy number variation (CNV) and mutation information of 15 selected genes to generate predictive models for recurrence and survivability. We compare the accuracy and benefits of three well-established machine learning algorithms: decision tree methods, neural networks and support vector machines. Although the accuracy of predictive models using the decision tree method has no significant advantage, the tree models reveal the most important predictors among genomic information (e.g. KRAS, EGFR, TP53), clinical status (e.g. TNM stage and radiotherapy) and demographics (e.g. age and gender) and how they influence the prediction of recurrence and survivability for both early stage LUAD and LUSC. The machine learning models have the potential to help clinicians to make personalised decisions on aspects such as follow-up timeline and to assist with personalised planning of future social care needs.Entities:
Keywords: ANNs, artificial neural networks; ANOVA, analysis of variance; AUC, the area under the ROC curve; CART, classification and regression tree; CNV, copy number variation; DTs, decision trees; Decision tree; FFNN, Feedforward neural networks; LS-SVM, least-squares support vector machine; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; Lung cancer; ML, machine learning; Machine learning; NSCLC, non-small cell lung cancer; Personalized diagnosis and prognosis; ROC, receiver operating characteristic; SVMs, support vector machines; TCGA, The Cancer Genome Atlas; TNM, a common cancer staging system while T, N and M refers to tumour, node and metastasis
Year: 2022 PMID: 35521553 PMCID: PMC9043969 DOI: 10.1016/j.csbj.2022.03.035
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Demographic, genomic and clinical profiles of TCGA dataset for non-small cell lung cancer (418 LUAD and 382 LUSC patients).
Comparison of machine learning methods.
Easy to understand Efficient training Can be used for classification or regression Order of training instances has no effect on training Pruning can deal with the problem of overfitting | Classes must be mutually exclusive Final decision tree dependent upon order of attribute selection Errors in training set can result in overly complex decision trees Missing values for an attribute make it unclear about which branch to take when that attribute is tested | |
Can be used for classification or regression Able to represent Boolean functions Tolerant of noisy inputs Instances can be classified by more than one output | Difficult to understand structure of algorithm Too many attributes can result in overfitting Optimal network structure can only be determined by experimentation | |
Models nonlinear class boundaries Overfitting is unlikely to occur Computational complexity reduced to quadratic optimization problem Easy to control complexity of decision rule and frequency of error | Training is slow compared to decision trees Difficult to determine optimal parameters when training data is not linearly separable Difficult to understand structure of algorithm |
Fig. 2The p-value matrix for all ANOVA tests two subtypes of NSCLC for (a) LUAD and (b) LUSC. cnv = copy number variation, mut = mutation.
Significant demographic, clinical and genomic factors for recurrence and overall survival of non-small cell lung cancer.
| Significant factors | LUSC | LUAD | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Recurrence | Overall Survival | Recurrence | Overall Survival | |||||||||||
| Yes | P | Sig. | ≥3y | P | Sig. | Yes | P | Sig. | ≥3y | P | Sig. | |||
| Demographic | Age | <65 | 20.5 | 0.830 | 27.6 | 0.450 | 26.9 | 0.642 | 12.8 | 0.640 | ||||
| ≥65 | 21.5 | 31.4 | 29.1 | 18.2 | ||||||||||
| Gender | Female | 16.5 | 0.199 | 41.0 | 0.029 | * | 29.6 | 0.486 | 14.9 | 0.580 | ||||
| Male | 23.0 | 27.5 | 26.3 | 16.7 | ||||||||||
| Race | Other | 30.4 | 0.034 | * | 53.8 | 8E-05 | *** | 32.9 | 0.331 | 20.0 | 0.222 | |||
| White | 18.7 | 26.1 | 27.0 | 14.8 | ||||||||||
| Clinical | Cancer stage | I | 15.9 | 0.001 | ** | 37.2 | 0.140 | 13.2 | 4E-15 | *** | 5.8 | 0.137 | ||
| II | 20.2 | 22.9 | 38.8 | 25.0 | ||||||||||
| III | 40.9 | 36.4 | 46.2 | 28.1 | ||||||||||
| IV | 33.3 | 0.00 | 78.9 | 72.7 | ||||||||||
| M stage | M0 | 20.4 | 0.563 | 35.9 | 3E-04 | *** | 27.2 | 0.523 | 13.1 | 0.790 | ||||
| M1 | 33.3 | 0.00 | 78.9 | 72.7 | ||||||||||
| Mx | 23.3 | 9.10 | 21.7 | 24.3 | ||||||||||
| T stage | T1 | 13.2 | 0.025 | * | 34.2 | 0.027 | * | 17.3 | 1E-05 | *** | 10.0 | 0.278 | ||
| T2 | 22.9 | 35.7 | 31.0 | 16.0 | ||||||||||
| T3 | 21.2 | 11.9 | 41.4 | 32.0 | ||||||||||
| T4 | 50.0 | 16.7 | 54.5 | 20.0 | ||||||||||
| N stage | N0 | 16.7 | 0.008 | ** | 29.6 | 0.117 | 18.8 | 8E-06 | *** | 10.1 | 0.788 | |||
| N1 | 25.9 | 30.3 | 45.3 | 26.9 | ||||||||||
| N2 | 47.8 | 57.1 | 46.7 | 29.6 | ||||||||||
| N3 | ---- | ---- | 50.0 | 28.1 | ||||||||||
| Nx | 0.00 | 50.0 | ---- | 72.7 | ||||||||||
| Radio-Therapy | No | 19.6 | 0.044 | * | 31.5 | 0.845 | 23.3 | 1E-07 | *** | 13.3 | 0.485 | |||
| Yes | 34.3 | 29.6 | 59.2 | 37.9 | ||||||||||
| Genomic | EGFR | A | 18.1 | 0.143 | 40.0 | 0.012 | * | 30.0 | 0.496 | 16.1 | 0.199 | |||
| D | 21.6 | 9.10 | 21.4 | 8.3 | ||||||||||
| N | 25.2 | 26.0 | 26.8 | 16.8 | ||||||||||
| KRAS | A | 23.8 | 0.231 | 30.4 | 0.398 | 30.5 | 0.657 | 16.0 | 0.026 | * | ||||
| D | 20.8 | 10.5 | 24.7 | 13.8 | ||||||||||
| N | 18.1 | 35.5 | 27.9 | 16.4 | ||||||||||
| NF1 | No | 18.6 | 0.001 | ** | 29.2 | 0.024 | * | 27.2 | 0.297 | 14.7 | 0.124 | |||
| Yes | 41.7 | 50.0 | 34.9 | 23.5 | ||||||||||
| ERBB2 | No | 20.1 | 0.004 | ** | 31.0 | 0.418 | 28.5 | 0.321 | 16.1 | 0.267 | ||||
| Yes | 62.5 | 50.0 | 12.5 | 0.0 | ||||||||||
| STK11 | No | 21.1 | 0.851 | 31.3 | 0.939 | 26.2 | 0.017 | * | 14.8 | 0.080 | ||||
| Yes | 25.0 | 33.3 | 43.9 | 25.0 | ||||||||||
| TP53 | No | 22.2 | 0.774 | 19.2 | 0.007 | ** | 23.1 | 0.013 | * | 13.6 | 0.982 | |||
| Yes | 20.8 | 36.0 | 34.8 | 19.1 | ||||||||||
| KEAP1 | No | 20.6 | 0.440 | 30.7 | 0.570 | 26.1 | 0.043 | * | 15.5 | 0.160 | ||||
| Yes | 26.7 | 37.5 | 39.0 | 17.1 | ||||||||||
| SMARCA4 | No | 20.6 | 0.313 | 30.5 | 0.14 | 26.7 | 0.037 | * | 15.4 | 0.594 | ||||
| Yes | 31.2 | 38.5 | 44.8 | 21.1 | ||||||||||
Note: P refers to the p-value of ANOVA analysis that indicates the statistical significance of each factor. Sig. refers to the significance level of p-value: 0.01 < p < 0.05 (*), 0.001 < p < 0.01(**), p < 0.001(***).
Summary of machine learning training datasets used for recurrence risk for LUAD and LUSC.
| Class labels | Description | No. of records | |
|---|---|---|---|
| LUSC | LUAD | ||
| High risk | tumour recurrence after initial resection treatment | 49 | 64 |
| Low risk | no tumour recurrence after initial resection treatment | 227 | 231 |
Fig. 3Receiver operating characteristic (ROC) curve of performance comparison of NSCLC recurrence risk models using different machine learning algorithms for (a) LUAD and (b) LUSC recurrence risk prediction. Decision tree (CART) model for (c) LUAD and (d) LUSC recurrence risk prediction.
Summary of machine learning training datasets used for survivability for LUAD and LUSC.
| Class labels | Description | No. of records | |
|---|---|---|---|
| LUSC | LUAD | ||
| Good | overall survival | 75 | 68 |
| Poor | overall survival < 3 years after initial resection treatment | 167 | 181 |
Fig. 4Receiver operating characteristic (ROC) curve of performance comparison of NSCLC survivability models using different machine learning algorithms for (a) LUAD and (b) LUSC survivability prediction. Decision tree (CART) model for (c) LUAD and (d) LUSC survivability prediction.