| Literature DB >> 30458782 |
María Gabriela Valdés1, Iván Galván-Femenía2, Vicent Ribas Ripoll3, Xavier Duran2, Jun Yokota4, Ricard Gavaldà5,6, Xavier Rafael-Palou7, Rafael de Cid8.
Abstract
BACKGROUND: During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes.Entities:
Keywords: Classification; Feature selection; GWAS; Lung cancer; Machine learning
Mesh:
Substances:
Year: 2018 PMID: 30458782 PMCID: PMC6245589 DOI: 10.1186/s12918-018-0615-5
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Relevant clinical and socio-demographic variables in the ML-based analysis
| BREC | Disease progression | ||||||
|---|---|---|---|---|---|---|---|
| No | Yes | ||||||
| N | % | N | % | N | % | ||
| Gender | |||||||
| Male (1) | 139 | 78 | 104 | 76 | 35 | 85 | |
| Female (2) | 39 | 22 | 33 | 24 | 6 | 15 | |
| Smoker | |||||||
| Yes (1) | 167 | 94 | 126 | 92 | 41 | 100 | |
| No (2) | 10 | 6 | 10 | 7 | 0 | 0 | |
| NA | 1 | 0 | 1 | 1 | 0 | 0 | |
| ECOG | |||||||
| 0 | 59 | 33 | 45 | 33 | 14 | 34 | |
| 1 | 114 | 64 | 88 | 65 | 26 | 64 | |
| 2 | 2 | 1 | 2 | 1 | 0 | 0 | |
| NA | 3 | 2 | 2 | 1 | 1 | 2 | |
| Histology | |||||||
| ADCA (1) | 99 | 56 | 83 | 61 | 16 | 39 | |
| SCC (2) | 64 | 36 | 44 | 32 | 20 | 49 | |
| LCC (3) | 6 | 3 | 6 | 4 | 0 | 0 | |
| Others (4) | 9 | 5 | 4 | 3 | 5 | 12 | |
| Treatment | |||||||
| doce/cis (1) | 123 | 69 | 93 | 68 | 30 | 73 | |
| gemci/cis (2) | 44 | 25 | 36 | 26 | 8 | 20 | |
| doce (3) | 11 | 6 | 8 | 6 | 3 | 7 | |
| Arm | |||||||
| Control | 95 | 53 | 72 | 53 | 23 | 56 | |
| Biomarker-directed | 83 | 47 | 65 | 47 | 18 | 44 | |
| RECIST | |||||||
| PD (1) | 41 | 23 | |||||
| SD (0) | 56 | 31 | |||||
| PR (0) | 58 | 32 | |||||
| CR (0) | 23 | 14 | |||||
Fig. 1Extended Pipeline Configuration
Advantages and disadvantages of types of feature selection methods used in the pipeline configuration
| FS Methods | ||
|---|---|---|
| Advantages | Disadvantages | |
| Filter | They are easily scalable to very high-dimensional data sets. | They do not interact with the classification algorithm. |
| They are computationally fast and simple. | Most of this methods are univariate, this is, they consider features independently or only with regard to the target feature, thereby ignoring feature dependencies. | |
| They are independent of the classification algorithm used in the further model construction. | ||
| Wrapper | They include the interaction between feature subset search and the classification algorithm that is “wrapped”. | They have a higher risk of overfitting, depending on how exhaustive is the feature subset search. |
| They take into account feature dependencies. | They are very computationally intensive, especially if the “wrapped” classifier has a high computational cost. | |
| Embedded | They include the interaction between feature subset search and the final classification model constructed. | They depend on the specific learning method of the final model constructed. |
| They take into account feature dependencies. | ||
| They are computationally faster than wrapper methods. | ||
Advantages and disadvantages of classification methods chosen for the pipeline configuration
| Classification methods | ||
|---|---|---|
| Advantages | Disadvantages | |
| Linear SVM | By introducing the kernel, SVMs gain flexibility in the choice of the form of the threshold separating samples from different classes, which needs not be linear and even needs not have the same functional form for all data, since its function is non-parametric and operates locally. | The lack of transparency of the results. |
| Since the kernel implicitly contains a non-linear transformation, no assumptions about the functional form of the transformation, which makes data linearly separable, is necessary. | The SVM moves the problem of over-fitting from optimizing the parameters to model selection. | |
| SVMs provide a good out-of-sample generalization, if the parameters (C for example) are appropriately chosen. This means that, by choosing an appropriate generalization grade, SVMs can be robust, even when the training sample has some bias. | ||
| SVMs deliver a unique solution, since the optimality problem is convex. | ||
| RF | It decides the final classification by voting, decreasing the variance of the model without increasing the bias. | It is hard to visualize the model or understand why it predicted something, as compared to a single decision tree. |
| It uses a random subset of features at each node of the decision trees, to identify the best split among this subset, and the subsets are different in each node. This is to avoid the most powerful features being selected too frequently in each tree, making them more correlated to each other. | A large number of trees may make the algorithm slow for real-time prediction. | |
| It is fast even on large data-sets. | RFs have been observed to over-fit for some data-sets with noisy classification/regression tasks. | |
| It gives estimates of what variables are important in the classification. | ||
| KNN | The cost of the learning process is zero. | The algorithm must compute the distance and sort all the training data at each prediction, which can be slow if there are a large number of training examples. |
| No assumptions about the characteristics of the concepts to learn have to be done. | The algorithm does not learn anything from the training data, which can result in the algorithm not generalizing well and also not being robust to noisy data. | |
| Complex concepts can be learned by local approximation using simple procedures. | Changing | |
Fig. 2Initial steps of “General Framework”
Fig. 3Main loop of “General Framework” where the “Partial Analysis” is executed for each chromosome in the genome and results are finally merged in the “Final Analysis”
Fig. 4Output of “General Framework” corresponding to each of the 36 pipeline configurations
Parameters tested using grid-search and 5-fold CV. EFD refers to the “Extended Framework Design”
| Pipeline step | Parameter options |
|---|---|
| ANOVA | EFD (Partial analysis): percentile = 2% of total # of variables |
| EFD (Final analysis): percentile = 10% of total # of variables | |
| LR penalty = ’l1’ | |
| C = 1 | |
| RFE-LR | RFE EFD (Partial analysis): |
| n_features_to_select = 2% of total # of variables, | |
| step = 4% | |
| EFD (Final analysis): | |
| n_features_to_select = 10% of total # of variables, | |
| step = 10% | |
| RLR-L1 | penalty = ’l1’ |
| EFD (Partial analysis): C = [100, 500, 1000, 1500, 5000, 10000] | |
| EFD (Final analysis): C = [100, 500, 1000, 1500, 5000, 10000] | |
| threshold = 1 | |
| Linear SVM | C = [0.001, 0.01, 0.1, 1, 10, 100, 1000] |
| RF | n_estimators = [30,47, 75, 119, 189, 299, 475, 753,1194,1892,2999] |
| KNN | n_neighbors = [5, 20, 35, 50] |
LC related traits from the GWAS Catalog v1.0 (release date: 2017-07-31)
| LC related traits |
|---|
| Pulmonary function |
| Lung adenocarcinoma |
| Lung cancer |
| Lung cancer (DNA repair capacity) |
| Lung cancer (smoking interaction) |
| Non-small cell lung cancer |
| Non-small cell lung cancer (recurrence rate) |
| Non-small cell lung cancer (survival) |
| Response to platinum-based agents |
| Response to platinum-based chemotherapy (carboplatin) |
| Response to platinum-based chemotherapy (cisplatin) |
| Response to platinum-based chemotherapy in non-small-cell lung cancer |
| Adverse response to chemotherapy (neutropenia/leucopenia) (cisplatin) |
1/2 Cancer related traits from the GWAS Catalog v1.0 (release date: 2017-07-31)
| Cancer related traits |
|---|
| Adverse response to chemotherapy (neutropenia/leucopenia) (cisplatin) |
| Adverse response to chemotherapy in breast cancer (alopecia) |
| Adverse response to chemotherapy in breast cancer (alopecia) (anti-microtubule) |
| Adverse response to chemotherapy in breast cancer (alopecia) (cyclophosphamide+doxorubicin+/-5FU) |
| Adverse response to chemotherapy in breast cancer (alopecia) (cyclophosphamide+epirubicin+/-5FU) |
| Adverse response to chemotherapy in breast cancer (alopecia) (docetaxel) |
| Adverse response to chemotherapy in breast cancer (alopecia) (paclitaxel) |
| Anthracycline-induced cardiotoxicity in childhood cancer |
| Bladder cancer |
| Bladder cancer (smoking interaction) |
| Body mass index (change over time) in cancer |
| Body mass index (change over time) in cancer or chronic obstructive pulmonary disease |
| Body mass index (change over time) in gastrointestinal cancer |
| Body mass index (change over time) in gastrointestinal cancer or chronic obstructive pulmonary disease |
| Body mass index (change over time) in lung cancer |
| Body mass index (change over time) in lung cancer or chronic obstructive pulmonary disease |
| Breast cancer |
| Breast cancer (early onset) |
| Breast cancer (estrogen-receptor negative |
| Breast cancer (estrogen-receptor negative) |
| Breast cancer (estrogen-receptor positive) |
| Breast cancer (male) |
| Breast cancer (menopausal hormone therapy interaction) |
| Breast cancer (prognosis) |
| Breast cancer (survival) |
| Breast Cancer in BRCA1 mutation carriers |
| Breast cancer in BRCA2 mutation carriers |
| Breast cancer-free interval (treatment with aromatase inhibitor) |
| Cancer |
| Cancer (pleiotropy) |
| Cardia gastric cancer |
| Cervical cancer |
| Colon cancer |
| Colorectal cancer |
| Colorectal cancer (alcohol consumption interaction) |
| Colorectal cancer (aspirin and/or NSAID use interaction) |
| Colorectal cancer (calcium intake interaction) |
| Colorectal cancer (diet interaction) |
| Colorectal cancer (interaction) |
| Colorectal cancer (oestrogen-progestogen hormone therapy interaction) |
| Colorectal or endometrial cancer |
| Disease-free survival in breast cancer |
| Docetaxel-induced peripheral neuropathy in metastatic castrate-resistant prostate cancer |
| Endometrial cancer |
| Epithelial ovarian cancer |
| Erectile dysfunction and prostate cancer treatment |
| Esophageal cancer |
| Esophageal cancer (alcohol interaction) |
| Esophageal cancer (squamous cell) |
| Esophageal cancer and gastric cancer |
| Esophageal squamous cell cancer (length of survival) |
| Estradiol plasma levels (breast cancer) |
| Estrogen receptor status in breast cancer |
| Estrogen receptor status in HER2 negative breast cancer |
| Estrone conjugates/estrone ratio in resected early stage estrogen-receptor positive breast cancer |
| Estrone/androstenedione ratio in resected early stage-receptor positive breast cancer |
| Gallbladder cancer |
| Gastric cancer |
| Lobular breast cancer (menopausal hormone therapy interaction) |
| Lung adenocarcinoma |
| Lung cancer |
| Lung cancer (asbestos exposure interaction) |
| Lung cancer (DNA repair capacity) |
| Lung cancer (smoking interaction) |
| Multiple cancers (lung cancer |
| Multiple keratinocyte cancers |
| Non-cardia gastric cancer |
2/2 Cancer related traits from the GWAS Catalog v1.0 (release date: 2017-07-31)
| Cancer related traits |
|---|
| Non-melanoma skin cancer |
| Non-small cell lung cancer |
| Non-small cell lung cancer (recurrence rate) |
| Non-small cell lung cancer (survival) |
| Obesity in adult survivors of childhood cancer exposed to cranial radiation |
| Obesity in adult survivors of childhood cancer not exposed to cranial radiation |
| Oral cavity and pharyngeal cancer |
| Oral cavity cancer |
| Oropharynx cancer |
| Ovarian cancer |
| Ovarian cancer in BRCA1 mutation carriers |
| Pancreatic cancer |
| Plasma androstenedione levels in resected early stage-receptor positive breast cancer |
| Plasma estrone conjugates levels in resected early stage estrogen-receptor positive breast cancer |
| Plasma estrone levels in resected estrogen-receptor positive breast cancer |
| Platinum-induced myelosuppression in non-small cell lung cancer |
| Progression free survival in metastatic colorectal cancer (CAPOX-B vs CAPOX-B plus cetuximab) |
| Progression free survival in metastatic colorectal cancer (treatment interaction) |
| Prostate cancer |
| Prostate cancer (early onset) |
| Prostate cancer (interaction) |
| Prostate cancer (survival) |
| Prostate cancer aggressiveness |
| Pulmonary function |
| Response to carboplatin and paclitaxel in ovarian cancer (Caspase 3/7 EC50) |
| Response to carboplatin and paclitaxel in ovarian cancer (MTT IC50) |
| Response to carboplatin in ovarian cancer (MTT IC50) |
| Response to chemotherapy in breast cancer (hypertension) (bevacizumab) |
| Response to chemotherapy in breast cancer hypertensive cases (cumulative dose) (bevacizumab) |
| Response to gemcitabine in pancreatic cancer |
| Response to irinotecan and platinum-based chemotherapy in non-small-cell lung cancer |
| Response to irinotecan in non-small-cell lung cancer |
| Response to paclitaxel in ovarian cancer (Caspase 3/7 EC50) |
| Response to paclitaxel in ovarian cancer (MTT IC50) |
| Response to Pazopanib in cancer (hepatotoxicity) |
| Response to platinum-based agents |
| Response to platinum-based chemotherapy (carboplatin) |
| Response to platinum-based chemotherapy (cisplatin) |
| Response to platinum-based chemotherapy in non-small-cell lung cancer |
| Response to platinum-based neoadjuvant chemotherapy in cervical cancer |
| Response to radiotherapy in cancer (late toxicity) |
| Response to radiotherapy in prostate cancer (overall toxicity) |
| Response to radiotherapy in prostate cancer (toxicity |
| Response to radiotherapy in prostate cancer (toxicity |
| Response to radiotherapy in prostate cancer (toxicity |
| Response to radiotherapy in prostate cancer (toxicity) |
| Response to tamoxifen in breast cancer |
| Small-cell lung cancer (survival) |
| Survival in colon cancer |
| Survival in colorectal cancer |
| Survival in colorectal cancer (distant metastatic) |
| Survival in colorectal cancer (non-distant metastatic) |
| Survival in endocrine treated breast cancer (estrogen-receptor positive) |
| Survival in head and neck cancer |
| Survival in microsatellite instability low/stable colorectal cancer |
| Survival in rectal cancer |
| Testicular cancer |
| Testicular germ cell cancer |
| Thyroid cancer |
| Thyroid cancer (Papillary |
| Urinary bladder cancer |
| Urinary symptoms in response to radiotherapy in prostate cancer |
Fig. 5Parameter sensitivity analysis of top 2 pipeline configurations with the highest CV F1 score obtained during model selection
Fig. 6Parameter sensitivity analysis of pipeline configurations in third and fourth positions with the highest CV F1 score obtained during model selection
Fig. 7Parameter sensitivity analysis of pipeline configuration in fifth position with the highest CV F1 score obtained during model selection
Model selection and evaluation metrics (general and per class) of top 5 models from 36 possible instantiations of pipeline using LC data-set
| FS | Sampling | Classifier | CV F1 | CV Precision | CV Recall | Train | Test | Test | Test | Test | Test | Test | Test | Test | Test | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean ± Std | Mean ± Std | Mean ± Std | F1 | F1 | Precision | Recall | F1 (0) | Precision (0) | Recall (0) | F1 (1) | Precision (1) | Recall (1) | Parameters | ||||
| 1 | RFE-LR | Up-sampling | RF | 0,72 ± 0,054 | 0,686 ± 0,102 | 0,79 ± 0,039 | 1 | 0,722 | 0,778 | 0,729 | 0,871 | 0,964 | 0,794 | 0,2 | 0,125 | 0,5 | n_estimators =30 |
| 2 | RLR-L1 | SMOTE-sampling | KNN | 0,712 ± 0,087 | 0,68 ± 0,122 | 0,762 ± 0,066 | 0,777 | 0,741 | 0,806 | 0,844 | 0,889 | 1 | 0,8 | 0,222 | 0,125 | 1 | n_neighbors =5, |
| C =100 | |||||||||||||||||
| 3 | ANOVA | No sampling | RF | 0,698 ± 0,077 | 0,651 ± 0,12 | 0,776 ± 0,061 | 1 | 0,652 | 0,722 | 0,595 | 0,839 | 0,929 | 0,765 | 0 | 0 | 0 | n_estimators =30 |
| 4 | RFE-LR | SMOTE-sampling | RF | 0,689 ± 0,077 | 0,648 ± 0,119 | 0,761 ± 0,071 | 1 | 0,681 | 0,778 | 0,605 | 0,875 | 1 | 0,778 | 0 | 0 | 0 | n_estimators =30 |
| 5 | ANOVA | No sampling | Linear SVM | 0,687 ± 0,113 | 0,687 ± 0,136 | 0,707 ± 0,112 | 1 | 0,811 | 0,833 | 0,823 | 0,9 | 0,964 | 0,844 | 0,5 | 0,375 | 0,75 | C =0.1 |
They are ordered by CV F1. FS stands for feature selection, Cv for cross-validation, F1 is the measure of model evaluation defined as: Precision x Recall / (Precision + Recall). Precision is the proportion of examples classified as positive that are truly positive and Recall the proportion of truly positive examples that are classified as positive. Std stands for standard deviation. Train indicates we used the training set to compute the evaluation metric and Test if we used the test set. (0) indicates it’s an evaluation metric for class 0 and (1) for class 1
Fig. 8CV F1 mean scores with their corresponding standard deviations for all 36 pipeline instantiations using LC data-set
Fig. 9Confusion matrix of LC test data-set using first pipeline model: RFE-L1 + Up-sampling + RF (left) and fifth pipeline model: ANOVA + No sampling + Linear SVM (right)
Results of analysis of intersection of relevant SNPs given by the ML models, with GWAS Catalog records associated with LC and Cancer
| Pipeline | # of | ML Rank | ML Rank | ML Rank |
|---|---|---|---|---|
| features | cat ALL | cat LUNG | cat CANCER | |
| RFE-LR + Up-sampling + RF | 257 | 0 | 0 | 0 |
| RLR-L1 + SMOTE-sampling + KNN | 13 | 0 | 0 | 0 |
| ANOVA + No sampling + RF | 144 | 0 | 0 | 0 |
| RFE-LR + SMOTE-sampling + RF | 238 | 1 | 0 | 0 |
| ANOVA + No sampling + Linear SVM | 193 | 0 | 0 | 0 |
| ANOVA + Up-sampling + Linear SVM | 193 | 0 | 0 | 0 |
| ANOVA + SMOTE-sampling + Linear SVM | 193 | 0 | 0 | 0 |
| RLR-L1 + SMOTE-sampling + RF | 3 | 0 | 0 | 0 |
| ANOVA + No sampling + KNN | 95a | 0 | 0 | 0 |
| RFE-LR + No sampling + RF | 305 | 0 | 0 | 0 |
| RFE-LR + No sampling + KNN | 148b | 2 | 0 | 0 |
| RLR-L1 + No sampling + KNN | 17 | 0 | 0 | 0 |
| RLR-L1 + Up-sampling + KNN | 16 | 0 | 0 | 0 |
| RFE-LR + Down-sampling + KNN | 148b | 2 | 0 | 0 |
| RFE-LR + No sampling + Linear SVM | 148b | 2 | 0 | 0 |
| RFE-LR + Up-sampling + Linear SVM | 148b | 2 | 0 | 0 |
| RFE-LR + SMOTE-sampling + Linear SVM | 148b | 2 | 0 | 0 |
| ANOVA + SMOTE-sampling + RF | 193 | 0 | 0 | 0 |
| RLR-L1 + No sampling + RF | 17 | 0 | 0 | 0 |
| ANOVA + Up-sampling + RF | 193 | 0 | 0 | 0 |
acorresponds to 5% of the top features selected by the ANOVA feature selection method. bcorresponds to 0,1% of the top features selected by the RFE-LR feature selection method
Intersection of relevant features from top 20 pipeline models that coincide with the same configuration of FS + Classifier
| FS + Classifier | # of relevant features selected by pipelines | # of features that match |
|---|---|---|
| ANOVA + LINEAR SVM | 193 / 193 / 193 | 193 |
| ANOVA + RF | 144 / 193 / 193 | 144 |
| ANOVA + KNN | 95 | N/A |
| RFE-LR + LINEAR SVM | 148 / 148 / 148 | 148 |
| RFE-LR + RF | 257 / 238 / 305 | 3 |
| RFE-LR + KNN | 148 / 148 | 148 |
| RLR-L1 + RF | 3 / 7 | 3 |
| RLR-L1 + KNN | 13 / 17 / 16 | 12 |