| Literature DB >> 31757200 |
Damiano Verda1, Stefano Parodi2, Enrico Ferrari1, Marco Muselli3,4.
Abstract
BACKGROUND: Logic Learning Machine (LLM) is an innovative method of supervised analysis capable of constructing models based on simple and intelligible rules. In this investigation the performance of LLM in classifying patients with cancer was evaluated using a set of eight publicly available gene expression databases for cancer diagnosis. LLM accuracy was assessed by summary ROC curve (sROC) analysis and estimated by the area under an sROC curve (sAUC). Its performance was compared in cross validation with that of standard supervised methods, namely: decision tree, artificial neural network, support vector machine (SVM) and k-nearest neighbor classifier.Entities:
Keywords: Cancer; Decision tree; Diagnosis; Gene expression; K-nearest neighbor classifier; Logic learning machine; Microarrays; Neural network; Prognosis; Support vector machine
Mesh:
Year: 2019 PMID: 31757200 PMCID: PMC6873393 DOI: 10.1186/s12859-019-2953-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Microarray data sets excluded from the analyses and reason for their exclusion
| GEO dataset accession | Disease | Reason for exclusion |
|---|---|---|
| Squamous cell carcinoma of the tongue | Non independent sampling: repeated measures | |
| Clear cell renal carcinoma | Non independent sampling: repeated measures | |
| Prostate cancer | Non independent sampling: repeated measures | |
| Chronic lymphocytic leukaemia | Non independent sampling: repeated measures | |
| Breast cancer | Non independent sampling: repeated measures | |
| Pancreatic adenocarcinoma | Non independent sampling: different tissues from the same patient | |
| Clear cell renal carcinoma | Non independent sampling: different tissues from the same patient | |
| Chronic lymphocytic leukaemia | Non independent sampling: different tissues from the same patient | |
| Chronic lymphocytic leukaemia | Non independent sampling: different tissues from the same patient | |
| Colorectal adenocarcinoma | Insufficient sample size in at least one class | |
| Glioblastoma | Insufficient sample size in at least one class | |
| Colorectal cancer | Insufficient sample size in at least one class | |
| Acute myeloid leukaemia | Insufficient sample size in at least one class | |
| Acute lymphoblastic leukaemia | Insufficient sample size in at least one class | |
| Acute lymphoblastic leukaemia | Insufficient sample size in at least one class | |
| T-lymphoblastic leukaemia | Insufficient sample size in at least one class | |
| Acute myeloid leukaemia | Insufficient sample size in at least one class | |
| Acute myeloid leukaemia | Insufficient sample size in at least one class | |
| Chronic lymphocytic leukaemia | Insufficient sample size in at least one class | |
| Chronic lymphocytic leukaemia | Insufficient sample size in at least one class | |
| Breast cancer | Insufficient sample size in at least one class | |
| Glioblastoma and glioma | Insufficient sample size in at least one class | |
| B-cell lymphoma | Insufficient sample size in at least one class | |
| Breast cancer | Insufficient sample size in at least one class | |
| Inflammatory bowel disease | No malignant cancer and insufficient sample size | |
| Malignant melanoma | No available classes to compare | |
| Colorectal cancer | Non-human tissue (transplantation on mice) | |
| Breast cancer | No available classes for cancer diagnosis | |
| Colorectal cancer | No available classes for cancer diagnosis | |
| Colorectal cancer | No available classes for cancer diagnosis | |
| Stage I endometrial cancer | No available classes for cancer diagnosis | |
| Colon cancer | No available classes for cancer diagnosis | |
| Bladder cancer | No available classes for cancer diagnosis | |
| Colorectal cancer | No available classes for cancer diagnosis | |
| Acute myeloid leukaemia | No available classes for cancer diagnosis | |
| Hodgkin’s lymphoma | No available classes for cancer diagnosis | |
| Acute lymphoid leukaemia | No available classes for cancer diagnosis | |
| Gastric adenocarcinoma | No available classes for cancer diagnosis | |
| Acute myeloid leukaemia | No available classes for cancer diagnosis | |
| Prostate cancer | No available classes for cancer diagnosis | |
| Breast cancer | No available classes for cancer diagnosis | |
| Breast cancer | No available classes for cancer diagnosis | |
| Non-small cell lung cancer | No available classes for cancer diagnosis | |
| Myelodisplastic syndrome | No available classes for cancer diagnosis |
Microarray data sets included in the analyses and the related classes at comparison
| GEO dataset accession | Disease | Classes at comparison | |
|---|---|---|---|
| 99 | Multiple myeloma | Monoclonal gammopathy ( | |
| 40 | Hepatocellular carcinoma | Hepatocellular carcinoma ( | |
| 65 | Small cell lung cancer | Normal cells ( | |
| 80 | Breast cancer | Cancer cells ( | |
| 76 | Medulloblastoma | Classic medulloblastoma ( | |
| 174 | Many different malignancies | Renal cancer ( | |
| 162 | Breast cancer | Benign ( | |
| 42 | Renal clear cell carcinoma | T3 thyronine ( |
N number of samples
Analysis of gene expression profiles in eight selected data sets of for cancer diagnosis. Comparison between five methods of supervised data mining in cross-validation
| Method | Sens. % | Spec. % | Youden Index % | Empirical Accuracy % | Cohen’s Kappa % | |
|---|---|---|---|---|---|---|
| LLM | 98.1 | 90.2 | 88.4 | 94.7 | 91.7 | |
| DT | 96.2 | 95.1 | 93.1 | 95.7 | 93.3 | |
| ANN | 94.3 | 95.1 | 89.4 | 93.6 | 90.0 | |
| SVM | 98.1 | 97.5 | 95.7 | 97.9 | 96.7 | |
| kNN | 98.1 | 97.6 | 95.7 | 96.8 | 95.1 | |
| LLM | 100 | 95.0 | 95.0 | 97.5 | 95.0 | |
| DT | 100 | 95.0 | 95.0 | 97.5 | 95.0 | |
| ANN | 100 | 100 | 100 | 100 | 100 | |
| SVM | 100 | 100 | 100 | 100 | 100 | |
| kNN | 100 | 100 | 100 | 100 | 100 | |
| LLM | 100 | 100 | 100 | 100 | 100 | |
| DT | 100 | 100 | 100 | 100 | 100 | |
| ANN | 100 | 100 | 100 | 100 | 100 | |
| SVM | 100 | 100 | 100 | 100 | 100 | |
| kNN | 100 | 100 | 100 | 100 | 100 | |
| LLM | 100 | 100 | 100 | 100 | 100 | |
| DT | 100 | 100 | 100 | 100 | 100 | |
| ANN | 97.3 | 100 | 97.3 | 98.8 | 97.5 | |
| SVM | 100 | 100 | 100 | 100 | 100 | |
| kNN | 100 | 100 | 100 | 100 | 100 | |
| LLM | 99.0 | 96.0 | 95.0 | 97.4 | 94.0 | |
| DT | 88.2 | 76.0 | 64.2 | 84.2 | 64.2 | |
| ANN | 82.4 | 88.0 | 70.4 | 84.2 | 66.3 | |
| SVM | 98.0 | 96.0 | 94.0 | 97.4 | 97.4 | |
| kNN | 94.2 | 96.0 | 90.1 | 94.7 | 88.3 | |
| LLM | 97.8 | 96.2 | 94.0 | 96.6 | 95.7 | |
| DT | 75.8 | 100 | 75.8 | 63.3 | 53.1 | |
| ANN | 98.9 | 96.2 | 95.1 | 93.2 | 91.4 | |
| SVM | 100 | 100 | 100 | 100 | 100 | |
| kNN | 100 | 100 | 100 | 100 | 100 | |
| LLM | 96.4 | 90.3 | 87.6 | 92.2 | 89.4 | |
| DT | 94.5 | 100 | 94.5 | 70.2 | 57.7 | |
| ANN | 97.3 | 90.3 | 87.6 | 76.6 | 67.2 | |
| SVM | 100 | 100 | 100 | 95.7 | 94.2 | |
| kNN | 100 | 100 | 100 | 97.2 | 96.1 | |
| LLM | 100 | 100 | 100 | 100 | 100 | |
| DT | 90.5 | 100 | 90.5 | 95.2 | 90.5 | |
| ANN | 14.3 | 81.0 | −4.8 | 46.6 | −4.8 | |
| SVM | 95.2 | 100 | 100 | 97.6 | 95.2 | |
| kNN | 85.7 | 95.2 | 81.0 | 90.5 | 81.0 | |
Fig. 1Summary ROC curves for the eight diagnostic comparisons
Results of summary ROC analysis
| Method | ||||
|---|---|---|---|---|
| Diagnostic studies | ||||
| | 0.995 | 1546 | ||
| | 0.964 | 104 | ||
| | 0.904 | 26 | ||
| | 0.996 | 1736 | ||
| | 0.991 | 635 | ||
sAUC summary Area Under the ROC Curve, 95%CI 95% confidence interval, sOR summary Odds Ratio
Classification rules identified by Logic Learning Machine applied to gene expression profiles in eight selected data sets for cancer diagnosis
| Output | Condition 1 | Condition 2 | Condition 3 | |
|---|---|---|---|---|
| Monoc. Gamm. | SNHG3_1 ≤ 9.28 | SNORA14B ≤ 4.30 | – | 95.0% |
| Monoc. Gamm. | Control_3389 ≤ 8.20 | – | – | 60.0% |
| MM | THOP1 > 6.23 | TARP_5 ≤ 6.71 | – | 85.4% |
| MM | C22orf23 ≤ 5.20 | FLJ20712 ≤ 3.14 | – | 26.8% |
| Smold. MM | DNAJC7 > 8.13 | IGK_2 ≤ 10.4561 | DEK > 6.50 | 97.0% |
| Smold. MM | HNRNPA1 > 6.44 | – | – | 51.5% |
| HC | AQP7 ≤ 8.46 | – | – | 100% |
| Non tumor | CLPX_1 > 11.4116 | – | – | 100% |
| Normal cells | DSCC1_1 ≤ 110.1 | – | – | 100% |
| SCLC | CBX3_1 > 2232.75 | – | – | 100% |
| Breast cancer | FMN2 < = 116.32 | – | – | 100% |
| Fibroblast | SHC4 > 52.20 | – | – | 100% |
| Classic MB | EFHD2_1 > 3.87 | LOC100132891 ≤ 4.37 | – | 88.3% |
| Classic MB | TCL1A > 4.66 | – | – | 31.4% |
| Other MB | LOC100132891 > 4.18 | 5.47 < ZMYM5_3 ≤ 6.17 | 76.0% | |
| Other MB | CHIAP2 > 3.38 | ZNF212 ≤ 6.45 | – | 40.0% |
| Colon cancer | KLK6 > 7.71 | – | – | 100% |
| Melanoma | EDNRB > 5.72279 | – | – | 100% |
| Non-SCLC | 5.61 < TMEM51 ≤ 6.55 | FAM177A1 > 8.27 | LINC00936 > 5.59 | 100% |
| Ovarian cancer | TMEM101 ≤ 6.15 | – | – | 85% |
| Ovarian cancer | MEIS1_1 > 6.70 | – | – | 57.1% |
| Renal cancer | LRRN4 > 4.69 | APBB1IP_2 > 7.46 | – | 100% |
| Benign diseasea | 2.32 < IGHV7–81 ≤ 3.29 | 2.87 < BM983749 ≤ 4.06 | LIM2 > 4.11 | 83.8% |
| Benign diseasea | LCP2_1 > 9.07 | ST8SIA2_1 ≤ 2.215 | – | 27.0% |
| Ectopic cancers | ST3GAL1 > 6.55 | PWWP2A > 6.18 | – | 100% |
| Healthy controls | USMG5 > 11.85 | – | – | 90.3% |
| Healthy controls | NUFIP2_1 > 8.81 | – | – | 41.9% |
| Breast cancera | MKNK1 ≤ 3.91 | 227762_at ≤8.3 | BF194770 > 2.385 | 80.4% |
| Breast cancer | ZNF81 ≤ 2.99 | MMAB_1 ≤ 4.095 | – | 29.4% |
| Breast cancer | AU143882 > 4.57 | – | – | 21.6% |
| Untreated controls | COQ10A < = 125.66 | – | – | 100% |
| Renal cancer | COQ10A > 125.66 | – | – | 100% |
Monoc. Gamm. Monoclonal Gammopathy, MM Multiple Myeloma, Smold. MM Smoldering Multple Myeloma, SCLC Small Cell Lung Cancer, HC Hepatocellular Carcinoma, MB Medulloblastoma
aClassification algorithm truncated to the first three rules with the highest covering
Fig. 2Schematic representation of the Logic Learning Machine algorithm. In the first phase (Latticization) each variable is transformed into a string of binary data, using the inverse only-one code binarization and all strings are eventually concatenated in one unique large string per each subject. In the second phase (Shadow Clustering) a set of binary vectors (the “implicants”) is generated, each of which identifies a cluster in the input space associated with a specific output class. Finally, all the implicants are transformed into simple conditions and combined in a set of intelligible rules