| Literature DB >> 32324757 |
Sherry Bhalla1,2, Harpreet Kaur3, Rishemjit Kaur4, Suresh Sharma2, Gajendra P S Raghava1.
Abstract
INTRODUCTION: Recently, the rise in the incidences of thyroid cancer worldwide renders it to be the sixth most common cancer among women. Commonly, Fine Needle Aspiration biopsy predominantly facilitates the diagnosis of the nature of thyroid nodules. However, it is inconsiderable in determining the tumor's state, i.e., benign or malignant. This study aims to identify the key RNA transcripts that can segregate the early and late-stage samples of Thyroid Carcinoma (THCA) using RNA expression profiles.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32324757 PMCID: PMC7179925 DOI: 10.1371/journal.pone.0231629
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The overall flow of the study, including the number of samples in each class and types of feature selections explored for the development of machine learning models.
The green arrows represent that only protein-coding features have been explored to segregate cancer and normal samples. The blue arrows indicate that both protein-coding and non-protein coding features individually and in combination have been explored to segregate early vs late-stage samples. For multiclass classification, protein-coding and non-protein coding transcripts in combination are explored to segregate early, late and normal samples.
The performance measures of the prediction models developed based on 37-protein-coding mRNA feature set (THCA-EL-PC) selected by FCBF-WEKA feature selection method on training and validation dataset by implementing various machine-learning algorithms.
| Classifier | Dataset | TP | FP | TN | FN | Recall (%) | Precision (%) | Spec (%) | Acc (%) | MCC | AUROC (95% CI) | F1 Score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training | 219 | 52 | 81 | 46 | 82.64 | 0.81 | 60.9 | 75.38 | 0.44 | 0.79 (0.74–0.84) | 0.75 | |
| Validation | 57 | 18 | 16 | 11 | 83.82 | 0.76 | 47.06 | 71.57 | 0.33 | 0.66 (0.54–0.77) | 0.72 | |
| Training | 244 | 68 | 65 | 21 | 92.08 | 0.78 | 48.87 | 77.64 | 0.47 | 0.70 (0.66–0.75) | 0.78 | |
| Validation | 60 | 24 | 10 | 8 | 88.24 | 0.71 | 29.41 | 68.63 | 0.22 | 0.59 (0.50–67) | 0.69 | |
| Training | 180 | 47 | 86 | 85 | 67.92 | 0.79 | 64.66 | 66.83 | 0.31 | 0.66 (0.61–0.72) | 0.66 | |
| Validation | 50 | 16 | 18 | 18 | 73.53 | 0.76 | 52.94 | 66.67 | 0.26 | 0.66 (0.55–0.77) | 0.67 | |
| Training | 190 | 39 | 94 | 75 | 71.7 | 0.83 | 70.68 | 71.36 | 0.4 | 0.77 (0.72–0.82) | 0.71 | |
| Validation | 47 | 15 | 19 | 21 | 69.12 | 0.76 | 55.88 | 64.71 | 0.24 | 0.63 (0.51–0.75) | 0.46 | |
| Training | 225 | 52 | 81 | 40 | 84.91 | 0.81 | 60.9 | 76.88 | 0.47 | 0.8 (0.75–0.85) | 0.76 | |
| Validation | 52 | 18 | 16 | 16 | 76.47 | 0.74 | 47.06 | 66.67 | 0.24 | 0.60 (0.47–0.73) | 0.5 |
TP: True Positive; FP: False Positive; TN: True Negative; FN: False Negative; Spec: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient; AUROC: Area under Receiver operating characteristic curve; CI: Confidence Interval.
The performance measures of the prediction models developed based 36-full feature set (THCA-EL-SVC-L1) selected by SVC-L1 on training and validation dataset by implementing various machine-learning algorithms.
| Classifier | Dataset | TP | FP | TN | FN | Recall (%) | Precision | Spec (%) | Acc (%) | MCC | AUROC (95% CI) | F1 Score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training | 228 | 17 | 116 | 37 | 86.04 | 0.93 | 87.22 | 86.43 | 0.71 | 0.93 (0.91–0.96) | 0.86 | |
| Validation | 52 | 10 | 24 | 16 | 76.47 | 0.84 | 70.59 | 74.51 | 0.45 | 0.73 (0.62–0.84) | 0.75 | |
| Training | 252 | 30 | 103 | 13 | 95.09 | 0.89 | 77.44 | 89.2 | 0.75 | 0.86 (0.82–90) | 0.89 | |
| Validation | 59 | 16 | 18 | 9 | 86.76 | 0.79 | 52.94 | 75.49 | 0.42 | 0.7 (0.60–0.79)) | 0.75 | |
| Training | 178 | 52 | 81 | 87 | 67.17 | 0.77 | 60.9 | 65.08 | 0.27 | 0.66 (0.60–0.71) | 0.65 | |
| Validation | 51 | 17 | 17 | 17 | 75 | 0.75 | 50 | 66.67 | 0.25 | 0.62 (0.50–0.73) | 0.67 | |
| Training | 239 | 39 | 94 | 26 | 90.19 | 0.86 | 70.68 | 83.67 | 0.63 | 0.87 (0.83–0.91) | 0.84 | |
| Validation | 58 | 15 | 19 | 10 | 85.29 | 0.79 | 55.88 | 75.49 | 0.43 | 0.72 (0.62–0.83) | 0.75 | |
| Training | 197 | 32 | 101 | 68 | 74.34 | 0.86 | 75.94 | 74.87 | 0.48 | 0.84 (0.80–0.88) | 0.73 | |
| Validation | 46 | 11 | 23 | 22 | 67.65 | 0.81 | 67.65 | 67.65 | 0.34 | 0.75 (0.64–0.85) | 0.69 |
TP: True Positive; FP: False Positive; TN: True Negative; FN: False Negative; Spec: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient; AUROC: Area under Receiver operating characteristic curve; CI: Confidence Interval.
The performance of SVC based model at the different threshold in term of the probability of correct prediction, developed using 36-full feature set (THCA-EL-SVC-L1) on training and validation dataset.
| Threshold/ Cut-offs | Prediction of Early-stage | Prediction of Late-stage | ||||
|---|---|---|---|---|---|---|
| Total Predictions | Correct Prediction | PPV | Total Predictions | Correct Prediction | NPV | |
| 1.00 | 16 | 16 | 100.00 | 6 | 6 | 100.00 |
| 0.95 | 136 | 131 | 96.32 | 42 | 40 | 95.24 |
| 0.90 | 170 | 161 | 94.71 | 64 | 60 | 93.75 |
| 0.85 | 199 | 188 | 94.47 | 71 | 67 | 94.37 |
| 0.80 | 219 | 207 | 94.52 | 80 | 76 | 95.00 |
| 0.75 | 233 | 221 | 94.85 | 86 | 81 | 94.19 |
| 0.70 | 244 | 227 | 93.03 | 91 | 86 | 94.51 |
| 0.65 | 254 | 235 | 92.52 | 98 | 90 | 91.84 |
| 0.60 | 261 | 240 | 91.95 | 106 | 97 | 91.51 |
| 1.00 | 0 | 0 | 0.00 | 0 | 0 | 0.00 |
| 0.95 | 28 | 25 | 89.29 | 9 | 7 | 77.78 |
| 0.90 | 42 | 36 | 85.71 | 12 | 8 | 66.67 |
| 0.85 | 47 | 39 | 82.98 | 15 | 9 | 60.00 |
| 0.80 | 54 | 45 | 83.33 | 16 | 10 | 62.50 |
| 0.75 | 58 | 48 | 82.76 | 19 | 12 | 63.16 |
| 0.70 | 62 | 52 | 83.87 | 22 | 14 | 63.64 |
| 0.65 | 65 | 53 | 81.54 | 22 | 14 | 63.64 |
| 0.60 | 70 | 56 | 80.00 | 25 | 16 | 64.00 |
PPV: Positive Predictive Value; NPV: Negative Predictive Value.
The performance measures of the SVC based model, developed using 18-features set on training (TCGA dataset) and external validation (GSE48953) dataset.
| Dataset | TP | FP | TN | FN | Recall (%) | Precision | Spec (%) | Acc (%) | MCC | AUROC (95% CI) | F1 Score |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 50 | 83 | 45 | 83.02 | 0.81 | 62.41 | 76.13 | 0.46 | 0.81 (0.76–0.85) | 0.76 | |
| 12 | 0 | 3 | 5 | 70.59 | 1.00 | 100 | 75 | 0.51 | 0.78 (0.58–9.98) | 0.75 |
TP: True Positive; FP: False Positive; TN: True Negative; FN: False Negative; Sens: Sensitivity; Spec: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient; AUROC: Area under Receiver operating characteristic curve; CI: Confidence Interval.
Fig 2The AUROC plot comparing the performance of prediction models based on different feature sets for segregating early and late-stage tissue samples of (A) Training dataset and (B) Validation dataset.
The performance measures of the multiclass prediction model developed based on 107 features selected by SVC-L1 on training and validation dataset by implementing SVC.
| Training Data | ||||||||
|---|---|---|---|---|---|---|---|---|
| Class | Recall (%) | Precision (%) | Spec (%) | Acc (%) | MCC | AUROC (95% CI) | F1 Score | Number of samples |
| 100 | 98 | 99 | 99 | 0.98 | 0.99(0.98–0.99) | 0.99 | 46 | |
| 88 | 86 | 79 | 86 | 0.72 | 0.93(0.92–0.94) | 0.88 | 265 | |
| 77 | 81 | 92 | 87 | 0.67 | 0.91(0.90–0.93) | 0.77 | 133 | |
| 92 | 92 | 99 | 98 | 0.91 | 0.95 (0.84–1.00) | 0.88 | 12 | |
| 79 | 75 | 58 | 74 | 0.44 | 0.76 (0.66–0.85) | 0.78 | 68 | |
| 53 | 58 | 87 | 75 | 0.34 | 0.72 (0.61–0.83) | 0.55 | 34 | |
Spec: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient; AUROC: Area under Receiver operating characteristic curve.
The performance measures of prediction models developed based on 5-protein coding transcripts (THCA-CN-F) feature set selected by F_ANOVA feature selection method for discriminating cancer and normal patients on training and independent validation dataset.
| Classifier | Dataset | TP | FP | TN | FN | Recall (%) | Precision (%) | Spec (%) | Acc (%) | MCC | AUROC (95% CI) | F1 Score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training | 396 | 4 | 42 | 4 | 99.00 | 0.99 | 91.3 | 98.21 | 0.9 | 0.97 (0.93–1) | 0.98 | |
| Independent Validation | 97 | 0 | 12 | 3 | 97.00 | 1.00 | 100.00 | 97.32 | 0.88 | 0.99 (0.91–1) | 0.97 | |
| Training | 396 | 6 | 40 | 4 | 99.00 | 0.99 | 86.96 | 97.76 | 0.88 | 0.93 (0.88–0.98) | 0.98 | |
| Validation | 97 | 0 | 123 | 97 | 97.00 | 1.00 | 100.00 | 97.32 | 0.88 | 0.98 (0.96–1) | 0.87 | |
| Training | 397 | 9 | 37 | 3 | 99.25 | 0.98 | 80.43 | 97.31 | 0.85 | 0.85 (0.76–95) | 0.97 | |
| Validation | 97 | 3 | 9 | 3 | 97.00 | 0.97 | 75.00 | 94.64 | 0.72 | 0.87 (0.74–0.99) | 0.87 | |
| Training | 383 | 3 | 43 | 17 | 95.75 | 0.99 | 93.48 | 95.52 | 0.80 | 0.95 (0.91–0.99) | 0.96 | |
| Validation | 91 | 0 | 12 | 9 | 91.00 | 1.00 | 100.00 | 91.96 | 0.72 | 0.96 (0.93–0.99) | 0.92 | |
| Training | 393 | 5 | 41 | 7 | 98.25 | 0.99 | 89.13 | 97.31 | 0.86 | 0.97 (0.93–1) | 0.97 | |
| Validation | 95 | 0 | 12 | 5 | 95.00 | 1.00 | 100 | 95.54 | 0.82 | 0.99 (0.97–1) | 0.96 |
TP: True Positive; FP: False Positive; TN: True Negative; FN: False Negative; Sens: Sensitivity; Spec: Specificity; Acc: Accuracy; MCC: Matthews Correlation Coefficient; AUROC: Area under Receiver operating characteristic curve; CI: Confidence Interval.