| Literature DB >> 30629589 |
Davide Chicco1,2, Cristina Rovelli3.
Abstract
BACKGROUND: Mesothelioma is a lung cancer that kills thousands of people worldwide annually, especially those with exposure to asbestos. Diagnosis of mesothelioma in patients often requires time-consuming imaging techniques and biopsies. Machine learning can provide for a more effective, cheaper, and faster patient diagnosis and feature selection from clinical data in patient records. METHODS ANDEntities:
Mesh:
Year: 2019 PMID: 30629589 PMCID: PMC6328132 DOI: 10.1371/journal.pone.0208737
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Dataset features with ranges and measurement units.
We removed “diagnosis method” from the classification and feature selection phases, because it has the same values of “class of diagnosis” target we predict. We changed some feature names to add clarity: “blood lactic dehydrogenise (LDH)” into “lactate dehydrogenase test”, “cell count (WBC)” into “white blood cells (WBC)”, “cytology” into “cytology exam of pleural fluid”, “hemoglobin (HGB)” into “hemoglobin normality test”, “keep side” into “lung side”, “pleural glucose” into “pleural fluid glucose”, and “white blood” into “pleural fluid WBC count”.
| feature name | value range | measurement unit |
|---|---|---|
| ache on chest | 0, 1 | boolean |
| asbestos exposure | 0, 1 | boolean |
| cytology exam of pleural fluid | 0, 1 | boolean |
| dead or not | 0, 1 | boolean |
| diagnosis method | 0, 1 | boolean |
| dyspnoea | 0, 1 | boolean |
| hemoglobin normality test | 0, 1 | boolean |
| pleural effusion | 0, 1 | boolean |
| pleural level of acidity (pH) | 0, 1 | boolean |
| pleural thickness on tomography | 0, 1 | boolean |
| weakness | 0, 1 | boolean |
| city | [0, 8] | category |
| gender | 0, 1 | category |
| habit of cigarette | 0, 1, 2, 3 | category |
| lung side | 0, 1, 2 | category |
| performance status | 0, 1 | category |
| type of malignant mesothelioma | 0, 1, 2 | category |
| age | [19, 85] | years |
| duration of asbestos exposure | [0, 70] | years |
| duration of symptoms | [0.5, 52] | years |
| albumin | [1.5, 6.9] | g/dL (grams per deciliter) |
| alkaline phosphatise (ALP) | [41, 489] | IU/L (international units per liter) |
| C-reactive protein (CRP) | [11, 103] | mg/L (milligrams per liter) |
| lactate dehydrogenase test (LDH) | [55, 1128] | IU/L (international units per liter) |
| glucose | [60, 421] | mg/dL (milligrams per deciliter) |
| platelet count (PLT) | [111, 3335] | kilo platelets per mcL (microliter) |
| pleural albumin | [0, 4.4] | g/dL (grams per deciliter) |
| pleural fluid WBC count | [742, 21500] | cells per microliter (mcL) |
| pleural fluid glucose | [2, 151] | mg/dL (milligrams per deciliter) |
| pleural lactic dehydrogenase | [110, 7541] | IU/L (international units per liter) |
| pleural protein | [0, 6.7] | g/L (grams per liter) |
| sedimentation rate | [7, 129] | mm/hr (millimeters per hour) |
| total protein | [3.1, 8.5] | g/dL (grams per deciliter) |
| white blood cells (WBC) | [4, 22] | cells per mcL (microliter) |
Meaning of each feature of the dataset.
We reported a detailed description of each feature in the Supplementary Information.
| feature name | meaning |
|---|---|
| ache on chest | presence or absence of pain in the chest area |
| asbestos exposure | if a patient has been exposed to asbestos during life |
| cytology exam of pleural fluid | test to detect cancer cells and certain other cells in the area that surrounds the lung |
| dead or not | if a patient is still alive |
| diagnosis method | if the patient has had a mesothelioma diagnosed by a common diagnosis method |
| dyspnoea | shortness of breath |
| hemoglobin normality test | test that measures how much hemoglobin is in blood |
| pleural effusion | presence of effusion, common symptom that can inhibit the normal function of the organ |
| pleural level of acidity (pH) | if the pleural fluid pH is lower than the normal pleural fluid pH, that it’s neutral |
| pleural thickness of thickness | any form of thickening involving either the parietal or visceral pleura |
| weakness | lack of strength |
| city | place of provenance of the patients |
| gender | female or male |
| habit of cigarette | four categories for the habit of smoking |
| lung side | the side of the lungs which is experiencing pleural plaques or mesothelioma traces |
| performance status | patient’s ability to perform normal tasks |
| type of malignant mesothelioma | mesothelioma stage to which the symptoms seem to belong, according to the TNM Classification of Malignant Tumors |
| age | the age of the patients |
| duration of asbestos exposure | how long has been the environmental exposure to asbestos |
| duration of symptoms | the time period, in years, in which the patients show symptoms |
| albumin | level of blood albumin |
| alkaline phosphatase (ALP) | test used to help detect liver disease or bone disorders |
| C-reactive protein (CRP) | acute phase reactant, significantly elevated in patients with pleural mesothelioma (MPM) |
| glucose | test which measures the amount of glucose in a sample of blood |
| lactate dehydrogenase test (LDH) | protein that helps produce energy in the body |
| platelet count (PLT) | test to measure how many platelets patients have in the blood |
| pleural albumin | level of albumin in the pleural fluid |
| pleural fluid WBC count | the count of leukocytes in the pleural fluid |
| pleural fluid glucose | low level can be linked to infection or malignancy |
| pleural lactic dehydrogenase | its levels indicates if the fluid is exudate or transudate |
| pleural protein | pleural effusions are classified as transudates or exudates on the basis of the fluid protein level |
| sedimentation rate | test to measure how quickly erythrocytes settle in a test tube in one hour |
| total protein | biochemical test for measuring the total amount of protein in serum |
| white blood cells (WBC) | test measures the number and quality of white blood cells |
Fig 1Architecture of the probabilistic neural network.
In our model, there are 33 neurons in the input layer, 33 neurons in the pattern layer, and 2 neurons in the summation layer.
Fig 2Architecture of a multi-layer perceptron-based neural network.
In our model, the input layer neurons are 33. We found different optimized numbers of hidden layers and hidden units, for each program execution. The top architecture among the ten executions had 20 hidden units and 1 hidden layer.
Fig 3Decision tree.
An example of decision tree, which can classify each patient as healthy (non-mesothelioma) or unhealthy (mesothelioma). Random forest generates a set of predictive decision trees.
Results of the computational predictions of patient diagnosis on the complete dataset.
Matthews correlation coefficient (MCC): Eq 3. Accuracy: Eq 1. F1 score: Eq 4. Sensitivity (true positive rate): Eq 5. Specificity (true negative rate): Eq 6. The scores are the medians of the results’ ten separate program executions. We report the results of the application of the methods on all the dataset features, plus the results of the decision tree only to the two selected features: the row entitled “Decision tree (applied only to lung side & platelet count)”. Dataset imbalance: 29.63% positive data instances (all the 96 mesothelioma patients), and 70.37% negative data instances (all the 228 non-mesothelioma patients).
| method | MCC | accuracy | sensitivity | specificity | |
|---|---|---|---|---|---|
| Random forest classifier | 0.75 | 0.39 | 0.28 | 0.97 | |
| Decision tree (applied only to | 0.76 | 0.37 | 0.28 | 0.95 | |
| One rule | 0.74 | 0.29 | 0.17 | 0.97 | |
| Decision tree | 0.69 | 0.39 | 0.39 | 0.80 | |
| Perceptron | 0.52 | 0.47 | 0.66 | 0.42 | |
| Probabilistic neural network | 0.57 | 0.32 | 0.32 | 0.71 |
Results of the computational predictions of patient diagnosis, after under-sampling.
Matthews correlation coefficient (MCC): Eq 3. Accuracy: Eq 1. F1 score: Eq 4. Sensitivity (true positive rate): Eq 5. Specificity (true negative rate): Eq 6. The scores are the medians of the results’ ten separate program executions, run with different subset content selected randomly for training set, validation set, and test set every time. We report the results of the application of the methods on all the dataset features, plus the results of the decision tree only to the two selected features: the row entitled “Decision tree (applied only to lung side & platelet count)”. Dataset balance: 50% positive data instances (all the 96 mesothelioma patients), and 50% negative data instances (96 non-mesothelioma patients, randomly selected). Perceptron: learning rate = 0.1.
| method | MCC | accuracy | sensitivity | specificity | |
|---|---|---|---|---|---|
| Random forest classifier | 0.82 | 0.80 | 0.75 | 0.86 | |
| Decision tree | 0.79 | 0.77 | 0.72 | 0.82 | |
| Decision tree (applied only to | 0.68 | 0.63 | 0.58 | 0.80 | |
| Perceptron | 0.62 | 0.71 | 0.95 | 0.20 | |
| One rule | 0.57 | 0.55 | 0.47 | 0.67 | |
| Probabilistic neural network | 0.53 | 0.50 | 0.50 | 0.58 |
Fig 4Mean square error (MSE) decrease in accuracy for each feature removal.
Random forest feature selection rely on bootstrap aggregation (bagging), and therefore does not have training set, validation set, and test set [69]. The bars represent the drop in the accuracy of the prediction made on the patients’ dataset each time a feature is removed. For each feature, the higher is its accuracy drop when removed, the more important the feature is (Methods).
Fig 5Gini impurity decreases of each random forest tree node.
Random forest feature selection rely on bootstrap aggregation (bagging), and therefore does not have training set, validation set, and test set [69]. The bars represent the importance of each feature, measured through the sum of all the Gini impurity index decreases for each specific feature [39] (Methods).
Merged rank of features.
We sorted the features by combining ranking of the node impurity and the ranking of the percentage of MSE decrease in accuracy (Methods).
| merged ranking position | feature name | MSE decrease in accuracy % | tree node purity decrease |
|---|---|---|---|
| 1 | lung side | 2.56 × 10−2 | 4.32 |
| 2 | platelet count (PLT) | 1.52 × 10−2 | 4.97 |
| 3 | duration of symptoms | 6.92 × 10−3 | 4.22 |
| 4 | age | 3.60 × 10−3 | 3.78 |
| 5 | city | 1.03 × 10−2 | 2.80 |
| 6 | duration of asbestos exposure | 4.40 × 10−3 | 3.60 |
| 7 | C-reactive protein (CRP) | 3.28 × 10−3 | 3.11 |
| 8 | pleural protein | 4.42 × 10−3 | 2.66 |
| 9 | sedimentation | 1.30 × 10−3 | 3.13 |
| 10 | glucose | 1.12 × 10−3 | 2.63 |
| 11 | gender | 4.45 × 10−3 | 0.87 |
| 12 | pleural albumin | 2.27 × 10−3 | 2.31 |
| 13 | pleural fluid glucose | 2.55 × 10−4 | 3.20 |
| 14 | albumin | 1.01 × 10−3 | 2.46 |
| 15 | pleural lactic dehydrogenise | 9.18 × 10−4 | 1.85 |
| 16 | lactate dehydrogenase test | 3.84 × 10−6 | 2.74 |
| 17 | white blood cells (WBC) | 4.30 × 10−4 | 2.11 |
| 18 | habit of cigarette | 5.92 × 10−4 | 0.86 |
| 19 | type of malignant mesothelioma | 7.23 × 10−4 | 0.50 |
| 20 | cytology exam of pleural fluid | 3.00 × 10−4 | 0.42 |
| 21 | pleural thickness on tomography | 2.49 × 10−4 | 0.44 |
| 22 | pleural fluid WBC count | −2.96 × 10−3 | 2.63 |
| 23 | total protein | −8.30 × 10−4 | 2.38 |
| 24 | alkaline phosphatise (ALP) | −4.54 × 10−4 | 1.70 |
| 25 | asbestos exposure | 4.49 × 10−4 | 0.23 |
| 26 | hemoglobin normality test | 1.54 × 10−4 | 0.41 |
| 27 | performance status | 3.63 × 10−5 | 0.26 |
| 28 | dyspnoea | −4.23 × 10−4 | 0.33 |
| 29 | pleural level of acidity (pH) | −2.25 × 10−4 | 0.27 |
| 30 | ache on chest | −9.76 × 10−4 | 0.41 |
| 31 | pleural effusion | −6.28 × 10−5 | 0.15 |
| 32 | weakness | −4.58 × 10−4 | 0.40 |
| 33 | dead or not | −1.41 × 10−4 | 0.11 |
Fig 6Strip plot of platelet count (PLT) by lung side.
We exclude one outlier on the X axis with 3,335 platelet/microliter. Vertical blue dotted line: lower boundary of the platelet count normality test.