| Literature DB >> 28910336 |
Chip M Lynch1, Victor H van Berkel2, Hermann B Frieboes3,4.
Abstract
This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.Entities:
Mesh:
Year: 2017 PMID: 28910336 PMCID: PMC5598970 DOI: 10.1371/journal.pone.0184370
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
| T0 | No evidence of a primary tumor |
| T1 | Tumor is < 3 cm in diameter, has not penetrated the visceral pleura, and has not affected the main bronchi branches. If tumor is < 2 cm, it is T1a stage; otherwise, it is T1b. |
| T2 | Tumor is > 3 cm in diameter, or has penetrated the visceral pleura, or has partially occluded the airways, or involves the main bronchus but is > 2 cm away from the carina. If tumor is < 5 cm, it is T2a stage; otherwise, it is T2b. |
| T3 | Tumor is > 7 cm in diameter, or has caused an entire lung to collapse or develop pneumonia, or has grown into the chest wall, diaphragm, mediastinal pleura, or parietal pericardium, or involves the main bronchus and is < 2 cm from the carina (without involving the carina), or two or more tumor nodules are present in the same lung lobe. |
| T4 | Tumor has grown into the mediastinum, heart, large blood vessels near the heart, trachea, esophagus, spinal column, or the carina. |
| N0 | No tumor spread to nearby lymph nodes |
| N1 | Tumor has spread to lymph nodes within the lung or near the hilar lymph nodes. The affected lymph nodes are on the same side of the body as the primary tumor. |
| N2 | Tumor has spread to lymph nodes around the carina or in the mediastinum. Affected lymph nodes are on same side as primary tumor. |
| N3 | Tumor has spread to lymph nodes near the clavicle on either side, or spread to hilar or mediastinal lymph nodes on opposite body side of primary tumor. |
| M0 | Tumor has not spread to distant organs or to the other lung or lymph nodes farther away than in those specified in the “N” classification. |
| M1 | Tumor has spread to distant organs or to the other lung or lymph nodes farther away than in those specified in the “N” category. |
Definition of T, N, and M categories for Non-Small Cell Lung Cancer [41].
| Stage | TNM Category | ||
|---|---|---|---|
| Grouping | T | N | M |
| 1A | T1 | N0 | M0 |
| 1B | T2a | N0 | M0 |
| 2A | T1 | N1 | M0 |
| T2a | N1 | M0 | |
| T2b | N0 | M0 | |
| 2B | T2b | N1 | M0 |
| T3 | N0 | M0 | |
| 3A | T1 to T3 | N2 | M0 |
| T3 | N1 | M0 | |
| T4 | N0 or N1 | M0 | |
| 3B | T1 to T4 | N3 | M0 |
| T4 | N2 | M0 | |
| 4 | T1 to T4 | N1 to N3 | M1 |
Stage grouping for Non-Small Cell Lung Cancer based on TNM category [41].
| Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 | Class 8 | Class 9 | |
| 1588 | 1314 | 637 | 835 | 1071 | 727 | 774 | 939 | 2557 | |
| 992 | 1098 | 1154 | 1612 | 1146 | 651 | 1232 | 926 | 1631 |
Distribution of records based on applying the Model-Based and k-Means clustering techniques.
| Model-Based | k-Means | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
| 0 | 36 | 1094 | 28 | 0 | 429 | 1 | 0 | 0 | |
| 21 | 95 | 0 | 63 | 17 | 214 | 44 | 860 | 0 | |
| 26 | 101 | 21 | 61 | 117 | 8 | 240 | 18 | 45 | |
| 0 | 87 | 0 | 675 | 52 | 0 | 9 | 12 | 0 | |
| 0 | 18 | 39 | 15 | 52 | 0 | 938 | 9 | 0 | |
| 0 | 727 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 0 | 754 | 20 | 0 | 0 | 0 | 0 | |
| 0 | 12 | 0 | 16 | 888 | 0 | 0 | 23 | 0 | |
| 945 | 22 | 0 | 0 | 0 | 0 | 0 | 4 | 1586 | |
Comparison of k-Means to the Model-Based Clustering. While there are some areas where hundreds of members are modeled into the same class, few member counts dominate both the Model-Based and k-Means based classification at the same time, implying that there is some, but not too much, agreement between the two methods.
| Class 1 | Class 2 | Class 3 | Class 4 | Class 5 |
|---|---|---|---|---|
| 3610 | 2787 | 2197 | 410 | 1438 |
Set of classes built from Non-Negative Matrix Factorization (r = 5).
| Classification Technique | Root Mean Squared Error of Linear Regression | Coefficient of Determination (R2) |
|---|---|---|
| Hierarchical Clustering | 16.202 | 0.06819 |
| Model-Based Classification | 16.250 | 0.05659 |
| k-Means Classification | 16.193 | 0.06731 |
| Self-Ordering Maps | 15.591 | 0.13539 |
| Non-Negative Matrix Factorization | 16.589 | 0.01923 |
| Principal Component Analysis (PCA) | 16.085 | 0.07969 |
Root mean squared error of linear regression and coefficient of determination values for the various classification techniques evaluated. While PCA did not have specific classes, regression against the component values themselves can be performed, which makes the comparison possible albeit slightly less meaningful.