| Literature DB >> 31304331 |
Neofytos Dimitriou1, Ognjen Arandjelović1, David J Harrison2, Peter D Caie2.
Abstract
Accurate prognosis is fundamental in planning an appropriate therapy for cancer patients. Consequent to the heterogeneity of the disease, intra- and inter-pathologist variability, and the inherent limitations of current pathological reporting systems, patient outcome varies considerably within similarly staged patient cohorts. This is particularly true when classifying stage II colorectal cancer patients using the current TNM guidelines. The aim of the present work is to address this problem through the use of machine learning. In particular, we introduce a data driven framework which makes use of a large number of diverse types of features, readily collected from immunofluorescence imagery. Its outstanding performance in predicting mortality in stage II patients (AUROC = 0:94), exceeds that of current clinical guidelines such as pT stage (AUROC = 0:65), and is demonstrated on a cohort of 173 colorectal cancer patients.Entities:
Keywords: Cancer microenvironment; Colorectal cancer
Year: 2018 PMID: 31304331 PMCID: PMC6550189 DOI: 10.1038/s41746-018-0057-x
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Average AUROC and standard deviation (for n = 200) of trained classifiers on the training set using 20-times repeated tenfold cross-validation
| LSVM | RSVM | LR | RF | KNN | NB | |
|---|---|---|---|---|---|---|
| 5 year | 0.89 ± 0.12 | 0.89 ± 0.13 | 0.91 ± 0.12 | 0.89 ± 0.13 | 0.88 ± 0.12 | 0.86 ± 0.14 |
| 10 year | 0.89 ± 0.13 | 0.89 ± 0.12 | 0.91 ± 0.119 | 0.90 ± 0.13 | 0.89 ± 0.13 | 0.88 ± 0.12 |
LSVM linear kernel SVM, RLSVM radial basis function kernel SVM
Fig. 1Tukey’s significance difference test. No hyperparameter learning was employed in the experiments corresponding to the plots a and b, in contrast to c and d
Fig. 2Frequency of occurrence of each feature from the 20 runs of SFFS and SFBS each for 5-year prognosis. Only features with at least one occurrence are shown for clarity
Fig. 3Frequency of occurrence of each feature from the 20 runs of SFFS and SFBS each for 10-year prognosis. Only features with at least one occurrence are shown for clarity
Features of significance to both prognosis terms, and those which were specific to a particular term; seven and six features were used for 5 and 10-year prognosis, respectively
| # | Features | |
|---|---|---|
| Unique to 5-year prognosis | 4 | Nuclei in tumour mean DAPI intensity, number of CK objects with no associated nuclei, sum area of vessels, average DAPI intensity (tumour area) |
| Unique to 10-year prognosis | 3 | Nuclei in tumour mean D240 intensity, mean compactness of tumour glands, number of PDCs |
| Common to both prognoses | 3 | Nuclei in tumour bud mean DAPI intensity, tumour gland relative area (%), sum area of vessels |
CK pancytokeratin, PDCs poorly differentiated clusters
Fig. 4Tukey’s significance difference test. No hyperparameter learning was employed in the experiments corresponding to the plots a and b, in contrast to c and d
Average AUROC and standard deviation (for n = 200) of each trained classifier using only features selected by SFFS and SFBS
| LSVM | RSVM | LR | RF | KNN | NB | |
|---|---|---|---|---|---|---|
| 5 years | 0.95 ± 0.08 | 0.95 ± 0.08 | 0.95 ± 0.08 | 0.93 ± 0.11 | 0.95 ± 0.08 | 0.93 ± 0.10 |
| 10 years | 0.95 ± 0.08 | 0.95 ± 0.08 | 0.95 ± 0.08 | 0.92 ± 0.10 | 0.95 ± 0.07 | 0.94 ± 0.09 |
The experiments were performed by 20 times repeating tenfold cross-validation on training data.
Fig. 5ROC curves for the two prognostic terms of interest
Fig. 6KM curves for 5-year prognosis
Fig. 7KM curves for 10-year prognosis
Summary of low vs. high risk patient separation results
| Differentiation (5/10 year) | T stage (5/10 year) | KNN (5/10 year) | |
|---|---|---|---|
| Specificity | 0.95/0.88 | 0.82/0.84 | 0.89/0.84 |
| Sensitivity | 0.39/0.36 | 0.43/0.46 | 0.43/1.00 |
| Accuracy | 0.84/0.72 | 0.75/0.72 | 0.82/0.89 |
| AUROC | 0.62/0.62 | 0.62/0.65 | 0.77/0.94 |
Summary of patient cohort statistics
| Number of patients | 173 | |
| Age (years) | ||
| Range | 62.5 ± 33.5 | |
| Median | 67 | |
| Gender | ||
| Male | 86 (50%) | |
| Female | 87 (50%) | |
| T Stage | ||
| TX | 1 (1%) | |
| T1 | 6 (3%) | |
| T2 | 7 (4%) | |
| T3 | 122 (71%) | |
| T4 | 37 (21%) | |
| N Stage | ||
| N0 | 163 (94%) | |
| N1 | 8 (5%) | |
| N2 | 1 (1%) | |
| N3 | 1 (1%) | |
| M Stage | ||
| MX | 9 (5%) | |
| M0 | 161 (93%) | |
| M1 | 3 (2%) | |
| Site | ||
| Rectum | 56 (32%) | |
| Colon | 117 (68%) | |
| Differentiation | ||
| Undetermined | 3 (2%) | |
| Poor | 25 (14%) | |
| Moderate | 138 (80%) | |
| Good | 7 (4%) |
The search space of each classifier based on the distributions over its hyperparameters (n.b. F denotes feature count; for biased categorical distributions, tuples (p, v) designate the sampling probability and the value assigned)
| Classifier | Hyperparameter | Distribution | Values |
|---|---|---|---|
| SVM, linear kernel | C | Log-uniform | [ln (1e−5), ln (1e2)] |
| Class weight | Categorical | Balanced or none | |
| SVM, RBF kernel | C | Log-uniform | [ln (1e−5), ln (1e2)] |
| Gamma | Log-uniform | [ln (1e−3), ln (1e3)] | |
| Class weight | Categorical | Balanced or none | |
| LR | Type of penalty | Categorical | L1 or L2 |
| C | Log-uniform | [ln (1e−5), ln (1e2)] | |
| Class weight | Categorical | Balanced or none | |
| RF | Number of trees | Log-uniform integer | [10, 1000] |
| Criterion | Categorical | Gini or entropy | |
| Maximum features | Biased categorical | (0.2, √F), (0.1, ln F), (0.1, F), (0.6,U(0, F)) | |
| Maximum depth | Biased categorical | (0.1, 2), (0.1, 3), (0.1, 4), (0.7, none) | |
| Bootstrap | Categorical | True or False | |
| Class weight | Categorical | Balanced or none | |
| KNN | K | Log-uniform integer | [1, 50] |
| Weights | Categorical | Uniform, or Euclidean distance | |
| Metric | Categorical | Balanced or none | |
| P | Categorical | Balanced or none |