| Literature DB >> 30733585 |
Bino Varghese1, Frank Chen2, Darryl Hwang2, Suzanne L Palmer2, Andre Luis De Castro Abreu3, Osamu Ukimura3, Monish Aron3, Manju Aron4, Inderbir Gill3, Vinay Duddalwar2,3, Gaurav Pandey5.
Abstract
Multiparametric magnetic resonance imaging (mpMRI) has become increasingly important for the clinical assessment of prostate cancer (PCa), but its interpretation is generally variable due to its relatively subjective nature. Radiomics and classification methods have shown potential for improving the accuracy and objectivity of mpMRI-based PCa assessment. However, these studies are limited to a small number of classification methods, evaluation using the AUC score only, and a non-rigorous assessment of all possible combinations of radiomics and classification methods. This paper presents a systematic and rigorous framework comprised of classification, cross-validation and statistical analyses that was developed to identify the best performing classifier for PCa risk stratification based on mpMRI-derived radiomic features derived from a sizeable cohort. This classifier performed well in an independent validation set, including performing better than PI-RADS v2 in some aspects, indicating the value of objectively interpreting mpMRI images using radiomics and classification methods for PCa risk assessment.Entities:
Mesh:
Year: 2019 PMID: 30733585 PMCID: PMC6367324 DOI: 10.1038/s41598-018-38381-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Results of the performance evaluation of various classification algorithms and their resultant models tested in our framework, grouped by several evaluation measures (A): High-risk class; (B): Lower-risk class). Also shown are the results of the statistical comparison of these performances in the form of Critical Difference (CD) plots for the high (C) and lower (D) PCa risk classes respectively. Classification algorithms, represented by vertical + horizontal lines, are displayed from left to right in terms of the average rank obtained by their resultant models in each of the ten cross-validation rounds, and the classifiers producing statistically equivalent performance are connected by horizontal lines. These results show that the Quadratic kernel-based SVM (QSVM) is the best performer overall, especially because it is the only classifier that is statistically the best performer (leftmost classifier in the plots, either by itself or tied with another classifier like CSVM or LogReg) in terms of all the evaluation measures for both the classes. The CD plots were drawn using open-source Matlab code.
Evaluation of the final QSVM-based radiomics classifier and alternatives/benchmarks on the independent validation set of 53 PCa patients in terms of several performance measures.
| Classifier/Benchmark | AUC | High-risk (minority) class | Lower-risk (majority) class | ||||
|---|---|---|---|---|---|---|---|
| F-measure | Precision | Recall | F-measure | Precision | Recall | ||
| PI-RADS v2 | 0.73 | 0.52 | 0.45 | 0.61 | 0.82 | 0.87 | 0.78 |
| QSVM-based radiomics classifier | 0.71 (0.673) | 0.69 (4.37 × 10−30) | 0.57 (2.07 × 10−26) | 0.86 (2.00 × 10−25) | 0.85 (4.12 × 10−16) | 0.94 (2.90 × 10−19) | 0.78 (5.58 × 10−6) |
| Randomized validation sets | 0.51 (0.01) | 0.45 (0.02) | 0.35 (0.03) | 0.62 (0.02) | 0.69 (0.02) | 0.82 (0.02) | 0.6 (0.03) |
The p-values obtained from the Friedman-Nemenyi test-based comparison of the performance of PI-RADS and radiomics classifier in terms of all the evaluation measures are shown in parentheses in the fourth row. Standard errors for the measures on the 100 randomized validation sets are shown in parentheses in the fifth row. The radiomics classifier produces more accurate predictions than PI-RADS v2, especially in terms of class-specific measures (F-measure, Precision and Recall) that are more meaningful for unbalance class situations like the cohorts in our study. It also performed better on the real validation set than its randomized versions, indicating that the classifier did capture a real relationship between the radiomics features and PCa risk status.
Baseline characteristics of the patients and their PCa tumors in the development and test sets.
| Development Set (N = 68) | Validation Set (N = 53) | Development vs. Validation set t-test p-value | |||||
|---|---|---|---|---|---|---|---|
| All | Lower risk (N = 54; 79.40%) | High risk (N = 14; 20.60%) | All | Lower risk (N = 39; 73.58%) | High risk (N = 14; 26.41%) | ||
| Gleason | 6.93 (0.86) | 6.58 (0.55) | 8.00 (0.65) | 7.13 (0.91) | 6.43 (0.07) | 8.36 (0.71) | 0.21 |
| PSA | 7.51 (4.54) | 5.99 (2.31) | 12.46 (5.99) | 8.26 (9.42) | 5.33 (2.73) | 15.55 (5.45) | 0.60 |
| PI-RADS v2 | 3.85 (0.60) | 3.71 (0.51) | 4.21 (0.56) | 3.79 (0.74) | 3.31 (0.48) | 4.36 (0.81) | 0.63 |
In addition to the mean value of each characteristic, its standard deviation is shown in parentheses. The p-values from the t-test of the comparison of the characteristics between the development and validation sets, shown in the final column, demonstrate that there are no significant differences between the two sets that could potentially bias the classification results.
Figure 2Flowchart of some sample quantitative radiomic features used in our study that were extracted from segmented tumor regions of interest (ROI) of mpMRI images. In summary, 55 different features were extracted per image type (i.e., T2WI or ADC) using four different texture extraction methods, yielding 110 radiomic features per patient. The four texture methods included histogram analysis, Gray-Level Co-occurrence and Difference Matrix methods (GLCM and GLDM) and Fast Fourier Transform (FFT). Some of these features are highlighted in green, blue and red respectively. The full list and details of these features are provided in the online Appendix in Supplementary Information. Note that all these features were 2D, as the input imaging data were two-dimensional.
Figure 3Workflow of our ML-based framework used to identify the best combination of radiomic features and classification algorithm for categorizing PCa patients into high-risk and lower-risk categories. Cross-validation was used to identify the best performing algorithm out of seven commonly used algorithms, which was then used to train the final classifier on the entire development set (68 PCa patients). This classifier was then evaluated on an independent validation set of 53 PCa patients in terms of a variety of performance measures, namely AUC, Fmax, Pmax and Rmax.