| Literature DB >> 30297720 |
Patrick Leo1, Robin Elliott2, Natalie N C Shih3, Sanjay Gupta4, Michael Feldman3, Anant Madabhushi5.
Abstract
Site variation in fixation, staining, and scanning can confound automated tissue based image classifiers for disease characterization. In this study we incorporated stability into four feature selection methods for identifying the most robust and discriminating features for two prostate histopathology classification tasks. We evaluated 242 morphology features from N = 212 prostatectomy specimens from four sites for automated cancer detection and grading. We quantified instability as the rate of significant cross-site feature differences. We mapped feature stability and discriminability using 188 non-cancerous and 210 cancerous regions via 3-fold cross validation, then held one site out, creating independent training and testing sets. In training, one feature set was selected only for discriminability, another for discriminability and stability. We trained a classifier with each feature set, testing on the hold out site. Experiments were repeated with 117 Gleason grade 3 and 112 grade 4 regions. Stability was calculated across non-cancerous regions. Gland shape features yielded the best stability and area under the receiver operating curve (AUC) trade-off while co-occurrence texture features were generally unstable. Our stability-informed method produced a cancer detection AUC of 0.98 ± 0.05 and increased average Gleason grading AUC by 4.38%. Color normalization of the images tended to exacerbate feature instability.Entities:
Mesh:
Year: 2018 PMID: 30297720 PMCID: PMC6175913 DOI: 10.1038/s41598-018-33026-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Effect of small variation in segmentation on resulting gland derived feature values. In this experiment we randomly removed a varying percentage of the total number of glands in a digitized image of a radical prostatectomy specimen in order to evaluate the corresponding effect on feature instability. (a) Sample benign regions of interest (ROIs) along with automatically identified gland boundaries are shown. Blue glands are those randomly selected for removal. (b) Gland sub-graph map, a histomorphometric feature which quantitatively captures gland architecture on the digital slide image, is illustrated. The addition of the glands with blue boundaries, though representing just 10% of the total number of glands, greatly changes the sub-graph map by connecting previously disconnected glands. Clearly, different gland detection algorithms, no matter how accurate, are likely to miss at least some percentage of glands in the image and hence choice of algorithm can substantially impact resulting features. (c) Plot of the percentage change in six features when removing 0 to 20% of randomly chosen glands from the region in (a). We systematically removed 0 to 20% of all glands in the image, in increments of 1%. For every 1% removal of glands, we ran 10 simulations for what the corresponding 6 features values might be. The averaged values for each gland removal percentage number is reported in (c). The values of the three most stable (solid lines) and unstable (dashed lines) features are shown in (c).
Patient dataset.
| Site | Patients | Cancer regions | Non-cancer regions | Grade 3 regions | Grade 4 regions |
|---|---|---|---|---|---|
| Univ. of Pennsylvania | 80 | 80 | 73 | 39 | 44 |
| Univ. of Pittsburgh | 35 | 35 | 26 | 24 | 24 |
| Roswell Park | 33 | 33 | 26 | 28 | 22 |
| UH Cleveland | 64 | 62 | 63 | 26 | 22 |
Figure 2Annotated digitized radical prostatectomy image with corresponding feature maps of three feature families (shape, global graph, and sub-graph). (a) Radical prostatectomy specimen with expert pathologist annotations for non-cancerous (yellow), cancerous (green), homogeneous Gleason 3 (red) and homogeneous Gleason 4 (blue) regions. (b,f,j) Benign, (c,g,k) cancerous, (d,h,l) Gleason 3, and (e,i,m) Gleason 4 regions of interest. Visualization of (b–e) automated segmentation using the method of Nguyen et al.[21]. From these segmentations, gland area, perimeter, and boundary descriptors are calculated. (f–i) Delaunay triangulation, from which measures of gland arrangement and density such as average and standard deviation of edge length and polygon area are extracted. (j–m) Sub-graphs, local graphs of gland architecture from which measurements relating to gland packing, average degree and radius of graphs, number of isolated nodes, and clustering descriptors are extracted.
Figure 3(a) PI-AUC plot for 242 features for 212 studies across four sites for cancer vs. non-cancer classification. Each feature is represented by a dot, color coded according to feature family. On the X-axis is the PI value for each feature in the non-cancerous regions of the four sites. On the Y-axis is the feature’s AUC averaged across 100 iterations of 3-fold cross validation across all patients from all four sites. Shown are 242 features from the global graph (blue), shape (red), disorder (green), sub-graph (yellow), and Haralick (purple) feature families. The optimal high-AUC, low-PI space is dominated by shape features while the low-AUC, high-PI space comprises Haralick and global graph features. (b) ROIs corresponding to features occupying different regions in the PI-AUC space, see Table 2.
70 features found in ROIs of the PI-AUC space shown in Fig. 3(b).
| (S) Mean distance ratio | (G) Voronoi area std. deviation |
| (S) Mean smoothness | (G) Voronoi chord std. deviation |
| (S) Mean invariant moment 1 | (G) Delaunay side length min/max |
| (S) Mean invariant moment 2 | (G) Delaunay triangle area min/max |
| (S) Mean invariant moment 4 | (G) Delaunay triangle area std. deviation |
| (S) Std. deviation distance ratio | (G) Voronoi polygon area |
| (S) Std. deviation smoothness | (SG) Number of end nodes |
| (S) Std. deviation Fourier descriptor 1 | (H) Mean average intensity |
| (S) Std. deviation Fourier descriptor 2 | (H) Mean entropy |
| (S) Std. deviation Fourier descriptor 3 | (H) Std. deviation entropy |
| (S) Std. deviation Fourier descriptor 4 | (H) Mean information measure 1 |
| (S) Std. deviation Fourier descriptor 5 | |
| (S) Std. deviation Fourier descriptor 8 | (G) Std. deviation neighbors in 40 micron radius |
| (S) Mean invariant moment 6 | (G) Std. deviation neighbors in 60 micron radius |
| (G) Std. deviation neighbors in 80 micron radius | |
| (S) Mean fractal dimension | (G) Std. deviation neighbors in 100 micron radius |
| (S) Mean Fourier descriptor 2 | |
| (S) Mean Fourier descriptor 5 | (D) Std. deviation tensor contrast energy |
| (S) Median invariant moment 7 | (D) Mean tensor contrast inverse moment |
| (S) Median fractal dimension | (D) Std. deviation tensor contrast inverse moment |
| (S) Median Fourier descriptor 1 | (D) Mean tensor contrast average |
| (S) Median Fourier descriptor 2 | (D) Std. deviation tensor contrast average |
| (S) Min/max invariant moment 6 | (D) Std. deviation tensor contrast variance |
| (S) Min/max invariant moment 7 | (D) Mean tensor contrast entropy |
| (S) Min/max Fourier descriptor 7 | (D) Mean tensor contrast entropy |
| (S) Min/max Fourier descriptor 9 | (D) Std. deviation tensor intensity average |
| (D) Mean tensor contrast entropy | |
| (D) Range tensor contrast energy | (D) Std. deviation tensor intensity variance |
| (D) Range tensor contrast inverse moment | (D) Mean tensor intensity entropy |
| (D) Range tensor contrast variance | (D) Std. deviation tensor intensity entropy |
| (D) Range tensor contrast entropy | (D) Mean tensor entropy |
| (D) Range tensor intensity average | (D) Mean tensor energy |
| (D) Range tensor intensity variance | (D) Std. deviation tensor energy |
| (D) Range tensor intensity entropy | (D) Mean tensor correlation |
| (D) Range tensor entropy | (D) Std. deviation tensor correlation |
| (D) Range tensor correlation | (D) Mean tensor information measure 2 |
| (D) Range tensor information measure 1 | (D) Std. deviation tensor information measure 2 |
Feature PI and AUC values were found by using all 212 patients from all four sites. PI is calculated across non-cancerous regions and AUC is the mean from 100 iterations of 3-fold cross validation for the cancer vs. non-cancer classification task. Cross validation was performed independently for each site.
Mean (standard deviation) of AUC for the cancer vs. non-cancer classification problem across the four hold-one-site-out folds with and , .
| SFS | WLCX | |||||
|---|---|---|---|---|---|---|
|
|
| % Improvement |
|
| % Improvement | |
| LDA | 0.99 (0.01) | 0.99 (0.01) | −0.15 | 0.96 (0.04) | 0.97 (0.04) | −0.22 |
| QDA | 0.98 (0.02) | 0.99 (0.02) | −0.11 | 0.95 (0.04) | 0.96 (0.05) | −1.18 |
| SVM | 0.98 (0.01) | 0.99 (0.01) | −0.68 | 0.96 (0.03) | 0.97 (0.04) | −0.36 |
| RF | 0.98 (0.02) | 0.99 (0.01) | −0.84 | 0.95 (0.03) | 0.96 (0.03) | −1.63 |
|
|
| |||||
|
|
|
|
|
|
| |
| LDA | 0.99 (0.01) | 0.99 (0.01) | −0.37 | 0.99 (0.01) | 0.99 (0.01) | 0.00 |
| QDA | 0.96 (0.04) | 0.98 (0.02) | −2.32 | 0.98 (0.03) | 0.98 (0.05) | 0.00 |
| SVM | 0.97 (0.02) | 0.99 (0.01) | −1.50 | 0.98 (0.02) | 0.98 (0.01) | −0.03 |
| RF | 0.99 (0.01) | 0.99 (0.01) | 0.98 (0.00) | 0.99 (0.00) | −0.66 | |
For each classifier model, the top 5 most stable and discriminating or most discriminating features were employed for constructing and respectively. For every feature selection-classification pair four models were trained and validated, one model for every possible combination of three of the four sites. The three chosen sites were combined and used for training and the held out site was used for validation. The improvement between and is shown. A positive improvement indicates that outperformed . Note that for this particular problem, the prediction AUC for all models were very high, nearly perfect in most cases.
Figure 4(a) PI-AUC plot for 242 features for 157 studies across four sites for Gleason 3 vs. Gleason 4 classification. Each feature is represented by a dot, color coded according to feature family. On the X-axis is the PI value for each feature in the non-cancerous regions of the four sites. On the Y-axis is the feature’s AUC averaged across 100 iterations of 3-fold cross validation across all patients from all four sites. Shown are 242 features from the global graph (blue), shape (red), disorder (green), sub-graph (yellow), and Haralick (purple) feature families. The optimal high-AUC, low-PI space comprises shape and sub-graph features while the low-AUC, high-PI space comprises chiefly Haralick and global graph features (b) ROIs corresponding to features occupying different regions in the PI-AUC space, see Table 4.
32 features found in ROIs of the PI-AUC space shown in Fig. 4(b).
| (S) Mean invariant moment 1 | (G) Voronoi perimeter average |
| (S) Std. deviation Fourier descriptor 2 | (G) Delaunay side length std. deviation |
| (S) Std. deviation Fourier descriptor 3 | (G) Delaunay side length average |
| (S) Median invariant moment 1 | (G) Delaunay triangle area std. deviation |
| (S) Median invariant moment 6 | (G) Delaunay triangle area average |
| (S) Median fractal dimension | (G) Delaunay triangle area disorder |
| (SG) Mean of edge length | |
| (SG) Skewness of edge length | (SG) Average eccentricity |
| (SG) Kurtosis of edge length | (SG) Diameter |
| (SG) Radius | |
| (D) Mean tensor contrast inverse moment | (SG) Average eccentricity 90th percentile |
| (D) Mean tensor contrast entropy | (SG) Diameter 90th percentile |
| (D) Mean tensor intensity entropy | (SG) Radius 90th percentile |
| (D) Mean tensor entropy | (SG) Average path length |
| (D) Mean tensor energy | (SG) Clustering coefficient C |
| (D) Mean tensor correlation | (SG) Clustering coefficient D |
| (D) Mean tensor information measure 2 | (SG) Clustering coefficient E |
Feature PI values were found by using all 188 non-cancerous regions from all four sites. PI is calculated across non-cancerous regions and AUC is the mean from 100 iterations of 3-fold cross validation for the Gleason 3 vs. Gleason 4 classification task. Cross validation was performed independently for each site. The family of each feature is indicated as graph (G), shape (S), disorder (D), sub-graph (SG) or Haralick (H).
Mean (standard deviation) of AUC for the Gleason 3 vs. Gleason 4 classification problem across the four hold-one-site-out folds with and , .
| SFS | WLCX | |||||
|---|---|---|---|---|---|---|
|
|
| % Improvement |
|
| % Improvement | |
| LDA | 0.75 (0.07) | 0.67 (0.06) | 0.74 (0.05) | 0.71 (0.10) | ||
| QDA | 0.77 (0.04) | 0.69 (0.09) | 0.70 (0.06) | 0.69 (0.06) | ||
| SVM | 0.71 (0.08) | 0.65 (0.10) | 0.71 (0.05) | 0.67 (0.09) | ||
| RF | 0.72 (0.06) | 0.68 (0.10) | 0.76 (0.04) | 0.71 (0.07) | ||
|
|
| |||||
|
|
|
|
|
|
| |
| LDA | 0.71 (0.03) | 0.71 (0.02) | −0.66 | 0.70 (0.08) | 0.70 (0.05) | −0.15 |
| QDA | 0.66 (0.06) | 0.60 (0.08) | 0.71 (0.05) | 0.69 (0.07) | ||
| SVM | 0.64 (0.06) | 0.60 (0.08) | 0.72 (0.06) | 0.71 (0.08) | ||
| RF | 0.68 (0.04) | 0.72 (0.06) | −6.02 | 0.74 (0.03) | 0.74 (0.07) | |
For each classifier model, the top 5 most stable and discriminating or most discriminating features were employed for constructing and respectively. For every feature selection-classification pair four models were trained and validated, one model for every possible combination of three of the four sites. The three chosen sites were combined and used for training and the held out site was used for validation. The improvement between and is shown. A positive improvement indicates that outperformed . Improvement in over occurred in 13 of the 16 cases, with the average improvement in those 13 scenarios being 5.92% compared to 2.28% when was superior compared to .
Figure 5(a) Boxplots of mean red, green, and blue intensities in 188 non-cancerous regions pre- (left boxes) and post- (right boxes) color normalization. While our chosen normalization method works in the stain vector space, not in the RGB space, the range and variation of mean color intensities decreased after normalization, especially in the green channel. This suggests that normalization has reduced the variation in color across the images. (b) Feature PI by family before (lighter bars) and after (darker bars) color normalization. PI was measured across the 188 non-cancerous regions of the four sites before and after those regions were normalized. Color normalization increased instability in every feature family with an especially strong effect on the disorder features. These results suggest that color normalization is inadequate to resolve the problem of feature instability from site variation and may even worsen instability.
20 features which showed an improvement (left column) or worsening (right column) in instability following color normalization.
| Stabilized by normalization | Destabilized by normalization | ||
|---|---|---|---|
| Feature | PI change | Feature | PI change |
| (S) Std. deviation distance ratio | −0.59 | (S) Min/max Fourier descriptor 9 | 0.58 |
| (S) Std. deviation area ratio | −0.59 | (S) Median perimeter ratio | 0.56 |
| (G) Voronoi chord std. deviation | −0.55 | (S) Median Fourier descriptor 8 | 0.54 |
| (G) Voronoi perimeter std. deviation | −0.54 | (S) Median invariant moment 1 | 0.49 |
| (S) Std. deviation of std. deviation of distance | −0.54 | (S) Min/max Fourier descriptor 1 | 0.46 |
| (G) Voronoi area min/max | −0.46 | (D) Mean tensor contrast average | 0.45 |
| (G) Voronoi area disorder | −0.45 | (H) Mean energy | 0.44 |
| (D) Range of tensor information measure 1 | −0.43 | (S) Median Fourier descriptor 7 | 0.44 |
| (S) Std. deviation of variance of distance | −0.40 | (S) Median invariant moment 2 | 0.43 |
| (G) Voronoi chord disorder | −0.37 | (S) Mean invariant moment 1 | 0.42 |
N = 188 non-cancerous regions from 188 patients across all four sites were color normalized to a template and instability across the four sites was calculated before and after normalization. The 10 features with the largest absolute PI change pre- and post-normalization in each direction are shown. A negative PI change signifies a reduction in feature instability. The family of each feature is listed as graph (G), shape (S), disorder (D), sub-graph (SG), and Haralick (H).