| Literature DB >> 30866968 |
Robert Robinson1, Vanya V Valindria2, Wenjia Bai2, Ozan Oktay2, Bernhard Kainz2, Hideaki Suzuki3, Mihir M Sanghvi4,5, Nay Aung4,5, José Miguel Paiva4, Filip Zemrak4,5, Kenneth Fung4,5, Elena Lukaschuk6, Aaron M Lee4,5, Valentina Carapella6, Young Jin Kim6,7, Stefan K Piechnik6, Stefan Neubauer6, Steffen E Petersen4,5, Chris Page8, Paul M Matthews3,9, Daniel Rueckert2, Ben Glocker2.
Abstract
BACKGROUND: The trend towards large-scale studies including population imaging poses new challenges in terms of quality control (QC). This is a particular issue when automatic processing tools such as image segmentation methods are employed to derive quantitative measures or biomarkers for further analyses. Manual inspection and visual QC of each segmentation result is not feasible at large scale. However, it is important to be able to automatically detect when a segmentation method fails in order to avoid inclusion of wrong measurements into subsequent analyses which could otherwise lead to incorrect conclusions.Entities:
Keywords: Automatic quality control; Population imaging; Segmentation
Mesh:
Year: 2019 PMID: 30866968 PMCID: PMC6416857 DOI: 10.1186/s12968-019-0523-x
Source DB: PubMed Journal: J Cardiovasc Magn Reson ISSN: 1097-6647 Impact factor: 5.364
Fig. 1Reverse Classification Accuracy - Single-atlas Registration Classifier. Reverse Classification Accuracy (RCA), with single-atlas registration classifier, as applied in our study. A set of reference images are first registered to the test-image before the resulting transformations are used to warp the corresponding reference segmentations. Dice Similarity Coefficient (DSC) is calculated between the warped segmentations and the test-segmentation with the maximum DSC taken as a proxy for the accuracy of the test-segmentation. Note that in practice, the ground truth test-segmentation is absent. Images and segmentation annotated as referred to in the text
A summary of the experiments performed in this study
| Experiment | Dataset | Size | GT | Seg. Method |
|---|---|---|---|---|
| A | Hammersmith | 100 | Yes | RF |
| B | UKBB-2964 | 4805 | Yes | RF and CNN |
| C | UKBB-18545 | 7250 | No | Multi-Atlas |
Experiment A uses data from an internal dataset which is segmented with a multi-atlas segmentation approach and manually validated by experts at Hammersmith Hospital, London. These manual validations are counted as ‘ground truth’ (GT) and 100 of them are taken for the reference set used in all experiments. UKBB datasets are shown with their application numbers. In experiment C we segment with both random forests (RF) and a convolutional neural network (CNN). In C the CNN from Bai [4] is used
Fig. 2Example Results from RCA. Examples of RCA results on one proposed segmentation. The panels in the top row show (left to right) the MRI scan, the predicted segmentation, an overlay and the manual annotation. The array below shows a subset of the 100 reference images ordered by Dice similarity coefficient (DSC) and equally spaced from highest to lowest DSC. The array shows (left) the reference image, (middle) its ground truth segmentation and (right) the test-segmentation from the upper row which has been warped to the reference image. The real DSC between each reference image and warped segmentation is shown for each pair. RCA-predicted and real GT-calculated DSCs are shown for the whole-heart binary classification case at the top alongside the metrics for each individual class in the segmentation
Initial reverse classification accuracy validation on 400 random forest segmentations
| Class | Acc. | TPR | FPR | MAE | |||
|---|---|---|---|---|---|---|---|
| DSC | MSD | RMS | HD | ||||
|
|
|
| |||||
| LVC | 0.973 | 0.977 | 0.036 | 0.020 | 4.104 | 5.593 | 14.15 |
| 0.980 | 0.975 | 0.019 | |||||
| LVM | 0.815 | 0.947 | 0.215 | 0.044 | 3.756 | 4.741 | 13.08 |
| 0.990 | 0.987 | 0.008 | |||||
| RVC | 0.985 | 0.923 | 0.012 | 0.030 | 4.104 | 5.022 | 16.63 |
| 0.943 | 0.914 | 0.047 | |||||
| Av. | 0.924 | 0.949 | 0.089 | 0.031 | 3.988 | 5.119 | 14.62 |
| 0.971 | 0.959 | 0.025 | |||||
| WH |
|
|
|
| 4.445 | 5.504 | 15.11 |
|
|
|
| |||||
Classes are LV Cavity (LVC), LV Myocardium (LVM), RV Cavity (RVC), An average over the classes (Av.) and a binary segmentation of the whole heart (WH). First row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Dice Similarity Coefficient (DSC) ranges [0.0 0.7) and [0.7 1.0] respectively. Second row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Mean Surface Distance (MSD) ranges [>2.0mm] and [0.0mm 2.0mm] respectively. True-positive and false-positive rates are also shown. We report mean absolute errors (MAE) on the predictions of DSC and additional surface-distance metrics: root-mean-squared surface distance (RMS) and Hausdorff distance (HD)
Fig. 3RCA Validation on 400 cardiac MRI. 400 cardiac MRI segmentations were generated with a Random Forest classifier. 500 trees and depths in the range [5, 40] were used to simulate various degrees of segmentation quality. RCA with single-atlas classifier was used to predict the Dice Similarity Coefficient (DSC), mean surface distance (MSD), root mean-squared surface distance (RMS) and Hausdorff distance (HD). Ground truth for the scans is known so real metrics are also calculated. All calculations on the whole-heart binary classification task. We report low mean absolute error (MAE) for all metrics and 99% binary classification accuracy (TPR = 0.98, FPR = 0.00) with a DSC threshold of 0.70. High accuracy for individual segmentation classes. Absolute error for each image is shown for each metric. We note increasing error with decreasing quality of segmentation based on the real metric score
Analysis of 4800 Random Forest segmentations with available ground truth
| Class | Acc. | TPR | FPR | MAE | |||
|---|---|---|---|---|---|---|---|
| DSC | MSD | RMS | HD | ||||
|
|
|
| |||||
| LVC | 0.968 | 0.997 | 0.330 | 0.042 | 0.906 | 2.514 | 11.09 |
| 0.975 | 0.962 | 0.011 | |||||
| LVM | 0.454 | 0.956 | 0.571 | 0.125 | 0.963 | 2.141 | 11.83 |
| 0.972 | 0.962 | 0.012 | |||||
| RVC | 0.868 | 0.957 | 0.352 | 0.057 | 1.140 | 2.790 | 15.23 |
| 0.969 | 0.977 | 0.040 | |||||
| Av. | 0.763 | 0.970 | 0.418 | 0.075 | 1.003 | 2.482 | 12.72 |
| 0.972 | 0.967 | 0.032 | |||||
| WH |
|
|
|
| 1.156 | 2.762 | 12.52 |
|
|
|
| |||||
4800 RF segmentation at various depths [5 40] and 500 trees. Manual contours were available through Biobank Application 2964. Classes are LV Cavity (LVC), LV Myocardium (LVM), RV Cavity (RVC), an average over the classes (Av.) and a binary segmentation of the whole heart (WH). First row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Dice Similarity Coefficient (DSC) ranges [0.0 0.7) and [0.7 1.0] respectively. Second row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Mean Surface Distance (MSD) ranges [>2.0mm] and [0.0mm 2.0mm] respectively. True-positive and false-positive rates are also shown. We report mean absolute errors (MAE) on the predictions of DSC and additional surface-distance metrics: root-mean-squared surface distance (RMS) and Hausdorff distance (HD)
Fig. 4Validation on 4805 Random Forest segmentations of UKBB Imaging Study with Ground Truth. 4,805 cardiac MRI were segmented with a Random Forest classifier. 500 trees and depths in the range [5 40] were used to simulate various degrees of segmentation quality. Manual contours were available through Biobank Application 2964. RCA with single-atlas classifier was used to predict the Dice Similarity Coefficient (DSC), mean surface distance (MSD), root mean-squared surface distance (RMS) and Hausdorff distance (HD). All calculations on the whole-heart binary classification task. We report low mean absolute error (MAE) for all metrics and 95% binary classification accuracy (TPR = 0.97 and FPR = 0.15) with a DSC threshold of 0.70. High accuracy for individual segmentation classes
Fig. 5Extensive Reverse Classification Accuracy Validation on 900 UKBB Segmentations. Convolutional neural network (CNN) segmentation as in Bai et al. [4]. Manual contours were available through Biobank Application 2964. RCA with single-atlas classifier was used to predict the Dice Similarity Coefficient (DSC), mean surface distance (MSD), root mean-squared surface distance (RMS) and Hausdorff distance (HD). All calculations for the binary quality classification task on (top) ’Whole Heart’ average and (bottom) Left Ventricular Myocardium. We report low mean absolute error (MAE) for all metrics and 99.8% binary classification accuracy (TPR = 1.00 and FPR = 0.00) with a DSC threshold of 0.70
Analysis of 900 CNN segmentations with available ground truth
| Class | Acc. | TPR | FPR | MAE | |||
|---|---|---|---|---|---|---|---|
| DSC | MSD | RMS | HD | ||||
|
|
|
| |||||
| LVC | 0.998 | 1.000 | 0.000 | 0.082 | 0.386 | 0.442 | 1.344 |
| 1.000 | 1.000 | 0.000 | |||||
| LVM | 0.051 | 1.000 | 0.001 | 0.268 | 0.510 | 0.547 | 2.127 |
| 1.000 | 1.000 | 0.000 | |||||
| RVC | 0.901 | 1.000 | 0.033 | 0.146 | 0.588 | 0.656 | 2.086 |
| 0.997 | 0.997 | 0.000 | |||||
| Av. | 0.650 | 1.000 | 0.011 | 0.165 | 0.495 | 0.548 | 1.852 |
| 0.999 | 0.999 | 0.000 | |||||
| WH |
|
|
|
| 0.460 | 0.509 | 1.698 |
|
|
|
| |||||
CNN segmentations as in Bai et al. [4]. Manual contours were available through Biobank Application 2964. Classes are LV Cavity (LVC), LV Myocardium (LVM), RV Cavity (RVC), an average over the classes (Av.) and a binary segmentation of the whole heart (WH). First row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Dice Similarity Coefficient (DSC) ranges [0.0 0.7) and [0.7 1.0] respectively. Second row for each class shows the binary classification accuracy for ‘poor’ and ‘good’ segmentations in the Mean Surface Distance (MSD) ranges [>2.0mm] and [0.0mm 2.0mm] respectively. True-positive and false-positive rates are also shown. We report mean absolute errors (MAE) on the predictions of DSC and additional surface-distance metrics: root-mean-squared surface distance (RMS) and Hausdorff distance (HD)
Fig. 6RCA Application on 7250 Cardiac MRI segmentations of UKBB Imaging Study. 7,250 cardiac MRI segmentations generated with a multi-atlas segmentation approach [18]. Manual QC scores given in the range [0 6] (i.e. [0 2] for each of basal, mid and apical slices). RCA with single-atlas classifier was used to predict the Dice Similarity Coefficient (DSC), mean surface distance (MSD), root mean-squared surface distance (RMS) and Hausdorff distance (HD). All calculations on the LV Myocardium binary classification task. We show correlation in all metrics. Examples show: a) and b) agreement between low predicted DSC and low manual QC score, c) successful automated identification of poor segmentation with low predicted DSC despite high manual QC score and d) agreement between high predicted DSC and high manual QC score. Inserts in top row display extended range of y-axis
Fig. 7Investigating the Effect of Reference Set Size on Prediction Accuracy. 4,805 automated segmentations from Experiment B were processed with Reverse Classification Accuracy (RCA) using differing numbers of reference images. Random subsets of 10, 15, 35, 50, 65 and 75 reference images were taken from the full set of 100 available reference images. Five random runs were performed to obtain error bars for each setting. Average prediction accuracy increases with increasing number of reference images and the variance between runs also decreases