| Literature DB >> 28422152 |
Francesco Ciompi1,2, Kaman Chung1, Sarah J van Riel1, Arnaud Arindra Adiyoso Setio1, Paul K Gerke1, Colin Jacobs1, Ernst Th Scholten1, Cornelia Schaefer-Prokop1, Mathilde M W Wille3, Alfonso Marchianò4, Ugo Pastorino4, Mathias Prokop1, Bram van Ginneken1.
Abstract
The introduction of lung cancer screening programs will produce an unprecedented amount of chest CT scans in the near future, which radiologists will have to read in order to decide on a patient follow-up strategy. According to the current guidelines, the workup of screen-detected nodules strongly relies on nodule size and nodule type. In this paper, we present a deep learning system based on multi-stream multi-scale convolutional networks, which automatically classifies all nodule types relevant for nodule workup. The system processes raw CT data containing a nodule without the need for any additional information such as nodule segmentation or nodule size and learns a representation of 3D data by analyzing an arbitrary number of 2D views of a given nodule. The deep learning system was trained with data from the Italian MILD screening trial and validated on an independent set of data from the Danish DLCST screening trial. We analyze the advantage of processing nodules at multiple scales with a multi-stream convolutional network architecture, and we show that the proposed deep learning system achieves performance at classifying nodule type that surpasses the one of classical machine learning approaches and is within the inter-observer variability among four experienced human observers.Entities:
Mesh:
Year: 2017 PMID: 28422152 PMCID: PMC5395959 DOI: 10.1038/srep46479
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Examples of triplets of patches for different nodule types in axial, coronal and sagittal views.
Each triplet is depicted using three different patch sizes, namely 10 mm, 20 mm and 40 mm.
Detailed number of nodules and samples in the training, validation and test sets.
| MILD (943 patients) | DLCST (468 patients) | ||||
|---|---|---|---|---|---|
| Training nodules | Training samples | Validation nodules | Test nodules | ||
| Solid | 694 | 8 | 88,832 | 232 | 382/27 |
| Calcified | 233 | 22 | 82,016 | 78 | 58/27 |
| Part-solid | 63 | 80 | 80,640 | 21 | 37/27 |
| Non-solid | 152 | 33 | 80,256 | 50 | 87/27 |
| Perifissural | 181 | 28 | 81,088 | 62 | 48/27 |
| Spiculated | 29 | 167 | 77,488 | 10 | 27/27 |
| 1,352 | — | 490,320 | 453 | 639/162 | |
The MILD dataset is used for training and validation purposes, the DLCST dataset is used for testing purposes. In the test set, the number of nodules per class randomly selected to design the observer study is reported. The number of class-specific planes per nodule used to extract training data (N, see also Fig. 4) is indicated for each nodule type. The number of used patients from MILD and DLCST are also indicated.
Cohen k statistics with 95% confidence intervals for agreement between computer and observers.
| Observers | Computer | ||||||
|---|---|---|---|---|---|---|---|
| O1 | O2 | O3 | O4 | 1 scale | 2 scales | 3 scales | |
| O1 | — | 0.59 (0.51–0.68) | 0.65 (0.57–0.74) | 0.68 (0.60–0.76) | 0.63 (0.54–0.72) | 0.64 (0.55–0.73) | 0.65 (0.57–0.74) |
| O2 | 0.59 (0.51–0.68) | — | 0.71 (0.63–0.79) | 0.66 (0.58–0.75) | 0.55 (0.45–0.64) | 0.54 (0.45–0.64) | 0.58 (0.49–0.67) |
| O3 | 0.65 (0.57–0.74) | 0.71 (0.63–0.79) | — | 0.75 (0.67–0.82) | 0.56 (0.47–0.65) | 0.57 (0.48–0.66) | 0.61 (0.52–0.70) |
| O4 | 0.68 (0.60–0.76) | 0.66 (0.58–0.75) | 0.75 (0.67–0.82) | — | 0.62 (0.53–0.70) | 0.64 (0.55–0.73) | 0.67 (0.59–0.75) |
Oi indicates the i-th observer. Results for automatic classification using deep learning systems with different numbers of scales are reported.
Nodule classification performance in terms of accuracy and F-measure per nodule type.
| Accuracy | ||||||||
|---|---|---|---|---|---|---|---|---|
| O1 vs. Computer (3 scales) | 71.5% | 60.8% | 88.4% | 66.7% | 86.3% | 62.2% | 71.4% | — |
| O2 vs. Computer (3 scales) | 66.2% | 62.6% | 82.4% | 47.8% | 72.7% | 80.0% | 56.4% | — |
| O3 vs. Computer (3 scales) | 67.7% | 56.8% | 85.1% | 59.1% | 78.3% | 75.6% | 60.9% | — |
| O4 vs. Computer (3 scales) | 72.8% | 64.2% | 88.9% | 71.7% | 80.0% | 77.3% | 62.7% | — |
| Average | 69.6% | 61.1% | 86.2% | 61.3% | 79.3% | 73.8% | 62.9% | — |
| O1 vs. O2 | 66.0% | 52.7% | 84.0% | 51.3% | 79.2% | 63.6% | 83.3% | 50.0% |
| O1 vs. O3 | 71.0% | 55.0% | 87.0% | 66.7% | 80.0% | 81.5% | 74.4% | 40.0% |
| O1 vs. O4 | 72.8% | 64.8% | 90.9% | 66.7% | 71.7% | 75.5% | 89.4% | 0.0% |
| O2 vs. O3 | 76.5% | 74.7% | 88.9% | 61.5% | 81.0% | 77.3% | 75.7% | 66.7% |
| O2 vs. O4 | 72.2% | 64.4% | 88.5% | 70.8% | 71.1% | 79.1% | 73.2% | 0.0% |
| O3 vs. O4 | 79.0% | 68.4% | 95.8% | 71.1% | 80.9% | 90.6% | 79.2% | 0.0% |
| Average | 72.9% | 63.3% | 89.2% | 64.7% | 77.3% | 77.9% | 79.2% | 26.1% |
Results for each pair of human observer Oi vs. Oj and for observers versus the computer on the test dataset (167 nodules) are reported. Averages of measures across observers and across computer-observers are also indicated. The additional class “not a nodule” is added to observers since they could exclude nodules during the observer study. The performance of the system on the test set (639 nodules) is also reported. In this case, annotations from DLCST radiologists (O4) are considered as the reference standard.
Comparison of classification performance in terms of accuracy and F-measure when the considered methods are: (1) features based on pixel intensity of patches and linear SVM classifier, (2) features learned from raw nodule patches using the unsupervised learning approach proposed in ref. 34 and linear SVM classifier, (3) the proposed deep learning approach using ConvNets working at 1, 2 and 3 scales.
| Accuracy | |||||||
|---|---|---|---|---|---|---|---|
| Intensity features + SVM | 27.0% | 4.1% | 60.2% | 0.0% | 35.4% | 26.7% | 32.5% |
| Unsupervised features + SVM | 39.9% | 38.4% | 32.0% | 49.4% | 59.2% | 16.9% | 39.5% |
| ConvNets 1 scale | 78.0% | 84.4% | 82.4% | 54.5% | 84.4% | 57.5% | 37.8% |
| ConvNets 2 scales | 79.2% | 85.6% | 84.9% | 52.3% | 87.8% | 63.4% | 36.8% |
| ConvNets 3 scales | 79.5% | 85.6% | 85.7% | 52.2% | 87.4% | 68.2% | 43.4% |
Figure 2Examples of classified nodules from the test set (DLCST).
Each row depicts nodules from one class as labeled in the DLCST trial, and nodules are sorted from left to right based on the probability given by the (3-scale) deep learning system. Examples with low probability (on the left) are a-typical cases of each nodule type, while a high probability (on the right) is given to typical examples of each nodule type.
Figure 3Multidimensional scaling of nodules in the test set using the t-SNE algorithm.
Close nodules have similar characteristics. In (a), clusters of similar nodules are highlighted and grouped with different boxes. A zoomed-in version of each cluster is also shown and a representative name is given based on their appearance. The nodule label assigned in the DLCTS trial is also reported as a coloured dot for each nodule patch (see legend for nodule types).
Precision and recall values for the 3-scale deep learning system tested on the test set.
| Solid | Calcified | Part-solid | Non-solid | Perifissural | Spiculated | |
|---|---|---|---|---|---|---|
| Precision | 89.2 | 88.9 | 43.6 | 87.4 | 78.4 | 32.7 |
| Recall | 82.2 | 82.8 | 64.9 | 87.4 | 60.4 | 64.3 |
Figure 4(a) Examples of triplets of nodules extracted by varying the parameter N. (b) Examples of pyramidal triplets of patches used to feed the proposed deep learning systems. The system consists of three groups of three streams, one for each considered scale (namely 10 mm, 20 mm and 40 mm for patch size). Convolutional layers, max-pooling layers, fully-connected layers and one soft-max layer are the building blocks of the proposed network. The last fully-connected layer with 256 neurons serves as a combiner of the three sets of three streams, and a 6-value probability vector is generated as output.