Literature DB >> 28422152

Towards automatic pulmonary nodule management in lung cancer screening with deep learning.

Francesco Ciompi^1,2, Kaman Chung¹, Sarah J van Riel¹, Arnaud Arindra Adiyoso Setio¹, Paul K Gerke¹, Colin Jacobs¹, Ernst Th Scholten¹, Cornelia Schaefer-Prokop¹, Mathilde M W Wille³, Alfonso Marchianò⁴, Ugo Pastorino⁴, Mathias Prokop¹, Bram van Ginneken¹.

Abstract

The introduction of lung cancer screening programs will produce an unprecedented amount of chest CT scans in the near future, which radiologists will have to read in order to decide on a patient follow-up strategy. According to the current guidelines, the workup of screen-detected nodules strongly relies on nodule size and nodule type. In this paper, we present a deep learning system based on multi-stream multi-scale convolutional networks, which automatically classifies all nodule types relevant for nodule workup. The system processes raw CT data containing a nodule without the need for any additional information such as nodule segmentation or nodule size and learns a representation of 3D data by analyzing an arbitrary number of 2D views of a given nodule. The deep learning system was trained with data from the Italian MILD screening trial and validated on an independent set of data from the Danish DLCST screening trial. We analyze the advantage of processing nodules at multiple scales with a multi-stream convolutional network architecture, and we show that the proposed deep learning system achieves performance at classifying nodule type that surpasses the one of classical machine learning approaches and is within the inter-observer variability among four experienced human observers.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 28422152 PMCID： PMC5395959 DOI： 10.1038/srep46479

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

The American National Lung Screening Trial (NLST)1 demonstrated a lung cancer mortality reduction of 20% by screening of heavy smokers using low-dose Computed Tomography (CT), compared with screening using chest X-rays. Motivated by this positive result and subsequent recommendations of the U.S. Preventive Services Task Force2, lung cancer screening is now being implemented in the U.S., where high-risk subjects will receive a yearly low-dose CT scan with the aim of (1) checking for the presence of nodules detectable in chest CT and (2) following-up on nodules detected in previous screening sessions. As a consequence, an unprecedented amount of CT scans will be produced, which radiologists will have to read in order to check for the presence of nodules and decide on nodule workup. In this context, (semi-) automatic computer-aided diagnosis (CAD) systems3456 for detection and analysis of pulmonary nodules can make the scan reading procedure efficient and cost effective. Once a nodule has been detected, the main question radiologists have to answer is: what to do next? In order to address this question, the Lung CT Reporting And Data System (Lung-RADS) has been recently proposed, with the aim of defining a clear procedure to decide on patient follow-up strategy based on nodule-specific characteristics such as nodule type, size and growth. Lung-RADS guidelines also refer to the PanCan model7, which estimates the malignancy probability of a pulmonary nodule detected in a baseline scan (i.e., during the first screening session) based on patient data and nodule characteristics. In both Lung-RADS guidelines and the PanCan model, the key characteristic to define nodule follow-up management is nodule type. Pulmonary nodules can be categorized into four main categories, namely solid, non-solid, part-solid and calcified nodules (see Fig. 1). Solid nodules are characterized by a homogeneous texture, a well-defined shape and an intensity above −450 Hounsfield Units (HU) on CT. Two sub-categories of nodules with the density of solid nodules can be considered, namely perifissural nodules8, i.e., lymph nodes (benign lesions) that are attached or close to a fissure, and spiculated nodules, which appear as solid lesions with characteristic spicules on the surface, often considered as an indicator of malignancy. Non-Solid nodules have an intensity on CT lower than solid nodules (in the range between −750 and −300 HU), also referred to as ground glass opacities. Part-Solid nodules contain both a non-solid and a solid part, the latter normally referred to as the solid core. Compared with solid nodules, non-solid and in particular part-solid nodules occur less frequent but have a higher frequency of being malignant lesions9. Finally, calcified nodules are characterized by a high intensity and a well-defined rounded shape on CT. Completely calcified nodules represent benign lesions.

Figure 1

Examples of triplets of patches for different nodule types in axial, coronal and sagittal views.

Each triplet is depicted using three different patch sizes, namely 10 mm, 20 mm and 40 mm.

In Lung-RADS, the workup for pulmonary nodules is mainly defined by nodule type and nodule size. However, presence of imaging findings that increase the suspicion of lung cancer, such as spiculation, can modify the workup. In the PanCan model, spiculation is a parameter that together with nodule type, nodule size and patient data contribute to the estimation of the malignancy probability of a nodule. Furthermore, completely calcified and perifissural nodules are given a malignancy probability equal to zero. In a scenario in which CAD systems are used to automate the lung cancer screening workflow from nodule detection to automatic report with decision on nodule workup, it is necessary to solve the problem of automatic classification of nodule type. In this context, the classes that have to be considered are: (i) solid, (ii) non-solid, (iii) part-solid, (iv) calcified, (v) perifissural and (vi) spiculated nodules. Although the general characteristics of nodule types can be easily defined, recent studies1011 have shown that there is a substantial inter- and intra-observer variability among radiologists at classifying nodule type. In this context, researchers have addressed the problem of automatic classification of nodule type in CT scans by (1) designing a problem-specific descriptor of lung nodule and (2) training a classification model to automatically predict nodule type. In ref. 11, nodules were classified as solid, non-solid and part-solid. A nodule descriptor was designed based on information on volume, mass and intensity of the nodule, and a kNN classifier was applied, but the used features strongly rely on the result of a nodule segmentation algorithm, whose optimal settings also depend on nodule type. The authors propose to solve this problem by first running the algorithm multiple times using different segmentation settings in order to extract features and then classify nodule type. In practice, this strategy hampers the applicability of such a system to an optimized scan reading scenario. In ref. 12, the SIFT descriptor was used to classify nodules as juxta, well circumscribed, pleural-tail and vascularized, and a feature matching strategy was used for classification purposes. Despite the good performance reported, the considered categories are not relevant for nodule management according to current guidelines. A descriptor specifically tailored for lung nodule analysis was introduced in ref. 13, which was used to assess presence of spiculation in detected solid nodules14 and to classify nodules as perifissural15. Although this approach could be extended to other nodule types, it strongly relies on the estimation of nodule size in order to define the proper scale to analyze data. Scale is an important factor to consider in automatic nodule type classification. As an example, discriminating a pure solid nodule from a perifissural nodule involves the detection of the fissure, which on a 2D view of the nodule can be differentiated from a vessel only if a sufficiently large region surrounding the nodule is considered (see Fig. 1). On the other hand, discriminating non-solid from part-solid nodules strongly relies on the presence of a solid core, which can consist of a tiny part of the lesion that can only be clearly detected on a small scale. In recent years, the advent of deep learning1617 has emerged as a powerful alternative to designing ad-hoc descriptors for pattern recognition applications by using deep neural networks, which can learn a representation of data from the raw data itself. The most used incarnation of deep neural networks are convolutional networks161819, a supervised learning algorithm particularly suited to solve problems of classification of natural images192021, which has recently been applied to some applications in chest CT analysis615222324. In this paper, we address the problem of automatic nodule classification by introducing three main contributions. For the first time, we propose a single system that classifies all nodule types relevant for patient management in lung cancer screening according to the Lung-RADS assessment categories and the PanCan malignancy probability model, namely solid, non-solid, part-solid, calcified, perifissural and spiculated nodules. Differently from what has been done in previous work, we design a classification framework based on Convolutional Networks (ConvNets)171819. In particular, we propose a multi-stream multi-scale architecture in which ConvNets simultaneously process multiple triplets of 2D views of a nodule at multiple scales and compute the probability for the nodule to belong to each one of the six considered nodule types. The proposed approach does not require nodule segmentation or the estimation of nodule size. Inspired by recent work615, we formulate the analysis of a nodule as a combination of 2D patches. Relying on the experimental results of Setio et al.6, which showed that performance increase by increasing the number of analyzed patches, we go beyond a limited number of patches by introducing a novel approach to extract an arbitrary number of 2D views from a nodule. We trained the deep learning system using data from 943 patients and 1,352 nodules from the Multicentric Italian Lung Detection (MILD) trial25 and we validated the trained system using independent data from 468 patients and 639 nodules from the Danish Lung Cancer Screening Trial (DLCST)26. Furthermore, in order to compare the performance of our deep learning architecture versus classical approaches of patch classification, we trained a linear support vector machines classifier to classify both features based on the raw intensity of nodules and features learned from raw data via an unsupervised learning approach. Finally, in order to compare the performance of our method versus human performance, we designed an observer study in which four observers, including experienced radiologists, classified a subset of 162 nodules extracted from the test set. We show that the proposed system achieves performance that surpasses classical patch classification approaches and is comparable with the inter-observer variability among human observers.

Results

Training data

We trained the deep learning system using data from the Multicentric Italian Lung Detection (MILD) trial25. For this purpose, we considered all baseline CT scans from the MILD trial. The study was approved by the Institutional review board of Fondazione IRCCS Istituto Nazionale Tumori di Milano, and the written informed consent was waived for the retrospective examination of the analyzed data. For all patients, non contrast-enhanced low-dose CT scans were acquired using a 16-detector row CT system, with section collimation 16 × 0.75 mm. Images were reconstructed using a sharp kernel (Siemens B50 kernel, Siemens Medical Solutions) with a slice thickness of 1.0 mm. Nodules were detected and annotated based on the following procedure. All CT scans were first read by a workstation (CIRRUS Lung Screening, Diagnostic Image Analysis Group, Radboudumc, Nijmegen, Netherlands) with automatic nodule detection (CAD) tools integrated. Two medical students, trained by a radiology research in detecting pulmonary nodules, either accepted or rejected CAD marks and labeled nodules as one of the considered nodule types. Accepted nodules were segmented using the algorithm presented in ref. 27, which is implemented in CIRRUS Lung Screening. The students manually adjusted parameters to obtain the best possible nodule segmentation, which allowed to compute the equivalent diameter of the lesion. Nodules with label disagreement were reviewed by a thoracic radiologist (ES) with more than 20 years of experience in reading chest CT scans. Nodules with label agreement were further reviewed by two radiology researchers (SvR, KC) independently. From the set of annotated nodules, we removed all cases with a diameter smaller than 4 mm, which is considered as an irrelevant finding in lung cancer screening1. The final set of data consisted of 1,805 nodules from 943 subjects (see Table 1), which were split into two non-overlapping sets: a training set (1,352 nodules), used to train the deep learning system and a validation set (453 nodules), used to monitor the performance of the system during training.

Table 1

Detailed number of nodules and samples in the training, validation and test sets.

	MILD (943 patients)				DLCST (468 patients)
	Training nodules	N	Training samples	Validation nodules	Test nodules test_ALL/test_OBS
Solid	694	8	88,832	232	382/27
Calcified	233	22	82,016	78	58/27
Part-solid	63	80	80,640	21	37/27
Non-solid	152	33	80,256	50	87/27
Perifissural	181	28	81,088	62	48/27
Spiculated	29	167	77,488	10	27/27
Total	1,352	—	490,320	453	639/162

The MILD dataset is used for training and validation purposes, the DLCST dataset is used for testing purposes. In the test set, the number of nodules per class randomly selected to design the observer study is reported. The number of class-specific planes per nodule used to extract training data (N, see also Fig. 4) is indicated for each nodule type. The number of used patients from MILD and DLCST are also indicated.

In the development of the proposed deep learning system, we defined a nodule data sample as a set of triplets of patches (axial, coronal and sagittal view), where each triplet was used to feed three streams of convolutional network (details on data preprocessing, system design and training are detailed in the Methods section). For training purposes, several different samples were extracted from the same nodule by rotating triplets around the center of mass and by using techniques of data augmentation at patch level. In this way, ≈0.5 M training samples were extracted and used to train the deep learning system. In our experiments, we investigated the performance of the system when data at different scales were considered. For this purpose, we extracted nodule data with patches of size 10 mm, 20 mm and 40 mm, which represent 3 different scales. We built and trained three network architectures where one scale (40 mm), two scales (20 mm, 40 mm) and three scales (10 mm, 20 mm, 40 mm) were processed, and compared the performance of the three networks with both classical patch classification approaches based on machine learning and human performance.

Test data

The performance of the trained deep learning system was assessed on data from the Danish Lung Cancer Screening Trial (DLCST)26. In particular, we used the subset of data used in a study recently published by the DLCST research group28, where the authors also describe the procedure used to annotate nodule types. The DLCST was approved by the ethics committe of Copenhagen County and fully funded by the Danish Ministry of Interior and Health. Approval of data management in the trial was obtained from the Danish Data Protection Agency. The trial is registered with ClinicalTrials.gov (NCT00496977). All participants provided written informed consent. Non contrast-enhanced low-dose CT scans were acquired using a multi-slice CT system (16-row Philips Mx 8000, Philips Medical Systems) with section collimation 16 × 0.75 mm. Images were reconstructed using a sharp kernel (kernel D) with a slice thickness of 1.0 mm. From the initial data set, we removed nodules with a diameter smaller than 4 mm, as done for data from the MILD trial, and discarded scans with incomplete or corrupted data (e.g., missing slices). Finally, we obtained a set test of 639 nodules from 468 subjects (see Table 1), which we used for testing purposes.

Observer study

In order to compare the deep learning system with human performance, we selected a subset of nodules from the set test and asked three observers to label nodule type. For this purpose, we built a dataset by including all spiculated nodules in test (27 nodules) and the same number of nodules randomly selected from the other classes. Therefore, a dataset test of 162 nodules was built for the observer study. Two chest radiologists (ES, CSP) with more than 20 years of experience reading chest CT and a radiology researcher (KC) were involved in the observer study. Readers independently labeled nodule types. Nodules were shown at locations indicated by annotations provided by the DLCST trial, and readers had the possibility to either label the nodule as belonging to one of the six considered categories, or label it as not a nodule. For evaluation purposes, we considered annotations made by the three observers involved in this study as well as annotations coming from the DLCST trial, which we considered as an additional observer. In the rest of the paper we will refer to annotations coming from these four different sources as observers O1, O2, O3 and O4 (where O4 indicates the DLCST annotations).

Evaluation

After training, all nodules in test were classified using the trained deep learning system. In order to compute the computer-observer agreement, we compared the results from the computer with the nodule type given by each observer independently in the test set. Furthermore, we computed the inter-observer agreement by considering all possible pairs of observers Oi vs. Oj (i, j = 1, …, 4, i ≠ j). In this case, since observers were given the possibility of labeling a given nodule as “not a nodule”, the additional class not a nodule is considered to assess the inter-observer variability. The results in terms of k value are reported in Table 2, when all pairs of observers and the results from the three deep learning architectures working with different scales are considered. It can be noted that human observers have a moderate to substantial agreement, with k between 0.59 and 0.75, and that the deep learning system achieves a variability in the same range of human observers, with a level of agreement that increases with the number of scales used for nodule classification. When the 3-scale architecture is considered, the k value between the computer and each observer under test is between 0.58 and 0.67 and in half of the cases, it is higher than the agreement between the observer under test and at least one of the other observers.

Table 2

Cohen k statistics with 95% confidence intervals for agreement between computer and observers.

	Observers				Computer
	O₁	O₂	O₃	O₄	1 scale	2 scales	3 scales
O₁	—	0.59 (0.51–0.68)	0.65 (0.57–0.74)	0.68 (0.60–0.76)	0.63 (0.54–0.72)	0.64 (0.55–0.73)	0.65 (0.57–0.74)
O₂	0.59 (0.51–0.68)	—	0.71 (0.63–0.79)	0.66 (0.58–0.75)	0.55 (0.45–0.64)	0.54 (0.45–0.64)	0.58 (0.49–0.67)
O₃	0.65 (0.57–0.74)	0.71 (0.63–0.79)	—	0.75 (0.67–0.82)	0.56 (0.47–0.65)	0.57 (0.48–0.66)	0.61 (0.52–0.70)
O₄	0.68 (0.60–0.76)	0.66 (0.58–0.75)	0.75 (0.67–0.82)	—	0.62 (0.53–0.70)	0.64 (0.55–0.73)	0.67 (0.59–0.75)

Oi indicates the i-th observer. Results for automatic classification using deep learning systems with different numbers of scales are reported.

We also evaluated the classification performance of the best performing network, namely the one working with 3 scales, in terms of accuracy and per-class F-measure and compared it with human performance (Table 3). It is worth noting that the average performance among human observers are comparable with the average performance between the computer and observers, with an average accuracy of 72.9% versus 69.6% respectively. A similar trend can be observed for all the other classification parameters.

Table 3

Nodule classification performance in terms of accuracy and F-measure per nodule type.

	Accuracy	F_solid	F_Calcified	F_Part-solid	F_non-solid	F_Perifissural	F_Spiculated	F_Not-a-nodule
O₁ vs. Computer (3 scales)	71.5%	60.8%	88.4%	66.7%	86.3%	62.2%	71.4%	—
O₂ vs. Computer (3 scales)	66.2%	62.6%	82.4%	47.8%	72.7%	80.0%	56.4%	—
O₃ vs. Computer (3 scales)	67.7%	56.8%	85.1%	59.1%	78.3%	75.6%	60.9%	—
O₄ vs. Computer (3 scales)	72.8%	64.2%	88.9%	71.7%	80.0%	77.3%	62.7%	—
Average	69.6%	61.1%	86.2%	61.3%	79.3%	73.8%	62.9%	—
O₁ vs. O₂	66.0%	52.7%	84.0%	51.3%	79.2%	63.6%	83.3%	50.0%
O₁ vs. O₃	71.0%	55.0%	87.0%	66.7%	80.0%	81.5%	74.4%	40.0%
O₁ vs. O₄	72.8%	64.8%	90.9%	66.7%	71.7%	75.5%	89.4%	0.0%
O₂ vs. O₃	76.5%	74.7%	88.9%	61.5%	81.0%	77.3%	75.7%	66.7%
O₂ vs. O₄	72.2%	64.4%	88.5%	70.8%	71.1%	79.1%	73.2%	0.0%
O₃ vs. O₄	79.0%	68.4%	95.8%	71.1%	80.9%	90.6%	79.2%	0.0%
Average	72.9%	63.3%	89.2%	64.7%	77.3%	77.9%	79.2%	26.1%

Results for each pair of human observer Oi vs. Oj and for observers versus the computer on the test dataset (167 nodules) are reported. Averages of measures across observers and across computer-observers are also indicated. The additional class “not a nodule” is added to observers since they could exclude nodules during the observer study. The performance of the system on the test set (639 nodules) is also reported. In this case, annotations from DLCST radiologists (O4) are considered as the reference standard.

Furthermore, we used the test dataset to compare the performance of the proposed deep learning system with two classical approaches where a linear Support Vector Machines (SVM) classifier was trained in a supervised fashion using features extracted from 2D nodule patches. In the first approach, features based on the raw pixel intensity of 2D patches were used (intensity features). In the second approach, features were not engineered but learned from raw data via an unsupervised learning approach using the K-Means algorithm (unsupervised features), as proposed in ref. 34. Details on the design of these two additional experiments are given in the Methods section. The proposed approach based on deep learning, together with these two approaches based on classical machine learning, covers a scenario where the problem of nodule classification is tackled by (1) manually defining features based on raw image data and use them to train a classifier, (2) learning features from raw data in an unsupervised fashion and use them to train a classifier, (3) learning a hierarchical representation of nodules from raw data, using convolutional networks trained end-to-end. The results of the comparison are reported in Table 4, where the gradual improvement from using intensity-based features and SVM to a 3-scale approach based on deep learning can be observed both in terms of accuracy and F-measure.

Table 4

Comparison of classification performance in terms of accuracy and F-measure when the considered methods are: (1) features based on pixel intensity of patches and linear SVM classifier, (2) features learned from raw nodule patches using the unsupervised learning approach proposed in ref. 34 and linear SVM classifier, (3) the proposed deep learning approach using ConvNets working at 1, 2 and 3 scales.

	Accuracy	F_solid	F_Calcified	F_Part-solid	F_non-solid	F_Perifissural	F_Spiculated
Intensity features + SVM	27.0%	4.1%	60.2%	0.0%	35.4%	26.7%	32.5%
Unsupervised features + SVM	39.9%	38.4%	32.0%	49.4%	59.2%	16.9%	39.5%
ConvNets 1 scale	78.0%	84.4%	82.4%	54.5%	84.4%	57.5%	37.8%
ConvNets 2 scales	79.2%	85.6%	84.9%	52.3%	87.8%	63.4%	36.8%
ConvNets 3 scales	79.5%	85.6%	85.7%	52.2%	87.4%	68.2%	43.4%

In Fig. 2, examples of nodule type classification are depicted, grouped based on labels provided by the DLCST trial. For each nodule type, nodules classified by the deep learning system are ordered by increasing probability. As a consequence, atypical examples for each nodules type can be found on the left side of the figure, while typical examples can be found on the right side of the figure.

Figure 2

Examples of classified nodules from the test set (DLCST).

Each row depicts nodules from one class as labeled in the DLCST trial, and nodules are sorted from left to right based on the probability given by the (3-scale) deep learning system. Examples with low probability (on the left) are a-typical cases of each nodule type, while a high probability (on the right) is given to typical examples of each nodule type.

Discussion

The deep learning system produces a score by classifying an internal representation learned from raw data. In order to get insights on the kind of features learned by the network, we extracted an embedded representation of each nodule and applied multidimensional scaling to project the embedded representation onto a bidimensional plane. For this purpose, we applied the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm29 to the output of the last fully-connected layer of the network. In this way, each nodule is represented by a feature vector of 256 values. The result of the multidimensional projection is depicted in Fig. 3, where close nodules have a similar representation in the network. Clearly defined clusters of nodules with similar characteristics can be identified. Examples are clusters of large solid nodules, calcified or perifissural nodules, but also groups of nodules of a particular class that was not used in this study, namely juxtapleural nodules.

Figure 3

Multidimensional scaling of nodules in the test set using the t-SNE algorithm.

Close nodules have similar characteristics. In (a), clusters of similar nodules are highlighted and grouped with different boxes. A zoomed-in version of each cluster is also shown and a representative name is given based on their appearance. The nodule label assigned in the DLCTS trial is also reported as a coloured dot for each nodule patch (see legend for nodule types).

One of the clusters in the t-SNE representation shows a direct association between large solid nodules and spiculated nodules. Based on training data, the system implicitly learns that large solid nodules are likely to be spiculated nodules. This effect can be observed in the quantitative evaluation reported in Table 3, where spiculation has an F-measure of 62.7% when the system is compared with O4 on the subset of 162 nodules, while it decreases to an F-measure of 43.4% when all nodules are considered. The reduction in precision observed in the second experiment is therefore related to the presence of more large solid nodules that are misclassified as spiculated. This suboptimal behavior of the system can be compensated by increasing the amount of spiculated nodules in the training set, for example by including follow-up cases. Nevertheless, in the clinical context of lung cancer screening, labeling large solid nodules as spiculated may not hamper the nodule workup, since large solid nodules without spiculation are also considered as suspicious lesions. The values of precision and recall per nodule type when the test set is classified with the 3-scale network are reported in Table 5. We can observe that the system tends to classify solid, calcified and non-solid nodules with high performance. As a consequence, since nodule type distributions are skewed (see Table 1), the overall accuracy for test is higher than for test. The low value of precision and recall for part-solid and spiculated nodules in test corroborates what is observed for test and can be compensated in the future by adding more training samples for underrepresented classes, therefore increasing the variability of nodule appearance in the learning procedure.

Table 5

Precision and recall values for the 3-scale deep learning system tested on the test set.

	Solid	Calcified	Part-solid	Non-solid	Perifissural	Spiculated
Precision	89.2	88.9	43.6	87.4	78.4	32.7
Recall	82.2	82.8	64.9	87.4	60.4	64.3

The performance of the system is within the inter-observer variability. This corroborates the effectiveness of the system at classifying nodules and also indicates that even experienced radiologists do not fully agree on nodule types. The concept of nodule type has been coined by radiologists, who have to differentiate opacities in CT scans according to their appearance and, most importantly, to their frequency of malignancy. The fact that there is no complete agreement among experienced radiologists implies that no gold standard for nodule type classification can be made, and that there will always be doubtful cases even in the training set. In this context, the range of variability within the one among humans reached by the proposed system makes it the first suitable system to be integrated in workstations for automatic analysis of CT scans in lung cancer screening.

Methods

The input of the proposed framework is a chest CT scan and the position q = [x, y, z] of the nodule (e.g., its center of mass) to classify. The output of the system is the probability for the nodule to belong to each one of the six considered classes. The framework is based on convolutional networks (ConvNet), which process input samples via a “multi-stream multi-scale” architecture (see Fig. 4). We define an input sample as a triplets of 2D patches obtained by intersecting the 3D domain of the nodule with triplets of orthogonal planes, and crop triplets of patches at different resolutions. Therefore, an input sample to feed the deep learning system is given by three triplets of patches from the same nodule (see Fig. 4). Each step of the proposed framework is detailed in next sections.

Figure 4

(a) Examples of triplets of nodules extracted by varying the parameter N. (b) Examples of pyramidal triplets of patches used to feed the proposed deep learning systems. The system consists of three groups of three streams, one for each considered scale (namely 10 mm, 20 mm and 40 mm for patch size). Convolutional layers, max-pooling layers, fully-connected layers and one soft-max layer are the building blocks of the proposed network. The last fully-connected layer with 256 neurons serves as a combiner of the three sets of three streams, and a 6-value probability vector is generated as output.

Generation of triplets of 2D patches

Let us define a triplet of orthogonal planes passing through the point q and an angle (n = 1, …, N), which defines the rotation of each plane of T with respect to the axes x, y, z. In this way, T1 is the triplet of planes that define the default axial, coronal and sagittal views of a CT scan, and any other triplet T is obtained by sequentially rotating the triplet with respect to the x, the y and the z axis by an angle θ. Rotating all the planes by the same angle guarantees that orthogonal planes are always obtained. Examples of triplets for several values of N are depicted in Fig. 4(a), where the axial, coronal and sagittal planes are represented in different colors. The intersection of a triplet of planes and a CT scan generates 2D views of the nodule of interest. From each intersection, we generate triplets of 2D patches by cropping a square area of size d centered on q. Increasing the value of N allows to increase the number of extracted patches per nodule, which also increases the coverage of the volume of a nodule in 3D. Furthermore, adapting the value of N per nodule type has the advantage of (1) balancing classes distribution in the presence of skewed distribution of classes by using a larger value of N for underrepresented classes, and (2) using it as a kind of data augmentation, in which many different views of the same object are extracted. The parameter d defines the scale at which patches are considered. Using multiple values of d allows to crop triplets of patches with information that range from local content to more global context of nodule appearance. In order to train the proposed deep learning system, we extracted triplets of patches at three different scales, namely d = 10, 20, 40 mm and fed three streams of the network with three triplets at the same time. This allows the network to focus both on the local appearance of a nodule (10 mm), where small structures like the solid core can be analyzed, and on more global context (40 mm), in which structures like the fissure can be recognized. Before feeding the network, each patch was rescaled to a fixed size of 64 × 64 pixels using bicubic interpolation and the pixel intensity HU was rescaled to by applying the transformation .

Deep learning network

Network design

The architecture of the used deep learning system is depicted in Fig. 4(b). The system consists of nine streams of ConvNets, grouped into three sets of three streams. Each set of streams is fed with a triplet of orthogonal patches extracted at the same scale. Different sets of streams process triplets of orthogonal patches with exactly the same orientation in the CT scan, but at different scales. Each stream of the set is fed with one patch from a triplet of orthogonal patches. The 2D input patch is then processed by a series of convolutional and pooling layers, with one last fully-connected layer. The size of each patch is 64 × 64 pixels, which covers a size of ≈40 mm at the used in-plane resolution of 0.67 mm/px. In order to define the optimal architecture for each stream, we followed the VGG-net approach proposed in ref. 30. We set a fixed size of convolutional kernels to 3 × 3 px and used 32 filters in the initial layer. Similarly to ref. 30, we added pairs of convolutional and max-pooling layers, keeping a fixed filter size of 3 × 3 and doubling the number of filters in convolutional layers after each max-pooling, as long as the performance on the validation set were improving. We slightly deviated from the fixed procedure of ref. 30 by increasing the filter size in the first convolutional layer to 5 × 5 and by using 2 layers of 64 filters in cascade before the second max-pooling layer, since this configuration showed to perform slightly better than the standard one. The described architecture represents one of the three streams used in a set, which we define as multi-stream network. All the parameters of the network are shared across the three streams in the same multi-stream network. It is worth noting that a multi-stream network processes triplets of 2D patches extracted with the same resolution d. We used three scales with patch size of 10 mm, 20 mm and 40 mm, respectively, and for each scale we trained a multi-stream network. Each multi-stream network has the same architecture, but parameters are optimized independently at each scale. The multi-stream networks at different scales are finally merged in a final fully-connected layer (see Fig. 4(b)). The final soft-max layer has six neurons, which produce the probability for the six considered classes. We implemented the network using Theano31.

Training

We trained the proposed multi-stream multi-scale convolutional network with data from the MILD trial. For training purposes, we split the dataset into two parts, a training set containing 75% of the data, and a validation set, containing the remaining 25% of the data. We defined the two data sets without any overlap of patients or nodules across the sets and distributing all nodule types in the two sets based on the same proportion 75–25%. The statistics of the two data sets are reported in Table 1. For training purposes, for each nodule, three triplets of patches were extracted. Each triplet was extracted at a given scale by setting the values d1 = 10 mm, d2 = 20 mm and d3 = 40 mm for the streams 1, 2, and 3 respectively. Since the distribution of nodule types were skewed, we adapted the number of angles N per nodule type. In order to set the proper value for N, we decided to initially extract 5,000 training samples per nodule class. Specific values for N for each class are reported in Table 1. Adapting the value of N per nodule type produced 30,000 training samples. We further augmented the size of the training data set by adding three shifted versions of each training sample. Data augmentation was therefore done by randomly shifting the position q of the center of mass of the nodule to , where (δ, δ, δ) were drawn from a normal distribution with mean value μ = 0 and standard deviation , which ensures shifting within a sphere of radius 1 mm centered on q. Finally, each patch of the triplet and its shifted version were flipped along the vertical, the horizontal axis, and a combination of the two axes. As a result, 16 different views of each nodule sample were included in the training set, which resulted in approximately 500,000 training samples. In order to train the ConvNet, we initialized the parameters according to the method in ref. 32 and trained using stochastic gradient descent, minimizing the categorical cross-entropy loss. During optimization, we set an initial learning rate η = 10−3 and decreased it by a factor 3 every 50 epochs. The parameters of the network were updated using the ADAM algorithm33. We set the batch size to 256 and used dropout19 with a probability of 0.5 in the last fully-connected layer. Additionally, L2 normalization was used, with a weight decay parameter of 10−6. We empirically noticed that the training converges after ≈200 epochs.

Prediction

Given an input sample x, consisting of a set of triplets extracted at multiple scales, the trained architecture is able to predict a probability P(x) for each considered nodule type class k. Since one set of triplets is extracted for a given angle θ, the prediction also depends on the angle θ. Therefore, the input triplet for a given nodule can be written as a function of θ, namely x. In order to classify a given nodule, the prediction becomes a function of the parameter θ as well, which we can write as P(x). The final prediction is obtained as a combination of the N predictions obtained by varying the parameter θ. The adopted combination strategy consisted in averaging the per-class probability, and finally assigning the nodule the label . This prediction strategy was applied both during training to assess the performance of the network on the validation set, and during the final evaluation on the DLCST data set. For validation purpose, after each epoch, all nodules in the validation set were tested and performance was assessed. For this purpose, 30 samples per nodule were extracted (N = 30), meaning that patches at rotation steps of 6° were taken. At each iteration, nodule type was predicted using the proposed combination of predictions, and quantitative performance parameters were computed. Since the distribution of nodule types in the validation set is skewed (see Table 1), we considered the F-measure per class instead of the commonly used accuracy, since the F-measure is less sensitive to skewed distributions. Based on this, during training we maximized the mean F-measure across classes. For the final evaluation on DLCST data, the same settings using N = 30 was used, and the results for the three considered architectures reported in Tables 2 and 3 were obtained.

Nodule classification using Support Vector Machines

In this section, we describe the details of the experiments based on classical machine learning approaches, where we used two different sets of features. The first set consists of features based on the intensity of pixels in 2D patches. The second set consists of features automatically learned from raw data in an unsupervised fashion, using the K-means algorithm.

Intensity features

The first set of features consists of the raw pixel intensity (HU values) extracted from 2D patches. Given a patch of size 64 × 64 px, we extracted a feature vector by vectorizing the values of pixel intensities in the patch. In this way, each patch had a 4,096-dimension feature vector. We built a training set by considering all the nodules used to train the methods based on deep learning, balancing samples across classes using the coefficients reported in Table 1. We used the training set to train a linear Support Vector Machines (SVM) classifier. Data were normalized prior to training to have zero mean and unit variance, and the one-vs-one strategy was used to deal with the multi-class problem. After training, we applied the classifier to the test dataset, which contains 634 nodules. As done for the evaluation of deep learning approaches, 30 patches per nodules were considered at test time, which were all classified using the trained SVM classifier. Finally, majority voting of the predicted labels was used to obtain the final prediction of nodule type.

Unsupervised features

The approach used to learn a representation of pulmonary nodules in an automatic unsupervised fashion is based on the work of Coates et al.34. The original method presented in 34 was developed based on the CIFAR10 dataset, which contains RGB images of 32 × 32 px. Since the size of the patches used in this paper is 64 × 64 px, in order to apply the method in ref. 34 to our data we doubled the receptive field size, which we set to 12 px, and set the number of centroids to 1,600, which gave a feature space of 6,400 dimensions. We kept the rest of parameters of the algorithm at their default value. As done for the experiment using intensity features and linear SVM, at test time we classified 30 samples per nodule and considered the label given by the majority voting on the predicted labels as the final prediction of nodule type.

Additional Information

How to cite this article: Ciompi, F. et al. Towards automatic pulmonary nodule management in lung cancer screening with deep learning. Sci. Rep. 7, 46479; doi: 10.1038/srep46479 (2017). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

21 in total

1. A new computationally efficient CAD system for pulmonary nodule detection in CT imagery.

Authors: Temesguen Messay; Russell C Hardie; Steven K Rogers
Journal: Med Image Anal Date: 2010-02-19 Impact factor: 8.545

2. Automatic detection of large pulmonary solid nodules in thoracic CT images.

Authors: Arnaud A A Setio; Colin Jacobs; Jaap Gelderblom; Bram van Ginneken
Journal: Med Phys Date: 2015-10 Impact factor: 4.071

3. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box.

Authors: Francesco Ciompi; Bartjan de Hoop; Sarah J van Riel; Kaman Chung; Ernst Th Scholten; Matthijs Oudkerk; Pim A de Jong; Mathias Prokop; Bram van Ginneken
Journal: Med Image Anal Date: 2015-09-08 Impact factor: 8.545

4. Observer Variability for Classification of Pulmonary Nodules on Low-Dose CT Images and Its Effect on Nodule Management.

Authors: Sarah J van Riel; Clara I Sánchez; Alexander A Bankier; David P Naidich; Johnny Verschakelen; Ernst T Scholten; Pim A de Jong; Colin Jacobs; Eva van Rikxoort; Liesbeth Peters-Bax; Miranda Snoeren; Mathias Prokop; Bram van Ginneken; Cornelia Schaefer-Prokop
Journal: Radiology Date: 2015-05-22 Impact factor: 11.105

5. Bag-of-frequencies: a descriptor of pulmonary nodules in computed tomography images.

Authors: Francesco Ciompi; Colin Jacobs; Ernst Th Scholten; Mathilde M W Wille; Pim A de Jong; Mathias Prokop; Bram van Ginneken
Journal: IEEE Trans Med Imaging Date: 2014-11-20 Impact factor: 10.048

6. Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network.

Authors: Marios Anthimopoulos; Stergios Christodoulidis; Lukas Ebner; Andreas Christe; Stavroula Mougiakakou
Journal: IEEE Trans Med Imaging Date: 2016-02-29 Impact factor: 10.048

7. Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks.

Authors: Arnaud Arindra Adiyoso Setio; Francesco Ciompi; Geert Litjens; Paul Gerke; Colin Jacobs; Sarah J van Riel; Mathilde Marie Winkler Wille; Matiullah Naqibullah; Clara I Sanchez; Bram van Ginneken
Journal: IEEE Trans Med Imaging Date: 2016-03-01 Impact factor: 10.048

8. Annual or biennial CT screening versus observation in heavy smokers: 5-year results of the MILD trial.

Authors: Ugo Pastorino; Marta Rossi; Valentina Rosato; Alfonso Marchianò; Nicola Sverzellati; Carlo Morosi; Alessandra Fabbri; Carlotta Galeone; Eva Negri; Gabriella Sozzi; Giuseppe Pelosi; Carlo La Vecchia
Journal: Eur J Cancer Prev Date: 2012-05 Impact factor: 2.497

9. Automatic detection of subsolid pulmonary nodules in thoracic computed tomography images.

Authors: Colin Jacobs; Eva M van Rikxoort; Thorsten Twellmann; Ernst Th Scholten; Pim A de Jong; Jan-Martin Kuhnigk; Matthijs Oudkerk; Harry J de Koning; Mathias Prokop; Cornelia Schaefer-Prokop; Bram van Ginneken
Journal: Med Image Anal Date: 2013-12-17 Impact factor: 8.545

10. Predictive Accuracy of the PanCan Lung Cancer Risk Prediction Model -External Validation based on CT from the Danish Lung Cancer Screening Trial.

Authors: Mathilde M Winkler Wille; Sarah J van Riel; Zaigham Saghir; Asger Dirksen; Jesper Holst Pedersen; Colin Jacobs; Laura Hohwü Thomsen; Ernst Th Scholten; Lene T Skovgaard; Bram van Ginneken
Journal: Eur Radiol Date: 2015-03-13 Impact factor: 5.315

67 in total

Review 1. Image-based biomarkers for solid tumor quantification.

Authors: Peter Savadjiev; Jaron Chong; Anthony Dohan; Vincent Agnus; Reza Forghani; Caroline Reinhold; Benoit Gallix
Journal: Eur Radiol Date: 2019-04-08 Impact factor: 5.315

Review 2. Designing deep learning studies in cancer diagnostics.

Authors: Andreas Kleppe; Ole-Johan Skrede; Sepp De Raedt; Knut Liestøl; David J Kerr; Håvard E Danielsen
Journal: Nat Rev Cancer Date: 2021-01-29 Impact factor: 60.716

Review 3. Deep learning aided decision support for pulmonary nodules diagnosing: a review.

Authors: Yixin Yang; Xiaoyi Feng; Wenhao Chi; Zhengyang Li; Wenzhe Duan; Haiping Liu; Wenhua Liang; Wei Wang; Ping Chen; Jianxing He; Bo Liu
Journal: J Thorac Dis Date: 2018-04 Impact factor: 2.895

Review 4. Current Applications and Future Impact of Machine Learning in Radiology.

Authors: Garry Choy; Omid Khalilzadeh; Mark Michalski; Synho Do; Anthony E Samir; Oleg S Pianykh; J Raymond Geis; Pari V Pandharipande; James A Brink; Keith J Dreyer
Journal: Radiology Date: 2018-06-26 Impact factor: 11.105

5. Automated Triaging of Adult Chest Radiographs with Deep Artificial Neural Networks.

Authors: Mauro Annarumma; Samuel J Withey; Robert J Bakewell; Emanuele Pesce; Vicky Goh; Giovanni Montana
Journal: Radiology Date: 2019-01-22 Impact factor: 11.105

Review 6. Lung cancer prediction using machine learning and advanced imaging techniques.

Authors: Timor Kadir; Fergus Gleeson
Journal: Transl Lung Cancer Res Date: 2018-06

7. Adenocarcinoma in pure ground glass nodules: histological evidence of invasion and open debate on optimal management.

Authors: Gianluca Milanese; Nicola Sverzellati; Ugo Pastorino; Mario Silva
Journal: J Thorac Dis Date: 2017-09 Impact factor: 2.895

8. Deep learning and medical imaging.

Authors: Eyal Klang
Journal: J Thorac Dis Date: 2018-03 Impact factor: 2.895

9. Combining multi-scale feature fusion with multi-attribute grading, a CNN model for benign and malignant classification of pulmonary nodules.

Authors: Jumin Zhao; Chen Zhang; Dengao Li; Jing Niu
Journal: J Digit Imaging Date: 2020-08 Impact factor: 4.056

10. Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method.

Authors: Peng Huang; Cheng T Lin; Yuliang Li; Martin C Tammemagi; Malcolm V Brock; Sukhinder Atkar-Khattra; Yanxun Xu; Ping Hu; John R Mayo; Heidi Schmidt; Michel Gingras; Sergio Pasian; Lori Stewart; Scott Tsai; Jean M Seely; Daria Manos; Paul Burrowes; Rick Bhatia; Ming-Sound Tsao; Stephen Lam
Journal: Lancet Digit Health Date: 2019-10-17