Literature DB >> 32326730

Assessing the Accuracy of a Deep Learning Method to Risk Stratify Indeterminate Pulmonary Nodules.

Pierre P Massion^1,2, Sanja Antic¹, Sarim Ather³, Carlos Arteta⁴, Jan Brabec⁵, Heidi Chen⁶, Jerome Declerck⁴, David Dufek⁵, William Hickes³, Timor Kadir⁴, Jonas Kunst⁵, Bennett A Landman⁷, Reginald F Munden⁸, Petr Novotny⁴, Heiko Peschl³, Lyndsey C Pickup⁴, Catarina Santos⁴, Gary T Smith^9,10, Ambika Talwar³, Fergus Gleeson³.

Abstract

Rationale: The management of indeterminate pulmonary nodules (IPNs) remains challenging, resulting in invasive procedures and delays in diagnosis and treatment. Strategies to decrease the rate of unnecessary invasive procedures and optimize surveillance regimens are needed.
Objectives: To develop and validate a deep learning method to improve the management of IPNs.
Methods: A Lung Cancer Prediction Convolutional Neural Network model was trained using computed tomography images of IPNs from the National Lung Screening Trial, internally validated, and externally tested on cohorts from two academic institutions.Measurements and Main
Results: The areas under the receiver operating characteristic curve in the external validation cohorts were 83.5% (95% confidence interval [CI], 75.4-90.7%) and 91.9% (95% CI, 88.7-94.7%), compared with 78.1% (95% CI, 68.7-86.4%) and 81.9 (95% CI, 76.1-87.1%), respectively, for a commonly used clinical risk model for incidental nodules. Using 5% and 65% malignancy thresholds defining low- and high-risk categories, the overall net reclassifications in the validation cohorts for cancers and benign nodules compared with the Mayo model were 0.34 (Vanderbilt) and 0.30 (Oxford) as a rule-in test, and 0.33 (Vanderbilt) and 0.58 (Oxford) as a rule-out test. Compared with traditional risk prediction models, the Lung Cancer Prediction Convolutional Neural Network was associated with improved accuracy in predicting the likelihood of disease at each threshold of management and in our external validation cohorts.Conclusions: This study demonstrates that this deep learning algorithm can correctly reclassify IPNs into low- or high-risk categories in more than a third of cancers and benign nodules when compared with conventional risk models, potentially reducing the number of unnecessary invasive procedures and delays in diagnosis.

Entities: Chemical Disease Gene Species

Keywords: computer-aided image analysis; early detection; lung cancer; neural networks; risk stratification

Mesh：

Year: 2020 PMID： 32326730 PMCID： PMC7365375 DOI： 10.1164/rccm.201903-0505OC

Source DB: PubMed Journal: Am J Respir Crit Care Med ISSN： 1073-449X Impact factor: 21.405

At a Glance Commentary

Scientific Knowledge on the Subject

It is unknown whether a deep learning algorithm applied to chest computed tomography scans of individuals presenting with indeterminate pulmonary nodules allows their reclassification into lower- or higher-risk groups.

What This Study Adds to the Field

These results suggest the potential utility of the Lung Cancer Prediction Convolutional Neural Network algorithm to revise the probability of disease for indeterminate pulmonary nodules, with the goal of decreasing invasive procedures and shortening the time to diagnosis. Lung cancer remains the leading cause of cancer-related deaths in the United States and worldwide. In the United States alone, an estimated 228,820 adults will receive a diagnosis of lung cancer in 2020 (1). Despite recent progress in immunotherapy and other treatment modalities, the 5-year survival rate is 21.7% (2), mainly because most lung cancers are diagnosed at an advanced stage. Early diagnosis can markedly improve outcomes—the survival for patients with stage IA1 non–small cell cancer is 92% (3). There are two principal routes to an early lung cancer diagnosis. The first is screening using low-dose computed tomography (LDCT), which has been shown to reduce lung cancer deaths by 20% in the U.S. National Lung Screening Trial (NLST) (4), and by 26% in the European NELSON (Nederlands–Leuvens Longkanker Screenings Onderzoek) trial (5). The second route is the detection of cancer as an incidental finding in patients undergoing imaging for an unrelated reason. Indeterminate pulmonary nodules (IPNs) are reported as incidental findings in ∼30% of chest CTs, and it has been estimated that 1.57 million patients with pulmonary nodules are identified in this way every year in the United States (6). Regardless of the route to detection, the management of screen-detected and incidentally detected IPNs is a challenging clinical problem. One issue is the high false-positive rate of LDCT. The rate of positive LDCT screening tests in the NLST was reported to be ∼27% in the first two rounds and 17% in the third year of screening (4). More than 96% of all positive screens were false positives and 72% had some form of diagnostic follow-up. Variability in image interpretation among radiologists is known to be high, and this may lead to variability in management (7, 8). Moreover, CT scans on which incidental nodules occur are frequently read by generalist radiologists with limited thoracic experience. Guidelines published by the American College of Radiology for screen-detected nodules (Lung-RADS) (9, 10), and by the Fleischner Society (11) and the British Thoracic Society (12) for incidentally detected nodules recommend management strategies based on qualitative or quantitative estimates of malignancy risk. Such estimates may incorporate clinical parameters such as patient age, smoking history, and cancer history, and radiological parameters such as nodule diameter, appearance, and location (11). Standard-of-care guidelines for incidental IPNs suggest thresholds for patient stratification (12–15); for example, nodule risks below 5% indicate interval surveillance imaging, whereas those above 65% indicate active intervention (biopsy/surgery). Such guidelines aim to optimize patient benefit given the performance of currently available tests. Despite their availability, adherence to these guidelines can be variable and patient stratification can be subjective (16). For example, intermediate-category nodules (i.e., 5–65% for American College of Chest Physicians and American College of Radiology Lung-RADS category 4) present challenges in the clinic because the guidelines do not provide specific recommendations for what is a very broad range of risk profiles. Patients with intermediate-risk nodules are typically associated with a large number of expensive and invasive tests (17), with an unacceptable rate of surgeries on benign nodules (13, 16). Such poor stratification may result in delayed diagnosis and treatment and potential upstaging. The current preferred option for smaller IPNs is growth assessment over time, which has been shown to contribute significantly to risk assessment (18, 19), although waiting may be difficult for patients and delay diagnosis and potential treatment. Growth identified over a short interval is less reliable for diagnosing malignancy than growth identified over longer periods (20, 21). Logistic regression–based methods, such as the Mayo and Brock risk models (22, 23), are recommended by some guidelines but are limited at least partly by their reliance on qualitative—and hence inconsistent—human interpretation of variables such as nodule size and morphology, and patients’ estimates of factors such as smoking history. Computer-aided risk stratification using machine learning (ML) classification of benign and malignant nodules could potentially address some of these limitations, and the availability of large datasets and increasingly powerful computational resources has made the development of such techniques feasible. Such techniques work directly with the image and patient clinical data, negating the need to first describe the morphology or measure the size of the nodule. Prior ML work on previous datasets has shown that such tools have the potential to outperform conventional risk models (24–30), but their performance has not been evaluated on multiple independent datasets, including incidentally detected nodules in smokers and nonsmokers. Moreover, the published literature lacks external validation, including data acquired using heterogeneous CT technology and protocols from a variety of clinical practices. Our study offers such a level of clinical validation, which is required for future clinical trials and ultimately for clinical practice. Our objective in this study was to derive and validate a computer-aided tool to classify benign and malignant nodules—a “digital biomarker” for use in patient stratification and management of IPNs. We aimed to investigate the performance of a deep learning risk stratification tool developed using the LDCT arm of the NLST (i.e., current/former smokers, 55–75 yr old, ≥30 pack-years) and internally and externally validated on multiple cohorts, including never-smokers. The eventual goal of this tool is to accelerate the diagnosis and treatment of malignant nodules, and to avoid unnecessary imaging and invasive procedures in patients with benign disease. Some of the results of these studies were reported in 2018 in the form of abstracts (31–33).

Methods

Study Design

In this study we used a prospective-specimen collection, retrospective-blinded-evaluation design (34). The Lung Cancer Prediction Convolutional Neural Network (LCP-CNN) developed by Optellum was derived and internally validated using the NLST dataset with cross-validation. Two independent external validation datasets were obtained from Vanderbilt University Medical Center (VUMC) and Oxford University Hospitals National Health Service Foundation Trust (OUH). These datasets included incidentally detected IPNs that had been brought to the attention of pulmonary physicians. The LCP-CNN was applied without modification to nodules from these populations (characterized in Table E1 in the online supplement).

Datasets

Deep learning methods require large representative datasets for training. Therefore, the derivation and internal validation dataset contained CT images of all solid and semisolid nodules of at least 6 mm in diameter from the NLST dataset. Ground-glass opacities were then excluded because there were too few malignant examples to train the system reliably. Working under the supervision of expert thoracic radiologists from OUH, a team of doctors and medical students performed an extensive data curation process (summarized in Figure E1). The final dataset contained 14,761 benign nodules from 5,972 patients and 932 malignant nodules from 575 patients. Note that each patient had up to three annual images, hence a nodule could be present on up to three CTs. The VUMC external validation dataset contains prospectively collected data from patients with incidental pulmonary nodules who were referred to a lung nodule clinic (Table E1). Patients of either sex, ≥18 years of age, with a CT scan reporting a solid pulmonary nodule 5–30 mm in diameter were included, provided that the patient had no history of a cancer diagnosis within 2 years before the nodule was detected. Nodules were only included if they had a diagnosis provided by histology or 2-year stability based on diameter. When multiple images of a nodule were available, the earliest study for which a thin-slice CT section (≤ 1.25 mm thick) was available was selected. The VUMC external validation dataset contained 116 nodules (52 benign [including at least 3 histoplasmosis] and 64 malignant) from 116 patients. The OUH external validation dataset contained retrospectively collected data from patients with incidental IPNs (Table E1). The same inclusion criteria as described above were used, except for a size range of 5–15 mm, a 5-year cancer cutoff, and no more than five nodules per patient. Although the criteria specified a diameter of 5–15 mm, all longitudinal studies were collected, and the earliest study for which a noncontrast CT was available was selected; therefore, the dataset included nodules up to 18.8 mm in diameter. The dataset contained 463 nodules from 427 patients. These included 63 cancer nodules from 62 different patients. Deidentified NLST datasets were obtained through the National Cancer Institute’s Cancer Data Access System (35). This research was approved by the OUH (Health Research Authority Integrated Research Application System ID: 214451) and VUMC institutional review boards (000616 and 030763).

Derivation of the LCP-CNN Model

The LCP-CNN system is based on the Dense Convolutional Network (36), a widely used type of deep learning CNN architecture (37) that was designed for computer vision tasks (Figure 1). An eightfold cross-validation strategy was used for training and validation on the NLST data, and the datasets were split into eight approximately equal subsets (see Figure E1). This approach allowed us to report results that were not used for training. In all reported results, the output of the LCP-CNN model is a score between 0% and 100% to represent a likelihood of malignancy. During development, it was found that clinical variables (e.g., age, sex, and smoking history) did not contribute significantly to the performance of the model, and hence they were excluded. Further details are provided in the online supplement.

Figure 1.

Schematics showing the (A) Lung Cancer Prediction Convolutional Neural Network (LCP-CNN) architecture, (B) the training procedure, and (C) application of the trained model to novel data. The input to the network is a three-dimensional anisotropically resampled box ∼56 mm in width.

Performance Metrics and Statistical Analysis

We measured the performance of the LCP-CNN model in three different ways. First, we examined the area under the curve (AUC) for the LCP-CNN classifier over all testing data and compared the results with those obtained for relevant risk models. Second, we examined its impact on patient stratification by conducting a reclassification analysis, using a rule-in threshold of >65% and a rule-out threshold of <5% according to the American College of Chest Physicians guidelines (11). We reported net reclassification indices (NRIs) for cases and controls separately (38, 39) for both thresholds to measure the LCP-CNN’s potential to change management. A two-way reclassification analysis was performed. For example, at 65%, we calculated the fraction of cancers that were correctly classified compared with the Mayo model (“net cancer”) by counting the cancers that scored >65% using LCP-CNN but scored ≤ 65% using Mayo (“cancer up”), and subtracting the number of cancers that scored >65% with Mayo and ≤ 65% with LCP-CNN (“cancer down”). We compared the internal validation dataset, which contained screening data, with the Brock model (23), as that model is appropriate for screening patients (i.e., older patients with a significant smoking history). The external validation dataset contained only incidentally detected nodules, including those detected in smokers and nonsmokers; therefore, we compared it with the more generally applicable Mayo model (22). Data regarding a family history of cancer or emphysema, which are necessary for the Brock model, were missing from many of the external datasets. To enable a cross-comparison, we also included Mayo results on the NLST. Third, we calculated the diagnostic likelihood ratio (DLR) to evaluate the clinical value added. Nonparametric bootstrapping with 10,000 samples was used for all confidence intervals and P values (40, 41).

Results

AUC Performance

The model was first internally validated using cross-validation on the NLST dataset (Figure E1). The AUC over all the testing data for the LCP-CNN classifier was 92.1% (95% confidence interval [CI], 91.2–92.9%), compared with 85.6% (95% CI, 84.3–86.8%) for the Brock model (Figure 2A) (P < 0.001) and 85.2% (95% CI, 84.1–86.4%) for the Mayo model (P < 0.001). The performances of the Brock and Mayo models were not statistically different on the NLST (P = 0.126).

Figure 2.

Receiver operating characteristic curves and area under the curve (AUC) analysis of the (A) internal National Lung Screening Trial (NLST) dataset using eight-way cross-validation, (B) external Vanderbilt dataset, and (C) external Oxford dataset. The Brock model was used as a comparator for the screening population, and the Mayo model was used for the incidental nodule populations for the two independent validation datasets. LCP-CNN = Lung Cancer Prediction Convolutional Neural Network. To demonstrate generalizability beyond the NLST data, we tested the LCP-CNN on the two independent, nonscreening external cohorts. The AUCs on these represented an improvement of 5–10 percentage points of AUC compared with existing clinical prediction tools. On the OUH data, the AUC for the LCP-CNN classifier was 91.9% (95% CI, 88.7–94.7%) versus 81.9% (95% CI, 76.1–87.1%) for Mayo (P = 0.018). On the VUMC data, the AUC for the LCP-CNN classifier was 83.5% (95% CI, 75.4–90.7%) versus 78.1% (95% CI, 68.7–86.4%) for Mayo (P = 0.082). Figures 2B and 2C show the corresponding receiver operating characteristic curves.

Reclassification Performance

We analyzed the model by comparing its ability to reclassify benign and malignant nodules with that of conventional risk models using >65% (rule-in) and <5% (rule-out) thresholds. Figure 3 illustrates the benefit of the LCP-CNN in reclassifying nodules compared with the Brock and Mayo models selected for the clinical setting (screening or incidental). Table E2 provides a numerical annotation of Figure 3. The reclassification indices (42) for <5% and >65% risk thresholds were calculated separately, defining low- and high-risk categories, and are shown in Table 1. NRI results for Mayo applied to the NLST are shown in Figure E6, and reclassifications against other guideline-relevant thresholds are included in Table E3.

Figure 3.

Table 1.

Reclassification of Indeterminate Pulmonary Nodules with the Lung Cancer Prediction Convolutional Neural Network

National Lung Screening Trial Reclassification: Compared with Brock (Screening Population)
Target (%)	Cancer Up (95% CI)	Cancer Down (95% CI)	Net Cancer (95% CI)	Net Cancer P Value	Benign Up (95% CI)	Benign Down (95% CI)	Net Benign (95% CI)	Net Benign P Value	Overall (95% CI)	Overall P Value
5	0.11 (0.09 to 0.13)	0.02 (0.01 to 0.02)	0.09 (0.07 to 0.11)	<0.0001	0.16 (0.15 to 0.16)	0.12 (0.12 to 0.13)	−0.04 (−0.04 to 0.03)	<0.0001	0.06 (0.03 to 0.08)	<0.0001
65	0.54 (0.51 to 0.57)	0.02 (0.01 to 0.03)	0.52 (0.49 to 0.56)	<0.0001	0.05 (0.04 to 0.05)	0.00 (0.00 to 0.01)	−0.04 (−0.05 to 0.04)	<0.0001	0.48 (0.45 to 0.51)	<0.0001

Definition of abbreviation: CI = confidence interval.

Reclassification indices for cancers and benign nodules on the National Lung Screening Trial, Vanderbilt, and Oxford University Hospitals datasets for the rule-out test with a 5% threshold and the rule-in test with a 65% threshold are shown. For each threshold, the proportion of cancers that moved above a given threshold (i.e., scored below the threshold on the comparator model and above the threshold on the Lung Cancer Prediction Convolutional Neural Network) is designated as “cancer up.” Movement of cancers and benign nodules is recorded in both the up and down directions as a proportion of the total number of cancers or benign nodules, respectively. The “net cancer” movement is positive when more cancers are reclassified above the threshold than are reclassified below the threshold, and conversely, the “net benign” movement is positive when more benign nodules are reclassified below the threshold.

Reclassification diagrams. (A) National Lung Screening Trial (NLST) dataset for 200 cases and 200 benign nodules (randomly selected; numbers were limited for readability of the figure). (B) Vanderbilt University Medical Center dataset. (C) Oxford University Hospitals dataset. Reclassification diagrams are a useful way to visualize the impact of a new biomarker compared with a reference at predefined thresholds. Here we use rule-out and rule-in thresholds at 5% and 65%, respectively, as shown by the black lines. Red triangles indicate cancers, and blue circles indicate controls. If a new biomarker improves classification of cancers compared with the reference, then one would expect, for example, cases (red triangles) that were below 65% on the horizontal axis to move above 65% to the vertical axis, that is, from the central rectangular region to the region immediately above it. For example, on the Vanderbilt and Oxford datasets, 45% and 32% of the cancers, respectively, are reclassified up compared with the Mayo model. Similarly, a new biomarker improves benign classification compared with the reference if it moves controls (blue circles) that were above the 5% threshold on the horizontal axis to below 5% on the vertical axis. For nodules that stay within the three square regions intersected by the green diagonal, the Lung Cancer Prediction Convolutional Neural Network (LCP-CNN) does not add value because none of the nodules are correctly reclassified compared with the Brock or Mayo model. On the Vanderbilt and Oxford datasets, 33% and 61% of the benign nodules, respectively, are reclassified down compared with the Mayo model. Reclassification of Indeterminate Pulmonary Nodules with the Lung Cancer Prediction Convolutional Neural Network Definition of abbreviation: CI = confidence interval. Reclassification indices for cancers and benign nodules on the National Lung Screening Trial, Vanderbilt, and Oxford University Hospitals datasets for the rule-out test with a 5% threshold and the rule-in test with a 65% threshold are shown. For each threshold, the proportion of cancers that moved above a given threshold (i.e., scored below the threshold on the comparator model and above the threshold on the Lung Cancer Prediction Convolutional Neural Network) is designated as “cancer up.” Movement of cancers and benign nodules is recorded in both the up and down directions as a proportion of the total number of cancers or benign nodules, respectively. The “net cancer” movement is positive when more cancers are reclassified above the threshold than are reclassified below the threshold, and conversely, the “net benign” movement is positive when more benign nodules are reclassified below the threshold.

Rule-in Test (>65%)

On the VUMC dataset, the NRI was 0.34 (95% CI, 0.15 to 0.52; P = 0.0004). Of the 64 cancers, 45 (70%) were classified as high-risk by the LCP-CNN, compared with 16 (25%) classified by Mayo (net cancer: 0.45 [95% CI, 0.33 to 0.58]; P < 0.0001). The LCP-CNN false-positive rate was slightly closer to that expected at this probability threshold compared with Mayo. Of 52 benign nodules, 11 (21%) were false positives with LCP-CNN, compared with 5 (10%) with Mayo (net benign: −0.12 [95% CI, −0.25 to 0.00]; P = 0.0439). On the OUH dataset, the NRI was 0.29 (95% CI 0.18 to 0.41; P < 0.0001). Of 63 cancers, 23 (36%) were classified as high-risk, compared with 3 (5%) classified by Mayo (net cancer: 0.32 [95% CI, 0.21 to 0.43]; P < 0.0001). Among the benign nodules, the LCP-CNN had 10 (2.5%) false positives and Mayo had 1 (0.25%), resulting in a false-positive rate slightly closer to that expected at this risk threshold (net benign: −0.02 [95% CI −0.04 to −0.01]; P < 0.0001).

Rule-out Test (<5%)

The VUMC NRI was 0.33 (95% CI, 0.20–0.47; P < 0.0001). Of the 52 benign nodules in VUMC, 23 (44%) were ruled out by LCP-CNN and 6 (12%) were ruled out by Mayo (net benign: 0.33 [95% CI, 0. 19–0.46]; P < 0.0001), and both had 1 (2%) false negative. The OUH NRI was 0.58 (95% CI, 0.51–0.64; P < 0.0001). Of the 400 benign nodules in OUH, 257 (64%) were ruled out by LCP-CNN and 12 (3%) were ruled out by Mayo (net benign: 0.62 [95% CI, 0.57–0.67]; P < 0.0001). There were two (3%) false negatives for LCP-CNN and none for Mayo.

DLR Performance

Table 2 presents the sensitivity and specificity of all models, with the corresponding positive and negative DLRs. The positive DLR for rule-in at >65% using the LCP-CNN was 3.32 for VUMC and 14.6 for OUH, and the negative DLR for rule-out (<5% threshold) was 0.04 for VUMC and 0.05 for OUH.

Table 2.

Sensitivity, Specificity, and Diagnostic Likelihood Ratio Testing Associated with the Lung Cancer Prediction Convolutional Neural Network at Specific Thresholds

Threshold (%)	NLST
	Brock			LCP-CNN
	Sensitivity	Specificity	DLR⁻	Sensitivity	Specificity	DLR⁻
5	86.5 (84.1–88.6)	66.5 (65.8–67.2)	0.20 (0.17–0.24)	95.6 (94.2–96.9)	62.9 (62.1–63.7)	0.07 (0.05–0.09)

Definition of abbreviations: DLR = diagnostic likelihood ratio; Inf = infinity; LCP-CNN = Lung Cancer Prediction Convolutional Neural Network; NLST = National Lung Screening Trial; OUH = Oxford University Hospitals; VUMC = Vanderbilt University Medical Center.

The sensitivity, specificity, and DLRs for the NLST, VUMC, and OUH datasets at 5% and 65% probability thresholds are shown. For rule-out at 5%, a good risk model is one that can provide the greatest specificity while maintaining an adequately high sensitivity. For rule-in at 65%, a high specificity indicates few unnecessary procedures for patients with benign nodules. The LCP-CNN has a much higher sensitivity for most of these operating points, indicating that many more cancers than indicated by the Brock or Mayo model could be ruled in for fast-tracked interventions.

Sensitivity, Specificity, and Diagnostic Likelihood Ratio Testing Associated with the Lung Cancer Prediction Convolutional Neural Network at Specific Thresholds Definition of abbreviations: DLR = diagnostic likelihood ratio; Inf = infinity; LCP-CNN = Lung Cancer Prediction Convolutional Neural Network; NLST = National Lung Screening Trial; OUH = Oxford University Hospitals; VUMC = Vanderbilt University Medical Center. The sensitivity, specificity, and DLRs for the NLST, VUMC, and OUH datasets at 5% and 65% probability thresholds are shown. For rule-out at 5%, a good risk model is one that can provide the greatest specificity while maintaining an adequately high sensitivity. For rule-in at 65%, a high specificity indicates few unnecessary procedures for patients with benign nodules. The LCP-CNN has a much higher sensitivity for most of these operating points, indicating that many more cancers than indicated by the Brock or Mayo model could be ruled in for fast-tracked interventions.

Discussion

The management of screen-detected and incidentally detected IPNs is a challenging and growing clinical problem. Pulmonary nodules are detected in up to 30% of chest CT studies, and the vast majority of these are benign. Importantly, establishing a definite diagnosis of IPNs can take up to 2 years and can result in many follow-up procedures, including imaging, biopsy, and surgery. In this study, we report on the derivation and validation of the LCP-CNN, a deep learning lung cancer malignancy prediction tool, to classify and risk stratify IPNs from screening and nonscreening data. This study is the first to validate such a tool on multiple independent cohorts, including a large multicenter screening dataset (n = 15,693) and real-world clinical nodules (n = 579), and to show a reclassification performance that is significantly superior to that of existing risk models (net reclassification of at least 30% on the external validation cohorts compared with Mayo) and could potentially change patient management. Although previous studies demonstrated the potential of radiomics/ML for predicting IPN malignancy (43–47), most of these studies used small datasets (e.g., ∼100 nodules) or did not perform external validation. Recent studies by Ardila and colleagues (30) and Huang and colleagues (48) also trained on the NLST. The former used an external validation dataset from one center that included 27 cancers, and the latter is not directly comparable because it used human-reported parameters rather than CT images directly. In contrast, our model exhibited a robust performance on multiple independent, real-world, heterogeneous datasets (acquired with many different imaging protocols and scanners; see Table E4) across two continents, independently of differences in patient demographics. Also, for the first time, we demonstrate that a deep learning method that is appropriately trained on screening data generalizes well to the complex problem of incidentally detected nodules, including those in smokers and nonsmokers. Performance on the internal validation NLST data is not directly comparable between our study and that by Ardila and colleagues (30). In our work, each malignant nodule was tracked to earlier CTs and considered malignant. Ardila and colleagues used only the CT nearest in time to the diagnosis; therefore, our NLST dataset had a greater number of smaller, more difficult to detect cancers. Moreover, Ardila and colleagues combined both detection and classification steps and classified at the image level, whereas the LCP-CNN performs only classification and considers each nodule separately. Quantitative measures of IPN growth, such as the volume doubling time, were shown to provide excellent classification performance within the NELSON screening trial (19). However, such measures have not been extensively validated for incidental IPNs, and additionally require at least a second follow-up CT and accurate segmentation, which may fail in a substantial number of patients. The LCP-CNN uses only one CT image, often the earliest, and segmentation is not required. Although logistic regression–based risk models, such as the Brock (23), Mayo (22), and Gould (14) models, may be helpful for standardizing nodule management, they require information about the patient and nodule, and their results are dominated by nodule size. They typically use radiologist-reported parameters, which are subject to variability (7, 8, 49). In contrast, the LCP-CNN is both more performant and unaffected by such subjective assessments, deriving its information directly from the image. The LCP-CNN demonstrates superiority to the Mayo or Brock models, encouraging further exploration of its utility. The test provides excellent negative predictive value, and hence LCP-CNN scores below 5% would indicate the need for surveillance according to Fleischner Society guidelines (3). Above 65%, it would indicate the need for a tissue diagnosis. Because patients’ preferences with regard to management depend on their understanding of the risks involved, a reliable estimate of the probability of cancer would be helpful in shared decision-making. Although the NRI analysis was performed over the full range of IPNs, an inspection of Figures 3B and 3C and Table E3 shows that the LCP-CNN left fewer patients in the intermediate-risk region than the Mayo model on both validation sets, and correctly reclassified many of Mayo’s intermediate cases. Figures E2–E5 show examples of IPNs and LCP-CNN scores. An examination of these results is useful for gaining an intuitive understanding of the tool. For example, many intrapulmonary lymph nodes are assigned very low scores (typically <0.5%), whereas more complex benign cases (infection/immune response) tend to score higher, perhaps because of their suspicious appearance, which more closely resembles a malignancy. The three lowest-scoring cancer nodules from the Vanderbilt population (Figure E5) were rather small, indolent tumors, and a diagnosis was only available 606, 1,872, and 537 days, respectively, after the CT on which the LCP-CNN score was calculated, and hence may be safely monitored with follow-up imaging. An inspection of misclassifications also provides excellent feedback for refining our digital biomarker. For example, the fourth lowest-scoring nodule was a carcinoid. These nodules were underrepresented in the training set and typically had a benign-looking, smooth, round appearance. One histoplasmoma was misclassified as high risk; however, granulomas are known to represent a significant clinical challenge (38, 39). Using a model trained on screening data but tested on incidental IPNs is adequate for biomarker validation studies, which use large numbers of patients from heterogeneous populations (50). Although an inability to generalize to “real-world” clinical care can cause many biomarkers to fail in validation studies, our LCP-CNN demonstrated efficacy on nodules obtained from two separate pulmonary clinics with very different populations and disease prevalence. The guideline thresholds for low and high risk are chosen as a function of the accuracy of the available tests and may shift as better tests become available. We demonstrated the advantage of LCP-CNN over traditional prediction models using multiple metrics of biomarker performance, including discrimination (51), reclassification (42), and likelihood statistic testing. The work presented here has limitations. Although we compared the performance of LCP-CNN with that of relevant clinical risk models, we did not report its potential to change clinical decision-making. Because some clinical parameters were missing, not all risk models could be run on all datasets. In the future, comparisons with multiple models would be desirable (52, 53). Because of the smaller size of the VUMC dataset (n = 116), the difference in AUC was not significant (P = 0.082), although all VUMC reclassification results were significant. As discussed above, despite the differences in disease prevalence and patient populations across the three validation datasets, the same linear calibration between the LCP-CNN and risk was used for all the results shown in Figure 3; however, the results may be further optimized by a population-specific calibration. For example, although the reclassification of VUMC and OUH datasets was very good, on the NLST, 3.5% of controls were incorrectly classified as intermediate risk compared with Brock, because of the low prevalence of disease. The OUH dataset did not capture the patients’ history of cancer, which is necessary to calculate the Mayo risk scores, although patients who had received a cancer diagnosis in the last 5 years were excluded. Therefore, in calculating the Mayo scores, it was assumed that the OUH patients had no history of cancer. Although the results are at the nodule level rather than the patient level, the VUMC dataset only had one nodule per patient, and the mean number of nodules per patient in the OUH dataset was 1.08. In summary, using an ML method as a diagnostic algorithm, our LCP-CNN model provided a significant improvement in AUC over the clinically validated risk models (Brock and Mayo). Furthermore, it achieved a strong improvement in DLRs in both clinical validation sets, which included different patient populations. Our model is intended to be improved over time as data collections are added and structured curation efforts continue. Although more stringent clinical validations on additional (external and independent) datasets are needed, our results suggest that it may be possible to address a major problem in the management of individuals presenting with IPNs by using an ML-derived prediction model.

41 in total

1. Detours on the road to personalized medicine: barriers to biomarker validation and implementation.

Authors: Louis D Fiore; Leonard William D'Avolio
Journal: JAMA Date: 2011-11-02 Impact factor: 56.272

2. Guidelines for Management of Incidental Pulmonary Nodules Detected on CT Images: From the Fleischner Society 2017.

Authors: Heber MacMahon; David P Naidich; Jin Mo Goo; Kyung Soo Lee; Ann N C Leung; John R Mayo; Atul C Mehta; Yoshiharu Ohno; Charles A Powell; Mathias Prokop; Geoffrey D Rubin; Cornelia M Schaefer-Prokop; William D Travis; Paul E Van Schil; Alexander A Bankier
Journal: Radiology Date: 2017-02-23 Impact factor: 11.105

3. Reduced lung-cancer mortality with low-dose computed tomographic screening.

Authors: Denise R Aberle; Amanda M Adams; Christine D Berg; William C Black; Jonathan D Clapp; Richard M Fagerstrom; Ilana F Gareen; Constantine Gatsonis; Pamela M Marcus; JoRean D Sicks
Journal: N Engl J Med Date: 2011-06-29 Impact factor: 91.245

4. A clinical model to estimate the pretest probability of lung cancer in patients with solitary pulmonary nodules.

Authors: Michael K Gould; Lakshmi Ananth; Paul G Barnett
Journal: Chest Date: 2007-02 Impact factor: 9.410

5. Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method.

Authors: Peng Huang; Cheng T Lin; Yuliang Li; Martin C Tammemagi; Malcolm V Brock; Sukhinder Atkar-Khattra; Yanxun Xu; Ping Hu; John R Mayo; Heidi Schmidt; Michel Gingras; Sergio Pasian; Lori Stewart; Scott Tsai; Jean M Seely; Daria Manos; Paul Burrowes; Rick Bhatia; Ming-Sound Tsao; Stephen Lam
Journal: Lancet Digit Health Date: 2019-10-17

6. Predicting Malignant Nodules from Screening CT Scans.

Authors: Samuel Hawkins; Hua Wang; Ying Liu; Alberto Garcia; Olya Stringfield; Henry Krewer; Qian Li; Dmitry Cherezov; Robert A Gatenby; Yoganand Balagurunathan; Dmitry Goldgof; Matthew B Schabath; Lawrence Hall; Robert J Gillies
Journal: J Thorac Oncol Date: 2016-07-13 Impact factor: 15.609

Review 7. Evaluation of individuals with pulmonary nodules: when is it lung cancer? Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines.

Authors: Michael K Gould; Jessica Donington; William R Lynch; Peter J Mazzone; David E Midthun; David P Naidich; Renda Soylemez Wiener
Journal: Chest Date: 2013-05 Impact factor: 9.410

Review 8. Net reclassification indices for evaluating risk prediction instruments: a critical review.

Authors: Kathleen F Kerr; Zheyu Wang; Holly Janes; Robyn L McClelland; Bruce M Psaty; Margaret S Pepe
Journal: Epidemiology Date: 2014-01 Impact factor: 4.822

9. Towards automatic pulmonary nodule management in lung cancer screening with deep learning.

Authors: Francesco Ciompi; Kaman Chung; Sarah J van Riel; Arnaud Arindra Adiyoso Setio; Paul K Gerke; Colin Jacobs; Ernst Th Scholten; Cornelia Schaefer-Prokop; Mathilde M W Wille; Alfonso Marchianò; Ugo Pastorino; Mathias Prokop; Bram van Ginneken
Journal: Sci Rep Date: 2017-04-19 Impact factor: 4.379

10. 3D multi-view convolutional neural networks for lung nodule classification.

Authors: Guixia Kang; Kui Liu; Beibei Hou; Ningbo Zhang
Journal: PLoS One Date: 2017-11-16 Impact factor: 3.240

19 in total

1. Cancer Risk Estimation Combining Lung Screening CT with Clinical Data Elements.

Authors: Riqiang Gao; Yucheng Tang; Mirza S Khan; Kaiwen Xu; Alexis B Paulson; Shelbi Sullivan; Yuankai Huo; Stephen Deppen; Pierre P Massion; Kim L Sandler; Bennett A Landman
Journal: Radiol Artif Intell Date: 2021-10-13

2. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review.

Authors: Alice C Yu; Bahram Mohajer; John Eng
Journal: Radiol Artif Intell Date: 2022-05-04

3. Artificial Intelligence Tool for Assessment of Indeterminate Pulmonary Nodules Detected with CT.

Authors: Roger Y Kim; Jason L Oke; Lyndsey C Pickup; Reginald F Munden; Travis L Dotson; Christina R Bellinger; Avi Cohen; Michael J Simoff; Pierre P Massion; Claire Filippini; Fergus V Gleeson; Anil Vachani
Journal: Radiology Date: 2022-05-24 Impact factor: 29.146

Review 4. A narrative review of deep learning applications in lung cancer research: from screening to prognostication.

Authors: Jong Hyuk Lee; Eui Jin Hwang; Hyungjin Kim; Chang Min Park
Journal: Transl Lung Cancer Res Date: 2022-06

5. Risk-Stratifying Pulmonary Nodules.

Authors: David O Wilson
Journal: Am J Respir Crit Care Med Date: 2021-01-01 Impact factor: 21.405

6. Validation of the BRODERS classifier (Benign versus aggRessive nODule Evaluation using Radiomic Stratification), a novel HRCT-based radiomic classifier for indeterminate pulmonary nodules.

Authors: Fabien Maldonado; Cyril Varghese; Srinivasan Rajagopalan; Fenghai Duan; Aneri B Balar; Dhairya A Lakhani; Sanja L Antic; Pierre P Massion; Tucker F Johnson; Ronald A Karwoski; Richard A Robb; Brian J Bartholmai; Tobias Peikert
Journal: Eur Respir J Date: 2021-04-01 Impact factor: 16.671

7. Solitary pulmonary nodule imaging approaches and the role of optical fibre-based technologies.

Authors: Susan Fernandes; Gareth Williams; Elvira Williams; Katjana Ehrlich; James Stone; Neil Finlayson; Mark Bradley; Robert R Thomson; Ahsan R Akram; Kevin Dhaliwal
Journal: Eur Respir J Date: 2021-03-25 Impact factor: 16.671

Review 8. Artificial Intelligence and Early Detection of Pancreatic Cancer: 2020 Summative Review.

Authors: Barbara Kenner; Suresh T Chari; David Kelsen; David S Klimstra; Stephen J Pandol; Michael Rosenthal; Anil K Rustgi; James A Taylor; Adam Yala; Noura Abul-Husn; Dana K Andersen; David Bernstein; Søren Brunak; Marcia Irene Canto; Yonina C Eldar; Elliot K Fishman; Julie Fleshman; Vay Liang W Go; Jane M Holt; Bruce Field; Ann Goldberg; William Hoos; Christine Iacobuzio-Donahue; Debiao Li; Graham Lidgard; Anirban Maitra; Lynn M Matrisian; Sung Poblete; Laura Rothschild; Chris Sander; Lawrence H Schwartz; Uri Shalit; Sudhir Srivastava; Brian Wolpin
Journal: Pancreas Date: 2021-03-01 Impact factor: 3.243

Review 9. Noninvasive biomarkers for lung cancer diagnosis, where do we stand?

Authors: Michael N Kammer; Pierre P Massion
Journal: J Thorac Dis Date: 2020-06 Impact factor: 3.005

Review 10. Artificial intelligence in pulmonary medicine: computer vision, predictive model and COVID-19.

Authors: Danai Khemasuwan; Jeffrey S Sorensen; Henri G Colt
Journal: Eur Respir Rev Date: 2020-10-01