| Literature DB >> 30356283 |
Gregory R Hart1, David A Roffman1, Roy Decker1, Jun Deng1.
Abstract
The objective of this study is to train and validate a multi-parameterized artificial neural network (ANN) based on personal health information to predict lung cancer risk with high sensitivity and specificity. The 1997-2015 National Health Interview Survey adult data was used to train and validate our ANN, with inputs: gender, age, BMI, diabetes, smoking status, emphysema, asthma, race, Hispanic ethnicity, hypertension, heart diseases, vigorous exercise habits, and history of stroke. We identified 648 cancer and 488,418 non-cancer cases. For the training set the sensitivity was 79.8% (95% CI, 75.9%-83.6%), specificity was 79.9% (79.8%-80.1%), and AUC was 0.86 (0.85-0.88). For the validation set sensitivity was 75.3% (68.9%-81.6%), specificity was 80.6% (80.3%-80.8%), and AUC was 0.86 (0.84-0.89). Our results indicate that the use of an ANN based on personal health information gives high specificity and modest sensitivity for lung cancer detection, offering a cost-effective and non-invasive clinical tool for risk stratification.Entities:
Mesh:
Year: 2018 PMID: 30356283 PMCID: PMC6200229 DOI: 10.1371/journal.pone.0205264
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
The demographics of the NHIS dataset that was used in our ANN.
We show means and standard deviations for the continuous variables, means for the binary variables, and the percentage for each race.
| Input | Lung Cancer | Non-Cancer |
|---|---|---|
| Age | 65.6 (±11.8) | 46.1 (±17.6) |
| BMI | 25.8 (±5.9) | 27.3 (±6.0) |
| Heart Disease Score | 0.13 (±0.22) | 0.040 (±0.13) |
| Number of Vigorous Exercise done per week | 0.38 (±1.7) | 1.60 (3.0) |
| Female | 53.8% | 54.9% |
| Ever Smoked | 83.8% | 41.8% |
| Has Emphysema | 24.1% | 1.53% |
| Has Asthma | 18.8% | 11.2% |
| Has Diabetes | 17.4% | 7.92% |
| Ever Had a Stroke | 9.57% | 2.55% |
| Has Hypertension | 18.8% | 11.2% |
| Hispanic Ethnicity | 7.10% | 16.7% |
| Race: | ||
| Caucasian | 82.4% | 77.3% |
| African American | 14.2% | 15.3% |
| Asian | 1.39% | 4.96% |
| Native American/Alaska Native | 0.309% | 0.868% |
| Multiracial | 1.70% | 1.55% |
Fig 1A sketch of our ANN.
All lines are weights connecting one layer to next, with each circle either being an input, neuron, or output. The bias terms are analogous to intercepts and they improve the model’s performance.
A description of the inputs used in our ANN.
| Input | Input Type | Input Range | Details |
|---|---|---|---|
| Age | Continuous | 0-1 | 18-85, (85+ recorded as 85) |
| BMI | Continuous | 0-1 | BMI of 99.95+ recorded as 99.95 |
| Heart Disease Score | Continuous | 0-1 | Coronary heart disease, Angina, Heart attacks, and other heart complications each contribute 0.25 to the score |
| Vigorous Exercise | Continuous | 0-1 | Number of times per week vigorous exercise is performed; 28+ is treated as 28. Minimum time for exercise to count was 10 minutes, except for the first half of 1997 for which it was 20 minutes. |
| Gender | Binary | 0 or 1 | 0 is a man and 1 is a woman |
| Ever Smoked | Binary | 0 or 1 | Never smoked is 0 and current and former smokers are 1 |
| Emphysema | Binary | 0 or 1 | No COPD is 0 and COPD is 1 |
| Asthma | Binary | 0 or 1 | No asthma is 0 and asthma is 1 |
| Diabetes | Binary | 0 or 1 | Non-diabetics and pre-diabetics are 0, with diabetics being 1 |
| Strokes | Binary | 0 or 1 | No stroke is 0 and a prior stroke is 1 |
| Hypertension | Binary | 0 or 1 | No hypertension is 0, and having single measurement of it is 1 |
| Hispanic Ethnicity | Binary | 0 or 1 | Non-Hispanic is 0 and Hispanic is 1 |
| Race | Continuous | 0-1 | Each race is assigned a value equal to its fractional percentage in the sample plus the fractional percentage of each less common race being added to the race of interest |
Fig 2The sensitivity and specificity for the training and validation datasets as functions of the cutoff values.
Fig 3An ROC plot for our ANN’s training and validation datasets.
Fig 4An ROC plot for our ANN’s training and validation datasets as well as the performance of Random Forest and Support Vector Machine.
Fig 5Cumulative distribution function for high risk (solid line) and low risk (dashed line) population without cancer (orange) and population with cancer (blue) populations in the validation dataset.
Allowing for a 1% misclassification rate (black line), we can divide individual cancer risk into 3 categories: high (red), medium (yellow), and low (green, too narrow to see on the left of this figure).
NHIS 2016 data risk stratification results by our ANN.
| # People | # Low Risk | % Low Risk | # Medium Risk | % Medium Risk | # High Risk | % High Risk | |
|---|---|---|---|---|---|---|---|
| Cancer | 55 | 1 | 1.82% | 44 | 80.0% | 10 | 18.2% |
| Non-Cancer | 27,844 | 3,362 | 12.1% | 24,159 | 86.8% | 323 | 1.16% |
The various screening methods, with their sensitivities and specificities.
| Method | Sensitivity | Specificity | Pros and Cons |
|---|---|---|---|
| Our developed ANN | 75.3% | 80.6% | Noninvasive, Cost-effective, Easy to implement; Less Sensitive than LDCT |
| Low-Dose CT Scan [ | 93.8% | 73.4% | Noninvasive, High sensitivity; Expensive, False positives, Radiation exposure |
| Chest X-ray [ | 73.5% | 91.3% | Noninvasive; Expensive, False positives, Radiation exposure |
| Sputum Cytology [ | 16% | 99.1% | Noninvasive; Low sensitivity |
| Automated Sputum Cytometry [ | 40% | 91% | Noninvasive, High through-put; Low sensitivity |
| hnRNP A2/B1 Expression [ | 80.5% | 73.5% | High accuracy; Expensive |
| Promoter Hypermethylation [ | 63%-86% | 75%-92% | High accuracy; Expensive |
| Microarray Gene Spectorometry [ | 80% | 84% | High accuracy; Expensive, More invasive |
| Gas Chromatography-Mass Spectorometry [ | 51%-96.5% | 66.7%-100% | Noninvasive, Can be accurate; Expensive, Difficult to perform correctly |
| Electronic Noses [ | 71.4%-87% | 48%-100% | Noninvasive, Can be accurate; Expensive, Difficult to perform correctly |
| Biomarkers in Blood [ | 41%-77% | 80%-93% | Approaching high accuracy; Blood draw and analysis |
| Buccal Mucosa Analysis [ | 79% | 83% | Noninvasive, Quick; Limited testing |
| Urine Analysis [ | 72%-79% | 85%-100% | Noninvasive, High accuracy; Limited testing |
* These values are based on three years of screening and follow up on a positive screen. Instead considering each scan or radiograph in isolation the false positive rate goes way up (96.4% and 94.5%) and the positive predictive value drops to 3.8% and 5.7% for Low-Dose CT scans and chest X-rays, respectively [32, 56].