Anne-Michelle Noone1, Clara J K Lam1, Angela B Smith2,3, Matthew E Nielsen2,3, Eric Boyd4, Angela B Mariotto1, Mousumi Banerjee5. 1. Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD. 2. University of North Carolina Lineberger Comprehensive Cancer Center, Chapel Hill, NC. 3. Biostatistics and Clinical Data Management Core, University of North Carolina Lineberger Comprehensive Cancer Center, Chapel Hill, NC. 4. Information Management Services Inc, Calverton, MD. 5. University of Michigan, Ann Arbor, MI.
Abstract
PURPOSE: Population-based cancer incidence rates of bladder cancer may be underestimated. Accurate estimates are needed for understanding the burden of bladder cancer in the United States. We developed and evaluated the feasibility of a machine learning-based classifier to identify bladder cancer cases missed by cancer registries, and estimated the rate of bladder cancer cases potentially missed. METHODS: Data were from population-based cohort of 37,940 bladder cancer cases 65 years of age and older in the SEER cancer registries linked with Medicare claims (2007-2013). Cases with other urologic cancers, abdominal cancers, and unrelated cancers were included as control groups. A cohort of cancer-free controls was also selected using the Medicare 5% random sample. We used five supervised machine learning methods: classification and regression trees, random forest, logic regression, support vector machines, and logistic regression, for predicting bladder cancer. RESULTS: Registry linkages yielded 37,940 bladder cancer cases and 766,303 cancer-free controls. Using health insurance claims, classification and regression trees distinguished bladder cancer cases from noncancer controls with very high accuracy (95%). Bacille Calmette-Guerin, cystectomy, and mitomycin were the most important predictors for identifying bladder cancer. From 2007 to 2013, we estimated that up to 3,300 bladder cancer cases in the United States may have been missed by the SEER registries. This would result in an average of 3.5% increase in the reported incidence rate. CONCLUSION: SEER cancer registries may potentially miss bladder cancer cases during routine reporting. These missed cases can be identified leveraging Medicare claims and data analytics, leading to more accurate estimates of bladder cancer incidence.
PURPOSE: Population-based cancer incidence rates of bladder cancer may be underestimated. Accurate estimates are needed for understanding the burden of bladder cancer in the United States. We developed and evaluated the feasibility of a machine learning-based classifier to identify bladder cancer cases missed by cancer registries, and estimated the rate of bladder cancer cases potentially missed. METHODS: Data were from population-based cohort of 37,940 bladder cancer cases 65 years of age and older in the SEER cancer registries linked with Medicare claims (2007-2013). Cases with other urologic cancers, abdominal cancers, and unrelated cancers were included as control groups. A cohort of cancer-free controls was also selected using the Medicare 5% random sample. We used five supervised machine learning methods: classification and regression trees, random forest, logic regression, support vector machines, and logistic regression, for predicting bladder cancer. RESULTS: Registry linkages yielded 37,940 bladder cancer cases and 766,303 cancer-free controls. Using health insurance claims, classification and regression trees distinguished bladder cancer cases from noncancer controls with very high accuracy (95%). Bacille Calmette-Guerin, cystectomy, and mitomycin were the most important predictors for identifying bladder cancer. From 2007 to 2013, we estimated that up to 3,300 bladder cancer cases in the United States may have been missed by the SEER registries. This would result in an average of 3.5% increase in the reported incidence rate. CONCLUSION: SEER cancer registries may potentially miss bladder cancer cases during routine reporting. These missed cases can be identified leveraging Medicare claims and data analytics, leading to more accurate estimates of bladder cancer incidence.
Cancer surveillance relies on a comprehensive system to collect information on newly diagnosed patients with cancer. This information is critical to accurately estimate cancer incidence and survival. In turn, these data provide a basis for public health research to understand and work toward reducing the cancer burden in the population. Data integrity is dependent upon cancer surveillance meeting high standards of completeness.
CONTEXT
Key ObjectiveTo develop and evaluate the feasibility of a machine learning–based classifier to identify bladder cancer cases potentially missed by cancer registries.Knowledge GeneratedA classification tree was able to identify bladder cancer cases versus noncancer controls with very high accuracy using treatment and comorbidity information from medical claims. Common treatments for bladder cancer including Bacille Calmette-Guerin, cystectomy, and mitomycin were important predictors for identifying bladder cancer cases. We estimated that the incidence rate of bladder cancer reported by cancer registries is likely to be underestimated by 3.5%.RelevanceCancer registries may not record all cases of bladder cancer primarily because of diagnosis and treatment outside of hospital settings. Machine learning–based classifiers, such as a classification tree, may accurately identify these unrecorded cases. This would lead to more accurate reporting of bladder cancer incidence rates.The North American Association of Central Cancer Registries estimates the extent to which all incident cases are reported to the registry using the incidence-mortality ratio.[1-3] Assumptions of this method may not be met since external factors may influence cancer incidence such as trends in cancer risk factors or screening; therefore, this ratio may not fully capture registry case completeness.[4] Furthermore, cases were historically diagnosed and treated in the hospital setting and easily identified by cancer registrars. Over time, cancer diagnosis and treatment have evolved substantially, often now in outpatient settings, making measures of case completeness even more critical. Particularly, bladder cancer is often diagnosed and treated in urology offices and may not get reported routinely to cancer registries.[5,6] Therefore, they are not included in population-based incidence rate estimates, potentially resulting in underestimation of bladder cancer incidence.Bladder cancer diagnosis and treatment cascade is unique compared with other cancers. Specifically, a diagnostic and initial therapeutic procedure is often a transurethral resection of bladder tumor (TURBT).[7] The goal of this procedure is to make the correct diagnosis and remove visible lesions. Another common treatment is intravesical Bacille Calmette-Guerin (BCG) after surgery. If the tumor has invaded the muscle, then either a radical or partial cystectomy is standard. Patients may also receive combinations of radiation and chemotherapy depending on tumor stage.[7,8] Because many of these procedures are specific to bladder cancer, there is an opportunity to develop an algorithm using procedure codes to identify unreported cases.Machine learning methods are powerful statistical techniques for developing classification tools. In contrast to tools based on clinical knowledge of disease and treatment, these methods choose the best algorithm that results in the lowest misclassification error. Machine learning methods can easily handle large amounts of data and many predictor variables. They are well suited to identify nonlinear relationships including interactions or Boolean combinations of variables that may not be known a priori. To our knowledge, there is no published research using these techniques to identify unreported cases of cancer. We used five machine learning methods, specifically logistic regression, classification and regression trees (CART), random forest, support vector machines (SVM), and logic regression, to build a classifier, and compared these based on predictive accuracy.The primary objective was to develop an algorithm for cancer surveillance (ie, to detect unreported cases of bladder cancer) that is critical for estimating bladder cancer incidence in the United States. Furthermore, these results may also highlight whether claims data could be used by registries to ascertain unreported bladder cancers.
METHODS
Data Sources
The SEER program of the National Cancer Institute is a system of 18 population-based cancer registries that covers 35% of the US population from geographically defined areas. Individuals in the SEER data eligible for Medicare have been matched to their Medicare claims to create the linked SEER-Medicare data. These linked data contain longitudinal claims with codes for medical services and diagnoses associated with services and dates.[9] Specifically, individuals in the SEER data were matched to Medicare’s master enrollment file maintained by the Centers for Medicare and Medicaid Services. Ninety-four percent of those reported to SEER 65 years of age or older have been linked to their Medicare claims. The linked database also contains a random sample of Medicare beneficiaries who do not have cancer. This cancer-free group is a random 5% sample of Medicare beneficiaries residing in the SEER areas without a cancer recorded in a SEER registry. The Medicare claims in both the SEER-Medicare data and the 5% random sample include hospital care (part A) as well as physician and outpatient services (part B). Part B requires that a beneficiary pay a premium and includes International Classification of Diseases (ICD)-9 and ICD-9 diagnosis codes and Health Care Procedure Healthcare Common Procedure Coding System codes for treatments. Similar claims data are available for the 5% noncancer sample.
Study Sample
We included all individuals in the SEER-Medicare data who were diagnosed with nonmetastatic bladder cancer from 2007 to 2013. We also constructed several control groups to distinguish bladder cancer from similar cancers and those without cancer. The first set was patients diagnosed with other urologic cancers (ie, kidney and renal pelvis, ureter, and other urinary organs); the second was a set of other abdominal cancers (ie, stomach, liver, pancreas, colon, rectum, and gallbladder); and the third was a set of unrelated cancers (eg, female breast and prostate). Cancer types were defined by the SEER ICD-O-3 codes.[10] We included in our study all consecutive individuals who were at least 65 years of age, had continuous part A, B, and fee-for-service coverage, and not enrolled in a Health Maintenance Organization during that year. Individuals could have more than one cancer recorded by SEER. If this was the case, the first primary cancer was used to determine if they were selected into the bladder cancer group or one of the other cancer groups. We also selected controls without a cancer diagnosis from the 5% noncancer sample that comprised a fourth control group. Ten cancer-free controls were randomly selected for each bladder cancer case matched by birth year. Thus, we constructed four data sets using the four different control groups and each was randomly split into 2/3 for training and 1/3 for testing.
Identification of Medical Conditions and Cancer Treatment
An individual was considered to have a specified medical condition if they had a claim with a diagnosis code for that condition in the year before cancer diagnosis. For the cancer-free controls, conditions that occurred in the year before diagnosis of their matched case were ascertained. Medical conditions included individual comorbidities as well as chronic obstructive pulmonary disease and history of smoking (Table 1). Information on cancer treatment and medical procedures was identified using Medicare claims. An individual was considered to have treatment if at least one Medicare claim included a code for the specific treatment or procedure within the first year of cancer diagnosis. Codes used to identify treatment are listed in Appendix Tables A1 and A2.
TABLE 1.
Demographic, Comorbid, and Treatment Characteristics of 766,303 Individuals From SEER-Medicare With Selected Cancers and Without Cancer
TABLE A1.
List of CPT and ICD-9 Codes to Identify Treatment
TABLE A2.
List of HCPC to Identify Chemotherapy
Demographic, Comorbid, and Treatment Characteristics of 766,303 Individuals From SEER-Medicare With Selected Cancers and Without Cancer
Statistical Analyses
We used five supervised machine learning methods, namely, CART, random forest, logic regression, SVM, and logistic regression, for predicting bladder cancer. We included all variables listed in Table 1 in the models except for TURBT. TURBT was excluded because it is used as a confirmatory procedure for bladder cancer. So, using TURBT to predict bladder cancer would not be clinically useful since patients who underwent TURBT but did not have bladder cancer would result in false positives.CART is a binary recursive partitioning technique.[11-13] The method begins with all subjects in the top node of the tree. Subjects are passed down the tree with decisions made at each node to split into two daughter nodes until no further splitting is done and a terminal node is reached. Each nonterminal node contains a question on which a split is based. In a classification tree, the covariate space is partitioned recursively in a binary fashion based on homogeneity of the nodes. To prevent overfitting, this tree is pruned using a cost-complexity parameter that imposes a penalty for large trees.[14] The final tree is selected based on a 10-fold cross-validation[11] and is the one that has the lowest misclassification error to predict bladder cancer.Random forest is an ensemble of unpruned classification or regression trees grown using bootstrap resamples of the data.[13,14] This method overcomes much of the inherent instability with a single CART tree. Here, a tree is grown on bootstrap samples using a random selection of covariates at each step of the tree growing. A patient is classified by each tree and the final classification is the one with majority votes across all trees in the forest. We grew a forest of 500 trees to predict bladder cancer. The Gini index, a measure of variable importance, was also estimated.[14] This measures the impact a single variable has on the error rates of the forest and is used to rank the variables.Logic regression is an adaptive classification and regression procedure that constructs Boolean combinations of binary predictors.[15,16] For example, a decision to classify a patient in a certain group may be based on rules such as if X1 and X2 but not X3 are true. Logic regression searches among all possible Boolean combinations of predictors while remaining in the regression modeling framework. The quality of the models is determined by an appropriate score function. In our analysis, we used binomial deviance as the score function to predict bladder cancer, and a stochastic optimization algorithm to search for the Boolean expressions.[15,16]We also used SVM, which is a nonprobabilistic supervised learning procedure that creates a multidimensional hyperplane to partition the covariate space into two groups allowing for classification.[17,18] SVMs create hyperplanes by maximizing the margin between the nearest data points on either side of the hyperplane based on a cost penalty for each misclassified patient.Finally, we estimated the predicted probabilities of being a bladder cancer case using logistic regression. All covariates were entered into the model and no model selection was performed. Patients were classified as having bladder cancer if their predicted probability was > 50%.Models were trained and tested using the split sample: 2/3 were used for training and the remaining 1/3 was used as a test set. Classification performance was evaluated using overall accuracy, sensitivity, and specificity based on the test set. Overall accuracy is the proportion of correct classifications using the classifier, and sensitivity is the proportion of individuals classified as having bladder cancer among those who were reported to SEER data (true disease status). Specificity is the proportion of individuals classified as not having bladder cancer among those who did not have a bladder cancer diagnosis. We also calculated the area under the receiver operating characteristic curve and the F1 score.[19] Analysis was conducted using R software, version 3.5.1.[20] Specifically, we used the packages rpart[21] (CART), randomForest[22] (random forest), e1071[23] (SVM), and LogicReg[24] (logic regression). R code and model parameters are available in Appendix Table A3 and the Data Supplement.
TABLE A3.
List of Parameters and R Function Used for Each Method
Estimates of Unreported Bladder Cancer
We used the entire noncancer sample from Medicare to estimate the number and rate of potential missed cases of bladder cancer in SEER. The date of the first claim for any treatment shown in Table 1 was used as the index date and with the same inclusion and exclusion criteria stated above. That is, the person must have been at least 65 years of age and resided in a SEER registry area before the index date and had continuous Medicare parts A and B enrollment in the year following the index date.For the purpose of estimating potential missed cases of bladder cancer, we chose the classifier with the highest positive predictive value (PPV) and applied it to the noncancer sample. In this context, the PPV quantifies the percentage of correctly identified bladder cancers among those predicted to have bladder cancer by our model. Therefore, we based our estimation of potentially missed bladder cancer cases on the classifier with the highest PPV to provide the most conservative estimate of the number of missed bladder cancer cases. The denominator from the noncancer sample was obtained using the same exclusion criteria as the cases. Starting from 2007 through 2013, we identified the annual number of people included in the noncancer sample who were 65+ years of age and enrolled in Medicare at the first of the year.The incidence of missed bladder cancer cases was compared with the crude annual incidence rate of bladder cancer among those 65+ years of age reported to SEER. The crude incidence rate was estimated using SEER*Stat.[25] Finally, since we were using the 5% sample, the number of missed cases was multiplied by 20 to scale to the SEER-Medicare noncancer population and used to estimate the total and percentage of bladder cancer cases that were unreported to the registries.
RESULTS
A total of 37,940 patients with bladder cancer were included (Table 1). Most patients were diagnosed with noninvasive (50%) or localized bladder cancer (36%). Compared to the control groups, patients with bladder cancer were more likely to be male and older. The noncancer group was less likely to have comorbid conditions and less likely to have been smokers compared with the bladder cases. Patients with bladder cancer were more likely to have undergone treatment with BCG, cystectomy, TURBT, and mitomycin compared with those diagnosed with other cancers or those without cancer. For example, almost 90% of bladder cancer cases received TURBT treatment within 1 year of diagnosis compared with 8% of those diagnosed with other urinary cancers. Approximately 9% of bladder cancer cases received a cystectomy within a year of diagnosis compared with < 1% of other cancer cases or cancer-free controls.All five machine learning methods classified bladder cancer cases with very high accuracy (Table 2). Distinguishing bladder cancer from other urinary cancers was the most difficult with an overall accuracy of about 70% across all methods. The overall accuracy for distinguishing bladder from abdominal cancer was ≥ 86%, and more than 91% for bladder cancer versus other unrelated cancers. The highest accuracy was achieved for distinguishing bladder cancer from cancer-free controls, at 95% or higher across all methods.
TABLE 2.
Overall Accuracy, Sensitivity, Specificity, AUC, and F1 Score of Each Classifier
Sensitivity and specificity were also fairly high across all methods and comparison groups (Table 2). The sensitivity for distinguishing bladder cancer versus other urinary cancers was high (range, 77.8%-81.2%). The sensitivity for distinguishing bladder cancer versus other control groups was higher than that for other urinary cancers. Specificity, which is the probability of correctly predicting that a patient was not diagnosed with bladder cancer, was very high, more than 99%, across all methods and for all comparison groups except other urinary cancers. Specificity was the lowest for classifying bladder cancer versus other urinary cancers (range, 61.5%-72.0%). There was not much variation in the F1 statistic across methods, although the random forest tended to give slightly higher values. The area under the curve was highest for logistic regression compared with the other methods.CART, random forest, and logistic regression identified BCG, mitomycin, and cystectomy as the most important variables to distinguish bladder cancer versus any of the comparison groups. Figure 1 shows the final classification tree used to classify bladder cancer versus noncancer controls and Figure 2 is the variable importance plot from the random forest. Based on variable importance from random forest, receipt of BCG was the most important variable, followed by mitomycin and cystectomy, which had substantially lower importance compared with BCG. BCG had the largest odds ratio followed by cystectomy in the logistic regression. The SVM assigned largest weights to the use of certain chemotherapies (ie, cisplatin, carboplatin, gemcitabine, and doxorubicin) followed by cystectomy and BCG.
FIG 1.
Classification tree to identify bladder cancer cases versus cancer-free controls. BCG, Bacille Calmette-Guerin.
FIG 2.
Variable importance plot from the random forest to identify bladder cancer cases versus cancer-free controls. BCG, Bacille Calmette-Guerin; COPD, chronic obstructive pulmonary disease; CVD, cardiovascular disease; MI, myocardial infarction.
For all comparison groups, logic regression identified interactions between ever having a cystectomy, BCG, and mitomycin. The classification rule for distinguishing bladder cancer from other abdominal cancers was having at least one of cystectomy, BCG, or mitomycin within a year of diagnosis. The classification rules for bladder cancer versus other groups were more complex. For example, the classification rule for identifying bladder cancer versus other urinary cancers included age, sex, and receipt of cisplatin, whereas that for distinguishing bladder cancer from other unrelated cancers included gemcitabine. Finally, if a patient had one of cystectomy, mitomycin, cisplatin, BCG, or carboplatin, then they were classified as having bladder cancer versus a cancer-free control (Fig 3). From 2007 to 2013, there were 165 missed bladder cancer cases identified in the 5% noncancer control sample. The classification tree was used to estimate the potential number of missed cases since it had the highest PPV. The annual rate of missed incident bladder cancer ranged from 16.5 per 100,000 in 2007 to 9.6 per 100,000 in 2013. In comparison, there were 90,714 bladder cancer cases reported to the SEER registries among persons 65+ years of age during that time. This resulted in crude incidence rates ranging from 128.4 per 100,000 in 2007 to 118.1 per 100,000 in 2013. Inclusion of the missed bladder cancer cases would increase the cases reported to SEER by 3.5% from 2007 to 2013. The increase was 6.2% in 2007 declining to 2.6% in 2013.
FIG 3.
Logic tree to identify bladder cancer cases versus cancer-free controls. If a patient had one of cystectomy, mitomycin, cisplatin, BCG, or carboplatin, then they were classified as having bladder cancer versus being a cancer-free control. BCG, Bacille Calmette-Guerin.
Overall Accuracy, Sensitivity, Specificity, AUC, and F1 Score of Each ClassifierClassification tree to identify bladder cancer cases versus cancer-free controls. BCG, Bacille Calmette-Guerin.Variable importance plot from the random forest to identify bladder cancer cases versus cancer-free controls. BCG, Bacille Calmette-Guerin; COPD, chronic obstructive pulmonary disease; CVD, cardiovascular disease; MI, myocardial infarction.Logic tree to identify bladder cancer cases versus cancer-free controls. If a patient had one of cystectomy, mitomycin, cisplatin, BCG, or carboplatin, then they were classified as having bladder cancer versus being a cancer-free control. BCG, Bacille Calmette-Guerin.
DISCUSSION
Bladder cancer cases were identified with high accuracy using our machine learning–based rules. BCG is a unique therapeutic procedure for bladder cancer and emerged as the most important variable to distinguish patients with bladder cancer versus other urologic, abdominal, and unrelated cancers followed by mitomycin and cystectomy. Indeed, using CART, we found that approximately 3,300 potential bladder cancer cases were unreported to SEER from 2007 to 2013. Overall, these cases would have increased the SEER reported incidence rate by 3.5%. Registries that have access to claims or other treatment data for cancer cases and the general population may use this classifier to flag cases and follow-up with outpatient facilities. Moreover, some registries may be able to implement case tracking with claims processed in real time. These results are in accordance with standards used for clinical treatment of bladder cancer, namely, BCG and cystectomy being primary treatments.[26]All machine learning methods had very high sensitivity and specificity in distinguishing bladder cancer cases from the cancer-free controls. CART and logic regression are preferred because of their tree-based structures (lending to easy interpretation), as well as ease of implementation. Therefore, these methods are more amenable to real-time implementation.Medical claims have been used extensively to identify treatments and cancer diagnoses and our results are concordant with prior studies. One study used automated software for processing billing data from community urology practices to identify an additional 12% of bladder cancer cases that were unreported to the central registry.[5] Lam et al[26] recently developed an algorithm to identify missed cases of bladder cancer using SEER-Medicare data. The algorithm, based on clinical expertise, uses combinations of diagnosis codes, treatment, procedures, and oncology consultations to identify bladder cancer cases. They found about 4% of cases were missed in SEER from 2008 to 2015.Including TURBT increased predictive performance across all models and all metrics, compared with not including TURBT. However, TURBT is used as a confirmatory procedure for bladder cancer. The prevalence of TURBT among bladder cancer cases is extremely high (90%). So, using TURBT to predict bladder cancer would, by definition, result in a highly sensitive tool. Indeed, secondary analyses including TURBT resulted in TURBT being the only variable to distinguish bladder cancers versus other cancers with high sensitivity. However, as mentioned, TURBT is used for diagnosis when a patient is suspected to have bladder cancer. Many patients undergoing the procedure may not have bladder cancer. Therefore, using only TURBT to predict bladder cancer would not be useful clinically, since patients who underwent TURBT for suspicion of bladder cancer but did not have it confirmed on TURBT would result in false positives. For this reason, we decided to exclude TURBT from the analyses.Strengths of our study include a large data set with many cancer cases along with a comprehensive list of treatment and comorbid conditions. We used several comparison groups to classify bladder cancer to challenge the algorithm to distinguish bladder cancer versus other cancers that may have similar profiles. Limitations include age restriction (≥ 65 years) for Medicare eligibility. However, the majority of bladder cancer cases occur after 65 years of age. Also, patients with HMO insurance and multiple cancers were excluded, which may limit generalizability. Some treatments or comorbid conditions may have been missed if a claim for payment was not processed through Medicare. Finally, some cancers such as upper tract urothelial carcinoma have similar treatment modalities as bladder cancer. This may have affected our prediction performance when distinguishing bladder cancer from other urologic cancers.In summary, using machine learning methods, we identified common treatments as the most important variables in distinguishing individuals with bladder cancer compared to those with other cancers or without cancer with very high accuracy. Our results validate what is known clinically about the treatment of bladder cancer and therefore may be useful to cancer registries in identifying cases that may have been unreported to the cancer registry.
Authors: Marko Babjuk; Andreas Böhle; Maximilian Burger; Otakar Capoun; Daniel Cohen; Eva M Compérat; Virginia Hernández; Eero Kaasinen; Joan Palou; Morgan Rouprêt; Bas W G van Rhijn; Shahrokh F Shariat; Viktor Soukup; Richard J Sylvester; Richard Zigeuner Journal: Eur Urol Date: 2016-06-17 Impact factor: 20.096
Authors: Clara J K Lam; Joan L Warren; Matthew Nielsen; Angela Smith; Eric Boyd; Michael J Barrett; Angela B Mariotto Journal: J Natl Cancer Inst Monogr Date: 2020-05-01