| Literature DB >> 34232304 |
Qianyu Yuan1, Tianrun Cai2, Chuan Hong3,4, Mulong Du1,5, Bruce E Johnson6,7, Michael Lanuti8, Tianxi Cai3,4, David C Christiani1,9.
Abstract
Importance: Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. Objective: To investigate whether a clinical cohort assembled from EHRs could be used in a lung cancer prognosis study. Design, Setting, and Participants: In this cohort study, patients with lung cancer were identified among 76 643 patients with at least 1 lung cancer diagnostic code deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018. Patients were identified via a semisupervised machine learning algorithm, for which clinical information was extracted from structured and unstructured data via natural language processing tools. Data completeness and accuracy were assessed by comparing with the Boston Lung Cancer Study and against criterion standard EHR review results. A prognostic model for non-small cell lung cancer (NSCLC) overall survival was further developed for clinical application. Data were analyzed from March 2019 through July 2020. Exposures: Clinical data deposited in EHRs for cohort construction and variables of interest for the prognostic model were collected. Main Outcomes and Measures: The primary outcomes were the performance of the lung cancer classification model and the quality of the extracted variables; the secondary outcome was the performance of the prognostic model.Entities:
Mesh:
Year: 2021 PMID: 34232304 PMCID: PMC8264641 DOI: 10.1001/jamanetworkopen.2021.14723
Source DB: PubMed Journal: JAMA Netw Open ISSN: 2574-3805
Figure 1. Overview of Electronic Health Record (EHR) Cohort Assembling
Data were initially from EHRs; lung cancer diagnosis code was used as a filter to create a data mart containing structured data and narrative notes. Structured data were queried, and narrative notes were processed using natural language processing tools. Structured data and narrative notes were combined to develop the phenotyping algorithm and extract variables of interest. The performance of the phenotyping algorithm was compared with a random sample of patients selected for EHR review. The accuracies of the extracted variables were compared with EHR reviewed samples and Boston Lung Cancer Study cohort data. BMI indicates body mass index; CPT, Current Procedural Terminology; ECOG, Eastern Cooperative Oncology Group; EXTEND, Extraction of Electronic Medical Record Numerical Data; NER, named-entity recognition; NICE, Natural Language Processing Interpreter for Cancer Extraction.
Data Sources, Extraction Method, and Description
| Key variable | Data sources and extraction method | Variable description | |
|---|---|---|---|
| Structured data | Unstructured data | ||
| Demographic characteristic | |||
| Birth date | Demographics | NA | NA |
| Sex | Demographics | NA | NA |
| Race/ethnicity | Demographics | NA | NA |
| Clinical outcomes | |||
| Diagnosis date | Diagnosis codes ( | NICE | Date of lung cancer diagnosis |
| Date of death | Death report | NA | NA |
| Prognostic factors | NA | NA | NA |
| Stage | NA | NICE | TNM stage and clinical stage |
| Histologic type | NA | NICE | NSCLC (ie, adenocarcinoma, squamous cell carcinoma, other non-small cell carcinoma) or small cell lung cancer |
| Smoking status | NA | NA | Smoker or nonsmoker |
| BMI | Vital signs | EXTEND | Calculated as weight in kilograms divided by height in meters squared |
| ECOG performance status | NA | EXTEND | Grade 0 to 4 |
| Laboratory test | Laboratory test codes | NA | Complete blood count, metabolic panel, lipid panel, liver panel, hemoglobin A1C, and urinalysis |
| Tumor somatic variant information | NA | NICE | Genetic alterations in |
| Medical history | Diagnosis codes ( | NA | Respiratory disease (eg, COPD and asthma), cardiovascular disease, type 2 diabetes, and others |
| Treatment | |||
| Surgical treatment | Procedure codes ( | NA | Surgical procedure (ie, lobectomy, segmentectomy, wedge resection, video-assisted thoracic surgical procedure) with surgical admission and discharge dates |
| Radiation therapy | Procedure codes ( | NA | Radiation therapy procedure, treatment start and end dates |
| Chemotherapy | Procedure codes ( | NA | Chemotherapy procedures, chemotherapy drugs, and treatment start and end dates |
| Target therapy and immunotherapy | Medication name codes | NA | Target therapy and immunotherapy drugs and treatment start and end dates |
Abbreviations: BMI, body mass index; COPD, chronic obstructive pulmonary disease; ECOG, Eastern Cooperative Oncology Group; EXTEND, Extraction of Electronic Medical Record Numerical Data; ICD-9, International Classification of Diseases, Ninth Revision; ICD-10, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision; NA, not applicable; NICE, Natural Language Processing Interpreter for Cancer Extraction; NSCLC, non–small cell lung cancer.
Demographic and Clinical Characteristics of Final Cohort
| Characteristic | No. (%) (N = 35 375) |
|---|---|
| Age at initial diagnosis, median (IQR), y | 66.7 (58.4-74.1) |
| Sex | |
| Women | 18 756 (53.0) |
| Men | 16 613 (47.0) |
| Unknown | 6 (0.02) |
| Race/ethnicity | |
| White | 30 140 (85.2) |
| Black | 1040 (2.9) |
| Asian | 857 (2.4) |
| Hispanic | 323 (0.9) |
| Other | 267 (0.8) |
| Unknown | 2748 (7.8) |
| Smoking status | |
| Smoker | 32 650 (92.3) |
| Nonsmoker | 2725 (7.7) |
| Histologic type | |
| Completeness | 30 813 (87.1) |
| Adenocarcinoma | 18 331 (59.5) |
| Squamous cell carcinoma | 5816 (18.9) |
| NSCLC unspecified | 3601 (11.7) |
| Small cell lung cancer | 3065 (9.9) |
| Stage | |
| Completeness | 26 843 (75.9) |
| 1 | 7083 (26.4) |
| 2 | 3069 (11.4) |
| 3 | 5889 (21.9) |
| 4 | 8495 (31.6) |
| Limited | 1222 (4.6) |
| Extensive | 1085 (4.0) |
| Treatment received within MGB health care system | |
| Surgical treatment | 13 628 (38.5) |
| Chemotherapy | 14 039 (39.7) |
| Radiation therapy | 14 710 (41.6) |
| Target therapy | 2631 (7.4) |
| Immunotherapy | 504 (1.4) |
| 4655 (13.1) | |
| Variant positive | 857 (18.4) |
| Variant negative | 3798 (81.6) |
| 4655 (13.1) | |
| Variant positive | 1242 (26.7) |
| Variant negative | 3413 (73.3) |
| 4655 (13.1) | |
| Variant positive | 171 (3.7) |
| Variant negative | 4484 (96.3) |
| 3791 (10.1) | |
| Rearrangement present | 203 (5.4) |
| Rearrangement not present | 3588 (81.6) |
| 2436 (6.9) | |
| Rearrangement present | 51 (2.1) |
| Rearrangement not present | 2385 (97.9) |
| Follow-up from initial diagnosis, median (IQR), y | 1.62 (0.63-4.14) |
Abbreviations: IQR, interquartile range; MGB, Mass General Brigham; NSCLC, non–small cell lung cancer.
Patients received treatments within the MGB health care system with International Classification of Diseases, Ninth Revision (ICD-9) or International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) codes, procedure codes, or medication codes available.
EGFR, KRAS, and BRAF were tested using the SNaPshot assay (Thermo Fisher Scientific).
ALK and ROS were tested using fluorescence in situ hybridization or immunohistochemistry.
Factors Associated With Overall Survival at 5 Years
| Factor | Training set | Testing set | ||
|---|---|---|---|---|
| HR (95% CI) | HR (95% CI) | |||
| Sex | ||||
| Women | 1 [Reference] | NA | 1 [Reference] | NA |
| Men | 1.23 (1.16-1.30) | <.001 | 1.303 (1.17-1.44) | <.001 |
| Age | 1.02 (1.01-1.02) | <.001 | 1.02 (1.01-1.02) | <.001 |
| Smoking status | ||||
| Nonsmoker | 1 [Reference] | NA | 1 [Reference] | NA |
| Smoker | 1.66 (1.46-1.89) | <.001 | 1.92 (1.51-2.44) | <.001 |
| Stage | ||||
| 1 | 1 [Reference] | NA | 1 [Reference] | NA |
| 2 | 1.73 (1.54-1.94) | <.001 | 1.32 (1.08-1.61) | <.001 |
| 3 | 2.92 (2.66-3.20) | <.001 | 2.44 (2.08-2.86) | <.001 |
| 4 | 5.08 (4.65-5.54) | <.001 | 4.83 (4.16-5.62) | <.001 |
| Histologic type | ||||
| Adenocarcinoma | 1 [Reference] | NA | 1 [Reference] | NA |
| Squamous cell carcinoma | 1.05 (0.98-1.13) | .16 | 1.14 (1.01-1.29) | .03 |
| Other | 1.52 (1.40-1.65) | <.001 | 1.32 (1.16-1.51)) | <.001 |
| BMI | ||||
| Reference range | 1 [Reference] | NA | 1 [Reference] | NA |
| Obesity | 0.88 (0.80-0.97) | .01 | 0.86 (0.73-1.01) | .06 |
| Overweight | 0.90 (0.83-0.98) | .02 | 0.90 (0.78-1.04) | .15 |
| Underweight | 1.28 (1.06-1.55) | .01 | 0.94 (0.65-1.35) | .73 |
| Missing | 1.38 (1.28-1.49) | <.001 | 1.32 (1.16-1.51) | <.001 |
| Albumin, g/dL | ||||
| ≤3.5 | 1 [Reference] | NA | 1 [Reference] | NA |
| >3.5 | 0.66 (0.61-0.71) | <.001 | 0.59 (0.52-0.68) | <.001 |
| Missing | 0.48 (0.41-0.57) | <.001 | 0.45 (0.34-0.61) | <.001 |
| Alkaline phosphatase, U/L | ||||
| ≤140 | 1 [Reference] | NA | 1 [Reference] | NA |
| >140 | 1.40 (1.27-1.54) | <.001 | 1.54 (1.29-1.83) | <.001 |
| Missing | 1.17 (0.99-1.38) | .06 | 1.09 (0.81-1.47) | .57 |
| Creatinine, mg/dL | ||||
| Reference range | 1 [Reference] | NA | 1 [Reference] | NA |
| High | 1.02 (0.95-1.10) | .57 | 0.93 (0.81-1.06) | .27 |
| Low | 1.45 (1.26-1.67) | <.001 | 1.45 (1.26-1.67) | <.001 |
| Missing | 1.19 (0.90-1.57) | .22 | 0.93 (0.81-1.06) | .16 |
| Hemoglobin, g/dL | ||||
| Reference range | 1 [Reference] | NA | 1 [Reference] | NA |
| High | 1.35 (1.08-1.70) | .01 | 1.34 (0.89-2.03) | .15 |
| Low | 1.16 (1.09-1.24) | <.001 | 1.06 (0.95-1.19) | .28 |
| Missing | 1.56 (0.96-2.55) | .07 | 0.54 (0.18-1.65) | .32 |
| Red cell distribution width, % | ||||
| ≤14.5 | 1 [Reference] | NA | 1 [Reference] | NA |
| >14.5 | 1.12 (1.05-1.20) | <.001 | 1.36 (1.20-1.53) | <.001 |
| Missing | 0.74 (0.46-1.20) | .23 | 0.84 (0.27-2.61) | .77 |
| WBC count, per μL | ||||
| 4500-11 000 | 1 [Reference] | NA | 1 [Reference] | NA |
| ≥11 000 | 1.17 (1.09-1.25) | <.001 | 1.24 (1.10-1.40) | <.001 |
| Missing | 1.01 (0.64-1.59) | .97 | 2.82 (0.73-10.91) | .13 |
| Neutrophil-lymphocyte ratio | ||||
| ≤4 | 1 [Reference] | NA | 1 [Reference] | NA |
| >4 | 1.34 (1.25-1.43) | <.001 | 1.23 (1.10-1.38) | <.001 |
| Missing | 0.85 (0.76-0.96) | .01 | 0.98 (0.80-1.20) | .83 |
| Calcium, mg/dL | ||||
| 8.5-10.5 | 1 [Reference] | NA | 1 [Reference] | NA |
| ≥8.5 | 0.72 (0.66-0.79) | <.001 | 0.68 (0.58-0.79) | <.001 |
| ≥10.5 | 1.37 (1.17-1.59) | <.001 | 1.56 (1.15-2.11) | .003 |
| Missing | 1.08 (0.84-1.40) | .54 | 1.12 (0.75-1.67) | .59 |
| Sodium, mEq/L | ||||
| 135-145 | 1 [Reference] | NA | 1 [Reference] | NA |
| <135 | 1.32 (1.23-1.43) | <.001 | 1.25 (1.10-1.43) | <.001 |
| >145 | 0.95 (0.77-1.16) | .59 | 1.07 (0.74-1.55) | .72 |
| Missing | 1.04 (0.73-1.48) | .82 | 0.97 (0.58-1.65) | .92 |
Abbreviations: BMI, body mass index; HR, hazard ratio; NA, not applicable; WBC, white blood cell.
SI conversion factors: To convert albumin to grams per liter, multiply by 10; alkaline phosphatase to microkatals per liter, multiply by 0.0167; calcium to millimoles per liter, multiply by 0.25; creatinine to micromoles per liter, multiply by 88.4; hemoglobin to grams per liter, multiply by 10.0; sodium to millimoles per liter, multiply by 1.0; WBC count to × 109 per liter, multiply by 0.001.
The reference range for creatinine is 0.6 to 1.2 mg/dL for men and 0.5 to 1.1 mg/dL for women.
The reference range for hemoglobin is 13.5 to 17.5 g/dL for men and 12.0 to 15.5 g/dL for women.
Figure 2. Prognostic Nomogram for Patients With Non–Small Cell Lung Cancer
The results of multivariate Cox regression model incorporating variables from panelized regression were used to build the final nomogram and generate probabilities of overall survival at 1 year to 5 years after diagnosis. BMI indicates body mass index; NA, not applicable; WBC, white blood cell. SI conversion factors: To convert albumin to grams per liter, multiply by 10; alkaline phosphatase to microkatals per liter, multiply by 0.0167; calcium to millimoles per liter, multiply by 0.25; creatinine to micromoles per liter, multiply by 88.4; hemoglobin to grams per liter, multiply by 10.0; sodium to millimoles per liter, multiply by 1.0; WBC count to × 109 per liter, multiply by 0.001.