Literature DB >> 29391345

Automated Information Extraction on Treatment and Prognosis for Non-Small Cell Lung Cancer Radiotherapy Patients: Clinical Study.

Fusheng Wang^1,2, Wei Zou³, Shuai Zheng⁴, Salma K Jabbour⁵, Shannon E O'Reilly³, James J Lu⁶, Lihua Dong⁷, Lijuan Ding⁷, Ying Xiao³, Ning Yue⁵.

Abstract

BACKGROUND: In outcome studies of oncology patients undergoing radiation, researchers extract valuable information from medical records generated before, during, and after radiotherapy visits, such as survival data, toxicities, and complications. Clinical studies rely heavily on these data to correlate the treatment regimen with the prognosis to develop evidence-based radiation therapy paradigms. These data are available mainly in forms of narrative texts or table formats with heterogeneous vocabularies. Manual extraction of the related information from these data can be time consuming and labor intensive, which is not ideal for large studies.
OBJECTIVE: The objective of this study was to adapt the interactive information extraction platform Information and Data Extraction using Adaptive Learning (IDEAL-X) to extract treatment and prognosis data for patients with locally advanced or inoperable non-small cell lung cancer (NSCLC).
METHODS: We transformed patient treatment and prognosis documents into normalized structured forms using the IDEAL-X system for easy data navigation. The adaptive learning and user-customized controlled toxicity vocabularies were applied to extract categorized treatment and prognosis data, so as to generate structured output.
RESULTS: In total, we extracted data from 261 treatment and prognosis documents relating to 50 patients, with overall precision and recall more than 93% and 83%, respectively. For toxicity information extractions, which are important to study patient posttreatment side effects and quality of life, the precision and recall achieved 95.7% and 94.5% respectively.
CONCLUSIONS: The IDEAL-X system is capable of extracting study data regarding NSCLC chemoradiation patients with significant accuracy and effectiveness, and therefore can be used in large-scale radiotherapy clinical data studies. ©Shuai Zheng, Salma K Jabbour, Shannon E O'Reilly, James J Lu, Lihua Dong, Lijuan Ding, Ying Xiao, Ning Yue, Fusheng Wang, Wei Zou. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 01.02.2018.

Entities: Chemical Disease Gene Species

Keywords: chemoradiation treatment; information extraction; information storage and retrieval; natural language processing; non–small cell lung; oncology; prognosis

Year: 2018 PMID： 29391345 PMCID： PMC5814605 DOI： 10.2196/medinform.8662

Source DB: PubMed Journal: JMIR Med Inform

Introduction

Locally advanced or inoperable non–small cell lung cancer (NSCLC) occurs in approximately 20% to 30% of all cases of NSCLC [1] and may be treated with a combination of definitive concurrent chemotherapy and radiation. Modern radiotherapy has made great advances in the care of NSCLC patients, by reducing potential toxicities using involved field irradiation, while improving survival rates [2-4]. Assessing the effects of new developments in treatment techniques and regimens requires studies on the correlation between the treatment and prognosis [5-7]. Such studies involve extracting extensive patient information on chemoradiation treatments and follow-up assessments, including survival, tumor control, and toxicities. Information about treatment and prognosis is embedded in treatment summaries and clinical encounter notes, which have various formats and diverse vocabularies. Manual extraction from large volumes of patient treatment summaries and records describing prognosis is time consuming and labor intensive. There is a need for an automated information system, as a natural language processing tool, to extract the needed patient treatment and prognosis data. During recent years, automated information systems have become widely used in medical and biomedical domains. The clinical Text Analysis and Knowledge Extraction System specializes in clinical information extraction [8]. The Cancer Tissue Information Extraction System focuses on annotating cancer text [9]. MedLEE supports connecting value to controlled vocabularies [10]. MedEx aims to extract medication-related information such as dosage and duration [11]. The Clinical Language Annotation, Modeling, and Processing toolkit integrates award-winning algorithms and, moreover, enables users to customize natural language processing components so as to encode clinical text automatically [12,13]. Medical text extraction processes pathology reports and uses rule-based methods to classify lung cancer stages [14]. A recent study also demonstrated that the metastatic site and status of lung cancer could be extracted from pathology reports using a pipeline [15]. Another study showed that cancer stage information could also be extracted with natural language processing [16]. Most traditional information extraction systems rely on batch training or predefined rules and were designed for only limited medical domains or tasks. To support a retrospective study of NSCLC chemoradiotherapy patients, we adapted our in-house–developed information extraction platform, Information and Data Extraction using Adaptive Learning (IDEAL-X; X represents controlled vocabulary) system [17-19]. This information extraction system aims to transform free-text clinical documents into structured data and has been used by projects in cardiology and pathology. IDEAL-X possesses unique features different from the systems mentioned above: (1) users may freely customize attributes to be extracted; (2) the system extracts information from narrative medical documents and generates normalized values to populate output tables and assist manual annotation; (3) it requires no mandatory configurations or training before performing annotation and adaptive learning processes; and (4) the system learns from users’ normal interactions transparently, and establishes and refines decision models incrementally, which further alleviates manual annotation efforts. Figure 1 shows how the IDEAL-X system processes the input from free-text reports generated during physician and patient encounters and delivers structured output.

Figure 1

Screenshot of the Information and Data Extraction using Adaptive Learning (IDEAL-X) platform, and example input and output.

Methods

Patient Information

We collected NSCLC patient data to investigate the relationship between shrinkage of the treated tumor and each category of prognosis data: survival, tumor control, and toxicities. The patient treatment data we needed to identify included the chemoradiotherapy drugs used, dose, and treatment time frame. From the follow-up clinical notes, we needed to extract tumor control information diagnosed from the patient’s follow-up computed tomography and positron emission tomography images, patient toxicities, and complication data, including skin, internal organ, blood, and overall body reactions to treatment. We further categorized toxicities into different toxicity grades [20]. After we extracted the information in a structured format, we intended to use it to statistically correlate treatment tumor shrinkage with survival time, disease control rate, and the toxicities. From studies approved by the institutional review boards of both Rutgers University and Emory University, we retrospectively identified 50 patients who had primary unresectable, locally advanced, biopsy-proven stage II-III NSCLC, and who had received chemoradiotherapy with a median follow-up of 22 months. In total, we exported 261 treatment and patient follow-up documents from the patient electronic health record system ARIA (Varian Medical Systems, Inc, Palo Alto, CA, USA) and anonymized the data for this study.

IDEAL-X System Development

We adapted the IDEAL-X system to support automated information extraction from the NSCLC chemoradiation patients’ documents. After a requirement analysis, we added new features, such as extracting timex and parsing tabular information, to enhance the original system. We also implemented corresponding feature extraction and machine learning processes for timex and tabular formats, and constructed the dictionary to assist toxicity data extraction. We extracted patient information, such as treatment time frame and chemoradiotherapy, from treatment records with an adaptive learning process (Table 1). In extracting this information, the system began without any prior training and created its machine learning model incrementally. During the information extraction of the toxicities, the adaptive learning process was disabled. We used the dictionary shown in Textbox 1 to aid in toxicities information extraction. Along with extracted values, the sentences where the values were embedded were also output in a spreadsheet, which could be used for further manual toxicity grade differentiation based on patient Common Terminology Criteria for Adverse Events guidelines v 4.0, which were designated previously in the patient charts [20].

Table 1

Information extracted from treatment records of patients with non–small cell lung cancer.

Attributes	Text data type	Numbers of values	Dictionary	Adaptive learning
Treatment site	Nominal	68	N/A^a	Yes
Chemotherapy information	Nominal	56	N/A	Yes
Treatment time frame	Date	92	N/A	Yes
Radiation therapy dose	Numerical	97	N/A	Yes
Toxicities	Nominal	331	Yes	N/A

aN/A: not applicable.

In addition, to verify the extracted data, we asked 2 physicians to manually annotate these reports. We used the manually annotated ground truth to validate the automatically generated output from the IDEAL-X system. We used precision and recall results to estimate the effectiveness of extraction.

IDEAL-X Adaptive Learning Process

Through adaptive learning, IDEAL-X established its decision model through ordinary operations in manual annotation. First, the user designated the value to fill every attribute in the structured output form. After a few initial documents, the system quickly learned important and related information that the user sought and began to generate standardized values automatically in subsequent documents. The system continued to learn and update its knowledge, without special user intervention. This incremental learning process made the system domain agnostic and not limited to a specific medical report. When available, a user-defined controlled dictionary and other configurations could also be provided by the user to facilitate this learning process, but they were not mandatory.

System Data Flow

Figure 2 demonstrates the system’s data flow. Each time that the system loaded a document, the system moved through the preprocessing phase and parsed the text to analyze and identify important linguistic features and natural language elements. These features and elements included (1) part of speech: the part-of-speech tag of each word, for example, noun and verb; (2): timex: the system relied on predefined regular expressions to identify timex, such as 2010-01-09 and Sep 13, 2013, and then indexed them based on their position in the text; (3) tabular information: the system identified and parsed tables in input text to comprehend underlying relations between values and the metadata in a table; (4) negation terms: the system detected negation terms and regions being affected, for example, in the case of “patient denies fever and fatigue,” “fever” and “fatigue” were not extracted as part of the toxicities; and (5) uncertain terms: the system identified uncertain phrases and regions being governed, for example, “We explained to her that the risks of the treatment included dysphagia and pneumonitis” meant that dysphagia and pneumonitis had not appeared yet as symptoms. We used these features to mark the input text and provide detailed linguistic indications during extraction.

Figure 2

Data flow in the Information and Data Extraction using Adaptive Learning (IDEAL-X) platform. EMR: electronic medical record.

After preprocessing, the parsed text was investigated by the automated annotation component of the system to populate the output form automatically. First, sentences where possible values may be located were extracted based on text hierarchy, frequently co-occurring terms, previously extracted values, or user-customized vocabularies. The system then identified candidate phrases from located sentences using either a hidden Markov model [21] chunker or a dictionary chunker. Subsequently, candidate values were examined by various filters based on linguistic features such as part of speech, certainty, or negation collected during preprocessing. After filtering, the sentence score and the chunk score were combined, on the basis of which a classifier determined the overall confidence score of each candidate value and categorized it as “accept” or “reject.” We then reviewed the automatically extracted values manually for the purpose of adaptive learning. We considered positive and negative scenarios: if the user navigated to the next document without changing any values, we regarded the values generated by the system as positive training cases; if the user modified any values, we regarded the system-generated values as negative training cases and the manually updated values as positive ones. We used the results of the review to support further improvements in the automated annotation component. Difference feature extract procedures, which model the traits of numerical, nominal, timex, and tabular data elements, were applied to corresponding positive and negative instances. By repeating these steps, the system became intelligent incrementally and delivered more accurate results. Information extracted from treatment records of patients with non–small cell lung cancer. aN/A: not applicable. Anemia Lymphopenia Anorexia Dehydration Dyspnea Fatigue Mucosal inflammation Radiation esophagitis Weight decrease Cough Febrile neutropenia Neutropenia Bronchitis Diarrhea Esophagitis Hyponatremia Nausea Radiation pneumonitis Dermatitis Leukopenia Thrombocytopenia Decreased appetite Dysphagia Failure to thrive Localized infection Pneumonia Vomiting Insomnia Data flow in the Information and Data Extraction using Adaptive Learning (IDEAL-X) platform. EMR: electronic medical record.

Results

Figure 3 shows the validation results against the manually annotated ground truth. In the validation for patient characteristics and tumor control, the system achieved an overall precision of over 93%. The recall values of all information were more than 83%. The recalls were lower than the precisions, as the recalls reflected the performance during the overall adaptive learning process—the system processed a few documents to construct and refine its decision model at its early stage in the adaptive learning process.

Figure 3

Effectiveness of data extraction as estimated by precision and recall of automatically generated output compared with manually annotated ground truth.

Especially in the extraction of the toxicities, the negation detection and certainty detection filters contributed directly to the accuracy of extraction. With the help of a controlled dictionary, the system achieved an overall precision of 95.7% and recall of 94.5%. Within 1 second, a well-trained system can process patient documents of multiple pages and output the results in a predefined format. Compared with manual review, which requires reading through the entire document and manually annotating the notes on each patient, this system significantly improved the efficiency of information extraction. Effectiveness of data extraction as estimated by precision and recall of automatically generated output compared with manually annotated ground truth.

Discussion

IDEAL-X employed adaptive learning and a controlled vocabulary to support information extraction, which alleviated both the training and the deployment processes that could be expensive in applying a traditional information extraction system. The various data types IDEAL-X supports cover the most important and common information in oncology reports, which delivers great usability to our use case. We have demonstrated the great advantage of this system in greatly improving information extraction effectiveness while maintaining high accuracy when applied to extracting NSCLC patient treatment and prognoses data from heterogeneous document formats. In addition, because the system improves its performance incrementally, its accuracy could be further improved with additional training documents. Once trained, the developed system was able to process further fed-in reports in batch mode without revision. Without an intervening regular manual reporting process that handles input documents in sequence, the system accumulates knowledge transparently to empower the task and, therefore, could be conveniently integrated into a regular clinical workflow. The technology it used was domain agnostic and, therefore, could be transformed to other disease sites and studies in radiation oncology.

Limitations

In the validation analysis, the system also revealed some unavoidable limitations. The system identified and comprehended information based on explicitly expressed keywords. For example, the phrases “neoadjuvant chemo” and “upfront chemotherapy” may be used as keywords to identify chemotherapy induction. However, in situations where relevant information is distributed across different regions in the text, more insightful comprehension becomes necessary. For example, in the case of “After 4 cycles of chemotherapy and abdomen...we began radiation...,” the system was not intelligent enough to interpret the meaning of “4 cycles” as “neoadjuvant chemotherapy” behind the narrations. In general, this sophisticated scenario reveals the limitation of this information extraction-based approach. The system requires explicit keywords or hints to determine an event; however, it cannot reason and analyze factors collected from different sources. Such cases resulted in lower recalls for chemotherapy than for other attributes and demanded a manual review. Therefore, to facilitate the manual review, we output the associated sentence with the extracted information together in tabular format for user manual review and validation at a later time.

Conclusion

We adapted the IDEAL-X system to automatically extract treatment and prognostic information for stage II and III NSCLC patients who had received chemoradiation. With this system, patient information was extracted efficiently from their medical documents in various formats. The system, together with minimized manual review efforts, generated outputs with high precision and recall. It significantly improved the effectiveness and can be easily applied to other radiation oncology patient studies at larger scales.

16 in total

1. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research.

Authors: Rebecca S Crowley; Melissa Castine; Kevin Mitchell; Girish Chavan; Tara McSherry; Michael Feldman
Journal: J Am Med Inform Assoc Date: 2010 May-Jun Impact factor: 4.497

2. MedEx: a medication information extraction system for clinical narratives.

Authors: Hua Xu; Shane P Stenner; Son Doan; Kevin B Johnson; Lemuel R Waitman; Joshua C Denny
Journal: J Am Med Inform Assoc Date: 2010 Jan-Feb Impact factor: 4.497

3. Reduction in Tumor Volume by Cone Beam Computed Tomography Predicts Overall Survival in Non-Small Cell Lung Cancer Treated With Chemoradiation Therapy.

Authors: Salma K Jabbour; Sinae Kim; Syed A Haider; Xiaoting Xu; Alson Wu; Sujani Surakanti; Joseph Aisner; John Langenfeld; Ning J Yue; Bruce G Haffty; Wei Zou
Journal: Int J Radiat Oncol Biol Phys Date: 2015-04-15 Impact factor: 7.038

4. Symbolic rule-based classification of lung cancer stages from free-text pathology reports.

Authors: Anthony N Nguyen; Michael J Lawley; David P Hansen; Rayleen V Bowman; Belinda E Clarke; Edwina E Duhig; Shoni Colquist
Journal: J Am Med Inform Assoc Date: 2010 Jul-Aug Impact factor: 4.497

5. Combined chemoradiotherapy regimens of paclitaxel and carboplatin for locally advanced non-small-cell lung cancer: a randomized phase II locally advanced multi-modality protocol.

Authors: Chandra P Belani; Hak Choy; Phil Bonomi; Charles Scott; Patrick Travis; John Haluschak; Walter J Curran
Journal: J Clin Oncol Date: 2005-08-08 Impact factor: 44.544

Review 6. Meta-analysis of concomitant versus sequential radiochemotherapy in locally advanced non-small-cell lung cancer.

Authors: Anne Aupérin; Cecile Le Péchoux; Estelle Rolland; Walter J Curran; Kiyoyuki Furuse; Pierre Fournel; Jose Belderbos; Gerald Clamon; Hakki Cuneyt Ulutin; Rebecca Paulus; Takeharu Yamanaka; Marie-Cecile Bozonnat; Apollonia Uitterhoeve; Xiaofei Wang; Lesley Stewart; Rodrigo Arriagada; Sarah Burdett; Jean-Pierre Pignon
Journal: J Clin Oncol Date: 2010-03-29 Impact factor: 44.544

7. Phase III study of concurrent versus sequential thoracic radiotherapy in combination with mitomycin, vindesine, and cisplatin in unresectable stage III non-small-cell lung cancer.

Authors: K Furuse; M Fukuoka; M Kawahara; H Nishikawa; Y Takada; S Kudoh; N Katagami; Y Ariyoshi
Journal: J Clin Oncol Date: 1999-09 Impact factor: 44.544

8. Daily megavoltage computed tomography in lung cancer radiotherapy: correlation between volumetric changes and local outcome.

Authors: Samuel Bral; Mark De Ridder; Michaël Duchateau; Thierry Gevaert; Benedikt Engels; Denis Schallier; Guy Storme
Journal: Int J Radiat Oncol Biol Phys Date: 2010-07-16 Impact factor: 7.038

9. Influence of technologic advances on outcomes in patients with unresectable, locally advanced non-small-cell lung cancer receiving concomitant chemoradiotherapy.

Authors: Zhongxing X Liao; Ritsuko R Komaki; Howard D Thames; Helen H Liu; Susan L Tucker; Radhe Mohan; Mary K Martel; Xiong Wei; Kunyu Yang; Edward S Kim; George Blumenschein; Waun Ki Hong; James D Cox
Journal: Int J Radiat Oncol Biol Phys Date: 2009-06-08 Impact factor: 7.038

10. Identifying Metastases-related Information from Pathology Reports of Lung Cancer Patients.

Authors: Ergin Soysal; Jeremy L Warner; Joshua C Denny; Hua Xu
Journal: AMIA Jt Summits Transl Sci Proc Date: 2017-07-26

3 in total

1. Natural language processing for populating lung cancer clinical research data.

Authors: Liwei Wang; Lei Luo; Yanshan Wang; Jason Wampfler; Ping Yang; Hongfang Liu
Journal: BMC Med Inform Decis Mak Date: 2019-12-05 Impact factor: 2.796

2. A Semiautomated Chart Review for Assessing the Development of Radiation Pneumonitis Using Natural Language Processing: Diagnostic Accuracy and Feasibility Study.

Authors: Jordan McKenzie; Rasika Rajapakshe; Hua Shen; Shan Rajapakshe; Angela Lin
Journal: JMIR Med Inform Date: 2021-11-12

3. Identifying Lung Cancer Risk Factors in the Elderly Using Deep Neural Networks: Quantitative Analysis of Web-Based Survey Data.

Authors: Songjing Chen; Sizhu Wu
Journal: J Med Internet Res Date: 2020-03-17 Impact factor: 7.076

3 in total