Literature DB >> 24303276

Identifying Abdominal Aortic Aneurysm Cases and Controls using Natural Language Processing of Radiology Reports.

Sunghwan Sohn¹, Zi Ye, Hongfang Liu, Christopher G Chute, Iftikhar J Kullo.

Abstract

Prevalence of abdominal aortic aneurysm (AAA) is increasing due to longer life expectancy and implementation of screening programs. Patient-specific longitudinal measurements of AAA are important to understand pathophysiology of disease development and modifiers of abdominal aortic size. In this paper, we applied natural language processing (NLP) techniques to process radiology reports and developed a rule-based algorithm to identify AAA patients and also extract the corresponding aneurysm size with the examination date. AAA patient cohorts were determined by a hierarchical approach that: 1) selected potential AAA reports using keywords; 2) classified reports into AAA-case vs. non-case using rules; and 3) determined the AAA patient cohort based on a report-level classification. Our system was built in an Unstructured Information Management Architecture framework that allows efficient use of existing NLP components. Our system produced an F-score of 0.961 for AAA-case report classification with an accuracy of 0.984 for aneurysm size extraction.

Entities: Disease Gene Species

Year: 2013 PMID： 24303276 PMCID： PMC3845740

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

Abdominal aortic aneurysm (AAA) is present in about 10% of men older than 65 years 1 , 2 . Most cases are asymptomatic and detected on imaging studies accidentally. Rupture of an AAA is the most severe complication and associated with a high mortality rate of 90% - the 14 th leading cause of death in the U.S.A. 3 . The US Preventive Services Task Force recommends to perform one time screening in all men older than 65 years who ever smoked and those older than 60 years with family history of AAA 3 , 4 . The commonly used threshold for diagnosis of AAA is >=3cm and >=5–5.5cm for surgical repair 5 , 6 . It is generally accepted that larger AAA tends to grow more rapidly than small AAA and at higher risk for rupture, but not in all cases 7 , 8 . Recent studies have shown the influence of genetic factors on AAA 9 , 10 . Several genetic variants have been reported to be associated with AAA 11 – 13 . Identifying genetic determinants associated with AAA can facilitate understanding of underlying pathophysiology and modifiers of abdominal aortic size. As the first step to this end, we described the development of a tool to identify patients with AAA and extract information on progression by extracting sequential AAA size changes using natural language processing (NLP). The study of aneurysm progression may identify novel biomarkers of disease development. In this paper, we implemented a rules-based system that identifies AAA-cases from radiology reports and extracts corresponding AAA sizes with date information. Our system utilized the NLP components from MedTagger 14 to process radiology reports and was built in the Apache Unstructured Information Management Architecture (UIMA) ( http://incubator.apache.org/uima/ ) framework that provides an efficient way to add new components as well as reuse the existing NLP modules.

Background

AAA is typically diagnosed by physical examination, ultrasound, or CT scan and its outcomes are recorded in radiology reports. Medical experts can manually review radiology reports to identify AAA patients and extract corresponding AAA sizes for clinical studies and patient history summarization. Manual review, however, is time consuming and often impractical for a routine practice and large-scale clinical studies. In order to overcome these drawbacks, NLP techniques, which process unstructured text and convert it to a structured format, can be used to automatically extract AAA-related information and identify patient cohorts. Over the past decade, advances in NLP have produced promising results in information extraction from clinical text 15 and have been successfully applied in various clinical applications including patient medical status extraction 16 , 17 , sentiment analysis 18 , decision support 19 , 20 , genome-wide association studies 21 , 22 , and diagnosis code assignment 23 , 24 . Recently, Mayo Clinic developed MedTagger 14 , a NLP pipeline with a fast dictionary lookup, to process clinical text and annotate clinical concepts. Our system used basic NLP components including dictionary lookup in MedTagger to process unstructured text and find medical concepts from radiology reports. The main annotations include: Sentence Detection parses sentence; Tokenization finds word token boundaries; Normalization generates one form for the various morphological variants of the word through the NLM’s Lexical Variant Generation (LVG) tool ( http://SPECIALIST.nlm.nih.gov ), which make it possible to use normalized terms for dictionary lookup; Size Annotation extracts aneurysm size; AAA NE (Named Entity) Detection discovers AAA related concepts based on dictionary lookup; Negation identifies negated NEs; AAA Identification identifies AAA-case reports and extracts the corresponding aneurysm sizes.

Methods

This study used Mayo Clinic radiology reports—including ultrasound, CT, MRI, and angiography report—from eMERGE 25 patient cohorts. However, many radiology reports are not related to AAA. In order to maximize system efficiency, a hierarchical approach was used to determine the AAA patient cohort. First, potential AAA radiology reports were selected using keywords. Secondly, reports were classified into AAA-case vs. non-case using manually-crafted rules based on keywords and aneurysm size. Lastly, the AAA patient cohort was determined based on a report-level classification. The detailed methods are as follows:

Selection of potential AAA reports

Mayo’s radiology reports consist of multiple fields including test codes and test descriptions. Initially, we selected radiology reports based on specific CPT (Current Procedural Terminology) codes and code descriptions. However this approach missed many AAA-related reports. A better alternative was to use keywords – i.e., select potential AAA reports that contain both “aorta” and “abdominal” relevant terms because AAA-related reports must include these terms. This keyword-based search was able to catch those reports missed by a code-based search and retrieved a much higher number of potential AAA reports—i.e., out of 180K reports, the code based approach retrieved 3,370 reports and the keyword-based approach retrieved 11,420 reports. Table 1 shows the keywords we used. Those terms were expanded through both UMLS ( http://www.nlm.nih.gov/research/umls/ ) concepts and the most frequent terms used in Mayo clinical notes.

Table 1.

“Aorta” and “abdominal” related keywords

“aorta” terms	“abdominal” terms

aorta	abdominal
aortae	abd
aortas	abdomen
aortic	abdomens
	abdomina
	abdominals
	abdominopelvic region
	abdominopelvic regions
	abdominopelvis
	ccs_abdominal
	intrabdominal

AAA report classification

Figure 1 shows the pseudo code of the algorithm. After we selected the potential AAA-case report, each report was classified as AAA-case vs. non-case as follows:

Figure 1.

Algorithm of AAA report classification.

AAA-case: Reports that contain “abdominal aorta” or “abdominal aorta aneurysm” related terms and aneurysm size at the examination date is equal to or greater than 3 cm. Non-case: Reports that contain status post indications (e.g., aortic/aorto + endograft, abdominal aortic endograft, repair of AAA, s/p AAA repair, endovasc repair AAA, etc). Reports that contain only AAA related terms without the size information or “ectasia of abdominal aorta.” Reports that contain explicit terms indicating “normal” AAA condition (e.g., normal caliber abdominal aorta, normal distal aorta), negated AAA (e.g., negative for abdominal aortic aneurysm), or the aneurysm size is less than 3 cm. Reports that do not contain any AAA related information It should be noted that our definition of AAA-case in this study excludes patients with open surgery or endovascular repair although they are AAA case in clinical perspective. This is because our system focuses on abstracting aneurysm sizes longitudinally and a track for aneurysm growth.

Size extraction:

In radiology reports, AAA sizes are basically expressed as one-, two-, or three-dimension (AP, width/transverse, and length) and described in numerous ways (e.g., 4.4cm, measuring 4.4×5.3 cm, 4.4×5.3×6.1 cm, maximum AP diameter of 3.7cm and a transverse diameter of 3.7cm, etc.). Although there can be more than one dimension for the aneurysm size description, only one value (i.e., maximum size of either AP or width/transverse) is used to determine an AAA-case. Some radiology reports contain the size(s) from a previous examination, but we only considered the size of the current examination. The size annotator used regular expressions to extract the size description from free text. Then it selects the maximum value from AP and width/transverse and then normalized the value to “cm” (some values are in “mm”). The sizes that are not associated with the given examination date (i.e., sizes from previous examinations) were excluded based on description patterns as follows: Select the size that comes with current indication word (e.g., “now measures/measuring”). Exclude the size that comes with previous indication words (e.g., “previously/earlier/prior” “previous measurement(s) was/were” “prior exam” “compared to/with” “increased/decreased from” etc.).

AAA-related keywords:

Table 2 includes keywords for abdominal aorta (AA), abdominal aorta aneurysm (AAA), status post (S/P), and normal. They were initially provided by a medical expert and expanded through UMLS concepts and frequent terms used in Mayo clinical notes. They were also normalized into canonical forms through NLM’s LVG in order to match variations (e.g., “aneurysm abdominal” can match with “aneurysm abdominals”). We used the dictionary-lookup in MedTagger to find those keywords.

Table 2.

AAA related keywords (normalized through LVG)

AA	AAA	S/P	Normal

infrarenal aorta	a.a.a.	post a.a.a. repair	normal caliber abdominal aorta
abdominal aorta	abdominal aortic aneurysm	s/p a.a.a. repair	normal distal aorta
aorta abdominal	aneurysm abdominal aorta	endograft	abdominal aorta normal caliber
infrarenal location	aneurysm abdominal	endovascular	aorta normal caliber
	aneurysm abdominal aortic	aneurysm sac
	aorta abdominal aneurysm	bifurcate endograft
	aortic aneurysm abdominal	endoleak
	infrarenal abdominal aorta
	infrarenal aortic aneurysm

AAA Patient cohort identification

Generally, patients have more than one examination and therefore have more than one report. After we classified all of the report class (i.e., AAA-case vs. Non-case), we can finally determine the AAA patients. If any report for a given patient is an AAA case, then we determine this patient as an AAA patient. Our AAA patient cohort also includes information of corresponding aneurysm size and examination date.

Results

Figure 2 shows the annotation types and values of a de-identified sample report in UIMA CAS Visual Debugger. The right window shows a radiology report snippet processed to populate annotations as they appear in the left window. The bottom left window shows annotation types and values including the report class and the AAA size. For this example, there are two AAA-related terms, “aaa” and “infrarenal aorta” and two AAA size information (3.6 × 3.4cm and 3.5 × 3.3cm). However, we only consider the AAA size of the current examination date (3.6 × 3.4cm). As the final size, we extract the larger one (3.6 cm).

Figure 2.

AAA annotations visualized through the UIMA CAS Visual Debugger.

A medical expert manually examined 650 radiology reports and classified them as AAA-case vs. non-case. If the report was an AAA case, the corresponding size was also extracted. We used 400 reports to train the system and held-out 250 reports to test. Our system was able to catch most AAA-case reports with a high F-score of 0.961 (61 TPs, 4 FPs, and 1 FNs) on the test set. Table 3 shows the corresponding evaluation performance.

Table 3.

AAA-case report classification in the test set

Evaluation	value
precision	0.939
recall	0.984
F-score	0.961
size accuracy *	0.984

# correct sizes / #TP AAA cases

The performance of AAA patient cohort identification is in Table 4 . The test set contains 25 AAA-case patients and the system identified 27 patients as an AAA-case which led 2 FPs and 0 FN.

Table 4.

AAA patient identification in the test set

Evaluation	value
precision	0.926
recall	1
F-score	0.962

Our system was also able to generate sequential size variations in time that are required to build a sophisticated AAA phenotype algorithm in the future. For example: PatientID|2.5cm:**/**/1999|3.2cm:**/**/2008|3.5cm:**/**/2009|3.5cm:**/**/2009|3.6cm:**/**/2010|3.8cm:**/**/2011

Discussion

Our system was able to classify most AAA-case reports with a high F-score. There was one false negative case due to the S/P relevant term “endovascular.” This report contained the term “pre-endovascular,” indicating it is a report before surgery and should not be treated as S/P. False positive cases were due to: incorrect negation (e.g., abdominal aorta negative for aneurysm - “aneurysm” is negated, but “abdominal aorta” is not), incorrect size determination (e.g., “under 3cm” was not treated as < 3cm), incorrect association with other than “abdominal” aorta (e.g., “a fusiform 5.5cm aneurysm of the distal thoracic and upper abdominal aorta extending…”). A radiology report could contain more than one AAA size description, mainly due to a size description from the previous examination. We effectively eliminated the previous size by filtering out the size associated with words that indicate “previous” and achieved an accuracy of 0.984 (size accuracy in Table 3 ). The AAA patient cohort identification was based on a simple rule—i.e., examining report-level AAA classification. Although report-level classification is not perfect, it is possible to identify an AAA patient if a patient has more than one AAA-case report and one of them is correctly classified. Our results show that a rule-based system using NLP techniques could effectively identify an AAA patient cohort and extract aneurysm size from radiology reports. There is a potential role for an NLP-based size extractor to generate an electronic alert that will notify the referring physician about an AAA that exceeds a certain size threshold. Our approach may be helpful in ascertaining the presence of pathologies from radiology reports, which have size-based criteria, by adjusting pattern matching rules; for example, cerebral or other arterial aneurysms. The automated system for AAA patient cohort identification enables large-scale clinical study. Currently, our system is being applied to a larger patient cohort to identify AAA patients with size and date information for the eMERGE II phenotype study.

23 in total

1. Automated encoding of clinical documents based on natural language processing.

Authors: Carol Friedman; Lyudmila Shagina; Yves Lussier; George Hripcsak
Journal: J Am Med Inform Assoc Date: 2004-06-07 Impact factor: 4.497

2. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

Review 3. Screening for abdominal aortic aneurism.

Authors: Janelle Guirguis-Blake; Tracy A Wolff
Journal: Am Fam Physician Date: 2005-06-01 Impact factor: 3.292

4. Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques.

Authors: Serguei V S Pakhomov; James D Buntrock; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2006-06-23 Impact factor: 4.497

5. The same sequence variant on 9p21 associates with myocardial infarction, abdominal aortic aneurysm and intracranial aneurysm.

Authors: Anna Helgadottir; Gudmar Thorleifsson; Kristinn P Magnusson; Solveig Grétarsdottir; Valgerdur Steinthorsdottir; Andrei Manolescu; Gregory T Jones; Gabriel J E Rinkel; Jan D Blankensteijn; Antti Ronkainen; Juha E Jääskeläinen; Yoshiki Kyo; Guy M Lenk; Natzi Sakalihasan; Konstantinos Kostulas; Anders Gottsäter; Andrea Flex; Hreinn Stefansson; Torben Hansen; Gitte Andersen; Shantel Weinsheimer; Knut Borch-Johnsen; Torben Jorgensen; Svati H Shah; Arshed A Quyyumi; Christopher B Granger; Muredach P Reilly; Harland Austin; Allan I Levey; Viola Vaccarino; Ebba Palsdottir; G Bragi Walters; Thorbjorg Jonsdottir; Steinunn Snorradottir; Dana Magnusdottir; Gudmundur Gudmundsson; Robert E Ferrell; Sigurlaug Sveinbjornsdottir; Juha Hernesniemi; Mika Niemelä; Raymond Limet; Karl Andersen; Gunnar Sigurdsson; Rafn Benediktsson; Eric L G Verhoeven; Joep A W Teijink; Diederick E Grobbee; Daniel J Rader; David A Collier; Oluf Pedersen; Roberto Pola; Jan Hillert; Bengt Lindblad; Einar M Valdimarsson; Hulda B Magnadottir; Cisca Wijmenga; Gerard Tromp; Annette F Baas; Ynte M Ruigrok; Andre M van Rij; Helena Kuivaniemi; Janet T Powell; Stefan E Matthiasson; Jeffrey R Gulcher; Gudmundur Thorgeirsson; Augustine Kong; Unnur Thorsteinsdottir; Kari Stefansson
Journal: Nat Genet Date: 2008-01-06 Impact factor: 38.330

6. Combining decision support methodologies to diagnose pneumonia.

Authors: D Aronsky; M Fiszman; W W Chapman; P J Haug
Journal: Proc AMIA Symp Date: 2001

7. Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease.

Authors: Iftikhar J Kullo; Jin Fan; Jyotishman Pathak; Guergana K Savova; Zeenat Ali; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

8. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors: Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

9. A genome-wide association study of red blood cell traits using the electronic medical record.

Authors: Iftikhar J Kullo; Keyue Ding; Hayan Jouni; Carin Y Smith; Christopher G Chute
Journal: PLoS One Date: 2010-09-28 Impact factor: 3.240

10. Genetic and environmental contributions to abdominal aortic aneurysm development in a twin population.

Authors: Carl Magnus Wahlgren; Emma Larsson; Patrik K E Magnusson; Rebecka Hultgren; Jesper Swedenborg
Journal: J Vasc Surg Date: 2009-11-24 Impact factor: 4.268

14 in total

1. Automatic Classification of Ultrasound Screening Examinations of the Abdominal Aorta.

Authors: Craig Morioka; Frank Meng; Ricky Taira; James Sayre; Peter Zimmerman; David Ishimitsu; Jimmy Huang; Luyao Shen; Suzie El-Saden
Journal: J Digit Imaging Date: 2016-12 Impact factor: 4.056

2. MedXN: an open source medication extraction and normalization tool for clinical text.

Authors: Sunghwan Sohn; Cheryl Clark; Scott R Halgrim; Sean P Murphy; Christopher G Chute; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2014-03-17 Impact factor: 4.497

3. Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.

Authors: S V Ramanan; Kedar Radhakrishna; Abijeet Waghmare; Tony Raj; Senthil P Nathan; Sai Madhukar Sreerama; Sriram Sampath
Journal: J Med Syst Date: 2016-06-24 Impact factor: 4.460

4. Imaging Informatics: 25 Years of Progress.

Authors: J P Agrawal; B J Erickson; C E Kahn
Journal: Yearb Med Inform Date: 2016-06-30

5. Use of Natural Language Processing Algorithms to Identify Common Data Elements in Operative Notes for Total Hip Arthroplasty.

Authors: Cody C Wyles; Meagan E Tibbo; Sunyang Fu; Yanshan Wang; Sunghwan Sohn; Walter K Kremers; Daniel J Berry; David G Lewallen; Hilal Maradit-Kremers
Journal: J Bone Joint Surg Am Date: 2019-11-06 Impact factor: 5.284

6. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification.

Authors: Imon Banerjee; Yuan Ling; Matthew C Chen; Sadid A Hasan; Curtis P Langlotz; Nathaniel Moradzadeh; Brian Chapman; Timothy Amrhein; David Mong; Daniel L Rubin; Oladimeji Farri; Matthew P Lungren
Journal: Artif Intell Med Date: 2018-11-23 Impact factor: 5.326