Literature DB >> 27570671

Towards Evidence-based Precision Medicine: Extracting Population Information from Biomedical Text using Binary Classifiers and Syntactic Patterns.

Kalpana Raja1, Naman Dasot1, Pawan Goyal1, Siddhartha R Jonnalagadda1.   

Abstract

Precision Medicine is an emerging approach for prevention and treatment of disease that considers individual variability in genes, environment, and lifestyle for each person. The dissemination of individualized evidence by automatically identifying population information in literature is a key for evidence-based precision medicine at the point-of-care. We propose a hybrid approach using natural language processing techniques to automatically extract the population information from biomedical literature. Our approach first implements a binary classifier to classify sentences with or without population information. A rule-based system based on syntactic-tree regular expressions is then applied to sentences containing population information to extract the population named entities. The proposed two-stage approach achieved an F-score of 0.81 using a MaxEnt classifier and the rule- based system, and an F-score of 0.87 using a Nai've-Bayes classifier and the rule-based system, and performed relatively well compared to many existing systems. The system and evaluation dataset is being released as open source.

Entities:  

Year:  2016        PMID: 27570671      PMCID: PMC5001749     

Source DB:  PubMed          Journal:  AMIA Jt Summits Transl Sci Proc


Introduction

The goal of precision medicine is to develop prevention and treatment strategies for individual variability based on individual patient’s unique biological characteristics (e.g. inherited variation to drug response) and disease processes (e.g. tumor genomic characteristics). The approach extends beyond personalized medicine, evidence- based medicine and genome medicine. Precision medicine aims to bridge individual patient characterization and phenotype with evidence-based medicine. Recent years have witnessed the development of biological databases from human genome projects, characterization of patients using system biology approaches (e.g. proteomics, metabolomics, genomics, and diverse cellular assays), phenotype, and computational methods to enhance health and wellness of each person rather than just treating the disease.[1] The approaches for characterizing the patients incorporate knowledge derived from proteomics, genomics, metabolomics, and even social and mobile health.[1,2] Evidence-based medicine integrates the best evidence from well-designed research with clinical expertise and patient values. The four components of precision medicine have been defined to be predictive, preventive, personalized, and participatory medicine.[3,4] Khoury et al. [5] proposed the integration of “fifth P” - the population perspective that describes the balance between individual and population interventions for improving health and the evaluation of their comparative effectiveness. A population perspective implements the concept of population screening to preventive medicine, and use of evidence-based practice to personalized medicine. It was argued that the application of population science into precision medicine is the key for deciding the most appropriate treatment for every individual patient.[5] The short-term goal of precision medicine is to come closer to curing cancers and diabetes, and the long-term goal is to provide access to personalized knowledge for all diseases.[1] The primary source of knowledge is biomedical literature, which is growing at an exponential rate. Natural language processing (NLP) techniques are being used to automatically extract information from biomedical literature. Various studies have explored the extraction of the number of participants, their age, sex, ethnicity, country, comorbidities, spectrum of presenting symptoms, current treatments, etc. While most studies only highlighted the sentences containing the population data elements, six studies[6 -11] extracted data elements as opposed to only highlighting the sentence containing the data element. For example, Kelly and Yang[6] extracted age of participants, duration of study, ethnicity of participants, gender of subjects, health status of participants, and number of participants on a dataset of 386 abstracts. Unfortunately, each of these studies used a different corpus of reports, which makes direct comparisons impossible. None of the studies have made their systems available as open-source, except RIDeM tool,[12] which is available as a web service. In the current study, we developed an automated approach for extracting population named entity from biomedical literature. Our hybrid approach integrates a binary classifier for preprocessing and a rule-based system for extracting the population named entity. The binary classifier classifies input sentences into those with population information using MaxEnt classifier and Naïve-Bayes classifier. The rule-based system uses a set of syntactic patterns to identify and extract the population named entity. This has the potential to provide personalized evidence updates to clinicians and patients based on their individual characteristics. The evaluation dataset and the code are available as open source to enable implementation in wider precision medicine applications.[13] To our knowledge, no evaluation dataset is publically available for testing and training population extraction approaches.

Methods

For the scope of this paper, population refers to the cohort of patients with shared characteristics such as age, gender, treatments and diseases. The algorithm for extracting population named entity operates at sentence level. Preprocessing of input sentences is carried out to first identify the sentences with population information. Two types of binary classifiers are used in the current study: MaxEnt Classifier[14] and Naïve-Bayes Classifier.[15] We use MALLET (Machine Learning for LanguagE Toolkit)[16] for sentence classification. The hybrid approach of classifiers with the population extraction algorithm is described in detail below.

Population Extraction Algorithm

SyntaxPattern Generation: Our approach uses two NLP parsers, Stanford lexical parser[22] to parse an input sentence and Stanford Tregex (Tree regular expressions) parser, simply called as Tregex[23] to query the parse tree structure for extracting the population information. Our approach for generating syntax patterns is semi- automated and uses the Tregex-specific syntax for matching the nodes in a parse tree. In a previous study, we generated Tregex patterns for extracting protein-protein interaction information from biomedical literature,[24] where we achieved an F-score of 66.05% to outperform most of the existing systems. In the current study, we use Tregex patterns for extracting population named entity. The constituency parse trees for a set of sentences (used for developing Tregex patterns) are generated with Stanford lexical parser. A parse tree details the grammatical components such as noun phrase (NP) and verb phrase (VP) as shown in Figure 1A. It may contain more than one noun phrase and sometimes a noun phrase is nested within another noun phrase or verb phrase. For each parse tree, we identified the sub-tree encompassing the population named entity. We manually developed Tregex patterns that best explain the nodes of sub-tree representing a population named entity (Figure 1). Tregex patterns are similar to regular expressions and are easy to use. The patterns are then incorporated into the population extraction algorithm for automatic extraction of population named entity. The various Tregex symbols applied for identifying the relationship between the nodes are listed in Table 1. Tregex patterns developed for extracting population named entities are listed in Table 2.
Figure 1.

Extraction of population named entity

Table 1.

List of Tregex symbols for pattern generation

Tregex SymbolDescription
Node A « Node BNode A dominates Node B
Node A >> Node BNode A is dominated by Node B (Node B << Node A)
Node A < Node BNode A immediately dominates Node B
Node A > Node BNode A is immediately dominated by Node B (Node B < Node A)
Node A g Node BA and B are sisters i.e. at same level in the parse tree (but are not equal)
@NodeSelects the entire phrase (noun or verb) mentioned i.e. @NP
Table 2.

Tregex patterns for population named entity extraction (to be used in conjunction with "population-related concepts")

PatternOutput PhraseExample Sentence with Output Underlined
NP > PPNoun phrase succeeding prepositional phraseAldosterone blockade has been shown to be effective in reducing total mortality as well as hospitalization for heart failure in patients with systolic left ventricular dysfunction (SLVD) due to chronic heart failure and in patients with SLVD post acute myocardial infarction. (PMID: 15134801)
PP $ NPPrepositional phrase and noun phrase are sistersRosuvastatin did not reduce mortality compared to placebo in patients with heart failure and left ventricular systolic dysfunction due to ischaemic heart disease in the CORONA study. (PMID: 18179987)
NP $ NPTwo noun phrases as sistersImplantation of CRT-D rather than an implantable cardioverter defibrillator in patients with mild heart failure and QRS >/=130 ms reduced the risk of hospitalization for heart failure in MADIT-CRT; (PMID:19926603)
NP $ NNSNoun phrase and noun are sistersDiuretics are indicated for symptomatic patients as needed for volume overload. (PMID: 18441861)
@NPNoun phraseSo far, nebivolol is the only beta-blocker to have been shown effective in elderly heart failure patients, regardless of their left ventricular ejection fraction. (PMID: 20307222)
PP, NPPrepositional phrase immediately follows noun phraseIt is suggested that beta-receptor blockade should be added to conventional treatment with digitalis and diuretics in all patients with severe myocardial failure caused by conqestive cardiomyopathy. (PMID: 6107090)
PP $ PPTwo prepositional phrases as sistersThis article reviews the physiological changes that occur in the elderly and the treatment approach that can be taken in elderly patients with heart failure. (PMID: 9205849)
NP, PPNoun phrase immediately follows prepositional phraseIt is suggested that potassium depletion is not a major problem in patients with heart-failure treated with diuretics. (PMID: 62899)
NP $ PPNoun phrase and prepositional phrase are sistersPiretanide, a diuretic that acts on the loop of Henle, was used to treat patients with cardiac failure. (PMID: 6990212)
@VPVerb phraseIsosorbide dinitrate and hydralazine hydrochloride should be tried in patients who cannot tolerate ACE inhibitors or who have refractory symptoms. (PMID: 7933398)
Syntax Pattern Application: The input sentence is parsed with Stanford lexical parser using the probabilistic context free grammar (PCFG) model.[22] The generated parse tree is queried using Tregex patterns to identify and extract the population named entities. Not all the named entities that match the Tregex patterns are population named entities. We used UMLS to obtain a set of 130 population- related concepts belonging to “patient or disabled group” semantic type. An additional set of 22 terms related to population was manually identified from MEDLINE citations (Supplementary data 1). These concepts and terms are used to filter the population named entities (Figure 2). For named entities extracted with patterns NP > PP, PP g NP, PP g PP and @VP, the algorithm trims the sub-tree from the noun phrase matching population-related concepts or terms. Rarely, more than one pattern is applicable for the same population named entity in a sentence and in such cases, the first matching pattern based on a predetermined order of precedence is considered. The named entity identified using the above patterns but matching selected stop phrases namely ‘patient education’, ‘patient survival’, ‘patient preference’, ‘patient factors’, ‘patient characteristics’, ‘patient confidentiality’, ‘patient permission’, ‘patient status’, ‘patient selection’, ‘patient level data’ and ‘patient refusal’ are filtered out.

Rule-based approach with binary classifier

Extraction of population by the rule-based system alone may not be sufficient for sentences with complex tree structures, thereby producing a number of false negatives. Our initial study included only five Tregex patterns: NP > PP, PP g NP, NP g NP, NP g NN and @NP (Table 2). Attempting to decrease the number of false negatives by developing new patterns for meeting specific requirements extracted many incorrect population phrases, thereby producing a number of false positives. Therefore, for the population extraction algorithm to be more reliable and accurate, and to accommodate the syntactic pattern matching without increasing false positives, we designed a binary classifier for pre-processing. The binary classifier determines whether a sentence contains a population named entity or not. If the sentence contains the population named entity, then it is sent to the rule-based system described above. Otherwise, the sentence is rejected. This additional layer helped us to eliminate many sentences where the probability of recognizing a population named entity is very low. Thus, we were able to concentrate on extracting the population named entity from the sentences that have a higher probability of including it. This allowed us to increase the number of Tregex patterns in our rule-based system to ten from five (Table 2). We have used two different types of binary classifiers, namely MaxEnt[14] and Naive-Bayes[15] classifiers. The system architecture of binary classifier with rule-based system is shown in Figure 3. MaxEnt Classifier with Rule-based approach: The dataset collected for validating the population extraction algorithm was split into training dataset (80%) and test dataset (20%). Sentences from training dataset are converted to list of instances where each instance is a feature vector. We applied a set of basic filters prior to learning on the features: removal of non-ASCII or Unicode characters, conversion to lower case, removal of stop words, and lemmatization. Our feature-set includes all the unigrams and bigrams in the sentences. Naïve-Bayes classifier with Rule-based approach: As an alternative, we also implemented a Naïve-Bayes classifier. The feature set for training the classifier includes 50 terms (unigram and bigram) wit the maximum information gain identified from the training dataset. The frequency of each feature in the training dataset is calculated and used for estimating the probability.

Evaluation approach and Dataset

We performed an experiment to extract population named entities related to congestive heart failure (CHF) and atrial fibrillation (AFib). We selected the diseases based on statistics from Centers for Disease Control and Prevention (CDC), which states heart disease and the leading cause of death in the United States.[25] The evaluation dataset for developing the population named entity extraction algorithms consists of 714 sentences from MEDLINE citations that are retrieved from SemMedDB (Table 3). Since the goal of our overall research is to apply these algorithms for precision medicine applications in cardiovascular treatment and diagnosis we focused on sentences related to diagnosis and treatment of CHF and AFib.
Table 3.

Evaluation Dataset

DatasetCitations with populationCitations without population
Diagnosis for CHF80120
Treatment for CHF98102
Diagnosis for AFib14060
Treatment for AFib5658

Citation extraction from MEDLINE

Our overall strategy aims at retrieving population information from high quality clinical journals (Figure 4A). Two Boolean queries were built to retrieve articles on systemic reviews (SR) and randomized control trials (RCTs) from MEDLINE (Figure 4B) (Supplementary data 2).

Sentence extraction from SemMedDB

A set of MEDLINE abstracts for a given clinical condition (e.g., treatments and diagnosis for CHF and AFib) is retrieved from SemMedDB for these citations.[16] Our information retrieval approach makes use of a list of UMLS concept identifiers (CUIs) (41 CUIs for CHF and 25 CUIs for AFib) (Supplementary data 3) to query SemMedDB for retrieving the sentences. For example, CUIs such as ‘C0018802’, ‘C0264719’, ‘C0264722’, ‘C2039715’ and ‘C2183328’ are related to CHF. Each unit of information retrieved for a condition consists of PMID, sentence and predication as in Example 1. The evaluation dataset was divided into training dataset (80% of the dataset) and test dataset (20% of the dataset). A 5-fold cross validation is run on the training dataset for the binary classifiers. Then the test dataset is used for evaluating the performance of each of these classifiers with the rule-based system: MaxEnt classifier and the rule-based system, and Nai've-Bayes classifier and the rule-based system. Example 1 - PMID: 2539290 Sentence: Enalapril provides significant haemodynamic, symptomatic and clinical improvement when added to maintenance therapy with digitalis and diuretics in patients with congestive heart failure [NYHA (New York Heart Association) classes II to IV]. Predication: Diuretics-TREATS-Congestive Heart Failure

Results

Table 4 shows the performance of the system on the evaluation dataset consisting of sentences related to the diagnosis and treatment for CHF and AFib. The standard metricsof precision, recall and F-score were used for evaluating the system performance. The rule-based system alone achieved F-score of 0.64. The two-stage approach to classify the input sentence as having or not having population information, and to extract population named entity achieved F-score of 0.81 with MaxEnt classifier and rule-based system, and F-score of 0.87 with Naïve-Bayes classifier and rule-based system.
Table 4.

System performance

SystemPrecisionRecallF-score
Rule-based system0.670.620.64
MaxEnt classifier + Rule-based system0.870.760.81
Naïve-Bayes classifier + Rule-based system0.900.830.87
Preprocessing with MaxEnt or Naïve-Bayes classifiers filtered sentences with potential population information. This improved the performance of MaxEnt classifier and the rule-based system by 17% (0.81-0.64), and Naïve-Bayes classifier and the rule-based system by 23% (0.87-0.64). Table 5 shows the performance of binary classifiers i.e. MaxEnt classifier and Naïve-Bayes classifier on classifying sentences with or without population information.
Table 5.

Performance of binary classifiers

SystemPrecisionRecallF-score
MaxEnt classifier0.870.820.84
Naïve-Bayes classifier0.890.910.90

Discussion

Existing systems for population extraction

Table 6 lists the performance of our system and other systems available for similar task. This gives an idea about the techniques and dataset used by our system and other systems for extracting the population named entity from a given sentence. None of these systems are available as open source, except the RIDeM tool.[12]
Table 6.

Approach and dataset used by various systems

SystemDatasetSentence ClassificationPopulation Extraction
ModelF-scoreModelF-scoreRemarks
Xu et al. [26] Abstracts Only from PubMedHMM + NLP Techniques92%Classification + parse Tree (Stanford)0.51-
Zhu et al. [27] --Partially Matched Using Metamap0.84-
Partially Matched Using NLP- based method0.83
Exact Matched Using Metamap0.42
Exact Matched Using NLP- based method0.64
Demner Fushman and Lin 8 MedLine Abstracts-Baseline0.53Returns a set of results
Extractor0.80
Zhao et al. Reduced DatasetMallet CRFIndependent0.784 different Methods Used
Sentence-First0.78
Word-First0.78
Joint0.75
Full DatasetIndependent0.64
Sentence-First0.64
Word-First0.63
Joint0.60
Kelly[6] Abstracts from PubMed-Partial match0.877Dependency Parse
Exact match0.601Regular Expressions
RIDeM Tool[12] Tested on Our Dataset-Original0.63Upper bound on precision
With Add-ons0.766
Our Current SystemTested on Our Dataset-Rule-based0.64-
MaxEnt0.844MaxEnt + Rule-based0.81
Naive- Bayes0.9Naive-Bayes + Rule based0.87
The accuracies are not comparable since the datasets are different. We were able to compare our tool with the RIDeM tool developed by Demner Fushman and Thoma on our dataset. The F-score of RIDeM tool[12] on our dataset (76.6%) is comparable to its accuracy in their own dataset (80.0%). This to some extent supports the validity of our dataset. The F-score of our best approach (87.7%) is better than other approaches. However, all these approaches have to be evaluated on our dataset for drawing conclusions.

Error Analysis

Based on analysis of 20 randomly selected sentences, the following are the major reasons for errors in population named entity recognition apart from classification errors: Parse tree can be too complex to extract the Population phrase with Tregex patterns, i.e. the patient and disease terms are in different sub-trees (65% of errors). For example, in sentence “The left ventricular partitioning device appears to be relatively safe and potentially effective in the treatment for patients with heart failure and a prior anterior myocardial infarction” (PMID: 22607859), our system extracts “patients with heart failure” as population. There are some cases in which the Stanford Tregex parser returns an incorrect parse tree of the sentence. This results in retrieving a wrong parse tree or assigning wrong tags to nodes. Thus, Tregex patterns are not able to extract the population phrase successfully (15% of errors). For example, in sentence “Cardiac resynchronization therapy produces both short-term hemodynamic and long-term symptomatic/mortality benefits in symptomatic heart failure patients with a QRS duration >120 ms” (PMID: 19822812), our system extracts "in symptomatic heart failure patients with a QRS duration >120 ms" as population. Extraction of population phrase can be achieved by introducing new Tregex patterns. However, including additional patterns extracted incorrect population information. This resulted in the increase of false positives to a large extent, hence lowering the precision of the system (20% of errors). Therefore, we decided to limit our use of Tregex patterns to ten (Table 2). We believe sentence simplification, which has been shown to both improve the accuracy of parsers[31] and also of information extraction,[32] might improve the accuracy of population extraction.

Applying the system for evidence summarization

We have independently studied the efficacy of a preliminary population extraction system that uses only the rule- based component in summarizing individualized evidence for clinicians; the goal was to automatically generate clinically useful sentences that provide a specific recommendation for an intervention (e.g. medication treatment) employed with a specific patient population.[29] We found that such an approach is entirely feasible and it is possible to classify such clinically actionable sentences both from MEDLINE abstracts and also from clinical knowledge systems such as UpToDate.[30] The gold standard used for testing the population extraction algorithm for evidence summarization is different from the one presented here. It consists of 4,499 sentences from UpToDate documents on the treatment of six chronic conditions namely coronary artery disease, hypertension, depression, heart failure, diabetes mellitus, and prostate cancer. The system achieved 90.5% precision, 96.7% recall and 93.6% F-score when tested on the gold standard generated from UpToDate document. The gold standard is available for sharing upon obtaining permission from UpToDate.

Conclusions and Future Work

Our work aimed to extract population information pertaining to evidence for supporting the retrieval of citations and ultimately evidence-based precision medicine. We used three different methods: rule-based system, MaxEnt classifier with rule-based system, and Naïve-Bayes classifier with rule-based system. In all the three methods, we used a rule-based system to extract the population named entities. F-score of the best classifier is 90% and that of whole system is 87%. We are optimistic about the use of our system to advance precision medicine, especially in being able to deliver individualized evidence summaries at the point-of-care.
  17 in total

1.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.

Authors:  Thomas C Rindflesch; Marcelo Fiszman
Journal:  J Biomed Inform       Date:  2003-12       Impact factor: 6.317

2.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

Review 3.  Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized medicine.

Authors:  Andrea D Weston; Leroy Hood
Journal:  J Proteome Res       Date:  2004 Mar-Apr       Impact factor: 4.466

4.  Extracting subject demographic information from abstracts of randomized clinical trial reports.

Authors:  Rong Xu; Yael Garten; Kaustubh S Supekar; Amar K Das; Russ B Altman; Alan M Garber
Journal:  Stud Health Technol Inform       Date:  2007

5.  Automatic extracting of patient-related attributes: disease, age, gender and race.

Authors:  Huijia Zhu; Yuan Ni; Peng Cai; Zhaoming Qiu; Feng Cao
Journal:  Stud Health Technol Inform       Date:  2012

6.  A new initiative on precision medicine.

Authors:  Francis S Collins; Harold Varmus
Journal:  N Engl J Med       Date:  2015-01-30       Impact factor: 91.245

7.  Exploiting classification correlations for the extraction of evidence-based practice information.

Authors:  Jin Zhao; Praveen Bysani; Min-Yen Kan
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

8.  A system for extracting study design parameters from nutritional genomics abstracts.

Authors:  Cassidy Kelly; Hui Yang
Journal:  J Integr Bioinform       Date:  2013-04-04

Review 9.  Predictive, personalized, preventive, participatory (P4) cancer medicine.

Authors:  Leroy Hood; Stephen H Friend
Journal:  Nat Rev Clin Oncol       Date:  2011-03       Impact factor: 66.675

Review 10.  Deep phenotyping for precision medicine.

Authors:  Peter N Robinson
Journal:  Hum Mutat       Date:  2012-05       Impact factor: 4.878

View more
  3 in total

1.  Data extraction methods for systematic review (semi)automation: A living systematic review.

Authors:  Lena Schmidt; Babatunde K Olorisade; Luke A McGuinness; James Thomas; Julian P T Higgins
Journal:  F1000Res       Date:  2021-05-19

2.  Biomedical Literature Mining and Its Components.

Authors:  Kalpana Raja
Journal:  Methods Mol Biol       Date:  2022

Review 3.  Psoriatic arthritis: prospects for the future.

Authors:  Simon Hackett; Alexis Ogdie; Laura C Coates
Journal:  Ther Adv Musculoskelet Dis       Date:  2022-03-28       Impact factor: 5.346

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.