| Literature DB >> 25789153 |
Xiao Fu1, Riza Batista-Navarro2, Rafal Rak1, Sophia Ananiadou1.
Abstract
BACKGROUND: Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients.Entities:
Keywords: Automatic annotation workflows; Chronic obstructive pulmonary disease; Corpora for clinical text mining; Corpus annotation; Ontology linking; Phenotype curation
Year: 2015 PMID: 25789153 PMCID: PMC4364458 DOI: 10.1186/s13326-015-0004-6
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Distribution of COPD-relevant articles over COPD-focussed journals. A total of 974 full-text articles were retrieved from 10 journals in the PubMed OpenAccess subset.
The proposed typology for capturing COPD phenotypes
|
|
|
|
|---|---|---|
| 1) Problem | an overall category for any COPD indications of concern |
|
| a) MedicalCondition* | any disease or medical condition; includes COPD comorbidities |
|
| b) RiskFactor* | a phenotype signifying a patient’s increased chances of having COPD |
|
| i) SignOrSymptom* | an observable irregularity manifested by a COPD patient |
|
| ii) IndividualBehaviour* | a patient’s habits leading to susceptibility of having COPD |
|
| iii) TestOrMeasureResult* | findings based on COPD-relevant examinations |
|
| 2) Treatment | any medication, therapy or program for treating COPD |
|
| 3) TestOrMeasure | an overall category for any COPD-relevant examinations or measures/parameters |
|
| a) RadiologicalTest | any of the radiological tests for detecting COPD |
|
| b) MicrobiologicalTest | an examination of a COPD- relevant specimen |
|
| c) PhysiologicalTest | a measurement of a COPD patient’s capacity to exercise |
|
Types marked with an asterisk (*) were adapted from the PhenoCHF scheme.
Examples of phenotypic information represented using our proposed annotation scheme
|
|
|
|
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| N/A |
|
|
| N/A |
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 2Our semi-automatic annotation workflow in Argo.
Figure 3The user interface for linking mentions to ontologies.
Figure 4The Manual Annotation Editor’s graphical user interface. The article excerpt shown is annotated using our proposed scheme for finer-grained COPD phenotype annotations.
Number of unique concepts for each type, based on the nine manually annotated articles
|
|
|
|---|---|
| Treatment | 430 |
| RiskFactor | 415 |
| MedicalCondition | 371 |
| TestOrMeasure | 282 |
| Drug | 192 |
| AnatomicalConcept | 96 |
| Quality | 59 |
| Protein | 40 |
|
|
|
Evaluation of annotations automatically generated by the text mining-assisted workflow against gold standard data
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| AnatomicalConcept | 0.1923 | 0.7527 | 0.3063 | 0.2814 | 0.9038 | 0.4292 |
| Drug | 0.5861 | 0.2744 | 0.3738 | 0.7921 | 0.6463 | 0.7118 |
| MedicalCondition | 0.0290 | 0.2842 | 0.2868 | 0.3697 | 0.6313 | 0.4663 |
| TestOrMeasure | 0.1425 | 0.0680 | 0.0920 | 0.1914 | 0.1039 | 0.1347 |
| Treatment | 0.3080 | 0.1494 | 0.2012 | 0.4688 | 0.4015 | 0.4325 |
| Micro-average | 0.2670 | 0.2283 | 0.2462 | 0.4050 | 0.5243 | 0.4570 |
| Macro-average | 0.3037 | 0.3057 | 0.3047 | 0.4207 | 0.5374 | 0.4719 |
Results are reported for only nine full-text papers.
Results of 10-fold cross validation of concept recognisers, using exact matching
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| AnatomicalConcept | 0.2361 | 0.6617 | 0.3428 | 0.7602 | 0.4990 | 0.5912 |
| Drug | 0.7318 | 0.2161 | 0.3283 | 0.8576 | 0.4499 | 0.5873 |
| MedicalCondition | 0.3986 | 0.2436 | 0.3010 | 0.8510 | 0.4590 | 0.5932 |
| TestOrMeasure | 0.0766 | 0.0182 | 0.0289 | 0.6850 | 0.3190 | 0.4332 |
| Treatment | 0.4330 | 0.1021 | 0.1635 | 0.8276 | 0.3458 | 0.4829 |
| Micro-average | 0.3305 | 0.1776 | 0.2310 | 0.7929 | 0.3970 | 0.5291 |
| Macro-average | 0.3752 | 0.2483 | 0.2988 | 0.7963 | 0.4145 | 0.5452 |
Performance is compared with that of the components utilised in the text mining-assisted workflow.
Results of evaluation using a fixed split over 381 paragraphs (training set: 75% or 286 paragraphs; held-out set: 25% or 95 paragraphs), using exact matching
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| AnatomicalConcept | 0.2602 | 0.6145 | 0.3656 | 0.8000 | 0.4314 | 0.5605 |
| Drug | 0.6885 | 0.1900 | 0.2979 | 0.7966 | 0.4196 | 0.5497 |
| MedicalCondition | 0.4494 | 0.2492 | 0.3206 | 0.8673 | 0.3899 | 0.5380 |
| TestOrMeasure | 0.0250 | 0.0041 | 0.0070 | 0.6719 | 0.2966 | 0.4115 |
| Treatment | 0.4111 | 0.0847 | 0.1404 | 0.8400 | 0.2903 | 0.4315 |
| Micro-average | 0.3735 | 0.1614 | 0.2254 | 0.8034 | 0.3552 | 0.4926 |
| Macro-average | 0.3669 | 0.2285 | 0.2816 | 0.7952 | 0.3656 | 0.5009 |