| Literature DB >> 24155869 |
Nigel Collier1, Mai-vu Tran, Hoang-quynh Le, Quang-Thuy Ha, Anika Oellrich, Dietrich Rebholz-Schuhmann.
Abstract
The identification of phenotype descriptions in the scientific literature, case reports and patient records is a rewarding task for bio-medical text mining. Any progress will support knowledge discovery and linkage to other resources. However because of their wide variation a number of challenges still remain in terms of their identification and semantic normalisation before they can be fully exploited for research purposes. This paper presents novel techniques for identifying potential complex phenotype mentions by exploiting a hybrid model based on machine learning, rules and dictionary matching. A systematic study is made of how to combine sequence labels from these modules as well as the merits of various ontological resources. We evaluated our approach on a subset of Medline abstracts cited by the Online Mendelian Inheritance of Man database related to auto-immune diseases. Using partial matching the best micro-averaged F-score for phenotypes and five other entity classes was 79.9%. A best performance of 75.3% was achieved for phenotype candidates using all semantics resources. We observed the advantage of using SVM-based learn-to-rank for sequence label combination over maximum entropy and a priority list approach. The results indicate that the identification of simple entity types such as chemicals and genes are robustly supported by single semantic resources, whereas phenotypes require combinations. Altogether we conclude that our approach coped well with the compositional structure of phenotypes in the auto-immune domain.Entities:
Mesh:
Year: 2013 PMID: 24155869 PMCID: PMC3796529 DOI: 10.1371/journal.pone.0072965
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Representation of phenotypes in textual narratives and as pre-composed and post-composed terms. Imagine Mus musculus courtesy of George Shuklin published at Wikimedia Commons.
Figure 2Example tagging of phenotypes along with features from external vocabularies and ontologies.
Figure 3The stages of our experimental phenotype candidate system.
Referential semantics and scoping of mentions by entity type.
| specific | generic | underspecified | modifiers | conjunctions | processes | negation | |
| reference | reference | reference | disjunction | ||||
| GG | Yes | Yes | No | No | Yes | No | No |
| DS | Yes | Yes | No | No | Yes | No | No |
| CD | Yes | Yes | No | No | Yes | No | No |
| OR | Yes | Yes | No | No | Yes | No | No |
| AN | Yes | Yes | No | Yes | Yes | No | No |
| PH | Yes | Yes | No | Yes | Yes | Yes | Yes |
Notes on annotation:
Where there is elision of the head, e.g. [IA/H5 virus], then annotate the whole expression. Otherwise annotate each expression separately, e.g. [IA virus] and [H5 virus].
Markable expressions include specific people, e.g. [Jane] as well as definite noun phrases such as, the [24-year-old man].
Quantitive modifers are included, e.g. [both kidneys] as well as spatial modifiers, e.g. [left collar bone].
When modifiers are considered to be part of the disease name they are included, e.g. [highly pathogenic avian influenza], [end-stage renal disease].
We exclude however finite verb forms, infinite verb forms with to', verbs in a progressive or perfect aspect, verb phrases, clauses or sentences and any phrase with a relative clause or complement clause.
If the negation appears in a noun phrase with an anatomical entity then we generally allow it, e.g. [absent ankle reflexes], [no left kidney].
Qualitative modifiers are included. For example, physical components: [black hair], underspecified ranges: [normal height], locational modifers: [low set ears], and level modifiers: [quite small fingers].
Auto-immune diseases from OMIM represented in the Phenominer A corpus.
| Disease | Organism |
| Auto immune thyroid disease | human |
| Auto immune skin diseases | human |
| Immune mediated diseases | human |
| Immuno-mediated gastrointestinal diseases | human |
| Celiac's disease/Caliac disease | human |
| Grave's disease/Grave disease | human |
| Hashimoto's disease/Hashimoto disease | human |
| Crohn's disease/Crohn disease | human |
| Addison's disease/Addison disease | human |
| Type 1 diabetes | human |
| Rhematoid arthritis | human |
| Multiple sclerosis | human |
| Systemic lupus erythematosus | human |
| Asthma | human |
| Familial psoriasis | human |
| Auto immune encephalomyeliti | mouse |
| Inflammatory arthritis | mouse |
| Histamine sensitization | mouse |
| Mouse lupus | mouse |
Descriptive statistics for entities in the Phenominer A corpus.
| Entity | # Entities | # Unique Entities | Average length of entity |
| PH | 472 | 393 | 3.0 |
| OR | 764 | 402 | 1.8 |
| DS | 875 | 270 | 1.9 |
| GG | 1611 | 885 | 1.7 |
| AN | 188 | 132 | 2.2 |
| CD | 48 | 31 | 1.4 |
Figure 4Hypothesis resolution using a priority list.
Figure 5Handling ambiguous versus unambiguous cases.
Figure 6Hypothesis resolution using maximum entropy with beam search (MS+BS).
Features used by the Maximum Entropy model for hypothesis resolution.
| No. | Feature | Example |
| 1 | Current word |
|
| 2 | Context words |
|
| 3 | ME+BS labels |
|
| 4 | Rule matching labels |
|
| 5 | PH dictionary labels |
|
| 6 | DS dictionary labels |
|
| 7 | CD dictionary labels |
|
| 8 | AN dictionary labels |
|
| 9 | GG dictionary labels |
|
Figure 7Hypothesis resolution using support vector machine and learn to rank (SVM+LTR).
Defining the test metrics.
| Gold standard class | |||
| True | False | ||
| System | True | TP | FP |
| (Type 1 error) | |||
| class | False | FN | TN |
| (Type 2 error) | |||
Performance of named entity recognition using using partial matching for ME+BS in machine learning labeler and priority list in resolution module.
| External resources |
| |||||||||||||||
| J | U | H | M | G | L | F | P | C | B |
|
|
|
|
|
|
|
| − | + | + | + | + | + | + | + | + | + | 73.7 | 75.6 | 76.2 |
| 78.9 | 74.2 | 68.8 |
| + | − | + | + | + | + | + | + | + | + |
| 72.1 | 76.8 | 83.2 | 78.7 |
| 73.1 |
| + | + | − | + | + | + | + | + | + | + |
| 74.0 | 77.1 | 84.8 | 80.4 | 73.6 | 73.7 |
| + | + | + | − | + | + | + | + | + | + |
| 75.2 | 75.6 | 85.0 | 80.4 | 73.2 | 72.1 |
| + | + | + | + | − | + | + | + | + | + | 74.6 | 75.4 | 77.1 |
| 80.4 | 74.3 | 78.9 |
| + | + | + | + | + | − | + | + | + | + | 73.2 |
| 76.7 | 85.2 | 79.3 | 73.8 | 77.4 |
| + | + | + | + | + | + | − | + | + | + | 74.9 | 75.4 |
| 85.2 | 80.4 | 74.3 | 77.1 |
| + | + | + | + | + | + | + | − | + | + |
| 75.4 | 77.1 | 85.2 | 80.4 | 74.3 | 79.1 |
| + | + | + | + | + | + | + | + | − | + | 74.9 | 75.4 | 77.1 | 85.2 |
| 74.3 | 75.2 |
| + | + | + | + | + | + | + | + | + | − | 74.9 | 75.4 |
| 85.2 | 80.4 | 74.3 | 79.1 |
| + | + | + | + | + | + | + | + | + | + | 74.9 | 75.4 | 77.1 | 85.2 | 80.4 | 74.3 | 79.2 |
Each horizontal row shows a combination of features and the associated F-scores for each class on test data. ALL shows micro-averaged F-score. Key to external resources: J: JNLPBA model, U: UMLS and MetaMap, H: Human Phenotype Ontology, M: Mammalian Phenotype Ontology, G: Gene Dictionary from NCBI, L: Linnaeus, F: Foundation Model of Anatomy, P: Phenotypic Trait Ontology, C: Jochem's dictionary, B: Brenda Tissue Ontology.
Performance of named entity recognition using exact matching for ME+BS in machine learning labeler and priority list in resolution module.
| External resources |
| |||||||||||||||
| J | U | H | M | G | L | F | P | C | B |
|
|
|
|
|
|
|
| − | + | + | + | + | + | + | + | + | + | 36.0 | 61.3 | 58.0 | 48.5 | 71.3 | 55.3 | 50.1 |
| + | − | + | + | + | + | + | + | + | + | 35.3 | 60.4 | 58.0 | 57.1 | 71.2 | 49.4 | 52.2 |
| + | + | − | + | + | + | + | + | + | + | 33.5 | 58.2 | 58.0 | 56.4 | 71.3 | 54.3 | 53.0 |
| + | + | + | − | + | + | + | + | + | + | 30.0 | 57.4 | 57.4 | 58.7 | 71.3 | 53.7 | 52.7 |
| + | + | + | + | − | + | + | + | + | + | 36.0 | 61.3 | 58.0 | 58.2 | 71.3 | 55.3 | 54.4 |
| + | + | + | + | + | − | + | + | + | + | 35.4 | 35.6 | 57.6 | 59.2 | 70.8 | 55.0 | 53.2 |
| + | + | + | + | + | + | − | + | + | + | 36.3 | 61.3 | 39.2 | 59.2 | 71.3 | 55.3 | 54.5 |
| + | + | + | + | + | + | + | − | + | + | 35.5 | 61.3 | 58.0 | 59.2 | 71.3 | 55.3 | 55.4 |
| + | + | + | + | + | + | + | + | − | + | 36.3 | 61.3 | 58.0 | 59.2 | 38.4 | 55.3 | 55.3 |
| + | + | + | + | + | + | + | + | + | − | 36.3 | 61.3 | 56.9 | 59.2 | 71.3 | 55.3 | 55.3 |
| + | + | + | + | + | + | + | + | + | + | 36.3 | 61.3 | 58.0 | 59.2 | 71.3 | 55.3 | 55.4 |
Each horizontal row shows a combination of features and the associated F-scores for each class on test data. ALL shows micro-averaged F-score. Key to external resources: J: JNLPBA model, U: UMLS and MetaMap, H: Human Phenotype Ontology, M: Mammalian Phenotype Ontology, G: Gene Dictionary from NCBI, L: Linnaeus, F: Foundation Model of Anatomy, P: Phenotypic Trait Ontology, C: Jochem's dictionary, B: Brenda Tissue Ontology.
Figure 8Statistical significance tests for differences in performance using approximate randomization on resources contributions.
The entries in cells indicate that the two systems are significantly different in F-scores. AR: All resources, J: JNLPBA model, U: UMLS and MetaMap, H:Human Phenotype Ontology, M: Mammalian Phenotype Ontology, G: Gene Dictionary from NCBI, L: Linnaeus, F: Foundation Model of Anatomy, P: Phenotypic Trait Ontology, C: Jochem's dictionary, B: Brenda Tissue Ontology, -: No significant difference. Significance is decided at p< = 0.05.
Performance of named entity recognition using Priority List (PL), ME plus beam search (ME+BS)and SVM learn-to-rank (SVM+LTR).
| PL |
|
| |||||||
| NE class | P | R | F | P | R | F | P | R | F |
| PH | 73.7 | 76.1 | 74.9 | 73.3 | 68.2 | 70.7 | 74.3 | 76.4 |
|
| GG | 87.0 | 83.5 | 85.2 | 84.7 | 84.0 | 84.4 | 86.8 | 85.0 |
|
| OR | 72.8 | 78.1 |
| 62.1 | 65.9 | 63.9 | 70.2 | 77.2 | 73.5 |
| CD | 79.6 | 81.3 | 80.4 | 74.2 | 71.6 | 72.9 | 80.5 | 81.4 |
|
| AN | 72.4 | 82.5 | 77.1 | 69.4 | 71.6 | 70.5 | 75.6 | 80.1 |
|
| DS | 75.8 | 72.9 |
| 71.9 | 70.4 | 71.1 | 73.2 | 71.6 | 72.4 |
| ALL | - | - | 79.2 | - | - | 74.9 | - | - |
|
Each horizontal row shows Precision, Recall and F-score performance for a class using alternative methods. ALL shows micro-averaged F-score.
Statistical significance tests for differences in performance using approximate randomization on Resolution methods.
| Priority list | ME+BS | |
| SVM LRT | GG, OR | PH, GG, OR, AN, DS |
| Priority list | PH, GG, OR, AN, DS |
The entries in cells indicate that the two systems are significantly different in F-scores. CD has no significant difference for all tests. Significance is decided at p< = 0.05.
Performance of named entity recognition using SVM learn-to-rank (SVM+LTR) for all entities in the cross-validation test and unique entities only.
|
|
| Unique | |||||
| NE class | P | R | F | P | R | F | Rate |
| PH | 74.3 | 76.4 |
| 65.4 | 60.3 | 62.3 | 26.2 |
| GG | 86.8 | 85.0 |
| 80.2 | 79.4 | 79.8 | 14.6 |
| OR | 70.2 | 77.2 | 73.5 | 67.3 | 69.3 | 68.3 | 22.9 |
| CD | 80.5 | 81.4 |
| 74.3 | 71.0 | 72.6 | 41.3 |
| AN | 75.6 | 80.1 |
| 71.3 | 72.6 | 72.0 | 19.2 |
| DS | 73.2 | 71.6 | 72.4 | 70.1 | 69.2 | 69.7 | 12.3 |
| ALL | - | - |
| - | - | 73.2 | - |
Each horizontal row shows Precision, Recall and F-score performance for a class using alternative methods. Unique Rate shows the percentage of unique entity mentions seen in the cross-validation test for each class. ALL shows micro-averaged F-score.
Examples of mentions in the corpus where we noticed a gain in recall with each of the resources.
| No. | Resource | Entity example | Named entity class |
| 1 | JNLPBA ME+BS | [human gammaglobulin] | PH |
| corpus | [eukaryotic elongation factor 1A-1] | PH | |
| [high-affinity human mAb] | PH | ||
| 2 | UMLS & | [disorder of the Steroidogenic Acute | PH |
| MetaMap | Regulatory Protein] | ||
| [Dermatitis Herpetiformis] | DS | ||
| [uveitis] | DS | ||
| 3 | HPO | [immunoglobulin abnormality] | PH |
| [asthma phenotype] | PH | ||
| [autoimmunity] | PH | ||
| 4 | MP | [oxidative stress pathway] | PH |
| [intestinal inflammation] | PH | ||
| [insulitis] | PH | ||
| 5 | Gene | [CEACAM6] | GG |
| dictionary | [COL29A1] | GG | |
| [Slc30A8] | GG | ||
| 6 | Linnaeus | [adenoviruses] | OR |
| [murine] | OR | ||
| [adherent-invasive E. coli] | OR | ||
| 7 | FMA Ontology | [lung] | AN |
| [multiple organ systems] | AN | ||
| [central nervous system] | AN | ||
| 8 | PATO | [high IgE levels] | PH |
| 9 | Jochem | [S €nitrosoglutathione] | CD |
| dictionary | [histamine] | CD | |
| [dapsone] | CD | ||
| 10 | Brenda Tissue | [ileal mucosa] | AN |
| ontology |
Named entity class is the correct results.
Errors by resolution module using Priority List (PL) and SVM learn-to-rank (SVM LTR).
| No. | Entity | CA | ML | RB | DB |
| |||||
| PH | GG | DS | CD | AN | PL | LTR | |||||
| 1 | [susceptibilities to | PH | PH | - | - | - | DS | - | - | DS |
|
| autoimmune disease] | |||||||||||
| 2 | [asthma and | PH | PH | - | PH | - | DS | - | - | DS |
|
| atopy phenotypes] | |||||||||||
| 3 | [IgE levels] | PH | GG | - | PH | - | - | - | - |
| GG |
| 4 | [Toll-like receptor/ | PH | GG | - | - | GG | - | - | - | GG | GG |
| IL-1R pathways] | |||||||||||
| 5 | [MyD88-deficiency] | PH | GG | - | - | - | - | - | - | GG | GG |
| 6 | [allergen-induced | PH | DS | - | - | - | - | - | - | DS | DS |
| bronchial | |||||||||||
| inflammation] | |||||||||||
CA: Corpus annotation. Key to labeler: ML: Machine Learning labeler, RB: Rule-based labeler, DB: Dictionary-based labeler. PL: Priority list, LTR: SVM- Learn to rank. The resources which the dictionary-based labelers used to recognize the entity are as follows:
UMLS C0004364,
HP 0002099,
UMLS C0004096,
MP 0002492 and HP 0003212,
NCBI Gene dictionary.