| Literature DB >> 29250549 |
Manuel Lobo1, Andre Lamurias1, Francisco M Couto1.
Abstract
Named-Entity Recognition is commonly used to identify biological entities such as proteins, genes, and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses Stanford CoreNLP for text processing and applies Conditional Random Fields trained with a rich feature set, which includes linguistic, orthographic, morphologic, lexical, and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted manual rules, such as the negative connotation analysis, that combined with a dictionary can filter incorrectly identified entities, find missed entities, and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC), where the system Bio-LarK CR obtained the best F-measure of 0.56. IHP achieved an F-measure of 0.65 on the GSC. Due to inconsistencies found in the GSC, an extended version of the GSC was created, adding 881 entities and modifying 4 entities. IHP achieved an F-measure of 0.863 on the new GSC.Entities:
Mesh:
Year: 2017 PMID: 29250549 PMCID: PMC5700471 DOI: 10.1155/2017/8565739
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Layout of IHP's annotation pipeline. IHP requires as input a Gold Standard Corpora that will serve as a training set for the CRFSuite and to evaluate IHP performance in the end; a feature set to use in CRFSuite; and a list of rules a dictionary to solve potential errors.
Comparative performance of IHP and Bio-LarK CR in the Gold Standard Corpora.
| Precision | Recall |
| |
|---|---|---|---|
| IHP | 0.56 | 0.79 | 0.65 |
| Bio-LarK CR | 0.65 | 0.49 | 0.56 |
The performance of the different types of used features for IHP: linguistic (L), orthographic (O), morphological (M), context (C), lexical (LE), and others (X). These features were tested in a single cross-validation iteration.
| Precision | Recall |
| |
|---|---|---|---|
| Baseline | 0.452 | 0.594 | 0.514 |
| L | 0.463 | 0.72 | 0.564 |
| O | 0.452 | 0.594 | 0.514 |
| M | 0.457 | 0.766 | 0.573 |
| C | 0.469 | 0.783 | 0.587 |
| Le | 0.453 | 0.606 | 0.518 |
| X | 0.428 | 0.697 | 0.530 |
| L + O | 0.458 | 0.720 | 0.560 |
| L + O + M | 0.451 | 0.760 | 0.566 |
| L + O + M + C | 0.478 | 0.800 | 0.598 |
| L + O + M + C + Le | 0.478 | 0.805 | 0.600 |
| L + O + M + C + Le + X | 0.482 | 0.823 | 0.608 |
Performance of IHP on the Gold Standard Corpora with no validation rules, only identification rules, only removal rules, and all validation rules.
| Precision | Recall |
| |
|---|---|---|---|
| No Validation Rules | 0.672 | 0.614 | 0.642 |
| With Identification Rules | 0.442 | 0.797 | 0.568 |
| With Removal Rules | 0.754 | 0.609 | 0.674 |
| With Validation Rules | 0.549 | 0.791 | 0.649 |
Potential performance in the Gold Standard Corpora by removing false positives found in either the HPO database or GSC.
| Precision | Recall |
| |
|---|---|---|---|
| No Filter | 0.549 | 0.791 | 0.649 |
| With Filter | 0.845 | 0.791 | 0.817 |
Performance of IHP, OBO Annotator, NCBO Annotator, and MER on the GSC+.
| Precision | Recall |
| |
|---|---|---|---|
| IHP | 0.872 | 0.854 | 0.863 |
| OBO Annotator | 0.769 | 0.344 | 0.475 |
| NCBO Annotator | 0.688 | 0.455 | 0.548 |
| MER | 0.649 | 0.405 | 0.499 |