| Literature DB >> 25725061 |
Tudor Groza1, Sebastian Köhler2, Sandra Doelken2, Nigel Collier1, Anika Oellrich2, Damian Smedley2, Francisco M Couto2, Gareth Baynam1, Andreas Zankl1, Peter N Robinson1.
Abstract
Concept recognition tools rely on the availability of textual corpora to assess their performance and enable the identification of areas for improvement. Typically, corpora are developed for specific purposes, such as gene name recognition. Gene and protein name identification are longstanding goals of biomedical text mining, and therefore a number of different corpora exist. However, phenotypes only recently became an entity of interest for specialized concept recognition systems, and hardly any annotated text is available for performance testing and training. Here, we present a unique corpus, capturing text spans from 228 abstracts manually annotated with Human Phenotype Ontology (HPO) concepts and harmonized by three curators, which can be used as a reference standard for free text annotation of human phenotypes. Furthermore, we developed a test suite for standardized concept recognition error analysis, incorporating 32 different types of test cases corresponding to 2164 HPO concepts. Finally, three established phenotype concept recognizers (NCBO Annotator, OBO Annotator and Bio-LarK CR) were comprehensively evaluated, and results are reported against both the text corpus and the test suites. The gold standard and test suites corpora are available from http://bio-lark.org/hpo_res.html. Database URL: http://bio-lark.org/hpo_res.html.Entities:
Mesh:
Year: 2015 PMID: 25725061 PMCID: PMC4343077 DOI: 10.1093/database/bav005
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Distribution of disorders associated with the the HPO gold standard corpus
| Disorder (OMIM) | Count |
|---|---|
| Angelman syndrome (OMIM:105830) | 56 |
| Neurofibromatosis type II (OMIM:101000) | 46 |
| Basal cell nevus syndrome (OMIM:109400) | 40 |
| Branchiootorenal syndrome 1 (OMIM:113650) | 27 |
| Brachydactyly type C (OMIM:113100) | 14 |
| Branchiooculofacial syndrome (OMIM:113620) | 13 |
| Townes-Brocks syndrome (OMIM:107480) | 11 |
| Arthrogryposis distal type 1 (OMIM:108120) | 9 |
| Brachydactyly type A1 (OMIM:112500) | 7 |
| Popliteal pterygium syndrome (OMIM:119500) | 6 |
| Prader-Willi syndrome (OMIM:176270) | 5 |
| Arthrogryposis distal type 2B (OMIM:601680) | 4 |
| Van der Woude syndrome (OMIM:119300) | 3 |
| Neurofibromatosis type I (OMIM:162200) | 3 |
| Arthrogryposis distal type 2A (OMIM:193700) | 3 |
| Arthrogryposis distal type 5 (OMIM:108145) | 2 |
| Gordon syndrome (OMIM:114300) | 2 |
| Trismus-pseudocamptodactyly syndrome (OMIM:158300) | 2 |
| Schwannomatosis (OMIM:162091) | 2 |
| Neurofilament protein heavy polypeptide (OMIM:162230) | 2 |
| Hemifacial microsomia (OMIM:164210) | 2 |
| Symphalangism proximal cushing symphalangism (OMIM:185800) | 2 |
| Branchiootic syndrome 1 (OMIM:602588) | 2 |
| Arthrogryposis distal type 4 (OMIM:609128) | 2 |
| Acrodysostosis 1 with or without hormone resistance (OMIM:101800) | 1 |
| Arthrogryposis-like hand anomaly and sensorineural deafness (OMIM:108200) | 1 |
| Stickler syndrome type I (OMIM:108300) | 1 |
| Brachydactyly type A2 (OMIM:112600) | 1 |
| Charcot-Marie-Tooth disease demyelinating type 1B (OMIM:118200) | 1 |
| Arthrogryposis distal type 9 (OMIM:121050) | 1 |
| Arthrogryposis distal type 2E (OMIM:121070) | 1 |
| Crouzon syndrome (OMIM:123500) | 1 |
| Duane retraction syndrome 1 (OMIM:126800) | 1 |
| Multiple endocrine neoplasia type I (OMIM:131100) | 1 |
| Treacher Collins-Franceschetti syndrome (OMIM:154500) | 1 |
| Mesothelioma malignant (OMIM:156240) | 1 |
| Neurofibromatosis familial spinal (OMIM:162210) | 1 |
| Neurofibromatosis type III mixed central and peripheral (OMIM:162260) | 1 |
| Noonan syndrome 1 (OMIM:163950) | 1 |
| Oculodentodigital dysplasia (OMIM:164200) | 1 |
| Polydactyly postaxial type A1 (OMIM:174200) | 1 |
| Greig cephalopolysyndactyly syndrome (OMIM:175700) | 1 |
| Hutchinson-Gilford progeria syndrome (OMIM:176670) | 1 |
| Multiple pterygium syndrome autosomal dominant (OMIM:178110) | 1 |
| Symphalangism C. S. Lewis type (OMIM:185650) | 1 |
| Thumbs stiff with brachydactyly type A1 and developmental delay (OMIM:188201) | 1 |
| Waardenburg syndrome type 1 (OMIM:193500) | 1 |
| Williams-Beuren syndrome (OMIM:194050) | 1 |
| Diarrhea 1 secretory chloride congenital (OMIM:214700) | 1 |
| Cystic fibrosis (OMIM:219700) | 1 |
| Hydrocephalus autosomal dominant (OMIM:600256) | 1 |
| Bor-Duane hydrocephalus contiguous gene syndrome (OMIM:600257) | 1 |
| Cholesteatoma congenital (OMIM:604183) | 1 |
| Basal cell carcinoma susceptibility to 1 (OMIM:605462) | 1 |
The listing includes the name of the OMIM disease and the number of abstracts associated with it (the Count column).
Figure 1.Distribution of HPO test cases according to their types mapped to the top-level HPO categories. The larger the symbol, the more test case entries the corresponding mapping has. For example, the largest number of test case entries of Length-1 is present in Abnormality of the integument. In addition to providing an overview on the test suite content, this figure also depicts a birds-eye view over the variation in terms of characteristics of the concept lexical representations in the different top-level HPO categories. We can observe, e.g. that only a very few top-level categories contain concept labels with a length greater than 10. Similarly, metaphoric constructs seem to be present only in skeletal abnormalities, which also dominate together with the abnormalities of the integument and of the metabolism the range of labels containing punctuation.
Figure 2.Distribution of HPO annotations according to the top-level HPO categories. Two distributions are shown: an overall distribution that accounts for duplicate concept annotations (i.e. every instance of an annotation is counted), and a unique distribution that shows the counts of the unique concept annotations (i.e. every concept is counted a single time, indifferently of how many annotations exist in the corpus).
System performance on the HPO corpus using exact matching and concept identification
| Precision | Recall | F1 | |
|---|---|---|---|
| NCBO Annotator | 0.54 | 0.39 | 0.45 |
| OBO Annotator | 0.69 | 0.44 | 0.54 |
| Bio-LarK CR | 0.65 | 0.49 | 0.56 |
OBO Annotator and Bio-LarK CR have a similar overall efficiency, the difference in F-Score being of only 0.02. The efficiency of the NCBO Annotator was on average with 10 percentage points lower than of the other two systems.
Figure 3.F-Score results achieved by the three systems on the HPO gold standard, distributed according to the HPO top-level category.
System performance on the HPO test suites using exact matching and concept identification
| Precision | Recall | F1 | |
|---|---|---|---|
| NCBO Annotator | 0.95 | 0.84 | 0.89 |
| OBO Annotator | 0.54 | 0.26 | 0.35 |
| Bio-LarK CR | 0.97 | 0.93 | 0.95 |
As opposed to the results listed in Table 2, the NCBO Annotator achieved an overall F-Score similar to the one of Bio-LarK CR—i.e. 0.89 compared to 0.95. Surprisingly, the OBO Annotator’s efficiency was much lower than of the other two systems (F1 of 0.35), although on real data it performed on par with Bio-LarK CR.
Figure 4.F-Score results achieved by the three systems on the HPO test suites, distributed according to the type of the test case.