| Literature DB >> 26507285 |
Nigel Collier1, Tudor Groza2, Damian Smedley3, Peter N Robinson4, Anika Oellrich3, Dietrich Rebholz-Schuhmann5.
Abstract
Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high-quality databases such as the Online Mendelian Inheritance in Man (OMIM). However, the identification and harmonization of phenotype descriptions struggles with the diversity of human expressivity. We introduce a novel automated extraction approach called PhenoMiner that exploits full parsing and conceptual analysis. Apriori association mining is then used to identify relationships to human diseases. We applied PhenoMiner to the BMC open access collection and identified 13,636 phenotype candidates. We identified 28,155 phenotype-disorder hypotheses covering 4898 phenotypes and 1659 Mendelian disorders. Analysis showed: (i) the semantic distribution of the extracted terms against linked ontologies; (ii) a comparison of term overlap with the Human Phenotype Ontology (HP); (iii) moderate support for phenotype-disorder pairs in both OMIM and the literature; (iv) strong associations of phenotype-disorder pairs to known disease-genes pairs using PhenoDigm. The full list of PhenoMiner phenotypes (S1), phenotype-disorder associations (S2), association-filtered linked data (S3) and user database documentation (S5) is available as supplementary data and can be downloaded at http://github.com/nhcollier/PhenoMiner under a Creative Commons Attribution 4.0 license. Database URL: phenominer.mml.cam.ac.uk.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26507285 PMCID: PMC4622021 DOI: 10.1093/database/bav104
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overview of PhenoMiner illustrating the flow of data from the literature, to text mining, to association discovery and into an integrated semantic representation.
Figure 2.Semantic representation of a text fragment in the PhenoMiner system. Keyword search identifies the potential trigger word unusual causing the sentence to be selected for grammatical parsing. The adjectives thickened and median along with the common noun nerve are identified and their corresponding ontology terms are mapped as shown. A semantically typed regular expression then guides the system to select the phrase as a phenotype candidate.
Figure 3.Text/data mining pipeline showing processes and resources. Highlighted indexes correspond to steps in Text/Data Mining section.
Figure 4.Example of semantic tree matching with a Tregex rule.
Representation of external concepts in disorder phenotype terms
| Ontology (O) | Ontology (O) | ||
|---|---|---|---|
| PATO | 0.99 | MP | 0.24 |
| SNOMEDCT | 0.98 | ORDO | 0.21 |
| OMIM | 0.96 | MA | 0.15 |
| HP | 0.57 | RxNORM | 0.09 |
| FMA | 0.44 | ChEBI | 0.02 |
| DOID | 0.30 | GO | 0.00 |
Probability that an extracted Phenotype term (T) will have an ontology concept (O) associated with it based on NCBO Annotator data (18). Total number of terms is 4898.
HP coverage
| Affected system | ID | % |
|---|---|---|
| Abnormality of the endocrine system | HP:0000818 | 9.0 |
| Abnormality of prenatal development or birth | HP:0001197 | 6.7 |
| Neoplasm | HP:0002664 | 17.3 |
| Abnormality of the respiratory system | HP:0002086 | 14.1 |
| Abnormality of the genitourinary system | HP:0000119 | 13.2 |
| Abnormality of the nervous system | HP:0000707 | 12.6 |
| Abnormality of the musculature | HP:0003011 | 7.0 |
| Abnormality of metabolism/homeostasis | HP:0001939 | 8.4 |
| Abnormality of blood and blood-forming tissues | HP:0001871 | 15.2 |
| Abnormality of the immune system | HP:0002715 | 15.9 |
| Abnormality of the voice | HP:0001608 | 13.3 |
| Abnormality of the skeletal system | HP:0000924 | 2.7 |
| Abnormality of the ear | HP:0000598 | 5.7 |
| Abnormality of head and neck | HP:0000152 | 8.4 |
| Abnormality of the breast | HP:0000769 | 17.4 |
| Abnormality of the integument | HP:0001574 | 6.5 |
| Growth abnormality | HP:0001507 | 15.7 |
| Abnormality of the abdomen | HP:0001438 | 14.6 |
| Abnormality of the cardiovascular system | HP:0001626 | 12.5 |
| Abnormality of the eye | HP:0000478 | 8.0 |
| Abnormality of connective tissue | HP:0003549 | 6.2 |
Percentage overlap between PhenoMiner terms and HPO terms estimated using Bio-LarK’s concept alignment.
Figure 5.ROC curve for known gene-disease associations from OMIM’s MorbidMap using HP and PhenoMiner annotations. For each disease the true- and false-positive rate is calculated from the list of mouse genes in MGD ordered by phenotypic similarity and the known associated gene in OMIM.