| Literature DB >> 21362185 |
Raphael Cohen1, Avitan Gefen, Michael Elhadad, Ohad S Birk.
Abstract
BACKGROUND: The OMIM database is a tool used daily by geneticists. Syndrome pages include a Clinical Synopsis section containing a list of known phenotypes comprising a clinical syndrome. The phenotypes are in free text and different phrases are often used to describe the same phenotype, the differences originating in spelling variations or typing errors, varying sentence structures and terminological variants.These variations hinder searching for syndromes or using the large amount of phenotypic information for research purposes. In addition, negation forms also create false positives when searching the textual description of phenotypes and induce noise in text mining applications. DESCRIPTION: Our method allows efficient and complete search of OMIM phenotypes as well as improved data-mining of the OMIM phenome. Applying natural language processing, each phrase is tagged with additional semantic information using UMLS and MESH. Using a grammar based method, annotated phrases are clustered into groups denoting similar phenotypes. These groups of synonymous expressions enable precise search, as query terms can be matched with the many variations that appear in OMIM, while avoiding over-matching expressions that include the query term in a negative context. On the basis of these clusters, we computed pair-wise similarity among syndromes in OMIM. Using this new similarity measure, we identified 79,770 new connections between syndromes, an average of 16 new connections per syndrome. Our project is Web-based and available at http://fohs.bgu.ac.il/s2g/csiomimEntities:
Mesh:
Year: 2011 PMID: 21362185 PMCID: PMC3053257 DOI: 10.1186/1471-2105-12-65
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Areas of phenotypes identified
| Area Name | Area# | #Distinct phrases | #Clusters identified | Avg Similar Phrases Cluster Size | % Phrases clustered in area |
|---|---|---|---|---|---|
| Syndrome names | 1 | 4,801 | 278 | 3.45 | 19.9 |
| Abdomen/gi | 2 | 707 | 33 | 2.3 | 10.1 |
| Respiratory | 3 | 546 | 39 | 2.17 | 15.6 |
| Gu/renal | 4 | 985 | 44 | 2.25 | 10.0 |
| Gu/genitalia | 5 | 975 | 44 | 2.25 | 10.0 |
| Cardiovascular | 6 | 808 | 38 | 2.13 | 10.0 |
| Muscle | 7 | 1143 | 68 | 2.98 | 17.8 |
| Endo | 8 | 344 | 16 | 2.25 | 10.4 |
| Neuro | 9 | 4.721 | 247 | 2.68 | 14.0 |
| Oncology | 10 | 891 | 21 | 2.62 | 6.17 |
| Heme | 11 | 576 | 16 | 2.75 | 7.6 |
| Immune | 12 | 542 | 19 | 2.05 | 7.2 |
| Eyes | 13 | 2,265 | 152 | 2.5 | 16.8 |
| Face | 14 | 3,927 | 283 | 2.56 | 18.4 |
| Teeth | 15 | 3,542 | 258 | 2.48 | 18.1 |
| Neck | 16 | 3,490 | 252 | 2.49 | 17.9 |
| Head | 17 | 4,196 | 281 | 2.51 | 16.8 |
| Limb | 18 | 4,522 | 283 | 3.15 | 19.7 |
| Skel | 19 | 3,936 | 247 | 3.02 | 18.9 |
| Chest | 20 | 4,307 | 273 | 3.03 | 19.2 |
| Growth | 21 | 497 | 37 | 4.24 | 31.6 |
| Nails | 22 | 1,617 | 106 | 2.41 | 15.8 |
| Skin | 23 | 2,098 | 140 | 2.42 | 16.2 |
| Hair | 24 | 1,701 | 116 | 2.41 | 16.5 |
| Lab | 25 | 3,553 | 106 | 2.9 | 8.6 |
| Misc | 26 | 4,021 | Was not clustered | ||
Figure 1Distribution of phrase instances by number of words. Distribution of length of Noun Phrases in OMIM Clinical Synopsis.
Rough Semantic Categories.
| Rough Semantic Category | Description | Examples |
|---|---|---|
| Pathology or Finding | Names and symptoms of diseases | "Perthes", "Hexadactyly", "Diffuse atrophy" or "Short finger" |
| Named entities | Names of chemicals, functions, microorganisms or proteins | "Actin", "Tyrosine", "Insulin" or "Agglutination". |
| Anatomy | The body part or organ the phenotypes occurs in. | "Cranium bifidum", "Thumb", or "Distal femur" |
| Modifiers | Concepts describing the phenotype and changing its meaning. | "Absent", "Hypoplastic", "Mild", "Enlarged" |
Figure 2Parsing result. a) The phrase "brainstem hypoplasia" parsed. The token "brainstem" was recognized by MetaMap as UMLS term (CUI C0006121) of semantic type "Body Part, Organ, or Organ Component" and the token "hypoplasia" was identified as CUI C0243069 of semantic type "Pathologic Function". b) Parse tree of the phrase "mri shows brainstem hypoplasia", same concepts were recognized as in (a), "mri" is marked as a noun and "shows" is marked as a noun as well since we view the phrases as noun phrases without verbs. c) Parse tree of the phrase "hypoplasia of the brainstem", the entire phrase is reduced by MetaMap to the two concepts identified in (a), only in reversed order.
Figure 3CSI-OMIM: Negation detection. Negations are marked with italics and are ignored in the clustering process.
Figure 4CSI-OMIM: Clusters of similar phrases. Clusters of similar phrases results of "white matter abnormalities" search.