| Literature DB >> 21347142 |
Zhihui Luo1, Robert Duffy, Stephen Johnson, Chunhua Weng.
Abstract
We describe a corpus-based approach to creating a semantic lexicon using UMLS knowledge sources. We extracted 10,000 sentences from the eligibility criteria sections of clinical trial summaries contained in ClinicalTrials.gov. The UMLS Metathesaurus and SPECIALIST Lexical Tools were used to extract and normalize UMLS recognizable terms. When annotated with Semantic Network types, the corpus had a lexical ambiguity of 1.57 (=total types for unique lexemes / total unique lexemes) and a word occurrence ambiguity of 1.96 (=total type occurrences / total word occurrences). A set of semantic preference rules was developed and applied to completely eliminate ambiguity in semantic type assignment. The lexicon covered 95.95% UMLS-recognizable terms in our corpus. A total of 20 UMLS semantic types, representing about 17% of all the distinct semantic types assigned to corpus lexemes, covered about 80% of the vocabulary of our corpus.Entities:
Year: 2010 PMID: 21347142 PMCID: PMC3041551
Source DB: PubMed Journal: Summit Transl Bioinform ISSN: 2153-6430
Figure 1:System modules and data flow of the pipeline architecture for creating a semantic lexicon for clinical research eligibility criteria from UMLS
Example of semantic assignment before applying semantic preference rules
| immunodeficiency | Disease or Syndrome |
| recent | Temporal Concept |
| patient |
- Idea or Concept - Intellectual Product - Patient or Disabled Group - Organism |
| therapy |
- Therapeutic or Preventive Procedure - Functional Concept - Finding |
| while | |
| bulky | |
Frequently applied preference rules
| Health Care Activity | Diagnostic Procedure | liver biopsy, lumbar puncture |
| Intellectual Product | Health Care Related Organization | intensive care unit, hospital |
| Quantitative Concept | Temporal Concept | minutes, second |
| Spatial Concept | Body Location or Region | mediastinal, pericardial |
| Idea or Concept | Organism Function | recovery, birth, death |
Coverage of the corpus by UMLS types
| Count | Percent | Count | Percent | |
| No Type | 1908 | 4.05% | 691 | 9.98% |
| Has Type | 45221 | 95.95% | 6230 | 90.02% |
| Multiple Words | 7334 | 15.56% | 2283 | 32.99% |
| Single Word | 39795 | 84.43% | 4638 | 67.01% |
Coverage of the 20 semantic types
| Temporal Concept | 11.07% |
| Qualitative Concept | 10.60% |
| Functional Concept | 6.19% |
| Laboratory Procedure | 5.16% |
| Therapeutic or Preventive Procedure | 4.62% |
| Disease or Syndrome | 4.58% |
| Intellectual Product | 4.01% |
| Idea or Concept | 4.00% |
| Pharmacologic Substance | 3.70% |
| Organism Attribute | 3.39% |
| Spatial Concept | 2.99% |
| Health Care Activity | 2.92% |
| Finding | 2.87% |
| Organism Function | 2.67% |
| Population Group | 2.62% |
| Professional or Occupational Group | 2.27% |
| Quantitative Concept | 2.11% |
| Neoplastic Process | 1.72% |
| Patient or Disabled Group | 1.71% |
| Body Part, Organ, or Organ Component | 1.40% |
| 80.60% |
Ambiguity reduction in the top 5 semantic types after applying the semantic preference rules
| Idea or Concept | 10017 | 2965 |
| Qualitative Concept | 8408 | 4489 |
| Intellectual Product | 6134 | 1941 |
| Conceptual Entity | 3135 | 218 |
| Manufactured Object | 2470 | 121 |
Functional words and their examples
| Function words | The, of, can, if, while… |
| Number Strings | 18, 60, 1979, 2005 |
| Symbols Strings | -, #, >=, @, +, ?, * |
| Abbreviations | GLD, HCV, NICHD |
| Units | mm3, ph, mmhg, l, kg |