| Literature DB >> 25874077 |
Kristina Doing-Harris1, Yarden Livnat2, Stephane Meystre1.
Abstract
BACKGROUND: We develop medical-specialty specific ontologies that contain the settled science and common term usage. We leverage current practices in information and relationship extraction to streamline the ontology development process. Our system combines different text types with information and relationship extraction techniques in a low overhead modifiable system. Our SEmi-Automated ontology Maintenance (SEAM) system features a natural language processing pipeline for information extraction. Synonym and hierarchical groups are identified using corpus-based semantics and lexico-syntactic patterns. The semantic vectors we use are term frequency by inverse document frequency and context vectors. Clinical documents contain the terms we want in an ontology. They also contain idiosyncratic usage and are unlikely to contain the linguistic constructs associated with synonym and hierarchy identification. By including both clinical and biomedical texts, SEAM can recommend terms from those appearing in both document types. The set of recommended terms is then used to filter the synonyms and hierarchical relationships extracted from the biomedical corpus. We demonstrate the generality of the system across three use cases: ontologies for acute changes in mental status, Medically Unexplained Syndromes, and echocardiogram summary statements.Entities:
Keywords: Natural language processing; Ontology; Terminology extraction
Year: 2015 PMID: 25874077 PMCID: PMC4396714 DOI: 10.1186/s13326-015-0011-7
Source DB: PubMed Journal: J Biomed Semantics
Figure 1A pictorial representation of the SEAM system. This figure shows the three processing stages of the SEAM system. NLP processes are in the orange boxes. Each stage includes one or more phases A…F. Each box represents the database table created in that processing phase. The dotted line indicates that term processing is separate from relationship processing.
The UMLS semantic types used by SEAM for partial matches and final recommendations
| T020 | Acquired Abnormality |
| T190 | Anatomical Abnormality |
| T049 | Cell or Molecular Dysfunction |
| T019 | Congenital Abnormality |
| T047 | Disease or Syndrome |
| T050 | Experimental Model of Disease |
| T037 | Injury or Poisoning |
| T048 | Mental or Behavioral Dysfunction |
| T191 | Neoplastic Process |
| T046 | Pathologic Function |
| T184 | Sign or Symptom |
| T033 | Finding |
| T029 | Body Location or Region |
| T080 | Qualitative Concept |
| T023 | Body Part, Organ or Organ Component |
| T081 | Quantitative Concept |
Equations used in SEAM
| C-value (a) [ |
|
| where: | |
|
| |
|
| |
| Τa is the set of extracted candidate terms that contain a | |
|
| |
| Termhood (a) | = −0.7836 + |
| 0.7541* FirstPOS _ ADJECTIVE – | |
| 1.3722* FirstPOS _ ADVERB + | |
| 0.3541* FirstPOS _ NOUN + | |
| 1.4182 * FirstPOS _ VERB – | |
| 0.7722 * LastPOS _ ADJECTIVE + | |
| 2.2576 * LastPOS _ ADVERB + | |
| 0.0285 * LastPOS_NOUN + | |
| 0.6038 * LastPOS _ VERB + | |
| 1.2899 * NP _ VALUE + | |
| 1.0475 * REPEAT _ SUP _ GREATER _ MEDIAN + | |
| 0.8417 * REPEAT _ SUB _ GREATER _ MEDIAN + | |
| 0.8422 * DISTINCT _ PERHOST _ GREATER _ THAN _ MEDIAN | |
| where: | |
| POS is Part of Speech tag | |
| REPEAT_SUP is number of supra (candidate terms containing a) = | |
| REPEAT_SUB is subgroup (candidate terms that are contained within a) = P (Αt) | |
| NP_VALUE is a a noun phrase | |
| DISTINCT_PER_HOST is equivalent to document frequency | |
| MEDIAN is calculated for the whole document set | |
| TF-IDF = wi,j = TFi,j x IDFi [ |
|
| where: | |
| TFi,j is term frequency for keyword ki in document dj | |
| fi,j is the number of times ki appears in dj | |
| maxzfz,j is the maximum frequency across all keywords kz in dj | |
|
| |
| where: | |
| IDFi is the inverse document frequency for keyword ki | |
| N is the total number of documents in the corpus | |
| nj is the number of documents that ki appears in | |
| Cosine similarity [ |
|
| where | |
| wi,j is defined above |
Lexico-syntactic patterns used to identify relationships [25]
|
|
|
|---|---|
| Synonymy | “%also%known%as%”, “(“, “aka”, “so called”, “also called”, “%also% referred% to%”, “%referred% to%” |
| Hierarchy | “%such%as%”, “%or other%”, “%and other%”,”%including%”, “%associated with”, “is”, “are”, “is “, “%type of%”,”are “ |
The breakdown in potential terms found from each corpus with those found in the existing ontology and matched to the UMLS Metathesaurus
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
|
|
|
|
|
|
|
| ||
|
| 39,863 | 14,407 | 23,936 | 1,520 | 66 | 163/166 | 1,138 | 8,021 |
|
| 86,931 | 27,152 | 56,883 | 2,896 | 88 | 219/219 | 4,962 | 18,410 |
|
| 83,368 | 78,358 | 4,268 | 1,342 | 198 | 198/198 | 4,695 | 15,843 |
SEAM term results for the ACMS targeted clinical corpus (n = 199), with 19 biomedical articles
|
| ||
|---|---|---|
|
| Filter 1: Direct UMLS matches in both corpora | 81 |
| Filter 2: Partial matches to UMLS in both corpora | 101 | |
| Filter 3: Non-matches with c-value > 50 in clinical corpus or both | - | |
| Filter 4: Non-matches with Termhood score > 4.6 in clinical corpus or both | - | |
| Combined recommended terms with the same CUI | -9 | |
|
|
| |
|
| Fully Matched | 12 |
| Partially Matched | 13 | |
|
|
|
SEAM relationship results for the ACMS targeted clinical corpus (n = 199), with 19 biomedical articles
|
| ||
|---|---|---|
|
| Filter 1: TF-IDF | 16 |
| Filter 2: LSP | 21 | |
| Filter 3: ASIUM | 14 | |
| Filter 4: Context Vectors | - | |
|
|
| |
|
|
|
Expert review of terms for each use case, Recommended vs. Accepted terms. Qualitative terms are reported separately because they were not considered when the ontologies were first constructed
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
|
| 173 | 25 (14%) | 31 (18%) | 48 (28%) | 103 (60%) | 23 (13%) | 47 (26%) | 0.38 (Fair) |
|
| 271 | 61 (23%) | 35 (13%) | 67 (25%) | 163 (60%) | 26 (9%) | 83 (31%) | 0.29 (Fair) |
|
| 363 | 289 (80%) | N/A | 37 (10%) | 326 (90%) | 14 (4%) | 23 (6%) | 0.66 (Substantial) |
Expert review of relationships for each use case, Recommended vs. Accepted relationships
|
|
|
|
|
|
|---|---|---|---|---|
|
| Synonymy | 51 |
| 33 (57%) |
| Hierarchy | 6 |
| 2 (33%) | |
|
| Synonymy | 75 |
| 56 (75%) |
| Hierarchy | 14 |
| 5 (36%) | |
|
| Synonymy | 127 |
| 93 (73%) |
| Hierarchy | 16 |
| 11 (69%) |
SEAM term results for the MUS large clinical corpus (n = 696), with 47 biomedical articles
|
| ||
|---|---|---|
|
| Filter 1: Direct UMLS matches in both corpora | 134 |
| Filter 2: Partial matches to UMLS in both corpora | 148 | |
| Filter 3: Non-matches with c-value > 50 in both corpora | 5 | |
| Filter 4: Non-matches with Termhood score > 4.6 in both corpora | 2 | |
| Combined recommended terms with the same CUI | -18 | |
|
|
| |
|
| Fully Matched | 12 |
| Partially Matched | 50 | |
|
|
|
SEAM relationship results for the MUS large clinical corpus (n = 696), with 47 biomedical articles
|
| ||
|---|---|---|
|
| Filter 1: TF-IDF | - |
| Filter 2: LSP | 40 | |
| Filter 3: ASIUM | 19 | |
| Filter 4: Context Vectors | 16 | |
|
|
| |
|
|
|
SEAM term results for the Echocardiogram large clinical corpus (n = 2874/5 = 575), with 232 case reports
|
| ||
|---|---|---|
|
| Filter 1: Direct UMLS matches in both corpora | 90 |
| Filter 2: Partial matches to UMLS in both corpora | 180 | |
| Filter 3: Non-matches with c-value > 50 in both corpora | 71 | |
| Filter 4: Non-matches with Termhood score > 4.6 in both corpora | 89 | |
| Combined recommended terms with the same CUI | -70 | |
|
|
| |
|
|
|
SEAM relationship results for the Echocardiogram large clinical corpus (n = 2874/5 = 575), with 232 case reports
|
| ||
|---|---|---|
|
| Filter 1: TF-IDF | 66 |
| Filter 2: LSP | 29 | |
| Filter 3: ASIUM | 25 | |
| Filter 4: Context Vectors | 1 | |
|
|
| |
|
|
|