| Literature DB >> 22493050 |
Stephen T Wu1, Hongfang Liu, Dingcheng Li, Cui Tao, Mark A Musen, Christopher G Chute, Nigam H Shah.
Abstract
OBJECTIVE: To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.Entities:
Mesh:
Year: 2012 PMID: 22493050 PMCID: PMC3392861 DOI: 10.1136/amiajnl-2011-000744
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1The number of words in a term versus relative frequency of Unified Medical Language System (UMLS) terms with that number of words.
Figure 2The number of characters in a term versus how many Unified Medical Language System (UMLS) terms had that number of characters.
Figure 3Distribution of the most frequent terms in clinical versus biomedical data.
Top terms in clinical text (Mayo corpus) and biomedical text (Medline 2011), by term frequency
| (A) Clinical text | (B) Biomedical text | ||
| Term | Frequency | Term | Frequency |
| Patient | 38 434 437 | Patients | 10 393 786 |
| Not | 18 601 179 | Cells | 4 855 359 |
| History | 16 650 248 | Treatment | 4 103 013 |
| Pain | 15 125 464 | Study | 4 032 105 |
| Time | 14 667 600 | Results | 3 498 940 |
| Normal | 13 656 279 | Cell | 3 082 455 |
| Right | 13 181 157 | Using | 2 840 963 |
| Left | 13 170 124 | Effect | 2 754 055 |
| Daily | 10 923 371 | Activity | 2 610 750 |
| Well | 9 534 581 | Protein | 2 332 732 |
Top terms in clinical text by tf–idf weight
| Term | Frequency | Document frequency | tf−idf |
| Patient | 38 434 437 | 12 163 186 | 5.5E+07 |
| Not | 18 601 179 | 6 921 338 | 3.7E+07 |
| Pain | 15 125 464 | 4 883 178 | 3.5E+07 |
| History | 16 650 248 | 7 375 392 | 3.2E+07 |
| Normal | 13 656 279 | 5 265 335 | 3.1E+07 |
| Daily | 10 923 371 | 2 984 235 | 3.1E+07 |
| Right | 13 181 157 | 5 351 140 | 3.0E+07 |
| Left | 13 170 124 | 5 388 304 | 3.0E+07 |
| Time | 14 667 600 | 7 177 814 | 2.9E+07 |
| Day | 8 288 472 | 3 358 834 | 2.3E+07 |
Figure 4Tf−idf values of the most frequent terms in clinical data.
Top source vocabularies and their degree of utilisation, by number of unique term strings in clinical notes
| (A) UMLS | (B) Clinical text—Mayo | (C) Biomedical text—Medline | ||||||
| Source | Unique | Source | Unique | % Use | Frequency | Source | Unique | % Use |
| SNOMED-CT | 988 733 | CHV | 106 426 | 74.4 | 1 866 925 442 | MSH | 242 462 | 32.6 |
| MSH | 743 332 | SNOMED-CT | 94 788 | 9.6 | 1 538 745 839 | SNOMED-CT | 215 217 | 21.8 |
| MEDCIN | 726 724 | MSH | 51 584 | 6.9 | 753 847 562 | NCI | 101 807 | 58.0 |
| NCBI | 662 674 | NCI | 50 536 | 28.8 | 981 062 417 | CHV | 85 473 | 59.7 |
| RXNORM | 455 466 | RCD | 42 668 | 12.3 | 1 683 517 327 | NCBI | 84 129 | 12.7 |
| RCD | 346 922 | MEDCIN | 32 335 | 4.4 | 298 650 586 | RCD | 69 519 | 20.0 |
| LNC | 313 431 | SNMI | 30 280 | 18.5 | 629 881 044 | SNMI | 57 177 | 34.8 |
| ICD10 | 249 863 | MDR | 28 714 | 39.8 | 310 815 333 | SCTSPA | 56 735 | 3.8 |
| NCI | 175 679 | MTH | 21 642 | 15.3 | 866 386 287 | OMIM | 46 339 | 34.5 |
| SNMI | 164 069 | SCTSPA | 17 661 | 1.2 | 369 476 316 | MTH | 43 029 | 30.5 |
Frequency of terms from each source in clinical text is also shown.
CHV, Consumer Health Vocabulary; ICD10, International Classification of Diseases, 10th revision; LNC, Logical Observation Identifier Names and Codes (LOINC); MDR, Medical Dictionary for Regulatory Activities Terminology (MedDRA); MSH, Medical Subject Headings; MTH, UMLS Metathesaurus; NCBI, National Center for Biotechnology Information; NCI, NCI Thesaurus; OMIM, Online Mendelian Inheritance in Man; RCD, Clinical Terms Version 3 (Read Codes); SCTSPA, SNOMED Terminos Clinicos; SNMI, SNOMED International v3.5; SNOMED-CT, Systematized Nomenclature of Medicined - Clinical Terms.
Figure 5(A) Frequencies of terms discovered in clinical versus biomedical text, by semantic group; (B) number of unique terms, by semantic group.
Figure 6Percentage of unique terms that are noun phrase (NP) dominated, by semantic group.
Transferability of corpus-based filtering of the Unified Medical Language System (UMLS)
| UMLS | Mayo Clinic term occurrences | i2b2/VA term occurrences | ||||||||
| Unique | % rdn | Unique | % exc | Matches (n) | % exc | Unique | % exc | Matches (n) | % exc | |
| Full UMLS | 8 335 125 | – | 296 798 | – | 2.376×109 | – | 17 570 | – | 376 350 | – |
| 1. Sp. Char. | 5 146 096 | 38.26 | 296 798 | 0.00 | 2.376×109 | 0.00 | 17 570 | 0.00 | 376 350 | 0.00 |
| 2. MaxWord | 6 157 283 | 26.13 | 295 385 | 0.48 | 2.376×109 | 0.00 | 17 564 | 0.03 | 376 343 | 0.00 |
| 3. MaxChar | 6 477 250 | 22.29 | 296 516 | 0.10 | 2.376×109 | 0.00 | 17 569 | 0.01 | 376 349 | 0.00 |
| 4. Language | 5 610 576 | 32.69 | 296 167 | 0.21 | 2.375×109 | 0.05 | 17 552 | 0.10 | 376 234 | 0.03 |
| 5. Sources | 3 409 183 | 59.10 | 251 361 | 15.31 | 2.327×109 | 2.08 | 16 491 | 6.14 | 368 682 | 2.04 |
| 6. SemGroup | 7 798 937 | 6.43 | 273 300 | 7.92 | 2.289×109 | 3.68 | 16 343 | 6.98 | 361 018 | 4.07 |
| 7. EmpFilt | 296 798 | 96.44 | 296 798 | 3.56 | 2.376×109 | 0.00 | 17 371 | 1.13 | 319 258 | 15.17 |
| 8. TermFreq | 230 011 | 97.24 | 226 697 | 23.62 | 2.376×109 | 0.00 | 17 326 | 1.39 | 319 039 | 15.23 |
| Filters 1–8 | 181 523 | 97.82 | 181 523 | 38.84 | 2.244×109 | 5.57 | 15 139 | 13.84 | 301 473 | 19.90 |
| Filters 1–6 | 1 448 811 | 82.62 | 230 860 | 22.22 | 2.244×109 | 5.56 | 15 343 | 12.68 | 354 274 | 5.87 |
| Filters 1–5 | 1 594 674 | 80.87 | 250 192 | 15.70 | 2.327×109 | 2.09 | 16 486 | 6.17 | 368 676 | 2.04 |
The UMLS column shows % rdn (reduction) of lexicon size (larger % rdn is more efficient). The Mayo and i2b2/VA columns compare this to % exc (exclusion) rate, wherein UMLS terms are no longer mapped due to the filtering. Incongruencies in % exclusion indicate corpus differences.