| Literature DB >> 15608198 |
Jonathan D Wren1, Jeffrey T Chang, James Pustejovsky, Eytan Adar, Harold R Garner, Russ B Altman.
Abstract
Longer words and phrases are frequently mapped onto a shorter form such as abbreviations or acronyms for efficiency of communication. These abbreviations are pervasive in all aspects of biology and medicine and as the amount of biomedical literature grows, so does the number of abbreviations and the average number of definitions per abbreviation. Even more confusing, different authors will often abbreviate the same word/phrase differently. This ambiguity impedes our ability to retrieve information, integrate databases and mine textual databases for content. Efforts to standardize nomenclature, especially those doing so retrospectively, need to be aware of different abbreviatory mappings and spelling variations. To address this problem, there have been several efforts to develop computer algorithms to identify the mapping of terms between short and long form within a large body of literature. To date, four such algorithms have been applied to create online databases that comprehensively map biomedical terms and abbreviations within MEDLINE: ARGH (http://lethargy.swmed.edu/ARGH/argh.asp), the Stanford Biomedical Abbreviation Server (http://bionlp.stanford.edu/abbreviation/), AcroMed (http://medstract.med.tufts.edu/acro1.1/index.htm) and SaRAD (http://www.hpl.hp.com/research/idl/projects/abbrev.html). In addition to serving as useful computational tools, these databases serve as valuable references that help biologists keep up with an ever-expanding vocabulary of terms.Entities:
Mesh:
Year: 2005 PMID: 15608198 PMCID: PMC540091 DOI: 10.1093/nar/gki137
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Number of results returned when searching either PubMed or Ovid using the phrases above typed in exactly as shown
| Pattern | Search pattern | No. of results in PubMed | No. of results in Ovid |
|---|---|---|---|
| 1 | JNK | 5477 | 7902 |
| 2 | c-jun N-terminal kinase | 3773 | 2912 |
| 3 | c-jun NH2-terminal kinase | 503 | 731 |
| 4 | c-jun amino-terminal kinase | 3057 | 3039 |
| 5 | jun N-terminal kinase | 2451 | 3445 |
| 6 | #2 OR #3 OR #4 OR #5 | 4487 | 5860 |
| 7 | MAPK8 (official LocusLink name, ID#5599) | 2 | 3 |
| 8 | Mitogen activated protein kinase 8 | 381 | 382 |
The results were as of May 14, 2004. First, the gene name JNK is used as the query. Then its official name, according to LocusLink, is used (MAPK8). Notice the literature has more references to JNK, but the number retrieved depends upon how it is spelled. Retrieval numbers are more consistent with the standardized name.
Summary statistics for each of the databases
| Database | Unique acronyms | Unique definitions | Total acronym-definition pairs | MEDLINE records processed | Last updated |
|---|---|---|---|---|---|
| ARGH | 206 348 | 767 609 | 885 060 | 12 808 695 | January 2004 |
| Stanford | 699 043 | 1 490 909 | 1 716 288 | 11 447 996 | March 2002 |
| AcroMed | 211 000 | 703 924 | 481 531 | 11 000 000 | December 2002 |
| SaRAD | 64 764 | 193 103 | 3 960 168 | 11 253 125 | January 2002 |
Comparison of the four databases described herein
| Database | Base methoda | Query methodb | Stemc? | Terms normalizedd? | Quality evaluatione? | Groupedf | Relative frequencyg? | Concept mappingh? |
|---|---|---|---|---|---|---|---|---|
| ARGH | HR | P+W | N | N | N | N | Y | N |
| Stanford | DP | D | N | Y | Y | Y | N | N |
| AcroMed | NLP | D | Y | Y | N | Y | Y | Y |
| SARAD | HS | D | Y | Y | N | Y | N | Y |
aHR, heuristic/rule-based; HS, heuristic/score-based; DP, Dynamic Programming (alignment) based; and NLP, Natural Language Processing.
bQuery (search) method available to the user to find terms: P, precise match; D, degenerate match (e.g. a search on JNK also retrieves JNK-1); and W, wildcard matching.
cStemming removes plural endings.
dTerm normalization treats certain characters or patterns as equivalent (e.g. ‘beta-carotene’ and ‘beta carotene’ would be considered the same term).
eQuality evaluation provides a score of how confident the algorithm was in pairing short and long forms.
fGrouping clusters together long-form terms considered to be conceptually the same definition (e.g. by stemming/normalizing or some other means).
gRelative frequency indicates how common (% wise) one definition is over the others.
hConcept mapping associates extracted terms with higher-level concepts such as MeSH terms.
Figure 1Screenshot of SaRAD. The user has searched for ‘SS’ and clicked to get details of the sub-definition ‘sjorgen's syndrome’. The possible filters are MeSH terms useful for limiting search results.