| Literature DB >> 17118146 |
Abstract
MOTIVATION: The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases.Entities:
Mesh:
Year: 2006 PMID: 17118146 PMCID: PMC1683569 DOI: 10.1186/1471-2105-7-S2-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of the principle by which a first-order MM works using the words "ethanol" and "booze". State transition frequencies are calculated for each letter in a word (including spaces on both sides of the word) and compared with models the MM has been trained on; in this case chemicals and words. The probability of observing a sequence of letters within each model is calculated as the product of each state (character) transition. To reflect a statistical distance between two models, the log10 ratio is taken.
Figure 2MM training curves converge at different rates (light blue line = 200-period moving average). (A) MM training on non-scientific text – in this case, Tolstoy's "War and Peace". Note that convergence is faster and more stable than when trained on scientific text (B), which is more complex. (C) Training on chemical names requires a relatively large training set, but reaches convergence.
Figure 3The effects on precision and recall rates from using a cutoff score. Test sets containing chemical names and words were evaluated with an MM trained on both types of data and cutoff scores ranging from 10 to zero were used to define which entries were valid. The data points from left to right reflect the precision and recall rates obtained by using each cutoff value, shown in descending order from 10 (far left) to zero (far right). The optimal tradeoff between precision and recall appears to be somewhere between a cutoff of one and two.
Scalability of MM term evaluation for chemical names when applied to a large corpus, in this case approximately 13.1 million MEDLINE records that contain approximately 7.4 million abstracts. Using these estimates, the overall precision for chemical term entry into the database is 82.7%.
| 42% | 46% | 48% | 54.7% | 3.1% | 203,985 | 92,473 | |
| 27% | 25% | 22% | 75.3% | 2.5% | 319,000 | 78,687 | |
| 5% | 3% | 5% | 95.7% | 1.2% | 202,655 | 8,782 | |
| 2% | 0% | 1% | 99.0% | 1.0% | 164,286 | 1,643 | |
| 0% | 0% | 0% | 100.0% | 0.0% | 162,728 | - | |
| 82.7% | |||||||
Example of morphological (spelling) variations for a chemical, 8-SPT, as observed within MEDLINE abstracts. 8-SPT is also found abbreviated as 8SPT, 8SPTH, and 8-PSPT.
| 1 | 8-(p-sulfophenyl)theophylline | 13 | 19.1% |
| 2 | 8-sulfophenyltheophylline | 10 | 14.7% |
| 3 | 8-sulphophenyltheophylline | 8 | 11.8% |
| 4 | 8-(p-sulphophenyl)theophylline | 5 | 7.4% |
| 5 | 8-(p-sulfophenyl)-theophylline | 4 | 5.9% |
| 6 | 8-(p-sulphophenyl)-theophylline | 4 | 5.9% |
| 7 | 8-(p-sulphophenyl) theophylline | 3 | 4.4% |
| 8 | 8-(p-sulfophenyl) theophylline | 3 | 4.4% |
| 9 | 8-p-sulpho-phenyltheophylline | 2 | 2.9% |
| 10 | 8(p-sulfophenyl)theophylline | 2 | 2.9% |
| 11 | 8-(sulfophenyl)theophylline | 1 | 1.5% |
| 12 | 8-p-sulfophenyl theophylline | 1 | 1.5% |
| 13 | 8-(4-sulfophenyl)theophyline | 1 | 1.5% |
| 14 | 8 (p-sulphophenyl) theophylline | 1 | 1.5% |
| 15 | 8-(sulfophenyl) theophylline | 1 | 1.5% |
| 16 | 8-p-sulfophenyltheophylline | 1 | 1.5% |
| 17 | 8-p-sulphophenyltheophylline | 1 | 1.5% |
| 18 | 8(p-sulfophenyl)-theophylline | 1 | 1.5% |
| 19 | 8-rho-(sulfophenyl)theophylline | 1 | 1.5% |
| 20 | 8-sulphophenyl-theophylline | 1 | 1.5% |
| 21 | 8-(p-sulfophenyl)-theophyllin | 1 | 1.5% |
| 22 | 8-(para-sulfophenyl)theophylline | 1 | 1.5% |
| 23 | 8-p-sulfophenyl-theophylline | 1 | 1.5% |
| 24 | 8-(p-sulfophenyl)-theophylline) | 1 | 1.5% |
8-SPT as an example of how variation in chemical nomenclature affects information retrieval. Here, Ovid was used to map 5 variations of the chemical name 8-SPT to subject headings. Only adenosine and theophylline were common to each variant tested. 8-SPT is an adenosine receptor antagonist. For comparison, PubMed retrieved 341 unique records when using each of the keywords separated by "OR" in the query.
| Chemical Name mapped to term(s) | 8-p-sulphophenyltheophylline | 8-sulphophenyltheophylline | 8-p-sulfophenyltheophylline | 8-SPT | 8SPT |
| X | X | X | X | X | |
| Adenosine Triphosphate | X | X | X | ||
| Aorta | X | ||||
| Autonomic Nervous System | X | ||||
| Cerebellum | X | ||||
| Cerebral Cortex | X | ||||
| Coronary Circulation | X | ||||
| Coronary Vessels | X | ||||
| Creatine Kinase | X | ||||
| Endothelium, Vascular | X | ||||
| Heart | X | ||||
| Hippocampus | X | ||||
| Hyperhomocysteinemia | X | ||||
| Iris | X | ||||
| Muscle Contraction | X | ||||
| Muscle, Smooth | X | ||||
| Myocardial Reperfusion Injury | X | ||||
| Neurons | X | ||||
| Neutrophils | X | ||||
| Parasympathetic Nervous System | X | ||||
| Phenethylamines | X | ||||
| Rats, Wistar | X | ||||
| Receptors, Adrenergic, alpha-1 | X | ||||
| Receptors, Cell Surface | X | ||||
| Receptors, Purinergic | X | X | X | ||
| Receptors, Purinergic P1 | X | X | X | ||
| Receptors, Purinergic P2 | X | ||||
| Spinal Cord | X | ||||
| Synaptic Transmission | X | ||||
| X | X | X | X | X | |
| Vasodilation | X | X | |||
| Xanthines | X | ||||
Figure 4The number of spelling variants for a given chemical name is correlated with the number of times it has appeared within the literature. Here the number of spelling variants per chemical name is plotted against the total number of times the chemical name and all variants appeared within MEDLINE. The black line represents a 100-period moving average.