| Literature DB >> 21347167 |
Elizabeth J Goldsmith1, Saurabh Mendiratta, Radha Akella, Kathleen Dahlgren.
Abstract
MOTIVATION: With the increasing volume of scientific papers and heterogeneous nomenclature in the biomedical literature, it is apparent that an improvement over standard pattern matching available in existing search engines is required. Cognition Search Information Retrieval (CSIR) is a natural language processing (NLP) technology that possesses a large dictionary (lexicon) and large semantic databases, such that search can be based on meaning. Encoded synonymy, ontological relationships, phrases, and seeds for word sense disambiguation offer significant improvement over pattern matching. Thus, the CSIR has the right architecture to form the basis for a scientific search engine. RESULT: Here we have augmented CSIR to improve access to the MEDLINE database of scientific abstracts. New biochemical, molecular biological and medical language and acronyms were introduced from curated web-based sources. The resulting system was used to interpret MEDLINE abstracts. Meaning-based search of MEDLINE abstracts yields high precision (estimated at >90%), and high recall (estimated at >90%), where synonym, ontology, phrases and sense seeds have been encoded. The present implementation can be found at http://MEDLINE.cognition.com. CONTACT: Elizabeth.goldsmith@UTsouthwestern.edu Kathleen.dahlgren@cognition.com.Entities:
Year: 2009 PMID: 21347167 PMCID: PMC3041583
Source DB: PubMed Journal: Summit Transl Bioinform ISSN: 2153-6430
Figure 1:Architecture of CSIR
Cognition Dictionary by numbers
| Cognition’s Semantic Map (Based on Computational Linguistic Science) | |
|---|---|
| Word Stems | 506,000 Word stems |
| Words and Phrases | 536,000 Word senses or concepts |
| Meanings in context | 4,000,000 Semantic contexts |
| Different Word Meanings | 17,000 Ambiguous word definitions |
| Complex Word Series Meanings | 191,000 Phrases |
| Ontology or Taxonomy | 7,000 Nodes |
| Synonyms | 76,000 Thesaural concept groups |
Ontology of Biochemical and Molecular Biology
| A. |
|---|
| Macromolecule-node |
| Protein-stuff |
| antibody |
| binding protein |
| enzyme |
| Nucleic-acid |
| Laboratory-procedure |
| Electrophoresis |
| Spectroscopy |
Figure 2shows the Coverage of MEDLINE
Precision and Recall: Comparison between Cognition and Pubmed.
| Cognition vs MEDLINE search | Cognition good/20 | Cognition bad/20 | Total | Pubmed good/20 | Pubmed bad/20 | Total |
|---|---|---|---|---|---|---|
| Genetic correlates of alcoholism | 18 | 2 | 1436 | 6 | 14 | 44 |
| DNA repair and aging | 17 | 3 | 1220 | 11 | 9 | 1265 |
| Drugs for fibromyalgia | 17 | 3 | 1484 | 9 | 11 | 220 |
| Genetic interactions of BCL2 | 18 | 2 | 876 | 8 | 11 | 19 |
| Oxidative stress in plants | 18 | 2 | 3122 | 9 | 11 | 3197 |
| spectroscopy of amidohydrolases | 17 | 3 | 861 | 7 | 13 | 1142 |
| Benzene induced neuropathy | 18 | 2 | 220 | 6 | 1 | 7 |
| Birth defects from glycol ether | 16 | 4 | 20 | 13 | 7 | 61 |
| Depression in aging | 19 | 1 | 13381 | 7 | 13 | 3658 |
| Symptoms of type II diabetes mellitus | 18 | 2 | 241 | 7 | 13 | 24704 |
| Menopause and depression | 18 | 2 | 696 | 11 | 9 | 1146 |
| Treatment for bronchiectasis | 18 | 2 | 2163 | 6 | 14 | 3207 |
| OCD and anorexia | 20 | 0 | 176 | 14 | 6 | 247 |
| Proteolysis in SARS virus entry | 4 | 0 | 4 | 2 | 0 | 2 |
| Total | 280 | 60 | 18433 | 125 | 127 | 34080 |
| Cognition | MEDLINE | |||||
| Precision | 0.90 | 0.50 | ||||
| Recall (*Assume total recall is the total of the cognition retrievals) | 0.99 | 0.54 |
Finer grained Protein Kinases ontology.
| B. |
|---|
| protein-kinases |
| protein-histidine-kinases |
| serine-threonine-kinases |
| AGC-kinases |
| STE-kinase |
| Tyrosine-kinase |
| ACK-kinase |
| EGFR-kinase |
| Tyrosine-Like-Kinase |
| MLK-kinase |
| RAF-kinase |