| Literature DB >> 19750194 |
Leon French1, Suzanne Lane, Lydia Xu, Paul Pavlidis.
Abstract
The ability to computationally extract mentions of neuroanatomical regions from the literature would assist linking to other entities within and outside of an article. Examples include extracting reports of connectivity or region-specific gene expression. To facilitate text mining of neuroscience literature we have created a corpus of manually annotated brain region mentions. The corpus contains 1,377 abstracts with 18,242 brain region annotations. Interannotator agreement was evaluated for a subset of the documents, and was 90.7% and 96.7% for strict and lenient matching respectively. We observed a large vocabulary of over 6,000 unique brain region terms and 17,000 words. For automatic extraction of brain region mentions we evaluated simple dictionary methods and complex natural language processing techniques. The dictionary methods based on neuroanatomical lexicons recalled 36% of the mentions with 57% precision. The best performance was achieved using a conditional random field (CRF) with a rich feature set. Features were based on morphological, lexical, syntactic and contextual information. The CRF recalled 76% of mentions at 81% precision, by counting partial matches recall and precision increase to 86% and 92% respectively. We suspect a large amount of error is due to coordinating conjunctions, previously unseen words and brain regions of less commonly studied organisms. We found context windows, lemmatization and abbreviation expansion to be the most informative techniques. The corpus is freely available at http://www.chibi.ubc.ca/WhiteText/.Entities:
Keywords: conditional random field; corpus; natural language processing; neuroanatomy; text mining
Year: 2009 PMID: 19750194 PMCID: PMC2741206 DOI: 10.3389/neuro.11.029.2009
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1A representative annotated abstract with several expanded abbreviations (Gabbott et al. , .
Top 20 context features from text only CRF.
| Token type | Position | Count | CRF weight | Normalized score |
|---|---|---|---|---|
| the | Previous token | 28,376 | 11.4 | 117.2 |
| and | Previous token | 13,109 | 10.8 | 102.8 |
| Previous token | 16,811 | 10.4 | 101.3 | |
| from | Previous token | 2,295 | 10.4 | 80.6 |
| in | Previous token | 12,203 | 8.5 | 80.1 |
| to | Previous token | 6,630 | 9.1 | 79.9 |
| with | Previous token | 2,957 | 9.8 | 78.1 |
| that | Previous token | 3,581 | 9.2 | 75.5 |
| rat | Previous token | 777 | 10.4 | 69.2 |
| into | Previous token | 758 | 9.6 | 63.9 |
| monkey | Previous token | 216 | 11.8 | 63.6 |
| Previous token | 10,944 | 6.7 | 61.9 | |
| labeled | Previous token | 785 | 9.0 | 60.2 |
| projections | Second preceding token | 904 | 8.6 | 58.3 |
| The | Previous token | 3,274 | 7.0 | 56.4 |
| or | Previous token | 1,198 | 7.9 | 56.2 |
| mouse | Previous token | 171 | 10.9 | 56.0 |
| and | Next token | 13,108 | 5.8 | 54.7 |
| of | Previous token | 19,205 | 5.5 | 54.6 |
Top 40 frequently occurring mentions.
| Mention | Frequency |
|---|---|
| Retina | 313 |
| Retinal | 280 |
| Spinal cord | 256 |
| Cortical | 239 |
| Superior colliculus | 142 |
| Cortex | 140 |
| Olfactory bulb | 134 |
| Brainstem | 127 |
| Thalamic | 122 |
| Thalamus | 115 |
| Hippocampus | 108 |
| Hypothalamus | 100 |
| Lateral geniculate nucleus | 92 |
| Olfactory | 92 |
| Cerebellum | 86 |
| Thalamocortical | 85 |
| Suprachiasmatic nucleus | 83 |
| Amygdala | 78 |
| Hippocampal | 76 |
| Optic nerve | 74 |
| Forebrain | 73 |
| Striatum | 73 |
| Inferior colliculus | 72 |
| Visual cortex | 71 |
| Cerebral cortex | 69 |
| Basal forebrain | 68 |
| Nucleus of the solitary tract | 64 |
| Spinal | 64 |
| Cerebellar | 63 |
| Globus pallidus | 61 |
| Midbrain | 60 |
| Periaqueductal gray | 60 |
| Locus coeruleus | 59 |
| Basal ganglia | 57 |
| Nucleus accumbens | 55 |
| Substantia nigra | 55 |
| v2 | 55 |
| Area 17 | 54 |
| Prefrontal cortex | 52 |
Results from evaluated techniques.
| Name | Strict | Lenient | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | |||
| TextPresso Lexicon | 0.529 | 0.185 | 0.274 | 0.824 | 0.288 | 0.427 |
| Neuronames Lexicon | 0.572 | 0.355 | 0.438 | 0.839 | 0.521 | 0.643 |
| Features CRF | 0.751 | 0.595 | 0.664 | 0.889 | 0.704 | 0.786 |
| Lemma CRF | 0.773 | 0.681 | 0.724 | 0.890 | 0.784 | 0.834 |
| Text CRF | 0.811 | 0.717 | 0.761 | 0.924 | 0.818 | 0.868 |
| Features + Lemma + Text CRF | 0.813 | 0.761 | 0.916 | 0.857 | 0.886 | |