| Literature DB >> 25609795 |
Renaud Richardet1, Jean-Cédric Chappelier1, Martin Telefont1, Sean Hill1.
Abstract
MOTIVATION: In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity.Entities:
Mesh:
Year: 2015 PMID: 25609795 PMCID: PMC4426844 DOI: 10.1093/bioinformatics/btv025
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of datasets, methods and models. Three named entity recognizers (NER) identify and normalize brain region mentions: BAMS and ABA (lexical-based) and BraiNER (machine learning-based). Three different extractors predict the connectivity probability of brain region co-occurrences: Filters takes a top–down filtering approach, Kernel is a machine learning-based classifier and Rules consists of hand-written extraction rules. Connectivity results are presented in a searchable web interface. In the future, feedback from the interface can be used to retrain the NERs and extractors for continuous model improvement
Example of sentences exhibiting connectivity statements between brain regions
| Sample sentence | Connectivity statement, comment |
|---|---|
| The nucleus accumbens (AC) receives projections from both the substantia nigra (SN) and the ventral tegmental area (VTA) (Dworkin, 1988). | (SN, VTA) → AC |
| Substantial numbers of tyrosine hydroxylase-immunoreactive cells in the dorsal raphe nucleus (DR) were found to project to the nucleus accumbens (AC) (Stratford and Wirtshafter, 1990). | DR → AC |
| The dentate gyrus (DG) is, of course, not only an input link between the entorhinal cortex (Ent) and the hippocampus proper (CAs) but also a major site of projection from the hippocampus (CA), as are the amygdala (Amg), entorhinal cortex (Ent) and septum (Spt) (Izquierdo and Medina, 1997). | CAs → DG → Ent, (CA, Amg, Ent, Spt) → DG Complex, long range relationships |
| This latter nucleus (N?), which projects to the striatum (CP), receives inputs from motor cortex (MO) as well as the basal ganglia (BG) and is situated to integrate these and then provide feedback to the basal ganglia (BG) (Strutz, 1987). | MO → N? → CP, BG ↔ N? Anaphora: ‘latter nucleus (N?)’ was defined in previous sentence |
| In this review, we summarize a classic injury model, lesioning of the perforant path, which removes the main extrahippocampal input to the dentate gyrus (Perederiy and Westbrook, 2013). | Injury model, not normal conditions |
| The most commonly proposed mechanism is that the periaqueductal gray of the midbrain (PAG) or the cerebral cortex (Cx) have descending influences to the spinal cord (SpC) to modulate pain transmission at the spinal cord (SpC) level (Andersen, 1986). | PAG → SpC, Cx → SpC ‘proposed’ implies an hypothesis, not a finding |
Abbreviations have been manually added.
NERs for brain regions
| NER name | Description | Brain regions | Terms |
|---|---|---|---|
| ABA | Lexicon from ABA Institute | 1197 | 1197 |
| ABA-SYN | ABA + automated synonyms enrichment from other lexica | 1197 | 3882 |
| BAMS | Lexicon from BAMS, version | 832 | 832 |
| BAMS-SYN | BAMS + automated synonyms enrichment from other lexica | 832 | 2705 |
| BraiNER | Machine learning-based NER (linear chain CRF) | ( | ( |
Performance comparison of brain region NER models against the WhiteText corpus (partially matching spans)
| Model | Exact comparison | Lenient comparison | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | |||
| ABA lexicon | 58.4% | 11.1% | 18.6% | 89.9% | 16.9% | 28.5% |
| ABA-SYN lexicon | 58.4% | 21.9% | 31.9% | 34.2% | 49.9% | |
| BAMS lexicon | 61.1% | 11.0% | 18.6% | 90.7% | 16.2% | 27.5% |
| BAMS-SYN lexicon | 61.3% | 17.5% | 27.2% | 89.8% | 25.5% | 39.7% |
| WhiteText ( | 81.3% | 76.1% | 78.6% | 91.6% | ||
| BraiNER-W (features from WhiteText) | 83.6% (3.3) | 76.4% (4.6) | 79.8% (3.9) | 87.1% (3.6) | 77.8% (7.4) | 82.1% (5.8) |
| BraiNER (with additional features) | 88.4% (1.0) | 81.0% (1.8) | 84.6% (1.3) | |||
For machine learning-based NERs [French and BraiNER], average values over 8-fold cross validation with splits at document level and 5 repetitions, including standard deviation in parenthesis where appropriate.
Evaluation of extraction models against the WhiteText corpus
| Extractor | Prec. | Recall | |
|---|---|---|---|
| All co-occurrences (all permutations) | 9% | 16% | |
| Filter sentence > 500 characters | 10% | 93% | 18% |
| Filter sentence with > 7 brain regions | 11% | 80% | 19% |
| Keep if contain trigger words | 15% | 53% | 23% |
| Keep nearest neighbor co-occurrence | 28% | 51% | 36% |
| All filters (FILTERS) | 45% | 31% | 37% |
| Shallow linguistic kernel (KERNEL) | 60% | 68% | |
| Ruta rules (RULES) | 12% | 21% | |
| FILTERS and KERNEL | 66% | ||
| FILTERS and RULES | 80% | 7% | 13% |
| KERNEL and RULES | 81% | 10% | 18% |
| FILTERS and KERNEL and RULES | 7% | 12% | |
| (FILTERS or KERNEL) and RULES | 80% | 11% | 19% |
Statistics of the corpora used, extracted brain regions and connections using all three extractors (FILTERS or KERNEL or RULES)
| Corpus | Corpus statistics | Brain regions | Connectivity statements | |||||
|---|---|---|---|---|---|---|---|---|
| Documents | Words | ABA | BAMS | BraiNER | ABA | BAMS | BraiNER | |
| All PubMed abstracts | 13 293 649 | 2.1 × 109 | 1 705 549 | 1 918 561 | 1 992 747 | 41 965 | 50 331 | 188 994 |
| Full-text neuroscience articles | 630 216 | 6.1 × 109 | 2 327 586 | 2 514 523 | 2 751 952 | 62 095 | 72 602 | 279 100 |
The number of documents and words refers to non-empty documents after pre-processing. Two generic terms from BAMS (brain and nerves) are omitted.
Fig. 2.Number of extracted connections for the three extractors, on PubMed and full-text corpora using the ABA-SYN NER
Fig. 3.Evaluation against AMBCA. AMBCA contains 16 954 distinct connected brain region pairs (AMBCA Pos) and 28 415 unconnected pairs (AMBCA Neg). Connectivity data extracted from the literature contain 7949 distinct connected brain region pairs (LIT), of which 904 are connected in AMBCA (LIT TP) and 261 are not connected in AMBCA (LIT TN)
Fig. 4.Comparison of the inter-region connectivity matrices, renormalized between 0 (white) and 1 (blue). Rows and columns correspond to ABA brain regions. Left: connection matrix from AMBCA (ipsilateral), using ABA’s inter-region connectivity model, with values representing a combination of connection strength and statistical confidence [see Fig. 4a of Oh ]. Middle: same matrix from AMBCA, but symmetrized (connection directionality is ignored, since the NLP models do not extract directionality). Right: connection matrix from the results extracted from the literature (LIT) with values representing the number of extracted connectivity statements, weighted by the estimated precision of each connectivity extractor