| Literature DB >> 26052282 |
Leon French1, Po Liu2, Olivia Marais2, Tianna Koreman2, Lucia Tseng2, Artemis Lai2, Paul Pavlidis3.
Abstract
We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 67% of connectivity statements at 51% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/.Entities:
Keywords: connectome; information retrieval; natural language processing; text mining
Year: 2015 PMID: 26052282 PMCID: PMC4439553 DOI: 10.3389/fninf.2015.00013
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1Visualization of processing steps for an example sentence.
Figure 2Flow chart depicting the origins and evaluations of the connectivity corpora. Arrows represent the use of annotated data from one corpus (source) to test or create a corpus (target). JCN, Journal of Comparative Neurology; BAMS, Brain Architecture Management System.
Summary connectivity statistics for curated and predicted corpora.
| Original corpus (French et al., | JCN evaluations | JCN predictions (French et al., | MScanner | |
|---|---|---|---|---|
| Abstracts | 1377 | 1828 | 12557 | 8264 |
| Source | Curation | Evaluations | Classification | Classification |
| Region annotations | Manual | Automatic | Automatic | Automatic |
| Region pairs | 22577 | 11825 | 156741 | 164555 |
| Connections | 3097 (16%) | 2111 (18%) | 28107 (22%) | 36566 (22%) |
| Recall | 70% | 67% | NA | NA |
| Precision | 50% | 51% | NA | NA |
This table presents summary counts of abstracts and sentence level connectivity counts for abstracts with predicted and curated connections. Region pairs and connection counts are counted within sentences only. Connections were predicted with a shallow linguistic kernel trained on the Original Corpus for both the JCN Predictions (Journal of Comparative Neurology) and MScanner sets. Precision and recall values were computed with shallow linguistic kernel in a crossvalidation framework.
Figure 3Screenshot of example results from WhiteText Web. The top text input field attempts to match typed text to brain regions in NIFSTD while the user types. The query region column shows the original named brain regions that were matched to the given input of “Habenula” or it’s children. Sentence text is directly linked to the source abstract in PubMed. Query and connected regions are colored, with underlines marking words that suggest connectivity. Results can be sorted by all columns except the first. A single click on the gray flag in the “Report” column allows users to mark sentences that were incorrectly parsed. The “Export Table” link (top left) provides a tab-separated file containing the returned results.
Species with the most associated connections in the combined corpus.
| Species name | NCBI species identifier | Connections |
|---|---|---|
| Rattus norvegicus | 10116 | 24690 |
| Cat | 9685 | 12469 |
| Rhesus monkey | 9544 | 3113 |
| Rat | 10118 | 2368 |
| Rabbit | 9986 | 1497 |
| Human | 9606 | 1258 |
| Macaca fascicularis | 9541 | 1218 |
| Mouse | 10090 | 1107 |
| Chiecken | 9031 | 728 |
| Guinea-pig | 10141 | 611 |
Connection counts combine predicted and curated connections in the corpus. NCBI taxonomy identifiers are provided.
Top ten most frequent journals in the combined corpus.
| Journal name | Abstracts |
|---|---|
| The Journal of comparative neurology | 9815 |
| Brain Research | 1643 |
| Neuroscience | 938 |
| Experimental brain research | 627 |
| The Journal of neuroscience | 369 |
| Brain research bulletin | 365 |
| Neuroscience letters | 326 |
| Brain, behavior and evolution | 251 |
| Anatomy and embryology | 231 |
| The European journal of neuroscience | 207 |
Figure 4Bar plot of yearly counts of abstracts with connectivity information in the combined corpus.