| Literature DB >> 28365731 |
Fabio Rinaldi1,2, Oscar Lithgow1, Socorro Gama-Castro1, Hilda Solano1, Alejandra Lopez1, Luis José Muñiz Rascado1, Cecilia Ishida-Gutiérrez1, Carlos-Francisco Méndez-Cruz1, Julio Collado-Vides1.
Abstract
Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators. Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.Entities:
Mesh:
Year: 2017 PMID: 28365731 PMCID: PMC5467564 DOI: 10.1093/database/bax012
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Example of using the syntactic structure to validate a potential relationship.
Figure 2.Simplified example of distributional vectors.
Figure 3.A screenshot of the curation system’s interface.
Figure 5.Using filters to focus on sections of the text more likely to contain the desired information.
Figure 6.Self-filling forms can be used to speed up the curation process.
Figure 7.Preliminary version of the new interface, including cross-document semantic linking.