| Literature DB >> 17683642 |
Carlos Rodríguez-Penagos1, Heladia Salgado, Irma Martínez-Flores, Julio Collado-Vides.
Abstract
BACKGROUND: Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17683642 PMCID: PMC1964768 DOI: 10.1186/1471-2105-8-293
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Annotation workflow. A suggested workflow for parallel manual and automatic annotations of transcriptional regulation, with manual review of automatically-generated networks in shaded lines. Curators would check the interactions mined from text, since they would be provided with the reference papers and the textual segments from which the system retrieved them.
Document sets (corpora) used in this work
| RegulonDB Network References | 724 | 24.9 | full-text | Full-text papers from the RegulonDB database references that curators have identified as referring specifically to the regulatory network, as opposed to those referring to other objects from the database. | ||
| RegulonDB papers | 2,475 | 99 | full-text | Full text papers from the complete RegulonDB references that we were able to access and download. | ||
| RegulonDB Abstracts | 3,075 | 3.3 | abstracts | Abstracts from the complete RegulonDB references, as of June of 2006. | ||
| RegulonDB search strategies | 12,059 | 12.3 | abstracts | Corpus generated by using the RegulonDB curator's search strategies, without any subsequent filtering. | ||
| EcoCyc Abstracts | 13,334 | 14.4 | abstracts | Abstracts from references in the 2006 EcoCyc database that describes the genome and the biochemical machinery of | ||
| STRING-IE | 58,312 | 10.7 | sentences | Corpus of distinct sentences generated by the STRING-IE team by searching in PubMed for " |
Description of the different full text and abstract corpora used for extraction of regulatory interactions. The document sets are based on PubMed searches and on reference lists from database curation efforts.
Figure 2Corpus coverage of transcriptional regulation in . Venn diagram illustrates overlapping coverage in corpora used in this work, with dots representing papers relevant for transcriptional regulation in E. coli K-12. Different selection strategies (keyword searches on PubMed and curated databases references) result in diverse document sets, which can contain in some cases groups of the same documents as well as other non-relevant papers.
Final network extraction system evaluation metrics
| 3108 | - | 3397 | 3397 | 100% | 100% | - | - | - | |
| 3148 | 768 | 661 | 1429 | 45% | 45% | 0.45 | 0.77 | 0.57 | |
| 2649 | 569 | 535 | 1104 | 41% | 35.5% | 0.41 | 0.72 | 0.47 | |
| 2650 | 711 | 605 | 1316 | 49% | 42% | 0.49 | 0.78 | 0.55 | |
| 2202 | 522 | 491 | 1013 | 46% | 32% | 0.46 | 0.74 | 0.45 | |
| 1643 | 555 | 471 | 1026 | 62% | 33% | 0.62 | 0.85 | 0.47 | |
| 1354 | 426 | 385 | 811 | 59% | 26% | 0.59 | 0.81 | 0.39 | |
| 627 | 262 | 140 | 402 | 64% | 12% | 0.64 | 0.95 | 0.22 | |
| 554 | 217 | 114 | 331 | 59% | 10% | 0.59 | 0.91 | 0.19 | |
| 718 | 254 | 146 | 400 | 55% | 12% | 0.55 | 0.91 | 0.22 | |
| 630 | 207 | 121 | 328 | 52% | 10% | 0.52 | 0.86 | 0.18 | |
| 691 | 199 | 143 | 342 | 49% | 11% | 0.49 | 0.90 | 0.19 | |
| 628 | 170 | 118 | 288 | 45% | 9% | 0.45 | 0.86 | 0.16 | |
| 414 | 207 | 115 | 322 | 77% | 10% | 0.77 | 1.00 | 0.18 | |
| 370 | 180 | 97 | 277 | 74% | 8% | 0.74 | 0.99 | 0.16 |
An asterisk [*] next to source name indicates that no multiple-unit objects (like the individual elements of two-system components, or operons) were added; The RegulonDB "dual" interactions (that is, presenting both activation and repression), are counted here as two distinct interactions. AS represents a file containing all compiled interactions found in all textual sources used in this work, and constitutes the sum of all non-redundant interactions extracted from the full-text and abstract documents.
1. Unique, non-repeated interactions found for each file
2. Interactions that match RegulonDB interaction pairs, but whose function (activation or repression) was not determined by the system.
3. Interactions that match RegulonDB entries, and also match repressor/activator function
4. Overall matches (column 2 + column 3), including both interactions with complete information as well as under specified ones
5. % of total interactions in file which are correct: (1)/(4)
6. Recall: percentage of RegulonDB's 3108 interactions that is represented by correct interactions in file: ((4)*100)/3108
7. Precision 1: Number of overall matches (4)/unique interactions in file (1)
8. Precision 2: Number of overall matches (4)/interactions in file that contain a RegulonDB-annotated regulator
9. F-Measure (F = 2RP/R+P)
Figure 3Curated and retrieved articles for RegulonDB, by year. Comparison between all references initially retrieved from PubMed using RegulonDB curator's search algorithms, and references that were finally reviewed in full to populate the database. Since search algorithms are refined and changed continuously, this is shown only for illustration.
Informational density of various corpora
| 724 | 24.9 | 1026 | 1643 | 62.4 | 65.9 | 1.41 | |
| 2475 | 99.0 | 1316 | 2650 | 49.6 | 40.0 | 0.53 | |
| 3075 | 3.3 | 322 | 414 | 77.7 | 1.07 | 0.1 | |
| 13334 | 14.4 | 402 | 627 | 64.1 | 1.08 | 0.03 | |
| 12059 | 12.3 | 400 | 718 | 55.7 | 1.02 | 0.03 | |
| 58312 | 10.7 | 342 | 691 | 49.5 | 0.18 | 0.005 |
A comparison of the degree of informativeness with regard to transcriptional regulation in E. coli K-12 in various corpora, as established from the number of RegulonDB-attested interactions they contain; The table includes total number of documents and interactions (Cols. B & E), percentage and number of all interactions found in RegulonDB (Cols. D & F), average size of each document in the corpus (Col. G), ordered by number of RegulonDB interactions per document (Col. H).