| Literature DB >> 16842628 |
Zuoshuang Xiang1, Wenjie Zheng, Yongqun He.
Abstract
BACKGROUND: Brucella species are Gram-negative, facultative intracellular bacteria that cause brucellosis in humans and animals. Sequences of four Brucella genomes have been published, and various Brucella gene and genome data and analysis resources exist. A web gateway to integrate these resources will greatly facilitate Brucella research. Brucella genome data in current databases is largely derived from computational analysis without experimental validation typically found in peer-reviewed publications. It is partially due to the lack of a literature mining and curation system able to efficiently incorporate the large amount of literature data into genome annotation. It is further hypothesized that literature-based Brucella gene annotation would increase understanding of complicated Brucella pathogenesis mechanisms.Entities:
Mesh:
Year: 2006 PMID: 16842628 PMCID: PMC1539029 DOI: 10.1186/1471-2105-7-347
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The BBP system architecture for . A PubMed literature extraction and parsing program loads all Brucella-related papers from PubMed into the Brucella Limix database and the TextPresso-powered text processing pipeline. An automatic literature update program also extracts Brucella papers published in the recent and previous months. The Limix system provides an efficient way for literature searching and data extraction, edition, and submission by integrating computational text mining programs with manual literature curation and management features. InterBru integrates Brucella genome data from different data sources including our in-house curated data from the Brucella Limix database. The Brucella Genome Browser (BGBrowser) features graphic visualization of Brucella genome data and offers many analysis tools. InterBru and BGBrowser also share the same output page displaying comprehensive Brucella gene and protein information.
Public databases and software programs linked or used in BBP. Unique database identifiers (e.g., RefSeq ID) are usually stored for linking to public database web pages. Brucella literature abstracts and full text PDF files are also extracted from PubMed. Software programs are integrated into BBP in different ways.
| Resources | ||
| Databases | ||
| NCBI | PubMed | Biomedical publications |
| MeSH | Medical Subject Headings | |
| RefSeq | Reference sequences | |
| Genome | Genome summary | |
| Gene | Gene information | |
| Protein | Protein information | |
| Nucleotide | Nucleotide information | |
| CDD | conserved domains | |
| COGs | clusters of orthologous groups | |
| Taxonomy | ||
| 3D structure DB | 3D structures (typically of related proteins) | |
| Feature tables | Protein coding genes, functional and structural RNAs, | |
| EBI & SIB | Swissprot | Annotated protein data |
| TrEMBL | protein data | |
| InterPro | protein families, domains and functions | |
| PROSITE | protein families and domains | |
| TIGR | CMR | codon usage, EC numbers, condensed genome display, role category graph, GO terms, gene attribute, Role category gene list, RNAs, terminators |
| TIGRfam | TIGRfam assignments | |
| Others | PFam | protein domains and families |
| ProDom | protein domain families | |
| Software programs integrated | ||
| NCBI | BLAST | Blastn, blastp, blastx, tblastn, tblastx, PSI/PHI Blast, Mega Blast, Blast 2 sequences |
| GMOD | GBrowse | genome browse and analysis |
| TextPresso | NLP text mining | |
| PubSearch | literature curation management | |
| Other | BioPerl | Programming tools |
Figure 2A scenario of . (A) The InterBru database allows users to search public databases (e.g., RefSeq, Swissprot) for Brucella genes and proteins via different characteristics or identifiers. Here a user searches for Brucella sodC gene. (B) BGBrowser localizes the sodC gene and it neighbor genes in Brucella genomes and provides many add-on gene analysis tools. (C) The detailed gene information table shared by InterBru and BGBrowser provides sequences and functional annotation of Brucella sodC gene and its encoded protein Cu/Zn superoxide dismutase. Links to various databases and detailed curated data from Limix are summarized. Local BLAST programs are also available from this page for similarity analysis.
Figure 3MeSH Browser. All the Brucella literature publications can be visualized by the interactive MeSH-tree browser. The two clickable numbers in each line links to all publications with the term as a MeSH term or a major MeSH term, respectively. This figure shows the hierarchical MeSH tree structure leading to Mutagenesis and Gene Deletion.
Figure 4Integrated computational text mining and manual curation in Limix. The computational text mining frame shows a typical TextPresso-type result after query for the sodC keyword and "mutant" category. All sodC words and words under mutant category are clearly labeled in colors. One sentence containing both sodC and mutant words is highlighted in bold and considered as one match. A curator can easily highlight and copy text from this frame to an editable text field below the frame within the same page. The data can be further edited and submitted to a backend database by clicking an 'update' button. Other literature retrieval approaches (e.g., keywords search) are also available in the computational text mining frame.
Clustering of 75 attenuated Brucella genes found from literature search using the COG classification method.
| C: Energy production and conversion | 4: |
| D: Cell cycle control, mitosis and meiosis | 1: |
| E: Amino acid transport and metabolism | 15: |
| F: Nucleotide transport and metabolism | 10: |
| G: Carbohydrate transport and metabolism | 7: |
| H: Coenzyme transport and metabolism | 3: |
| I: Lipid transport and metabolism | 2: |
| J: Translation | 3: |
| K: Transcription | 3: |
| L: Replication, recombination and repair | 2: |
| M: Cell wall/membrane biogenesis | 6: |
| O: Posttranslational modification, protein turnover, chaperones | 5: |
| P: Inorganic ion transport and metabolism | 2: |
| Q: Secondary metabolites biosynthesis, transport and catabolism | 1: |
| R: General function prediction only | 2: |
| T: Signal transduction mechanisms | 4: |
| U: Intracellular trafficking and secretion | 9: |
| -: Not in COGs | 6: |
TextPresso-predicted and manually curated Brucella genetic interactions. One match means one highlighted sentence containing at least 2 genes and at least one word under "association" or "regulation" category. Each match represents for one predicted genetic interaction. The results are shown by manually verified vs. TextPresso-predicted interactions. The number of verified vs. predicted interactions varies depending on the numbers (#) of matches and papers to use as the cutoffs and whether or not to use full text contents besides paper abstracts.
| # of papers | # of matches | # of papers | # of matches | |||||||||||
| 1 | 2 | 3 | 4 | 5 | 10 | 1 | 2 | 3 | 4 | 5 | 10 | 20 | ||
| 1 | 17/38 | 6/10 | 1/1 | 1/1 | 1/1 | 0/0 | 1 | 58/1330 | 50/213 | 42/105 | 34/81 | 31/63 | 16/26 | 6/7 |
| 2 | 3/5 | 3/5 | 1/1 | 1/1 | 1/1 | 0/0 | 2 | 46/172 | 46/172 | 41/94 | 34/73 | 31/55 | 16/26 | 6/7 |
| 3 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 0/0 | 3 | 33/71 | 33/71 | 33/71 | 30/60 | 28/45 | 15/24 | 6/7 |
| 4 | 1/1 | 1/1 | 1/1 | 1/1 | 1/1 | 0/0 | 4 | 28/50 | 28/50 | 28/50 | 28/50 | 27/40 | 15/23 | 6/7 |
| 8 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | 8 | 8/10 | 8/10 | 8/10 | 8/10 | 8/10 | 8/10 | 6/7 |
Figure 5. Limix is used to find and confirm 62 Brucella genetic interactions. In the Brucella genetic interaction map displayed in a SVG form, any node can be clicked for detailed gene information, and any edge can be clicked to show description of the specific interaction.