| Literature DB >> 23185310 |
Pablo Pareja-Tobes1, Marina Manrique, Eduardo Pareja-Tobes, Eduardo Pareja, Raquel Tobes.
Abstract
BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version - which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.Entities:
Mesh:
Year: 2012 PMID: 23185310 PMCID: PMC3504008 DOI: 10.1371/journal.pone.0049239
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Pipeline of BG7.
Java programs are represented by blue ellipses, quality control programs are represented in green trapezoids and the blue cylinders connect the programs that provide the final results in different formats.
Number of genes predicted for E. coli K12.
| Feature | BG7 | NCBI |
| Protein coding genes | 4370 | 4145 |
| RNA | 156 | 175 |
BG7 detection of NCBI E. coli K12 genes.
| NCBI coding genes in BG7 annotation | Number of genes | % of genes |
| Detected Identical by BG7 | 3458 | 83.43 |
| Detected with minimal differences | 471 | 11.36 |
| Not Detected by BG7 | 216 | 5.21 |
| Total | 4145 |
Minimal differences: genes predicted in the same region of the genome, sharing a high percentage of sequence positons but with differences in the start position and/or stop position.
Figure 2BG7 annotation in different states of completion and error rate of E.coli O104:H4 TY-2482 genome.
False positive and false negative genes in BG7 annotation were detected with reference to the genes predicted by BROAD Institute in the annotation available at “Escherichia coli O104:H4 Sequencing Project, Broad Institute of Harvard and MIT (http://www.broadinstitute.org/). The gene sequences were downloaded on 20-Aug-2012 from: http://www.broadinstitute.org/annotation/genome/Ecoli_O104_H4/FeatureSearch.html. We used BLASTN between the nucleotide sequences of the BG7 predicted genes and those from BROAD annotation. The graph displays how the number of BG7 not detected genes (false negatives) is very similar in two very different states of genome assembly with very different error rate in the sequence.