| Literature DB >> 18442389 |
Katharina J Hoff1, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern, Peter Meinicke.
Abstract
BACKGROUND: Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18442389 PMCID: PMC2409338 DOI: 10.1186/1471-2105-9-217
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The figure illustrates possible localizations of open reading frames (ORFs) in a fragment (shown only for the forward strand). ORFs are shown as grey bars, »▹«denotes stop codons, »|« indicates the position of translation initiation site candidates. ORFs that are related by a common stop codon are grouped and we refer to them as ORF-sets. The box symbolizes the fragment range. Everything that might be located outside the box is invisible to gene prediction algorithms. Further explanations are given in section »Methods«.
Genomes of microbial species that were used for the evaluation of our method. The upper three species are archaea while the lower ten species belong to the bacterial domain. The table shows GenBank accession numbers (GenBank Acc.), and genome sizes (Size).
| Species | GenBank Acc. | Size (Mbp) |
| 2.2 | ||
| 1.7 | ||
| 2.6 | ||
| 0.6 | ||
| 7.2 | ||
| 4.2 | ||
| 2.5 | ||
| 2.2 | ||
| 4.6 | ||
| 1.6 | ||
| 6.3 | ||
| 1.7 | ||
| 1.1 | ||
Figure 2Average gene prediction performance of the neural network in fragments of the lengths 100 to 2000 bp. The performance values from thirteen test species were averaged by arithmetic mean.
Mean and standard deviation for gene prediction performance of our method (Neural Net) and MetaGene. Performance was measured on 700 bp fragments that were randomly excised from each test genome to 5-fold coverage (ten replications per species). The harmonic mean is a measure that combines sensitivity and specificity.
| SENSITIVITY | SPECIFICITY | HARMONIC MEAN | ||||
| Species | Neural Net | MetaGene | Neural Net | MetaGene | Neural Net | MetaGene |
| 87.2 ± 0.21 | 92.7 ± 0.16 | 90.2 ± 0.17 | ||||
| 91.7 ± 0.17 | 92.7 ± 0.19 | 93.9 ± 0.10 | ||||
| 87.9 ± 0.22 | 92.7 ± 0.17 | 90.8 ± 0.16 | ||||
| 90.6 ± 0.37 | 91.1 ± 0.29 | 92.9 ± 0.28 | ||||
| 87.9 ± 0.11 | 85.1 ± 0.13 | 89.0± 0.08 | ||||
| 89.8 ± 0.14 | 89.3 ± 0.19 | 89.5 ± 0.14 | ||||
| 89.7 ± 0.24 | 89.2 ± 0.21 | 90.5 ± 0.13 | ||||
| 82.1 ± 0.25 | 88.4 ± 0.26 | 86.4 ± 0.19 | ||||
| 91.7 ± 0.16 | 90.9 ± 0.10 | 92.1 ± 0.07 | ||||
| 90.2 ± 0.14 | 89.6 ± 0.23 | 89.9 ± 0.15 | ||||
| 90.4 ± 0.14 | 91.4 ± 0.09 | 91.4 ± 0.12 | ||||
| 87.2 ± 0.21 | 90.8 ± 0.20 | 91.4 ± 0.15 | ||||
| 87.2 ± 0.27 | 71.2 ± 0.54 | 79.7 ± 0.45 | ||||
Translation initiation site prediction correctness (TIS correctness) and complete/incomplete classifi-cation accuracy (Gene Type Accuracy) of the Neural Net and MetaGene according to GenBank annotation. Performance was measured on 700 bp fragments that were randomly excised from each test genome to 5-fold coverage (mean and standard deviation for 10 replicates per species are given).
| TIS CORRECTNESS | GENE TYPE ACCURACY | |||
| Species | Neural Net | MetaGene | Neural Net | MetaGene |
| 69.8 ± 0.32 | 73.6 ± 0.32 | 98.1 ± 0.05 | 97.2 ± 0.07 | |
| 69.4 ± 0.52 | 73.3 ± 0.52 | 99.0 ± 0.09 | 97.6 ± 0.12 | |
| 75.2 ± 0.58 | 82.9 ± 0.28 | 96.9 ± 0.16 | 97.6 ± 0.09 | |
| 86.5 ± 0.40 | 88.6 ± 0.64 | 99.1 ± 0.09 | 98.3 ± 0.21 | |
| 70.1 ± 0.45 | 73.0 ± 0.28 | 97.6 ± 0.08 | 96.9 ± 0.09 | |
| 79.7 ± 0.32 | 66.1 ± 0.42 | 98.6 ± 0.05 | 97.0 ± 0.08 | |
| 78.2 ± 0.49 | 73.4 ± 0.68 | 98.1 ± 0.08 | 96.6 ± 0.11 | |
| 68.1 ± 0.46 | 71.9 ± 0.45 | 98.1 ± 0.08 | 96.7 ± 0.13 | |
| 84.5 ± 0.31 | 78.2 ± 0.15 | 98.7 ± 0.06 | 97.0 ± 0.08 | |
| 87.3 ± 0.40 | 77.1 ± 0.33 | 99.2 ± 0.09 | 96.4 ± 0.16 | |
| 78.4 ± 0.22 | 81.0 ± 0.36 | 97.7 ± 0.03 | 97.2 ± 0.07 | |
| 86.6 ± 0.40 | 88.6 ± 0.47 | 99.0 ± 0.07 | 97.8 ± 0.10 | |
| 79.3 ± 0.77 | 79.9 ± 0.42 | 98.7 ± 0.13 | 96.9 ± 0.17 | |
Translation initiation site prediction performance of the new gene prediction algorithm (Neural Net) and MetaGene according to »reliable annotation subsets« (A subset of »verified genes« from »EcoGene« for Escherichia coli [28], all non-y genes of the Bacillus subtilis GenBank annotation and the »PseudoCAP« annotation of Pseudomonas aeruginosa [29]). TIS prediction sensitivity and correctness were measured on artificial 700 bp fragments that were randomly excised from each test genome to 5-fold coverage. Mean and standard deviation over 10 replicates per species are shown.
| SENSITIVITY TIS | TIS CORRECTNESS | |||
| Species | Neural Net | MetaGene | Neural Net | MetaGene |
| 62.1 ± 1.43 | 84.1 ± 0.51 | 70.2 ± 0.64 | ||
| 75.1 ± 0.61 | 86.6 ± 0.57 | 77.5 ± 0.67 | ||
| 68.0 ± 0.22 | 80.7 ± 0.20 | 83.7 ± 0.36 | ||