| Literature DB >> 23735199 |
Yongchu Liu1, Jiangtao Guo, Gangqing Hu, Huaiqiu Zhu.
Abstract
BACKGROUND: Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues.Entities:
Mesh:
Year: 2013 PMID: 23735199 PMCID: PMC3622649 DOI: 10.1186/1471-2105-14-S5-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The distributions of TIS scores of protein-coding genes (the upper one) and non-coding ORFs (the lower one). We simulated shotgun sequences by randomly sampling DNA fragments from E. coli K12 genomic sequence with fixed-length of 870 bp. Upfalse, True and Downfalse are stand for the probabilities of a TIS to be the candidate TIS from non-coding region, to be the true TIS and to be the candidate TIS from coding region, respectively.
Gene prediction performance on simulated shotgun sequences.
| Methods | 1200 bp | 870 bp | 535 bp | 120 bp | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sn(%) | Sp(%) | Hm(%) | Sn(%) | Sp(%) | Hm(%) | Sn(%) | Sp(%) | Hm(%) | Sn(%) | Sp(%) | Hm(%) | |
| MG | 97.7 | 97.4 | 96.9 | 93.2 | 91.4 | |||||||
| MGC | 98.0 | 97.7 | 97.2 | 93.3 | ||||||||
| MP | 97.5 | 93.6 | 95.5 | 97.2 | 93.5 | 95.3 | 96.8 | 92.9 | 94.8 | 92.0 | 85.5 | 88.7 |
| GLM | 93.3 | 95.6 | 93.3 | 95.6 | 93.1 | 95.3 | 88.7 | |||||
| MGM | 97.5 | 92.7 | 95.1 | 97.1 | 92.9 | 94.9 | 96.7 | 92.8 | 94.7 | 90.1 | 89.1 | 89.6 |
| MGA | 97.4 | 91.7 | 94.4 | 97.2 | 91.4 | 94.2 | 96.8 | 90.5 | 93.5 | 91.3 | 83.7 | 87.4 |
| FGS | 95.7 | 87.3 | 91.3 | 95.5 | 88.0 | 91.6 | 95.2 | 88.4 | 91.6 | 90.4 | 82.1 | 86.1 |
| Net | 94.6 | 94.7 | 94.6 | 94.1 | 94.7 | 94.4 | 93.3 | 94.6 | 93.9 | 82.0 | 76.4 | 79.1 |
The gene prediction methods are denoted by abbreviations. MG: MetaGUN, MGC: complete version of MetaGUN that trained on all 261 training genomes, MP: MetaProdigal, GLM: Glimmer-MG, MGM: MetaGeneMark, FGS: FragGeneScan, MGA: MetaGeneAnnotator, Net: Orphelia.
Figure 2TIS prediction experiments by modified MetaTISA on simulated shotgun DNA fragments. The artificial shotgun sequences are sampled from E. coli K12 with fixed-length of 870 bp. The upstream minimum length means the minimum requisite amount of upstream bases used for scoring if it is less than 50 bp, and the TIS accuracy is the overall accuracy of both the internal and the external TISs. The supervised TIS parameters used for the experiments including those trained on the RefSeq annotations and the TriTISA annotations, with Markov models ranging from 0-order to 4th-order.
TIS prediction performance on experimentally characterized gene starts.
| Methods | 870 bp | 535 bp | ||||
|---|---|---|---|---|---|---|
| Total | Internal | External | Total | Internal | External | |
| MG | 98.5% | 98.8% | ||||
| MP | 95.1% | 90.1% | 95.6% | 88.1% | ||
| GLM | 95.0% | 91.2% | 98.7% | 95.4% | 89.2% | 98.8% |
| MGM | 92.1% | 84.3% | 99.4% | 93.4% | 82.5% | 99.4% |
| MGA | 90.9% | 82.3% | 98.9% | 92.4% | 81.1% | 98.6% |
| FGS | 86.2% | 72.8% | 98.8% | 89.4% | 72.2% | 98.9% |
| Net | 84.3% | 78.6% | 89.8% | 88.0% | 72.4% | 96.4% |
The abbreviations of gene prediction methods are the same as in Table 1. We follow Hyatt et al. [13] to assess the TIS accuracy on both the internal TISs and the external TISs. An internal TIS is a TIS locates inside a fragment, and an external TIS is that exceeds the edge of a fragment. The total means the overall accuracy of both the internal and the external TISs.
Application to 2 human gut microbiome samples.
| Samples | Size(M) | Contigs | Annotated | Methods | Predicted | Additional | Potential novel |
|---|---|---|---|---|---|---|---|
| MG | 21524 (94.8%) | 2101 (58.1%) | |||||
| MP | 22056 (96.3%) | 2332 (54.1%) | 5 | ||||
| GLM | 22116 (96.4%) | 2361 (54.5%) | 5 | ||||
| Sub. 7 | 15.8 | 10411 | 20487 | MGM | 22200 (96.8%) | 2365 (56.7%) | 5 |
| MGA | 22102 (96.3%) | 2377 (57.2%) | 3 | ||||
| FGS | 23215 (95.6%) | 3634 (34.9%) | 4 | ||||
| Net | 21421 (94.5%) | 2067 (48.7%) | 3 | ||||
| MG | 26881 (95.0%) | 2241 (64.5%) | |||||
| MP | 27737 (97.0%) | 2589 (61.6%) | 5 | ||||
| GLM | 28127 (97.1%) | 2931 (58.2%) | 5 | ||||
| Sub. 8 | 20.5 | 12020 | 25943 | MGM | 27931 (97.1%) | 2728 (63.7%) | 4 |
| MGA | 27627 (96.2%) | 2666 (63.1%) | 4 | ||||
| FGS | 29462 (96.5%) | 4433 (36.0%) | 4 | ||||
| Net | 26780 (95.0%) | 2126 (58.0%) | 4 | ||||
In this experiment, Orphelia runs used 'Net700' parameter and FragGeneScan runs used 'complete' mode for sequences in these samples are highly assembled. Others run under default settings. Percentages in the column 'Predicted genes' are ratios of successfully predicted genes to annotated genes; and percentages in the column 'Additional genes' are the ration of annotated missed genes to additional genes.