| Literature DB >> 15312235 |
Abstract
BACKGROUND: A necessary step for a genome level analysis of the cellular metabolism is the in silico reconstruction of the metabolic network from genome sequences. The available methods are mainly based on the annotation of genome sequences including two successive steps, the prediction of coding sequences (CDS) and their function assignment. The annotation process takes time. The available methods often encounter difficulties when dealing with unfinished error-containing genomic sequence.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15312235 PMCID: PMC514700 DOI: 10.1186/1471-2105-5-112
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic illustration of methods for reconstructing the metabolic network from genome data: (A) conventional three-step method based on coding sequence (CDS) prediction, function assignment and metabolic reconstruction; (B) proposed two-step method based on a simultaneous and direct identification of coding sequences and their functions from raw genome sequences. Note that the query process has been reversed in the new method in comparison to the conventional method.
Evaluation of the method IdentiCS for identification of CDSs in the genome of Salmonella typhimurium. KEGG: KEGG genome database; SW: SWISS-PROT + TrEMBL + TrEMBL updates.
| Database | KEGG | SW |
| True positive | 4121 | 4339 |
| False positive | 907 | 987 |
| False negative | 328 | 110 |
| Sensitivity %* | 92.6 (91.1) | 97.7 (98.2) |
| Specificity %* | 82.0 (94.9) | 81.5 (87.2) |
| Inconsistence rate (%) in TP | 0.35 | 0.64 |
*: Values shown in parentheses are calculated based on the nucleotide level according to Burset and Guigo[32]
Effects of different scoring criteria on CDS identification in the genome of S. typhimurium using IdentiCS and the database SWISS-PROT and TrEMBL. Criteria 1: E-value < = E-10; Criteria 2: E-value < = E-10 and Bits score > = 75; Criteria 3: E-value < = E-10, Bits score > = 75 and Identities > = 25%
| Criteria 1 | Criteria 2 | Criteria 3 | |
| True positive | 4339 | 4305 | 4303 |
| False positive | 987 | 602 | 555 |
| False negative | 110 | 144 | 146 |
| Sensitivity % | 97.5 | 96.8 | 96.7 |
| Specificity % | 81.5 | 87.7 | 88.6 |
Evaluation of the performance of IdentiCS for the prediction of EC number -containing CDSs (EC-CDSs) with the EC-number containing subset of the protein database SWISS-PROT and TrEMBL.
| Compared to originally annotated EC-CDSs | Compared to all originally annotated CDSs | |
| EC-CDSs predicted | 1894 | 1894 |
| True positive | 1204 | 1813 |
| False positive | 690 | 55 |
| False negative | 14 | NA |
| Sensitivity | 98.9% | NA |
| Specificity | 63.6% | 97.1% |
| Inconsistence rate in T.P. | 0.08% | 3.32% |
NA: not applicable.
Comparison of EC numbers identified with different methods and different versions of the genome sequence of Klebsiella pneumoniae. WIT: WIT version of annotation by gene prediction from the 3.9-fold genome sequences; KEGG3.9 and KEGG7.9: annotations of the 3.9-fold and 7.9-fold genome sequences by applying the KEGG genome database. SW3.9 and SW7.9: annotations of the 3.9-fold and 7.9-fold genome sequences by applying SWISS-PROT and TrEMBL protein databases.
| Annotation version | Number of unique ECs identified | Version-specific EC numbers* compared to: | ||||
| WIT | KEGG3.9 | KEGG7.9 | SW3.9 | SW7.9 | ||
| WIT | 764 | - | 162 | 159 | 152 | 131 |
| KEGG3.9 | 646 | 44 | - | 4 | 63 | 56 |
| KEGG7.9 | 653 | 48 | 11 | - | 68 | 57 |
| SW3.9 | 693 | 81 | 110 | 108 | - | 15 |
| SW7.9 | 735 | 102 | 145 | 139 | 57 | - |
*The version-X-specific EC number compared to the version Y is referred to EC numbers that are only found in the annotation Version X but not in Y. For example, the KEGG3.9 has only 4 version-specific EC numbers compared to KEGG7.9. On the other side, KEGG7.9 has 11 version-specific EC numbers compared to KEGG3.9.
Distribution of the unique EC numbers of K. pneumoniae in different function categories compared to other organisms (see legend of Fig. 3 for name abbreviations of organisms). The EC numbers for K. pneumoniae were identified from the unannotated 7.9-fold coverage genome sequences by SWISS-PROT-TrEMBL based IdentiCS. The EC numbers for other organisms are taken from the KEGG genome annotations. The total number of strain-specific ECs is shown in parenthesis under the strain name.
| Function Categoriesa | ECsb | KPN | ECO | STM | STY | PAO | YPE |
| Carbohydrate metabolism | 427 | 151 | 154 | 147 | 101 | 105 | 123 |
| Energy metabolism | 167 | 60 | 62 | 62 | 46 | 61 | 58 |
| Lipid metabolism | 126 | 27 | 25 | 25 | 17 | 27 | 17 |
| Nucleotide metabolism | 166 | 83 | 82 | 80 | 45 | 66 | 72 |
| Amino Acid metabolism | 561 | 211 | 189 | 189 | 132 | 205 | 183 |
| Other Amino Acids | 146 | 49 | 45 | 43 | 26 | 47 | 41 |
| Complex Carbohydrates | 184 | 63 | 58 | 55 | 33 | 43 | 53 |
| Complex Lipids metabolism | 171 | 46 | 38 | 34 | 24 | 31 | 34 |
| Cofactors and Vitamins | 225 | 104 | 105 | 107 | 74 | 100 | 96 |
| Sum of unique EC numbers | 1899 | 567 | 537 | 529 | 349 | 464 | 476 |
a: Functional categorization according to KEGG; b: unique EC numbers of corresponding metabolism category calculated from KEGG metabolic pathway maps.
Comparison of coding sequences (CDSs) prediction by IdentiCS and CRITICA from unfinished genome sequences of K. pneumoniae with different genome sequence coverage.
| 3.9 × genome data | 7.9 × genome data | |||
| CRITICA | IdentiCS | CRITICA | IdentiCS | |
| Number of all CDSs | 6734 | 5650 | 5135 | 5261 |
| CDSs shared by both programs | 6332* | 4302 | 4823 | 4512 |
| CDSs merely identified by the respective program | 402 | 1348 | 312 | 749 |
*CDSs predicted by IdentiCS can cover more than one smaller CDSs predicted by CRITICA.
Figure 3Glycolysis pathways as example for demonstration of metabolic comparison among different organisms. KPN: K. pneumoniae MGH78578, ECO: Escherichia coli K-12 MG1655, STM: Salmonella typhimurium LT2, STY: Salmonella typhi, PAE: Pseudomonas aeruginosa PA01, YPE: Yersinia pestis strain CO92. Green background with enzyme EC number means that this enzyme exists in all the compared organisms. Colored rectangles under the EC number box represent links to the strain-specific annotation of the corresponding enzyme. Colored bars above the EC number link the enzyme to the corresponding entry in different public databases.