| Literature DB >> 20538725 |
Abstract
SUMMARY: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves approximately 92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy. AVAILABILITY: The program has been implemented as a standalone prediction program, cBar, which is available at http://csbl.bmb.uga.edu/~ffzhou/cBar.Entities:
Mesh:
Year: 2010 PMID: 20538725 PMCID: PMC2916713 DOI: 10.1093/bioinformatics/btq299
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The prediction performance by five classification approaches, C4.5 decision tree, Bayes network, SVM with the RBF kernel, SMO and nearest neighbor
| Strategy | Algorithm | Sn | Sp | Ac | MCC | AUC |
|---|---|---|---|---|---|---|
| 10FCV | C4.5 | 0.8766 | 0.7832 | 0.8371 | 0.6646 | 0.832 |
| Bayes net | 0.6594 | 0.8034 | 0.7203 | 0.4585 | 0.807 | |
| RBF | 0.9463 | 0.7613 | 0.8681 | 0.7315 | 0.854 | |
| SMO | 0.9474 | 0.8783 | 0.9182 | 0.8321 | 0.913 | |
| NN | 0.9177 | 0.8128 | 0.8734 | 0.7395 | 0.865 | |
| C4.5 | 0.7667 | 0.8627 | 0.8108 | 0.6280 | 0.811 | |
| Bayes net | 0.6333 | 0.8039 | 0.7117 | 0.4398 | 0.764 | |
| RBF | 0.8500 | 0.8235 | 0.8378 | 0.6735 | 0.837 | |
| SMO | 0.8667 | 0.9020 | 0.8829 | 0.7664 | 0.884 | |
| NN | 0.8667 | 0.8039 | 0.8378 | 0.6730 | 0.835 | |
| 10FCV | C4.5 | 0.8759 | 0.8020 | 0.8445 | 0.6809 | 0.841 |
| Bayes net | 0.6877 | 0.7905 | 0.7314 | 0.4730 | 0.815 | |
| RBF | 0.9358 | 0.7746 | 0.8672 | 0.7290 | 0.855 | |
| SMO | 0.9422 | 0.8887 | 0.9195 | 0.8349 | 0.915 | |
| NN | 0.9112 | 0.8165 | 0.8709 | 0.7349 | 0.864 |
These algorithms were evaluated using the 10FCV on the Training dataset, the Testing dataset and the All dataset.