| Literature DB >> 22132069 |
Peng Jia1, Liming Xuan, Lei Liu, Chaochun Wei.
Abstract
Metagenomic sequence classification is a procedure to assign sequences to their source genomes. It is one of the important steps for metagenomic sequence data analysis. Although many methods exist, classification of high-throughput metagenomic sequence data in a limited time is still a challenge. We present here an ultra-fast metagenomic sequence classification system (MetaBinG) using graphic processing units (GPUs). The accuracy of MetaBinG is comparable to the best existing systems and it can classify a million of 454 reads within five minutes, which is more than 2 orders of magnitude faster than existing systems. MetaBinG is publicly available at http://cbb.sjtu.edu.cn/~ccwei/pub/software/MetaBinG/MetaBinG.php.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22132069 PMCID: PMC3223155 DOI: 10.1371/journal.pone.0025353
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Impact of the order (K) of Markov models in MetaBinG.
| Sequence Length (bps) | K = 3 | K = 4 | K = 5 | K = 6 | K = 7 |
| 100 | 1/47.5 | 2/49.6 | 4/50.6 | 15/50.6 |
|
| 200 | 2/56.7 | 2/58.9 | 5/60.8 |
| 55/61.6 |
| 300 | 2/62.2 | 3/64.9 |
| 16/67.6 | 57/66.9 |
| 400 | 3/66.3 | 4/69.7 |
| 17/71.6 | 57/71.4 |
| 500 | 3/69.2 | 4/72.7 |
| 18/74.5 | 59/73.3 |
| 600 | 4/71.9 | 5/75.3 | 8/77.2 |
| 59/74.9 |
| 700 | 5/74.5 | 6/77.8 | 9/79.2 |
| 61/76.9 |
| 800 | 6/75.4 | 7/78.4 |
| 22/79.9 | 62/77.2 |
| 900 | 8/76.6 | 9/79.2 |
| 22/80.6 | 63/78.4 |
| 1000 | 8/78.5 | 10/81.5 | 13/82.4 |
| 65/78.9 |
The impact of the order of Markov models (K) in MetaBinG has been tested. The K values various from 3 to 7. The sequence data sets are the same as in Table 2. Ten different sequence lengths from 100 bps to 1000 bps have been used for testing. Each sequence length contains 6,640 sequences. Each column is for a K value, which is the order of a Markov model. The total computing time (in seconds) and accuracy was measured as in Table 2. Each cell contains the total computing time and the accuracy separated by a “/”.
For each sequence length, the best performance is in a bold font. K is set to 5 by default in MetaBinG.
Comparison of Phymm and MetaBinG.
| Sequence Length (bps) | Phymm | MetaBinG | Speedup | ||
| Accuracy (%) | Time (s) | Accuracy (%) | Time (s) | ||
| 100 | 53.62 | 573 | 50.61 | 4 | 143 |
| 200 | 64.21 | 880 | 60.82 | 5 | 176 |
| 300 | 70.71 | 1262 | 67.66 | 6 | 210 |
| 400 | 73.36 | 1652 | 71.56 | 6 | 275 |
| 500 | 76.02 | 1949 | 74.48 | 8 | 244 |
| 600 | 78.47 | 2330 | 77.24 | 8 | 291 |
| 700 | 79.89 | 2632 | 79.21 | 9 | 292 |
| 800 | 81.86 | 3006 | 80.25 | 10 | 301 |
| 900 | 82.40 | 3403 | 80.77 | 12 | 284 |
| 1000 | 84.18 | 3795 | 82.35 | 13 | 292 |
Ten different sequence lengths from 100 bps to 1000 bps have been used for testing. Each sequence length contains 6,640 sequences. The accuracy and total computing time (in seconds) for 6,640 sequences is listed in the table. Accuracy is measured at phylum level. The last column in the table shows the speedup of MetaBinG compared to Phymm. Both Phymm and MetaBinG were tested in the same Linux machine with 2 Intel Xeon E5520 processors (8 cores in total), 16 GB RAM and one NVDIA Tesla C1060 GPU card (240 cores). Default parameters were used for Phymm. The same input sequences and reference databases were used for both MetaBinG and Phymm. The accuracy is defined by dividing the number of correctly predicted sequences by the total number of test sequences since both methods assign every sequence to a source genome. The time measured here included all overhead except the creating of reference databases.
Figure 1Biogas metagenome recovered by MetaBinG and Phymm.
The 616,072 454 reads contained in the biogas metagenome dataset have been classified using MetaBinG and Phymm. The classification accuracy was measured at phylum level. The histogram shows only the top 15 phylum from the metagenomes recovered by Phymm. In general, the results recovered from MetaBinG and Phymm are similar except some small differences in Euryarchaeota and Actinobacteria. Among the top 15 phyla generated by Phymm, 14 was in the list of top 15 produced by MetaBinG. The relative ranks for these phyla generated by different methods varies at most by a value of two. MetaBinG is almost 1500-fold faster than Phymm.
Figure 2The system design of MetaBinG.
First, the pre-built kth-order Markov Models (kMMs) are loaded to the GPU memory. Second, a CPU transforms input FASTA sequences into vectors of k-mer frequencies, which are then transferred to the GPU memory. Comparison of vectors against pre-built Markov models is done in the GPUs. The minimum scores are then output to the CPU, and the input sequence will be annotated with the NCBI taxonomy information in the CPU.