| Literature DB >> 26454281 |
Ivan Borozan1, Vincent Ferretti1.
Abstract
SUMMARY: Sequence comparison of genetic material between known and unknown organisms plays a crucial role in genomics, metagenomics and phylogenetic analysis. The emerging long-read sequencing technologies can now produce reads of tens of kilobases in length that promise a more accurate assessment of their origin. To facilitate the classification of long and short DNA sequences, we have developed a Python package that implements a new sequence classification model that we have demonstrated to improve the classification accuracy when compared with other state of the art classification methods. For the purpose of validation, and to demonstrate its usefulness, we test the combined sequence similarity score classifier (CSSSCL) using three different datasets, including a metagenomic dataset composed of short reads.Mesh:
Year: 2015 PMID: 26454281 PMCID: PMC4734043 DOI: 10.1093/bioinformatics/btv587
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The classification performance across three datasets obtained with CSSSCL, NBC (Rosen et al., 2011) and Kraken (Wood and Salzberg, 2014)
| Classifiers | Viral (precision, recall, [gp/min, RAM (GB)], time_db (h:m), time_cl (h:m)) | Bacterial I (precision, recall, [gp/min, RAM (GB)], time_db (h:m), time_cl (h:m)) | Bacterial II (precision, recall, [rp/min, RAM (GB)], time_db (h:m), time_cl (h:m)) |
|---|---|---|---|
| CSSSCL(blast, kmers, compression) | 95.0, 94.0, [2, 4], 12 h 34 m, 7 h 41 m | NA* | NA |
| CSSSCL(blast, kmers) | 91.0, 90.0, [254, 4], 1 h 23 m, 0 h 4 m | 87.0, 87.0, [51, 24], 12 h 16 m, 0 h 23 m | NA |
| CSSSCL(kmers) | 77.0, 76.0, [254, 4], 1 h 7 m, 0 h 4 m | 85.0, 86.0, [62, 50], 1 h 50 m, 0 h 19 m | NA |
| CSSSCL(blast) | 92.0, 90.0, [3390, 4], 0 h 31 m, 0 h 0.3 m | 89.0, 89.0, [591, 24], 7 h 27 m, 0 h 2 m | 95.0, 88.0, [2500, 12], 2 h 7 m, 0 h 4 m |
| NBC | 83.0, 77.0[0.9, 0.04], 0 h 18 m, 18 h 30 m | NA* | 77.8, 77.8, [3, NA], NA |
| Kraken | 65.0, 45.0, [686, 13], 0 h 4 m, 0 h 1.5 m | 91.0, 82.0, [204, 50], 4 h 3 m, 0 h 6 m | 94.7, 73.5, [892 472, 70], NA |
In the case of the Bacterial dataset I (full length bacterial sequences), we do not present the results for the NBC and the CSSSCL (but only when the compression measure is included) classifiers due to the very long run time (>4 weeks, marked with NA*), in the case of the Bacterial dataset II (short reads) the CSSSCL program selects only the blast-based similarity measure, since kmer and compression based measures are eliminated (marked with NA) during the optimization phase. In the table, gp/min indicates genomes processed per minute, rp/min indicates reads processed per minute, RAM indicates the maximum RAM usage in GB, time_db indicates the time to process/train the reference database and time_cl to classify sequences in the test set—after the reference database has been processed. The viral dataset was run on a 16 core AMD 64-bit processor with 16 GB of RAM, while the Bacterial datasets were run on a 16 core AMD 64-bit processor with 100 GB of RAM (see also the Supplementary Data for parameter value settings used to run the algorithms).