| Literature DB >> 25077800 |
Lőrinc S Pongor1, Roberto Vera2, Balázs Ligeti3.
Abstract
Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.Entities:
Mesh:
Year: 2014 PMID: 25077800 PMCID: PMC4117525 DOI: 10.1371/journal.pone.0103441
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Typical running times for the alignment.
| Running time | ||||
| Dbase | 1 thread | 4 threads | 12 threads | |
| MetaPhlAn | own bacterial marker dbase | 14 sec | 7 sec | 6 sec |
| Taxoner | NCBI nt Bacteria | 165 sec | 105 sec | 90 sec |
| Taxoner | NCBI nt full dbase | 2446 sec | 2031 sec | 1866 sec |
| MEGABLAST | NCBI nt bacteria | 8.3 h | n/a | 3.9 h |
| MEGABLAST | NCBI nt full dbase | 37.6 h | n/a | 9.4 h |
Read dataset: Dataset A, Table 1. Processor: Intel(R) Xeon(R) CPU E5-2640;
The built-in dataset is 366,988,039 nucleotides (367 MB) and contains only bacterial sequences;
15,400,949,699 nucleotides (15 GB), downloaded on 11/07/2013;
52,380,339,934 nucleotides (54 GB), downloaded on 11/07/2013;
Times include taxon assignment;
time of taxon assignment by MEGAN is not included.
Read assignment for Staphylococcus aureus genome sequencing data.
| Roche 454 (Dataset A) | Ion Torrent (Dataset B) | Illumina (Dataset C) | ||||||||
| Level: | Genus | Species | Strain | Genus | Species | Strain | Genus | Species | Strain | |
| Taxoner | Total | 93692 | 93692 | 93692 | 37175 | 37175 | 37175 | 27531 | 27531 | 27531 |
| Positive | 93189 | 92728 | 62 | 36482 | 35919 | 0 | 26023 | 17019 | 0 | |
| Negative | 4 | 43 | 875 | 28 | 126 | 17174 | 29 | 121 | 1213 | |
| FNR % | 0.004 | 0.046 | 0.934 | 0.075 | 0.339 | 46.198 | 0.105 | 0.440 | 4.406 | |
| MetaPhlAn | Total | 8525 | 8525 | 8525 | 2522 | 2522 | 2522 | 1692 | 1692 | 1692 |
| Positive | 8209 | 8063 | N/A | 2402 | 2399 | 0 | 1650 | 1613 | N/A | |
| Negative | 0 | 42 | N/A | 43 | 36 | 0 | 2 | 28 | N/A | |
| FNR % | 0.000 | 0.493 | N/A | 1.705 | 1.427 | N/A | 0.118 | 1.655 | N/A | |
| BLASTALL + MEGAN | Total | 86752 | 86752 | 86752 | 68696 | 68696 | 68696 | 45721 | 45721 | 45721 |
| Positive | 83718 | 82951 | 156 | 65264 | 63801 | 0 | 41189 | 27375 | 0 | |
| Negative | 25 | 29 | 53 | 114 | 125 | 5310 | 408 | 441 | 3094 | |
| FNR % | 0.029 | 0.033 | 0.061 | 0.166 | 0.182 | 7.730 | 0.892 | 0.965 | 6.767 | |
| DC-MEGABLAST + Megan | Total | 84211 | 84211 | 84211 | 64858 | 64858 | 64858 | 48677 | 48677 | 48677 |
| Positive | 81000 | 80035 | 140 | 61662 | 60199 | 0 | 44161 | 29466 | 0 | |
| Negative | 48 | 68 | 131 | 94 | 128 | 3052 | 81 | 180 | 3421 | |
| FNR % | 0.057 | 0.081 | 0.156 | 0.145 | 0.197 | 4.706 | 0.166 | 0.370 | 7.028 | |
100,000 random selected reads from experimental data, details in section 4.1.
False Negative Rate.
Not available.
Figure 1The Taxoner principle.
Reads are mapped to genomes and the corresponding taxon names are read from an ontology, in this case a taxonomic tree. For function analysis, the name of the mapped gene is read from an ontology of function names such as GO.
Figure 2Detection of low abundance strains.
Number of reads necessary on average to detect an unknown species at various taxonomy levels. Error bars indicate standard deviation of the mean, calculated from 400 repetitions.
Analysis of known and unknown B. anthracis strains.
| Taxa assigned | MetaPhlAn | Taxoner | |
| A) Strain included in the database ( | |||
| All | 991 reads | 104,573 reads | |
| Genus |
| 100.00 | 100.00% |
| Species |
| 76.60% | 100.00% |
| Species |
| 15.40% | 0.00% |
| Species |
| 8.00% | 0.00% |
| Species | other | 0.00% | 0.00% |
| False negative% |
|
| |
| B) Strain not included in the database ( | |||
| All | 96,045 reads | 7,379,118 reads | |
| Genus |
| 99.58% | 100.00% |
| Species |
| 65.30% | 96.50% |
| Species |
| 18.70% | 0.80% |
| Species |
| 15.60% | 0.40% |
| Species | other | 0.00% | 2.30% |
| False negative |
|
| |
The values are taken from the standard output of the program.
Values indicate the number of reads expressed as % of the total.
False negative is the % of taxa (MetaPhlAn) and reads (Taxoner) detected but not present.
Detection of species in a metagenomic datasets.
| A) Illumina sequenced HMP Mock Community sample | |||||||||||||
| MetaPhlAn | Taxoner | WGSQUICKR | |||||||||||
| No of positives (taxa present) | TP | FN | FP | F-measure | TP | FN | FP | F-measure | TP | FN | FP | F-measure | |
| strain | 22 | NA | NA | NA | NA | 14 | 7 | 8 | 0,65 | 1 | 20 | 79 | 0,02 |
| species | 22 | 21 | 1 | 7 | 0,84 | 20 | 2 | 0 | 0,95 | 9 | 13 | 67 | 0,18 |
| genus | 19 | 18 | 1 | 5 | 0,86 | 17 | 2 | 0 | 0,94 | 13 | 6 | 45 | 0,34 |
| family | 18 | 18 | 0 | 6 | 0,86 | 17 | 1 | 0 | 0,97 | 13 | 5 | 29 | 0,43 |
|
| |||||||||||||
|
|
|
| |||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| strain | 22 | NA | NA | NA | NA | 9 | 12 | 19 | 0,37 | 1 | 20 | 58 | 0,03 |
| species | 22 | 19 | 3 | 0 | 0,93 | 19 | 3 | 0 | 0,93 | 5 | 17 | 52 | 0,13 |
| genus | 19 | 16 | 3 | 0 | 0,91 | 16 | 3 | 0 | 0,91 | 8 | 11 | 37 | 0,25 |
| family | 18 | 16 | 2 | 0 | 0,94 | 16 | 2 | 0 | 0,94 | 9 | 9 | 23 | 0,36 |
The data was a mock community dataset provided by the Human Microbiome Project and consisted of 22 strains.
Only hits (read-taxon assignments) were considered where the worst alignment score was at least 0.9. Positive taxa predicted by Taxoner are those that received at least 1000 hits (dataset G).
True positives.
False negatives.
False positives.
Not available.
Hits (read-taxon assignments) were only considered where the worst alignment score was at least 0.9. Positive taxa predicted by Taxoner are those that received at least 100 hits (dataset H).
Figure 3Screenshot of the Taxoner summary of gene functions.
Each functional category is characterized by the number of genes that received hits.
Benchmark datasets.
| Dataset | ID | Sequencing platform and number of spots; average read length | Taxon | Note |
| A | SRR292150 | 454 GS 20 (183203;110.31) |
| Randomly selected |
| B | ERR236069 | Ion Torrent PGM(1338465;262.05) |
| Randomly selected |
| C | SRR017390 | Illumina Genome Analyzer II(26391487;76) |
| Randomly selected |
| D | DRR000184 | Illumina Genome Analyzer II(7631281;50) |
| Randomly selected |
| E | DRR000184 | Illumina Genome Analyzer II(7631281;50) |
| Whole run |
| F | AE017225 | NA (104574;99.99) |
| Full genome sampling |
| G | SRX055380 | Illumina Genome Analyzer II(6562065; 75.00) | HMP Mock Communityeven sample | Whole genome sequencing |
| H | SRX030841 | 454 GS FLX Titanium(1386198; 530.22) | HMP Mock Communityeven sample | Whole genome sequencing |
Random selected datasets were produced by a python script that uniformly sampled the read collection without replacement. The sample size was 100000.
Supplementary files are deposited at http://pongor.itk.ppke.hu/taxoner/examples/.
The whole run was analyzed.
The genome was sampled with overlapping reads. The read length was uniformly 100 bp and the overlap between the adjacent reads was 50 bp.