| Literature DB >> 23734710 |
Christina Ander1, Ole B Schulz-Trieglaff, Jens Stoye, Anthony J Cox.
Abstract
Environmental shotgun sequencing (ESS) has potential to give greater insight into microbial communities than targeted sequencing of 16S regions, but requires much higher sequence coverage. The advent of next-generation sequencing has made it feasible for the Human Microbiome Project and other initiatives to generate ESS data on a large scale, but computationally efficient methods for analysing such data sets are needed.Here we present metaBEETL, a fast taxonomic classifier for environmental shotgun sequences. It uses a Burrows-Wheeler Transform (BWT) index of the sequencing reads and an indexed database of microbial reference sequences. Unlike other BWT-based tools, our method has no upper limit on the number or the total size of the reference sequences in its database. By capturing sequence relationships between strains, our reference index also allows us to classify reads which are not unique to an individual strain but are nevertheless specific to some higher phylogenetic order.Tested on datasets with known taxonomic composition, metaBEETL gave results that are competitive with existing similarity-based tools: due to normalization steps which other classifiers lack, the taxonomic profile computed by metaBEETL closely matched the true environmental profile. At the same time, its moderate running time and low memory footprint allow metaBEETL to scale well to large data sets.Code to construct the BWT indexed database and for the taxonomic classification is part of the BEETL library, available as a github repository at git@github.com:BEETL/BEETL.git.Entities:
Mesh:
Year: 2013 PMID: 23734710 PMCID: PMC3622627 DOI: 10.1186/1471-2105-14-S5-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Composition of simulated metagenomic dataset having an even distribution of microbes.
| Name | Taxonomic id | Size | Fraction in simulation | Read count |
|---|---|---|---|---|
| Blattabacterium sp. str. BPLAN | 600809 | 0.64 Mb | 6.67% | 2372 |
| Borrelia hermsii DAH chromosome | 314723 | 0.92 Mb | 6.67% | 3616 |
| Candidatus Blochmannia pen. str. BPEN | 291272 | 0.79 Mb | 6.67% | 3122 |
| Candidatus Sulcia muelleri DMIN | 641892 | 0.24 Mb | 6.67% | 3122 |
| Candidatus Zinderia insecticola CARI | 871271 | 0.21 Mb | 6.67% | 816 |
| Catenulispora acidiphila DSM 44928 | 479433 | 10.47 Mb | 6.67% | 41950 |
| Chloroflexus aggregans DSM 9485 | 326427 | 4.68 Mb | 6.67% | 18684 |
| Clostridium sp. BNL1100 | 755731 | 4.61 Mb | 6.67% | 18248 |
| Deinococcus radiodurans R1 | 243230 | 3.06 Mb | 6.67% | 12066 |
| Escherichia coli DH1 | 536056 | 4.63 Mb | 6.67% | 18400 |
| Fluviicola taffensis DSM 16823 | 755732 | 4.63 Mb | 6.67% | 18258 |
| Frankia sp. CcI3 | 106370 | 5.43 Mb | 6.67% | 21282 |
| Geobacter bemidjiensis Bem | 404380 | 4.61 Mb | 6.67% | 18344 |
| Mycoplasma pneumoniae M129 | 272634 | 0.82 Mb | 6.67% | 3286 |
| Yersinia enterocolitica subsp. e. 8081 | 150052 | 4.62 Mb | 6.67% | 18548 |
Running time and memory requirements of the tested classifiers on the simulated data set.
| metaBEETL | CARMA3 | MEGAN | Genometa | |
|---|---|---|---|---|
| Memory | 1 GB | 13 GB | 13 GB | 3 GB |
| Time | 46 m | 18 h 35 m | 14 h 58 m | 2 min |
CARMA3 and MEGAN were run on a compute cluster, using 100 nodes. metaBEETL was run on an SSD drive. Memory consumption was taken at peak memory usage for one thread. All times are taken as wall clock times. For CARMA3 and MEGAN the time for the longest running time of the 100 threads was taken, the average time for MEGAN was 12 h 15 m and for CARMA 12 h 30 m.
Comparison of the correctly classified (true positive - TP) and not correct classified (false positive - FP) reads of the simulated metagenome between the classifiers, metaBEETL, CARMA3, MEGAN and Genometa.
| Taxonomic Level | metaBEETL | CARMA3 | MEGAN | Genometa | ||||
|---|---|---|---|---|---|---|---|---|
| TP | FP | TP | FP | TP | FP | TP | FP | |
| Superkingdom | 129,290 | 0 | 161,162 | 153 | 178,712 | 5 | 118,340 | 0 |
| Phylum | 129,280 | 10 | 158,904 | 395 | 176,604 | 31 | 113,138 | 5,202 |
| Class | 129,279 | 11 | 157,545 | 395 | 175,718 | 42 | 113,138 | 5,202 |
| Order | 129,279 | 11 | 155,625 | 220 | 174,737 | 47 | 113,138 | 5,202 |
| Family | 129,278 | 11 | 151,684 | 363 | 171,227 | 103 | 113,125 | 5,208 |
| Genus | 129,262 | 28 | 132,251 | 513 | 151,292 | 649 | 109,884 | 8,435 |
| Species | 129,242 | 48 | 51,920 | 232 | 110,728 | 1,196 | 109,444 | 8,896 |
Comparison of the simulated taxonomic profile of an artificial metagenome and the predicted profiles from metaBEETL, CARMA3, MEGAN and Genometa.
| Taxonomic Level | metaBEETL | CARMA3 | MEGAN | Genometa |
|---|---|---|---|---|
| Superkingdom | 1 | 1 | 1 | - |
| Phylum | 7.47 | 22.44 | 22.89 | - |
| Class | 7.48 | 25.70 | 23.84 | - |
| Order | 9.45 | 24.26 | 24.23 | - |
| Family | 9.39 | 22.15 | 19.60 | - |
| Genus | 10.85 | 26.22 | 21.56 | 38.82 |
| Species | 10.59 | 19.02 | 22.44 | 38.16 |
We compared profiles using the Euclidean distance to the simulated profile. Results from Genometa were only available at level genus and species.
Figure 1Species-level classification computed by metaBEETL for sample SRS013948 from the Human Microbiome Project.