| Literature DB >> 22927906 |
Colin F Davenport1, Jens Neugebauer, Nils Beckmann, Benedikt Friedrich, Burim Kameri, Svea Kokott, Malte Paetow, Björn Siekmann, Matthias Wieding-Drewes, Markus Wienhöfer, Stefan Wolf, Burkhard Tümmler, Volker Ahlers, Frauke Sprengel.
Abstract
UNLABELLED: Metagenomic studies use high-throughput sequence data to investigate microbial communities in situ. However, considerable challenges remain in the analysis of these data, particularly with regard to speed and reliable analysis of microbial species as opposed to higher level taxa such as phyla. We here present Genometa, a computationally undemanding graphical user interface program that enables identification of bacterial species and gene content from datasets generated by inexpensive high-throughput short read sequencing technologies. Our approach was first verified on two simulated metagenomic short read datasets, detecting 100% and 94% of the bacterial species included with few false positives or false negatives. Subsequent comparative benchmarking analysis against three popular metagenomic algorithms on an Illumina human gut dataset revealed Genometa to attribute the most reads to bacteria at species level (i.e. including all strains of that species) and demonstrate similar or better accuracy than the other programs. Lastly, speed was demonstrated to be many times that of BLAST due to the use of modern short read aligners. Our method is highly accurate if bacteria in the sample are represented by genomes in the reference sequence but cannot find species absent from the reference. This method is one of the most user-friendly and resource efficient approaches and is thus feasible for rapidly analysing millions of short reads on a personal computer. AVAILABILITY: The Genometa program, a step by step tutorial and Java source code are freely available from http://genomics1.mh-hannover.de/genometa/ and on http://code.google.com/p/genometa/. This program has been tested on Ubuntu Linux and Windows XP/7.Entities:
Mesh:
Year: 2012 PMID: 22927906 PMCID: PMC3424124 DOI: 10.1371/journal.pone.0041224
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A screenshot displaying key new features with a glacier ice metagenome dataset loaded [.
An aligner can be run with the graphical dialogue (top right) against a reference sequence. Thereafter the resulting file format is converted to the standard BAM format and read in, revealing the number of reads mapped to each species in a sortable list which can be exported for further analysis (left). A bar graph graphically displays the number of reads attributed to each taxon. Clicking on a blue bar takes the user to a genome level view of the distribution of reads mapped against a taxon. Large datasets can thus be easily aligned, analysed and tested for plausibility from a graphical user interface.
Comparative duration of alignment by BLAST and Bowtie versus the same metagenomic reference for four metagenome datasets.
| Dataset | Alignment tool | Number of reads | Alignment duration 7 threads (s) | Normalised alignment duration 1 thread (s) | Reads per thread per second | Dataset reference |
| Human gut | BLAST | 1501409 | 885898 | 6201286 | 0.24 | Kurokawa et al. 2007 |
| Human gut | Bowtie | 1501409 | 109 | 763 | 1967.77 | Kurokawa et al. 2007 |
| Human stool Diarrhea | BLAST | 96941 | 54180 | 379260 | 0.26 | Nakamura et al. 2008 |
| Human stool Diarrhea | Bowtie | 96941 | 14 | 98 | 989.19 | Nakamura et al. 2008 |
| Vineyard | BLAST | 9623513 | 5854784 | 40983488 | 0.23 | Coetzee et al. 2010 |
| Vineyard | Bowtie | 9623513 | 2617 | 18319 | 525.33 | Coetzee et al. 2010 |
| CF lung | BLAST | 772097 | 432374 | 3026618 | 0.26 | Willner et al. 2009 |
| CF lung | Bowtie | 772097 | 69 | 483 | 1598.54 | Willner et al. 2009 |
duration was extrapolated after ∼24 hours.
Figure 2Number of reads per species present in an in-house simulated ocean metagenome compared to the number of reads assigned to a reference containing all known strains by Genometa.
All bacterial species present were detected. Reads were retrieved in the same stoichiometric proportions in which they were inserted. Halobacterium sp NRC-1 was also detected, but this strain is colinear and practically identical to the included strain Halobacterium salinarum R1 [29].
Figure 3Number of reads from an artifical metagenome of known composition (SimLC dataset; [) which were included in the metagenome (black bars) and assigned to the correct bacterial species by Genometa (blue bars).
Only the top 21 species of the 113 bacteria included in the dataset are shown. Genometa achieves a high accuracy on this dataset. Asterisks indicate strains which are included in the SimLC dataset but not in the Genometa reference sequence. Inter strain differences generally mean less reads are attributed to these taxa. The cross denotes a species which is not present in the Genometa reference sequence.
Figure 4The number of 100,000 Illumina human gut 100 bp reads (SRR042027, Human Microbiome Project, [17]) assigned to bacterial species by four metagenomic programs.
Note the general agreement between the different programs but higher number of read assignments achieved by Genometa and MG-RAST. All programs found bacterial species typical of a human gut metagenome.
Software recommendations for analysis of different types of metagenome datasets.
| Metagenome dataset type | Read length | Recommended algorithms | Reference |
| 16S rDNA | 400 | QIIME, Mothur, RDP classifier |
|
| Whole genome shotgun 454/Ion torrent | 200–400 | MEGAN/MG-RAST WebMGA/EBI/FR-HIT |
|
| Whole genome shotgun SOLiD/Illumina | 50–120 | Genometa/MG-RAST | This study, |