| Literature DB >> 22369237 |
Monzoorul Haque Mohammed1, Tarini Shankar Ghosh, Rachamalla Maheedhar Reddy, Chennareddy Venkata Siva Kumar Reddy, Nitin Kumar Singh, Sharmila S Mande.
Abstract
BACKGROUND: Taxonomic classification of metagenomic sequences is the first step in metagenomic analysis. Existing taxonomic classification approaches are of two types, similarity-based and composition-based. Similarity-based approaches, though accurate and specific, are extremely slow. Since, metagenomic projects generate millions of sequences, adopting similarity-based approaches becomes virtually infeasible for research groups having modest computational resources. In this study, we present INDUS - a composition-based approach that incorporates the following novel features. First, INDUS discards the 'one genome-one composition' model adopted by existing compositional approaches. Second, INDUS uses 'compositional distance' information for identifying appropriate assignment levels. Third, INDUS incorporates steps that attempt to reduce biases due to database representation.Entities:
Mesh:
Year: 2011 PMID: 22369237 PMCID: PMC3333187 DOI: 10.1186/1471-2164-12-S3-S4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Work-flow of the INDUS algorithm. A schematic work-flow depicting the various steps followed by the INDUS algorithm for taxonomic assignment of query sequences.
Range of thresholds for determining an appropriate taxonomic level of assignments (TL)
| Lowest taxonomic level where the query sequence can be assigned | Distance range between query sequence and nearest genome fragment in reference database | |||
|---|---|---|---|---|
| Sanger (800 bp) | 454-Titanium (400 bp) | 454-Standard (250 bp) | 454-GS20 (100 bp) | |
| < 0.28 | < 0.35 | < 0.43 | < 0.6 | |
| 0.28 – 0.32 | 0.35 – 0.41 | 0.43– 0.51 | > 0.6 | |
| > 0.32 | > 0.41 | > 0.51 | ||
Range of distance values (between vectors corresponding to a query sequence and the closest genome fragment in reference database) to be used for determining an appropriate taxonomic level (TL) of assignment for a given query sequence.
Figure 2Results of validation on simulated test data sets. Graphical representation of the obtained pattern of assignments and the time taken by INDUS, TACOA, SOrt-ITEMS, MEGAN and SPHINX on the (A) Sanger, (B) 454-400 (C) 454-250 and (D) 454-100 test data sets.
Results of validation on FAMeS Data sets
| FAMeS Data set | Taxonomic assignment category | Results with complete database | Results with modified database | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| INDUS | TACOA | SOrT-ITEMS | MEGAN | SPHINX | INDUS | TACOA | SOrT-ITEMS | MEGAN | SPHINX | ||
| SimLC (96732) | Correct | 89.6 | 79.3 | 94.9 | 95 | 93.1 | 81.6 | 74.4 | 78.1 | 72.6 | 83.7 |
| Wrong | 2.6 | 9 | 2 | 2.7 | 2.8 | 7.1 | 13.9 | 9.2 | 23.6 | 11.5 | |
| Specific | 81.3 | 24.4 | 94.9 | 93.9 | 82.8 | 66.7 | 21 | 78.1 | 65 | 71 | |
| Non-specific | 8.3 | 54.9 | 0 | 1.1 | 10.3 | 14.9 | 53.5 | 0 | 7.6 | 12.7 | |
| Unassigned | 7.9 | 11.6 | 3.1 | 2.3 | 4.1 | 11.4 | 11.6 | 12.6 | 3.7 | 4.8 | |
| SimMC (113373) | Correct | 92.8 | 82.5 | 93.7 | 94 | 92.6 | 86 | 79 | 79.1 | 69.9 | 81.8 |
| Wrong | 1 | 7.3 | 3.1 | 3.5 | 3.6 | 4.6 | 10.8 | 9.1 | 26.6 | 13.5 | |
| Specific | 84.1 | 23.7 | 93.7 | 93.1 | 81.1 | 71.9 | 23 | 79.1 | 63.2 | 70.1 | |
| Non-specific | 8.7 | 58.8 | 0 | 1 | 11.5 | 14.1 | 56 | 0 | 6.7 | 11.6 | |
| Unassigned | 6.2 | 10.2 | 3.2 | 2.4 | 3.8 | 9.4 | 10.2 | 11.8 | 3.5 | 4.7 | |
| SimHC (115592) | Correct | 89.6 | 72.1 | 92 | 91.9 | 86 | 78.9 | 67 | 76.6 | 77.9 | 77.6 |
| Wrong | 1.9 | 11.4 | 3.6 | 4.9 | 7.1 | 5.7 | 16 | 9.5 | 17.6 | 10.1 | |
| Specific | 76.4 | 18.7 | 92 | 90.3 | 79.6 | 63.4 | 18.2 | 76.6 | 69.5 | 71.2 | |
| Non-specific | 13.2 | 53.4 | 0 | 1.5 | 6.4 | 15.5 | 48.8 | 0 | 8.4 | 6.4 | |
| Unassigned | 8.6 | 16.5 | 4.3 | 3.2 | 6.9 | 15.5 | 17 | 13.9 | 4.5 | 12.3 | |
| Average for FAMeS data sets | Correct | 90.6 | 78 | 93.5 | 93.6 | 90.6 | 82.2 | 73.5 | 77.9 | 73.5 | 81 |
| Wrong | 1.8 | 9.3 | 2.9 | 3.7 | 4.5 | 5.8 | 13.6 | 9.3 | 22.6 | 11.7 | |
| Specific | 80.6 | 22.3 | 93.5 | 92.4 | 81.2 | 67.3 | 20.7 | 77.9 | 65.9 | 70.8 | |
| Non-specific | 10.1 | 55.7 | 0 | 1.2 | 9.4 | 14.8 | 52.8 | 0 | 7.6 | 10.2 | |
| Unassigned | 7.6 | 12.8 | 3.5 | 2.6 | 4.9 | 12.1 | 12.9 | 12.8 | 3.9 | 7.3 | |
Summary of the results obtained with the FAMeS metagenomic data sets. The complete and modified reference database contained genome fragments from 952 and 652 prokaryotic genomes respectively. The number of sequences in each data set is indicated in parenthesis.
Validation results of INDUS with Sargasso sea metagenomic data set.
| Burkholderiales | 31.02 | Betaproteobacteria | 33.93 | |
| Alteromonadales | 10.64 | Gammaproteobacteria | 20.07 | |
| Prochlorales | 1.81 | Alphaproteobacteria | 0.9 | |
| Aeromonadales | 1.22 | Bacilli | 0.34 | |
| Chroococcales | 1.06 | Clostridia | 0.29 | |
| Enterobacteriales | 0.53 | Actinobacteria (class) | 0.26 | |
| Pseudomonadales | 0.52 | Mollicutes | 0.25 | |
| Rhizobiales | 0.26 | Spirochaetes (class) | 0.03 | |
| Clostridiales | 0.21 | Deltaproteobacteria | 0.02 | |
| Actinomycetales | 0.11 | Epsilonproteobacteria | 0.02 | |
| Rickettsiales | 0.11 | |||
| Mycoplasmatales | 0.09 | |||
| Bacillales | 0.07 | Proteobacteria | 60.98 | |
| Xanthomonadales | 0.06 | Cyanobacteria | 3.67 | |
| Lactobacillales | 0.04 | Firmicutes | 1.15 | |
| Nitrosopumilales | 0.04 | Actinobacteria | 0.3 | |
| Spirochaetales | 0.03 | Tenericutes | 0.29 | |
| Thiotrichales | 0.02 | Thaumarchaeota | 0.06 | |
| Campylobacterales | 0.02 | Spirochaetes | 0.04 | |
| Rhodobacterales | 0.02 | Bacteroidetes | 0.02 | |
| Vibrionales | 0.02 | Euryarchaeota | 0.02 | |
| - | - | Thermotogae | 0.02 | |
| - | - | Planctomycetes | 0.02 | |
Taxonomic profile, indicating the cumulative percentage of sequences assigned at various taxonomic levels. The cumulative number of sequences assigned at each taxonomic level is indicated in parenthesis.
Comparison of results of INDUS with other binning methods for Sargasso sea sample 1* metagenomic data set.
| Binning method | Total number of sequences | Time taken for analysis (minutes) | Total number of sequences assigned | Cumulative number of sequences assigned at different taxonomic levels | ||
|---|---|---|---|---|---|---|
| Phylum | Class | Order | ||||
| INDUS | 10000 | 13 | 8167 | 5793 | 4416 | 3748 |
| TACOA | 10000 | 180 | 8870 | 3518 | 2739 | 2545 |
| SOrt-ITEMS | 10000 | 347 | 8528 | 8173 | 6921 | 5506 |
| MEGAN | 10000 | 321 | 8866 | 8417 | 7559 | 7461 |
| SPHINX | 10000 | 23 | 9116 | 5346 | 3702 | 2726 |
| Burkholderiales | 22.79 | 16.75 | 25.6 | 28.63 | 20.31 | |
| Alteromonadales | 12.81 | 5.57 | 17.24 | 18.65 | 5.57 | |
| Rickettsiales | - | - | 5.58 | 12.78 | - | |
| Prochlorales | 1.88 | - | 3.01 | 2.94 | - | |
| Enterobacteriales | - | 1.75 | - | - | - | |
| Betaproteobacteria | 24.31 | 16.76 | 28.2 | 28.89 | 20.31 | |
| Gammaproteobacteria | 19.85 | 9.48 | 22.91 | 24.6 | 16.71 | |
| Alphaproteobacteria | - | - | 15.55 | 18.41 | - | |
| Flavobacteria | - | - | - | 2.12 | - | |
| Proteobacteria | 52.54 | 32.1 | 73.6 | 75.71 | 47.78 | |
| Cyanobacteria | 3.73 | 0.53 | 4.84 | 4.83 | - | |
| Firmicutes | 1.66 | 1.25 | 0.27 | 0.25 | 4.17 | |
| Bacteroidetes | - | 0.01 | 1.97 | 2.19 | - | |
| Tenericutes | - | 0.47 | - | - | 1.51 | |
The cumulative percentage of sequences assigned by INDUS, TACOA, SOrt-ITEMS, MEGAN and SPHINX at order, class and phylum levels
* Sample 1 refers to the subset of 10000 reads from the Sargasso sea data set [18] earlier analysed using MEGAN [2] and SOrt-ITEMS [4]
** Percentages shown are with respect to the total number of sequences (i.e 10000) in the Sample 1 data set. Only those taxa are shown for which at least one of the methods assigned a minimum of 1.5% of the sequences in the data set.
Estimates of time required for taxonomic binning of some real metagenomic data sets
| Metagenome | Total number of sequences | Sequence length range | Approximate estimate of time (in minutes) need for binning | Reference (s) | ||||
|---|---|---|---|---|---|---|---|---|
| INDUS | TACOA | SOrt-ITEMS | MEGAN | SPHINX | ||||
| Global Ocean Survey | 7521215 | ~800bp | 10530 (~ 7 days) | 129580 (~90 days) | 319330 (~221 days) | 287095 (199 days) | 15901 (11 days) | [ |
| Lean and obese mouse metagenome | 1744283 | ~100bp | 1544 (~ 1 day) | 8771 (6 days) | 52097 (36 days) | 48390 (33 days) | 2093 (1.5 days) | [ |
| Malnourished child metagenome | 1496170 | ~250bp - 400bp | 1795 (1.2 days) | 17526 (12 days) | 51297 (~36 days) | 44885 (~31 days) | 2308 (1.6 days) | [ |
| Acid Mine Drainage | 180713 | ~800bp | 252 | 3113 | 7672 | 6898 | 382 | [ |
Approximate time (in minutes) estimated to be taken by INDUS, TACOA, SOrt-ITEMS, MEGAN and SPHINX for binning some of the real metagenomic data sets (on a desktop with an Intel Xeon-Quad core processor and 4 GB RAM)