Literature DB >> 23658416

A poor man's BLASTX--high-throughput metagenomic protein database search using PAUDA.

Daniel H Huson1, Chao Xie.   

Abstract

SUMMARY: In the context of metagenomics, we introduce a new approach to protein database search called PAUDA, which runs ~10,000 times faster than BLASTX, while achieving about one-third of the assignment rate of reads to KEGG orthology groups, and producing gene and taxon abundance profiles that are highly correlated to those obtained with BLASTX. PAUDA requires <80 CPU hours to analyze a dataset of 246 million Illumina DNA reads from permafrost soil for which a previous BLASTX analysis (on a subset of 176 million reads) reportedly required 800,000 CPU hours, leading to the same clustering of samples by functional profiles. AVAILABILITY: PAUDA is freely available from: http://ab.inf.uni-tuebingen.de/software/pauda. Also supplementary method details are available from this website.

Entities:  

Mesh:

Year:  2013        PMID: 23658416      PMCID: PMC3866550          DOI: 10.1093/bioinformatics/btt254

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


In metagenomics studies, millions of DNA or cDNA reads are sequenced from environmental samples, and these are then analyzed in an attempt to determine the functional or taxonomic content of the samples (Handelsman ). An important computational step is to determine the genes or coding sequences present, which is usually done by aligning the sequences against a reference database of protein sequences. In most projects, BLASTX (Altschul ) has been the method of choice, despite the fact that running BLASTX requires thousands of CPU hours per million reads. In the related area of read mapping, numerous methods have been developed to solve the problem of aligning sequencing reads against DNA reference sequences in a high-throughput manner (for example, Langmead and Salzberg, 2012). Using read mapping tools directly for analyzing complex metagenomes is problematic because environmental reads usually do not match existing genome reference sequences. Moreover, the underlying algorithms cannot easily be extended to protein sequences. In this article, we present a new paradigm for the alignment of environmental sequencing reads called PAUDA, an acronym for ‘protein alignment using a DNA aligner’. It allows one to harness the high efficiency of DNA read aligners to compute BLASTX-like alignments. The key idea is to convert all protein sequences into ‘pseudo DNA’, or ‘pDNA’ for short, by mapping the amino acid alphabet onto a four-lettered alphabet that reflects which amino acids are likely to replace each other in significant BLASTX alignments. A high-throughput sequencing read aligner, such as Bowtie2, is then used to compare pDNA reads with a pDNA database. For any match found, the participating pDNA sequences are translated back into protein sequences, and the corresponding protein alignment is calculated so as to determine statistical significance. The final output is a file of statically significant protein alignments in BLASTX format. We have implemented this approach in a new software package called PAUDA. The package provides two scripts, pauda-build and pauda-run. The first script is run on the protein reference database and builds an appropriate index. The second script is run on a file of DNA reads and produces a BLASTX file as output. The two scripts use the Bowtie2 suite and a number of new Java programs that we have written. Bowtie2 can easily be replaced by some other method, if desired. An overview of the package is given in Figure 1.
Fig. 1.

An overview of the PAUDA approach

An overview of the PAUDA approach Using Bowtie2 as the comparison engine, PAUDA runs ∼10 000 times faster than BLASTX, while assigning about one-third as many reads to KO groups. Because of the huge computational burden of running BLASTX on a large dataset, BLASTX is rarely run to completion; therefore, the key question is how many reads can be assigned per hour. PAUDA assigns ∼3000 as many reads as BLASTX does, per hour. Mackelprang present a taxonomic and functional analysis of 12 permafrost datasets. Reanalysis of their data, a comparison of 246 million Illumina reads with the KEGG database (Kanehisa and Goto, 2000), takes ∼2 h on a single workstation (64 cores, 512 GB of main memory) using PAUDA, reproducing the main result of the article. In addition, we applied an early version of PAUDA to an unpublished dataset consisting of all 2.9 billion reads of a whole HiSeq2000 run on a waste-waster sample, requiring ∼2 days on 150 cores, whereas RAPSearch2 (Zhao ) required 115 days. We produced a benchmark dataset for comparing the performance of PAUDA, BLASTX and RAPSearch2 by taking the first 600 000 good quality reads from each of the 12 samples published in Mackelprang . We then ran all three programs on each of the 12 samples benchmark samples, comparing with the KEGG database. Running all samples in parallel on a single workstation using 48 cores, the runtime ranged from 7 min (PAUDA) to 7 days (BLASTX), Table 1. We used the metagenome analysis program MEGAN (Huson ) to assign reads to KEGG orthology (KO) groups based on their alignments.
Table 1.

Alignment of Illumina reads from permafrost data against the KEGG database

MethodaTimebSpeed-upcReads assignedd
KOse
True KOsf
PAUDA7155 82433%418278%1717 99.0%
RAPSearch2510449 14496%523798%1712 98.7%
BLASTX30 240465 588100%5363100%1735100%

aThe method used.

bThe number of wall-clock minutes required on 48 cores to process all 12 datasets.

cThe speed-up over BLASTX.

dThe number and percentage of reads that obtain a KO assignment.

eThe number and percentage of different KO groups identified.

fThe number of ‘true’ KO groups identified, defined as those that account for 99% of all reads with BLASTX hits. Percentages are in comparison with the results obtained by BLASTX. Note that half of the runtime reported here for PAUDA is start up overhead and on larger datasets the speed-up is .

Alignment of Illumina reads from permafrost data against the KEGG database aThe method used. bThe number of wall-clock minutes required on 48 cores to process all 12 datasets. cThe speed-up over BLASTX. dThe number and percentage of reads that obtain a KO assignment. eThe number and percentage of different KO groups identified. fThe number of ‘true’ KO groups identified, defined as those that account for 99% of all reads with BLASTX hits. Percentages are in comparison with the results obtained by BLASTX. Note that half of the runtime reported here for PAUDA is start up overhead and on larger datasets the speed-up is . Using PAUDA, the rate of assignment is 33% of that of BLASTX. In more detail, for alignments with a protein identity of 60, 70, 80, 90 and 100% the sensitivity is 35.1, 48.4, 61.6 and 78.5%, respectively. For alignments with identity <50%, the sensitivity is <8.1%. For those reads for which both BLASTX and PAUDA are able to assign a KO group, the assignment differs in ∼2% of all cases. Assuming a false-positive error rate of 1% for the assignment of reads to KO groups, BLASTX identifies 1735 ‘true’ KO groups for this dataset that account for 99% of all reads with BLASTX hits. PAUDA identifies 99% of these. The number of reads assigned to individual KO groups by PAUDA and BLASTX is highly correlated, as shown in Figure 2. The Pearson correlation is 0.977 for linear read counts and 0.949 for log-transformed counts.
Fig. 2.

KEGG comparison of PAUDA and BLASTX. Left: Each true KO group is represented by a dot with coordinates that correspond to the number of reads assigned to the KO group by BLASTX (on the x-axis) and PAUDA (on the y-axis). Right: To show the low abundance KO groups more clearly, here, we plot the same data on a logarithmic scale

KEGG comparison of PAUDA and BLASTX. Left: Each true KO group is represented by a dot with coordinates that correspond to the number of reads assigned to the KO group by BLASTX (on the x-axis) and PAUDA (on the y-axis). Right: To show the low abundance KO groups more clearly, here, we plot the same data on a logarithmic scale Using the LCA assignment algorithm as implemented in MEGAN, we also performed a taxonomic analysis of these datasets at a number of different taxonomic ranks. The results based on PAUDA and BLASTX are highly correlated, with a Pearson’s correlation coefficient r that ranges from 0.993 for the taxonomic rank of class to 0.953 for species. The corresponding range for log-transformed counts is 0.982–0.914. To further illustrate the accuracy of PAUDA, we applied the program to all 12 permafrost samples in their entirety, in total comparing 246 million reads with the KEGG database. A key result of (Mackelprang ) is that, on the one hand, two different frozen samples taken from the active layer of the permafrost have similar functional profiles, and that these change only little after thawing for 2 or 7 days. Although, on the other hand, two frozen samples obtained from the permafrost layer initially exhibit distinctive profiles that gradually become more similar during thawing. A PCoA analysis of Bray–Curtis distances (Mitra ) based on a PAUDA comparison of the data with the KEGG database delivers the same result in a small fraction of the computational time. Funding: National Research Foundation and Ministry of Education Singapore under its Research Centre of Excellence Programme. Conflict of Interest: none declared.
  8 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw.

Authors:  Rachel Mackelprang; Mark P Waldrop; Kristen M DeAngelis; Maude M David; Krystle L Chavarria; Steven J Blazewicz; Edward M Rubin; Janet K Jansson
Journal:  Nature       Date:  2011-11-06       Impact factor: 49.962

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  Comparison of multiple metagenomes using phylogenetic networks based on ecological indices.

Authors:  Suparna Mitra; Jack A Gilbert; Dawn Field; Daniel H Huson
Journal:  ISME J       Date:  2010-04-29       Impact factor: 10.302

5.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products.

Authors:  J Handelsman; M R Rondon; S F Brady; J Clardy; R M Goodman
Journal:  Chem Biol       Date:  1998-10

6.  Integrative analysis of environmental sequences using MEGAN4.

Authors:  Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal:  Genome Res       Date:  2011-06-20       Impact factor: 9.043

7.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

8.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data.

Authors:  Yongan Zhao; Haixu Tang; Yuzhen Ye
Journal:  Bioinformatics       Date:  2011-10-28       Impact factor: 6.937

  8 in total
  22 in total

1.  Evaluating techniques for metagenome annotation using simulated sequence data.

Authors:  Richard J Randle-Boggis; Thorunn Helgason; Melanie Sapp; Peter D Ashton
Journal:  FEMS Microbiol Ecol       Date:  2016-05-08       Impact factor: 4.194

2.  Fast and sensitive protein alignment using DIAMOND.

Authors:  Benjamin Buchfink; Chao Xie; Daniel H Huson
Journal:  Nat Methods       Date:  2014-11-17       Impact factor: 28.547

3.  Frameshift alignment: statistics and post-genomic applications.

Authors:  Sergey L Sheetlin; Yonil Park; Martin C Frith; John L Spouge
Journal:  Bioinformatics       Date:  2014-08-28       Impact factor: 6.937

4.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

Authors:  Ivan Borozan; Stuart Watt; Vincent Ferretti
Journal:  Bioinformatics       Date:  2015-01-07       Impact factor: 6.937

5.  Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data.

Authors:  Kathrin P Aßhauer; Bernd Wemheuer; Rolf Daniel; Peter Meinicke
Journal:  Bioinformatics       Date:  2015-05-07       Impact factor: 6.937

6.  BlastXtract2: Improving early exploration of (meta) genomes.

Authors:  Heleen de Weerd; Bernd E van der Veen; Marcus J Claesson
Journal:  Bioinformation       Date:  2015-04-30

7.  Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data.

Authors:  Rohita Sinha; Jennifer Clarke; Andrew K Benson
Journal:  BMC Genomics       Date:  2015-12-21       Impact factor: 3.969

8.  Challenges of the Unknown: Clinical Application of Microbial Metagenomics.

Authors:  Graham Rose; David J Wooldridge; Catherine Anscombe; Edward T Mee; Raju V Misra; Saheer Gharbia
Journal:  Int J Genomics       Date:  2015-09-14       Impact factor: 2.326

9.  De novo transcriptome hybrid assembly and validation in the European earwig (Dermaptera, Forficula auricularia).

Authors:  Anne C Roulin; Min Wu; Samuel Pichon; Roberto Arbore; Simone Kühn-Bühlmann; Mathias Kölliker; Jean-Claude Walser
Journal:  PLoS One       Date:  2014-04-10       Impact factor: 3.240

Review 10.  Metagenomic search strategies for interactions among plants and multiple microbes.

Authors:  Ulrich Melcher; Ruchi Verma; William L Schneider
Journal:  Front Plant Sci       Date:  2014-06-11       Impact factor: 5.753

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.