Literature DB >> 19914921

PyNAST: a flexible tool for aligning sequences to a template alignment.

J Gregory Caporaso1, Kyle Bittinger, Frederic D Bushman, Todd Z DeSantis, Gary L Andersen, Rob Knight.   

Abstract

MOTIVATION: The Nearest Alignment Space Termination (NAST) tool is commonly used in sequence-based microbial ecology community analysis, but due to the limited portability of the original implementation, it has not been as widely adopted as possible. Python Nearest Alignment Space Termination (PyNAST) is a complete reimplementation of NAST, which includes three convenient interfaces: a Mac OS X GUI, a command-line interface and a simple application programming interface (API).
RESULTS: The availability of PyNAST will make the popular NAST algorithm more portable and thereby applicable to datasets orders of magnitude larger by allowing users to install PyNAST on their own hardware. Additionally because users can align to arbitrary template alignments, a feature not available via the original NAST web interface, the NAST algorithm will be readily applicable to novel tasks outside of microbial community analysis. AVAILABILITY: PyNAST is available at http://pynast.sourceforge.net.

Entities:  

Mesh:

Year:  2009        PMID: 19914921      PMCID: PMC2804299          DOI: 10.1093/bioinformatics/btp636

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The Nearest Alignment Space Termination (NAST) tool (DeSantis et al., 2006b) was developed to efficiently align thousands of 16S rRNA genes using an alignment compression algorithm to create multiple sequence alignments (MSAs) with a set number of columns. The Greengenes ribosomal database (DeSantis et al., 2006a) has oriented the 400 000 known near full length 16S rRNA genes with NAST, and this aligner has become an integral tool in microbial community analysis. Additionally, users have created project-specific MSAs of this scale via the web interface at http://greengenes.lbl.gov/NAST. While NAST has been available to a wide audience as a web application, the difficulty of installing it locally has limited the applicability of NAST for many users. Here, we present Python Nearest Alignment Space Termination (PyNAST), a complete reimplementation of the NAST algorithm using the PyCogent toolkit (Knight et al., 2007). Several key features have been added in PyNAST representing significant enhancements. New features include: three convenient interfaces: a Mac OS X GUI (Fig. 1a), a command-line interface and a simple application programming interface (API);
Fig. 1.

(A) Screenshot of the PyNAST graphical user interface for Mac OS X. (B) Runtime of PyNAST is compared with that of NAST, each running on a single processor. PyNAST has a slightly shorter per sequence runtime (slope). The candidate sequences used in this evaluation ranged from 917 to 1343 bases, with a median length of 1294. The template alignment was a Greengenes core set (dated November 8, 2007) with 7682 positions and 4938 sequences.

parameterized algorithms at key steps of the analysis [e.g. pairwise alignment can be performed with BLAST, MUSCLE (Edgar, 2004), MAFFT (Katoh et al., 2005), ClustalW (Thompson et al., 1994), or the PyCogent pairwise hidden Markov model (HMM) aligner, and is extensible to incorporate new pairwise aligners]; an open source software package with minimal dependencies (Python, NumPy and BLAST), designed for easy installation on single machines or in a cluster environment; ability to align an arbitrary sequence against an arbitrary template alignment, rather than only 16S sequences. (A) Screenshot of the PyNAST graphical user interface for Mac OS X. (B) Runtime of PyNAST is compared with that of NAST, each running on a single processor. PyNAST has a slightly shorter per sequence runtime (slope). The candidate sequences used in this evaluation ranged from 917 to 1343 bases, with a median length of 1294. The template alignment was a Greengenes core set (dated November 8, 2007) with 7682 positions and 4938 sequences.

2 ALGORITHM

The NAST algorithm aligns a candidate sequence to a template alignment. The output, the aligned candidate sequence, is guaranteed to be the same length as the input template alignment. In the original NAST implementation, each user-submitted candidate sequence is aligned to the Greengenes ‘Core Set’ (a template alignment), comprising approximately 5000 aligned non-chimeric sequences representative of the currently recognized diversity among bacteria and archaea. In PyNAST, the user can specify an arbitrary template alignment in a standard fasta alignment file to which candidate sequences should be aligned. The NAST algorithm is applied to a candidate sequence and template alignment as follows. First, the sequence most similar to the candidate sequence in the template alignment (the template sequence) is identified using BLAST (Altschul et al., 1990). Gaps are removed from the template sequence, and it is then pairwise aligned with the candidate sequence. Next, the gap spacing from the template alignment is reintroduced into the pairwise alignment yielding an alignment that may be longer than the template alignment. To reduce the length of the pairwise alignment to that of the template alignment, gaps must be removed from the pairwise alignment. New gaps which were introduced in the aligned template sequence during pairwise alignment are removed, and to maintain the alignment, the nearest gap character in the aligned candidate sequence is also removed. Thus by introducing local misalignments, the candidate sequence is globally aligned to the template alignment without disrupting the length of the template alignment.

3 SPEED BENCHMARK

The runtime of PyNAST was compared against the runtime of NAST on a collection of 30 000 16S rRNA sequences, and subsets of this collection containing 5000, 10 000, 15 000, 20 000 and 25 000 sequences each. The command-line implementation of NAST was used in this study, and compared directly with command-line PyNAST. BLAST 2.2.16 was used for the database search and pairwise alignment steps in both applications. PyNAST runs faster than the original NAST implementation, requiring 1.46 s/sequence versus 1.55 s/sequence (Fig. 1b). The rate-limiting step in PyNAST is the pairwise alignment, and the algorithm is therefore of complexity corresponding to the pairwise alignment algorithm. When Blast is used for the pairwise alignment, the complexity is O(nm) where n and m refer to the lengths of the candidate and template sequences.

4 CONCLUSIONS

The key advantage of the NAST algorithm is that it aligns relatively conserved regions well, and avoids expanding the alignment with new gaps in non-conserved regions. Because new sequences are aligned to the same length as the template alignment, newly acquired sequences can be analyzed in the context of existing alignments which may have been developed through costly processes such as manual alignment. Users are thus, for example, able to calculate distance matrices for diversity estimates and taxonomically classify organisms in microbiome samples based on existing high-quality alignments. PyNAST is a reimplementation of NAST, introducing new features that increase its portability and flexibility. Its availability as an open source application with three convenient interfaces will allow the application of the NAST algorithm on a wider basis, to larger datasets, and in novel domains. Funding: This work was funded in part by grants T15LM009451 to JGC; a Bill and Melinda Gates Foundation Mal-ED Network Discovery Project to William Petri; 1U01HG004866-01 to Owen White; Human Microbiome Demonstration project grant UH2DK083981 to FDB, James Lewis, and Gary Wu. This work was also partially supported by grant UH2CA140233 from Human Microbiome Project of the NIH Roadmap Initiative and National Cancer Institute to Zhiheng Pei. Conflict of Interest: none declared.
  7 in total

1.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

2.  Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors:  T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal:  Appl Environ Microbiol       Date:  2006-07       Impact factor: 4.792

3.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors:  J D Thompson; D G Higgins; T J Gibson
Journal:  Nucleic Acids Res       Date:  1994-11-11       Impact factor: 16.971

4.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes.

Authors:  T Z DeSantis; P Hugenholtz; K Keller; E L Brodie; N Larsen; Y M Piceno; R Phan; G L Andersen
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

5.  MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Authors:  Kazutaka Katoh; Kei-ichi Kuma; Hiroyuki Toh; Takashi Miyata
Journal:  Nucleic Acids Res       Date:  2005-01-20       Impact factor: 16.971

6.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors:  Robert C Edgar
Journal:  BMC Bioinformatics       Date:  2004-08-19       Impact factor: 3.169

7.  PyCogent: a toolkit for making sense from sequence.

Authors:  Rob Knight; Peter Maxwell; Amanda Birmingham; Jason Carnes; J Gregory Caporaso; Brett C Easton; Michael Eaton; Micah Hamady; Helen Lindsay; Zongzhi Liu; Catherine Lozupone; Daniel McDonald; Michael Robeson; Raymond Sammut; Sandra Smit; Matthew J Wakefield; Jeremy Widmann; Shandy Wikman; Stephanie Wilson; Hua Ying; Gavin A Huttley
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

  7 in total
  1432 in total

1.  Ecological succession and viability of human-associated microbiota on restroom surfaces.

Authors:  Sean M Gibbons; Tara Schwartz; Jennifer Fouquier; Michelle Mitchell; Naseer Sangwan; Jack A Gilbert; Scott T Kelley
Journal:  Appl Environ Microbiol       Date:  2014-11-14       Impact factor: 4.792

2.  Seasonal patterns in Arctic prasinophytes and inferred ecology of Bathycoccus unveiled in an Arctic winter metagenome.

Authors:  Nathalie Joli; Adam Monier; Ramiro Logares; Connie Lovejoy
Journal:  ISME J       Date:  2017-03-07       Impact factor: 10.302

3.  Colonic microbiome is altered in alcoholism.

Authors:  Ece A Mutlu; Patrick M Gillevet; Huzefa Rangwala; Masoumeh Sikaroodi; Ammar Naqvi; Phillip A Engen; Mary Kwasny; Cynthia K Lau; Ali Keshavarzian
Journal:  Am J Physiol Gastrointest Liver Physiol       Date:  2012-01-12       Impact factor: 4.052

4.  Comparison of Illumina paired-end and single-direction sequencing for microbial 16S rRNA gene amplicon surveys.

Authors:  Jeffrey J Werner; Dennis Zhou; J Gregory Caporaso; Rob Knight; Largus T Angenent
Journal:  ISME J       Date:  2011-12-15       Impact factor: 10.302

5.  Spatial variability in airborne bacterial communities across land-use types and their relationship to the bacterial communities of potential source environments.

Authors:  Robert M Bowers; Shawna McLetchie; Rob Knight; Noah Fierer
Journal:  ISME J       Date:  2010-11-04       Impact factor: 10.302

6.  Mapping the Bacterial Community in Digboi Oil Refinery, India by High-Throughput Sequencing Approach.

Authors:  Abhisek Dasgupta; Ratul Saikia; Pratap J Handique
Journal:  Curr Microbiol       Date:  2018-07-20       Impact factor: 2.188

7.  Longitudinal Effects of Supplemental Forage on the Honey Bee (Apis mellifera) Microbiota and Inter- and Intra-Colony Variability.

Authors:  Jason A Rothman; Mark J Carroll; William G Meikle; Kirk E Anderson; Quinn S McFrederick
Journal:  Microb Ecol       Date:  2018-02-03       Impact factor: 4.552

8.  Response of germ-free mice to colonization with O. formigenes and altered Schaedler flora.

Authors:  Xingsheng Li; Melissa L Ellis; Alexander E Dowell; Ranjit Kumar; Casey D Morrow; Trenton R Schoeb; John Knight
Journal:  Appl Environ Microbiol       Date:  2016-09-23       Impact factor: 4.792

9.  Effects of moderate, voluntary ethanol consumption on the rat and human gut microbiome.

Authors:  Kassi L Kosnicki; Jerrold C Penprase; Patricia Cintora; Pedro J Torres; Greg L Harris; Susan M Brasser; Scott T Kelley
Journal:  Addict Biol       Date:  2018-05-11       Impact factor: 4.280

10.  A Method to Define the Effects of Environmental Enrichment on Colon Microbiome Biodiversity in a Mouse Colon Tumor Model.

Authors:  Andrew K Fuller; Benjamin D Bice; Ashlee R Venancio; Olivia M Crowley; Ambur M Staab; Stephanie J Georges; Julio R Hidalgo; Annika V Warncke; Melinda L Angus-Hill
Journal:  J Vis Exp       Date:  2018-02-28       Impact factor: 1.355

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.