Literature DB >> 26076725

MetaPathways v2.5: quantitative functional, taxonomic and usability improvements.

Kishori M Konwar¹, Niels W Hanson², Maya P Bhatia¹, Dongjae Kim³, Shang-Ju Wu³, Aria S Hahn¹, Connor Morgan-Lang², Hiu Kan Cheung¹, Steven J Hallam⁴.

Abstract

UNLABELLED: Next-generation sequencing is producing vast amounts of sequence information from natural and engineered ecosystems. Although this data deluge has an enormous potential to transform our lives, knowledge creation and translation need software applications that scale with increasing data processing and analysis requirements. Here, we present improvements to MetaPathways, an annotation and analysis pipeline for environmental sequence information that expedites this transformation. We specifically address pathway prediction hazards through integration of a weighted taxonomic distance and enable quantitative comparison of assembled annotations through a normalized read-mapping measure. Additionally, we improve LAST homology searches through BLAST-equivalent E-values and output formats that are natively compatible with prevailing software applications. Finally, an updated graphical user interface allows for keyword annotation query and projection onto user-defined functional gene hierarchies, including the Carbohydrate-Active Enzyme database.
AVAILABILITY AND IMPLEMENTATION: MetaPathways v2.5 is available on GitHub: http://github.com/hallamlab/metapathways2. CONTACT: shallam@mail.ubc.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26076725 PMCID： PMC4595896 DOI： 10.1093/bioinformatics/btv361

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Since the publication of MetaPathways (Konwar ), a modular annotation and analysis pipeline that enables construction of environmental pathway/genome databases using Pathway Tools (Karp , 2010) and MetaCyc (Caspi ; Karp , 2002a), there have been improvements to the software via the Knowledge Engine data structure, a graphical user interface (GUI) for data management and browsing and a master–worker model for task distribution on grids and clouds (Hanson ). Version 2.5 features faster and more accurate quantitative functional and taxonomic inference. Inspired by the pathway-centric analysis of the Hawaii-Ocean Time-series (Hanson ), a weighted taxonomic distance (WTD) has been integrated to detect taxonomic divergence of predicted MetaCyc pathways. Next, because it is difficult to determine relative open reading frame (ORF) abundance in assembled datasets, we adopt reads per kilobase per million mapped (RPKM) to provide a quantitative measure of sequence-coverage on a per-ORF basis (Patil ). Additionally, the LAST code has been modified to calculate BLAST-equivalent Bit-score and E-value statistics (Altschul ; Kiełbasa ), producing output files compatible with prevailing software applications, including the MetaGenome ANalyser (Huson ). Finally, query and projection features have been enhanced with keyword-based searches, with support for Carbohydrate-Active EnZymes database entries (Cantarel ).

2 Methods

Here, we describe MetaPathways v2.5 improvements in more detail.

2.1 Weighted taxonomic distance

MetaPathways runs the PathoLogic algorithm without taxonomic pruning, but this omission enables prediction of MetaCyc pathways outside their expected taxonomic range. WTD serves as a measure of predicted pathway taxonomic divergence between observed RefSeq taxonomy and its expected taxonomic range (Hanson ). Briefly, for each predicted pathway P, WTD D is calculated on the connecting path between xobs, the lowest common ancestor of observed annotations, and each member of its expected taxonomic range xexp, where ea,b is an edge between nodes a and b on the connecting path , and d(a) is the depth of node a. (For complete algorithm details and motivation, see Online Methods and Supplementary Note S2 of Metabolic pathways for the whole community (Hanson )).

2.2 Reads per kilobase per million mapped

Functional analysis of de novo assembled environmental sequence information is impeded by the lack of quantitative ORF annotations. ORF counts are affected by both sequencing depth and ORF length, longer ORFs naturally encompass more reads, making quantitative comparisons between samples difficult. To resolve this, we have implemented a bwa-based version of the RPKM (Li and Durbin, 2010). Intuitively RPKM is a simple proportion of the number of reads mapped to a sequence section, normalized for sequencing depth and ORF length:

2.3 LAST bit-score and E-value

Although both LAST and BLAST are dynamic programming seed-and-extend approximations to the Smith Waterman algorithm (Altschul and Erickson, 1986; Smith and Waterman, 1981), in practice, LAST’s adaptive-seed lengths and simpler code base is 20- to 100-times faster, more accurate and portable. However, LAST adoption has lagged due to the absence of BLAST-like output format and statistics. We modified the LAST code to produce the compatible Bit-score and E-value calculations.

3 Results

We benchmarked the implemented improvements described earlier using Illumina-sequenced marine metagenomic samples. (Joint Genome Institute: ‘Marine microbial communities from Expanding Oxygen minimum zones project’ (JGI Project IDs: 4093112, 4093113, 4093125, 4093127–4093132, 4093144–4093149, 4096364–4096371, 4096373, 4096375, 4096377–4096379, 4096381–4096383, 4096385–4096387, 4096389–4096396, 4096398–4096406 and 4096409–4096453)). The WTD distribution can be used as an informative tool to place pathways into different taxonomic hazard classes based on their order statistics (Fig. 1a). Protein annotations of BLAST and LAST are highly correlated in terms of E-value (Fig. 1b), suggesting roughly equivalent results, but with LAST being significantly faster. Although there is a positive correlation between RPKM score and ORF count, variance about the regression line indicates RPKM makes a correction in many instances (Fig. 1c).

Fig. 1.

Quantitative functional and taxonomic improvements. (a) WTD provides a measure of taxonomic agreement between observed RefSeq Lowest common ancestor (LCA) taxonomy and the expected taxonomic range of predicted MetaCyc pathways, separated into the ‘High’ (Red), ‘Medium’ (Orange) and ‘Low’ (Green) taxonomic hazard classes based on negative quartile order statistics. Positive distances represent taxa found within a pathways expected taxonomic range and so have a hazard class of ‘None’ (Grey). (b) The LAST and BLAST homology search algorithms are highly correlated in terms of E-value (, P < 0.01). (c) ORF Count and the RPKM measure show a linear relationship (R2 = 0.816, P < 0.01). Ninety percent of prediction intervals, displayed as a pair of thin blue lines about the fitted line, capture ∼96.7 and 91.3% of observed points in (b) and (c), respectively. Analysis code can be found in the Supplementary information

4 Conclusions

MetaPathways v2.5 now addresses quantitative functional and pathway prediction hazards based on WTD and RPKM calculations, provides performant LAST output equivalent with BLAST, and more flexible annotation subsetting and projection via GUI keyword searches. These improvements enable improved large-scale comparative analysis of next-generation environmental sequence information.

15 in total

1. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

2. Adaptive seeds tame genomic sequence comparison.

Authors: Szymon M Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C Frith
Journal: Genome Res Date: 2011-01-05 Impact factor: 9.043

3. MEGAN analysis of metagenomic data.

Authors: Daniel H Huson; Alexander F Auch; Ji Qi; Stephan C Schuster
Journal: Genome Res Date: 2007-01-25 Impact factor: 9.043

4. Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology.

Authors: Peter D Karp; Suzanne M Paley; Markus Krummenacker; Mario Latendresse; Joseph M Dale; Thomas J Lee; Pallavi Kaipa; Fred Gilham; Aaron Spaulding; Liviu Popescu; Tomer Altman; Ian Paulsen; Ingrid M Keseler; Ron Caspi
Journal: Brief Bioinform Date: 2009-12-02 Impact factor: 11.622

5. Taxonomic metagenome sequence assignment with structured output models.

Authors: Kaustubh R Patil; Peter Haider; Phillip B Pope; Peter J Turnbaugh; Mark Morrison; Tobias Scheffer; Alice C McHardy
Journal: Nat Methods Date: 2011-03 Impact factor: 28.547

6. MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information.

Authors: Kishori M Konwar; Niels W Hanson; Antoine P Pagé; Steven J Hallam
Journal: BMC Bioinformatics Date: 2013-06-21 Impact factor: 3.169

7. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Authors: Ron Caspi; Tomer Altman; Kate Dreher; Carol A Fulcher; Pallavi Subhraveti; Ingrid M Keseler; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Lukas A Mueller; Quang Ong; Suzanne Paley; Anuradha Pujar; Alexander G Shearer; Michael Travers; Deepika Weerasinghe; Peifen Zhang; Peter D Karp
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

8. Metabolic pathways for the whole community.

Authors: Niels W Hanson; Kishori M Konwar; Alyse K Hawley; Tomer Altman; Peter D Karp; Steven J Hallam
Journal: BMC Genomics Date: 2014-07-22 Impact factor: 3.969

9. Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2010-01-15 Impact factor: 6.937

10. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics.

Authors: Brandi L Cantarel; Pedro M Coutinho; Corinne Rancurel; Thomas Bernard; Vincent Lombard; Bernard Henrissat
Journal: Nucleic Acids Res Date: 2008-10-05 Impact factor: 16.971

17 in total

1. Structural and mechanistic analysis of a β-glycoside phosphorylase identified by screening a metagenomic library.

Authors: Spencer S Macdonald; Ankoor Patel; Veronica L C Larmour; Connor Morgan-Lang; Steven J Hallam; Brian L Mark; Stephen G Withers
Journal: J Biol Chem Date: 2018-01-09 Impact factor: 5.157

2. Basaltic Lava Tube Hosts a Putative Novel Genus in the Family Solirubrobacteraceae.

Authors: C B Fishman; J G Bevilacqua; O Gadson; A S Hahn; A C McAdam; J Bleacher; S S Johnson
Journal: Microbiol Resour Announc Date: 2022-10-03

3. Gene expression profiling of microbial activities and interactions in sediments under haloclines of E. Mediterranean deep hypersaline anoxic basins.

Authors: Virginia P Edgcomb; Maria G Pachiadaki; Paraskevi Mara; Konstantinos A Kormas; Edward R Leadbetter; Joan M Bernhard
Journal: ISME J Date: 2016-04-19 Impact factor: 10.302

4. Survival strategies of an anoxic microbial ecosystem in Lake Untersee, a potential analog for Enceladus.

Authors: Nicole Yasmin Wagner; Dale T Andersen; Aria S Hahn; Sarah Stewart Johnson
Journal: Sci Rep Date: 2022-05-05 Impact factor: 4.996

5. Fungal and Prokaryotic Activities in the Marine Subsurface Biosphere at Peru Margin and Canterbury Basin Inferred from RNA-Based Analyses and Microscopy.

Authors: Maria G Pachiadaki; Vanessa Rédou; David J Beaudoin; Gaëtan Burgaud; Virginia P Edgcomb
Journal: Front Microbiol Date: 2016-06-09 Impact factor: 5.640

10. Short-term microbial effects of a large-scale mine-tailing storage facility collapse on the local natural environment.

Authors: Heath W Garris; Susan A Baldwin; Jon Taylor; David B Gurr; Daniel R Denesiuk; Jonathan D Van Hamme; Lauchlan H Fraser
Journal: PLoS One Date: 2018-04-25 Impact factor: 3.240