Literature DB >> 34189396

Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens.

John Huddleston^1,2, James Hadfield², Thomas R Sibley², Jover Lee², Kairsten Fay², Misja Ilcisin², Elias Harkins², Trevor Bedford^1,2, Richard A Neher^3,4, Emma B Hodcroft^3,4,5.

Abstract

The analysis of human pathogens requires a diverse collection of bioinformatics tools. These tools include standard genomic and phylogenetic software and custom software developed to handle the relatively numerous and short genomes of viruses and bacteria. Researchers increasingly depend on the outputs of these tools to infer transmission dynamics of human diseases and make actionable recommendations to public health officials (Black et al., 2020; Gardy et al., 2015). In order to enable real-time analyses of pathogen evolution, bioinformatics tools must scale rapidly with the number of samples and be flexible enough to adapt to a variety of questions and organisms. To meet these needs, we developed Augur, a bioinformatics toolkit designed for phylogenetic analyses of human pathogens.

Entities: Disease Gene Species

Year: 2021 PMID： 34189396 PMCID： PMC8237802 DOI： 10.21105/joss.02906

Source DB: PubMed Journal: J Open Source Softw ISSN： 2475-9066

Augur originally existed as an internal component of the nextflu (Neher & Bedford, 2015) and Nextstrain (Hadfield et al., 2018) applications. As a component of nextflu, Augur consisted of a single monolithic Python script that performed most operations in memory. This script prepared a subset of seasonal influenza sequences and metadata and then processed those data to produce an annotated phylogeny for visualization in the nextflu web application. When Nextstrain replaced nextflu and expanded to support multiple viral and bacterial pathogens, each pathogen received its own copy of the original script. The resulting redundancy of these large scripts complicated efforts to debug analyses, add new features for all pathogens, and add support for new pathogens. Critically, this software architecture led to long-lived, divergent branches of untested code in version control that Nextstrain team members could not confidently merge without potentially breaking existing analyses.

Implementation

To address these issues, we refactored the original Augur scripts into a toolkit of individual subcommands wrapped by a single command line executable, augur. With this approach, we followed the pattern established by samtools (Li et al., 2009) and bcftools (Li, 2011) where subcommands perform single, tightly-scoped tasks (e.g., “view,” “sort,” “merge,” etc.) that can be chained together in bioinformatics pipelines. We migrated or rewrote the existing functionality of the original Augur scripts into appropriate corresponding Augur subcommands. To enable interoperability with existing bioinformatics tools, we designed subcommands to accept inputs and produce outputs in standard bioinformatics file formats wherever possible. For example, we represented all raw sequence data in FASTA format, alignments in either FASTA or VCF format, and phylogenies in Newick format. To handle the common case where a standard file format could not represent some or all of the outputs produced by an Augur command, we implemented a lightweight JSON schema to store the remaining data. The “node data” JSON format represents one such Augur-specific file format that supports arbitrary annotations of phylogenies indexed by the name assigned to internal nodes or tips. To provide a standard interface for our own analyses, we also designed several Augur subcommands to wrap existing bioinformatics tools including augur align (mafft (Katoh et al., 2002)) and augur tree (FastTree (Price, 2010), RAxML (Stamatakis, 2014), and IQ-TREE (Nguyen et al., 2014)). Many commands including augur refine, traits and ancestral make extensive use of TreeTime (Sagulenko et al., 2018) to provide time-scaled phylogenetic trees or further annotate the phylogeny. By implementing the core components of Augur as a command line tool, we were able to rewrite our existing pathogen analyses as straightforward bioinformatics workflows using existing workflow management software like Snakemake (Köster & Rahmann, 2012). Most pathogen workflows begin with user-curated sequences in a FASTA file (e.g., sequences.fas ta) and metadata describing each sequence in a tab-delimited text file (e.g., metadata. tsv). Users can apply a series of Augur commands and other standard bioinformatics tools to these files to create annotated phylogenies that can be viewed in Auspice, the web application that serves Nextstrain (Figure 1). This approach allows users to leverage the distributed computing abilities of workflow managers to run multiple steps of the workflow in parallel and also run individual commands that support multiprocessing in parallel. Further, the Augur modules can be easily recombined both with each other and with user-generated scripts to flexibly address the differing questions and restrictions posed by a variety of human pathogens.

Figure 1:

Example workflows composed with Snakemake from Augur commands for A) Zika virus, B) tuberculosis, C) a BEAST analysis, and D) the Nextstrain SARS-CoV-2 pipeline as of 2020-11-27. Each node in the workflow graph represents a command that performs a specific part of the analysis (e.g., aligning sequences, building a tree, etc.) with Augur commands in black, external software in red, and custom scripts in blue. A typical workflow starts by filtering sequences and metadata to a desired subset for analysis followed by inference of a phylogeny, annotation of that phylogeny, and export of the annotated phylogeny to a JSON that can be viewed on Nextstrain. Workflows for viral (A) and bacterial (B) pathogens follow a similar structure but also support custom pathogen-specific steps. Augur’s modularity enables workflows that build on outputs from other tools in the field like BEAST (C) as well as more complicated analyses such as that behind Nextstrain’s daily SARS-CoV-2 builds (D) which often require custom scripts to perform analysis-specific steps. Multiple outgoing edges from a single node represent opportunities to run the workflow in parallel. See the full workflows behind A, B, and D at https://github.com/nextstrain/zika-tutorial, https://github.com/nextstrain/tb, and https://github.com/nextstrain/ncov.

The modular Augur interface has enabled phylogenetic and genomic epidemiological analyses by academic researchers, public health laboratories, and private companies. Most recently, these tools have supported the real-time tracking of SARS-CoV-2 evolution at global and local scales (Alm et al., 2020; Bedford et al., 2020; The Nextstrain Team, 2020). This success has attracted contributions from the open source community that have allowed us to improve Augur’s functionality, documentation, and test coverage. To facilitate Augur’s continued use as part of wider bioinformatics pipelines in public health, we have committed to work with and contribute to open data standards such as PHA4GE (Griffiths et al., 2020) and follow recommendations for open pathogen genomic analyses (Black et al., 2020). Augur can be installed from PyPI (nextstrain-augur) and Bioconda (augur). See the full documentation for more details about how to use or contribute to development of Augur.

14 in total

1. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors: Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

2. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-09-08 Impact factor: 6.937

3. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

4. Real-time digital pathogen surveillance - the time is now.

Authors: Jennifer Gardy; Nicholas J Loman; Andrew Rambaut
Journal: Genome Biol Date: 2015-07-30 Impact factor: 13.583

5. nextflu: real-time tracking of seasonal influenza virus evolution in humans.

Authors: Richard A Neher; Trevor Bedford
Journal: Bioinformatics Date: 2015-06-26 Impact factor: 6.937

6. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

7. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.

Authors: Lam-Tung Nguyen; Heiko A Schmidt; Arndt von Haeseler; Bui Quang Minh
Journal: Mol Biol Evol Date: 2014-11-03 Impact factor: 16.240

8. Nextstrain: real-time tracking of pathogen evolution.

Authors: James Hadfield; Colin Megill; Sidney M Bell; John Huddleston; Barney Potter; Charlton Callender; Pavel Sagulenko; Trevor Bedford; Richard A Neher
Journal: Bioinformatics Date: 2018-12-01 Impact factor: 6.931

9. TreeTime: Maximum-likelihood phylodynamic analysis.

Authors: Pavel Sagulenko; Vadim Puller; Richard A Neher
Journal: Virus Evol Date: 2018-01-08

10. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020.

Authors: Erik Alm; Eeva K Broberg; Thomas Connor; Emma B Hodcroft; Andrey B Komissarov; Sebastian Maurer-Stroh; Angeliki Melidou; Richard A Neher; Áine O'Toole; Dmitriy Pereyaslov
Journal: Euro Surveill Date: 2020-08

26 in total

1. Evolutionary traits of Tick-borne encephalitis virus: Pervasive non-coding RNA structure conservation and molecular epidemiology.

Authors: Lena S Kutschera; Michael T Wolfinger
Journal: Virus Evol Date: 2022-06-11

2. Accelerated SARS-CoV-2 intrahost evolution leading to distinct genotypes during chronic infection.

Authors: Chrispin Chaguza; Anne M Hahn; Mary E Petrone; Shuntai Zhou; David Ferguson; Mallery I Breban; Kien Pham; Mario A Peña-Hernández; Christopher Castaldi; Verity Hill; Wade Schulz; Ronald I Swanstrom; Scott C Roberts; Nathan D Grubaugh
Journal: medRxiv Date: 2022-07-02

3. Distinct mutations and lineages of SARS-CoV-2 virus in the early phase of COVID-19 pandemic and subsequent 1-year global expansion.

Authors: Yan Chen; Shiyong Li; Wei Wu; Shuaipeng Geng; Mao Mao
Journal: J Med Virol Date: 2022-01-18 Impact factor: 20.693

4. Treponema pallidum genome sequencing from six continents reveals variability in vaccine candidate genes and dominance of Nichols clade strains in Madagascar.

Authors: Nicole A P Lieberman; Michelle J Lin; Hong Xie; Lasata Shrestha; Tien Nguyen; Meei-Li Huang; Austin M Haynes; Emily Romeis; Qian-Qiu Wang; Rui-Li Zhang; Cai-Xia Kou; Giulia Ciccarese; Ivano Dal Conte; Marco Cusini; Francesco Drago; Shu-Ichi Nakayama; Kenichi Lee; Makoto Ohnishi; Kelika A Konda; Silver K Vargas; Maria Eguiluz; Carlos F Caceres; Jeffrey D Klausner; Oriol Mitjà; Anne Rompalo; Fiona Mulcahy; Edward W Hook; Sheila A Lukehart; Amanda M Casto; Pavitra Roychoudhury; Frank DiMaio; Lorenzo Giacani; Alexander L Greninger
Journal: PLoS Negl Trop Dis Date: 2021-12-22

5. Geographical Landscape and Transmission Dynamics of SARS-CoV-2 Variants Across India: A Longitudinal Perspective.

Authors: Neha Jha; Dwight Hall; Akshay Kanakan; Priyanka Mehta; Ranjeet Maurya; Quoseena Mir; Hunter Mathias Gill; Sarath Chandra Janga; Rajesh Pandey
Journal: Front Genet Date: 2021-12-17 Impact factor: 4.599

6. Multicenter study evaluating one multiplex RT-PCR assay to detect SARS-CoV-2, influenza A/B, and respiratory syncytia virus using the LabTurbo AIO open platform: epidemiological features, automated sample-to-result, and high-throughput testing.

Authors: Hsing-Yi Chung; Ming-Jr Jian; Chih-Kai Chang; Jung-Chung Lin; Kuo-Ming Yeh; Ya-Sung Yang; Chien-Wen Chen; Shan-Shan Hsieh; Sheng-Hui Tang; Cherng-Lih Perng; Feng-Yee Chang; Kuo-Sheng Hung; En-Sung Chen; Mei-Hsiu Yang; Hung-Sheng Shang
Journal: Aging (Albany NY) Date: 2021-12-12 Impact factor: 5.682

7. Rapid and parallel adaptive mutations in spike S1 drive clade success in SARS-CoV-2.

Authors: Kathryn E Kistler; John Huddleston; Trevor Bedford
Journal: bioRxiv Date: 2022-01-19

8. Genomic surveillance reveals the detection of SARS-CoV-2 delta, beta, and gamma VOCs during the third wave in Pakistan.

Authors: Massab Umair; Aamer Ikram; Muhammad Salman; Syed Adnan Haider; Nazish Badar; Zaira Rehman; Muhammad Ammar; Muhammad Suleman Rana; Qasim Ali
Journal: J Med Virol Date: 2021-11-09 Impact factor: 20.693

9. E484K as an innovative phylogenetic event for viral evolution: Genomic analysis of the E484K spike mutation in SARS-CoV-2 lineages from Brazil.

Authors: Patrícia Aline Gröhs Ferrareze; Vinícius Bonetti Franceschi; Amanda de Menezes Mayer; Gabriel Dickin Caldana; Ricardo Ariel Zimerman; Claudia Elizabeth Thompson
Journal: Infect Genet Evol Date: 2021-05-25 Impact factor: 4.393

10. Rapid feedback on hospital onset SARS-CoV-2 infections combining epidemiological and sequencing data.

Authors: Oliver Stirrup; Joseph Hughes; Matthew Parker; David G Partridge; James G Shepherd; James Blackstone; Francesc Coll; Alexander Keeley; Benjamin B Lindsey; Aleksandra Marek; Christine Peters; Joshua B Singer; Asif Tamuri; Thushan I de Silva; Emma C Thomson; Judith Breuer
Journal: Elife Date: 2021-06-29 Impact factor: 8.140