Literature DB >> 33045068

COVID-Align: accurate online alignment of hCoV-19 genomes using a profile HMM.

Frédéric Lemoine^1,2, Luc Blassel^1,3, Jakub Voznica^1,4, Olivier Gascuel^1,5.

Abstract

MOTIVATION: The first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1000 and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data.
RESULTS: hCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1000 genomes requires ∼50 minutes on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels).
AVAILABILITY AND IMPLEMENTATION: https://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/covid-align. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Species

Mesh：

Year: 2021 PMID： 33045068 PMCID： PMC7745650 DOI： 10.1093/bioinformatics/btaa871

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Since the emergence of the hCoV-19 virus (or SARS-CoV-2) responsible for the COVID-19 pandemic, unprecedented efforts are taking place across the world to sequence genomes of this virus and share the data. As of today (September 9, 2020), the GISAID (Shu ) provides access to more than 105 000 full genomes, and 23 000 for the NCBI and the EBI. The first genomes were sequenced in China by the end of December 2019. Their number first increased slowly and then rapidly when the pandemic appeared on all continents. Submissions of several thousand sequences to GISAID in a single day have become common. Moreover, some genomes may be submitted incomplete, with sequencing and assembly errors. These characteristics pose major challenges to bioinformatics, notably that of multiple sequence alignment (MSA; Chatzou ), which is crucial for subsequent analyses (phylogeny, transmission clusters, mutation study, structure, etc.). To solve this difficulty, we use a profile HMM-based approach (Durbin ), which is the norm for HIV (www.hiv.lanl.gov), and is particularly well suited to hCoV-19, as its genome is highly conserved, without known recombination in human hosts (De Maio ; Xiaolu ). Using a profile, the addition of new data to an existing MSA requires linear computing times in the number of input genomes. Moreover, profile-based MSA proved to be very accurate (Earl ; Nute and Warnow, 2016). This approach is implemented in COVID-Align, which can be used thanks to a Web service and via Docker.

2 Materials and methods

To estimate our profile HMM, we proceeded in several steps, in order to select an appropriate set of sequences and obtain a clean and reliable MSA to give as input to HMMER (www.hmmer.org): We downloaded all hCoV-19 genomes available on GISAID (April 24, 2020) and performed pairwise alignments using MAFFT (Katoh and Standley, 2013) of each of these genomes with the reference strain hCoV-19/Wuhan/WIV04/2019, sequenced in China December 30, 2019. This genome was found perfectly conserved not only in China, but also in Thailand, Japan, USA, UK, etc. and is considered as the origin of the virus (Li ; www.gisaid.org). Then, using loose thresholds, we removed the genomes that were excessively divergent from the reference and had too many unknown (N) characters. We edited the remaining ones (e.g. removing the first gappy positions and the poly-A tail) and aligned them with MAFFT. The MSA so obtained was further filtered by removing the genomes having too many unique (i.e. not shared by any other genome) mutations and indels. We used more stringent thresholds than in the previous stage. This resulted in an MSA of 2426 genomes, where the 12 first and 22 last positions of the reference genome were removed due poor alignment and low signal, but all other reference positions were preserved and showed high conservation. We used HMMER to estimate our profile from this curated MSA. All details and program options are available in Supplementary Information. The resulting profile was implemented in a Nextflow (Di Tommaso ) and Galaxy workflow combining hmmalign from HMMER to align the input genomes to the profile, GoAlign to format the input/output files (https://github.com/evolbioinfo/goalign), and Python to compute summary statistics. These statistics help users evaluate the sequencing quality and potential evolutionary novelties of input genomes; for example: number of unique mutations and indels, number of mutations compared to the reference genome A user-friendly interface, implemented in GO (similar to Lemoine ) allows users to launch their analyses without having to know how to use the Galaxy system. For advanced users, COVID-Align can be installed locally via Docker (https://www.docker.com). COVID-Align is also available on www.gisaid.org, using GISAID sequence identifiers.

3 Results

All results are given in a zipped file containing: The MSA of the input genomes plus the reference one that is displayed first, but cutting the first 12 and last 22 positions. With small datasets, this MSA can be visualized using MSAviewer (Fig. 1; Yachdav ).

Fig. 1.

Visualization and statistics summary. Left: MSAviewer visualization of the Receptor Binding Domain (RBD) of the Spike gene, with reference genome (top), recently sequenced ones and the Bat and Pangolin genomes (bottom). The site numbering corresponds to that of the reference, to be used to recover the ORFs and genes. In RBD region the Pangolin virus genome is closer to Human’s than is Bat’s, suggesting a possible recombination. On the opposite, Human viruses are highly conserved. Right: Statistics summary, displaying the number of High and Low Quality genomes, and the number of evolutionary events (mutations, gaps, gap openings, insertions, insertion openings). We distinguish the number of unique events (not seen yet and present only once in submitted genomes, possibly due to errors) and the number of new events (seen at least twice, likely corresponding to evolutionary novelties). This table was filled with GISAID sequences deposited between August 10 and September 21 2020, with unique and new statistics with respect to the database as of August 9 (Supplementary Material)

The hmmalign output in FASTA format, for each of the input genomes. This can be used to recover the insertions, deletions and match positions (to be reported to the reference genome). A CSV file with all statistics computed for each of the input genomes. Unique mutations and indels are possibly due to errors (sequencing, assembly etc.), while new ones (seen at least twice in submitted genomes, for the first time) likely correspond to evolutionary novelties (see Supplementary Information for details). A table in CSV format, summarizing the main average statistics and features of submitted genomes (Fig. 1). Visualization and statistics summary. Left: MSAviewer visualization of the Receptor Binding Domain (RBD) of the Spike gene, with reference genome (top), recently sequenced ones and the Bat and Pangolin genomes (bottom). The site numbering corresponds to that of the reference, to be used to recover the ORFs and genes. In RBD region the Pangolin virus genome is closer to Human’s than is Bat’s, suggesting a possible recombination. On the opposite, Human viruses are highly conserved. Right: Statistics summary, displaying the number of High and Low Quality genomes, and the number of evolutionary events (mutations, gaps, gap openings, insertions, insertion openings). We distinguish the number of unique events (not seen yet and present only once in submitted genomes, possibly due to errors) and the number of new events (seen at least twice, likely corresponding to evolutionary novelties). This table was filled with GISAID sequences deposited between August 10 and September 21 2020, with unique and new statistics with respect to the database as of August 9 (Supplementary Material) Our Web service processes 1000 genomes and return the MSA in ∼50 minutes (average over 10 trials, max = 80 minutes), thanks to parallelization that is easy to set up with profiles. Comparison with MAFFT-based GISAID MSA shows that our MSA: (i) can be used as is, while MAFFT’s cannot due to ∼10 000 highly gappy columns resulting from sequencing and assembly errors; (ii) helps to detect and filter these errors; (iii) is similar for most sequences to a properly trimmed version of MAFFT’s MSA, and more accurate for the few others (Supplementary Information). Importantly, our profile and statistics will be regularly updated to account for user needs and the evolutionary novelties (mutations, indels) of the emerging genomes to come. Click here for additional data file.

8 in total

Review 1. Multiple sequence alignment modeling: methods and applications.

Authors: Maria Chatzou; Cedrik Magis; Jia-Ming Chang; Carsten Kemena; Giovanni Bussotti; Ionas Erb; Cedric Notredame
Journal: Brief Bioinform Date: 2015-11-27 Impact factor: 11.622

2. Nextflow enables reproducible computational workflows.

Authors: Paolo Di Tommaso; Maria Chatzou; Evan W Floden; Pablo Prieto Barja; Emilio Palumbo; Cedric Notredame
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

3. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

4. Alignathon: a competitive assessment of whole-genome alignment methods.

Authors: Dent Earl; Ngan Nguyen; Glenn Hickey; Robert S Harris; Stephen Fitzgerald; Kathryn Beal; Igor Seledtsov; Vladimir Molodtsov; Brian J Raney; Hiram Clawson; Jaebum Kim; Carsten Kemena; Jia-Ming Chang; Ionas Erb; Alexander Poliakov; Minmei Hou; Javier Herrero; William James Kent; Victor Solovyev; Aaron E Darling; Jian Ma; Cedric Notredame; Michael Brudno; Inna Dubchak; David Haussler; Benedict Paten
Journal: Genome Res Date: 2014-10-01 Impact factor: 9.043

5. Scaling statistical multiple sequence alignment to large datasets.

Authors: Michael Nute; Tandy Warnow
Journal: BMC Genomics Date: 2016-11-11 Impact factor: 3.969

6. MSAViewer: interactive JavaScript visualization of multiple sequence alignments.

Authors: Guy Yachdav; Sebastian Wilzbach; Benedikt Rauscher; Robert Sheridan; Ian Sillitoe; James Procter; Suzanna E Lewis; Burkhard Rost; Tatyana Goldberg
Journal: Bioinformatics Date: 2016-07-13 Impact factor: 6.937

7. GISAID: Global initiative on sharing all influenza data - from vision to reality.

Authors: Yuelong Shu; John McCauley
Journal: Euro Surveill Date: 2017-03-30

8. NGPhylogeny.fr: new generation phylogenetic services for non-specialists.

Authors: Frédéric Lemoine; Damien Correia; Vincent Lefort; Olivia Doppelt-Azeroual; Fabien Mareuil; Sarah Cohen-Boulakia; Olivier Gascuel
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

8 in total

5 in total

1. Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes.

Authors: Kristen L Beck; Edward Seabolt; Akshay Agarwal; Gowri Nayar; Simone Bianco; Harsha Krishnareddy; Timothy A Ngo; Mark Kunitomi; Vandana Mukherjee; James H Kaufman
Journal: Viruses Date: 2021-12-03 Impact factor: 5.048

2. Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.

Authors: Bahrad A Sokhansanj; Gail L Rosen
Journal: mSystems Date: 2022-03-21 Impact factor: 7.324

Review 3. Developments in Algorithms for Sequence Alignment: A Review.

Authors: Jiannan Chao; Furong Tang; Lei Xu
Journal: Biomolecules Date: 2022-04-06

4. Whole-Genome Sequencing of SARS-CoV-2 from Quarantine Hotel Outbreak.

Authors: Lex E X Leong; Julien Soubrier; Mark Turra; Emma Denehy; Luke Walters; Karin Kassahn; Geoff Higgins; Tom Dodd; Robert Hall; Katina D'Onise; Nicola Spurrier; Ivan Bastian; Chuan K Lim
Journal: Emerg Infect Dis Date: 2021-08 Impact factor: 6.883

5. Differing impacts of global and regional responses on SARS-CoV-2 transmission cluster dynamics.

Authors: Brittany Rife Magalis; Andrea Ramirez-Mata; Anna Zhukova; Carla Mavian; Simone Marini; Frederic Lemoine; Mattia Prosperi; Olivier Gascuel; Marco Salemi
Journal: bioRxiv Date: 2020-11-06

5 in total