Literature DB >> 34244702

Swarm v3: towards tera-scale amplicon clustering.

Frédéric Mahé^1,2, Lucas Czech^3,4, Alexandros Stamatakis^3,5, Christopher Quince^6,7,8, Colomban de Vargas^9,10, Micah Dunthorn^11,12, Torbjørn Rognes^13,14.

Abstract

MOTIVATION: Previously we presented swarm, an open-source amplicon clustering program that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes.
RESULTS: When compared to previous swarm versions, swarm v3 has modernized C ++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic. AVAILABILITY: Source code and binaries are available at https://github.com/torognes/swarm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34244702 PMCID： PMC8696092 DOI： 10.1093/bioinformatics/btab493

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

In emerging planetary biology, large-scale amplicon sequencing datasets are used to unravel global ecological and evolutionary patterns within and across biomes and biota (de Vargas ; Mahé ; Giner ). With today’s sequencing platforms, such as Illumina and PacBio, single environmental diversity studies can produce massive amounts of data. A critical bioinformatics step in the handling of these massive metabarcoding datasets is to cluster the sequencing reads into operational taxonomic units (OTUs). OTUs are often used as units of comparison in downstream statistical analyses and are often interpreted as proxies for species and other taxa (Santoferrara ). Swarm v1 (Mahé ) was introduced as a novel approach to cluster amplicons into OTUs, inspired by previous single-linkage methods, such as DOTUR (Schloss and Handelsman, 2005). The key underlying idea of swarm was to use a local, iterative, single-linkage clustering process to group closely related sequences (by default with one difference in their nucleotide sequences, i.e. d = 1). Swarm’s clustering process differs from global clustering threshold approaches that apply an arbitrary fixed minimal similarity between the OTU seed and other OTU members; often set at 97% or 98% (Edgar ), or from model-based noise-filtering methods, such as DADA2 (Callahan ) and Deblur (Amir ). The recommended usage of these methods is to process samples or sequencing runs independently and then to merge the results. Swarm offers a fast alternative allowing users to (re-)process entire datasets at once. Swarm v2 (Mahé ) implemented in C++ two additional features to refine clustering: OTU-breaking that splits OTUs that are only linked via low-abundant sequences (--no-otu-breaking to disable); and the merging that grafts low-abundant OTUs onto higher-abundant OTUs (--fastidious to enable). Swarm v2 was completely implemented in C++ and was substantially faster due to algorithmic advances when used with default parameters (d = 1). There was still room for improvement. There were issues with code standardization that could limit compile-time optimization and raise warnings or errors with future compilers (Darriba ; Wilson ). The code could only be executed on GNU/Linux and macOS on x86-64 CPUs. And although swarm v2 was multithreaded and fast, its time and memory requirements could become a limiting factor on very large current and future datasets, especially as amplicon sequences become longer. Swarm v3 addresses these issues.

2 Code quality and portability

Following the recommendations of Darriba ), swarm v3 features a substantially revised and improved documentation (e.g. help and man page), as well as clearer and more helpful warnings and error messages. Swarm’s logic and behavior have been tested extensively via automatically generated input (afl-fuzz; https://lcamtuf.coredump.cx/afl/) and 669 hand-crafted functional software tests (https://github.com/frederic-mahe/swarm-tests/), covering more than 95% of swarm’s code (the remaining code is CPU architecture-specific). The Codecov (https://codecov.io) tool tracks code coverage evolution, and the Travis-CI (https://travis-ci.org) suite automatically executes the test suite on each new code modification to prevent regressions. To facilitate swarm’s long-term maintenance and portability, advanced compiler options [gcc (https://gcc.gnu.org) and clang (https://clang.llvm.org)] as well as state-of-the-art static [cppcheck (http://cppcheck.sourceforge.net) and clang-tidy (https://clang.llvm.org/extra/clang-tidy/)] and dynamic C++ analyzers [valgrind (https://www.valgrind.org)] were used to detect unsafe or deprecated code not reported by commonly used compiler options. More than 1600 warnings were fixed so far, improving swarm’s global code quality score as assessed by SoftWipe (Zapletal ) from 5.2 to 6.6 out of 10. Swarm has now been ported to new combinations of CPU architectures and operating systems: Microsoft’s Windows on x86-64, GNU/Linux and macOS on ARM 64 and GNU/Linux on POWER8, in addition to the already available versions for GNU/Linux and macOS on x86-64.

3 Time and space optimization, real-world results

DNA sequences are stored in silico as strings of the four characters A, C, G and T. Rather than using a byte of memory for storing each nucleotide, it is possible to only use two bits. Thereby, four nucleotides can be stored per byte. This compression reduces the global memory-footprint but also requires some storage overhead and additional encoding-decoding operations as CPUs cannot operate directly on anything smaller than a byte. To alleviate this, swarm v3 deploys a faster hash function (Zobrist, 1970) and an efficient Bloom filter (Putze ), and was re-written to operate on fixed-length chunks of compressed sequences, rather than on individual nucleotides (see Supplementary File). It should be noted that this new algorithm only applies to the default value for swarm’s d parameter (d = 1). Higher d values use the same algorithm as in swarm v2. On a dataset of 10.6 million unique SSU-rRNA V4 sequences (representing 31.6 million reads, 380 bp on average, Mahé ), and a series of subsamplings (1% and 10–90% steps), swarm v3 outperformed swarm v2 in every performance metric, while yielding exactly identical clustering results. With both versions running on 1 core, v3 was more than 7 times faster than v2. When both were running on 16 cores, v3 was about 10 times faster than v2. The memory requirement of v3 was about half that of v2 (Supplementary Fig. S1). Comparable results were obtained on a second dataset of 10.6 million unique SSU-rRNA V9 sequences (130 bp on average, de Vargas ), but with a less pronounced memory-footprint reduction as the storage overhead of two-bit compressed sequences has a larger impact with shorter sequences (see Supplementary Figs. S2, S3 and Supplementary File for a detailed benchmark description). When using the merging option (named fastidious), swarm v3 is more than 5 times faster for SSU-rRNA V9 (130 bp), and more than 9 times faster for SSU-rRNA V4 (380 bp) (Supplementary Fig. S2). The memory-footprint is only reduced by 5–10% due to the fact that the fastidious algorithm relies on a Bloom filter to store hash values instead of DNA sequences, and therefore does not profit from the two-bit sequence compression.

4 Conclusion

Swarm v3 is a clustering method designed to maximize taxonomic resolution, sensitivity and speed. If coupled with ‘lossy’ post-clustering filtering steps, such as chimera detection, quality filtering and multi-sample co-occurrence patterns (e.g. Frøslev ), swarm has the potential to yield robust, single-nucleotide resolution results. Swarm v3 can be used on short and long read metabarcoding data (with sequences up to 10 Mbp when using d = 1), or on meta-transcriptomic/genomic data that has been subsampled from the same locus. It offers a comprehensive set of options that gives users full-control and access to intermediate internal data, such as the complete pairwise sequence network (see Forster , for a usage example). Swarm v3 is open-source, actively maintained, portable and efficient, thus reducing the need for expensive computational resources. As an example, the UniEuk project (Berney ) gathered from the global research community an SSU-rRNA V4 dataset with nearly 324 million unique sequences (123 billion nucleotides), more than three times the volume of the recently published Earth Microbiome Project (Thompson ). Using default parameters, swarm v3 required 50 min to cluster the UniEuk dataset on a 16-core system. We estimate that it would take less than six hours on the same machine to process a one trillion nucleotide, or one tera-byte dataset. Click here for additional data file.

15 in total

1. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

2. Ocean plankton. Eukaryotic plankton diversity in the sunlit ocean.

Authors: Colomban de Vargas; Stéphane Audic; Nicolas Henry; Johan Decelle; Frédéric Mahé; Ramiro Logares; Enrique Lara; Cédric Berney; Noan Le Bescot; Ian Probert; Margaux Carmichael; Julie Poulain; Sarah Romac; Sébastien Colin; Jean-Marc Aury; Lucie Bittner; Samuel Chaffron; Micah Dunthorn; Stefan Engelen; Olga Flegontova; Lionel Guidi; Aleš Horák; Olivier Jaillon; Gipsi Lima-Mendez; Julius Lukeš; Shruti Malviya; Raphael Morard; Matthieu Mulot; Eleonora Scalco; Raffaele Siano; Flora Vincent; Adriana Zingone; Céline Dimier; Marc Picheral; Sarah Searson; Stefanie Kandels-Lewis; Silvia G Acinas; Peer Bork; Chris Bowler; Gabriel Gorsky; Nigel Grimsley; Pascal Hingamp; Daniele Iudicone; Fabrice Not; Hiroyuki Ogata; Stephane Pesant; Jeroen Raes; Michael E Sieracki; Sabrina Speich; Lars Stemmann; Shinichi Sunagawa; Jean Weissenbach; Patrick Wincker; Eric Karsenti
Journal: Science Date: 2015-05-22 Impact factor: 47.728

3. Parasites dominate hyperdiverse soil protist communities in Neotropical rainforests.

Authors: Frédéric Mahé; Colomban de Vargas; David Bass; Lucas Czech; Alexandros Stamatakis; Enrique Lara; David Singer; Jordan Mayor; John Bunge; Sarah Sernaker; Tobias Siemensmeyer; Isabelle Trautmann; Sarah Romac; Cédric Berney; Alexey Kozlov; Edward A D Mitchell; Christophe V W Seppey; Elianne Egge; Guillaume Lentendu; Rainer Wirth; Gabriel Trueba; Micah Dunthorn
Journal: Nat Ecol Evol Date: 2017-03-20 Impact factor: 15.460

4. DADA2: High-resolution sample inference from Illumina amplicon data.

Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes
Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547

5. Best practices for scientific computing.

Authors: Greg Wilson; D A Aruliah; C Titus Brown; Neil P Chue Hong; Matt Davis; Richard T Guy; Steven H D Haddock; Kathryn D Huff; Ian M Mitchell; Mark D Plumbley; Ben Waugh; Ethan P White; Paul Wilson
Journal: PLoS Biol Date: 2014-01-07 Impact factor: 8.029

6. Swarm v2: highly-scalable and high-resolution amplicon clustering.

Authors: Frédéric Mahé; Torbjørn Rognes; Christopher Quince; Colomban de Vargas; Micah Dunthorn
Journal: PeerJ Date: 2015-12-10 Impact factor: 2.984

7. Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates.

Authors: Tobias Guldberg Frøslev; Rasmus Kjøller; Hans Henrik Bruun; Rasmus Ejrnæs; Ane Kirstine Brunbjerg; Carlotta Pietroni; Anders Johannes Hansen
Journal: Nat Commun Date: 2017-10-30 Impact factor: 14.919

Review 8. UniEuk: Time to Speak a Common Language in Protistology!

Authors: Cédric Berney; Andreea Ciuprina; Sara Bender; Juliet Brodie; Virginia Edgcomb; Eunsoo Kim; Jeena Rajan; Laura Wegener Parfrey; Sina Adl; Stéphane Audic; David Bass; David A Caron; Guy Cochrane; Lucas Czech; Micah Dunthorn; Stefan Geisen; Frank Oliver Glöckner; Frédéric Mahé; Christian Quast; Jonathan Z Kaye; Alastair G B Simpson; Alexandros Stamatakis; Javier Del Campo; Pelin Yilmaz; Colomban de Vargas
Journal: J Eukaryot Microbiol Date: 2017-04-21 Impact factor: 3.346

9. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns.

Authors: Amnon Amir; Daniel McDonald; Jose A Navas-Molina; Evguenia Kopylova; James T Morton; Zhenjiang Zech Xu; Eric P Kightley; Luke R Thompson; Embriette R Hyde; Antonio Gonzalez; Rob Knight
Journal: mSystems Date: 2017-03-07 Impact factor: 6.496

Review 10. The State of Software for Evolutionary Biology.

Authors: Diego Darriba; Tomáš Flouri; Alexandros Stamatakis
Journal: Mol Biol Evol Date: 2018-05-01 Impact factor: 16.240

3 in total

1. Diversity Patterns of Protists Are Highly Affected by Methods Disentangling Biological Variants: A Case Study in Oligotrich (s.l.) Ciliates.

Authors: Jiahui Xu; Jianlin Han; Hua Su; Changyu Zhu; Zijing Quan; Lei Wu; Zhenzhen Yi
Journal: Microorganisms Date: 2022-04-27

Review 2. Bioinformatic Challenges Detecting Genetic Variation in Precision Medicine Programs.

Authors: Matt A Field
Journal: Front Med (Lausanne) Date: 2022-04-08

3. CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.

Authors: Torbjørn Rognes; Lonneke Scheffer; Victor Greiff; Geir Kjetil Sandve
Journal: Bioinformatics Date: 2022-07-19 Impact factor: 6.931

3 in total