MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. RESULTS: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. AVAILABILITY AND IMPLEMENTATION: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. CONTACT: lavenier@irisa.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The analysis of next-generation sequencing (NGS) data remains a time- and space-consuming task. Many efforts have been made to provide efficient data structures for indexing the terabytes of data generated by the fast sequencing machines (Suffix Array, Burrows–Wheeler transform, Bloom filter, etc.). Genome assemblers such as Velvet (Zerbino and Birney, 2008), ABySS (Simpson ), SOAPdenovo2 (Luo ), SPAdes (Bankevich ) or mappers such as BWA (Li and Durbin, 2009) or variant detection such as CRAC (Philippe ) for instance make an intensive use of these data structures to keep their memory footprint as low as possible.At the same time, parallelism has been largely investigated to reduce execution time. Many strategies such as GPU implementation (Liu ), cloud deployment (Zhao ), algorithm vectorization (Rizk and Lavenier, 2010), multithreading, etc., have demonstrated high potentiality on NGS processing.The overall efficiency of NGS software depends on a smart combination of data representation and use of the available processing units. Developing such software is thus a real challenge, as it requires a large spectrum of competence from high-level data structure and algorithm concepts to tiny details of implementation.The GATB library aims to ease the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speedup the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is a memory efficient de-Bruijn graph (Compeau ), and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processor (laptop computer, small server) with a few gigabytes of memory.Hence, from the high-level C++ functions available in the GATB library, NGS programing designers can rapidly elaborate their own software based on state-of-the-art algorithms and data structures of the domain.Based on the same idea, other bioinformatics libraries exist, from which domain-specific tools can be elaborated. The NGS++ library (Markovits ) is specifically tailored for developing applications that work with genomic regions and features, such as epigenomics marks, gene features and data that are associated with BED type files. The SeqAn library (Doring ) is a general-purpose library targeting standard sequence processing. Advanced data structures such as de-Bruijn graphs are not included in SeqAn. Khmer (Crusoe ) is a library and toolkit for doing k-mer-based NGS dataset analysis. As with GATB, most of khmer relies on an underlying probabilistic data structure (Bloom filter). The khmer library can be used in various k-mer processing such as abundance filtering, error trimming, graph size filtering or partitioning.
2 METHODS
One of the main concerns of the GATB-core library is to provide computing modules able to run on standard machines, i.e. computers not requiring large amount of main memory.The central data structure is a de-Bruijn graph from which numerous actions can be performed as shown Figure 1: data error correction, assembly, biological motif detection [e.g. single nucleotide polymorphism (SNP)], etc. The graph is constructed by extracting and by counting all the different k-mers from one or several sequencing datasets. This time- and space-consuming task is conducted by a disk streaming algorithm, DSK (Rizk ), which adapts its memory requirement according to the available computer memory. Trade-off between execution time and memory occupancy can be set up: the larger the computer memory, shorter the computation time (reduced disk access).
Fig. 1.
Schematic view of the GATB organization
Schematic view of the GATB organizationThe de-Bruijn graph memory footprint is kept low thanks to an optimized Bloom filter representation (Chikhi and Rish, 2012; Salikhov ). Only vertices of the de-Bruijn graph are memorized. Edges are deduced by querying the Bloom filter. False positives (owing to the probabilistic behavior of the Bloom filter) are suppressed by adding an extra data structure enumerating critical vertices. This efficient de-Bruijn graph representation fits, for example, a complete mammal genome in ∼4 GB.
3 IMPLEMENTATION
The GATB library is composed of five main packages: system, tools, bank, kmer and de-Bruijn packages.The system package holds all the operations related to the operating system (OS): file management, memory management and thread management. Using such an abstraction allows client code to be independent from the OS, thus suppressing compilation directive inside the code or improving some OS accesses by hiding specific OS optimization. The supported operating systems are Linux, Mac and Windows.The tools package offers generic operations used throughout the user application but not specific to genomics area. For example, this package includes design pattern tools (such as iterators, observers, smart pointers, etc.) and object collections (such as containers, bags, iterables, etc.). It also optimizes the way GATB data structures are saved. The HDF5 file format is currently used (HDF5, 2012). This powerful technology is extremely well suited for large and complex data collection such as those handled in the GATB library.The bank package provides operations related to standard genomic sequence dataset management. All the main sequence file formats are supported, and high-level interfaces allow sequences to be easily iterated regardless of the input format. In other words, algorithms are written independently of the input formats.The kmer package is dedicated to fine-grained manipulation of k-mers. Optimized routines are provided to perform k-mer counting from large sequence datasets, to find k-mer neighborhood or to select k-mers based on different criteria.Finally, the de-Bruijn package provides high-level functions to manipulate a static de-Bruijn graph data structure: creation from a set of k-mers, iteration through different nature of nodes (simple k-mers, branching k-mers, etc.), extraction of neighbor nodes, etc. Additional information (e.g. k-mer coverage, markers of visited nodes) is stored in the graph branching nodes. From this abstraction level, developing new tools based on de-Buijn graphs is fast, and does not require programmers to delve into low-level details.The GATB library takes benefit of the parallel nature of today’s multicore architecture of microprocessors. When possible, time-consuming parts of the code are multithreaded to provide fast runtime execution.The GATB library is developed in C++ under the A-GPL license and is available from the following Web site: http://gatb.inria.fr. An extensive documentation with tutorials is available to guide designers in the process of developing new NGS tools from the GATB building blocks: http://gatb-core.gforge.inria.fr (see also Supplementary File 2 for technical implementation details).
4 RESULTS
To demonstrate the efficiency of the GATB library, a few software implemented from GATB are briefly presented. The idea is to give a quick overview of the application spectrum of the GATB library and some performance numbers.Minia (Chikhi and Rish, 2012) is a short-read de-Bruijn assembler capable of assembling large and complex genomes into contigs on a desktop computer. The assembler produces contigs of similar length and accuracy to other de-Bruijn assemblers––e.g. Velvet (Zerbino and Birney, 2008). As an example, a Boa constrictor constrictor (1.6 Gb) dataset (Illumina 2 × 120 bp reads, 125 × coverage) from Assemblathon 2 (Bradnam ) can be processed in ∼45 h and 3 GB of memory on a standard computer (3.4 GHz 8-core processor) using a single core, yielding a contig N50 of 3.6 kb (prior to scaffolding and gap-filling).Bloocoo is a k-mer spectrum-based read error corrector, designed to correct large datasets with low memory footprints. It uses the disk streaming k-mer counting algorithm contained in the GATB library and inserts solid k-mers in a Bloom filter. The correction procedure is similar to the Musket multistage approach (Liu ). Bloocoo yields similar results while requiring far less memory: for example, it can correct whole human genome re-sequencing reads at 70× coverage with <4 GB of memory (see Supplementary file 1 for extra information on Bloocoo).DiscoSNP aims to discover Single Nucleotide Polymorphism from non-assembled reads and without a reference genome. From one or several datasets a global de-Bruijn graph is constructed, then scanned to locate specific SNP graph patterns (Uricaru ). A coverage analysis on these particular locations can finally be performed to validate and assign scores to detected biological elements. Applied on a mouse dataset (2.88 Gb, 100 bp Illumina reads), DiscoSnp takes 34 h and requires 4.5 GB RAM. In the same spirit, the TakeABreak software discovers inversion variants from non-assembled reads. It directly finds particular patterns in the de-Bruijn graph and provides execution performances similar to DiscoSNP (Lemaitre ).Funding: ANR (French National Research Agency) (ANR-12-EMMA- 0019-01).Conflict of interest: none declared.
Authors: Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol Journal: Genome Res Date: 2009-02-27 Impact factor: 9.043
Authors: Michael R Crusoe; Hussien F Alameldin; Sherine Awad; Elmar Boucher; Adam Caldwell; Reed Cartwright; Amanda Charbonneau; Bede Constantinides; Greg Edvenson; Scott Fay; Jacob Fenton; Thomas Fenzl; Jordan Fish; Leonor Garcia-Gutierrez; Phillip Garland; Jonathan Gluck; Iván González; Sarah Guermond; Jiarong Guo; Aditi Gupta; Joshua R Herr; Adina Howe; Alex Hyer; Andreas Härpfer; Luiz Irber; Rhys Kidd; David Lin; Justin Lippi; Tamer Mansour; Pamela McA'Nulty; Eric McDonald; Jessica Mizzi; Kevin D Murray; Joshua R Nahum; Kaben Nanlohy; Alexander Johan Nederbragt; Humberto Ortiz-Zuazaga; Jeramia Ory; Jason Pell; Charles Pepe-Ranney; Zachary N Russ; Erich Schwarz; Camille Scott; Josiah Seaman; Scott Sievert; Jared Simpson; Connor T Skennerton; James Spencer; Ramakrishnan Srinivasan; Daniel Standage; James A Stapleton; Susan R Steinman; Joe Stein; Benjamin Taylor; Will Trimble; Heather L Wiencko; Michael Wright; Brian Wyss; Qingpeng Zhang; En Zyme; C Titus Brown Journal: F1000Res Date: 2015-09-25
Authors: Fernando Meyer; Adrian Fritz; Zhi-Luo Deng; David Koslicki; Till Robin Lesker; Alexey Gurevich; Gary Robertson; Mohammed Alser; Dmitry Antipov; Francesco Beghini; Denis Bertrand; Jaqueline J Brito; C Titus Brown; Jan Buchmann; Aydin Buluç; Bo Chen; Rayan Chikhi; Philip T L C Clausen; Alexandru Cristian; Piotr Wojciech Dabrowski; Aaron E Darling; Rob Egan; Eleazar Eskin; Evangelos Georganas; Eugene Goltsman; Melissa A Gray; Lars Hestbjerg Hansen; Steven Hofmeyr; Pingqin Huang; Luiz Irber; Huijue Jia; Tue Sparholt Jørgensen; Silas D Kieser; Terje Klemetsen; Axel Kola; Mikhail Kolmogorov; Anton Korobeynikov; Jason Kwan; Nathan LaPierre; Claire Lemaitre; Chenhao Li; Antoine Limasset; Fabio Malcher-Miranda; Serghei Mangul; Vanessa R Marcelino; Camille Marchet; Pierre Marijon; Dmitry Meleshko; Daniel R Mende; Alessio Milanese; Niranjan Nagarajan; Jakob Nissen; Sergey Nurk; Leonid Oliker; Lucas Paoli; Pierre Peterlongo; Vitor C Piro; Jacob S Porter; Simon Rasmussen; Evan R Rees; Knut Reinert; Bernhard Renard; Espen Mikal Robertsen; Gail L Rosen; Hans-Joachim Ruscheweyh; Varuni Sarwal; Nicola Segata; Enrico Seiler; Lizhen Shi; Fengzhu Sun; Shinichi Sunagawa; Søren Johannes Sørensen; Ashleigh Thomas; Chengxuan Tong; Mirko Trajkovski; Julien Tremblay; Gherman Uritskiy; Riccardo Vicedomini; Zhengyang Wang; Ziye Wang; Zhong Wang; Andrew Warren; Nils Peder Willassen; Katherine Yelick; Ronghui You; Georg Zeller; Zhengqiao Zhao; Shanfeng Zhu; Jie Zhu; Ruben Garrido-Oter; Petra Gastmeier; Stephane Hacquard; Susanne Häußler; Ariane Khaledi; Friederike Maechler; Fantin Mesny; Simona Radutoiu; Paul Schulze-Lefert; Nathiana Smit; Till Strowig; Andreas Bremges; Alexander Sczyrba; Alice Carolyn McHardy Journal: Nat Methods Date: 2022-04-08 Impact factor: 28.547