Literature DB >> 22815363

MolBioLib: a C++11 framework for rapid development and deployment of bioinformatics tasks.

Abstract

SUMMARY: We developed MolBioLib to address the need for adaptable next-generation sequencing analysis tools. The result is a compact, portable and extensively tested C++11 software framework and set of applications tailored to the demands of next-generation sequencing data and applicable to many other applications. MolBioLib is designed to work with common file formats and data types used both in genomic analysis and general data analysis. A central relational-database-like Table class is a flexible and powerful object to intuitively represent and work with a wide variety of tabular datasets, ranging from alignment data to annotations. MolBioLib has been used to identify causative single-nucleotide polymorphisms in whole genome sequencing, detect balanced chromosomal rearrangements and compute enrichment of messenger RNAs (mRNAs) on microtubules, typically requiring applications of under 200 lines of code. MolBioLib includes programs to perform a wide variety of analysis tasks, such as computing read coverage, annotating genomic intervals and novel peak calling with a wavelet algorithm. Although MolBioLib was designed primarily for bioinformatics purposes, much of its functionality is applicable to a wide range of problems. Complete documentation and an extensive automated test suite are provided. AVAILABILITY: MolBioLib is available for download at: http://sourceforge.net/projects/molbiolib CONTACT: ohsumit@molbio.mgh.harvard.edu.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22815363 PMCID： PMC3463126 DOI： 10.1093/bioinformatics/bts458

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Next-generation sequencing requires a data analysis approach capable of handling large, complex and varied datasets, from large sets of reads to complex polymorphisms to existing feature files. In addition, the competitive nature of research demands rapid development of methods that are flexible enough to integrate new and quickly evolving algorithms. Tools have been developed to address these needs, such as GATK (McKenna ). However, packages written in Java (e.g. GATK) require the maximum memory heap space to be specified at run time (Oracle, 2011), limiting how the input data are formatted and handled. For example, a coverage program would require more memory to compute coverage of a query-ordered SAM file versus a position-ordered SAM file, because a sliding window of coverage cannot be used. Programs written in C++ do not require the heap size to be specified and are only limited by the amount of available memory. Other packages written in C++ have their strengths, but they also have limitations that suggest a niche for our software, MolBioLib. Arachne (Batzoglou ; Jaffe ), the .NET Bio project by Outercurve Foundation (Outercurve, 2012) and NCBI's C++ Toolkit (Vatakov, 2012) provide many functions, but are not compact and do not always clearly identify the primary objects. Furthermore, the .NET Bio project is specific to the Windows environment (Mercer, 2012) and Arachne is specific to a particular Linux environment. IBM's GenomicTools (Tsirigos, 2012) has many very useful tools, but addresses common bioinformatics tasks at a lower level than MolBioLib, such as providing command-line tools rather than a unified program to generate ChIP-seq output. Other packages, such as Bio++ (Dutheil, 2008), libsequence (Thornton, 2003) and TIGR++ (Majoros, 2012), are targeted toward specific applications and not designed to provide breadth of functionality. The package that most closely resembles MolBioLib's philosophy is SeqAn (Döring, 2008), though it is written in an older version of C++ and thus does not take advantage of the variadic templates or other modern features of C++11 (ISO/IEC, 2011). MolBioLib fills the need for a platform-independent, extensively tested, compact and efficient C++11 library and an extensive set of bioinformatics applications that can be used to analyze data and rapidly develop new tools. MolBioLib's library includes a variety of useful objects and functions, such as a relational-database-like object, a text file reader object that simplifies data input, statistical functions and peak calling methods that can operate on any array of values, such as per base sequence coverage. In addition, MolBioLib includes a broad range of tools, such as to generate coverage, hits of reads to features and ChIP-seq, all in one unified package. The design of MolBioLib is based on four principles. The first is to simplify bioinformatics programming in C++11, achieved by developing a library that includes many common bioinformatics tasks. For example, C++11 requires programmers to write specialized data structures to sort associated data keeping them together, such as feature information associated with a position. Additionally, to iterate either sequentially or randomly through a tab-separated-values (TSV) file and select values from specific columns would require the creation of a function to split a line on tabs and constructs to index and traverse a text file. These, and many other common tasks, are built into MolBioLib, thus greatly simplifying the code one needs to write. It is hoped that MolBioLib will allow bioinformaticians to consider C++11 as a possible language of choice. Second, MolBioLib is efficient. C++11 is used because it is the new standard that introduces constructs for making objects such as Table. C++11 is efficient since it is a compiled language with no inherent restriction on memory heap size at run time. Templates are used extensively to compact code, avoid inefficient virtual table lookups and maintain type safety. Objects and method parameters are often templated so that they may be in-lined by the compiler. Third, MolBioLib promotes clarity and compactness by consolidating common operations into a concise set of objects. We also provide an extensive library of functions that are not intrinsic to one object, such as those that convert one data type to another, e.g. splitString converts a string to a vector. Given the range of problems MolBioLib addresses, the source code is compact: ~10 000 lines of code and comments for the core objects and functions. Among the 101 included applications, 86% are coded in fewer than 200 lines and 59% in fewer than 100 lines. In contrast, without such a framework, the user would have to code the thousands of lines of code to reproduce MolBioLib's functionality. Finally, MolBioLib is extensively tested and facilitates easy testing and debugging of its applications. Automated tests are provided for all objects and functions. Additional validation of the code base comes from extensive application of MolBioLib to many molecular biology projects (Lau ; Sharp ; Talkowski ; Zhao ; Raif S. Geha, manuscript in preparation). To simplify use of MolBioLib, all libraries are include files following the Boost convention (Schaling, 2011). Debugging and memory checking is thus facilitated with tools such as with Valgrind (Nethercote and Seward, 2007; Seward and Nethercote, 2005) since applications in MolBioLib consist of a main program file with many include files. Additional input and programming checks are incorporated into the framework through optional compiler flags.

2 METHODS

MolBioLib is hierarchically structured for ease of use. It contains three main components: the library consisting of a set of objects and functions, the set of applications and the documentation. A script is included to compile all the external packages [such as BAMTools (Barnett )], applications and optionally build and run tests. MolBioLib can be used independently of the external packages and interfaces. The documentation for all of MolBioLib may be generated automatically using the included Doxygen configuration file (van Heesch, 2011). The introductory pages of the Doxygen output show how to compile and use MolBioLib both as a set of tools as well as a programming framework. Functions that transform one data type to another are separated from the objects. Finally, the applications are hierarchically organized by usage type. Several novel classes power rapid development with MolBioLib. The primary object that stores data in MolBioLib is Table, which is a container class similar to the C++ STL vector class except that each column may store a different data type, through C++11's variadic templates. Variadic templates allow the coding of objects that can accept an arbitrary number of template parameters. However, writing variadic templated objects and functions can become cumbersome (Gregor and Järvi, 2008; Kalev, 2008). Therefore, MolBioLib includes the Table object to represent tabular data, a mainstay of bioinformatics data exchange, in an intuitive and easily used fashion. The Table structure was based on the relational database (Codd, 1970) model, where related data are stored row-wise. Column data types are specified through the template parameters. A Table may be thought of as a generalized vector. It includes data row insertion and retrievals operations that are simple to use. Database-like operations for Table, such as concatenation, filtering and inner and outer joins, are provided. Example usage of a Table is: where the readTSVTable is a function to read a TSV file into a table. This tabular grouping of data can be used for many bioinformatics tasks. One example is the Fasta object, derived from the Table object, which stores sequences and their headers. The Fasta object simplifies access to sequence data: The primary object to read text files is ReadOnlyStringFile. The class automatically creates an index of a file, if not already present, so the file may be accessed as if it were a string array. The index file is created by going through the text file once and noting the starting file position of each line in the index file. The index file itself has a fixed length per line, simplifying the process of finding the index position. Thus, to access a line in the text file, the appropriate line in the index file is looked up. Subsequently, the line starting at the file position indicated by the look up is read in. Almost no memory is required in using the index. A ReadOnlyStringFile object functions like an array in which each element is one row of the file. ReadOnlyTSVFile is a particularly useful derived class of ReadOnlyStringFile. The values of each tab-separated field of each row can be accessed by the operator[] method, returning a vector containing the parsed elements of one row. Applications written with MolBioLib capitalize on the ability of the ReadOnlyStringFile class to hide all the housekeeping chores involved in parsing data from delimited text files. For example, to sample a random subset of a TSV input file, one would code: If a file only needs to be traversed once, sequentially a line at a time, ReadOnlyStringFile can traverse the file without creating an index file. This eliminates the time to build and store an index. The ReadOnlySequencesFile, based on the ReadOnlyStringFile, is a FASTA/FASTQ reader object. It can work in a random access mode or sequentially traverse the file, providing all the read-only operations of the Fasta object, thus greatly simplifying access to sequence read files. Other general objects in MolBioLib include a sparse vector object, e.g. for use in storing sparse coverage, a map facilitator between keys and rows in a Table, and a parameterized type interval object with associated overlap and set functions. A random number generator class that includes a permutation vector is also provided. Other bioinformatics objects included are an alignment object, a Sequence class with operations such as reverse-complementing, a feature object and a peak object for storing local extrema of numeric data. All of these classes have been used to simplify coding of novel bioinformatics analyses. Functions included in MolBioLib are diverse, intended to cover four broad categories: algorithms, file readers and writers for various data types, system utilities and transformation of data types to different data types. Algorithm functions include Table functions, peak detection and statistics. Table functions include smoothing values, sorting and upper_bound and lower_bound of Tables analogous to their vector class counterparts. File reader and writers include those for various alignment formats, including SAM (Li ), Helicos BioSciences' TXT format (Helicos, 2010), NCBI tabular BLAST output (Madden, 2003), feature readers for tabular files such as for UCSC's refGene annotation table (refGene.txt; Kent, 2012), Ensembl's Biomart in TSV format (Flicek ; Smedley ) and the GFF format (Wellcome, 2012), streams for intervals and vectors, tuple streams and writers and Table reader and writers. Peak detection functions for numeric arrays will be discussed in more detail in the ChIP-seq section. System utilities include a powerful command-line parsing system. Various command-line argument types are provided, including numeric and string. Input and output file name argument types provide file checking to ensure all input files are present at run time and to prevent accidental overwrites of output files. Furthermore, the command-line parser automatically records the date and time the program was compiled as well as the command line to simplify documentation of computational steps and pipelines. System utilities also provide functions to transform one data type to another, such as string conversions to and from various data types as well as string splitting (on one or more delimiters). The compiler of choice for MolBioLib is clang++ version 3.0 and above (Clang, 2012) using the associated libc++ library and is available on the Linux, Mac OS X and MS Windows platforms. It supports a large subset of C++11, has very good compiler error messages and is efficient. MolBioLib also works with GNU g++ 4.7 and above (Gcc, 2012).

3 RESULTS

One of the primary goals of MolBioLib is to provide a set of programs that address the most common bioinformatics analyses. Here we describe applications in MolBioLib that address four common bioinformatics analyses: annotating a list of features, counts of hits to features, coverage and ChIP-Seq. We also touch upon additional useful utilities included in MolBioLib. The MolBioLib application addFeaturesToTSVFile performs the very common task of adding gene annotations or more generically ‘features’, to an input file in which each row describes a genomic interval. Examples of annotations include the refGene.txt file downloadable from UCSC's genome browser site (Fujita ) that contains the gene ID, name, chromosome, strand, and start and stop positions of the transcript. Other annotation files include tabular data from the Ensembl/Biomart website (Flicek ; Smedley ), where one can download any set of genes with user-selected attributes such as IDs, names, positions, expression data and protein domain. There are numerous other annotation sources, many of which consist of carefully curated private data, on a topical website, or in a published supplement to a journal article. Other common tabular formats include the BED, PSL and GFF formats (Kent, 2012; Wellcome, 2012). Using addFeaturesToTSVFile, the genomic footprint of any such annotation can be intersected with another TSV file containing genomic intervals. The application will use the genomic interval specified on each row of the input file and find all intersecting feature coordinates (with matching strand, if specified) and add the appropriate annotation(s) to the row in the output. Importantly, this application will take in any input TSV as well as any annotations in TSV form (such as those noted above) and thus may be used on a wide variety of projects. Another common bioinformatics task is to count the number of alignments mapping to a set of features in a TSV file, such as refGene.txt, promoter regions or classes of sequence repeats. For example, we have used this method to quantify the number of reads derived from genic regions, different classes of genomic repeats and from different classes of non-coding RNAs (Sharp ). refFeaturesAnalysis offers a number of options, such as shifting the positions of features (e.g. probing hits to upstream UTRs instead of the genes), filtering for a specific set of reads, maximum error in alignments and RPKM normalization (Mortazavi ). The input features are printed along with the count of alignments overlapping each input feature and meeting any filtering criteria. Sequence coverage can be used for determining the number of reads mapped to a base or region and also for finding polymorphisms. MolBioLib has a unique coverage tool that outputs strand-specific statistics as well as a count of mismatched bases observed at each position. Coverage can be run on a user-defined set of regions and normalized to the number of million reads in the input reads file. Additionally, coverage can be executed using only the midpoint position to identify where the reads align. Finally, coverage can be run using uniquely aligning reads. Programs are provided to post-process coverage output in various ways, such as multiplying by some constants (either for the whole file or by contig/chromosome), windowing the coverage and converting the coverage to a wiggle file. Coverage output is the input for the ChIP-Seq analysis program discussed next. An experimental ChIP-Seq program with a novel wavelet method is our final example of application of the MolBioLib library. Peak detection is one major next-generation sequencing application. Given an alignment file, this algorithm finds the coverage on each strand at each location, computed either per base or in bins spanning a user-defined number of bases. This is done for both the sample as well as for a background sample or control, and coverage is smoothed using a kernel smoother. One challenge of peak detection is finding peaks having a wide range of widths and heights. We address this by applying a translation-independent wavelet smoother applied at various scales, finding local peaks at each scale, and then ranking putative peaks by using ridge lines that identify peaks detected across multiple scales (Du ). Peaks with a longer ridge length are more isolated from other peaks, because they show up as peaks at various length scales. Optional filters remove low signals and spikes. This peak detection method is included in MolBioLib. Other smoothers, such as Gaussian, are also implemented in MolBioLib. We validated the wavelet peak detector on published H3K4me3 ChIP-Seq data (Myers ). As expected, H3K4me3 peaks are enriched in promoter regions (Zhao ). The percentage of peaks in promoter regions was 57% compared to 33% using the window tag density method (Kharchenko ) and 53% using MACS (Zhang ). Peaks were well defined, with a mean width of 232 bp compared to 2840 bp and 654 bp for the other two methods tested on the same data. There are many other applications included with MolBioLib that address much of the essential bioinformatics analyses done in next-generation sequencing projects. Some of the more common tasks include computing statistics on a list of numbers, creating histogram files from data files (both numeric and string), converting alignment formats. Additional tasks include common operations on FASTA and FASTQ files, such as obtaining a subset, trimming and removing duplicate reads. Moreover, there are programs to combine, print subsets and inner join TSV files. Intersection, subtraction and union operations of text files are also included.

4 DISCUSSION

MolBioLib fills the need for an efficient, reliable and compact C++11 bioinformatics framework. It is portable across many platforms and aligner formats and is fully documented. MolBioLib is unique in offering complete analysis programs for a variety of other very common tasks not addressed by other toolkits, from feature hit counts to coverage to ChIP-Seq. MolBioLib classes offer considerable power and convenience for creating novel analysis applications. A central and very general Table class simulating the functionality of a database eases construction of many programs. The Table class is based on a collection vectors, thus having a small memory overhead compared to other data structures such as a map. Capacity for larger datasets is only limited by the amount of available memory. File readers provide efficient methods to perform ubiquitous file I/O tasks. These classes will have general utility for application development beyond the specific needs of computational biology. As MolBioLib gains adoption, we aim to incorporate many of the applications both user-contributed and those developed for our projects into the main distribution through the SourceForge.net code repository mechanism.

24 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Whole-genome mapping of histone H3 Lys4 and 27 trimethylations reveals distinct genomic compartments in human embryonic stem cells.

Authors: Xiao Dong Zhao; Xu Han; Joon Lin Chew; Jun Liu; Kuo Ping Chiu; Andre Choo; Yuriy L Orlov; Wing-Kin Sung; Atif Shahab; Vladimir A Kuznetsov; Guillaume Bourque; Steve Oh; Yijun Ruan; Huck-Hui Ng; Chia-Lin Wei
Journal: Cell Stem Cell Date: 2007-09-13 Impact factor: 24.633

3. Genome-wide identification of polycomb-associated RNAs by RIP-seq.

Authors: Jing Zhao; Toshiro K Ohsumi; Johnny T Kung; Yuya Ogawa; Daniel J Grau; Kavitha Sarma; Ji Joon Song; Robert E Kingston; Mark Borowsky; Jeannie T Lee
Journal: Mol Cell Date: 2010-12-22 Impact factor: 17.970

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

5. Systematic and single cell analysis of Xenopus Piwi-interacting RNAs and Xiwi.

Authors: Nelson C Lau; Toshiro Ohsumi; Mark Borowsky; Robert E Kingston; Michael D Blower
Journal: EMBO J Date: 2009-08-27 Impact factor: 11.598

6. Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs.

Authors: Julien Dutheil; Bastien Boussau
Journal: BMC Evol Biol Date: 2008-09-22 Impact factor: 3.260

7. BioMart--biological queries made easy.

Authors: Damian Smedley; Syed Haider; Benoit Ballester; Richard Holland; Darin London; Gudmundur Thorisson; Arek Kasprzyk
Journal: BMC Genomics Date: 2009-01-14 Impact factor: 3.969

8. SeqAn an efficient, generic C++ library for sequence analysis.

Authors: Andreas Döring; David Weese; Tobias Rausch; Knut Reinert
Journal: BMC Bioinformatics Date: 2008-01-09 Impact factor: 3.169

9. Model-based analysis of ChIP-Seq (MACS).

Authors: Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal: Genome Biol Date: 2008-09-17 Impact factor: 13.583

10. Design and analysis of ChIP-seq experiments for DNA-binding proteins.

Authors: Peter V Kharchenko; Michael Y Tolstorukov; Peter J Park
Journal: Nat Biotechnol Date: 2008-11-16 Impact factor: 54.908

6 in total

1. Recurrent viral infections associated with a homozygous CORO1A mutation that disrupts oligomerization and cytoskeletal association.

Authors: Christina S Yee; Michel J Massaad; Wayne Bainter; Toshiro K Ohsumi; Niko Föger; Andrew C Chan; Nurten A Akarsu; Caner Aytekin; Deniz Çagdas Ayvaz; Ilhan Tezcan; Özden Sanal; Raif S Geha; Janet Chou
Journal: J Allergy Clin Immunol Date: 2015-10-21 Impact factor: 10.793

2. Deficiency of base excision repair enzyme NEIL3 drives increased predisposition to autoimmunity.

Authors: Michel J Massaad; Jia Zhou; Daisuke Tsuchimoto; Janet Chou; Haifa Jabara; Erin Janssen; Salomé Glauzy; Brennan G Olson; Henner Morbach; Toshiro K Ohsumi; Klaus Schmitz; Markianos Kyriacos; Jennifer Kane; Kumiko Torisu; Yusaku Nakabeppu; Luigi D Notarangelo; Eliane Chouery; Andre Megarbane; Peter B Kang; Eman Al-Idrissi; Hasan Aldhekri; Eric Meffre; Masayuki Mizui; George C Tsokos; John P Manis; Waleed Al-Herz; Susan S Wallace; Raif S Geha
Journal: J Clin Invest Date: 2016-10-17 Impact factor: 14.808

3. Regulatory T-cell deficiency and immune dysregulation, polyendocrinopathy, enteropathy, X-linked-like disorder caused by loss-of-function mutations in LRBA.

Authors: Louis-Marie Charbonnier; Erin Janssen; Janet Chou; Toshiro K Ohsumi; Sevgi Keles; Joyce T Hsu; Michel J Massaad; Maria Garcia-Lloret; Rima Hanna-Wakim; Ghassan Dbaibo; Abdullah A Alangari; Abdulrahman Alsultan; Daifulah Al-Zahrani; Raif S Geha; Talal A Chatila
Journal: J Allergy Clin Immunol Date: 2014-11-17 Impact factor: 10.793

4. The CIP2A-TOPBP1 axis safeguards chromosome stability and is a synthetic lethal target for BRCA-mutated cancer.

Authors: Salomé Adam; Silvia Emma Rossi; Nathalie Moatti; Mara De Marco Zompit; Yibo Xue; Timothy F Ng; Alejandro Álvarez-Quilón; Jessica Desjardins; Vivek Bhaskaran; Giovanni Martino; Dheva Setiaputra; Sylvie M Noordermeer; Toshiro K Ohsumi; Nicole Hustedt; Rachel K Szilard; Natasha Chaudhary; Meagan Munro; Artur Veloso; Henrique Melo; Shou Yun Yin; Robert Papp; Jordan T F Young; Michael Zinda; Manuel Stucki; Daniel Durocher
Journal: Nat Cancer Date: 2021-11-11

5. A homozygous mucosa-associated lymphoid tissue 1 (MALT1) mutation in a family with combined immunodeficiency.

Authors: Haifa H Jabara; Toshiro Ohsumi; Janet Chou; Michel J Massaad; Halli Benson; Andre Megarbane; Eliane Chouery; Raymond Mikhael; Oliver Gorka; Andreas Gewies; Pierre Portales; Toshinori Nakayama; Hiroyuki Hosokawa; Patrick Revy; Henry Herrod; Francoise Le Deist; Gerard Lefranc; Jürgen Ruland; Raif S Geha
Journal: J Allergy Clin Immunol Date: 2013-05-31 Impact factor: 10.793

6. The microglial sensome revealed by direct RNA sequencing.

Authors: Suzanne E Hickman; Nathan D Kingery; Toshiro K Ohsumi; Mark L Borowsky; Li-chong Wang; Terry K Means; Joseph El Khoury
Journal: Nat Neurosci Date: 2013-10-27 Impact factor: 24.884

6 in total