Literature DB >> 18499696

The Sleipnir library for computational functional genomics.

Curtis Huttenhower¹, Mark Schroeder, Maria D Chikina, Olga G Troyanskaya.

Abstract

MOTIVATION: Biological data generation has accelerated to the point where hundreds or thousands of whole-genome datasets of various types are available for many model organisms. This wealth of data can lead to valuable biological insights when analyzed in an integrated manner, but the computational challenge of managing such large data collections is substantial. In order to mine these data efficiently, it is necessary to develop methods that use storage, memory and processing resources carefully.
RESULTS: The Sleipnir C++ library implements a variety of machine learning and data manipulation algorithms with a focus on heterogeneous data integration and efficiency for very large biological data collections. Sleipnir allows microarray processing, functional ontology mining, clustering, Bayesian learning and inference and support vector machine tasks to be performed for heterogeneous data on scales not previously practical. In addition to the library, which can easily be integrated into new computational systems, prebuilt tools are provided to perform a variety of common tasks. Many tools are multithreaded for parallelization in desktop or high-throughput computing environments, and most tasks can be performed in minutes for hundreds of datasets using a standard personal computer. AVAILABILITY: Source code (C++) and documentation are available at http://function.princeton.edu/sleipnir and compiled binaries are available from the authors on request.

Entities: Chemical Disease Species

Mesh：

Year: 2008 PMID： 18499696 PMCID： PMC2718674 DOI： 10.1093/bioinformatics/btn237

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Whole-genome assays have now become pervasive, and the resulting wealth of data represents a new opportunity for biological discovery. A single genome-scale dataset can capture a snapshot of cellular function; integrative analysis of hundreds or thousands of genome-scale datasets can provide even more extensive systems-level insights regarding gene interactions under diverse conditions (Troyanskaya, 2005). Integrated approaches have already resulted in important biological discoveries (Hong et al., 2008; Myers and Troyanskaya, 2007), and the breadth and depth of possible analyses will only increase as additional experimental data is collected. As the amount of data to be analyzed continues to increase, computational efficiency becomes a greater concern. Specialized resources exist to enable very high-throughput computing for specific applications (Pekurovsky et al., 2004; Swindells et al., 2002), but few computational options exist allowing researchers to quickly mine large collections of genome-scale datasets. To address this need, we have created the Sleipnir library for computational functional genomics. The library contains algorithms and data types for efficiently manipulating and mining very large biological data collections. The core C++ library can be integrated into computational systems to provide rapid analysis of functional genomic data. Additionally, a variety of tools are provided that use the library to perform common tasks: microarray processing, Bayesian and support vector machine (SVM) learning and so forth. Even when analyzing individual datasets, Sleipnir often outperforms existing utilities in processing time, memory usage or both (Table 1). Tools provided with Sleipnir address common data manipulation requirements, in many cases processing hundreds of datasets on a standard desktop computer. Additionally, the core Sleipnir library can be easily employed to efficiently apply new algorithms to complex biological data.

Table 1.

Sleipnir efficiency on integration and single dataset tasks

Implementation	Peak RAM (KB)	Time (s)
Bayesian learning (500 genes, 15 datasets)
Sleipnir	1376	<1
GeNIe	6832	4
BNT	593 180	15
Bayesian inference (500 genes, 15 datasets)
Sleipnir	1216	1
BNT	273 992	>600
Missing value estimation (10% missing, k=10)
Sleipnir	27 232	195
knnimpute	115 708	368
Hierarchical clustering
Sleipnir	83 188	156
Cluster 3.0	176 836	154
MeV	198 292	361
K-means clustering (k=100)
Sleipnir	8780	114
Cluster 3.0	28 544	102
MeV	198 292	361

Memory usage and runtimes for Sleipnir and a number of other common tools for Bayesian analysis and biological data manipulation (de Hoon et al., 2004; Druzdzel, 1999; Murphy, 2001; Saeed et al., 2003; Troyanskaya et al., 2001). All microarray operations were performedon the 300 conditions and 6153 genes of (Hughes et al., 2000) using Euclidean distance. Bayesian operations were performed on simulated data using a binary gold standard and five randomly distributed values per dataset. Tests were run in a single thread on a 2 GHz Intel Core 2 Duo. In every case, Sleipnir demonstrates a substantial advantage in speed, memory usage or both.

Sleipnir efficiency on integration and single dataset tasks Memory usage and runtimes for Sleipnir and a number of other common tools for Bayesian analysis and biological data manipulation (de Hoon et al., 2004; Druzdzel, 1999; Murphy, 2001; Saeed et al., 2003; Troyanskaya et al., 2001). All microarray operations were performedon the 300 conditions and 6153 genes of (Hughes et al., 2000) using Euclidean distance. Bayesian operations were performed on simulated data using a binary gold standard and five randomly distributed values per dataset. Tests were run in a single thread on a 2 GHz Intel Core 2 Duo. In every case, Sleipnir demonstrates a substantial advantage in speed, memory usage or both.

2 METHODS

The Sleipnir library contains a wide variety of tools for consuming standard biological data formats, manipulating and normalizing data and performing machine learning and prediction. These are discussed extensively in the user and developer documentation included with the library (http://function.princeton.edu/sleipnir) and are presented here in summary. Sleipnir provides C++ classes to parse pairwise interaction data and standard microarray file formats. Microarray data can be converted into pairwise similarity/distance scores using a variety of measures, discretized, normalized, randomized for bootstrapping or synthetic data production, split or merged, imputed or clustered. To facilitate functional enrichment analysis, gene function prediction and gold standard generation from known gene functions and relationships, Sleipnir provides a uniform interface to several organism-independent function annotation catalogs. Information from organism-specific annotations can be merged with these functional annotations. Sleipnir also includes collections of data structures for dealing with common biological entities: gene identifiers, sets of genes, groups of related files, etc. Other utility classes include resources for multithreading, a ready-made network client/server class and a variety of mathematical and statistical tools. Sleipnir provides several tools for rapid machine learning and data mining. The SMILE Bayesian network library (Druzdzel, 1999) and the SVM Light (Joachims, 1999) library are used to learn and evaluate Bayesian or SVM models from very large collections of biological data. Arbitrary Bayesian structures are allowed, with parameters learned either discriminatively or generatively (Greiner and Zhou, 2005) from data in a context-specific manner (Huttenhower et al., 2006); extremely fast-customized learning and evaluation implementations are used for naive structures.

3 RESULTS

While Sleipnir's efficiency in integrating and mining biological datasets is most critical for very large data collections, it is also practical for single dataset tasks and smaller analyses (Table 1). When compared to several common tools for microarray manipulation or Bayesian learning, Sleipnir consistently demonstrates a substantial advantage in runtime, memory usage or both. These improvements arise from a variety of optimizations but are broadly attributable to the flexibility allowed by C++ in manipulating large quantities of individual data (microarray values, interaction pairs, etc.) What Sleipnir trades off in generality (e.g. with respect to BNT) or robustness to malformed input (e.g. with respect to MeV), it gains in speed, memory management and overall scalability, allowing it to efficiently manipulate large data collections. The Sleipnir library is particularly useful for large integration tasks involving hundreds of diverse biological datasets; example applications of Sleipnir in such settings include Huttenhower et al. (2006) and Myers and Troyanskaya (2007). A schematic of such a task is shown in Figure 1, where Sleipnir was used to learn 200 context-specific Bayesian classifiers each integrating 186 Saccharomyces cerevisiae datasets. Conditional probability tables were learned for each dataset within each context, entailing ∼75 000 probability distributions. The resulting Bayesian classifiers were used to infer context-specific functional relationship networks, each consuming 90 MB of disk space and calculated in 16.3 min. Sleipnir also supports an online mode for functional relationship inference in which no additional disk space is consumed and individual context-specific functional relationships can be produced in as little as 100 ns. Parallelization on four processor cores reduces the total learning and evaluation time by an optimal 4-fold speedup (∼13 h each for Bayesian learning and inference). Every stage of this complex data integration and machine-learning task was performed using Sleipnir and its associated tools.

Fig. 1.

Sample application of the Sleipnir library to integrate 186 heterogeneous genomic datasets in S.cerevisiae within 200 biological contexts. White boxes indicate externally generated data, grey boxes data generated by Sleipnir, arrows processing performed by Sleipnir, and black bubbles highlight time-consuming tasks. Times were generated on a 2 GHz Intel Xeon CPU; peak RAM usage was ∼200 MB. Sleipnir is extensively parallelizable, and running these tasks on four cores reduces processing time by an optimal 4-fold to ∼13 h each for Bayesian learning and inference.

4 DISCUSSION

The Sleipnir library for computational functional genomics provides a wide range of data processing and machine learning algorithms optimized for integrating very large collections of heterogeneous biological data. These include algorithms for data integration, machine learning by Bayesian networks or SVMs, and data types for manipulating microarrays, gene identifiers, functional annotations and other common biological entities. Several tools are provided with the core library to perform common tasks, and most algorithms are multithreaded or parallelizable for distributed computing. The Sleipnir library enables computational biologists to efficiently integrate thousands of genomic datasets and to rapidly mine them for biological knowledge.

9 in total

1. Functional discovery via a compendium of expression profiles.

Authors: T R Hughes; M J Marton; A R Jones; C J Roberts; R Stoughton; C D Armour; H A Bennett; E Coffey; H Dai; Y D He; M J Kidd; A M King; M R Meyer; D Slade; P Y Lum; S B Stepaniants; D D Shoemaker; D Gachotte; K Chakraburtty; J Simon; M Bard; S H Friend
Journal: Cell Date: 2000-07-07 Impact factor: 41.582

2. TM4: a free, open-source system for microarray data management and analysis.

Authors: A I Saeed; V Sharov; J White; J Li; W Liang; N Bhagabati; J Braisted; M Klapa; T Currier; M Thiagarajan; A Sturn; M Snuffin; A Rezantsev; D Popov; A Ryltsov; E Kostukovich; I Borisovsky; Z Liu; A Vinsavich; V Trush; J Quackenbush
Journal: Biotechniques Date: 2003-02 Impact factor: 1.993

3. Application of high-throughput computing in bioinformatics.

Authors: Mark Swindells; Mark Rae; Martyn Pearce; Stuart Moodie; Rob Miller; Pat Leach
Journal: Philos Trans A Math Phys Eng Sci Date: 2002-06-15 Impact factor: 4.226

4. A case study of high-throughput biological data processing on parallel platforms.

Authors: D Pekurovsky; I N Shindyalov; P E Bourne
Journal: Bioinformatics Date: 2004-03-25 Impact factor: 6.937

5. Open source clustering software.

Authors: M J L de Hoon; S Imoto; J Nolan; S Miyano
Journal: Bioinformatics Date: 2004-02-10 Impact factor: 6.937

Review 6. Putting microarrays in a context: integrated analysis of diverse biological data.

Authors: Olga G Troyanskaya
Journal: Brief Bioinform Date: 2005-03 Impact factor: 11.622

7. A scalable method for integration and functional analysis of multiple microarray datasets.

Authors: Curtis Huttenhower; Matt Hibbs; Chad Myers; Olga G Troyanskaya
Journal: Bioinformatics Date: 2006-09-27 Impact factor: 6.937

8. Context-sensitive data integration and prediction of biological networks.

Authors: Chad L Myers; Olga G Troyanskaya
Journal: Bioinformatics Date: 2007-06-28 Impact factor: 6.937

9. Gene Ontology annotations at SGD: new data sources and annotation methods.

Authors: Eurie L Hong; Rama Balakrishnan; Qing Dong; Karen R Christie; Julie Park; Gail Binkley; Maria C Costanzo; Selina S Dwight; Stacia R Engel; Dianna G Fisk; Jodi E Hirschman; Benjamin C Hitz; Cynthia J Krieger; Michael S Livstone; Stuart R Miyasato; Robert S Nash; Rose Oughtred; Marek S Skrzypek; Shuai Weng; Edith D Wong; Kathy K Zhu; Kara Dolinski; David Botstein; J Michael Cherry
Journal: Nucleic Acids Res Date: 2007-11-03 Impact factor: 16.971

9 in total

43 in total

1. Algorithms for modeling global and context-specific functional relationship networks.

Authors: Fan Zhu; Bharat Panwar; Yuanfang Guan
Journal: Brief Bioinform Date: 2015-08-06 Impact factor: 11.622

2. Selective Neuronal Vulnerability in Alzheimer's Disease: A Network-Based Analysis.

Authors: Jean-Pierre Roussarie; Vicky Yao; Patricia Rodriguez-Rodriguez; Rose Oughtred; Jennifer Rust; Zakary Plautz; Shirin Kasturia; Christian Albornoz; Wei Wang; Eric F Schmidt; Ruth Dannenfelser; Alicja Tadych; Lars Brichta; Alona Barnea-Cramer; Nathaniel Heintz; Patrick R Hof; Myriam Heiman; Kara Dolinski; Marc Flajolet; Olga G Troyanskaya; Paul Greengard
Journal: Neuron Date: 2020-06-29 Impact factor: 17.173

3. Aneuploid chromosomes are highly unstable during DNA transformation of Candida albicans.

Authors: Kelly Bouchonville; Anja Forche; Karen E S Tang; Anna Selmecki; Judith Berman
Journal: Eukaryot Cell Date: 2009-08-21

4. Detailing regulatory networks through large scale data integration.

Authors: Curtis Huttenhower; K Tsheko Mutungu; Natasha Indik; Woongcheol Yang; Mark Schroeder; Joshua J Forman; Olga G Troyanskaya; Hilary A Coller
Journal: Bioinformatics Date: 2009-10-13 Impact factor: 6.937

5. Tissue-aware data integration approach for the inference of pathway interactions in metazoan organisms.

Authors: Christopher Y Park; Arjun Krishnan; Qian Zhu; Aaron K Wong; Young-Suk Lee; Olga G Troyanskaya
Journal: Bioinformatics Date: 2014-11-26 Impact factor: 6.937

6. Regulatory network inferred using expression data of small sample size: application and validation in erythroid system.

Authors: Fan Zhu; Lihong Shi; James Douglas Engel; Yuanfang Guan
Journal: Bioinformatics Date: 2015-04-02 Impact factor: 6.937