| Literature DB >> 18499696 |
Curtis Huttenhower1, Mark Schroeder, Maria D Chikina, Olga G Troyanskaya.
Abstract
MOTIVATION: Biological data generation has accelerated to the point where hundreds or thousands of whole-genome datasets of various types are available for many model organisms. This wealth of data can lead to valuable biological insights when analyzed in an integrated manner, but the computational challenge of managing such large data collections is substantial. In order to mine these data efficiently, it is necessary to develop methods that use storage, memory and processing resources carefully.Entities:
Mesh:
Year: 2008 PMID: 18499696 PMCID: PMC2718674 DOI: 10.1093/bioinformatics/btn237
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Sleipnir efficiency on integration and single dataset tasks
| Implementation | Peak RAM (KB) | Time (s) |
|---|---|---|
| Bayesian learning (500 genes, 15 datasets) | ||
| Sleipnir | 1376 | <1 |
| GeNIe | 6832 | 4 |
| BNT | 593 180 | 15 |
| Bayesian inference (500 genes, 15 datasets) | ||
| Sleipnir | 1216 | 1 |
| BNT | 273 992 | >600 |
| Missing value estimation (10% missing, | ||
| Sleipnir | 27 232 | 195 |
| knnimpute | 115 708 | 368 |
| Hierarchical clustering | ||
| Sleipnir | 83 188 | 156 |
| Cluster 3.0 | 176 836 | 154 |
| MeV | 198 292 | 361 |
| Sleipnir | 8780 | 114 |
| Cluster 3.0 | 28 544 | 102 |
| MeV | 198 292 | 361 |
Memory usage and runtimes for Sleipnir and a number of other common tools for Bayesian analysis and biological data manipulation (de Hoon et al., 2004; Druzdzel, 1999; Murphy, 2001; Saeed et al., 2003; Troyanskaya et al., 2001). All microarray operations were performedon the 300 conditions and 6153 genes of (Hughes et al., 2000) using Euclidean distance. Bayesian operations were performed on simulated data using a binary gold standard and five randomly distributed values per dataset. Tests were run in a single thread on a 2 GHz Intel Core 2 Duo. In every case, Sleipnir demonstrates a substantial advantage in speed, memory usage or both.
Fig. 1.Sample application of the Sleipnir library to integrate 186 heterogeneous genomic datasets in S.cerevisiae within 200 biological contexts. White boxes indicate externally generated data, grey boxes data generated by Sleipnir, arrows processing performed by Sleipnir, and black bubbles highlight time-consuming tasks. Times were generated on a 2 GHz Intel Xeon CPU; peak RAM usage was ∼200 MB. Sleipnir is extensively parallelizable, and running these tasks on four cores reduces processing time by an optimal 4-fold to ∼13 h each for Bayesian learning and inference.