Literature DB >> 23933456

CloudNMF: a MapReduce implementation of nonnegative matrix factorization for large-scale biological datasets.

Ruiqi Liao¹, Yifan Zhang¹, Jihong Guan², Shuigeng Zhou³.

Abstract

In the past decades, advances in high-throughput technologies have led to the generation of huge amounts of biological data that require analysis and interpretation. Recently, nonnegative matrix factorization (NMF) has been introduced as an efficient way to reduce the complexity of data as well as to interpret them, and has been applied to various fields of biological research. In this paper, we present CloudNMF, a distributed open-source implementation of NMF on a MapReduce framework. Experimental evaluation demonstrated that CloudNMF is scalable and can be used to deal with huge amounts of data, which may enable various kinds of a high-throughput biological data analysis in the cloud. CloudNMF is freely accessible at http://admis.fudan.edu.cn/projects/CloudNMF.html.

Entities: Chemical Disease Gene Species

Keywords: Bioinformatics; MapReduce; Nonnegative matrix factorization

Mesh：

Year: 2013 PMID： 23933456 PMCID： PMC4411332 DOI： 10.1016/j.gpb.2013.06.001

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

The explosion of biological data brought about by the high-throughput technologies poses a great challenge to bioinformatics research. In order to learn the hidden structures of these high-dimensional data, nonnegative matrix factorization (NMF) [1] was introduced into biological research. NMF was quickly applied to various fields of biological data analysis, such as capturing expression pattern in microarray data [2], discovery of cancer subtypes [3], clustering of gene expression data [4,5], identification of histone modification modules [6], biological text mining [7,8], etc. The intrinsic nature of the NMF method makes it very suitable for an integrative analysis of multi-dimensional genomics data [9]. Devarajan presented a comprehensive review of the application of NMF to computational biology [10]. With the increasing dimensionality of biological data, it is foreseeable that the application of NMF to biological research will continue to grow. For example, sequencing technologies are generating terabytes (TBs) or even petabytes (PBs) of data for a multi-dimensional analysis. However, current implementations of NMF in the biology area can only deal with matrices of thousands-by-thousands size. For example, bioNMF [11], an implementation of NMF for bioinformatics analysis, can only handle matrices of 4096 × 512 (according to the documentation of bioNMF server), and thus would fail to process data with more attributes or samples. Another implementation using R [12] fails to work when data size reaches gigabytes (GBs) in a standalone machine. In their original papers, both implementations were used to analyze a microarray dataset represented by a 5000 × 38 gene expression data matrix [13]. However, a much more scalable implementation will be needed to deal with data of a significantly greater size such as protein–protein interaction (PPI) data or sequencing data. To facilitate biological data analysis in the “Big Data” era [14], we present CloudNMF, an open-source implementation of NMF in MapReduce framework. The implementation was developed on the Hadoop platform and can enable the nonnegative factorization of sparse matrices up to million-by-million size. Furthermore, CloudNMF is provided as a JAR file ready to be deployed anywhere. In particular, CloudNMF can be easily deployed on Amazon Elastic MapReduce to utilize the power of cloud computing for biological data analysis.

Methods

NMF was first introduced by Lee and Seung as a method for learning the substructure of data matrix [1]. It was defined as the factorization of a nonnegative matrix A into the multiplication of two other nonnegative matrices W and H, where A is a m × n matrix, W and H are m × k and k × n matrices, where k is the target dimensionality to be reduced to, which is a number smaller than the minimum of m and n. NMF was aimed at minimizing the Euclidian distance between A and WH, and can be used as an effective technique for dimension reduction and unsupervised clustering. In 2010, Liu et al. proposed an algorithm to perform NMF in the MapReduce framework [15] and showed that the algorithm can be used to factorize huge nonnegative matrices up to millions-by-millions size. However, this algorithm was aimed at Web applications, and no source code of the algorithm is available for public use. Our work is the first open-source implementation of NMF in the MapReduce framework, targeted at dealing with the explosion of biological data. Our work follows the method previously reported [15], which is based on the well-known iterative updating rule of W and H described by Lee and Seung [16].Here, .* denotes dot product and denotes transpose of matrix. Similar to the method used in [15], for each iteration, the updating of H and W are both factorized into five MapReduce steps; the computation of each step can be easily distributed into multiple machines to achieve speedup, please see Table 1 for the details of the algorithm.

Table 1

Algorithm for CloudNMF

Input: nonnegative matrix A, dimension k, iteration number i

Output: nonnegative matrices W and H

1: initiate W and H using random nonnegative values

2: for each iteration:

3: calculate X₁ = W^TA using two MapReduce steps

4: calculate Y₁ = W^TWH using two MapReduce steps

5: update H with H = H.*X₁/Y₁ using one MapReduce step

6: calculate X₂ = AH^T using two MapReduce steps

7: calculate Y₂ = WHH^T using two MapReduce steps

8: update W with W = W.*X₂/Y₂ using one MapReduce step

9: output W and H

The program was implemented using Java and was packaged as a JAR file which can run on local Hadoop clusters (Figure 1). We offered a command-line interface for the program; the usage of the command-line interface is also provided in our website (http://admis.fudan.edu.cn/projects/CloudNMF.html). Moreover, Amazon Elastic MapReduce service (http://aws.amazon.com/cn/elasticmapreduce/) offers on-demand computing clusters preinstalled with Hadoop, and provides a web interface to run Hadoop JAR files using only a web browser (see http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CLI_JobFlowUsingCustomJAR.html). For those inexperienced users who find it hard to build their own Hadoop clusters, it is possible to upload their data and CloudNMF into the cloud and perform their analysis remotely (Figure 2).

Figure 1

Using CloudNMF with a local Hadoop cluster

Figure 2

Using CloudNMF with Amazon Web Services

Experimental evaluation

In order to test the performance of our program, we applied the program to both real data and simulated matrices. The PPI data matrix from the STRING database [17] was used for performance testing, which includes 108,133,799 protein interactions from 1134 species. The dataset can be represented by a 1,349,909 × 1,349,909 matrix, where 1,349,909 is the number of distinct proteins in the dataset. Since the interactions between proteins are both nonnegative and sparse, the dataset is quite suitable for the application of NMF. Based on the STRING dataset, three submatrices of different sizes were generated. The four datasets are described in Table S1 and the performance of CloudNMF for these four datasets is summarized in Table S2. We also generated three simulated matrices of different sizes but containing the same number of nonzero elements to test the impact of matrix size on the performance of the program (Table S3). The experiments were performed on an 8-machine Hadoop cluster, and each machine has a Duo Core CPU and 4 GB memory. From Figure 3 we observed a very interesting feature of CloudNMF: the runtime actually increases in proportion to the number of nonzero elements (the number of PPIs) in the matrix (Figure 3A). This may be attributed to the MapReduce implementation of the algorithm: only nonzero elements are stored and distributed for computation. As the size of the matrix grows, the computation time increases logarithmically (Figure 3B). These features make the algorithm better to deal with sparse nonnegative matrices in comparison with the traditional implementations.

Figure 3

Performance of CloudNMF A. Performance of CloudNMF on four real datasets shows the linear correlation of runtime per iteration with a number of nonzero elements in the matrix. B. Performance of CloudNMF on simulated matrices of different sizes but with the same number of nonzero elements shows that the runtime per iteration is linear to the logarithm of matrix size. Note that the X-axis is on a logarithmic scale.

Discussion

CloudNMF is the first open-source implementation of MapReduce-based nonnegative matrix factorization, and is capable of handling significantly a greater size of data than existing NMF implementations in bioinformatics. Besides being deployed in local Hadoop clusters, CloudNMF can also be easily used on cloud computing platforms such as Amazon Web Services via only a web browser. Moreover, experimental results show that the algorithm can effectively deal with sparse matrices such as protein–protein interaction networks. CloudNMF also has some limitations. Although the program achieved considerable performance when dealing with large-size matrices, with the high overhead of MapReduce paradigm, it may be less efficient than existing implementations to deal with small-size matrices. In addition, while bioinformatics analyses using NMF may involve many pre-processing or post-processing steps, we only implemented the basic NMF algorithm. However, the code of CloudNMF is freely accessible at our website; users can integrate the code into their own pipelines to perform more specific analyses. To sum up, CloudNMF is the first open-source implementation of a MapReduce-based NMF algorithm and can be easily used to process large amounts of data. With the explosion of biological data and the wide application of NMF to biological research, we expect that CloudNMF will play more important roles in bioinformatics in the upcoming “Big Data” era.

Authors’ contributions

Ruiqi Liao drafted the manuscript and developed the software. Yifan Zhang participated in the software development. Shuigeng Zhou proposed the idea of the software and revised the manuscript. Jihong Guan revised the manuscript. All authors have read and approved the final manuscript.

Competing interests

The authors have no competing interests to declare.

15 in total

1. Learning the parts of objects by non-negative matrix factorization.

Authors: D D Lee; H S Seung
Journal: Nature Date: 1999-10-21 Impact factor: 49.962

2. Metagenes and molecular pattern discovery using matrix factorization.

Authors: Jean-Philippe Brunet; Pablo Tamayo; Todd R Golub; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2004-03-11 Impact factor: 11.205

3. LinkNMF: identification of histone modification modules in the human genome using nonnegative matrix factorization.

Authors: Inkyung Jung; Dongsup Kim
Journal: Gene Date: 2012-12-22 Impact factor: 3.688

4. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Authors: T R Golub; D K Slonim; P Tamayo; C Huard; M Gaasenbeek; J P Mesirov; H Coller; M L Loh; J R Downing; M A Caligiuri; C D Bloomfield; E S Lander
Journal: Science Date: 1999-10-15 Impact factor: 47.728

5. A flexible R package for nonnegative matrix factorization.

Authors: Renaud Gaujoux; Cathal Seoighe
Journal: BMC Bioinformatics Date: 2010-07-02 Impact factor: 3.169

6. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization).

Authors: Elina Tjioe; Michael W Berry; Ramin Homayouni
Journal: BMC Bioinformatics Date: 2010-10-07 Impact factor: 3.169

7. Discovering semantic features in the literature: a foundation for building functional associations.

Authors: Monica Chagoyen; Pedro Carmona-Saez; Hagit Shatkay; Jose M Carazo; Alberto Pascual-Montano
Journal: BMC Bioinformatics Date: 2006-01-26 Impact factor: 3.169

8. Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization.

Authors: Pedro Carmona-Saez; Roberto D Pascual-Marqui; F Tirado; Jose M Carazo; Alberto Pascual-Montano
Journal: BMC Bioinformatics Date: 2006-02-17 Impact factor: 3.169

Review 9. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology.

Authors: Karthik Devarajan
Journal: PLoS Comput Biol Date: 2008-07-25 Impact factor: 4.475

10. bioNMF: a web-based tool for nonnegative matrix factorization in biology.

Authors: E Mejía-Roa; P Carmona-Saez; R Nogales; C Vicente; M Vázquez; X Y Yang; C García; F Tirado; A Pascual-Montano
Journal: Nucleic Acids Res Date: 2008-05-30 Impact factor: 16.971

5 in total

1. easyMF: A Web Platform for Matrix Factorization-Based Gene Discovery from Large-scale Transcriptome Data.

Authors: Wenlong Ma; Siyuan Chen; Yuhong Qi; Minggui Song; Jingjing Zhai; Ting Zhang; Shang Xie; Guifeng Wang; Chuang Ma
Journal: Interdiscip Sci Date: 2022-05-18 Impact factor: 3.492

2. NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

Authors: Edgardo Mejía-Roa; Daniel Tabas-Madrid; Javier Setoain; Carlos García; Francisco Tirado; Alberto Pascual-Montano
Journal: BMC Bioinformatics Date: 2015-02-13 Impact factor: 3.169

3. eDRAM: Effective early disease risk assessment with matrix factorization on a large-scale medical database: A case study on rheumatoid arthritis.

Authors: Chu-Yu Chin; Sun-Yuan Hsieh; Vincent S Tseng
Journal: PLoS One Date: 2018-11-26 Impact factor: 3.240

4. Inference of Large-scale Time-delayed Gene Regulatory Network with Parallel MapReduce Cloud Platform.

Authors: Bin Yang; Wenzheng Bao; De-Shuang Huang; Yuehui Chen
Journal: Sci Rep Date: 2018-12-12 Impact factor: 4.379

Review 5. Enter the Matrix: Factorization Uncovers Knowledge from Omics.

Authors: Genevieve L Stein-O'Brien; Raman Arora; Aedin C Culhane; Alexander V Favorov; Lana X Garmire; Casey S Greene; Loyal A Goff; Yifeng Li; Aloune Ngom; Michael F Ochs; Yanxun Xu; Elana J Fertig
Journal: Trends Genet Date: 2018-08-22 Impact factor: 11.639

5 in total