Literature DB >> 20679333

Multi-netclust: an efficient tool for finding connected clusters in multi-parametric networks.

Arnold Kuzniar1, Somdutta Dhir, Harm Nijveen, Sándor Pongor, Jack A M Leunissen.   

Abstract

UNLABELLED: Multi-netclust is a simple tool that allows users to extract connected clusters of data represented by different networks given in the form of matrices. The tool uses user-defined threshold values to combine the matrices, and uses a straightforward, memory-efficient graph algorithm to find clusters that are connected in all or in either of the networks. The tool is written in C/C++ and is available either as a form-based or as a command-line-based program running on Linux platforms. The algorithm is fast, processing a network of > 10(6) nodes and 10(8) edges takes only a few minutes on an ordinary computer. AVAILABILITY: http://www.bioinformatics.nl/netclust/.

Entities:  

Mesh:

Year:  2010        PMID: 20679333      PMCID: PMC2944197          DOI: 10.1093/bioinformatics/btq435

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Finding tightly connected clusters in large datasets is a frequent task in many areas of bioinformatics such as the analysis of protein similarity networks, microarray or protein–protein interaction data. Classical clustering algorithms have difficulties in handling large datasets used in bioinformatics owing to high demands on computer resources. Fast heuristic algorithms have been developed for specific tasks, e.g. BLASTClust (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html) from the NCBI–BLAST package (Altschul et al., 1990), Tribe-MCL (Enright et al., 2002) or the CD-HIT (Li and Godzik, 2006) can delineate protein or gene families in a large network of sequence similarities (e.g. BLAST E-values). However, there are no apparent tools that could efficiently handle large multiple networks, such as those necessary to group proteins using more than one similarity criterion (e.g. based on sequence, structure or function) (Fig. 1A).
Fig. 1.

The principle of Multi-netclust is illustrated on a two-parameter network. Thick and thin edges correspond to distinct similarity data (A). Dotted lines denote edges that are below the respective threshold, and hence can be omitted from the networks. Two different aggregation rules are implemented: the weighted arithmetic averaging (‘sum rule’) gives clusters that are connected within either of the two networks (B); the weighted geometric averaging (‘product rule’) gives clusters that are connected within both networks (C). M denotes the value assigned to the edges, w is the weighting factor (‘alpha’) of the two matrices (hence n = 2) and Mmix refers to the aggregated matrix.

The principle of Multi-netclust is illustrated on a two-parameter network. Thick and thin edges correspond to distinct similarity data (A). Dotted lines denote edges that are below the respective threshold, and hence can be omitted from the networks. Two different aggregation rules are implemented: the weighted arithmetic averaging (‘sum rule’) gives clusters that are connected within either of the two networks (B); the weighted geometric averaging (‘product rule’) gives clusters that are connected within both networks (C). M denotes the value assigned to the edges, w is the weighting factor (‘alpha’) of the two matrices (hence n = 2) and Mmix refers to the aggregated matrix. We developed an efficient, semi-supervised tool that takes the users’ empirical knowledge of cutoff values into account (below which interactions or similarities can be neglected) to combine multiple data networks using an averaging or kernel fusion method (Kittler et al., 1998). The resulting combined network can then be queried for connected components (clusters) using an efficient implementation of the union-find algorithm (Tarjan, 1975). The clusters correspond to groups of nodes that are connected either by any or by all of the constituent networks, depending on the aggregation rule used (Fig. 1B and C, respectively). To adapt this method to large heterogeneous datasets, we combined the thresholding, aggregation as well as the connected component search into a single, memory- and time-efficient tool, Multi-netclust. Multi-netclust is not a new clustering method but an optimized implementation of existing graph algorithms suitable for handling large networks of > 106 nodes and 108 edges.

2 MULTI-NETCLUST INPUT AND OUTPUT

Multi-netclust uses external memory rather than the in-core approach (Chiang, 1995) for matrix manipulations so that the size of the datasets is not a limiting factor. The input to Multi-netclust are networks given in sparse matrix format, as well as the aggregation rule, ‘alpha’ weighting factor, and similarity (or distance) cutoff value(s) associated with a processing step(s). Generally, the product rule results in more reliable connections confirmed by multiple data sources, whereas the sum rule expands the network with new (not necessarily reliable) connections. Setting the ‘alpha’ value for each matrix provides means, e.g. to weight the reliability of different data sources or to decide which dataset is more likely to contribute with new (additional) information. A permissive cutoff value usually results in a few large clusters, while a strict cutoff value tend to produce many small (singleton) clusters. The data can be entered either via a CGI interface, or from the command line. The output of Multi-netclust is a list of the connected clusters given in a structured text format. Multi-netclust is written in the C/C++ language, and the CGI interface is a Perl script. The source code, sample data, explanations and benchmark results are available on the website http://www.bioinformatics.nl/netclust/. There is also a web-based application suitable to run smaller test-sets.

3 PERFORMANCE

The run-time of Multi-netclust subsumes: (i) the time needed for reading-in the data, thresholding and aggregation (>99.9%); and (ii) finding the connected components and writing the results (<0.1%). A benchmark dataset of 1357 proteins, taken from the Protein Classification Benchmark database (Sonego et al., 2007) was used to combine sequence similarities calculated by the BLAST and Smith–Waterman (Smith and Waterman, 1981) algorithms, and DALI 3D structure similarities (Holm and Sander, 1995). The analysis took 4 s on a 2 GHz processor, the influence of parameter settings on the purity of connected clusters is apparent from the results (Table 1). An interesting example is the immunoglobulin superfamily (SCOP b.1.1), which has 125 members in the benchmark dataset. Using DALI alone, the b.1.1 proteins clustered together with the ‘E set domains’ (SCOP b.1.18), grouping proteins related to immunoglobulin and/or fibronectin Type III superfamilies. Using BLAST alone, the b.1.1. proteins clustered together with a number of other superfamilies. Surprisingly, the combination of both DALI and BLAST datasets made 94% of the group b.1.1 cluster correctly.
Table 1.

Protein classification results obtained for the individual and combined similarity networks

DatasetCorrectIncorrectSingletons
SW × DALI1 (251)9100447
BLAST (0.1) × DALI2 (0.4)8880469
BLAST (0.4) + DALI2 (0.4)80346985
SW (251)31601041
DALI1 (251)56126635
DALI2 (0.4)79047592
BLAST (0.4)3601321
BLAST (0.1)661101190

Numbers in parentheses denote the threshold used. Symbols ‘×’ and ‘+’ refer to the product and sum aggregation rules, respectively. The results were obtained for ‘alpha’ weighting factor 0.5.

DALI1, matrix of raw scores; DALI2, matrix of diagonally normalized scores; correct, proteins connected only to members of the same SCOP superfamily; incorrect, proteins connected to members of other SCOP superfamilies.

Protein classification results obtained for the individual and combined similarity networks Numbers in parentheses denote the threshold used. Symbols ‘×’ and ‘+’ refer to the product and sum aggregation rules, respectively. The results were obtained for ‘alpha’ weighting factor 0.5. DALI1, matrix of raw scores; DALI2, matrix of diagonally normalized scores; correct, proteins connected only to members of the same SCOP superfamily; incorrect, proteins connected to members of other SCOP superfamilies. The external memory-based, connected component search algorithm is fast and memory-efficient compared to single-linkage-based clustering methods and in-memory graph algorithms used for similar purposes within the bioinformatics community (see Supplementary Material). The strength of Multi-netclust becomes more apparent when we deal with large datasets that cannot be handled with other algorithms. For example, a network of 2 713 908 nodes and 781 328 458 edges took <5 min on an ordinary computer. Of the other algorithms tested (see case studies on the website), only BLASTClust was able to handle a dataset of similar size; however, its use is limited to BLAST similarity networks (and at greater expense of CPU time and memory required), whereas Multi-netclust is generally applicable. To conclude, Multi-netclust is an efficient tool that can aid exploratory analyzes of large biological networks using an ordinary computer. Specifically, the potential applications include any task where network data of heterogeneous sources, such as sequence and structure similarities, gene expression or protein–protein interaction data, are to be combined together, resulting in new and/or improved biologically relevant predictions.
  6 in total

1.  An efficient algorithm for large-scale detection of protein families.

Authors:  A J Enright; S Van Dongen; C A Ouzounis
Journal:  Nucleic Acids Res       Date:  2002-04-01       Impact factor: 16.971

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

3.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

4.  Dali: a network tool for protein structure comparison.

Authors:  L Holm; C Sander
Journal:  Trends Biochem Sci       Date:  1995-11       Impact factor: 13.807

5.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

6.  A Protein Classification Benchmark collection for machine learning.

Authors:  Paolo Sonego; Mircea Pacurar; Somdutta Dhir; Attila Kertész-Farkas; András Kocsor; Zoltán Gáspári; Jack A M Leunissen; Sándor Pongor
Journal:  Nucleic Acids Res       Date:  2006-11-16       Impact factor: 16.971

  6 in total
  1 in total

1.  Promoter propagation in prokaryotes.

Authors:  Mariana Matus-Garcia; Harm Nijveen; Mark W J van Passel
Journal:  Nucleic Acids Res       Date:  2012-08-28       Impact factor: 16.971

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.