Literature DB >> 19454618

jClust: a clustering and visualization toolbox.

Georgios A Pavlopoulos¹, Charalampos N Moschopoulos, Sean D Hooper, Reinhard Schneider, Sophia Kossida.

Abstract

UNLABELLED: jClust is a user-friendly application which provides access to a set of widely used clustering and clique finding algorithms. The toolbox allows a range of filtering procedures to be applied and is combined with an advanced implementation of the Medusa interactive visualization module. These implemented algorithms are k-Means, Affinity propagation, Bron-Kerbosch, MULIC, Restricted neighborhood search cluster algorithm, Markov clustering and Spectral clustering, while the supported filtering procedures are haircut, outside-inside, best neighbors and density control operations. The combination of a simple input file format, a set of clustering and filtering algorithms linked together with the visualization tool provides a powerful tool for data analysis and information extraction. AVAILABILITY: http://jclust.embl.de/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2009 PMID： 19454618 PMCID： PMC2712340 DOI： 10.1093/bioinformatics/btp330

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

There exists a big variety of clustering algorithms, which are applicable to a wide range of problems. Most of them are available either as source code, as part of a software package like in R or Matlab packages or are available online. Beside the commercially available ones, there are a few web-based or standalone tools like NeAT (Brohee et al., 2008), Cluster 3.0 software (de Hoon et al., 2004) or Cluto (Zhao and Karypis, 2005) which provide access to some of the clustering algorithms. Nevertheless, it requires typically some effort to either implement the source code into own projects, get familiar with a specific software package or prepare the data for a specifically needed input format. A major weakness of most of the currently available tools is that they lack the interactivity and an easy visualization module to explore and navigate through the data. Here, we present the toolbox jClust, which aims to bridge the gap between analysis and visualization by integrating clustering analysis algorithms with tools able to provide these results visually. The tool provides access to a widely used set of clustering algorithms and simultaneously allows the interactive visualization of the data. It reads from a very simple input file format and produces a human readable output file. jClust comes with a user-friendly GUI that makes the functionality and the parameterization of the algorithms easy and we believe that jClust gives the users, the opportunity to analyze and visualize biological data in a fast, easy and efficient way.

2 CLUSTERING

jClust supports a variety of supervised and unsupervised clustering analysis methods. These are k-Means (MacQueen, 1967), Spectral clustering (Paccanaro et al., 2006), Affinity propagation (Frey and Dueck, 2007), Restricted neighborhood search cluster algorithms—RNSC (King et al., 2004), Markov clustering—MCL (Enright et al., 2002), MULIC (Andreopoulos et al., 2007a, b) and Bron–Kerbosch (Coen and Joep, 1973). Concerning k-Means and the Spectral clustering, the number of clusters needs to be defined by the user. The k-Means (MacQueen, 1967) algorithm requires a full, all-against-all distance matrix to run whereas this is not a requirement for the other implemented algorithms. All of the algorithms besides k-Means are suitable for sparse graphs and all of the methods are able to analyze large-scale data as long as the local computer memory permits it. The Bron–Kerbosch (Coen and Joep, 1973) algorithms is a very well-known algorithm for finding cliques in a graph, meaning that it isolates strongly connected sub-areas where every node is connected to every other node—all-against-all connections—that belongs to the same clique. All of the aforementioned clustering algorithms assign nodes to only one unique cluster whereas the Bron–Kerbosch (Coen and Joep, 1973) algorithm allows a node to belong to more than one cluster.

3 FILTERING

jCluster gives to the user the opportunity to filter noise from the predicted clusters that have been calculated by one of the previous methods. This way, in a second step, clusters can be enriched by nodes that are important or shrink by removing nodes that should not belong to the cluster. Here, we implemented the following procedures: (i) density, (ii) haircut, (iii) best neighbor and (iv) cutting edge operation. The density method applies a threshold, which filters down clusters below a certain allowed density. The haircut operation detects and excludes vertices with a low degree of connectivity from the potential cluster. In contrast to the haircut operation method, the best neighbor method tends to detect and enrich the clusters with candidate vertices that are considered as good ‘neighbors’. The cutting edge operation filters out cases of densely connected sub-areas, which are only sparsely connected to the rest of the network. A detailed explanation of how these methodologies, are mathematically defined and how they can be parameterized is given online in the Supplementary Material.

4 VISUALIZATION

We updated the Medusa (Hooper and Bork, 2005) visualization tool to graphically represent the produced clusters. Medusa can be used as an external application or can alternatively be called through the jClust application. Medusa is now more interactive and supports many layout algorithms that make the tool much more informative and the extraction of the biological knowledge easier. In contrast to the previous version, users can isolate connections of specific nodes and hyperlink them to external data sources. A predefined clustering layout algorithm is implemented to distribute nodes in an efficient way to visualize distinct clusters. According to this layout, N centers, where N is the number of clusters produced, are initially calculated on a grid distribution and then nodes that belong to the same cluster are placed circularly around these centers. This way, users can very easily see and identify distinct groups of nodes, see patterns and visually evaluate the correctness of their analysis. Through the Medusa application, users can save the final results in other formats that are readable by external visualization tools.

5 FUNCTIONALITY

The input file is very simple. It only requires a list of weighted connections where the weight determines the importance of the connection. These files could contain, for example, protein–protein interaction data resulting from experiments or other data sources like protein–chemical interactions coming from the Stitch database (Kuhn et al., 2008) or experimentally calculated sets like yeast protein–protein datasets (Gavin et al., 2006). jClust provides a Java interface, which allows parameterization for any of the available algorithms and shows the final and intermediate results in the GUI jtext areas, which are simultaneously saved as text files. These files also keep the track about the information regarding the distinct clusters, the nodes that belong to them and the connections between the member and nodes of each cluster.

6 CONCLUDING REMARKS

We believe that the jClust toolbox provides a simple but yet powerful tool for researchers in the life science field as it integrates a very strong collection of lately implemented clustering algorithms with an easy to use visualization tool. jClust can be used to address various questions like classifying similar literature abstracts, identifying protein families according to their sequence or domain similarity or predicting protein complexes from protein–protein interaction data. The usefulness of the tool was already shown in a biological case study recently published (Moschopoulos et al., 2008). There, we show how the combination of clustering (in that case a RNSC and MCL) and filtering algorithms can be applied to protein–protein interaction data to predict protein complexes (see Figure 1). The newer version of the Medusa visualization application provides an enriched functionality and interactivity, which makes exploration of data and navigation easier. Further information about the algorithms, the filters, their parameters, some typical application examples and real biological datasets are offered online in the Supplementary Material section.

Fig. 1.

This figure shows some protein complexes that were predicted after applying Spectral clustering algorithm and filtering the results with parameters density=0.7 and haircut=3 in a yeast protein–protein dataset (Gavin et al., 2006). The budding yeast Arp2/3 complex shown on the right part of the figure was successfully predicted as it is mentioned in the literature (Winter et al., 1999).

13 in total

1. An efficient algorithm for large-scale detection of protein families.

Authors: A J Enright; S Van Dongen; C A Ouzounis
Journal: Nucleic Acids Res Date: 2002-04-01 Impact factor: 16.971

2. Protein complex prediction via cost-based clustering.

Authors: A D King; N Przulj; I Jurisica
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

3. Medusa: a simple tool for interaction graph analysis.

Authors: Sean D Hooper; Peer Bork
Journal: Bioinformatics Date: 2005-09-27 Impact factor: 6.937

Review 4. Data clustering in life sciences.

Authors: Ying Zhao; George Karypis
Journal: Mol Biotechnol Date: 2005-09 Impact factor: 2.695

5. Proteome survey reveals modularity of the yeast cell machinery.

Authors: Anne-Claude Gavin; Patrick Aloy; Paola Grandi; Roland Krause; Markus Boesche; Martina Marzioch; Christina Rau; Lars Juhl Jensen; Sonja Bastuck; Birgit Dümpelfeld; Angela Edelmann; Marie-Anne Heurtier; Verena Hoffman; Christian Hoefert; Karin Klein; Manuela Hudak; Anne-Marie Michon; Malgorzata Schelder; Markus Schirle; Marita Remor; Tatjana Rudi; Sean Hooper; Andreas Bauer; Tewis Bouwmeester; Georg Casari; Gerard Drewes; Gitte Neubauer; Jens M Rick; Bernhard Kuster; Peer Bork; Robert B Russell; Giulio Superti-Furga
Journal: Nature Date: 2006-01-22 Impact factor: 49.962

6. Clustering by passing messages between data points.

Authors: Brendan J Frey; Delbert Dueck
Journal: Science Date: 2007-01-11 Impact factor: 47.728

7. Clustering by common friends finds locally significant proteins mediating modules.

Authors: Bill Andreopoulos; Aijun An; Xiaogang Wang; Michalis Faloutsos; Michael Schroeder
Journal: Bioinformatics Date: 2007-02-21 Impact factor: 6.937

8. Spectral clustering of protein sequences.

Authors: Alberto Paccanaro; James A Casbon; Mansoor A S Saqi
Journal: Nucleic Acids Res Date: 2006-03-17 Impact factor: 16.971

9. STITCH: interaction networks of chemicals and proteins.

Authors: Michael Kuhn; Christian von Mering; Monica Campillos; Lars Juhl Jensen; Peer Bork
Journal: Nucleic Acids Res Date: 2007-12-15 Impact factor: 16.971

10. NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways.

Authors: Sylvain Brohée; Karoline Faust; Gipsi Lima-Mendez; Olivier Sand; Rekin's Janky; Gilles Vanderstocken; Yves Deville; Jacques van Helden
Journal: Nucleic Acids Res Date: 2008-06-04 Impact factor: 16.971

11 in total

1. A unified computational model for revealing and predicting subtle subtypes of cancers.

Authors: Xianwen Ren; Yong Wang; Jiguang Wang; Xiang-Sun Zhang
Journal: BMC Bioinformatics Date: 2012-05-01 Impact factor: 3.169

2. clusterMaker: a multi-algorithm clustering plugin for Cytoscape.

Authors: John H Morris; Leonard Apeltsin; Aaron M Newman; Jan Baumbach; Tobias Wittkop; Gang Su; Gary D Bader; Thomas E Ferrin
Journal: BMC Bioinformatics Date: 2011-11-09 Impact factor: 3.307

3. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.

Authors: Damian Szklarczyk; Andrea Franceschini; Michael Kuhn; Milan Simonovic; Alexander Roth; Pablo Minguez; Tobias Doerks; Manuel Stark; Jean Muller; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

4. Interactive, multiscale navigation of large and complicated biological networks.

Authors: Thanet Praneenararat; Toshihisa Takagi; Wataru Iwasaki
Journal: Bioinformatics Date: 2011-02-23 Impact factor: 6.937

5. Medusa: A tool for exploring and clustering biological networks.

Authors: Georgios A Pavlopoulos; Sean D Hooper; Alejandro Sifrim; Reinhard Schneider; Jan Aerts
Journal: BMC Res Notes Date: 2011-10-06

6. Using graph theory to analyze biological networks.

Authors: Georgios A Pavlopoulos; Maria Secrier; Charalampos N Moschopoulos; Theodoros G Soldatos; Sophia Kossida; Jan Aerts; Reinhard Schneider; Pantelis G Bagos
Journal: BioData Min Date: 2011-04-28 Impact factor: 2.522

7. Which clustering algorithm is better for predicting protein complexes?

Authors: Charalampos N Moschopoulos; Georgios A Pavlopoulos; Ernesto Iacucci; Jan Aerts; Spiridon Likothanassis; Reinhard Schneider; Sophia Kossida
Journal: BMC Res Notes Date: 2011-12-20

8. Integration of interactive, multi-scale network navigation approach with Cytoscape for functional genomics in the big data era.

Authors: Thanet Praneenararat; Toshihisa Takagi; Wataru Iwasaki
Journal: BMC Genomics Date: 2012-12-13 Impact factor: 3.969

Review 9. Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future.

Authors: Georgios A Pavlopoulos; Dimitris Malliarakis; Nikolas Papanikolaou; Theodosis Theodosiou; Anton J Enright; Ioannis Iliopoulos
Journal: Gigascience Date: 2015-08-25 Impact factor: 6.524

10. DrugQuest - a text mining workflow for drug association discovery.

Authors: Nikolas Papanikolaou; Georgios A Pavlopoulos; Theodosios Theodosiou; Ioannis S Vizirianakis; Ioannis Iliopoulos
Journal: BMC Bioinformatics Date: 2016-06-06 Impact factor: 3.169