Literature DB >> 21473782

WordCloud: a Cytoscape plugin to create a visual semantic summary of networks.

Layla Oesper¹, Daniele Merico, Ruth Isserlin, Gary D Bader.

Abstract

BACKGROUND: When biological networks are studied, it is common to look for clusters, i.e. sets of nodes that are highly inter-connected. To understand the biological meaning of a cluster, the user usually has to sift through many textual annotations that are associated with biological entities.
FINDINGS: The WordCloud Cytoscape plugin generates a visual summary of these annotations by displaying them as a tag cloud, where more frequent words are displayed using a larger font size. Word co-occurrence in a phrase can be visualized by arranging words in clusters or as a network.
CONCLUSIONS: WordCloud provides a concise visual summary of annotations which is helpful for network analysis and interpretation. WordCloud is freely available at http://baderlab.org/Software/WordCloudPlugin.

Entities: Disease Species

Year: 2011 PMID： 21473782 PMCID： PMC3083346 DOI： 10.1186/1751-0473-6-7

Source DB: PubMed Journal: Source Code Biol Med ISSN： 1751-0473

Findings

Introduction

Networks are widely used to represent relationships between biological entities, such as proteins and genes. Biological networks are typically explored using tools such as Cytoscape [1]. One common analysis consists of identifying sub-networks characterized by a specific feature, such as the presence of dense interconnections compared to the rest of the network [2]. For example, comprehensive maps of protein-protein physical interactions have been mined for dense regions, which represent protein complexes, using clustering algorithms [3]. Once sub-networks have been identified, however, it is often difficult to interpret their biological meaning. Bio-entities typically have rich textual information associated with them, such as Gene Ontology (GO) annotations [4]. A popular method for interpreting sub-networks using this information is enrichment analysis, where node and edge attributes are mined for statistically enriched text terms. For example, a sub-network can be searched for enriched biological pathways associated with the list of nodes. While highly useful, enrichment analysis takes time to perform and produces a simple table of enriched attributes. When deciding which sub-networks are interesting, it is useful to have quick visual feedback displaying frequent node annotation. In previous work, we manually created 'word clouds' to help us with this task [5]. The purpose of the WordCloud plugin is to automatically generate concise visual summaries of such textual attributes for fast access during network exploration (Figure 1).

Figure 1

Tag cloud for a protein interaction cluster. The network consists of physical interactions between S. cerevisiae proteins involved in DNA replication (A). A group of highly inter-connected proteins was selected (blue circle) and their full names were mined using WordCloud. The results are shown for the three layouts: network (B), simple (C) and clustered (D). "Origin recognition complex component" and "Minichromosome maintenance complex component" are the dominating themes. The corresponding words are ranked on top in the simple cloud layout, but only the clustered and network layout reconstruct the correct connections between them, based on word co-occurrence patterns. Since clustering is non-overlapping, the words "complex" and "component" are forced to appear only in one cluster (with "minichrosome maintenance"), whereas the network layout displays association to "origin recognition" as well. The WordCloud plugin implements a visual information retrieval system known as a tag cloud. Tag cloud systems are used in a variety of domains from social bookmarking services [6] to summarization of PubMed database searches [7]. The WordCloud implementation extends the basic tag cloud concept of a simple collection of words by also displaying information about word co-occurrence [8,9]. WordCloud can also be used in combination with enrichment analysis to summarize any type of gene list. Gene-set enrichment analysis is a popular approach to functionally characterize gene lists [10], including gene clusters from protein networks. Known gene-sets, typically derived from standardized annotation systems such as the Gene Ontology, are statistically tested for overrepresentation in the query gene list. However, enrichment analysis can often produce long lists of enriched gene-sets, which are often redundant or interrelated, thus hindering the interpretation of the results. To overcome this problem, several visualization methods have been developed to arrange gene-sets as similarity networks, where clusters correspond to functionally related gene-sets [11-13]. WordCloud can be effectively used to summarize these gene-set clusters (Figure 2).

Figure 2

Application of WordCloud to gene-set enrichment analysis results. The transcriptional response of breast cancer cells to estrogen treatment was analyzed for gene-set enrichment, as described in [11]. Gene-sets were then arranged as a network using the Enrichment Map visualization technique [11]; edges represent gene-set overlap and clusters correspond to functional groups. A sub-network (A) was selected and analyzed using the WordCloud network layout (B). The most frequent words in gene-set names are "Mitotic Cell Cycle", "DNA Replication", "Ubiquitin Ligase Activity/Regulation", "Chromosome", "Microtubule"; this suggests that the sub-network consists of gene-sets involved in the control of cell proliferation. Specific parts of the sub-network (purple circles) relate to specific functional groups, as suggested by clustered word clouds (C,D).

Methods and Implementation

WordCloud is a freely available, open source Cytoscape plugin written in Java and compatible with Cytoscape versions 2.6, 2.7 and 2.8. Given a user-defined node selection (i.e. a sub-network), a word cloud can be generated using one or more user-selected node attributes that are of type string or list of string. Input text from all selected attributes is collected and broken down into words using separation characters, such as punctuation and space delimiters. Flagged words, such as commonly occurring English words and numbers, can be removed. In addition, words that share the same stem (e.g. cell and cells) can be mapped to that stem using the Porter Stemming Algorithm [14]. Font size for all words is then calculated proportionally to word frequency in the input text. The user can optionally scale font size using 'network-weighting' which considers word frequencies of all text in the entire network, rather than just the node selection, to penalize words that appear frequently outside the node selection. In this case, the font size of any word w in a tag cloud is directly proportional to: where selis the number of selected nodes that contain the word w, selis the total number of selected nodes, netis the number of nodes in the entire network that contain the word w, netis the total number of nodes in the network, and k is the network normalization coefficient, which can be tuned by the user through an interactive slider bar. The WordCloud plugin supports several layout options for the tag cloud. The most basic layout consists of the sequence of words arranged in order of descending frequency. The clustered and network layouts offer semantically richer summaries by considering co-occurrence patterns between words. Clusters are built by step-wise aggregation of frequently co-occurring word pairs. Specifically, the WordCloud plugin uses a greedy clustering algorithm similar to hierarchical clustering. Every ordered pair of words {w, w} that appear next to each other in at least one of the selected nodes is assigned a similarity score, defined by the ratio of the observed joint probability of these words appearing next to each other in the specified order, to the expected independent probability of these words appearing next to each other: Each word starts in its own cluster. Next, the most similar word pair is merged to form a larger cluster, maintaining word order, and the process is repeated. Similarity between multi-word clusters is defined as the similarity of the last word appearing in the first cluster and the first word appearing in the second cluster. This helps maintain the order of words in the cluster in the standard left to right English text direction. The cluster merging process is bounded by a user-defined threshold on the word pair similarity score. Cluster order is determined by the number of words in a cluster and word frequency information. For any word w appearing in a tag cloud, s(w) is the font size assigned to word w. A clustered tag cloud consists of a set of clusters C = {C, ..., C} where each Ccontains some set of words {w, ..., w}. The clusters are laid out in decreasing order according to the following value: This is the L2 norm (i.e. Euclidean length) of the cluster's word size vector. The greedy clustering algorithm described above does not consider the co-occurrence of all word pairs in the input text. Thus, as an alternative to the clustered layout, words can be visualized as a similarity network. Each word is represented as a node, with node and label size proportional to word frequency as previously described. Words are connected by edges whose width is proportional to their similarity score, as defined above. The resulting network can be laid out, analyzed and clustered using Cytoscape functionalities. The network layout is particularly useful when words tend to have multiple co-occurrence partners, rather than a single one.

Conclusions

WordCloud is a configurable tool for creating quick visual summaries of sub-networks within Cytoscape and is a useful tool to aid interactive network exploration. The configuration options provide a high degree of control over tag cloud visualization resulting in a publication quality summary of a sub-network. WordCloud also includes clustered tag cloud and word similarity network visualization options that retain the meaning of phrases by maintaining word order, rather than just displaying individual words.

Availability and Requirements

Project name: WordCloud Project home page: http://baderlab.org/Software/WordCloudPlugin Operating system: Platform independent Programming language: Java Other requirements: Cytoscape version 2.6 or newer, Java SE 5 License: GNU LGPL Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LO designed and developed the software and drafted the manuscript. DM, RI and GDB conceived the project, contributed to the design of the software and aided in the drafting of the manuscript. All authors have read and approved the final manuscript.

9 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

Review 3. Gene-set approach for expression pattern analysis.

Authors: Dougu Nam; Seon-Young Kim
Journal: Brief Bioinform Date: 2008-01-17 Impact factor: 11.622

4. ConceptGen: a gene set enrichment and gene set relation mapping tool.

Authors: Maureen A Sartor; Vasudeva Mahavisno; Venkateshwar G Keshamouni; James Cavalcoli; Zachary Wright; Alla Karnovsky; Rork Kuick; H V Jagadish; Barbara Mirel; Terry Weymouth; Brian Athey; Gilbert S Omenn
Journal: Bioinformatics Date: 2009-12-09 Impact factor: 6.937

5. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps.

Authors: Ruth Isserlin; Daniele Merico; Rasoul Alikhani-Koupaei; Anthony Gramolini; Gary D Bader; Andrew Emili
Journal: Proteomics Date: 2010-03 Impact factor: 3.984

6. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation.

Authors: Daniele Merico; Ruth Isserlin; Oliver Stueker; Andrew Emili; Gary D Bader
Journal: PLoS One Date: 2010-11-15 Impact factor: 3.240

7. How to visually interpret biological data using networks.

Authors: Daniele Merico; David Gfeller; Gary D Bader
Journal: Nat Biotechnol Date: 2009-10 Impact factor: 54.908

8. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

Authors: Nevan J Krogan; Gerard Cagney; Haiyuan Yu; Gouqing Zhong; Xinghua Guo; Alexandr Ignatchenko; Joyce Li; Shuye Pu; Nira Datta; Aaron P Tikuisis; Thanuja Punna; José M Peregrín-Alvarez; Michael Shales; Xin Zhang; Michael Davey; Mark D Robinson; Alberto Paccanaro; James E Bray; Anthony Sheung; Bryan Beattie; Dawn P Richards; Veronica Canadien; Atanas Lalev; Frank Mena; Peter Wong; Andrei Starostine; Myra M Canete; James Vlasblom; Samuel Wu; Chris Orsi; Sean R Collins; Shamanta Chandran; Robin Haw; Jennifer J Rilstone; Kiran Gandi; Natalie J Thompson; Gabe Musso; Peter St Onge; Shaun Ghanny; Mandy H Y Lam; Gareth Butland; Amin M Altaf-Ul; Shigehiko Kanaya; Ali Shilatifard; Erin O'Shea; Jonathan S Weissman; C James Ingles; Timothy R Hughes; John Parkinson; Mark Gerstein; Shoshana J Wodak; Andrew Emili; Jack F Greenblatt
Journal: Nature Date: 2006-03-22 Impact factor: 49.962

9. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks.

Authors: Gabriela Bindea; Bernhard Mlecnik; Hubert Hackl; Pornpimol Charoentong; Marie Tosolini; Amos Kirilovsky; Wolf-Herman Fridman; Franck Pagès; Zlatko Trajanoski; Jérôme Galon
Journal: Bioinformatics Date: 2009-02-23 Impact factor: 6.937

9 in total

41 in total

1. Recapitulating kidney development in vitro by priming and differentiating mouse embryonic stem cells in monolayers.

Authors: Theresa Chow; Frances T M Wong; Claudio Monetti; Andras Nagy; Brian Cox; Ian M Rogers
Journal: NPJ Regen Med Date: 2020-04-20

2. Critical role of extracellular vesicles in modulating the cellular effects of cytokines.

Authors: Géza Tamás Szabó; Bettina Tarr; Krisztina Pálóczi; Katalin Éder; Eszter Lajkó; Ágnes Kittel; Sára Tóth; Bence György; Mária Pásztói; Andrea Németh; Xabier Osteikoetxea; Éva Pállinger; András Falus; Katalin Szabó-Taylor; Edit Irén Buzás
Journal: Cell Mol Life Sci Date: 2014-04-06 Impact factor: 9.261

3. Cyclodextrin promotes atherosclerosis regression via macrophage reprogramming.

Authors: Sebastian Zimmer; Alena Grebe; Siril S Bakke; Niklas Bode; Bente Halvorsen; Thomas Ulas; Mona Skjelland; Dominic De Nardo; Larisa I Labzin; Anja Kerksiek; Chris Hempel; Michael T Heneka; Victoria Hawxhurst; Michael L Fitzgerald; Jonel Trebicka; Ingemar Björkhem; Jan-Åke Gustafsson; Marit Westerterp; Alan R Tall; Samuel D Wright; Terje Espevik; Joachim L Schultze; Georg Nickenig; Dieter Lütjohann; Eicke Latz
Journal: Sci Transl Med Date: 2016-04-06 Impact factor: 17.956

4. Transcriptome-based profiling of yolk sac-derived macrophages reveals a role for Irf8 in macrophage maturation.

Authors: Nora Hagemeyer; Katrin Kierdorf; Kathrin Frenzel; Jia Xue; Marc Ringelhan; Zeinab Abdullah; Isabelle Godin; Peter Wieghofer; Marta Joana Costa Jordão; Thomas Ulas; Gülden Yorgancioglu; Frank Rosenbauer; Percy A Knolle; Mathias Heikenwalder; Joachim L Schultze; Marco Prinz
Journal: EMBO J Date: 2016-07-13 Impact factor: 11.598

5. S100-alarmin-induced innate immune programming protects newborn infants from sepsis.

Authors: Thomas Ulas; Sabine Pirr; Beate Fehlhaber; Marie S Bickes; Torsten G Loof; Thomas Vogl; Lara Mellinger; Anna S Heinemann; Johanna Burgmann; Jennifer Schöning; Sabine Schreek; Sandra Pfeifer; Friederike Reuner; Lena Völlger; Martin Stanulla; Maren von Köckritz-Blickwede; Shirin Glander; Katarzyna Barczyk-Kahlert; Constantin S von Kaisenberg; Judith Friesenhagen; Lena Fischer-Riepe; Stefanie Zenker; Joachim L Schultze; Johannes Roth; Dorothee Viemann
Journal: Nat Immunol Date: 2017-05-01 Impact factor: 25.606

6. A travel guide to Cytoscape plugins.

Authors: Rintaro Saito; Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Samad Lotia; Alexander R Pico; Gary D Bader; Trey Ideker
Journal: Nat Methods Date: 2012-11-06 Impact factor: 28.547

7. Regulation of Platelet Derived Growth Factor Signaling by Leukocyte Common Antigen-related (LAR) Protein Tyrosine Phosphatase: A Quantitative Phosphoproteomics Study.

Authors: Adil R Sarhan; Trushar R Patel; Andrew J Creese; Michael G Tomlinson; Carina Hellberg; John K Heath; Neil A Hotchin; Debbie L Cunningham
Journal: Mol Cell Proteomics Date: 2016-04-13 Impact factor: 5.911

8. Identification of microRNAs and their gene targets in cytoplasmic male sterile and fertile maintainer lines of pigeonpea.

Authors: Abhishek Bohra; Prasad Gandham; Abhishek Rathore; Vivek Thakur; Rachit K Saxena; S J Satheesh Naik; Rajeev K Varshney; Narendra P Singh
Journal: Planta Date: 2021-02-04 Impact factor: 4.116

9. The Compressed Vocabulary of Microbial Life.

Authors: Gustavo Caetano-Anollés
Journal: Front Microbiol Date: 2021-07-07 Impact factor: 5.640

10. The Placental Response to Guinea Pig Cytomegalovirus Depends Upon the Timing of Maternal Infection.

Authors: Zachary W Berkebile; Dira S Putri; Juan E Abrahante; Davis M Seelig; Mark R Schleiss; Craig J Bierle
Journal: Front Immunol Date: 2021-06-15 Impact factor: 7.561