MOTIVATION: Clustering protein sequence data into functionally specific families is a difficult but important problem in biological research. One useful approach for tackling this problem involves representing the sequence dataset as a protein similarity network, and afterwards clustering the network using advanced graph analysis techniques. Although a multitude of such network clustering algorithms have been developed over the past few years, comparing algorithms is often difficult because performance is affected by the specifics of network construction. We investigate an important aspect of network construction used in analyzing protein superfamilies and present a heuristic approach for improving the performance of several algorithms. RESULTS: We analyzed how the performance of network clustering algorithms relates to thresholding the network prior to clustering. Our results, over four different datasets, show how for each input dataset there exists an optimal threshold range over which an algorithm generates its most accurate clustering output. Our results further show how the optimal threshold range correlates with the shape of the edge weight distribution for the input similarity network. We used this correlation to develop an automated threshold selection heuristic in order to most optimally filter a similarity network prior to clustering. This heuristic allows researchers to process their protein datasets with runtime efficient network clustering algorithms without sacrificing the clustering accuracy of the final results. AVAILABILITY: Python code for implementing the automated threshold selection heuristic, together with the datasets used in our analysis, are available at http://www.rbvi.ucsf.edu/Research/cytoscape/threshold_scripts.zip.
MOTIVATION: Clustering protein sequence data into functionally specific families is a difficult but important problem in biological research. One useful approach for tackling this problem involves representing the sequence dataset as a protein similarity network, and afterwards clustering the network using advanced graph analysis techniques. Although a multitude of such network clustering algorithms have been developed over the past few years, comparing algorithms is often difficult because performance is affected by the specifics of network construction. We investigate an important aspect of network construction used in analyzing protein superfamilies and present a heuristic approach for improving the performance of several algorithms. RESULTS: We analyzed how the performance of network clustering algorithms relates to thresholding the network prior to clustering. Our results, over four different datasets, show how for each input dataset there exists an optimal threshold range over which an algorithm generates its most accurate clustering output. Our results further show how the optimal threshold range correlates with the shape of the edge weight distribution for the input similarity network. We used this correlation to develop an automated threshold selection heuristic in order to most optimally filter a similarity network prior to clustering. This heuristic allows researchers to process their protein datasets with runtime efficient network clustering algorithms without sacrificing the clustering accuracy of the final results. AVAILABILITY: Python code for implementing the automated threshold selection heuristic, together with the datasets used in our analysis, are available at http://www.rbvi.ucsf.edu/Research/cytoscape/threshold_scripts.zip.
Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker Journal: Genome Res Date: 2003-11 Impact factor: 9.043
Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971
Authors: Stephen A Smith; Nerida G Wilson; Freya E Goetz; Caitlin Feehery; Sónia C S Andrade; Greg W Rouse; Gonzalo Giribet; Casey W Dunn Journal: Nature Date: 2011-10-26 Impact factor: 49.962
Authors: Swapnil V Ghodge; Alexander A Fedorov; Elena V Fedorov; Brandan Hillerich; Ronald Seidel; Steven C Almo; Frank M Raushel Journal: Biochemistry Date: 2013-01-30 Impact factor: 3.162
Authors: John H Morris; Leonard Apeltsin; Aaron M Newman; Jan Baumbach; Tobias Wittkop; Gang Su; Gary D Bader; Thomas E Ferrin Journal: BMC Bioinformatics Date: 2011-11-09 Impact factor: 3.307
Authors: Janelle B Leuthaeuser; Stacy T Knutson; Kiran Kumar; Patricia C Babbitt; Jacquelyn S Fetrow Journal: Protein Sci Date: 2015-08-18 Impact factor: 6.725
Authors: Francisco Javier Lopez; Maria Bernabeu; Carmen Fernandez-Becerra; Hernando A del Portillo Journal: BMC Genomics Date: 2013-01-16 Impact factor: 3.969