Julia Handl1, Joshua Knowles, Douglas B Kell. 1. School of Chemistry, University of Manchester, Faraday Building, Sackville Street, PO Box 88, Manchester M60 1QD, UK. J.Handl@postgrad.manchester.ac.uk
Abstract
MOTIVATION: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. RESULTS: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. AVAILABILITY: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. SUPPLEMENTARY INFORMATION: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.
MOTIVATION: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. RESULTS: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. AVAILABILITY: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/. SUPPLEMENTARY INFORMATION: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/.
Authors: Partha Mukhopadhyay; Guy Brock; Vasyl Pihur; Cynthia Webb; M Michele Pisano; Robert M Greene Journal: Birth Defects Res A Clin Mol Teratol Date: 2010-07
Authors: Elisabeth J Ploran; Steven M Nelson; Katerina Velanova; David I Donaldson; Steven E Petersen; Mark E Wheeler Journal: J Neurosci Date: 2007-10-31 Impact factor: 6.167
Authors: Xiting Yan; Jen-Hwa Chu; Jose Gomez; Maria Koenigs; Carole Holm; Xiaoxuan He; Mario F Perez; Hongyu Zhao; Shrikant Mane; Fernando D Martinez; Carole Ober; Dan L Nicolae; Kathleen C Barnes; Stephanie J London; Frank Gilliland; Scott T Weiss; Benjamin A Raby; Lauren Cohn; Geoffrey L Chupp Journal: Am J Respir Crit Care Med Date: 2015-05-15 Impact factor: 21.405
Authors: Jeff W Chou; Tong Zhou; William K Kaufmann; Richard S Paules; Pierre R Bushel Journal: BMC Bioinformatics Date: 2007-11-02 Impact factor: 3.169
Authors: Daniel L Cooke; David B McCoy; Van V Halbach; Steven W Hetts; Matthew R Amans; Christopher F Dowd; Randall T Higashida; Devon Lawson; Jeffrey Nelson; Chih-Yang Wang; Helen Kim; Zena Werb; Charles McCulloch; Tomoki Hashimoto; Hua Su; Zhengda Sun Journal: Transl Stroke Res Date: 2017-09-13 Impact factor: 6.829