Literature DB >> 17602604

Counting clusters using R-NN curves.

Rajarshi Guha1, Debojyoti Dutta, David J Wild, Ting Chen.   

Abstract

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.

Mesh:

Year:  2007        PMID: 17602604      PMCID: PMC2543137          DOI: 10.1021/ci600541f

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  24 in total

1.  The "nearest single neighbor" method-finding families of conformations within a sample.

Authors:  Doron Chema; Amiram Goldblum
Journal:  J Chem Inf Comput Sci       Date:  2003 Jan-Feb

2.  Automated clustering of ensembles of alternative models in protein structure databases.

Authors:  Francisco S Domingues; Jörg Rahnenführer; Thomas Lengauer
Journal:  Protein Eng Des Sel       Date:  2004-08-19       Impact factor: 1.650

3.  Hit-directed nearest-neighbor searching.

Authors:  Veerabahu Shanmugasundaram; Gerald M Maggiora; Michael S Lajiness
Journal:  J Med Chem       Date:  2005-01-13       Impact factor: 7.446

4.  Evaluating distance functions for clustering tandem repeats.

Authors:  Suyog Rao; Alfredo Rodriguez; Gary Benson
Journal:  Genome Inform       Date:  2005

5.  Are clusters found in one dataset present in another dataset?

Authors:  Amy V Kapp; Robert Tibshirani
Journal:  Biostatistics       Date:  2006-04-12       Impact factor: 5.899

6.  A comparative study on the application of hierarchical-agglomerative clustering approaches to organize outputs of reiterated docking runs.

Authors:  Giovanni Bottegoni; Andrea Cavalli; Maurizio Recanatini
Journal:  J Chem Inf Model       Date:  2006 Mar-Apr       Impact factor: 4.956

7.  A novel search engine for virtual screening of very large databases.

Authors:  David Vidal; Michael Thormann; Miquel Pons
Journal:  J Chem Inf Model       Date:  2006 Mar-Apr       Impact factor: 4.956

8.  A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

Authors:  Weizhong Li
Journal:  J Chem Inf Model       Date:  2006 Sep-Oct       Impact factor: 4.956

9.  Exploration of biologically relevant conformations of anandamide, 2-arachidonylglycerol, and their analogues using conformational memories.

Authors:  J Barnett-Norris; F Guarnieri; D P Hurst; P H Reggio
Journal:  J Med Chem       Date:  1998-11-19       Impact factor: 7.446

10.  Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.

Authors:  Jeffrey J Sutherland; Lee A O'Brien; Donald F Weaver
Journal:  J Chem Inf Comput Sci       Date:  2003 Nov-Dec
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.