Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Counting clusters using R-NN curves.

Literature DB >> 17602604

Counting clusters using R-NN curves.

Rajarshi Guha¹, Debojyoti Dutta, David J Wild, Ting Chen.

Abstract

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.

Mesh：

Year: 2007 PMID： 17602604 PMCID： PMC2543137 DOI： 10.1021/ci600541f

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Keyword Cloud
References

24 in total

1. The "nearest single neighbor" method-finding families of conformations within a sample.

Authors: Doron Chema; Amiram Goldblum
Journal: J Chem Inf Comput Sci Date: 2003 Jan-Feb

2. Automated clustering of ensembles of alternative models in protein structure databases.

Authors: Francisco S Domingues; Jörg Rahnenführer; Thomas Lengauer
Journal: Protein Eng Des Sel Date: 2004-08-19 Impact factor: 1.650

3. Hit-directed nearest-neighbor searching.

Authors: Veerabahu Shanmugasundaram; Gerald M Maggiora; Michael S Lajiness
Journal: J Med Chem Date: 2005-01-13 Impact factor: 7.446

4. Evaluating distance functions for clustering tandem repeats.

Authors: Suyog Rao; Alfredo Rodriguez; Gary Benson
Journal: Genome Inform Date: 2005

5. Are clusters found in one dataset present in another dataset?

Authors: Amy V Kapp; Robert Tibshirani
Journal: Biostatistics Date: 2006-04-12 Impact factor: 5.899

6. A comparative study on the application of hierarchical-agglomerative clustering approaches to organize outputs of reiterated docking runs.

Authors: Giovanni Bottegoni; Andrea Cavalli; Maurizio Recanatini
Journal: J Chem Inf Model Date: 2006 Mar-Apr Impact factor: 4.956

7. A novel search engine for virtual screening of very large databases.

Authors: David Vidal; Michael Thormann; Miquel Pons
Journal: J Chem Inf Model Date: 2006 Mar-Apr Impact factor: 4.956

8. A fast clustering algorithm for analyzing highly similar compounds of very large libraries.

Authors: Weizhong Li
Journal: J Chem Inf Model Date: 2006 Sep-Oct Impact factor: 4.956

9. Exploration of biologically relevant conformations of anandamide, 2-arachidonylglycerol, and their analogues using conformational memories.

Authors: J Barnett-Norris; F Guarnieri; D P Hurst; P H Reggio
Journal: J Med Chem Date: 1998-11-19 Impact factor: 7.446

10. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.

Authors: Jeffrey J Sutherland; Lee A O'Brien; Donald F Weaver
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec