| Literature DB >> 16859303 |
Rajarshi Guha1, Debojyoti Dutta, Peter C Jurs, Ting Chen.
Abstract
Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.Year: 2006 PMID: 16859303 DOI: 10.1021/ci060013h
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956