| Literature DB >> 15921467 |
Peter Itskowitz1, Alexander Tropsha.
Abstract
Variable selection k Nearest Neighbor (kNN) QSAR is a popular nonlinear methodology for building correlation models between chemical descriptors of compounds and biological activities. The models are built by finding a subspace of the original descriptor space where activity of each compound in the data set is most accurately predicted as the averaged activity of its k nearest neighbors in this subspace. We have formulated the problem of searching for the optimized kNN QSAR models with the highest predictive power as a variational problem. We have investigated the relative contribution of several model parameters such as the selection of variables, the number (k) of nearest neighbors, and the shape of the weighting function used to evaluate the contributions of k nearest neighbor compound activities to the predicted activity of each compound. We have derived the expression for the weighting function which maximizes the model performance. This optimization methodology was applied to several experimental data sets divided into the training and test sets. We report a significant improvement of both the leave-one-out cross-validated R(2) (q(2)) for the training sets and predictive R(2) of the test sets in all cases. Depending on the data set, the average improvements in the prediction accuracy (prediction R(2)) for the test sets ranged between 1.1% and 94% and for the training sets (q(2)) between 3.5% and 118%. We also describe a modified computational procedure for model building based on the use of relational databases to store descriptors and calculate compounds' similarities, which simplifies calculations and increases their efficiency.Entities:
Mesh:
Year: 2005 PMID: 15921467 DOI: 10.1021/ci049628+
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956