| Literature DB >> 25285160 |
Abstract
Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure-activity relationships (QSAR), many others exist in the technical literature. This discussion is methods-based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k-Nearest Neighbors and naïve Bayes classifiers.Entities:
Year: 2014 PMID: 25285160 PMCID: PMC4180928 DOI: 10.1002/wcms.1183
Source DB: PubMed Journal: Wiley Interdiscip Rev Comput Mol Sci ISSN: 1759-0884
Figure 1We can conceive of chemoinformatics as a two-part problem: encoding chemical structure as features, and mapping the features to the output property. The second of these is most often the province of machine learning.
Figure 2Five illustrative decision trees forming a (very small) Random Forest for classification. The terminal leaf nodes are shown as squares and colored red or green according to class. The path taken through each tree by a query instance is shown in orange. Trees A, B, C, and E predict that the instance belongs to the red class, tree D dissenting, so that the Random Forest will assign it to the red class by a 4–1 majority vote.
Figure 3Illustration of a kNN classification model. For k = 1, the model will classify the blue query instance as a member of the red class; for k = 3, it will again be assigned to the red class, this time by a 2–1 vote; however, since the fourth and fifth nearest neighbors are both green, a k = 5 model would classify it as part of the green class by a 3–2 majority.
Some Other Machine-Learning Methods Used in Chemoinformatics
| Algorithm | Description |
|---|---|
| Ant Colony | Uses virtual pheromones based on ant behavior for optimization |
| Relevance Vector Machine (RVM) | Sparse probabilistic binary classifier related to SVM; gives probabilities rather than all-or-nothing classification |
| Parzen-Rosenblatt Window | Kernel density estimation method that allows molecular similarities to be transformed into probabilities of class membership |
| Fuzzy Logic | Designed to give interpretable rules based on descriptor values |
| Rough Sets | Rule-based method designed to give interpretable rules |
| Support Vector Inductive Logic Programming (SVILP) | Rule-based method incorporating SVM ideas |
| Winnow | For every class, Winnow learns a vector of weights for each feature. Test instances are compared with these using score thresholds |
| Decision Tree | Like one tree from a Random Forest, but without randomization |
| Linear Discriminant Analysis (LDA) | Models statistical differences between classes in order to make a classification |
| kScore | Analogous to a weighted kNN scheme in which the weights are optimized by Leave-One-Out cross-validation |
| Projection to Latent Structures (PLS) | Obtains a linear regression by projecting |
Figure 4Design of a cross-validation exercise, here shown for eight-fold cross-validation. The identities of the six training, one test, and one internal validation folds are cyclically permuted.