| Literature DB >> 31729443 |
Daniele Raimondi1, Gabriele Orlando2, Wim F Vranken2,3, Yves Moreau4.
Abstract
Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the "biologically meaningful" scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and "real" propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31729443 PMCID: PMC6858301 DOI: 10.1038/s41598-019-53324-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Results of the randomization experiments. This figure shows the results of the experiments on the 3 tasks tested in this study (Cysteine oxidation (CYS), Relative Solvent Accessibility (RSA) and Secondary Structure (SS) predictions). The plots show the performances obtained by 4 ML methods (Multilayer Perceptron (MLP), Random Forest (RF), Ridge classifier (Ridge) and Linear Support Vector Machine (LinSVC)) in terms of their AUC (MCC in the SS case). For each ML method and each task we ran five simulations, testing the scores obtained with different features encoding. For each combination of task and ML method, we tested the ONEHOT encoding (ONEHOT), randomly sampled propensity scales (REAL), randomly shuffled propensity scales (SHUFFLED), randomly generated scales (RANDGEN) and a true randomization of the vectors (see Methods for more details).
Table showing the best AUC (MCC in the case of SS task) scores obtained in our simulations.
| Task | Method | ONEHOT | REAL | SHUFFLED | RANDGEN |
|---|---|---|---|---|---|
| RSA | MLP | 77.5 | 77.5 | 77.4 | |
| RF | 76.0 | 76.9 | 76.9 | ||
| Ridge | 77.2 | 75.4 | 75.9 | ||
| LinSVC | 77.0 | 73.5 | 75.6 | ||
| CYS | MLP | 67.7 | 71.5 | 71.6 | |
| RF | 68.0 | 72.9 | 73.2 | ||
| Ridge | 71.1 | 76.5 | 74.9 | ||
| LinSVC | 71.0 | 72.9 | 71.0 | ||
| SS | MLP | 45.5 | 44.9 | 44.6 | |
| RF | 39.9 | 41.4 | 41.2 | ||
| Ridge | 39.7 | 36.1 | 35.7 | ||
| LinSVC | 38.4 | 33.8 | 33.2 |
Figure 2Plots showing the distribution of the approximate Shapley values scores evaluating the contributions of the Real and SHUFFLED features for the RF and Ridge methods on the RSA task. r < 0.6 or r < 0.4 indicate the maximum Pearson correlation allowed among the pool of sampled scales for each of these experiments. This regulates the maximum allowed scales redundancy used in the experiments.
Figure 3Plot showing the MCC scores obtained by embeddings on non-linear ML methods (left) and linear ones (right). Purple-ish colors indicate experiments using the embeddings, while orange-ish color indicate the MCCs obtained by the same methods but with propensity scales. The grey line represents the best scores for the SS task reported in Table 1).