| Literature DB >> 29949975 |
Anna Cichonska1,2, Tapio Pahikkala3, Sandor Szedmak1, Heli Julkunen1, Antti Airola3, Markus Heinonen1, Tero Aittokallio1,2,4, Juho Rousu1.
Abstract
Motivation: Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29949975 PMCID: PMC6022556 DOI: 10.1093/bioinformatics/bty277
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic figure showing an overview of pairwiseMKL method for learning with multiple pairwise kernels, using the drug response in cancer cell line prediction as an example. First, two drug kernels and three cell line kernels are calculated from available chemical and genomic data sources, respectively. The resulting matrices associate all drugs and all cell lines, and therefore a kernel can be considered as a similarity measure. Since we are interested in learning bioactivities of pairs of input objects, here drug–cell line pairs, pairwise kernels relating all drug–cell line pairs are needed, and they are calculated as Kronecker products (⊗) of drug kernels and cell line kernels (2 drug kernels × 3 cell line kernels = 6 pairwise kernels). In the first learning stage, pairwise kernel mixture weights are determined (Section 2.2.1), and then a weighted combination of pairwise kernels is used for anticancer drug response prediction with a regularized least-squares pairwise regression model (Section 2.2.2). Importantly, pairwiseMKL performs those two steps efficiently by avoiding explicit construction of any massive pairwise matrices, and therefore it is very well-suited for solving large pairwise learning problems
Memory and time needed for a naïve MKL approach explicitly computing pairwise kernels (Section 2.1) and pairwiseMKL (Section 2.2), depending on the number of drugs and cell lines used in the drug bioactivity prediction experiment
| Number of drugs | Number of cell lines | Memory (GB) | Time (h) | ||
|---|---|---|---|---|---|
| Naïve approach | Naïve approach | ||||
| 50 | 50 | 9.810 | 0.001 | 2.976 | 0.003 |
| 60 | 60 | 20.290 | 0.001 | 7.797 | 0.005 |
| 70 | 70 | 37.750 | 0.043 | 17.678 | 0.057 |
| 80 | 80 | 64.000 | 0.044 | 37.691 | 0.069 |
| 90 | 90 | 103.180 | 0.046 | 77.408 | 0.087 |
| 100 | 100 | 156.890 | 0.048 | 145.312 | 0.106 |
| 110 | 110 | 229.670 | 0.050 | >168.000a | 0.118 |
| 120 | 120 | >256.000b | 0.053 | ≫168.000 | 0.123 |
Note: A single round of 10-fold CV was run using different-sized subsets of the data on anticancer drug responses (described in Section 2.3.1) with 10 drug kernels and 12 cell line kernels. Regularization hyperparameter λ was set to 0.1 in both methods.
Program did not complete within 7 days (168 h).
Program did not run given 256 GB of memory.
Fig. 2.Pairwise kernel mixture weights obtained with pairwiseMKL and KronRLS-MKL (average across 10 outer CV folds) in the task of (a) drug response in cancer cell line prediction and (b) drug–protein binding affinity prediction (note: KronRLS-MKL did not execute with 1 TB memory); only the weights different from 0 are shown. KronRLS-MKL finds separate weights for drug kernels and cell line (protein) kernels instead of pairwise kernels. Numbers at the end of kernel names indicate the kernel hyperparameter values, in particular (i) kernel width hyperparameter in case of Gaussian kernels (e.g. Kc-cn-146 with ), and (ii) maximum sub-string length L, σ1 controlling for the shifting contribution term and σ2 controlling for the amino acid similarity term in case of GS kernels (e.g. Kp-GS-atp-5-4-4 with , see Section 2.3.2 for details). (c) Summary of drug, cell line and protein kernels used in this work for the two prediction problems.
Fig. 3.Prediction performance of pairwiseMKL in the tasks of (a) drug response in cancer cell line prediction and (b) drug–protein binding affinity prediction. Scatter plots between original and predicted bioactivity values across (a) 15 376 drug–cell line pairs and (b) 167 995 drug–protein pairs. Performance measures were averaged over 10 outer CV folds. F1 score was calculated using the threshold of (a) ln(IC50) = 5 nM, (b) −log10(IC50) = 7 M, both corresponding to low drug concentration of roughly 100 nM, i.e. relatively stringent potency threshold (red dotted lines). Color coding indicates the number of training data points, i.e. drug–cell line (respectively drug–protein) pairs including the same drug or cell line (drug or protein) as the test data point.
Prediction performance, memory usage and running time of pairwiseMKL and KronRLS-MKL methods in the task of drug response in cancer cell line prediction.
| Anticancer drug response prediction | RMSE | rPearson | F1 score | Memory (GB) | Time (h) |
|---|---|---|---|---|---|
| 1.682 | 0.858 | 0.630 | 0.057 | 1.45 | |
| 1.899 | 0.849 | 0.378 | 3.890 | 8.42 |
Performance measures were averaged over 10 outer CV folds. F1 score was calculated using the threshold of ln(IC50) = 5 nM.