Ekaterina A Sosnina1,2, Sergey Sosnin1,3, Anastasia A Nikitina4,5, Ivan Nazarov1, Dmitry I Osolodkin5,6, Maxim V Fedorov1,3,7. 1. Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, Moscow 143026, Russia. 2. Institute of Physiologically Active Compounds, RAS, Severniy pr. 1, Chernogolovka 142432, Russia. 3. Syntelly LLC, Skolkovo Innovation Center, Bolshoy Boulevard 30, Moscow 121205, Russia. 4. Department of Chemistry, Lomonosov Moscow State University, Leninskie Gory 1 bd. 3, Moscow 119991, Russia. 5. FSBSI "Chumakov FSC R&D IBP RAS", Poselok Instituta Poliomielita 8 bd. 1, Poselenie Moskovsky, Moscow 108819, Russia. 6. Institute of Translational Medicine and Biotechnology, Sechenov First Moscow State Medical University, Trubetskaya Ulitsa 8, Moscow 119991, Russia. 7. Physics John Anderson Building, University of Strathclyde, 107 Rottenrow East, Glasgow G4 0NG, U.K.
Abstract
Recommender systems (RSs), which underwent rapid development and had an enormous impact on e-commerce, have the potential to become useful tools for drug discovery. In this paper, we applied RS methods for the prediction of the antiviral activity class (active/inactive) for compounds extracted from ChEMBL. Two main RS approaches were applied: collaborative filtering (Surprise implementation) and content-based filtering (sparse-group inductive matrix completion (SGIMC) method). The effectiveness of RS approaches was investigated for prediction of antiviral activity classes ("interactions") for compounds and viruses, for which some of their interactions with other viruses or compounds are known, and for prediction of interaction profiles for new compounds. Both approaches achieved relatively good prediction quality for binary classification of individual interactions and compound profiles, as quantified by cross-validation and external validation receiver operating characteristic (ROC) score >0.9. Thus, even simple recommender systems may serve as an effective tool in antiviral drug discovery.
Recommender systems (RSs), which underwent rapid development and had an enormous impact on e-commerce, have the potential to become useful tools for drug discovery. In this paper, we applied RS methods for the prediction of the antiviral activity class (active/inactive) for compounds extracted from ChEMBL. Two main RS approaches were applied: collaborative filtering (Surprise implementation) and content-based filtering (sparse-group inductive matrix completion (SGIMC) method). The effectiveness of RS approaches was investigated for prediction of antiviral activity classes ("interactions") for compounds and viruses, for which some of their interactions with other viruses or compounds are known, and for prediction of interaction profiles for new compounds. Both approaches achieved relatively good prediction quality for binary classification of individual interactions and compound profiles, as quantified by cross-validation and external validation receiver operating characteristic (ROC) score >0.9. Thus, even simple recommender systems may serve as an effective tool in antiviral drug discovery.
Multitask learning[1] gained popularity
in different fields by virtue of continued rapid growth of available
information and the development of advanced algorithms. The advantage
of multitask learning is that it provides an opportunity to use additional
information from related tasks for prediction for a task with insufficient
information. It allows one both to improve the prediction performance
and to process small or imbalanced data sets that are prevalent in
the drug discovery field.[2]The multitask
learning approaches in different fields evolved separately
and are characterized by different definitions and notations (see
the Appendix 1). The concept of multitask
learning has become popular in chemoinformatics[3−7] and the clearest examples of its realization were
the proteochemometrics[8−10] and “read across” approach.[11] In e-commerce, multitask learning is referred
to as the recommender system (RS).RS is one of the approaches[12] based
on multitask learning and allows one to realize multitask prediction
(occasionally referred to as multitarget prediction[13]). The interest in RS began to increase in October 2006
with the announcement of a Netflix Prize competition,[14,15] aiming to create a precise system to analyze users’ preferences
and suggest video content for them.RS methods are classified,
based on the information used for model
creation, into collaborative filtering (CF), content-based filtering
(CBF), hybrid approaches, and others.[16,17] The choice
of approach for a certain task is grounded on the required prediction
accuracy and available computational resources.[18−20]Collaborative
filtering (CF)[16,21,22] is one of the most common RS methods, popularized during the Netflix
competition due to its simplicity. If users have similar preferences,
then they have similar profiles and vice versa. A
CF RS model recommends new content to a user based on its evaluation
of other users with similar preference profiles. In the drug discovery
context, CF methods may rely on the similarity between compound or
target interaction profiles to predict interaction values and select
compound–target pairs with higher interaction scores.CF methods are the easiest to implement but have a lot of limitations.[16,23−25] The most important one is the cold-start problem
(CS): interaction values cannot be reliably predicted for pairs consisting
of new compounds or targets due to an inability to calculate similarity
for their (empty) interaction profiles. The second limitation is the
sparsity problem: the fewer the interaction values known, the more
complicated it is to calculate similarity. The third limitation is
scalability: the computational and memory complexities of CF algorithms
are generally quadratic.Content-based filtering (CBF) RS methods
are more advanced and
allow one to predict interaction values based on additional feature
information, also called side-channel information, which characterizes
both compounds and targets.[16,26,27] CBF recommends items similar to those liked by a user in the past
based on the assessment of similarity of their features. In drug discovery,
CBF may employ similarity based on features, or descriptors, of compounds
or targets. Feature information allows one to overcome the disadvantages
inherent to CF methods: prediction for new compounds or targets and
very sparse data matrices. Also, a valuable advantage of CBF algorithms
is the possibility of interpreting the model by analysis of important
features. The disadvantages of CBF include the ability of overfitting
and the need for feature calculation, which may be complicated in
the case of target characterization.Among the rapidly growing
number of multitask prediction applications
in drug discovery, only a couple of dozen studies regarded their approaches
as RS. These studies were usually concerned with the analysis of approved
drugs and their possible side effects,[28−30] drug repurposing,[30−33] drug–drug interactions,[34,35] toxicogenomics
prediction,[36] or treatment recommendations.[37,38] A comparison of several methods demonstrated the possibility to
successfully use rather simple RS algorithms in drug discovery.[39] Also, there were publications describing new
methods of matrix completion with validation on drug databases,[40] and estimating their robustness on bioactivity
data sets.[41] In clinical medicine, RSs
have been applied since 2008 to improve treatment recommendation schemes.
For example, the RS approaches were used for automatic detection of
omissions in medication lists,[42,43] as well as for treatment
optimization in the context of the information overload problem, by
suggesting knowledge-based items of interest to clinicians for specific
diseases.[44]The search for new antivirals
is an attractive field for the application
of the RS approach. It is rather different from other medicinal chemistry
fields because the majority of primary antiviral activity data are
obtained from phenotypic antiviral assays, usually cell-based.[45] Contrary to common approaches, where targets
are represented by individual proteins, antiviral activity is usually
measured in much more complicated systems, containing at least viruses,
cells, and compounds, and approaches based on individual targets are
of limited use here. Thus, the search for broad-spectrum antivirals
or antivirals against less-studied viruses reduces to the application
of common molecules or privileged classes, such as nucleosides, as
could be seen during the current coronavirus disease pandemic.[46−48] In our previous studies, we compiled a large annotated data set
of small-molecule antiviral activity, ViralChEMBL v. 0.1.[45] After filtering, the data on activity and inactivity
of approximately 250K compounds against 158 viral species were represented
as a sparse matrix of compound–virus interactions, containing
only 400K data points of 40M possible. Typically, the sparsity of
interaction matrix M in the RS setting reaches 90–99%,[49,50] so this data set forms a proper base for the development of predictive
models based on matrix completion methods.In this paper, we
present an attempt to apply RS approaches in
the antiviral drug discovery context. To complete the antiviral activity
matrix, we used the CF algorithm implemented in Surprise package[51] and sparse-group inductive matrix completion
(SGIMC) implementation of CBF.[52] Several
questions were addressed in our study: (1) Are the RS approaches effective
in the context of antiviral activity prediction taking into account
the data sparsity and unusual complexity of targets (viruses)? (2)
Which RS approach gives a more accurate prediction? (3) Can we obtain
a reliable prediction result for new compounds or new viral species?
To address these challenges, we developed scenarios for prediction
of new point interactions for compounds and viruses, which were used
for model building, and prediction of interaction profiles for new
compounds or viruses, not used for model building.
Materials and Methods
Recommender System Approaches
Collaborative Filtering
We used
CF implementation of the Surprise Python package.[51] The methods we used operate only the interaction matrix
elements and can be divided into three groups:Model hyperparameters are given in Supporting Information Table S1.k-Nearest-neighbor (kNN)-based algorithms
are implemented in knns.KNNBasic class and identify
the neighbors for the compounds and viruses based on the similarity
of their interaction profiles. We used cosine similarity and mean-squared
difference metrics for similarity calculation.Clustering algorithms are implemented in the co_clustering.CoClustering class. They
identify the neighborhood by grouping compounds or viruses into coclusters,
simultaneously clustering the columns and rows of a matrix, and generate
predictions based on the average interaction values.Matrix factorization algorithms are represented by singular
value decomposition (matrix_factorization.SVD) and non-negative matrix factorization (matrix_factorization.NMF) methods. They are based on the idea of
interaction matrix decomposition and determination of the latent variables
allowing for completion of the missing interaction values.
Content-Based
Filtering
The sparse-group
inductive matrix completion algorithm implemented in the SGIMC package[52] was used as an example of CBF RS. It is based
on the inductive matrix completion (IMC) method and allows one to
filter out noninformative features. To recover the missing entries
of matrix , where only M for
(i, j) ∈
Ω ⊂ {1,...,n1} × {1,...,n2} are known, IMC takes into account side feature
information provided in matrices X and Y. In our case, contains descriptors of the compounds and contains virus features (here, taxa), n1 and n2 are the
numbers of compounds and species, and d1 and d2 are the numbers of their features,
respectively.The approach is based on the assumption that the
elements of the matrix M may be predicted via a bilinear
model: M ∼ xWy for a low-rank matrix , given that the features are predictive.
As the matrix W must have rank k <
min (d1; d2), according to the constraint of the inductive matrix completion
approach, the matrix W can be represented by a low-rank
product UV with and . Then, the penalized minimization problem
is solvedwhere is
the smooth convex loss function for
the binary classification problem, log(1 + e–) for y and p as known and
predicted values, respectively. R(·) is the
penalty, and λ and λ are the appropriate regularization coefficients.The SGIMC algorithm shares the idea of the IMC approach of matrix
completion by combining feature vectors associated with rows and columns
of an interaction matrix with a low-rank matrix. The method differs
by application of the sparse-group penalty for selection of side features,
in addition to the classic ridge and lasso regularizations used in
IMC. Thus, the penalty function R(·) is represented
as a sum of three terms: sparsity-inducing penalty ∥Z∥2,1, the squared Frobenius norm ∥Z∥F2, and the matrix L1-norm ∥Z∥1,1where e is
an ith unit vector, which conforms
to the dimensionality of its context. The algorithm relies on single
penalty functions or their combinations by setting the proper regularization
coefficients Classo, Cridge, and Cgroup.We
investigated the influence of regularization coefficients, the
rank of low-rank matrix W, and the number of training
iterations on the predictive ability of RS in the case of antiviral
activity data. The ranges of the investigated hyperparameter values
are provided in Supporting Information Table S1.
Data Preparation
We used ViralChEMBL[45] as the source of information about compound–virus
interactions to create data sets for cross-validation and model training.
The ViralChEMBL data set contains 615 029 antiviral activity
data points extracted from ChEMBL v.20, standardized and annotated
by virus species according to ICTV taxonomy. We prepared compound–virus
interaction data based on the workflow depicted in Figure and described in the Supporting
Information. As a result of the data processing, the DB_main data set for training and cross-validation
comprised 247 994 compounds, 158 viral species, and 400 281
interaction values.
Figure 1
Scheme of data preparation.
Scheme of data preparation.External validation was based on compound–virus interaction
information from ChEMBL v.24 for the period 2017–2019.[53] Test sets were prepared for assessment of the
interaction prediction quality in two cases: for known compounds and
viruses (DB_ext_points) and for new compounds—compoundwise CS prediction (DB_ext_CS). DB_ext_points consisted of 659 interactions
between 447 compounds and 43 viral species, while DB_ext_CS contained 10 730
interactions between 10 730 compounds and 55 viral species.
The descriptive statistics of the data sets are given in Table .
Table 1
Data Sets for Model Training, Cross-Validation,
and External Validation
interaction
data sets
DB_main
DB_ext_points
DB_ext_CS
number of compounds
247 994
447
10 730
number
of viral species
158
43
56
number of interactions
400 281
659
10 730
active/inactive
class ratio
9/1
3/1
4/1
minimum/average/maximum
number of interactions with viruses for each compound
1/1.61/36
1/1.47/12
1/1/1
minimum/average/maximum
number of interactions with compounds for each virus
1/2533.42/85 823
1/15.33/155
1/195.09/2621
sparsity
1.02%
3.43%
1.82%
Compound Features
Two-dimensional
descriptors of chemical structures were calculated with Dragon 7.0.8[54] and used as the features of the compounds. Descriptors
were calculated and selected with default options, except the following
set “on”: (1) exclude descriptors with constant and
near-constant values, (2) exclude descriptors with a standard deviation
(SD) of <0.0001, (3) exclude descriptors with all missing values,
and (4) round descriptor values. The features with “NaN”
values were dropped according to the SGIMC requirements. The selected
features were standardized with the StandardScaler class of the sklearn.preprocessing module. The
resulting compound feature matrices DB_c.main, DB_c.ext_points, and DB_c.ext_CS were composed of 2016 feature columns and different numbers of compound
rows (Table ).For an additional cross-validation test, we reduced the number of
features to investigate their influence on the predictivity. DB_c.50d, DB_c.25d, and DB_c.10d contained reduced
number of features, up to 50, 25, and 10%, respectively. DB_c.8 contained only eight simplest features: AMW,
average molecular weight; nSK, number of non-hydrogen atoms; snBO,
number of non-hydrogen bonds; sRBN, number of rotatable bonds; nDB,
number of double bonds; sMLOGP, Moriguchi octanol–water partition
coefficient (logP); nHDon, number of hydrogen bond donors (N and O);
and nHAcc, number of H-bond acceptors (N, O, F). DB_c.1 was the unit feature vector with the sole feature
equal to 1 for all compounds. Feature selection for DB_c.50d, DB_c.25d, and DB_c.10d data sets was performed
in three random replicates to ensure the robustness of selection.
Viral Species Features
We used
viral species (species_id) and genus (genus_id) information from the ViralChEMBL data set according
to the ICTV 2014 taxonomy.[45,55] We exploited the genus
assignment as the only (pseudo)feature of the species and represented
it as a binary vector according to the following rule: if a species
belonged to a genus, then the bit corresponding to the genus_id was set to 1, otherwise to 0. The resulting
viral feature matrices DB_v.main, DB_v.ext_points, and DB_v.ext_CS contained 74 feature columns. It was not expected that such features
should contain sufficient predictive information. The utilization
of viral feature matrices was mandatory in the SGIMC method, while
the models were designed with mostly compound feature-based prediction
in mind.
Prediction Scenarios
Three main challenges
can be addressed in the multitask prediction (Figure ): prediction of new interactions for compounds
and viruses, data for which were used for model building (Figure a), prediction of
interaction profiles for compounds that were not used for model building
(compoundwise CS prediction) (Figure b), and prediction of interactions for viral species
that were not used for model building (specieswise CS prediction)
(Figure c). We assessed
the performance of the methods in all three scenarios, where possible.
Figure 2
Addressed
challenges: (a) prediction of point compound–virus
interactions, (b) compoundwise CS prediction, and (c) specieswise
CS prediction. Matrix of interactions, green; matrix of species features,
pink; matrix of compound features, yellow; and unknown compound–virus
interactions, white.
Addressed
challenges: (a) prediction of point compound–virus
interactions, (b) compoundwise CS prediction, and (c) specieswise
CS prediction. Matrix of interactions, green; matrix of species features,
pink; matrix of compound features, yellow; and unknown compound–virus
interactions, white.CF algorithms process
only the interaction matrix and thereby cannot
perform CS predictions. Thus, Surprise CF algorithms were used to
address only the prediction of new interactions between compounds
and viruses utilized for model building (Figure a). Hyperparameters for the best models were
selected by grid search (Supporting Information Table S1) and 10-fold cross-validation in the model_selection module. Models were based on the DB_main data set and external validation
was performed on the DB_ext_points data set.The SGIMC CBF algorithm was applied
for solving all three challenges
(Figure ). Models
were trained on the interaction matrix DB_main with side feature matrices DB_c.main and DB_v.main.
External validation was performed using the interaction matrix DB_ext_points and the
feature matrices DB_c.ext_points and DB_v.ext_points in the case of interaction prediction for known compounds
and species and using matrices DB_ext_CS, DB_c.ext_CS, and DB_v.ext_CS in the case of compoundwise CS prediction. Model selection
was performed based on a grid search of hyperparameters (Supporting
Information Table S1) and cross-validation.
We used sklearn.model_selection v.0.21.3
classes KFold for 10-fold cross-validation in the
compoundwise CS scenario and StratifiedKFold for
stratified 10-fold cross-validation of prediction of new interactions
of known compounds and viruses due to substantial data set imbalance.Models for solving specieswise CS problems were built without cross-validation
using DB_main, DB_c.main, and DB_v.main data sets and hyperparameters of the best model for prediction of
new interactions between known compounds and viral species. Model
building and evaluation were performed by excluding the activity profiles
of each species one by one from the model building and applying them
as external test sets. Due to a small number of interaction values
for the majority of species, the assessment of their prediction power
could not be accurate.
Influence of the Number
of Features
We carried out an additional test to investigate
the influence of
the number of features on the predictive power of the SGIMC algorithm.
The model for the DB_main matrix
was built. Feature matrices were represented by DB_v.main and DB_c.main subsets with different numbers of features, three generated by random
samplings: DB_c.50d, DB_c.25d, and DB_c.10d, and two with fixed features: DB_c.8 and DB_c.1. Assessment of prediction
on reduced matrices was based on the stratified 10-fold cross-validation
with hyperparameters from the best model for prediction of interaction
for known compounds and viruses based on the original DB_c.main data set.
Evaluation
and Metrics
Models were
built based on the DB_main set and
validated via cross-validation and external test sets (DB_ext_CS and DB_ext_points).[56,57] To avoid possible errors caused by the substantial imbalance of
the data set with regard to activity classes, we used stratified 10-fold
cross-validation keeping the constant proportion of active and inactive
class assignments (90 and 10%, respectively) in the training and test
sets.[58] We used grid search to optimize
the hyperparameters of our algorithms. Varied hyperparameters and
their values are shown in Supporting Information Table S1.We used the receiver operating characteristic
area under the curve (ROC AUC) score and two metrics based on it to
assess prediction quality. The ROC AUC score was calculated using
the roc_auc_score class of sklearn.metrics for all predicted values.[59] The mean and median of n-fold-averaged
ROC AUC for a set of viral species[60] were
also calculatedwhere ROC AUC(t) is the ROC AUC score calculated for viral species t for the test fold n, N = 10 in the case of cross-validation and N = 1
in the case of external validation. Standard deviation (SD) was calculated
by numpy.std class (numpy v. 1.17.2).We used the ROC AUC score to assess the prediction quality for
all of the interaction values in the test set. The mean and median
ROC AUC scores were used to demonstrate the difference in prediction
quality for the separate viral species. For the comparison of models,
we used median ROC AUC as the main measure not skewed by extremely
large or small values, so it would better describe the real prediction
quality. Also, SD was defined as a deviation for predictions for different
viral species in the mean ROC AUC calculation and as a deviation in
ROC AUC values for different cross-validation runs. The quality of
specieswise CS prediction was assessed by the ROC AUC score for each
species separately. ROC AUC, mean and median ROC AUC scores for the
10 best models for every algorithm are collected in Supporting Information Tables S2–S9. A code snippet for the metrics
calculation is available as Supporting Information File SI2.The robustness of the models was assessed
by y-scrambling.[56,61] The median
ROC AUC scores for
normal models were compared with the median ROC AUC scores of the
ones based on y-scrambled data sets. We used 1 –
(n/m) as a measure of robustness,[62,63] where m and n are the numbers
of “normal” and scrambled models with ROC AUC >0.6,
respectively. The y-scrambling was performed for
the 10 best models in each scenario according to the median ROC AUC
score and was applied for both cross-validation and external validation
(Supporting Information Tables S2–S9).To assess the applicability of constructed models, we compared
training and test data sets based on the similarity distance between
their compounds. The distance was evaluated between DB_c.main and DB_c.ext_points, between DB_c.main and DB_c.ext_CS, and as an average of distances between training and test sets in
every fold in the case of cross-validation (DB_c.main). The distance between each pair of compounds in
the training and test data sets was computed based on their feature
values. The cosine distance was calculated with the spatial.distance.cdist class of the scipy package. The similarity between
the training and test sets was assessed based on the distribution
of distances between every ith compound in a test
set and all of the compounds in the training set (DIST), calculated according to the equationwhere n is the number of known interactions of the ith compound of the test set and the jth compound
of the training set with the same viral species, N is the overall number of known interactions
for the ith compound, and sim is the cosine distance between the ith compound
of the test set and the jth compound of the training
set containing T compounds.
Results and Discussion
The efficiency of antiviral activity
class prediction with CF and
CBF techniques was assessed and compared for small molecules from
ViralChEMBL and ChEMBL 24. We represented compound–virus interactions
as two classes, active and inactive, and encoded them in the interaction
matrix as 1 and 0, respectively. In the case of a lack of experimental
measurements, the corresponding value was kept empty. To understand
the performance and robustness of the RS approaches, we investigated
four scenarios:In all
of the scenarios, the ROC AUC scores for the best models
were much higher than the corresponding mean and median ROC AUC (Supporting
Information Tables S2–S9). Thus,
the ROC AUC scores could be used only to illustrate how precise the
prediction was for all of the interaction values in the test set.
It was calculated based on all predicted values and did not take into
account the specifics of each viral species. At the same time, the
mean/median ROC AUC was the average/median of ROC AUC values separately
calculated for each viral species. A moderate mean/median ROC AUC
value and a high SD of the mean ROC AUC indicated satisfactory prediction
for the whole data set along with a substantial difference in prediction
power for different viral species: the activity class prediction based
on the same model for some viral species was perfect, while for the
others it was unsatisfactory.prediction of
point interactions for known compounds
and viruses,compoundwise CS prediction
(prediction of interaction
profiles for new compounds),specieswise
CS prediction (prediction of interaction
profiles for new viruses), andprediction
of compound–virus interactions with
a reduced number of compound features.We performed 10-fold cross-validation
and optimized hyperparameters
of models by grid search. To prove the lack of impact of data set
imbalance on the prediction results, the y-scrambling
test was performed for the 10 best models under each scenario in both
cross-validation and external validation settings (Supporting Information Tables S2–S9). Upon y-randomization, the quality of models decreased, providing compelling
evidence of the relevance of our prediction model.We examined
the opportunity of using the Balanced Accuracy and
the Precision-Recall AUC as model quality metrics.[59] Their use led to overestimation of prediction results:[59] for the best models, their values were equal
to 0.98–0.99. Thus, we did not use them for the model assessment.
We also did not assess the accuracy of our models because it is easy
to get high accuracy even for a poor model for an imbalanced data
set.[64] We did not use specificity, sensitivity/recall,
precision, median balanced accuracy, and similar metrics that are
suitable for imbalanced data because these metrics require an active/inactive
threshold, which is not applicable for a data set based on several
targets.We also evaluated the similarity of data sets by comparing
the
distance from each compound in the test set to all of the compounds
in the training sets. The results are presented in Supporting Information Figure S1 and Table S10. External test sets were found to consist of compounds that are
more distant from the training set compounds compared with the compounds
in the training and test sets during cross-validation.
Prediction of Point Interactions
Collaborative
Filtering
We explored
three collaborative filtering techniques: k-nearest
neighbors, coclustering, and matrix factorization. The performance
of the best models is given in Table and is illustrated in Figure .
Table 2
Predictivity of Surprise Models
cross-validation
external validation
Surprise
methods
ROC AUC ±
SD
mean ROC
AUC ± SD
median ROC
AUC
ROC AUC
mean ROC
AUC ± SD
median ROC
AUC
knns.KNNBasic msd (virus-based)
0.808 ± 0.004
0.8 ± 0.3
0.86
0.603
0.7 ± 0.3
0.72
knns.KNNBasic msd (compound-based)
0.888 ± 0.004
0.8 ± 0.3
0.83
0.888
0.8 ± 0.3
0.83
knns.KNNBasic cosine (virus-based)
0.806 ± 0.005
0.8 ± 0.3
0.86
0.606
0.7 ± 0.3
0.75
knns.KNNBasic cosine (compound-based)
0.872 ± 0.004
0.7 ± 0.3
0.79
0.872
0.7 ± 0.3
0.75
co_clustering.CoClustering
0.863 ± 0.01
0.7 ± 0.3
0.81
0.702
0.7 ± 0.3
0.76
matrix_factorization.SVD
0.939 ± 0.003
0.8 ± 0.3
0.88
0.764
0.7 ± 0.3
0.78
matrix_factorization.NMF
0.939 ± 0.003
0.8 ± 0.3
0.88
0.709
0.7 ± 0.3
0.68
Figure 3
Violin plot of ROC AUC values for viral species
in cross-validation
(blue) and external validation (red). Dotted lines inside the violins
represent the quartiles of the distribution.
Violin plot of ROC AUC values for viral species
in cross-validation
(blue) and external validation (red). Dotted lines inside the violins
represent the quartiles of the distribution.It should
be noted that the cross-validation prediction results
in Surprise suffer from the CS problem. Due to a high data set sparsity,
more than 60% of the compounds possess only one interaction (it is
about 40% of compound–species interactions in the DB_main). Predicted values for these compound–virus
pairs during cross-validation will be equal to the mean of all interactions
from their viral species profile.KNNBasic methods
are directly derived from the k-nearest-neighbors
approach and follow the basic paradigm
of chemoinformatics: similar compounds possess similar properties.
In our case, this statement can also be extended as follows: similar
compounds interact with similar viruses and similar viruses are inhibited
by similar compounds. The similarity is calculated for the interaction
profiles of compounds or viruses. The performance of models varies
depending on the similarity metric as well as the direction of similarity
calculation: virus- or compound-based similarity. Compound-based models
demonstrate better predictive power (Table ). However, the similarity calculation is
both the key factor and the bottleneck of this algorithm. Upon an
increase of the number of interaction profiles N,
the predictive power of the model increased, probably due to the increase
of the information capacity of the similarity matrix, but at the same
time, space and time complexity is O(N2). It makes the applicability of similarity-based CF
methods limited for large data sets. For example, the DB_main data set required at most 1.5 GB RAM and a
dozen seconds for the calculation of an msd or cosine similarity matrix
for 158 viral species. The same data set required at least 1700 GB
RAM and 2 h or 2500 GB and 6 h for an msd or cosine
calculation of 250K compounds, respectively.Methods based on
coclustering and matrix factorization do not rely
on profile similarity; therefore, they do not need large RAM resources
(no more than 1.5 GB of RAM) and take several minutes under the same
computational conditions as those for KNNBasic methods.
In the coclustering, rows and columns of an interaction matrix are
simultaneously grouped to compare the profiles and complete the missing
values. The best coclustering model shows a cross-validation median
ROC AUC of 0.81, which is between the cosine virus-based kNN (0.79)
and compound-based kNN (0.86). Both compound- and virus-based msd also perform better than coclustering in the cross-validation
(median ROC AUCs of 0.83 and 0.86). For the test set, the median ROC
AUC is almost the same for coclustering and kNN methods (around 0.75),
except for compound-based msd (0.83). Thus, coclustering
may be used in place of kNN if the computational resources are limited.The matrix factorization approach solves the problem of matrix
completion by finding latent features that determine the internal
relationship in data (in our case, between compounds and viruses).
Models based on this approach showed the best performance in the 10-fold
cross-validation protocol: median ROC AUC = 0.88 in both cases. On
the test set, the NMF models demonstrated the worst performance (median
ROC AUC = 0.68), while the prediction power of the SVD models (median
ROC AUC = 0.78) is second only to the msd compound-based
kNN model (median ROC AUC = 0.83).
CBF
Prediction of Point Interactions with
SGIMC
The problem of matrix completion is considered as an
optimization procedure using the features of compounds and viruses.
The SGIMC algorithm shares the idea of the IMC approach of matrix
completion by combining feature vectors, associated with row and column
entities of the interaction matrix, with a low-rank matrix. Three
matrices are required to train an SGIMC model: a partially filled
interaction matrix and full feature matrices for compounds and viruses.
Our compound feature matrix DB_c.main was filled with Dragon descriptors, whereas the virus feature matrix DB_v.main included only genus assignments.
By design, SGIMC has an option for feature selection, which is implemented
through a sparsity-inducing penalty and its regularization coefficient C. Also, coefficients Cridge and Classo, representing the squared Frobenius norm and the matrix L1-norm penalties, respectively, are involved
in regularization. These regularization coefficients were varied along
with the rank of the internal low-rank matrix W and the
number of training iterations to choose the best SGIMC model.For our best model (median ROC AUC = 0.84), we have the following
hyperparameters: rank = 10, number of iterations = 70, Classo = 0.0, Cridge = 120.0,
and Cgroup = 0.0. An increase of Classo and Cgroup leads to a notable decrease of performance, while an increase of Cridge leads to its slight increase, followed
by a slow descent (Figure ) after the optimal Cridge value
of about 120.
Figure 4
Guided grid search of Classo, Cridge, and Cgroup coefficients for interaction prediction for known compounds
and
viral species based on (a) ROC AUC, (b) mean ROC AUC, and (c) median
ROC AUC. Rank = 10, number of iterations = 70.
Guided grid search of Classo, Cridge, and Cgroup coefficients for interaction prediction for known compounds
and
viral species based on (a) ROC AUC, (b) mean ROC AUC, and (c) median
ROC AUC. Rank = 10, number of iterations = 70.
Cold-Start Prediction with CBF
The
cold-start problem is a possible lack of performance of a recommender
system applied to a new compound or virus, for which there is no experimental
data. In particular, the problem is critical for collaborative filtering
methods, based on the interaction matrix only. To tackle this issue,
CBF approaches (e.g., SGIMC), based on available side-channel information
(features of compounds/viruses), may be used to obtain reliable predictions
in the cold-start mode.We established the hyperparameters for
the best SGIMC model for the compoundwise CS prediction based on a
cross-validation grid search. The compoundwise CS performance appeared
not to differ substantially from prediction for known compounds and
viruses in the quality of interaction prediction (Figure ). For example, the models
with one of the highest predictive powers (median ROC AUCs of 0.82
and 0.86, respectively) were built on the same hyperparameter set:
rank = 10, number of iterations = 70, Classo = 0.0, Cridge = 120.0, and Cgroup = 0.0. Test set efficiency of the model based on
these hyperparameter values was assessed by median ROC AUC, which
was equal to 0.71 and 0.69 for compoundwise CS prediction and point
interaction prediction for known compounds and viruses, respectively
(Figure a,b). A substantial
decrease in predictive quality for external test sets is a result
of differences between their compounds and compounds in the training
set.
Figure 5
Violin plots of ROC AUC values for viral species: (a) prediction
of point compound–virus interactions, (b) compoundwise CS prediction,
and (c) specieswise CS prediction. The prediction was assessed in
cross-validation (light blue and coral) and external validation (dark
blue, red, and green). Lines depict the dependence of median ROC AUC
scores on the number of iterations. Dotted lines inside the violins
represent the quartiles of the distribution. Rank = 10, Classo = 0.0, Cgroup = 0.0,
and Cridge = 120.0.
Violin plots of ROC AUC values for viral species: (a) prediction
of point compound–virus interactions, (b) compoundwise CS prediction,
and (c) specieswise CS prediction. The prediction was assessed in
cross-validation (light blue and coral) and external validation (dark
blue, red, and green). Lines depict the dependence of median ROC AUC
scores on the number of iterations. Dotted lines inside the violins
represent the quartiles of the distribution. Rank = 10, Classo = 0.0, Cgroup = 0.0,
and Cridge = 120.0.SGIMC models for specieswise CS prediction based on the interaction
matrix DB_main and feature matrices DB_c.main and DB_v.main appeared to be inferior to the models for compoundwise
CS prediction (Figure b,c) (Supporting Information Tables S3 and S4). In specieswise CS prediction, the median value of ROC AUC for
all viral species is 0.65 at 70 iterations, while it is 0.75 in the
case of compoundwise CS prediction on the external test set (rank
= 10, Classo = 0.0, Cridge = 120.0, Cgroup = 0.0).
Moreover, the distribution of median ROC AUC values for new compounds
(Figure b) is more
shifted toward the 1 than the distribution of ROC AUC values for new
species (Figure b).
For example, the top quartile is more than 0.9 for compoundwise CS
prediction and more than 0.8 for specieswise CS prediction, while
the bottom quartile is close to 0.5 in both cases.The decrease
of the predictive power in the specieswise CS was
apparently caused by the insufficient virus features, represented
by the genus assignment only. To prove this hypothesis, we modeled
the situation with absolutely uninformative features by replacing DB_c.main and DB_v.main matrices with unit vectors (Figure ). It is clear that the models based on the
original feature matrices possess the best predictive quality (Figure , red). In the case
of unit vector for species (Figure , blue), models were based on sufficient information
from compound features and were still able to predict interaction
values, though with lower prediction quality. In the case of unit
vector for compounds (Figure , green), interaction prediction was not meaningful, indicating
that the original species features are not sufficient. We hope that
the proper introduction of virus features will improve the results
for all scenarios.
Figure 6
Dependence of the median ROC AUC score for point interaction
prediction
on number of iterations through cross-validation with original feature
matrices (red), unit vector for species (blue), and unit vector for
compounds (green) (rank = 10, Classo =
0.0, Cgroup = 0.0, and Cridge = 120.). Error bars represent the SD.
Dependence of the median ROC AUC score for point interaction
prediction
on number of iterations through cross-validation with original feature
matrices (red), unit vector for species (blue), and unit vector for
compounds (green) (rank = 10, Classo =
0.0, Cgroup = 0.0, and Cridge = 120.). Error bars represent the SD.
Influence of the Number of Features
The compound feature information in the DB_c.main matrix is redundant, so we supposed that the features
could be randomly removed without a significant deterioration in prediction
quality. To assess this hypothesis, we carried out a series of experiments
for known compounds and viruses by replacement of the DB_c.main matrix with truncated feature matrices DB_c.50d, DB_c.25d, DB_c.10d, DB_c.8, and DB_c.1.This experiment showed that compound feature information
in DB_c.main is excessive for the
SGIMC models (Figure ). Reducing the feature number up to 50, 25, and 10% continuously,
but slightly, reduced the prediction quality of the models. For the
models built on the DB_c.main, DB_c.50d, DB_c.25d, and DB_c.10d sets, the median
ROC AUC scores were 0.81, 0.78, 0.74, and 0.72, respectively (rank
= 10, number of iterations = 60, Classo = 0.0, Cridge = 120.0, Cgroup = 0.0). With the same hyperparameters, the models
based on the eight simplest features demonstrated a critical decrease
of the prediction quality, with the median ROC AUC = 0.67. The models
based on unit vectors were not predictive at all.
Figure 7
Dependence of mean ROC
AUC (a) and median ROC AUC (b) for models
with a different number of compound features on the number of iterations.
Rank = 10, Classo = 0.0, Cgroup = 0.0, and Cridge =
120.0. Compound feature matrices: DB_c.main (red ★), DB_c.50d (blue
■), DB_c.25d (magenta ◆), DB_c.10d (green ×), DB_c.8 (orange •), and DB_c.1 (light blue ▲). Error bars represent
the SD.
Dependence of mean ROC
AUC (a) and median ROC AUC (b) for models
with a different number of compound features on the number of iterations.
Rank = 10, Classo = 0.0, Cgroup = 0.0, and Cridge =
120.0. Compound feature matrices: DB_c.main (red ★), DB_c.50d (blue
■), DB_c.25d (magenta ◆), DB_c.10d (green ×), DB_c.8 (orange •), and DB_c.1 (light blue ▲). Error bars represent
the SD.SGIMC allows one to select features
using the Cgroup penalty coefficient to
filter out the noninformative
ones. The selection of the most significant features is performed
by the increase of the Cgroup coefficient:
with its increase, the number of selected features is decreased. In
the SGIMC authors’ benchmarks, the method was able to select
about 6000 features from the set of 355 709.[52] However, in our experiment, the increase of Cgroup led to a decrease of the mean/median ROC AUC values
and, thereby, the predictive quality of a model. The extent of the
decrease depended on the other hyperparameters (Figure ).
Figure 8
Influence of the Cgroup regularization
coefficient in cross-validation for point interaction prediction on
the mean/median ROC AUC at 70 (a) and 10 (b) iterations. Continuous
and dashed red lines indicate the mean and median ROC AUC, and continuous
and dashed blue lines indicate the mean and median number of zeroed
features. Shaded areas represent the corresponding standard deviations.
The black dash-dotted line shows median ROC AUC with 50% of compound
features. Classo = 0.0, Cridge = 120.0, and rank = 10.
Influence of the Cgroup regularization
coefficient in cross-validation for point interaction prediction on
the mean/median ROC AUC at 70 (a) and 10 (b) iterations. Continuous
and dashed red lines indicate the mean and median ROC AUC, and continuous
and dashed blue lines indicate the mean and median number of zeroed
features. Shaded areas represent the corresponding standard deviations.
The black dash-dotted line shows median ROC AUC with 50% of compound
features. Classo = 0.0, Cridge = 120.0, and rank = 10.It is clear from the comparison of Figures and 8 that the models
based on the same number of selected compound features possess different
predictive powers depending on whether the features were selected
randomly (Figure )
or using the Cgroup coefficient (Figure ). For example, in
cross-validation for point interaction prediction with 50% of the
compound features, ROC AUC scores were 0.79 for the models with random
selection and 0.68 for the models with Cgroup selection (number of iterations = 70, rank = 10, Classo = 0.0, Cridge = 120.0).
It was a result of an application of the Cgroup coefficient for both compound and species feature selections, i.e.,
using Cgroup led to zeroing of both compound
and species features simultaneously. The strategy of a simultaneous
feature selection is smart in the case of a huge amount of noisy features,
which is not the case in our task, characterized by the insufficiency
of species features. It led to a critical loss of feature information
and deterioration of model quality. Separate determination of regularization
coefficients for both compound and species feature matrices should
be a solution to this problem.
Conclusions
Multitask prediction algorithms have been gaining ground rapidly
with the appearance of databases storing multitarget data. The recommender
system (RS) as an approach of multitask prediction may be a powerful
tool for compound–target interaction prediction. These methods
allow one to predict the activity class for all combinations of compounds
and targets in a data set and select the best of them for further
experimental investigations. However, the current experience in this
domain is limited and far from complete.Our experiments demonstrated
that RS algorithms based on collaborative
and content-based filtering to a sparse matrix of antiviral activity
data can achieve sustainable performance for the antiviral activity
class prediction. We revealed that both approaches showed a very high
predictive ability in cross-validation and external validation as
measured by ROC AUC and mean/median ROC AUC.Collaborative filtering
(CF) methods demonstrate high performance
but they possess several crucial limitations. The models based on
the calculation of compound profile similarity demonstrate the best
predictive ability among the investigated CF methods but the application
of these methods is challenging due to the requirement of a huge amount
of RAM for the similarity calculation and storage. Improvement of
the algorithm by reducing the required RAM during model building would
allow the wider use of these methods for data sets with thousands
of compounds. The matrix factorization methods lead to models with
moderate predictive ability. Their preference over other CF methods
is determined by the simplicity of their application. The main disadvantage
of all CF methods is a limited applicability domain: we can make a
prediction only for compounds or viruses whose interaction profiles
were used during model creation.The application of content-based
filtering (CBF) algorithms is
preferable because of the possibility of using feature information
for compounds and viruses. Using compounds’ features, CBF makes
the cold-start prediction possible. The main disadvantage of the approach
is the requirement of generation and processing of additional feature
information, which can be a challenging task in the case of viruses,
and may require a lot of computational resources. The SGIMC method
allows one to reduce the number of used features to several thousand,
which was not possible in our case with only 2016 features. Without
feature selection, the SGIMC algorithm was virtually reduced to the
IMC. Using this algorithm, we demonstrated that the prediction of
antiviral activity for both new and known compounds against known
viruses can be performed with rather high accuracy, while prediction
of the antiviral activity of known compounds against new viruses was
less accurate due to the insufficient characterization of the viruses
in our data set. We believe that the development of appropriate virus
features could solve the problem; however, it may be a tricky issue
by itself.This research revealed promising applicability and
effectiveness
of the RS approaches in drug discovery. We hope that further progress
in this field can be achieved with hybrid RS approaches that can make
the best of CF and CBF models.
Authors: Vinicius M Alves; Alexander Golbraikh; Stephen J Capuzzi; Kammy Liu; Wai In Lam; Daniel Robert Korn; Diane Pozefsky; Carolina Horta Andrade; Eugene N Muratov; Alexander Tropsha Journal: J Chem Inf Model Date: 2018-06-13 Impact factor: 4.956
Authors: Mark Davies; Michał Nowotka; George Papadatos; Nathan Dedman; Anna Gaulton; Francis Atkinson; Louisa Bellis; John P Overington Journal: Nucleic Acids Res Date: 2015-04-16 Impact factor: 16.971