Apichat Suratanee1, Teerapong Buaboocha2,3, Kitiporn Plaimas3,4. 1. Department of Mathematics, Faculty of Applied Science, King Mongkut's University of Technology North Bangkok, Bangkok, Thailand. 2. Department of Biochemistry, Faculty of Science, Chulalongkorn University, Bangkok, Thailand. 3. Omics Sciences and Bioinformatics Center, Faculty of Science, Chulalongkorn University, Bangkok, Thailand. 4. Advanced Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, Thailand.
Abstract
Malaria caused by Plasmodium vivax can lead to severe morbidity and death. In addition, resistance has been reported to existing drugs in treating this malaria. Therefore, the identification of new human proteins associated with malaria is urgently needed for the development of additional drugs. In this study, we established an analysis framework to predict human-P. vivax protein associations using network topological profiles from a heterogeneous network structure of human and P. vivax, machine-learning techniques and statistical analysis. Novel associations were predicted and ranked to determine the importance of human proteins associated with malaria. With the best-ranking score, 411 human proteins were identified as promising proteins. Their regulations and functions were statistically analyzed, which led to the identification of proteins involved in the regulation of membrane and vesicle formation, and proteasome complexes as potential targets for the treatment of P. vivax malaria. In conclusion, by integrating related data, our analysis was efficient in identifying potential targets providing an insight into human-parasite protein associations. Furthermore, generalizing this model could allow researchers to gain further insights into other diseases and enhance the field of biomedical science.
Malaria caused by Plasmodium vivax can lead to severe morbidity and death. In addition, resistance has been reported to existing drugs in treating this malaria. Therefore, the identification of new human proteins associated with malaria is urgently needed for the development of additional drugs. In this study, we established an analysis framework to predict human-P. vivax protein associations using network topological profiles from a heterogeneous network structure of human and P. vivax, machine-learning techniques and statistical analysis. Novel associations were predicted and ranked to determine the importance of human proteins associated with malaria. With the best-ranking score, 411 human proteins were identified as promising proteins. Their regulations and functions were statistically analyzed, which led to the identification of proteins involved in the regulation of membrane and vesicle formation, and proteasome complexes as potential targets for the treatment of P. vivax malaria. In conclusion, by integrating related data, our analysis was efficient in identifying potential targets providing an insight into human-parasite protein associations. Furthermore, generalizing this model could allow researchers to gain further insights into other diseases and enhance the field of biomedical science.
Plasmodium is a parasite that has proven to be difficult to
eradicate. Plasmodium vivax is 1 of the 5 species of the parasite
group Plasmodium that infects humans.
P. vivax has the ability to confer virulence to humans and survive
in human hosts and has been categorized as a benign infection. At present,
P. vivax malaria is recognized as a cause of severe morbidity
and mortality.
Approximately, 14.3 million cases of P. vivaxinfection are
recorded annually.
Although the global incidence of P. vivax malaria infection
has decreased by 42% since 2000, the disease burden has increased in the Middle East
and South America since 2013.
In addition, P. vivax is able to evolve its strategy to
interact with the host, which has led to the development of drug-resistant
parasites. The first-line treatment drug for P. vivax is
chloroquine to treat blood-stage parasitemia together with primaquine to eradicate
persistent liver-stage infection.
However, P. vivax parasites resistant to their respective
first-line therapies have been found in Southeast Asia.
Recently, tafenoquine, a promising new drug, has been highlighted as a
radical cure for P. vivaxinfection. Results have shown that it
resulted in a significantly lower risk of P. vivax recurrence than
placebo in patients with normal glucose-6-phosphate dehydrogenase (G6PD) activity.
However, tafenoquine causes hemolysis in patients with G6PD deficiency.
Therefore, there is a need for testing G6PD activity before prescription of
tafenoquine.[7-9] The
Plasmodium parasite has the ability to evade the human immune
system, recruit host responses to regulate its life cycle, and adapt to the host environment.
Specifically, P. vivax invades erythrocytes during
blood-stage growth in humans. Duffy antigen receptor for chemokines (DARC), which is
a host receptor, is recognized by a critical invasion ligand, P.
vivax Duffy Binding Proteins (DBP), for the invasion of immature red
blood cells.
Therefore, DBP has been highlighted as a leading vaccine candidate against
P. vivax malaria.
To control this parasite, we require a better understanding of host-parasite
interactions which is crucial in the development and design of therapeutic
approaches for this infectious disease.Although recent technological advances in high-throughput techniques have enabled the
characterization of proteins that may be involved in the parasitic invasion of
target cells, maintaining a continuous in vitro culture for
P. vivax is still very difficult to standardize.
This is the main obstacle to the development of a new effective vaccine.
However, computational methods can be employed to solve this problem. One of the
most widely used methods is a network-based approach that focuses on protein-protein
interaction (PPI) networks. The analysis of a PPI network has been widely studied in
several organisms.[14-17] In
Plasmodium, several studies have investigated the PPI networks
with the aim of revealing many important aspects of protein interactions.[10,18-24] Most studies of PPI networks
have applied the calculation of degree and centralities, focusing on a single
organism in their analyses. In addition, PPI networks have also been used to study
the associations between proteins and diseases[14,25-27] and host-parasite protein
associations.[10,18,19,24,28] Saha et al
investigated the characteristics of a host-pathogen protein interaction
network based on interconnectivity and centrality properties. They analyzed the
significance of central, peripheral, hub and non-hub protein nodes in the infection
process of malaria. They also found few topologically unimportant but biologically
significant proteins between humans and malaria. Notably, most such studies have
been performed for Plasmodium falciparum. Several studies have used
ortholog-based methods to predict the association of proteins across
species.[29-33] Specifically, Cuesta-Astroz
et al
developed a method based on orthologous proteins to identify a transferred
interaction between host and parasite proteins. They identified common and specific
mechanisms of parasitic infection and survival in 15 human parasites. They also
intensively analyzed the human-Schistosoma mansoni protein
interaction network and revealed biological processes, pathways, and tissue-specific
interactions that may be essential in the life cycle of the parasites. Lee et al
predicted PPIs between P. falciparum calmodulin and
H. sapiens proteins based on orthologous pairs. From the
associations between host and parasite, they found that P.
falciparum may use calcium-modulating proteins in the host cell to
maintain the Ca2+ levels. Recently, a heterogeneous network has been
developed to propagate interaction information from the human PPI network and the
P. vivax PPI network to infer new associations between human
and P. vivax proteins.
This method was based on protein interactions that were considered to
globally represent of these 2 networks. The study used protein similarities between
human and parasite proteins to establish their associations; the idea behind this is
that a malaria protein that is homologous to a human protein may interact or work
together with human proteins to maintain their lives in the host and be related to
the same set of cooperative proteins in humans. Thus, the study of the relationship
between similar proteins in humans and malarial parasites is of great interest to
investigate their network topology in PPI networks. Similar proteins may also have
the same level of importance in the PPI, as the centrality measures reflect the
essentiality of a protein in terms of the network topology and connections under a
specific aspect of the measure. For example, the betweenness centrality provides an
insight into a node that may be involved with the paths of communication of any
pairs of nodes in the network.[17,35,36] Therefore, the integration of
these network topologies for the recognition of human-parasite protein associations
via machine learning has the potential to provide important insights and reveal new
associations and protein targets in human hosts.In this study, alternative properties based on local network topology features and
machine-learning techniques were used to elucidate new associations between human
and P. vivax proteins. The associations presented in this study
indicate the existence of functional interactions between human and P.
vivax proteins, implying that these proteins cooperate to perform a
task in the underlying mechanisms. A ranking technique was also developed to predict
potential protein targets in humans which may be important for the treatment of
P. vivax malaria. Clustering analysis was performed using
information from the heterogeneous network analysis to identify groups of related
proteins and functional proteins. Finally, a list of human proteins that are crucial
for the cellular mechanisms of P. vivax was reported and validated
via a literature search. This list may be useful in further studies that wish to
develop drugs for the treatment of P. vivax.
Materials and Methods
Overview of the analysis framework
The analysis framework was initiated with the network reconstruction process as
shown in Figure 1.
First, PPI networks for humans and malarial parasites were constructed based on
the interaction information obtained from the STRING database.
Each protein node in each network was then extracted for its network
topological features such as the degree and the betweenness centrality.
Subsequently, both networks were linked together to form a heterogeneous network
based on their protein sequence similarity. Then, the topological features of a
pair of human and malaria proteins were compared and evaluated to obtain the
strength of the differences and to build a similarity profile of the
human-parasite protein pairs. The protein sequence similarities obtained from
BlastP searches (E-value ⩽ 1e−05) were then used as an
initial class label of a pair of human and P. vivax proteins.
The complete profile was then applied to various machine-learning techniques
(naïve Bayes, neural network, random forest, and support vector machine).
Cross-validations were performed for each technique, and the performances were
measured using the receiver operator characteristic (ROC) curve. The top
classifiers from the best technique were selected as models to predict new
potential associations. Finally, the human proteins in the list of predicted
associations were ranked to identify potential protein targets for malaria
invasion in the human host.
Figure 1.
Analytical framework. An overview of the identification processes to
infer human protein targets from human-parasite protein associations
obtained using machine-learning methods with network topology
features.
Analytical framework. An overview of the identification processes to
infer human protein targets from human-parasite protein associations
obtained using machine-learning methods with network topology
features.
Network construction and topology features
Our analysis was performed on PPI networks of human proteins and P.
vivax proteins. The networks were obtained from the STRING database
(version 11.0).
To ensure that only reliable interactions were obtained, interactions
with a high confidence score (>900) were retained. A total of 12 038 human
proteins with 313 359 interactions and 1787 P. vivax proteins
with 11 477 interactions were obtained. Subsequently, a heterogeneous network
was constructed by connecting human-human protein interactions and P.
vivax-P. vivax protein interactions with the
human-P. vivax protein associations.The network topology features of all proteins were extracted based on centrality
measurements. Several studies have shown that a relationship exists between gene
essentiality and network centrality in PPI networks.[38-40] Thus, we further
investigated 5 topological features: betweenness centrality, closeness
centrality, degree, eccentricity, and Kleinberg’s hub centrality. Each of these
features explained different aspects of the measurement. Betweenness centrality
reflects an important node in term of overloading paths passing through it in
the communication of the network.[35,36] Closeness centrality
measures how close a given node is to the other nodes in the network.[35,36] The
degree represents the level of the local connections of a given node.[35,36]
Eccentricity calculates the local density of the connections among neighboring
nodes of a given node. The Kleinberg’s hub measures the importance of a given
node connecting the other important nodes.
Defining the human-P. vivax protein associations
To define the initial associations between human and P. vivax
proteins, we used the information obtained from a sequence similarity search.
When 2 protein sequences shared significant similarity with the
BlastP expectation value (E-value) less than 1e−05, they
were inferred to be homologous. This means that they did not arise
independently, but rather shared a common ancestor.
Therefore, we could define an association between 2 sequences when they
share more similarity than that would be expected by chance. However, when no
statistically significant match was found between the 2 protein sequences, we
could not ensure that no homologs were present. Thus, the machine-learning
method may be able to reveal hidden homologs. The P. vivax
protein sequences were retrieved from the Kyoto Encyclopedia of Genes and
Genomes (KEGG) database[42,43] using the Rcpi package
and then searched against all human protein sequences from the NCBI
database. We defined that 2 protein sequences were homologous when
BlastP (https://blast.ncbi.nlm.nih.gov) gave rise an E-value less than
1e−05. Then the pair of these 2 proteins was labeled to be associated.In addition, the relationship between network topologies and functions has been
revealed in several studies with the assumption that for each function, the
wiring patterns of the proteins are similar.
Different standard network topologies can be used to understand the
information contained in the wiring of a protein in the PPI.[45,46]
Therefore, we integrated initial associations from the protein sequence
similarity search and the similarities from network topological features and fed
them into machine-learning algorithms to predict new associations using both
types of similarity information. It is worth noting that our method is a
homology-based method that relies on sequence similarity, similar to previous
studies.[29-34] Protein associations were
predicted based on the initial associations from sequence similarity. Moreover,
homology-based methods have been used to infer functionally interacting proteins
in previous studies.[29-34]
Features of topological differences for machine learning
Based on the 5 network topology features, we established a vector
, that is a similarity profile, representing a relationship
between the topological values of a human protein
and a P. vivax protein
, as followswhere
, i = 1, 2,. . ., m and
j = 1, 2,. . ., n. m and
n are the number of human and P. vivax
proteins, respectively. k is the index for each topological
feature, ranging from 1 to 5.
represents the kth centrality value of a
human protein
and
represents the kth centrality value of a
P. vivax protein,
. Therefore,
denotes the topological similarity between the
kth centrality values of human protein i
and P. vivax protein j. A low value of
indicates a high similarity between the topological features
k of these 2 different types of proteins.
Training and validating of the association classifiers and calculating
association scores
We investigated all possible pairs of proteins to identify human-parasite protein
associations. To this end, we employed machine-learning techniques to classify
defined and undefined associations. Four classification algorithms, namely naïve
Bayes, neural network, random forest, and support vector machine algorithms,
were employed. Each of these classifiers is a well-known algorithm for
recognizing and creating classifiers in different ways. The naïve Bayes’
approach uses the statistics and likelihoods to make a final decision. A neural
network calculates a set of optimal weights for a weighted network structure to
separate different classes based on the features. Random forest creates complex
and hierarchical rules along the features to provide a predicted class. The
support vector machine builds a hyperplane to identify an optimal classifier
with maximum margin. With the different calculation methods to search for the
best solution for the classifier, all 4 classifiers were applied to search for
the best classifier. Different parameters of each algorithm were optimized to
determine the optimal models of each algorithm.For the naïve bayes classification, we tuned 3 hyperparameters. The first
parameter was to allow to use a kernel density estimation or a Gaussian density
estimation. The second parameter was used to adjust the bandwidth of the kernel
density when using kernel density estimation. Using this parameter, we optimized
it from 0 to 5. The third parameter was the parameter for the
Laplace smoother, which we tuned from 0 to 5.For neural networks, we optimized the number of units in the hidden layers
(H) and weight decay to avoid overfitting
(d) by employing a grid search with H = 1,
2, 3,. . ., 10 and d = 0.5, 0.1, 1e−2, 1e−3, 1e−4, 1e−5, 1e−6,
and 1e−7. The maximum iterations were set to 1000.For the random forest algorithm, we varied the number of variables randomly
sampled at each split time with a value of 2n for
n ∈ {0, 1, 2, 3, 4, 5}.For the support vector machine, we used a radial basis kernel, and optimized the
cost of false classification (C) and kernel width (γ) by
employing a grid search with C = {0.75, 1.0, 1.25} and γ =
{0.01, 0.015, 0.2}.Ten 10-fold cross-validations were performed to evaluate the performance of the
classifiers. At each time, the undefined association set was randomly selected
with an equal size to the defined set. A total of 80% of these data were used to
optimize the parameters using the cross-validation technique. At each time of
the cross-validation, the defined and undefined associations were randomly split
into 10 equal sizes. Nine parts were concatenated and used to train and optimize
the parameters. Testing was performed with the remaining part and the
performance was measured by comparing the predictions and the true class labels.
This experiment was repeated with a randomly undefined set 10 times. Several
cutoffs on the probabilities of positive class predictions were calculated,
yielding an ROC curve, which is a plot of the true-positive rate (TPR) against
the false-positive rate (FPR) at the different cutoffs. Using the ROC curve, a
broader view of the performance over various cutoffs could be measured by
calculating the area under the curve (AUC). An AUC of 1 indicated the best
performance of the classifier in which it can recognize and classify the
samples, whereas an AUC of 0.5 indicated that the performace could achieve the
same as random prediction by chance.Subsequently, the AUCs of the aforementioned 4 classification algorithms were
compared. The algorithm with the highest AUC was used as the prediction model.
Ten classifiers from the final model were employed as the ensemble classifiers.
Each classifier provided the probabilities of positive prediction for a
human-parasite protein pair. The voting score (S) was
calculated from the average probabilities of the 10 classifiers. Therefore, the
score was computed as followswhere
is the probability of a positive prediction derived from the
output of the Mth machine. The score was applied to all defined
and undefined associations in this study.
Ranking score calculation for each human protein
Using machine-learning algorithms to perform the classifications, we obtained a
promising list of human-parasite protein associations. It would be interesting
to use these associations to identify human proteins crucial for the P.
vivax malaria mechanism. It is worth noting that one human protein
could be associated with more than 1 P. vivax protein. To
identify the impact of a human protein on the list, we applied a ranking method
for all human proteins in the list. The probability of a positive prediction for
a pair of human and P. vivax proteins was used to rank the
protein pairs. The pair with the highest probability value was ranked first.
Notably, several pairs can have the same probability value. In this case, they
were assigned the same rank. The ranking score of a human protein
was calculated as followswhere
is the rank of a pair of a human protein
and P. vivax protein
, for all possible
, according to the prediction probability score of the
association.
Gene ontology enrichment analysis
To infer gene functions from the human candidate sets, we employed Gene Ontology
(GO) enrichment analysis to determine which GO terms were overrepresented in our
candidate proteins. To this end, the Cytoscape 3.7.2
plugin ClueGO v2.5.6
was used. ClueGO constructed a gene network based on GO terms by
employing all differentially expressed genes. A 2-sided hypergeometric test with
Benjamin-Hochberg corrections was performed to calculate the significant GO
terms. Only GO terms with adjusted p-values less than 0.05 were
considered.
Results
Network structures and node properties of human and P. vivax networks
In this study, we constructed 2 PPI networks of human and P.
vivax from the information of the STRING database.
The reconstructed human PPI network consisted of 12 038 proteins and 313
359 edges, while the malaria PPI network comprised 1787 proteins and 11 477
edges. The structures of the human PPI network and malaria PPI network followed
the power-law distribution (Figure 2A and B, respectively), indicating that there are small numbers of
high-degree nodes and large numbers of low-degree nodes in the networks. The
topological network features of each protein were calculated based on node
properties in the networks, namely betweenness centrality, closeness centrality,
degree, eccentricity, and Kleinberg’s hub. The deviations of these features are
shown as boxplots in Figure
3. Interestingly, both networks had similar average betweenness
centrality, degree and eccentricity, but large differences in closeness
centrality and a small difference in Kleinberg’s hub. A node with a high
betweenness score was indicative of a node with overloading paths passing
through it, that is, the node may act as a bridge between 2 or more communities.
The boxplot of betweenness centrality scores showed that both human and parasite
networks had a similar mean overload for each node in the entire network.
Evidently, there were the similar mean of degrees and eccentricities for both
networks.
Figure 2.
Degree distributions of 2 networks: the degree distributions of (A) human
protein-protein interaction network and (B) malaria protein-protein
interaction network.
Figure 3.
Boxplots for the properties of each node.
Degree distributions of 2 networks: the degree distributions of (A) human
protein-protein interaction network and (B) malaria protein-protein
interaction network.Boxplots for the properties of each node.Closeness centrality provides a good measure of a given node located in the
middle location, such that it can reach the other nodes in the shortest way. The
human network showed lower values of closeness scores than those of the parasite
network. This may be due to the fact that, in the human network, there were
several proteins, and several protein interactions caused a protein complex,
compared to that in the parasite network. Kleinberg’s hub represents the protein
nodes that may connect to other important nodes in the network. The boxplot
shows that, on average, human proteins are slightly more likely to connect with
other important nodes than that are parasite proteins. Although the boxplots
show the overall distributions of each node property in the entire network, they
do not represent all single differences of each protein in both networks. In
addition, these differences may provide a good view of how human and parasite
proteins relate to each other in terms of the cooperative community in the
network. Thus, the similarity profiles of these topological node properties for
each pair of human and Plasmodium proteins were determined.
This profile was used as a feature to train the machine-learning
classifiers.We calculated the topological similarity of each feature for each pair of human
and Plasmodium proteins. All possible combinations of these 2
types of proteins resulted in 225 675 478 human-Plasmodium
protein pairs. Next, the similarity features based on the node properties were
calculated (see Materials and Methods) for each pair of
human-Plasmodium proteins. Initially, we defined 19 939
pairs as positive association pairs based on protein sequence similarities. The
remaining pairs, namely 225 655 539 pairs, were defined as an undefined set.
These data sets were prepared to be fed into the established classification
processes. Before the classification process, it was interesting to analyze the
topology features to determine the relationship between proteins in the positive
pairs. We then calculated an uncentered correlation of each node property
between human and parasite proteins in the positive set, as shown in Table 1. This
uncentered correlation provides the value of the relationship, ranging from 0 to
1. As expected, we found a high correlation of closeness centrality between the
human and parasite proteins, with a correlation coefficient of 0.9805. In
addition, a moderate correlation of eccentricity between the human and parasite
proteins with a correlation coefficient of 0.6827 in the positive set was
observed. A low correlation of degree and betweenness centrality between human
and parasite proteins was observed, with correlation coefficients of 0.3507 and
0.1316, respectively. With Kleinberg’s hub, no correlation was observed, with
correlation coefficient of 0.0556 between human and parasite proteins. The
characterization of the topological features of human and parasite protein
interaction networks may help to identify underlying proteins that cooperate
with host cell recognition and invasion by parasite proteins.
Table 1.
Correlation coefficient values of each topological feature between human
and parasite proteins in the positive set.
Degree
Closeness
Betweenness
Eccentricity
Kleinberg’s hub
0.3507
0.9805
0.1316
0.6827
0.0556
Correlation coefficient values of each topological feature between human
and parasite proteins in the positive set.
Performance of the classifications used to recognize human-parasite protein
associations
Four classification algorithms, naïve Bayes, neural network, random forest, and
support vector machine, were used to recognize human-parasite protein
associations. Their performances were compared to select the best classifier for
the recognition of human-parasite protein similarities, based on topological
features. Ten 10-fold cross-validations were applied for each algorithm, which
yielded the performance in terms of an ROC curve with an AUC, as shown in Figure 4. The random
forest algorithm provided the best classifier, with an AUC of 0.85. The neural
network algorithm yielded a slightly lower performance, with an AUC of 0.79.
Similarly, the support vector machine achieved an AUC of 0.77. The naïve Bayes
classifier yielded a slightly lower performance compared with that of the neural
network and support vector machine with an AUC of 0.74. Notably, the random
forest algorithm provided the best performance, with an AUC that was relatively
far from that of the other algorithms. This is of great interest because the
results obtained for this algorithm indicate its potential in identifying new
human-parasite protein associations and, furthermore, in selection of key human
proteins for the parasite.
Figure 4.
Receiver operating characteristic (ROC) curves for the predictions of
human-parasite protein associations of each machine-learning
algorithm.
AUC indicates area under the curve; ROC, receiver operating
characteristic.
Receiver operating characteristic (ROC) curves for the predictions of
human-parasite protein associations of each machine-learning
algorithm.AUC indicates area under the curve; ROC, receiver operating
characteristic.The classifier showed a better performance than that did random selection, which
may result in 50% correct predictions. Moreover, we attempted to demonstrate the
reliability of the relationship between sequence similarity and network
topologies by performing several random experiments. These experiments could be
performed by randomly shuffling class labels and retraining the random forest
classifiers. Ten 10-fold cross-validations were performed in the same
procedures. An AUC of 0.5 was obtained for these random experiments. This was
also a good indication that the network topologies of protein nodes in the PPI
networks could be used to infer the relationship between human and parasite
proteins in terms of sequence similarity, reflecting the homologs and similar
cooperation in the network community.Based on the best performance and the results of the random forest classifiers,
we defined a voting score for a pair of human and parasite proteins. Ten
probability values of the positive prediction for a pair of human and parasite
proteins were obtained. The average of these probability values was calculated
and defined as a voting score for a pair of human and parasite proteins (see
Materials and Methods). This score was used to define the stringency of
predicting human-parasite protein associations. Initially, we identified 12 038
human proteins in the human PPI network and 1787 parasite proteins in the
parasite PPI network. This resulted in a total of 225 675 478 human-parasite
protein pairs. A total of 19 939 pairs were initially defined as positive
association pairs based on protein sequence similarities. After performing the
random forest classification, the average voting score was calculated for each
pair. It is worth noting that these scores indicated associations based on the
network topological profiles of the human-parasite protein pairs using machine
learning. It was also interesting to combine these scores with the other
association scores from other aspects such as the heterogeneous network study.
With the heterogeneous network model, the network propagation algorithm
with a decay factor of 0.1 was performed on the network to prioritize
human-parasite protein associations.
A total of 21 511 906 overlap pairs from both machine-learning and
network propagation techniques with scores greater than 0 were obtained and used
for the further analysis and selection of key human proteins. Of these pairs,
831 had the highest voting scores of the predictions according to our
machine-learning analysis (Supplementary Table S1).
Identifying promising key human proteins from predicted associations
All human proteins among the 21 511 906 pairs were ranked to calculate their
ranking scores under the assumption that human proteins in association with high
ranking scores may be important for parasite mechanisms. The final ranking score
for each human protein was obtained by the production of the ranking score (see
section “Ranking score calculation for each human protein”) calculated from the
ranked pairs obtained using the machine-learning method and the ranking score
calculated from the ranked pairs using the network propagation methods. The
histogram of the logarithmic transformation of the final ranking scores of all
12 038 human proteins is shown in Figure 5. Notably, most of the ranking
scores were less than 0.0001, while the top best-ranking score was 1 (the
logarithm of 1 is 0). Using this top-ranking score, we obtained 411 human
proteins. These human proteins were defined as the first list of promising
target proteins in human hosts. A complete list of these 411 human proteins is
provided in Supplementary Table S2. The bar plot representing the number of
highest-score associations for these 411 proteins is shown in Figure 6. Note that only
proteins found in more than 2 association pairs are presented in the figure.
Overall, we identified Ras-related proteins, kinesin family members, and
proteasome 20 S subunit alpha and beta in the list.
Figure 5.
Histogram showing the frequency of ranking scores in logarithm scale for
human proteins in the predicted human-parasite associations.
Figure 6.
Bar plot illustrating the number of the highest-score associations of
each human protein. Only proteins associated with more than 2 pairs were
presented.
Histogram showing the frequency of ranking scores in logarithm scale for
human proteins in the predicted human-parasite associations.Bar plot illustrating the number of the highest-score associations of
each human protein. Only proteins associated with more than 2 pairs were
presented.
Clusters of human protein candidates associated to malaria
As mentioned in section “Identifying promising key human proteins from predicted
associations,” we integrated the association scores from our machine-learning
techniques and the heterogeneous network model. First, the association scores of
candidate human-parasite protein pairs from the heterogeneous network method
were ranked to calculate their ranking scores for each protein in the same
manner as in our study (see Materials and Methods). Next, we combined the
ranking scores of these 2 methods as the attributes to cluster the human
proteins using hierarchical clustering. The aim was to group human proteins with
similar levels of importance in both aspects. Figure 7 shows the hierarchical
clustering of these proteins. By selecting the cut height of the dendrogram tree
as 8, we obtained 7 groups of proteins consisting of 2 groups of Ras-related
proteins, a single group of histone H2B proteins, kinesin family members,
ubiquitin specific peptidase 17 like family members, zinc finger proteins, and a
remaining group of mixed types of proteins. Figure S1 shows the high-resolution
circular dendrogram of the clustering analysis. The complete list of these
proteins in each cluster is provided in Supplementary Table S3. Ras proteins are members of a
superfamily of small GTPases that are involved in many processes of cell growth
control. Ubiquitin-specific peptidase 17 like family members regulate different
cellular processes, such as cell proliferation, cell migration, progression
through the cell cycle, apoptosis, and cellular response to viral
infection.[49-51]
Figure 7.
Circular dendrogram of the hierarchical clustering analysis.
Circular dendrogram of the hierarchical clustering analysis.
Functional characteristics of annotated human proteins
Interpreting the functions of these 411 annotated human proteins may reveal the
related mechanisms of the human host and parasite. We investigated these human
proteins using functional enrichment analyses. Gene ontology annotations were
performed to obtain an overview of the biological processes. The analysis was
performed using Cytoscape plugins, ClueGO. Gene ontology associations based on
biological processes were selected using intermediate detail in the panel
setting of ClueGO. This covered 3 to 8 levels of GO terminology. Based on the
PPI of STRING, a second enrichment analysis was performed with a group of genes
that were connected in the GO network using CluePedia (version 1.5.6). This
analysis revealed 9 functional groups of GO terms, as shown in Table 2 and Figure 8. The complete
list of these overrepresented GO terms in the biological process category is
provided in Supplementary Table S4. Interestingly, we found the term of
regulation of transcription, DNA-templated (GO:0006355), with the most
significant term. In addition, Rab protein signal transduction (GO:0032482) and
regulation of vesicle size (GO:0097494) were found in a high proportion of our
candidate proteins. Rab proteins are a subfamily of the Ras protein family
and commonly possess a GTPase fold. These Rab GTPases regulate the
processes of membrane trafficking, vesicle formation, and membrane
fusion.[52-54] Most of
our candidate proteins are involved in the regulation of membrane and vesicle
formation. These proteins may assist parasite transports in the host and could
be potential targets for the treatment of malaria. Figure 8 presents the network of the
main enriched GO terms of the 9 clusters, denoted as 9 different colors. Each
cluster contained associated GO terms and was named with its principal GO
term.
Table 2.
Nine functional groups based on principal gene ontology (GO) terms.
Cluster number
GO ID
Principle GO term
Adjusted P value*
Percentage of associated proteins
1
GO:0006355
Regulation of transcription, DNA-templated
8.52E−112
7.29
2
GO:0003700
DNA-binding transcription factor activity
6.50E−27
6.80
3
GO:0032482
Rab protein signal transduction
9.87E−20
30.26
4
GO:0070647
Protein modification by small protein conjugation or
removal
9.98E−20
6.85
5
GO:0006511
Ubiquitin-dependent protein catabolic process
8.17E−12
7.23
6
GO:0090382
Phagosome maturation
5.41E−03
11.11
7
GO:0097494
Regulation of vesicle size
5.67E−03
21.43
8
GO:0001217
DNA-binding transcription repressor activity
1.43E−02
4.53
9
GO:0006904
Vesicle docking involved in exocytosis
2.67E−02
8.33
Abbreviation: GO, gene ontology.
P values were adjusted according to
Benjamini-Hochberg correction method.
Figure 8.
Representative network of gene ontology (GO) terms of our candidate human
proteins using ClueGO.
GO indicates gene ontology.
Nine functional groups based on principal gene ontology (GO) terms.Abbreviation: GO, gene ontology.P values were adjusted according to
Benjamini-Hochberg correction method.Representative network of gene ontology (GO) terms of our candidate human
proteins using ClueGO.GO indicates gene ontology.
Protein complexes to potential protein targets
To identify sets of these 411 proteins that interact with each other and play
essential roles in regulatory processes, cellular functions, and signaling
cascades, we performed enrichment analysis in protein complexes. Enrichment
analysis of these proteins was performed on the CORUM protein complex database
(version 3.0).
Four protein complexes were found using Bonferroni-adjusted
P values for the enrichment tests <0.05. These 4 protein
complexes consisted of the 20S proteasome, 26S proteasome, PA28gamma-20S
proteasome, and PA28-20S proteasome. Most of the proteins overrepresented in
these protein complexes were PSMA4, PSMB2, PSMB4, PSMB5, PSMB6, and PSMB7. Only
the 26S proteasome contained 1 more protein (PSMC1) in the list. Thus, these
proteins may be interesting targets in future studies. Table 3 presents a list of the
overrepresented protein complexes.
Table 3.
The list of protein complexes enriched in 411 promising candidate
proteins.
Protein complex
Adjusted P value
Associated proteins
20S proteasome
8.34E−03
PSMA4, PSMB2, PSMB4, PSMB5, PSMB6, PSMB7
26S proteasome
1.29E−02
PSMA4, PSMB2, PSMB4, PSMB5, PSMB6, PSMB7, PSMC1
PA28gamma-20S proteasome
1.35E−02
PSMA4, PSMB2, PSMB4, PSMB5, PSMB6, PSMB7
PA28-20S proteasome
2.10E−02
PSMA4, PSMB2, PSMB4, PSMB5, PSMB6, PSMB7
The list of protein complexes enriched in 411 promising candidate
proteins.Furthermore, to examine the importance of the proposed human proteins, these
proteins were searched for in the Drugbank database.
Interestingly, Proteasome 20S Subunit Beta 2 (PSMB2) and Proteasome 20S
Subunit Beta 5 (PSMB5) were identified, which are known to be drug targets, in
the Drugbank database. PSMB2 and PSMB5 play several roles. They were found to be
enriched in the principal GO terms of regulation of transcription,
DNA-templated, protein modification by small protein conjugation or removal, and
ubiquitin-dependent protein catabolic process. Interestingly, PSMB2 was found to
be a drug target of carfilzomib (DB08889), while PSMB5 is a drug target of
carfilzomib and bortezomib (DB00188). Carfilzomib is a synthetic proteasome
inhibitor. It is an analogue of the natural product epoxomicin, which
effectively kills parasites. Bortezomib is the first therapeutic proteasome
inhibitor to be tested in humans, which induces cell cycle arrest and apoptosis.
Bortezomib interrupts the degradation of proapoptotic proteins in cancerous
cells. It is currently used for the treatment of relapsed multiple myeloma and
mantle cell lymphoma. Both carfilzomib and bortezomib have been reported to be
related to malaria treatment.
Carfilzomib has been reported to potently block P.
falciparum replication at effective concentrations as well as
killing asexual blood-stage P. falciparum.
Bortezomib exhibits antiplasmodial activities and has been examined for
efficacy against P. falciparum.
PSMB2 and PSMB5 were found in all our resulting protein complexes (Table 3). Thus, these
complexes may be a valuable starting point for further studies aiming to design
and develop drugs against malaria. In addition, PSMB2 and PSMB5 were observed in
mixed types of protein group of 62 proteins in our clustering results (see
section “Clusters of human protein candidates associated to malaria” and
Supplementary Table S3). Therefore, the remaining 60 proteins in
the same cluster of these proteins may be promising therapeutic targets for
P. vivax malaria. A list of these proteins is provided in
Supplementary Table S5. In addition, the relationship of these
411 human proteins and P. vivax malaria was evaluated to
determine orthologous proteins of P. vivax and the 411 human
proteins from EggNOG database (version 5.0).
The results are presented in Supplementary Table S6.
Discussion
Our understanding of the invasion mechanism of P. vivax remains
deficient due to the lack of a robust in vitro culture system for
this parasite. In an attempt to resolve this, the host-parasite interactions were
studied, including direct interactions at the protein level inside the cell. In this
study, we initially reconstructed the human and parasite PPI networks, and compared
their network structures. In principle, both networks follow the power distribution,
and the analysis of network topologies between these 2 networks revealed a
correlation of the connections within their own network between human and parasite
proteins in the positive set. The high correlation of closeness centrality between
these proteins indicated that most of the similar proteins between human and
parasite responded to minimum paths that connect the other proteins. These proteins
also formed a similar local community around them, as the high correlation was
observed in terms of eccentricity. Although the degree, betweenness centrality, and
Kleinberg’s hub did not show significant correlations among these proteins, the
machine-learning approaches applied here may help reveal several more human and
parasite protein associations in future studies.A ranking score calculation for the human proteins was developed based on the rank of
the associations according to their voting scores. A total of 411 human proteins
with the best-ranking score were selected as promising target candidates. Based on
the histogram shown in Figure
5, the second-best score had a gap jumping from the top best, while the
rest of the scores were far away from the best one. The majority of these proteins
had a ranking score of approximately 0.00001, which was very low in terms of the
probability of being a reliable association. Thus, these 411 proteins were selected
for further analysis together with heterogeneous network prioritization and
qualified in terms of clusters, functions, and protein complexes.The results showed that Ras-related proteins, a single group of histone H2B proteins,
kinesin family members, ubiquitin-specific peptidase 17 like family members, and
zinc finger proteins were the most prominent in our candidate list. These proteins
are involved in several processes of cell growth control and regulation of membrane
and vesicle formation. Several proteins related to proteasome 20S subunits have been
previously reported as promising multistage targets for malaria therapy.
These proteins may be used for the invasion of parasites to the host cell and
have been identified as potential drug targets in the human host.
Conclusion
In this study, we established an analysis framework that uses machine-learning
approach based on a heterogeneous network structure. We used the network topology
features of proteins in the human PPI network and the P. vivax PPI
network and integrated protein sequence similarities to the framework to predict
human-parasite protein associations. We also developed a ranking score calculation
to identify promising protein targets in humans for the treatment of malaria
infections. The candidate human proteins that were selected as promising targets
were then qualified by clustering analysis together with the information on the
existing targets from the heterogeneous network prioritization, as well as by
functional and protein complex enrichment analyses. We found that proteins in the
cluster of PSMB2 and PSMB5 (known drug targets), human proteins involved in the
regulation of membrane and vesicle formation, and complexes such as the 20S
proteasome, 26S proteasome, and PA28gamma/-20S proteasomes are potential targets for
the design and development of drugs for the treatment of malaria.In conclusion, the integration of data related to network topologies and sequence
similarity provides us with an opportunity to define associations between human and
P. vivax proteins. Human protein candidates extracted from
these associations were used to compile a list of promising targets in humans for
further validation in wet-laboratory experiments in future studies. An enhanced
understanding of potential host proteins at the molecular level will provide
insights to support malaria control efforts and the production of novel antimalarial
drugs.Click here for additional data file.Supplemental material, sj-pdf-1-bbi-10.1177_11779322211013350 for Prediction of
Human-Plasmodium vivax Protein Associations From Heterogeneous Network
Structures Based on Machine-Learning Approach by Apichat Suratanee, Teerapong
Buaboocha and Kitiporn Plaimas in Bioinformatics and Biology InsightsClick here for additional data file.Supplemental material, sj-xls-2-bbi-10.1177_11779322211013350 for Prediction of
Human-Plasmodium vivax Protein Associations From Heterogeneous Network
Structures Based on Machine-Learning Approach by Apichat Suratanee, Teerapong
Buaboocha and Kitiporn Plaimas in Bioinformatics and Biology InsightsClick here for additional data file.Supplemental material, sj-xls-3-bbi-10.1177_11779322211013350 for Prediction of
Human-Plasmodium vivax Protein Associations From Heterogeneous Network
Structures Based on Machine-Learning Approach by Apichat Suratanee, Teerapong
Buaboocha and Kitiporn Plaimas in Bioinformatics and Biology Insights
Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker Journal: Genome Res Date: 2003-11 Impact factor: 9.043
Authors: Douglas J LaCount; Marissa Vignali; Rakesh Chettier; Amit Phansalkar; Russell Bell; Jay R Hesselberth; Lori W Schoenfeld; Irene Ota; Sudhir Sahasrabudhe; Cornelia Kurschner; Stanley Fields; Robert E Hughes Journal: Nature Date: 2005-11-03 Impact factor: 49.962
Authors: Ronnatrai Rueangweerayut; Germana Bancone; Emma J Harrell; Andrew P Beelen; Supornchai Kongpatanakul; Jörg J Möhrle; Vicki Rousell; Khadeeja Mohamed; Ammar Qureshi; Sushma Narayan; Nushara Yubon; Ann Miller; François H Nosten; Lucio Luzzatto; Stephan Duparc; Jörg-Peter Kleim; Justin A Green Journal: Am J Trop Med Hyg Date: 2017-07-27 Impact factor: 2.345
Authors: Marcus V G Lacerda; Alejandro Llanos-Cuentas; Srivicha Krudsood; Chanthap Lon; David L Saunders; Rezika Mohammed; Daniel Yilma; Dhelio Batista Pereira; Fe E J Espino; Reginaldo Z Mia; Raul Chuquiyauri; Fernando Val; Martín Casapía; Wuelton M Monteiro; Marcelo A M Brito; Mônica R F Costa; Nillawan Buathong; Harald Noedl; Ermias Diro; Sisay Getie; Kalehiwot M Wubie; Alemseged Abdissa; Ahmed Zeynudin; Cherinet Abebe; Mauro S Tada; Françoise Brand; Hans-Peter Beck; Brian Angus; Stephan Duparc; Jörg-Peter Kleim; Lynda M Kellam; Victoria M Rousell; Siôn W Jones; Elizabeth Hardaker; Khadeeja Mohamed; Donna D Clover; Kim Fletcher; John J Breton; Cletus O Ugwuegbulam; Justin A Green; Gavin C K W Koh Journal: N Engl J Med Date: 2019-01-17 Impact factor: 91.245