Literature DB >> 29028879

Review of computational methods for virus-host protein interaction prediction: a case study on novel Ebola-human interactions.

Anup Kumar Halder¹, Pritha Dutta¹, Mahantapas Kundu¹, Subhadip Basu¹, Mita Nasipuri¹.

Abstract

Identification of potential virus-host interactions is useful and vital to control the highly infectious virus-caused diseases. This may contribute toward development of new drugs to treat the viral infections. Recently, database records of clinically and experimentally validated interactions between a small set of human proteins and Ebola virus (EBOV) have been published. Using the information of the known human interaction partners of EBOV, our main objective is to identify a set of proteins that may interact with EBOV proteins. Here, we first review the state-of-the-art, computational methods used for prediction of novel virus-host interactions for infectious diseases followed by a case study on EBOV-human interactions. The assessment result shows that the predicted human host proteins are highly similar with known human interaction partners of EBOV in the context of structure and semantics and are responsible for similar biochemical activities, pathways and host-pathogen relationships.

Entities: Disease Species

Mesh：

Substances：
Glycoproteins
Proteins

Year: 2018 PMID： 29028879 PMCID： PMC7109800 DOI： 10.1093/bfgp/elx026

Source DB: PubMed Journal: Brief Funct Genomics ISSN： 2041-2649 Impact factor: 4.241

Introduction

Infectious diseases are still among the major and prevalent death causes in human mostly because of the unknown pathogenic mechanisms of viruses [1]. The eukaryotic hosts are directly affected by the viral pathogens through complex interaction mechanisms [2]. The molecular-level interactions between the virus and their host play a key role here. Thus, the virus–host protein–protein interactions (PPIs) are crucial for better understanding of the infection mechanism and pathogenesis of infectious diseases [3, 4]. In proteomic studies, PPI prediction remains a hot topic for decades. Owing to the limitation of proteomic data, most of the previous studies were focused on predicting intra-species PPIs, i.e. interactions within a single organism. In recent analysis, several improvements have been reported in PPI predictions between different organisms, i.e. inter-species. These types of PPI prediction methods offer important information for further analysis of infection mechanisms between different species. PPIs between virus and host proteins allow pathogenic microorganisms to manipulate host mechanisms to use host capabilities and to escape from host immune responses [5-8]. Therefore, a complete understanding of infection mechanisms through PPIs is crucial for the development of new and more effective therapeutics. The computational PPI prediction methods primarily use sequence information [9-14], domain-based [5, 15–17], protein structure [18-21], physiochemical properties [22], semantic analysis [23, 24] and known interactions between virus and host proteins [25]. Classical machine learning techniques are well-accepted tools for PPI predictions when there are sufficient numbers of known interactions. In contrast, prediction of inter-species (virus–host) PPI is relatively young field of study, which requires new model-based approaches. To tackle the problem of data scarcity, eliciting and transferring data from related domains to a desired formulation can be a promising solution [26, 27]. Multitask learning [28-30] uses relationship among different domains and learns the problem simultaneously, which leads to a better performance rather conducting learning task on individual domain. In this article, we concentrate on the computational approaches regarding the prediction of virus–host interactions, followed by a case study on prediction of novel interactions between Ebola virus (EBOV) and human host proteins. We also present a brief cluster analysis on the predicted host proteins. This analysis includes the infection pathways and functional annotations that can be valuable in prevention, diagnosis and treatment of EBOV-infected diseases.

Computational approaches

Several computational approaches have been developed to predict novel host–virus protein interactions. Depending on the availability of the interaction information, different predictive models have been proposed in novel host–virus interaction predictions. Becerra et al. [31] have proposed short linear motifs-based predicting method for PPIs between HIV-1 and human proteins. They have implemented three filtering methods to obtain linear motif sets: (i) conserved in viral proteins (C), (ii) located in disordered regions (D) and (iii) rare or scarce in a set of randomized viral sequences (R). Finally, these three sets have used to find the disordered protein regions among the HIV-1 sequences and host sequences. Their study shows that the majority of conserved linear motifs in the virus are located in disordered regions. In [32], Segura et al. proposed a method to model viral–human interaction network using motif–domain interactions. Kharrat et al. [33] used structure sequence to cluster human viral proteomes. They analyzed the charged residue of amino acid composition (AC) between the viral and target host proteins and provided a better understanding of several pathological and biological processes (BPs). The charged residues in protein sequences mediate the interactions and play an important role in protein transport, localization and regulatory functions. Mukhopadhyay et al. [34, 35] proposed a bi-clustering approach to predict HIV-1-infected human proteins applying interaction-based analysis. A set of association rules was mined by bi-clustering technique from the interaction between HIV-1 proteins. Finally, a set of high-confidence rule was extracted to predict novel human protein interactions. As a further improvement to their work, Mukhopadhyay et al. [36] introduced type and direction (host-to-virus and virus-to-host)-based bi-clustering to existing interactions to predict novel host proteins. Mondal et al. [37] also proposed a HIV–human interaction prediction method using hierarchical bi-clustering and minimal covers of association rule mining. An approach for predicting future dominant hemagglutinin gene of influenza A virus was proposed by Plotkin et al. [38] and antigenic evolution over the host genes were analyzed. The spatiotemporal distribution of viral swarm and the evolution of hemagglutinin structure were analyzed for clustering. Finally, a critical length scaling in amino acid space was applied to cluster the viral sequences. A sequence-based hierarchical clustering approach over the EBOV and influenza virus was introduced in [39]. In a study by Spencer et al. [40], phosphorylation clustering was applied over the infection of bronchitis virus protein. Sequence similarity-based domain–domain interaction detection was proposed by Schleker et al. [41]. Salmonella–human interaction was predicted by this method. Several approaches proposed on Salmonella–human interaction predictions [42, 43]. Support Vector Machine (SVM) based approaches were successfully applied in virus–host protein interaction prediction studies [44, 45]. Cui et al. [44] proposed a SVM-based approach, where they used fixed-length feature vector indicating relative frequency of consecutive amino acids in the protein sequence. Doolittle et al. [46] proposed a method to predict the interactions between HIV-1 and human proteins based on protein structural similarity, where two crystal structures are compared with compute the structural similarity between host and pathogen proteins. The assumption is that, human proteins that have high structural similarity to a HIV protein are identified and their known interacting partners are considered as targets. They applied similar approach for developing interaction network between Dengue virus and the host [47]. Cao et al. [48] proposed a network-based approach to predict EBOV–human interaction. They introduced a principle called ‘guilt by association’ (GBA) for their prediction method. The GBA principle was stated as proteins interacting with each other are likely to function similarly or the same. Based on this assumption, they predicted EBOV infection-related human genes from a PPI network using Dijkstra algorithm. In several works [49-52], it has been reported that, molecular mimicry plays a key role in viral–host pathogen interactions, where a viral protein mimics similar structural binding surface similar to that of a host protein. As a result, viral protein competitively binds to another host protein and spread over the host. From available experimental data [53-56], it has been suggested that pathogenic agents extensively use the molecular mimicry to their advantages. Mei et al. [57] proposed transfer learning-based technique with three different classifier, where each individual classifier was executed on three gene ontology (GO)-based features. Finally, the classifier ensembling was applied to produce final result using weighting probability outputs of individual classifiers. In addition, to analyze the biological activity of proteins, GO-based semantic similarity creates an evolutionary orientation in PPI [58-63]. GO annotation-based semantic similarity has been regarded as one of the most powerful indicators for interaction [23, 64]. Thus, structural and semantic similarity becomes valuable features for interaction prediction involving in novel host protein. Table 1 summarizes the list of methods that have been successfully applied in virus–host (human) interaction prediction in the literature.

Table 1.

Computational approaches on virus host (human) interaction prediction

Interactors	Approaches/methods	References
Ebola–human	Graph-based multitask learning-based approach	[26]
Ebola–human	Network similarity-based approach	[48]
Dengue virus–human	Sequence- and structure-based method	[42]
	Structural motif–domain interaction-based approach	[32]
	Structural similarity of DENV proteins to human proteins having known interactions	[47]
Human papillomaviruse–human	Relative frequency of amino acid triplets (RFATs), frequency difference of amino acid triplets (FDATs) and AC	[45]
Human papillomaviruse–human	Fixed-length feature vector of protein sequence	[44]
Hepatitis C virus–human	Graph-based multitask learning-based approach	[26]
	RFATs, FDATs and AC	[45]
	Sequence, network topology, domain, GO and pathway-based kernel method	[12]
	Topological and functional properties of interaction network and domain interaction-based method	[25]
	Fixed-length feature vector of protein sequence	[44]
Salmonella–human	Sequence- and structure-based method	[42]
	Multi-instance homolog transfer-based approach	[43]
	Virus–host domain interaction, gene expression, pathway sharing and sequence-based	[26, 28]
	Obtain host–pathogen interactome using sequence and interacting domain similarity to known PPIs	[41]
	Homology detection method using template PPI databases	[17]
Plasmodium falciparum–human	K-mer-based sequence homology and pathway-based approach	[14]
	GO annotation and sequence filtering-based approach	[10]
	Homology detection method using template PPI databases	[11]
	Domain–domain interaction probability-based approach	[6]
Influenza A–human	Graph-based multitask learning-based approach	[26]
Influenza A–human	Structural homology-based approach	[21
HIV-1–human	Short linear motifs-based approach	[31]
	Bi-clustering with association rule mining	[34–36]
	Sequence-based classifier ensembling	[13]
	Differential gene expression between virus and host	[24]
	Hierarchical bi-clusters and minimal covers of association rule-based approach	[37]
	Supervised learning and prediction of physical interactions	[5]
Escherichia coli–human	Homology detection method using template PPI databases	[17]
Mycobacterium tuberculosis (H37Rv)–human	Stringent homology which uses inter-species template PPI	[4]

Computational approaches on virus host (human) interaction prediction In the following sections, a case study on prediction of novel EBOV–human interaction is discussed. The structural and GO-based semantic assessment scores are considered as effective features of these predictive models with four classifiers. Finally a cluster of predicted EBOV host is extracted from the prediction result. The known host proteins are considered as seed for the predictions, where unknown human proteins are considered as target; thus, seed and target creates the interaction pair. The predictive analysis on these EBOV–human protein pairs is discussed in the following sections. This assessment will facilitate the diagnosis and treatment of EBOV infections.

The role of Ebola glycoprotein in virus–host interaction

The Ebola hemorrhagic fever by EBOV infection causes pervasive human fatality and mortality. EBOV, a member of the Filoviridae family, is a negative-sense RNA virus [65]. The EBOV genome consists of seven genes, namely, nucleoprotein (NP), viral polymerase complex protein 35 (VP35), matrix protein (VP40), glycoprotein (GP), minor NP (VP30), membrane-associated protein (VP24) and polymerase (L-protein) [66]. Among the seven genes of EBOV, GP is the only viral protein that is present on the EBOV surface and is responsible for mediating attachment to host cell surface receptors and entry of the virus into host cells. Mature GP is a trimer of three disulfide-linked GP1-GP2 heterodimers [67-69]. GP1 mediates adhesion of the virus to host cells and regulates GP2, which carries out membrane fusion [70-72]. The EBOV virus initially targets specific cell types, including liver cells, immune system cells and endothelial cells. EBOV GP can damage cell adhesion, so that cells do not remain attached to each other and to the extracellular matrix. By targeting liver cells, EBOV disrupts the mechanism of removal toxins from the bloodstream. The infection leads to organ failure, fever and severe internal bleeding, which ultimately leads to death [73]. The first step in EBOV infection is attachment to host cell surface receptors. Specifically, EBOV GP allows the virus to introduce its contents into monocytes, macrophages, dendritic cells and/or endothelial cells [74-77], which causes the release of cytokines associated with inflammation and fever. EBOV GP has been found to be the most important factor required for EBOV entry into a host cell by binding to surface receptors on the host cell [78]. EBOV binds to human cells through various receptors expressed on the cell membrane surface—C-type lectins, which interact with glycans on EBOV GP [79-84], phosphatidylserine receptors, which interact with phosphatidylserine on EBOV GP [85-92], and integrins [93-95]. The cholesterol transporter, Niemann–Pick C1 (NPC1), facilitates EBOV entry into host cells and the release of the virus from the vacuole into the cytoplasm of the host [96, 97]. EBOV entry into endosomal compartments is primarily achieved through macropinocytosis, as well as other entry mechanisms, such as clathrin-dependent endocytosis [91, 98–100]. As EBOV GP mediates the entry of the virus into host cells, its role is important and essential to understand the interactions between EBOV GP and human cell surface receptors. The list of known interactions between EBOV GP and human with domain annotations is given in Table 2.

Table 2.

Known human target proteins of EBOV GP

Functional group	Gene name	Protein name
C-type lectin domain family	CLEC4M [82]	Liver/lymph node-specific ICAM-3 grabbing non-integrin (L-SIGN)
	CLEC4G [83]	Liver and lymph node sinusoidal endothelial cell C-type lectin (LSECTIN)
	CLEC10A [81]	Human macrophage galactose-and N-acetyl-galactosamine-specific C-type lectin (hMGL)
Dendritic cell-specific intercellular adhesion molecule (ICAM)	CD209 [82]	Dendritic cell-specific intercellular adhesion molecule-3-grabbingnon-integrin (DC-SIGN)
Tyrosine-protein kinase receptor	AXL [90]	Tyrosine-protein kinase receptor UFO
	TYRO3 [89]	Tyrosine-protein kinase receptor TYRO3
	MER [89]	Tyrosine-protein kinase Mer
T-cell immunoglobulin and mucin domain	TIM1 [85]	T-cell immunoglobulin and mucin domain-containing protein 1
T-cell immunoglobulin and mucin domain	TIM4 [79]	T-cell immunoglobulin and mucin domain-containing protein 4
Integrin domain	ITGB1 [94]	Integrin beta-1
Integrin domain	ITGA5 [93]	Integrin alpha-5
Lactadherin	MFGE8 [92]	Lactadherin
Growth arrest-specific protein	GAS6 [79]	Growth arrest-specific protein 6

Known human target proteins of EBOV GP

Structure-based similarity assessment

In structure-based analysis, each residue on the surface is compared with the target residue to extract the structural neighbors. A variety of features derived in different analysis [101-110] from the structural component of virus–host protein pairs. In this analytic approach, three-dimensional structure-based protein features are incorporated to find the structural neighbors. Five scoring metrics, Template modeling-score (TM-score) [111], RMSD [112], MaxSub-score [113], GDT_score [114] and Tm-Rm score [Equation (1)], are used to quantify the structural similarity of two proteins. TM-score [111] gives a value in the range (0, 1), where 1 indicates a perfect match in topological similarity of two protein structures. Scores <0.17 indicates no structural similarity, whereas a score >0.5 suggests that the two structures have similar fold. We use the TM align algorithm [115] for comparing the structures of two proteins. This algorithm identifies the best structural alignment between two proteins. After the optimal superposition, RMSD [112] represents the root-mean-square deviation of all the equivalent atom pairs of two protein structures. In general, lower RMSD indicates better superposition. For similar structural domain identification, data sets like SCOP and CATH set a RMSD threshold of 5 Å. An RMSD value <3 Å indicates a high degree of structural similarity. However, a lower RMSD and higher TM-score indicate a better structural similarity; thus, they are inversely related. In addition, RMSD value 0 and TM-score 1 represent optimal structural similarity. MaxSub-score method [113] identifies the largest subset of atom of a protein structure that superimposes well over another structure and provides a single normalized score. MaxSub score ranges from 0 to 1, where 0 indicates a wrong superimposing and 1 indicates perfect superimposition [113]. The global distance test (GDT) score [114] is calculated as the largest set of residue-based atoms in a structure falling within a defined distance cutoff of their position with respect to other structure. An increase in GDT may indicate an extreme divergence between a structure pair, such that no additional atoms are included in any cutoff of a reasonable distance [116]. Finally, a new structural similarity-based property is defined using both TM-score and RMSD. Here, the RMSD score is restrict up to 3 Å for higher structural similarity. In Equation (1), any RMSD value <3 Å will contribute positively with TM-score.

GO-based semantic assessment

The semantic similarity between human proteins is estimated by combining the similarities of their annotating GO term pairs belonging to a particular ontology [e.g. molecular function (MF), BP, cellular component (CC)]. Similarity of a GO term pair is determined by considering certain topological properties (shortest path length) of the GO graph and the average information content (IC) of the disjunctive common ancestors (DCAs) of the GO terms as proposed in [23]. In this measure, to estimate the semantic similarity between two GO terms t1 and t2, first certain GO terms are selected as cluster centers based on a value called propTerms(t) assigned to each GO term t in the GO graph, which gives the proportion of GO terms connected directly and indirectly to t in the ontology. The GO terms for which this propTerms value is above a given threshold are selected as cluster centers. The threshold values for selecting cluster centers with respect to MF, CC and BP ontologies are given in Table 3. Depending on the interaction prediction result, the threshold values are chosen. Initially, threshold value is started from 0.1 and gradually increases. With the increasing threshold value, the area under receiver operating characteristic (ROC) curve (AUC) is determined with the varying k (width of Gaussian function) values (from 1 to 10). Finally, the threshold and k are chosen for which the AUC score is highest. After selecting the cluster centers, the degree of membership of a GO term to each of the selected cluster centers is calculated using its respective shortest path lengths to the corresponding cluster centers. The membership of the GO term to a cluster decreases with increase in its shortest path length to the cluster center. Next, using the difference in membership values of the GO terms t1 and t2 with respect to each cluster center, a weight parameter is defined as one minus the maximum membership difference value. This weight value determines how dissimilar two GO terms can be with respect to the cluster centers. Next, the average IC [117], of the DCAs of GO terms t1 and t2, is determined. Finally, the semantic similarity between the two GO terms t1 and t2 is defined as the product of the weight parameter and the average IC of the DCAs of the two GO terms.

Table 3.

GO-based cluster center threshold with k-value

GO	Cluster center threshold	Width k
MF	0.18	2
CC	0.55	1
BP	0.31	3

GO-based cluster center threshold with k-value To determine the semantic similarity scores for protein pairs, the semantic similarity scores of their respective GO terms are combined using the best-match average approach [118, 119]. Here, the semantic similarity is estimated with respect to BP, CC and MF ontologies of the GO database [120] separately.

Data sources

Ontology data

Ontology data are downloaded from the GO database [121, 120] (dated July 2015) containing 43 368 ontology terms subdivided into 28 539 BP terms, 10 868 MF terms and 3961 CC terms.

GO annotation data

GO annotations for human proteins are downloaded from the Uniprot database [122, 123].

Seed proteins

The seed human proteins (referred as set SD) for clustering are those proteins that are known to interact with EBOV GP. These known human seeds are assimilated from literature survey. The list of known target human proteins (seeds in our approach) of EBOV GP is given in Table 2.

Target proteins

The target human proteins are collected from Uniprot database [122, 123].These target proteins are selected such that they are the first-level interaction partner of the human seed proteins and have structural information in Protein Data Bank (PDB) [124]. In addition, human protein interaction information is collected from DIP [125], MINT [126], BioGrid [127], STRING [128] and iRefWeb [129] databases.

Clustering analysis

The cluster analysis strategy is designed by integrating four classifiers, namely, Decision Tree (DT) Classifier [130], KNeighbors (KNN) Classifier [131], SVM [132] and Gaussian Naive Bayes (GNB) [133, 134]. A 3-fold cross-validation is done in case of all four classifiers and their respective ROC curves [135] are given in Figure 1. A pairwise relation with respect to seed proteins is generated with the above-defined structural and semantic features. All pairwise combinations, (Seed, Seed) within the set SD, are considered as the positive samples for the classifiers. The negative data have created as a pair of proteins (Seed, Nseed) where and Nseed are the proteins that have no interaction evidence with seed proteins and EBOV proteins. Finally, all classifier results are aggregated for final cluster. In this proposed work, we consider only those novel interactions where classification results are in agreement with all the classifiers to obtain more accurate cluster. The basic workflow of clustering is shown in Figure 2.

Figure 1.

ROC curves of DT, KNN, SVM and GNB. (A colour version of this figure is available online at: https://academic.oup.com/bfg)

Figure 2.

The workflow of cluster analysis based on known human target proteins of EBOV GP using DT, KNN, SVM and GNB. (A colour version of this figure is available online at: https://academic.oup.com/bfg)

ROC curves of DT, KNN, SVM and GNB. (A colour version of this figure is available online at: https://academic.oup.com/bfg) The workflow of cluster analysis based on known human target proteins of EBOV GP using DT, KNN, SVM and GNB. (A colour version of this figure is available online at: https://academic.oup.com/bfg)

Discussion

The predictive model is able to retrieve the potential human proteins, which may interact with EBOV GP and facilitate entry of the virus into host cells. Structural and GO-based functional annotation is considered as the key point of this analysis. In this analysis, total 10 EBOV GP hosts are selected as seed (Table 2). Among them, only 28 pairs of structural comparison are possible and considered as positive data pair for training. In addition, 28 pairs of negative data are manually created for the classifiers. A new test data are created from human proteins with respect to each seed proteins and those have structural information. These proteins are selected as the first-level interactor of EBOV GP host and . Finally, total 116 proteins are resulted from the cluster as the potential EBOV host. To establish the involvement of these proteins in viral infection, we have found some common pathway from KEGG database (http://www.genome.jp/kegg/). A set of pathways (hsa05162:Measles, hsa04145:Phagosome, hsa05203:Viral carcinogenesis, hsa05152:Tuberculosis, hsa05133:Pertussis and hsa05205:Proteoglycans in cancer) is found as common between the EBOV GP host and these cluster proteins (shown in Table 4). A GO-based functional analysis over these proteins is shown in Tables 5–7. These proteins share many biological activities related to virus receptor, immune response and viral genome replication with the known EBOV GP interactor.

Table 4.

Significant common KEGG pathways found in known and new human proteins

Serial number	KEGG	Term	Host		New host
Serial number	KEGG	Term	% of proteins	P-value	% of proteins	P-value
1	hsa05162	Measles	15.38	5.7E-2	22.52	7.8E-4
2	hsa04145	Phagosome	30.77	1.1E-5	27.27	3.10E-06
3	hsa05203	Viral carcinogenesis	18.22	1.30E-11	33.21	5.30E-09
4	hsa05152	Tuberculosis	15.31	7.5E-2	17.07	4.80E-02
5	hsa05133	Pertussis	25.81	3.2E-2	14.34	5.0E-3
6	hsa05205	Proteoglycans in cancer	12.23	8.40E-03	18.79	2.90E-02

Table 5.

Significant common GO terms (biological process) found in known and new human proteins

Term	Host		New host
Term	% of proteins	P-value	% of proteins	P-value
Antigen processing and presentation	35.38	3.50E-02	21.73	7.40E-06
Modulation by virus of host morphology or physiology	15.32	4.60E-03	24.22	5.60E-04
Innate immune response	30.7	2.40E-03	17.39	2.50E-02
Viral genome replication	23.08	3.50E-05	13.62	1.80E-02
Integrin-mediated signaling pathway	14.83	6.30E-02	28.19	4.70E-04
Platelet activation	16.61	5.00E-05	18.4	6.80E-02

Table 7.

Significant common GO terms (cellular component) found in known and new human proteins

Serial number	KEGG	Term	Host		New host
Serial number	KEGG	Term	% of proteins	P-value	% of proteins	P-value
1	GO:0005615	Extracellular space	31.87	5.40E-02	19.56	5.50E-06
2	GO:0009986	Cell surface	26.14	4.70E-03	17.97	7.10E-03
3	GO:0005886	Plasma membrane	53.84	3.50E-02	28.2	7.70E-02
4	GO:0001726	Ruffle	22.15	5.80E-02	9.6	4.50E-03

Significant common KEGG pathways found in known and new human proteins Significant common GO terms (biological process) found in known and new human proteins Significant common GO terms (molecular function) found in known and new human proteins Significant common GO terms (cellular component) found in known and new human proteins

Conclusion

In this article, we have reviewed the diverse level of host–virus interaction predictions across the variety of pathogenic species and their human host. These computational methods may have important roles in paving the way of experimental verification of virus–host interactions by highlighting high potential interactions. Depending on the availability of the required data, some virus–host interaction mechanisms are well studied and targeted in more research. The main challenge for computational virus-interaction predictions is the lack of available verified interactions and the relevant feature information in most of the prediction methods. Finally, a case study-based analysis is proposed on EBOV–human interaction prediction. Here, four different classifiers, DT Classifier [130], KNN Classifier [131], SVM [132] and GNB [133, 134], are used for predictions. In this approach, a cluster of potential human proteins is retrieved from the predicted novel interactions. These sets of proteins have close structural and semantic similarities with known EBOV GP human target proteins, and this may facilitate EBOV entry into host cells through interaction with EBOV GP. For defining the structural similarity feature, we use five scores, namely, TM-score [111], RMSD [112], MaxSub-score [113], GDT_score [114] and Tm-Rm score [Equation (1)]. The semantic similarity feature is determined by GO graph-based properties. The proteins predicted by this method are highly likely to interact with EBOV GP and facilitate EBOV entry into human cells. This method would enlighten the promising future direction for novel host–virus interactions. The review presents computational approaches toward the host–virus interaction-based predictive models. EBOV, a member of Filoviridae family, is a negative-sense RNA virus that causes high fatality rate in humans. Among the seven genes of Ebola, GP is responsible for mediating attachment to host cell surface receptors and entry of the virus into host cells. A case study-based analysis on novel EBOV–human protein interactions prediction using structural and semantic similarity features. Structural similarity feature is defined using five structural alignment scores, namely, TM-score, RMSD, MaxSub-score and Tm-Rm score. The semantic similarity feature is determined by using properties of the GO graph and IC of GO terms. Finally, the pathway and GO-based functional annotation is provided for novel EBOV–human interactions.

Funding

This work is partially supported by the CMATER research laboratory of the Computer Science and Engineering Department, Jadavpur University, India, PURSE-II and UPE-II project and Research Award [F.30-31/2016(SA-II)] from University Grants Commission, Government of India and Visvesvaraya PhD scheme from DeitY, Government of India.

Table 6.

Significant common GO terms (molecular function) found in known and new human proteins

Serial number	KEGG	Term	Host		New host
Serial number	KEGG	Term	% of proteins	P-value	% of proteins	P-value
1	GO:0001618	Virus receptor activity	61.54	1.20E-14	19.43	3.20E-06
2	GO:0001786	Phosphatidylserine binding	33.17	2.10E-06	17.3	4.20E-07
3	GO:0004714	Transmembrane receptor protein tyrosine kinase activity	23.08	3.20E-04	9.8	1.60E-03

120 in total

1. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors: Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal: Methods Mol Biol Date: 2016

2. Alpha5beta1-integrin controls ebolavirus entry by regulating endosomal cathepsins.

Authors: Kathryn L Schornberg; Charles J Shoemaker; Derek Dube; Michelle Y Abshire; Sue E Delos; Amy H Bouton; Judith M White
Journal: Proc Natl Acad Sci U S A Date: 2009-04-28 Impact factor: 11.205

3. Enveloped viruses disable innate immune responses in dendritic cells by direct activation of TAM receptors.

Authors: Suchita Bhattacharyya; Anna Zagórska; Erin D Lew; Bimmi Shrestha; Carla V Rothlin; John Naughton; Michael S Diamond; Greg Lemke; John A T Young
Journal: Cell Host Microbe Date: 2013-08-14 Impact factor: 21.023

4. Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations.

Authors: Najmul Ikram; Muhammad Abdul Qadir; Muhammad Tanvir Afzal
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-04-18 Impact factor: 3.710

Review 5. Progress in computational studies of host-pathogen interactions.

Authors: Hufeng Zhou; Jingjing Jin; Limsoon Wong
Journal: J Bioinform Comput Biol Date: 2012-10-24 Impact factor: 1.122

6. Prediction of protein-protein interactions between human host and a pathogen and its application to three pathogenic bacteria.

Authors: O Krishnadev; N Srinivasan
Journal: Int J Biol Macromol Date: 2011-02-16 Impact factor: 6.953

7. Supervised learning and prediction of physical interactions between human and HIV proteins.

Authors: Matthew D Dyer; T M Murali; Bruno W Sobral
Journal: Infect Genet Evol Date: 2011-03-05 Impact factor: 3.342

8. Functional similarity analysis of human virus-encoded miRNAs.

Authors: Guangchuang Yu; Qing-Yu He
Journal: J Clin Bioinforma Date: 2011-05-19

9. PRISM: protein interactions by structural matching.

Authors: Utkan Ogmen; Ozlem Keskin; A Selim Aytuna; Ruth Nussinov; Attila Gursoy
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. PPI_SVM: prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables.

Authors: Piyali Chatterjee; Subhadip Basu; Mahantapas Kundu; Mita Nasipuri; Dariusz Plewczynski
Journal: Cell Mol Biol Lett Date: 2011-03-20 Impact factor: 5.787

7 in total

1. Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2.

Authors: Yadi Zhou; Yuan Hou; Jiayu Shen; Yin Huang; William Martin; Feixiong Cheng
Journal: Cell Discov Date: 2020-03-16 Impact factor: 10.849

Review 2. Computational Network Inference for Bacterial Interactomics.

Authors: Katherine James; Jose Muñoz-Muñoz
Journal: mSystems Date: 2022-03-30 Impact factor: 7.324

3. Drug repurposing for COVID-19 using computational screening: Is Fostamatinib/R406 a potential candidate?

Authors: Sovan Saha; Anup Kumar Halder; Soumyendu Sekhar Bandyopadhyay; Piyali Chatterjee; Mita Nasipuri; Debdas Bose; Subhadip Basu
Journal: Methods Date: 2021-08-27 Impact factor: 4.647

4. Computational modeling of human-nCoV protein-protein interaction network.

Authors: Sovan Saha; Anup Kumar Halder; Soumyendu Sekhar Bandyopadhyay; Piyali Chatterjee; Mita Nasipuri; Subhadip Basu
Journal: Methods Date: 2021-12-10 Impact factor: 4.647

5. In silico prediction of HIV-1-host molecular interactions and their directionality.

Authors: Haiting Chai; Quan Gu; Joseph Hughes; David L Robertson
Journal: PLoS Comput Biol Date: 2022-02-08 Impact factor: 4.779

6. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms.

Authors: Kaustav Sengupta; Sovan Saha; Anup Kumar Halder; Piyali Chatterjee; Mita Nasipuri; Subhadip Basu; Dariusz Plewczynski
Journal: Front Genet Date: 2022-09-29 Impact factor: 4.772

7. Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2.

Authors: Yadi Zhou; Yuan Hou; Jiayu Shen; Yin Huang; William Martin; Feixiong Cheng
Journal: Cell Discov Date: 2020-03-16 Impact factor: 10.849

7 in total