Literature DB >> 35855405

GFCNet: Utilizing graph feature collection networks for coronavirus knowledge graph embeddings.

Zhiwen Xie¹, Runjie Zhu², Jin Liu¹, Guangyou Zhou³, Jimmy Xiangji Huang⁴, Xiaohui Cui⁵.

Abstract

In response to fighting COVID-19 pandemic, researchers in machine learning and artificial intelligence have constructed some medical knowledge graphs (KG) based on existing COVID-19 datasets, however, these KGs contain a considerable amount of semantic relations which are incomplete or missing. In this paper, we focus on the task of knowledge graph embedding (KGE), which serves an important solution to infer the missing relations. In the past, there have been a collection of knowledge graph embedding models with different scoring functions to learn entity and relation embeddings published. However, these models share the same problems of rarely taking important features of KG like attribute features, other than relation triples, into account, while dealing with the heterogeneous, complex and incomplete COVID-19 medical data. To address the above issue, we propose a graph feature collection network (GFCNet) for COVID-19 KGE task, which considers both neighbor and attribute features in KGs. The extensive experiments conducted on the COVID-19 drug KG dataset show promising results and prove the effectiveness and efficiency of our proposed model. In addition, we also explain the future directions of deepening the study on COVID-19 KGE task.

Entities: Chemical

Keywords: COVID-19; Knowledge Graph; Natural Language Processing; Text Mining

Year: 2022 PMID： 35855405 PMCID： PMC9279179 DOI： 10.1016/j.ins.2022.07.031

Source DB: PubMed Journal: Inf Sci (N Y) ISSN： 0020-0255 Impact factor: 8.233

Introduction

Pandemic has been a critical but unexpected factor threatening people’s lives in human history. The outbreak of COVID-19 is an unprecedented global health crisis that not only shocks the healthcare industry but also may lead to subsequent serious economic and financial consequences. According to the real time statistics announced by Johns Hopkins Coronavirus Resource Center 1 , as of April 1, 2022, there have been more than 488 million confirmed coronavirus cases worldwide, and the total death toll has exceeded 6.14 million people. There have been a great amount of COVID-19 data accumulated over the past approximately nine months of time. This excites researchers and scholars from various background to devote into the fight against COVID-19 pandemic. For example, ACL 2020 opens a NLP COVID-19 Workshop which focuses on research themes of document analysis and retrieval across COVID-19 corpus, COVID-19 question answering for mental health, social media analysis of COVID-19 pandemic, etc. [46], [11], [3]. While many people are still processing and completing the COVID-19 medical data which purely stands for a collection of facts, we are forced to generate COVID-19 related knowledge out of these open sourced data collections by various advanced data mining or artificial intelligence (AI) approaches given the restricted time. Knowledge Graph (KG), which serves as an interpretable and explainable base for the medical text mining methods, plays an important role in understanding the information structures and entity relations among the heterogeneous COVID-19 data. Currently, some researchers and scholars have constructed knowledge graphs based on the existing COVID-19 data collections. For instance, COKG-192 integrates many open knowledge graphs which allow future studies to stand on the shoulder of giant. Taking the example of the COVID-19 antiviral drug knowledge graph (DrugKG) in COKG-19, which uncovers links between drug, virus and protein, shown in Fig. 1 . There are four types of entities, namely Drug, Virus, HostProtein and VirusProtein, which are represented in different colors. These entities are connected by four types of relations including effect, produce, binding and interaction, which indicate the relationships between entities. For example, the edge (Vidarabin, effect, HHV-3) indicates the drug Vidarabin has effect on the virus HHV-3. Other lately published COVID-19 KGs also include [63], [47], [40], [51], [7]. These knowledge graphs serve as significant fundamental blocks for COVID-19 downstream tasks, such as supporting information retrieval and extraction for COVID-19 pneumonia diagnosis, detection and treatment automatically.

Fig. 1

An example of sub-graph in DrugKG. Different types of entities are represented in different colors: the purple nodes represent drug, the pink nodes represent virus, the green nodes represent host proteins, and the blue nodes represent virus protein. Although there have been many domain specific KGs constructed, we find that the coverage of knowledge among these existing COVID-19 knowledge graphs is very limited. These knowledge graphs are new, sparse and scattered, containing a considerable amount of semantic relations which are incomplete or missing. For example, we took a deep look at the DrugKG in COVID-19 research knowledge graph, around 36.1% of the drug effect relations are missing, around 92.9% of the protein binding relations are missing, around 32.6% of the protein interaction relations are missing, and around 1% of the virus produce relations are incomplete. The missing nodes and the incompleteness of the KG strongly affect the accuracy of the analyses on COVID-19 datasets. What is even worse than that is if the analytical results generated from the above are applied to real life clinical decision support process, it might lead to hundreds or millions of life or death consequences. Knowledge Graph Embedding (KGE), which refers to the representations of entities and relations at low-dimension in a KG, is a fundamental and significant task to infer nodes relations and to get insights about the existing vast COVID-19 medical knowledge graphs. In the past few years, researchers have been continuously proposing a number of KGE models with different scoring functions to learn entities and relations embeddings. These examples include but are not limited to TransE [9], TransD [23], TransH [50], TransR [30], DistMult [58], ComplEx [43], RotatE [42] etc. Although these models are able to deliver comparable and promising results in general domain datasets, they share the same problems of rarely taking important features of KG like attribute features other than relation triples into account, while dealing with the heterogeneous, complex and incomplete COVID-19 medical data. In this paper, we aim to take a heterogeneous approach to automatically infer the missing semantic relations in the COVID-19 knowledge graph. To the best of our knowledge, we are the first one to perform KGE task on the open-sourced COVID-19 antiviral drug knowledge graph. Specifically, we propose a graph feature collection network (GFCNet) for KGE task on COVID-19 DrugKG. Our proposed model utilizes both neighboring and attribute information, which include the entity type and drug category, to enhance the entity representation. Different from the well defined R-GCN [39], we propose a simple and parameter-efficient R-GCN (SR-GCN) which uses GCN to fuse the neighboring and attribute information for better KGE. The main contributions of this paper are listed as follows: We contribute to the global COVID-19 research community by investigating the newly built antiviral drug knowledge graph (DrugKG) while constructing a KGE task on top of it. We propose a graph feature collection network (GFCNet) which combines neighbor collector and attribute collector to tackle the KGE problem in DrugKG. We prove the effectiveness of our proposed model, and provide one of the baseline models for future study on the COVID-19 medical knowledge graphs. The remainder of the paper is structured as follows: Section 2 provides the related work of knowledge graph embedding and the state-of-the-art approaches. Section 3 describes the framework of our proposed method. Section 4 presents the experimental details and we discuss the experiemtal results compared to a list of baseline models. In Section 5, we conclude with a summary and suggest several possible directions of future work.

Related Work

Knowledge Discovery for COVID-19

Medical knowledge is useful to improve the performance of downstream tasks [55]. Therefore, knowledge discovery gains extensive attention in recent years. Knowledge discovery refers to the process of extracting relevant information and finding useful knowledge from the large amount of structured or unstructured data. Natural language processing and text mining for knowledge discovery in medical domain are even more difficult tasks to work on by nature. Therefore, most of the researchers dedicate to COVID-19 literature studies by building information retrieval tools and knowledge bases that could potentially ease future research and enhance explainability. Shen et al. [40] develop an end-to-end recommendation system of academic research papers to match them with potential use cases of COVID-19 studies. Wise et al. [51] construct the COVID-19 Knowledge Graph(CKG) to understand and present complex relations between COVID-19 scientific articles in the graph, with latent schema and enriched entity information generated by Amazon Web Services (AWS) technologies. Wang et al. [49] design the EVIDENCEMINER system to facilitate researchers and scholars needs for mining textual evidence from COVID-19 literature corpus. Soni et al. [41] conduct empirical studies on information retrieval results from two commercial search engines, Google and Amazon, for COVID-19, and compare them to the more academic prototypes evaluated by the TREC–COVID track [37]. Zhao et al. [66] combine the knowledge graph embedding and logic rules to refine the COVID-19 knowledge graph. Other COVID-19 knowledge discovery research comprises of diagnosis and detection studies. Most of the existing studies utilize CT scans and X-rays to practise COVID-19 positive cases diagnoses and detections. For example, COVID-Net [31] is an open-sourced COVID-19 cases diagnosis and detection platform for chest X-ray images, and [14] develops an automated chest CT scan analysis tool for COVID-19 diagnosis and detection with 2D and 3D CNN algorithms. Other related work include but are not limited to [48], [33], [1]. However, these work mainly focus on understanding and analysis of X-ray and CT images, which is out of the research scope of this specific study.

KG Embedding

With the development of big data, the data sets with graph structure are ubiquitous, i.e., ranging from social networks to the World Wide Web and knowledge graphs (KGs) [61], [62]. Among these graphs, KGs are heterogenous networks which carry rather richer information and semantic meanings. Learning the embedding for KGs is useful to capture the rich information hidden in the KGs. KGE aims to embed entities and relations of KGs into continuous vector spaces. The purpose of embedding is to inherit the original structure of knowledge graph entities, and to simplify the process of manipulating them in future. As the number and scale of KGs grow rapidly, especially in the medical domain, KGE becomes more important in tasks of KG analyses and semantic data modeling. Translation-based models are considered to be one of the major approaches to the KGE problem. Bordes et al. proposed the most representative translational distance method named TransE [9]. Later on, various improvements have been made to boost the model performance accuracy [23], [50], [30]. However, the shallow structures of these models restrict the expressiveness of this approach. Bilinear or Semantic-based models are another major approach to the KGE problems. This group of models use matrix decomposition to learn KGE. Specifically, they match the entities and relations’ latent semantics that are contained in the vector space representations and computes scoring functions based on similarities. [36], [43], [25], [5], [58] are examples that fall into the semantic-based models category. Similar to the translation-based models, the expressiveness of bilinear models are also limited unless the embedding size is increased. This could potentially lead to a considerable increase in parameters and a fundamental confine on scalability. The third major category of KGE tasks are rotate-based models. Representative models of the rotate-based approaches comprise examples such as RotatE [42] and QuatE [65]. RotatE [42] is named after its concept of projecting each relation of source entity to target entity with a rotation in a complex vector space. In essence, it is a translational model specializes in modeling and inferring various relations, such as symmetric, inverse or composition information between nodes. Whereas for QuatE [65], unlike RotatE that involves only rotation at one single plane, it involves geometric rotation at two planes and in essence is a semantic-based matching model. The QuatE utilizes quaternion representations and relational rotational quaternions to exercise semantic matching between entities of heads and tails. Aside from the three major categories listed, other KGE approaches also include adopting multi-layer CNN-based, GNN-based models and etc. ConvE [12], ConvKB [35], RSN [15] and R-GCN [39] are all representative examples of this CNN-based approach. For example, ConvE [12] uses multi-layer convolutional neural network (CNN) to do link prediction and to capture more expressive features. Other research studies such as InteractE [44] and ReInceptionE [56] also follow this direction to learn more expressive features. CNN-based models deliver promising results but lack of modelling the structural information. GNN-based models, such as R-GCN [39], KBGAT [34] and BiGAT [57], leverage graph convolutional network (GCN) to aggregate neighborhood features. ManifoldE [53] and MAKR [16] embed the entities and relations using manifold-based embedding. Xiao et al. [54] proposed a CNN-based model to integrate text features and structural features. These models generally show good performance on predicting link task and classifying entities task. In general, all the papers mentioned above focus relatively more on models instead of specific datasets. Although these approaches achieve promising results in their settings, they only work generally good for the KGs with more neighbors and more complete nodes and relations. Given our COVID-19 KG dataset, where the data points are in scarcity and missing links are one of the major issues, these models are not able to perform well. Meanwhile, the existing models also lack of considering the structure and attributes of the KG, which is not suitable for our COVID-19 dataset either. Beyond the application of COVID-19, our approach can also be applied to a wide range of down stream tasks, such as drug discovery [2], [8], [64], contact tracing [60], [17], biomedical and genomics information retrieval [21], [22], [59] and detection of coronavirus-themed mobile malware [20]. In this paper, we focus on the link prediction task on COVID-19 KGs and we will leave the application on these tasks in the future work.

Differences from Existing Methods

Among the above mentioned various models, the major contribution is that we propose a graph feature collection network (GFCNet) for KGE on COVID-19 DrugKG, aiming at effectively capturing both the neighboring and attribute information. In fact, we note some existing models (e.g., R-GCNs [39], KR-EAR [29], MARINE [13]) also use the neighboring and attribute information for KGE. Here we highlight the novelty and differences in the following two ways: (1) R-GCN need to learn different parameter matrices for each relation, resulting the number of parameters increasing. Different from R-GCN, we propose a simple and parameter-efficient R-GCN (SR-GCN). The proposed SR-GCN uses a simple relation-specific diagonal matrix for each relation instead of a full matrix, which is parameter efficient. (2) KR-EAR [29] and MARINE [13] use attribute information to enrich the representation of entities, while they are not able to capture neighborhood information. We propose a graph feature collection network (GFCNet) which combines neighbor collector and attribute collector to tackle the KGE problem in DrugKG. The neighbor collector is used to aggregate neighborhood features and the attribute collector is used to learn attribute information. We ensemble these two kind of features to obtained the final entity representations.

Our Approach

In this section, we present our graph feature collection network (GFCNet) for KGE task in details. We first describe the notations and problem definition used in our paper. Then we elaborate on the model architecture, following by the loss function and training. Lastly, we give a complexity analysis of our model.

Notations and Problem Definition

In this subsection, we first present some KG and KGE mathematical notations that would be applied in the rest of the study. For the ease of descriptions, this study uses lower case letters to represent vectors, upper case letters to represent matrices, and italic lower case letters to represent subscript notions, such as . The notions and definitions used in the following sections are shown in Table 1 .

Table 1

The Notation and Definition of KG and KGE.

Name	Notation	Definition
Knowledge Graph	▵	A set of triplets in the form (h,r,t)
Entity Collection	E	Vocabulary collection of the entity
Entity Relations	R	The set of pre-defined entity relations
Entity Attributes	Aei	A set of attributes for entity ei
Entity Embedding	El	The entity embedding in the l-th layer
Vector Dimension	d	The dimension of entity, attribute and relation vectors
Activation Function	σ	The activation function
Entity Vectors	h,t	The entity vectors for entity h and t
Relation Vector	r	The vector for relation r
Attribute Vector	ak	The vector for the attribute ak∈Aei
Hidden Neighbor Vector	eihl+1	The hidden feature vector for entity ei obtained by Neighbor Collector
Hidden Attribute Vector	eial+1	The hidden feature vector for entity ei obtained by Attribute Collector
Scoring Function	f(h,r,t)	The scoring function for the triple (h,r,t)
Loss Function	L	The loss function of the model

The Notation and Definition of KG and KGE. Typically, a KG is denoted as , which stands for a collection of triples in the form , and , where is the vocabulary collection of the entity and is the set of pre-defined entity relations. KGE aims to embed knowledge which inherits certain properties into low-dimensional continuous vector space. In the vector space, each node (entity) is presented as a point, and each relation between the nodes is presented as an operation on the embeddings of entities.

Model Architecture

In this study, in response to the COVID-19 pandemic, we investigate a new antiviral drug knowledge graph (DrugKG) and construct a knowledge graph embedding (KGE) task on the DrugKG. In the DrugKG, some useful features are available for the KGE task, including relation triples (e.g., KG structure features) and attribute triples (e.g., entity attribute features). Most previous study only use relation triples to learn knowledge graph embeddings for a KG [9], [58], [43]. Some methods based on graph neural network (GNN) try to leverage more information from neighborhood features [39], [34], [6]. However, existing methods rarely consider some other important features in KG, such as attribute features. In this paper, we propose a graph feature collection network (GFCNet) for knowledge graph embedding in DrugKG. As shown in Fig. 2 , the proposed GFCNet consists of three components, namely neighbor collector, attribute collector and feature ensemble. The neighbor collector is used to learn local neighborhood features around the given entity. The attribute collector is designed to make full use of the attributes for each entity. Given an entity and its neighbors and attributes, we firstly represent them as vectors through embedding layers (e.g., entity ebmedding layer, attribute embedding layer and relation embedding layer). Then, we apply neighbor collector and attribute collector to gather both neighborhood and attribute features. Finally, these two kind of features are integrated together by using a feature ensemble module.

Fig. 2

The structure of the proposed graph feature collection network (GFCNet). The proposed model consists of three components: neighbor collector, attribute collector and feature ensemble. The embedding layers are used to convert the relations, entities and attributes to low-dimensional vectors. The FFN denotes a feed-forward network.

Neighbor Collector

Fig. 2 gives an example of the neighbor collector. The drug entity Vidarabine will aggregate information from its neighbors HHV-3 and HHV-4. Recently, some graph neural networks (GNNs) have been successful applied to model graph data, such as GCN [27]. However, traditional GNNs are designed to perform on unlabeled graphs which has no label in the edge. Different with unlabeled graph, KGs are graphs with different relations, which makes traditional GNNs not suitable to model KG data. To address this issue, R-GCN [39] uses relation-specific transformation matrices to aggregate neighbors linking by different relations, which is formulated as:where denotes a set of indices for the neighbors of entity under relation is a normalization factor, is the relation-specific transformation parameter matrices for relation is an activation function, is the parameter matrices of a one layer feed-forward network (FFN). R-GCN is able to model relational graph data and has proved that explicitly model neighborhood information is important for KGE task. However, R-GCN need to learn different parameter matrices for each relation, resulting the number of parameters increasing. To make full use of the neighborhood information meanwhile reduce the number of parameters, we use a simple and parameter-efficient R-GCN (SR-GCN) to aggregate neighboring nodes for each entity. Specifically, instead of using different relation-specific transformation parameter matrices which need parameters for all relations (where d is the dimension of the embedding), we use a simple relation-specific diagonal matrix as transformation matrix, which is inspired by [58]. Formally, in our SR-GCN, the entity embeddings can be updated as:where is the relation-specific vector for relation r, and is a diagonal matrix for . By applying SR-GCN, we can efficiently model the KGs with different relations using transformation parameters for all the relations, which is much smaller than R-GCN.

Attribute Collector

In the DrugKG, each entity has some attributes which are important for learning the representation for the entities. For example, the drug entity Vidarabine has some attribute triples (Vidarabine, type, Drug), (Vidarabine, drug_category, Anti-infective Agents), (Vidarabine, drug_category, Antiviral Agents) and etc. These attributes contain some useful information of the entity, e.g., the type of the entity is Drug, and the drug is used as anti-infective and antiviral treatment. We believe these attribute information can be utilized to enrich the entity embedding and improve the performance of KGE task. However, previous studies for KGE rarely consider attribute information especially on the medical knowledge graph. To this end, we design an attribute collector module to take full advantage of the attribute information. As shown in Fig. 2, we first represent each attribute value as a vector by using the attribute embedding layer. Thus, we can obtain some attribute features for the centre entity , which can be denoted as . Then, the attribute information is obtained by aggregating these attribute embeddings:where is the parameters of a FFN layer for attribute features in the l-th layer, is the number of attributes for .

Feature Ensemble

The neighborhood and attribute features are obtained by the neighbor collector and attribute collector. In our study, we apply a sum operation to combine these two features. To make the model training more stable, we also use a residual connection [19] followed by a layer normalization [4]. The output entity embedding for entity is computed as:where denotes the layer normalization [4]. We can stack L layers to propagate the neighborhood and attribute features as defined in Eq. 2, Eq. 3 and Eq. 4. Thus we can obtain the final entity embedding which consists of all the entity embeddings.

Loss Function and Training

In order to optimize the parameters in our GFCNet, we propose to use a simple and efficient function to compute the score for a triple [58]. Different with the existing scoring function, we compute the score based on the entity embedding obtained by our model, which is formulated as:where are the entity embeddings for head and tail entity, is the embedding for relation r. The objective function is a very important factor for KGE task which has a great affect on the performance [24], [32], [38]. In this paper, we use the efficient cross-entropy loss function over the distribution of all the entities to optimize our model. For a given triple , we compute the probability of the tail entity as:where is a candidate tail entity for the triple. Then, the loss function is defined as: Note that we use reciprocal relations [25], [28] in our model, which introduce an inverse triple for . Thus, when predicting the head entity for the triple , we can predict the probability of its inverse form . Table 1.

Complexity Analysis

In addition to performance, efficiency is also important for the KGE task, especially for the COVID-19 KGs whose scale grow rapidly. Hence, it is essential to develop efficient model to address the KGE problem. Table 2 shows the parameter efficiency of our model and some strong baseline models. Compared to the related GNN-based model (e.g., R-GCN) with the same layers and embedding size, our model use fewer parameters since we use a diagonal matrix as relation-specific transformation parameters. Note that the entity and relation embeddings contributes a lot to the parameters since they are depended on the size of KGs which are always very large in practice. Therefore, our model is more parameter efficient than the models with double or more entity and relation embeddings, such as RotatE and QuatE. Specifically, when , the number of parameters of RotatE is , but the parameters of our model is only , which is much fewer than RotatE. Thus, our model GFCNet is efficient and can be easily to adapted to many real world applications.

Table 2

Parameter efficiency of different models.

Model	Parameter efficiency
TransE	\|E\|d+\|R\|d
DistMult	\|E\|d+\|R\|d
Rescal	\|E\|d+\|R\|d2
RotatE	2\|E\|d+\|R\|d
QuatE	4\|E\|d+4\|R\|d
TuckER	\|E\|d+\|R\|d+d3
R-GCN	\|E\|d+\|R\|d+L\|R\|d2+Ld2
GFCNet	\|E\|d+\|R\|d+L\|R\|d+2Ld2

Parameter efficiency of different models.

Experiments

In this section, our proposed model is evaluated on the DrugKG dataset and compared with other existing KGE models, including five categories of popular KGE models. Then, we further implement some variants of the GFCNet by removing different components and conduct ablation study by comparing our GFCNet with these model variants. The ablation study proves that each components in our GFCNet plays an important role in the task of KGE.

Datasets

In response to the COVID-19 pandemic, we conduct experiments on a new antiviral drug knowledge graph (DrugKG) released in COVID-19 Research KG3 . The DrugKG is constructed based on the relationship between antiviral drugs, viruses, virus-related proteins, the host and host proteins in the DrugBank [52] 4 database. As shown in Fig. 1, there are four kind of entities in DrugKG, namely Drug, Virus, VirusProtein and HostProtein, and four types of relations between them, namely effect, produce, binding and interaction. The relation effect is a relation between entity Drug and Virus, which indicates that the antiviral drugs have a certain effect on the virus. The relation produce is a relation between entity Virus and VirusProtein, which is used to express the relationship between the virus and the protein it expresses. The relations binding and interaction are the interaction relationships between entity VirusProtein and HostProtein. In DrugKG, each entity also has some important attributes, such as type and drug_category. Following previous study for KGE [9], we split the triples in DrugKG into training, validation, and testing sets. The statistic of the datasets are summarized in Table 3 .

Table 3

The statistic of the DrugKG dataset.

Relations	Training	Validation	Testing
effect	47	0	6
produce	709	28	84
binding	8790	458	1028
interaction	13999	723	1636

Total	23545	1209	2754

The statistic of the DrugKG dataset.

Evaluation

For the KGE model, a common evaluation method is to build a link prediction task which aims to predict missing triples in the KG, namely, predict missing tail entity t given or predict missing head entity h given . Specifically, we rank all the entities in KG to predict the most probable entities. To evaluate the performance of the model, we use three popular metrics: MRR (mean reciprocal rank), MR (mean rank) and the Hits@N (the correct percentage in the top N ranks, where ). However, directly ranking all the entities can mislead the computation of the metrics for testing set when some corrupted triples are correct ones, such as the triples existed in training set [9]. In this situation, the correct triples in training set may rank above the current test triple . Thus, even if the testing triple is correct, it may rank behind other correct triples, making the MRR, MR and Hits@N metrics not exact. To avoid this problem, we evaluate the model using a filtered setting [9], [12], namely filter out all the correctly triples occurred in training, validation and testing sets but current triple itself. Specifically, let denotes the filtered rank of head entity h and denotes the rank of tail entity t, then the MR is computed as: ; the MRR is computed as: ; and the Hits@N is computed as , where is 1 if the condition is true, and 0 oterwise.

Experimental Setups

All the experiments are conducted on a Linux server with 128G memory and RTX 2080Ti GPUs. We implement our model using pytorch 5 , which is a popular deep learning framework. In our study, we choose the Adam [26] as the optimizer to train our model. In order to select the hyper-parameters for our model, we search the hyper-parameters using grid search method and select hyper-parameters according to the Hits@10 on validation set. We select the embedding size of KG embeddings from , the weight decay from , the batch size from , the dropout rate from , the learning rate from , the hidden layer number L from . We finally set the embedding size to 100, the weight decay to , the batch size is 128, the dropout rate to , and the learning rate to , the hidden layer number L to. The activation function is Relu. We initialize the parameters of our model using Kaiming initialization [18] which is a robust initialization method designed for the rectifier nonlinearities. Most of the KGE models use negative sampling method to sample some negative triples to train the model [9], [38], which randomly construct some negative triples by perturbing the head or tail entity in the correct triple . However, the performance of these models always depend on the quantity and quality of the negative samples. Anther training method termed as 1vsAll [28], [38] is to take all possible triples by replacing head or tail entities with all other entities. In our study, we apply the 1vsAll method to train our model due to its simplicity and efficiency for training the model 6 .

Baselines of Comparison

To investigate the KGE performance of our model, we compare with some popular baselines, which can be divided into five categories: translation-based models, bilinear models, roate-based models, CNN-based models and GNN-based models. Translation-based models view the relation r as a translation operation from head entity h to tail entity t, which assume that the head entity vector plus the relation vector should be close to the tail entity vector. Translation-based methods include TransE [9], TransD [23], TransH [50], TransR [30] and KR-EAR [29]. Bilinear models, also called semantic matching models, use matrix decomposition to learn knowledge graph embeddings, including Rescal [36], DistMult [58], ComplEx [43], SimplE [25], TuckER [5] and MARINE [13]. Rotate-based models view the relation as rotation operation in complex space, including RotatE [42] and QuatE [65]. CNN-based models, such as ConvE [12] and ConvKB [35], use CNNs to capture more expressive features. GNN-based models, such as R-GCN [39] and KBGAT [34], apply graph convolutional network (GCN) [27] and graph attention network (GAT) [45] to gather neighborhood features. The results of these baseline models are obtained by downloading and running the available source codes 7 8 9 10 11 on the DrugKG datasets.

Experimental Results

Table 4 shows the link prediction results of some strong baselines and our model on DrugKG. Among the translation-based models, the baseline model TransE [9] achieves good performance on the DrugKG which surpasses various of its extension models, such as TransD [23], TransH [50], TransR [30]. And TransE also outperforms some bilinear models, such as [58], ComplEx [43] and SimplE [25]. This indicates that the simple TransE model is more suitable to model the DrugKG than other complex models. Among all the bilinear models, TuckER [5] performer much better than other bilinear models, which also outperforms the strong baseline TransE model. The rotate-based model RotatE achieves the second best results on MRR, Hits@10, Hits@3 and Hits@1.

Table 4

Link prediction results on DrugKG. The best results are in bold and the second best results are in underline.

Models		MRR	MR	Hits@10	Hits@3	Hits@1
Translation-based models	TransE [9]	0.196	734.81	0.367	0.226	0.108
	TransD [23]	0.147	750.71	0.332	0.181	0.052
	TransH [50]	0.153	765.69	0.330	0.188	0.061
	TransR [30]	0.130	792.03	0.282	0.147	0.056
	KR-EAR [29]	0.184	768	0.346	0.205	0.102
Bilinear models	Rescal [36]	0.104	880.45	0.202	0.103	0.055
	DistMult [58]	0.169	796.35	0.302	0.180	0.104
	ComplEx [43]	0.171	1004.97	0.313	0.184	0.104
	SimplE [25]	0.172	788.11	0.308	0.179	0.106
	TuckER [5]	0.224	1242.00	0.368	0.249	0.150
	MARINE [13]	0.177	1126	0.338	0.186	0.115
Rotate-based models	RotatE [42]	0.243	820.40	0.408	0.273	0.160
Rotate-based models	QuatE [65]	0.198	777.81	0.351	0.220	0.123
CNN-based models	ConvE [12]	0.193	970.09	0.331	0.214	0.123
CNN-based models	ConvKB [35]	0.069	816.49	0.205	0.090	0.000
GNN-based models	R-GCN [39]	0.181	1341.61	0.294	0.196	0.124
	KBGAT [34]	0.092	761.00	0.198	0.090	0.040
	GFCNet (ours)	0.270	630.12	0.432	0.299	0.188

Link prediction results on DrugKG. The best results are in bold and the second best results are in underline. Compared to the baseline models, the proposed GFCNet achieves the best performance across all the metrics. Our model outperforms the strong baselines such as TransE [9], RotatE [42] and TuckER [5]. KR-EAR [29] and MARINE [13] are two representative models which incorporate attribute features for KGE. The empirical results also show that our GFCNet is still outperforms than KR-EAR and MARINE. The advantages behind these comparison are that our model is able to aggregate heterogenous graph features including neighborhood features and attribute features, which are essential for learning good representations for the entities in knowledge graph. These experimental results show the effectiveness of our GFCNet.

Results on Different Relations

Fig. 3 illustrates the detail MRR results on different relations of the DrugKG. Our GFCNet achieves the best results on all the four relations. When predicting head entity h given , we obtain high performance gains over the state-of-the-art RotatE model especially on relations binding, produce and effect. When predicting tail entity t given , our model also outperforms RotatE by large margins (e.g., 0.04 MRR on relation interaction, 0.692 MRR on relation effect). Compared with previous state-of-the-art models, our model is able to capture both neigbhorhood and attribute features in DrugKG, which enable our model to learn better embeddings for entities. The experimental results demonstrate that our GFCNet can achieve good performance on different relations.

Fig. 3

MRR results on different relations.

Ablation Study

Recently, some studies have shown that the training strategies may have a great impact on the performance, making it difficult to analyse whether the performance gains are obtained from a model architecture. Therefore, an ablation study is required to performed to evaluate the effect of different parts of the model. In this section, we conduct ablation experiments under the same experimental setting to explore the importance of different components in our model. The results of the ablation study are shown in Table 5 . “GFCNet w/o A” denotes the model without using attribute collector, “GFCNet w/o N” denotes the model without using neighbor collector. “GFCNet w/o A&N” denotes the model without using the neighbor and attribute collectors. “GFCNet-RGCN” denotes the model obtained by replacing the SR-GCN with the traditional R-GCN.

Table 5

Ablation Study.

Model	MRR	MR	Hits@10	Hits@3	Hits@1
GFCNet w/o A	0.265	674.51	0.426	0.291	0.185
GFCNet w/o N	0.264	508.39	0.424	0.292	0.184
GFCNet w/o A&N	0.253	1052.89	0.402	0.283	0.175
GFCNet-RGCN	0.267	654.34	0.429	0.294	0.187

GFCNet	0.270	630.12	0.432	0.299	0.188

Ablation Study. From Table 5, we can see that the GFCNet with both neighbor collector and attribute collector achieves the best performance over all metrics except MR. Without using the attribute collector or neighbor collector, the models perform worse than GFCNet, which indicates both the neighborhood information and the attribute information contribute to the performance of the model. Compared the SR-GCN with the traditional R-GCN (e.g., GFCNet vs. GFCNet-RGCN), the proposed GFCNet also performs better even if the SR-GCN using fewer parameters, which verify the effectiveness of the SR-GCN.

Analysis

Visualization

In order to analyse the representation ability of our model, we show the two-dimensional PCA projection of the entity embeddings for different models, which is shown in Fig. 4 . From Fig. 4 we can see that different types of entity embeddings learned by the baseline TransE are confused in the two-dimensional vector space, which indicates that TransE can not well model different types of entities. The R-GCN model is able to learn similar representations for different types of entities. But it still can not differentiate between Virus and HostProtein entities, because both of them have relationships with VirusProtein. Compare with these two baselines, the state-of-the-art model RotatE and our proposed GFCNet are able to distinguish all these four types of entities in DrugKG, which can learn similar embeddings for the entities with the same type and learn different embeddings for the entities with different types. This is mainly because that different types of entities have different attributes. By aggregate attribute features, our model can easily capture the difference between different types of entities.

Fig. 4

The visualization of the two-dimensional PCA projection of the entity embeddings for different models.

Performance for Entities with Different Degree

In this subsection, we further analyse how the degree of entities has an affect on the performance. As shown in Fig. 5 , we compare our model with three baseline models TransE, RotatE and R-GCN. With the degree increasing, the MRR results first improve rapidly, while after a threshold, the performance drops a lot. All the four models satisfy this phenomenon simultaneously, including the graph-based method (e.g., R-GCN and GFCNet) and the other models (e.g., TransE and RotatE). Previous studies [10] observe similar phenomenon based on experimental results by graph-based methods without comparing with methods without aggregating neighbors, and they think this phenomenon is because that too few neighbors can support limited neigbhor information and too many neigbhors will make the model hard to optimize. However, our experiment on DrugKG shows that the models without gathering neighborhood information also follow the same tendency. Therefore, we think the main reason for this phenomenon is that: (1) the entities with low degree have only a few triples to train, which make it difficult to learn better representations; (2) the entities with too many neighbors always suffers from the many-to-many relation pattern, which is difficult for all the KGE models, including graph-based models, rotate-based models and translation-based models.

Fig. 5

The MRR results for entities with different degree.

Case Study of the Link Prediction

Table 6 gives some examples predicted by our GFCNet model on the testing set. Given a head entity h and relation r, we predict possible tail entities by ranking all the entities in DrugKG and show some results ranking at the top. In these examples, the correct tail entities in the DrugKG rank in the top results. For example, given the query (Famciclovir, effect,?), the correct tail entity HHV-3 ranks in the first position and another correct tail entity HSV-1 ranks in the second position. For other examples, such as (Vidarabine, effect,?), (PUF60, interaction,?) and (Influenza A virus, produce,?), we can also obtain similar observation. These examples show the good performance of our model in an intuitive way.

Table 6

Some example Predictions on test set using our model. Bold indicates the true tails in the DrugKG.

Input:(h,r,?)	Predicted Tails
(Famciclovir, effect,?)	HHV-3, HSV-1, Influenza A virus, HBV genotype C, HHV-5, HIV-1
(Vidarabine, effect,?)	HSV-1, HHV-4, HSV-2, HHV-3, Walleye dermal sarcoma virus, HHV-5
(Ganciclovir, effect,?)	HIV-1, HSV-1, HHV-5, Vaccinia virus WR, Human gammaherpesvirus 4
(PUF60, interaction,?)	NCAP, VE2, NRAM, POLG, M1, REP78
(Influenza A virus, produce,?)	NS1, NEP, Q30NB4, E3, POLG, NEF

Some example Predictions on test set using our model. Bold indicates the true tails in the DrugKG.

Conclusion and Future Work

In this paper, we dive into the COVID-19 KGE task and take a heterogeneous approach to automatically infer the missing semantic relations in the COVID-19 knowledge graph. To the best of our knowledge, we are the first one to perform KGE task on the open-sourced COVID-19 antiviral drug knowledge graph (DrugKG). In order to tackle the problems of existing models which rarely take important features of KG like neighboring and attribute features, other than relation triples, into account, we propose a novel graph feature collection network (GFCNet) that utilizes different attribute and neighbor information to enhance the entity representation, in a simpler and more parameter-efficient way than R-GCN. The extensive experiments carried out on the DrugKG prove the effectiveness of our proposed model, however, there are still a lot of future work need to be continued based on our proposed method. First, we look forward to seeing more high quality open-sourced COVID-19 KGs to be available and applicable to our research in the near future. Due to the limited time and the urgency of work on COVID-19, the access to high quality COVID-19 KGs is not yet ready. Therefore, we only applied one dataset that is available to the proposed model. Hopefully as the global research community cooperate further and deeper in the COVID-19 NLP and text mining domain, we would be able to apply our GFCNet model to other datasets with more varieties, such as drug discovery [2], [8], [64], contact tracing [60], [17] and detection of coronavirus-themed mobile malware [20]. Second, due to the length of our paper, our proposed model is just a starting point of solving the COVID-19 KGE problem. We have planned further advancement to our method, which include but are not limited to incorporating dynamic data concepts to serve the characteristic and the need for real time data for pandemic development. This improvement of our model also requires more high quality data to support model advancements. Third, given the COVID-19 is an unprecedented and new virus, we need to practice further research on how to apply our models safely and appropriately as a foundation block to support various real world medical downstream applications, including clinical decision support and other epidemiological research.

CRediT authorship contribution statement

Zhiwen Xie: Writing - original draft, Data curation, Software, Validation. Runjie Zhu: Writing - original draft, Writing - review & editing. Jin Liu: Writing - review & editing, Supervision. Guangyou Zhou: Writing - review & editing, Methodology, Supervision. Jimmy Xiangji Huang: Writing - review & editing, Methodology, Supervision. Xiaohui Cui: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

13 in total

Review 1. A survey on incorporating domain knowledge into deep learning for medical image analysis.

Authors: Xiaozheng Xie; Jianwei Niu; Xuefeng Liu; Zhengsu Chen; Shaojie Tang; Shui Yu
Journal: Med Image Anal Date: 2021-01-30 Impact factor: 8.545

2. A privacy and security analysis of early-deployed COVID-19 contact tracing Android apps.

Authors: Majid Hatamian; Samuel Wairimu; Nurul Momen; Lothar Fritsch
Journal: Empir Softw Eng Date: 2021-03-19 Impact factor: 2.522

3. DrugBank 5.0: a major update to the DrugBank database for 2018.

Authors: David S Wishart; Yannick D Feunang; An C Guo; Elvis J Lo; Ana Marcu; Jason R Grant; Tanvir Sajed; Daniel Johnson; Carin Li; Zinat Sayeeda; Nazanin Assempour; Ithayavani Iynkkaran; Yifeng Liu; Adam Maciejewski; Nicola Gale; Alex Wilson; Lucy Chin; Ryan Cummings; Diana Le; Allison Pon; Craig Knox; Michael Wilson
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971