Literature DB >> 32458963

Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases.

David Oniani¹, Guoqian Jiang², Hongfang Liu², Feichen Shen².

Abstract

OBJECTIVE: As coronavirus disease 2019 (COVID-19) started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine-readable dataset known as the COVID-19 Open Research Dataset (CORD-19) has been released. Based on this, our objective was to build a computable co-occurrence network embeddings to assist association detection among COVID-19-related biomedical entities.
MATERIALS AND METHODS: Leveraging a Linked Data version of CORD-19 (ie, CORD-19-on-FHIR), we first utilized SPARQL to extract co-occurrences among chemicals, diseases, genes, and mutations and build a co-occurrence network. We then trained the representation of the derived co-occurrence network using node2vec with 4 edge embeddings operations (L1, L2, Average, and Hadamard). Six algorithms (decision tree, logistic regression, support vector machine, random forest, naïve Bayes, and multilayer perceptron) were applied to evaluate performance on link prediction. An unsupervised learning strategy was also developed incorporating the t-SNE (t-distributed stochastic neighbor embedding) and DBSCAN (density-based spatial clustering of applications with noise) algorithms for case studies.
RESULTS: The random forest classifier showed the best performance on link prediction across different network embeddings. For edge embeddings generated using the Average operation, random forest achieved the optimal average precision of 0.97 along with a F1 score of 0.90. For unsupervised learning, 63 clusters were formed with silhouette score of 0.128. Significant associations were detected for 5 coronavirus infectious diseases in their corresponding subgroups.
CONCLUSIONS: In this study, we constructed COVID-19-centered co-occurrence network embeddings. Results indicated that the generated embeddings were able to extract significant associations for COVID-19 and coronavirus infectious diseases.

Entities: Disease

Keywords: COVID-19; association extraction; co-occurrence network embeddings; coronavirus INFECTIOUS diseases

Mesh：

Year: 2020 PMID： 32458963 PMCID： PMC7314034 DOI： 10.1093/jamia/ocaa117

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Having now affected millions of people worldwide, coronavirus disease 2019 (COVID-19)/the novel coronavirus has become a major pandemic of the century. Most countries have declared a state of national emergency and took actions effective immediately to slow the spread. Researchers and medical personnel around the world have published and released thousands of articles over a short period of time, covering a vast scientific ground and exploring medical treatments and possible vaccines for the virus. With all this information, it is important to assemble all the available heterogeneous information and be aware of the explicit or implicit associations among subjects related to COVID-19 (eg, certain genes could be linked to other genes and/or mutations related to COVID-19 and other coronavirus infectious diseases). Figuring out which subjects appear together is one of the approaches for identifying these associations and linking them together. Traditionally, text semantic similarity, is one of the approaches for detecting links between words or sentences from unstructured data. One limitation is that it is inefficient to apply this approach over a large collection of free-text data, hampering the global view to detect significant associations across literature from heterogeneous domains. Normalized data stored in semi-structured graph format is more suitable for global link detection, as linked data by nature provide efficient query scheme over triplets to interpolate between “macroscopic” and “microscopic” search. Several efforts were made in the graph-based analysis of COVID-19. For example, Ahamed and Samad developed a graph-based model using abstracts of 10 683 COVID-19–related scientific articles and applying betweenness-centrality to rank order the importance of keywords related to drugs, diseases, pathogens, hosts of pathogens, and biomolecules. Bellomarini et al presented a report on ongoing work about the application of automated reasoning and knowledge graph technology to address the impact of the COVID-19 outbreak on the network of Italian companies. Tsiotas and Magafas used visibility graphs to study Greek COVID-19 infection curve as a complex network. Per request of the White House Office of Science and Technology Policy, a new COVID-19 machine readable dataset (CORD-19) has been released, and several studies have featured it dataset to investigate COVID-19–related topics. For example, Wolinski used CORD-19 to extract diseases at risk and calculate relevant indicators as well as created VIDAR-19 (VIsualization of Diseases At Risk in CORD-19). Wang et al conducted CORD-19 named entity recognition leveraging the distant supervision strategy. CORD-19-on-FHIR is a Linked Data version of CORD-19. It is represented in FHIR RDF, and was produced by data mining the CORD-19 dataset and adding semantic annotations. In addition, Groza featured CORD-19-on-FHIR in the analysis of how semantically annotated dataset can be applied for detecting and preventing the potential spread of deceptive information regarding COVID-19. A vast co-occurrence information contained in CORD-19 datasets allows for detection of novel associations across findings from various research articles. However, such information has been largely unexplored for association extraction. Moreover, the lack of measurable association among heterogeneous biomedical entities hampers the capability for a quantitative analysis. Inspired by the success of word embeddings in building distributed semantic representations for each word given a corpus, network embeddings provide a solution to map graph nodes to distributional representations and translate nodes’ relationships from graph space to embedding space, which makes the association between the nodes measurable. In this study, we filled this gap by constructing network embeddings for the CORD-19 co-occurrence network. Specifically, we first derived a co-occurrence network by querying the CORD-19-on-FHIR and focused on the extraction of biomedical entities falling in 4 categories: chemical, disease, gene, and mutation. We then applied the node2vec model over the generated network and constructed network embeddings. We conducted the evaluation quantitatively and qualitatively. For the quantitative evaluation, we generated different embeddings with 4 embeddings generation operations using a downstream application on graph link prediction and measured the performance with 6 machine learning algorithms. For the qualitative evaluation, we visualized clusters generated by the optimal COVID-19 network embeddings and analyzed associations of heterogeneous biomedical entities related to COVID-19 and other coronavirus infectious diseases.

MATERIALS AND METHODS

CORD-19-on-FHIR

The purpose of building CORD-19-on-FHIR is to represent linkage with other biomedical datasets and enable answering research questions. In this study, we used a subset of CORD-19-on-FHIR datasets annotated by Pubtator and LitCovid, including 3207 COVID-19–related articles in total. Each article was stored in one specific annotation file. For each file and for each paragraph in the file, CORD-19-on-FHIR provides a way to capture all the annotated biomedical entities. A high level example of data stored in the Terse RDF Triple Language (Turtle) format is shown: pmc: annotations [ pmc: id “1”; pmc: infons [pmc: identifier “MESH: D003371”; pmc: type “Disease”]; pmc: locations [pmc: length “5”^^xsd: int; pmc: offset “20312”^^xsd: int]; pmc: text “cough”], pmc: id “2”; pmc: infons [pmc: identifier “MESH: C000657245”; pmc: type “Disease”]; pmc: locations [pmc: length “19”^^xsd: int; pmc: offset “14766”^^xsd: int]; pmc: text “2019-nCoV infection”] ,], pmc: annotations [ pmc: id “5”; pmc: infons [pmc: identifier “59272”; pmc: ncbi_homologene “41448”; pmc: type “Gene”]; pmc: locations [pmc: length “31”^^xsd: int; pmc: offset “1986”^^xsd: int]; pmc: text “angiotensin-converting enzyme 2”], pmc: id “7”; pmc: infons [pmc: identifier “MESH: C000657245”; pmc: type “Disease”]; pmc: locations [pmc: length “19”^^xsd: int; pmc: offset “14766”^^xsd: int]; pmc: text “2019-nCoV infection”] ,], where “pmc: annotations” was used to differentiate different paragraphs within a same article, “pmc: id” was used to indicate different biomedical entities along with entity type (“pmc: type”), location and offset (“pmc: location”), and the original text from literature (“pmc: text”). Such encoding of the data made it possible to easily detect co-occurrence of biomedical entities within a single paragraph for building the network across the literature.

Node2Vec

The node2vec model used a random walk-based sampling strategy to balance the graph homophily and structural equivalence. The reason we chose to use node2vec is its ability to learn node representations with a balance between the breadth-first search and depth-first search, which is essential for learning associations in a graph with both local and global views.

METHODS

The workflow of this study is made of 3 modules, including a CORD-19-on-FHIR–based co-occurrence network generation module, a network embeddings construction module, and an unsupervised learning module (Figure 1).

Figure 1.

Study workflow.

Co-occurrence network generation

For each literature, we treated paragraph-level co-occurrence in this study. We first designed a SPARQL query statement to extract paragraph-level co-occurrence of biomedical entities from CORD-19-on-FHIR. Particularly, in order to largely collect coronavirus related diseases and comorbidities, we built a list of keywords for diseases and symptoms to constrain the searching space, which includes COVID-19, SARS, pneumonia, fever, fibrosis, diarrhea, coronavirus, bronchitis, Ebola, influenza, and ZIKA. We extracted co-occurrences between gene-disease, mutation-disease, and chemical-disease using the following SPARQL query by replacing “Biomedical_Entity” with “Gene,” “Mutation,” and “Chemical,” respectively: PREFIX rdf:< http://www.w3.org/1999/02/22-rdf-syntax-ns# > PREFIX fhir:< http://hl7.org/fhir/ > PREFIX pmc:< https://www.ncbi.nlm.nih.gov/pmc/articles# > SELECT distinct? pmc_id0? text0? pmc_id1? text1 (count(? text1) as? count) WHERE { ? pmc pmc: annotations [pmc: id? id0; pmc: text? text0; pmc: infons [pmc: type? type0; pmc: identifier? pmc_id0] ]. FILTER ((? type0 = Biomedical_Entity)). {select * where{ ? pmc pmc: annotations [pmc: id? id1; pmc: text? text1; pmc: infons [pmc: type? type1; pmc: identifier? pmc_id1]]. FILTER ((? type1 = "Disease") && (contains (lcase(str(? text1)), “coronavirus”)‖ contains (lcase(str(? text1)), “sars”) ‖ contains (lcase(str(? text1)), “covid-19”) ‖ contains (lcase(str(? text1)), “pneumonia”) ‖ contains (lcase(str(? text1)), “fever”) ‖ contains (lcase(str(? text1)), “fibrosis”) ‖ contains (lcase(str(? text1)), “diarrhea”) ‖ contains (lcase(str(? text1)), “bronchitis”) ‖ contains (lcase(str(? text1)), “ebola”) ‖ contains (lcase(str(? text1)), “influenze”) ‖ contains (lcase(str(? text1)), “zika”))). } } }Group by? pmc_id0? text0? pmc_id1? text1 Order by DESC(? count) The outputs of the query were composed of a list of pairwise biomedical entities with co-occurrence frequency. We then built a network based on this list by adding a link between any 2 biomedical entities if they have co-occurred at least once. As shown in Figure 1, the co-occurrence network was represented by source-target pairs, which were then used as input data for training node representations.

Network embeddings representation learning

We applied the node2vec model in this module. Node2vec implements a second-order random walk over the graph topological structure, denoting that 3 types of node are involved in a specific walk, namely source entity, intermediate entity, and target entity. Given any source entity as , target entity as , intermediate entity that exists on the path between and as , normalization constant as , the distribution of entity with a fixed length of random walk can be represented as: where is a transition probability between entities and . Given the weight over edge as , could be calculated as: is a searching bias term developed in node2vec. Specifically, node2vec introduced 2 hyperparameters and to balance between the breadth-first search and depth-first search searching strategies for both local and global optimization. Given the shortest distance between and as , for entities and is computed based on and : After learning the sampled network data using random walk, we then leveraged the Skip-Gram model to train entity representations on the sampled data. For each entity node and all its sampled neighbors , the loss function for entity representation learning could be described as: In the end, we normalized the prediction distribution by using a nonlinearity (eg, softmax) and optimize this loss function using Stochastic gradient descent.

Unsupervised clustering of network embeddings

To render the relatively high-dimensional embedding representations of network embeddings into a lower-dimensional space, we utilized the t-distributed stochastic neighbor embedding (t-SNE) algorithm to render the embeddings for all entity nodes into a 2-dimensional space. t-SNE does not perform clustering in and of itself, but instead renders each node embedding into a coordinate. As such, additional postprocessing is needed to regroup these points into discrete clusters. The density-based spatial clustering of applications with noise (DBSCAN) algorithm was therefore used over output generated by the t-SNE to further partition different entity groups into distinct clusters. Given a parameter that denotes how close points should be to each other and another parameter that indicates the minimum number of points, the DBSCAN clustered similar entity nodes together based on density according to the predefined 2 parameters.

EXPERIMENTS

From CORD-19-on-FHIR, we extracted 49 696 co-occurred biomedical entities for 3626 coronavirus related diseases, 5741 genes, 524 mutations, and 6878 chemicals. Thus the derived co-occurrence network contains 16 769 nodes and 49 696 edges in total. For quantitative evaluation, we generated the optimal network embeddings by performing a downstream link prediction task. Link prediction is a procedure where the goal is to predict the relationship between any 2 nodes and use the performance of a prediction to evaluate the quality of the generated network embeddings. Edge embeddings were used in this task in order to investigate the relationships between nodes leveraging distributional representations provided by entity embeddings. For any given nodes , and their corresponding entity representations and , edge embeddings were calculated using 4 operations, namely Hadamard, Average, L1 and L2 as shown in equations 5-8, respectively: We used 6 conventional classification algorithms to evaluate the performance of different edge embeddings on link prediction task, including decision tree (DT), logistic regression (LR), support vector machine (SVM), random forest (RF), naïve Bayes (NB), and multilayer perceptron (MLP). Specifically, The Boolean function was used to determine the existence of edge(s) between nodes and , where indicates positive links and represents negative links. We fit features of edge embeddings with labels provided by to train the model. For positive examples, for each of the 4 networks, 60%, 10%, and 30% of all their edges were used for training, validation, and testing purposes, respectively. For negative examples, an equal number of node pairs were randomly sampled (with the same ratio among training, validation, and testing sets as 60%, 10%, and 30%, respectively). For each classifier, we plotted the receiver-operating characteristic (ROC) curve and computed the area under the ROC curve to report link prediction performance. Moreover, as shown in equations 9-12, we used precision, recall, F1 score, and average precision (AP) to quantify the link prediction performance among 4 edge embeddings. For qualitative evaluation, we first visualized the network embeddings clustering output and used the silhouette score to evaluate clustering outputs. Silhouette score is adopted to calculate the average distance to entities in the same cluster with the average distance to entities in other clusters. Given any entity node in cluster , the internal mean distance is defined as: where is the distance between node and in . Similarly, external mean distance is described as: Overall, the silhouette score is calculated incorporating both internal and external mean distances: For some selected coronavirus infectious diseases, we also located the cluster they belonged to and checked the most similar entities within the same cluster using cosine similarity. Let denote any given coronavirus infectious disease and denote any biomedical entity inferred by network embeddings, and and represent the embeddings for and respectively, cosine similarity was calculated as shown in equation 16.

RESULTS

Different embeddings were generated by using neighbor size of 10, number of walks of 10, window size of 10, and dimensionality of 128. The optimal p and q were also tuned as 0.5 and 0.25 for each training process, respectively. Detailed network embeddings and the corresponding clustering information could be found online (https://github.com/shenfc/COVID-19-network-embeddings). We have also implemented a Web-based user-friendly tool for clustering visualization and entity similarity checking (https://www.davidoniani.com/covid-19-network).

Quantitative evaluation

As shown in Table 1, we presented the evaluation results of 4 different edge embedding operations along with 6 different classification algorithms. We found that, in general, the embeddings trained by the Average operation achieved the best performance across all the evaluation metrics. The optimal AP, ROC score, precision, and recall was reached when the RF was used, and the optimal F1 score was achieved by using NB. L1 and L2 had roughly the same performance, both peaking at ROC = 0.95 and AP = 0.96 with RF classifier. Among all 4 approaches, Hadamard yielded the worst performance, peaking at ROC = 0.89 and AP = 0.92 with RF classifier. Across all 6 classification algorithms, the worst performance, on the other hand, was shown by DT and MLP classifiers.

Table 1.

Evaluation results for the 4 edge embeddings operations along with 6 machine learning algorithms

Operations	Algorithms	Average Precision (AP)	ROC score	Precision	Recall	F1 score
Hadamard	DT	0.79	0.82	0.84	0.82	0.81
	LR	0.89	0.83	0.86	0.82	0.81
	SVM	0.80	0.81	0.85	0.81	0.81
	RF	0.92	0.89	0.87	0.86	0.86
	NB	0.82	0.84	0.86	0.84	0.84
	MLP	0.56	0.60	0.63	0.60	0.57
Average	DT	0.81	0.84	0.85	0.84	0.84
	LR	0.94	0.92	0.87	0.85	0.85
	SVM	0.83	0.86	0.87	0.86	0.85
	RF	0.97^a	0.96^a	0.91^a	0.91^a	0.90
	NB	0.88	0.91	0.91^a	0.91^a	0.91^a
	MLP	0.78	0.84	0.84	0.84	0.84
L1	DT	0.75	0.80	0.80	0.80	0.80
	LR	0.95	0.94	0.89	0.89	0.89
	SVM	0.87	0.89	0.90	0.89	0.89
	RF	0.96	0.95	0.89	0.88	0.88
	NB	0.85	0.88	0.89	0.88	0.88
	MLP	0.87	0.89	0.89	0.89	0.89
L2	DT	0.75	0.80	0.80	0.80	0.80
	LR	0.94	0.93	0.89	0.88	0.88
	SVM	0.87	0.88	0.90	0.88	0.88
	RF	0.96	0.95	0.89	0.88	0.88
	NB	0.85	0.87	0.88	0.87	0.87
	MLP	0.85	0.87	0.88	0.87	0.87

DT: decision tree; LR: logistic regression; MLP: multilayer perceptron; NB: naïve Bayes; RF: random forest; ROC: receiver-operating characteristic; SVM: support vector machine.

Highest value.

Evaluation results for the 4 edge embeddings operations along with 6 machine learning algorithms DT: decision tree; LR: logistic regression; MLP: multilayer perceptron; NB: naïve Bayes; RF: random forest; ROC: receiver-operating characteristic; SVM: support vector machine. Highest value. Regarding different classification algorithms, for the L1 embeddings embedding operation, RF and LR had similar performance, with ROC_RF = 0.96, AP_RF = 0.97, and ROC_LR = 0.96, AP_LR = 0.95. SVM, NB, and MLP were also not much different from each other. DT had the worst performance, with ROC = 0.80 and AP = 0.75. Similarly, for L2, RF and LR had roughly the same performance. SVM, NB, and MLP were also equally performant. DT had the worst performance, with ROC = 0.80 and AP = 0.75. For the Average operation, RF has outperformed all the other classification methods, with ROC = 0.96 and AP = 0.97. LR and NB had similar performance. DT, SVM, and MLP were similar, yet all of them were behind RF, LR, and NB. For Hadamard, RF showed the best performance, with ROC = 0.89 and AP = 0.92. The rest of the methods showed roughly the same performance, except for MLP, which had the worst performance across all classification algorithms as well as across every embedding generation approach, with ROC = 0.60 and 0.56. We finalized the network embeddings generated by the Average operation as the optimal one and an ROC curve for the performance across 6 classification algorithms is shown in Figure 2.

Figure 2.

Receiver-operating characteristic scores for the average operation with 6 machine learning algorithms. DT: decision tree; LR: logistic regression; MLP: multilayer perceptron; NB: naïve Bayes; RF: random forest; SVM: support vector machine.

Qualitative evaluation

We clustered the network embeddings by selecting the optimal hyperparameters and for the DBSCAN algorithm. Sixty-three clusters were generated with a Silhouette score of 0.128. We used the network embeddings generated through the optimal Average operation to conduct a further qualitative evaluation. We first visualized clusters for diseases and clusters for all the entities as shown in Figure 3. In Figure 3A, we found that pneumonia, fever, fibrosis, and bronchitis appeared in each cluster, indicating that they are common comorbidities among all types of coronavirus infectious diseases. We also observed that COVID-19 co-occurred with coronavirus, SARS, Ebola, and ZIKA in different clusters respectively, denoting that there exists tremendous overlap between these diseases regarding underlying mechanisms. Clusters shown in Figure 3B further illustrate how different genes, mutations, and chemicals can link diseases with similar mechanisms. For example, COVID-19 was grouped in cluster #6, which contains 606 biomedical entities in total, including infection of SARS, Ebola viruses, rs180047 mutation, and carbohydrates chemical component. Based on literature search, we found that rs180047 is strongly related to TGF-β1, a master regulator for pulmonary fibrosis, which is a common comorbidity related to COVID-19, infection of SARS, and Ebola viruses. In addition, carbohydrates-based diagnostic was recently reported to be a potential new approach for testing COVID-19, was also detected from cluster #6. The comprehensive information for entities included in each cluster was illustrated online (https://github.com/shenfc/COVID-19-network-embeddings).

Figure 3.

Clustering visualization for (A) diseases and (B) all the biomedical entities. COVID-19 (coronavirus disease 2019) is represented in red, SARS (severe acute respiratory syndrome) is represented in black, coronavirus is represented in green, pneumonia is represented in blue, fever is represented in cyan, fibrosis is represented in yellow, diarrhea is represented in magenta, bronchitis is represented in olive drab, Ebola is represented in pink, influenza is represented in dark orchid, ZIKA is represented in khaki, all the genes are represented in purple, all the mutations are represented in silver, and all the chemicals are represented in salmon. We then selected 5 coronavirus infectious diseases and listed top 10 closest entities using cosine similarity as shown in Table 2. COVID-19 was clustered in cluster #6, and the top 2 closest entities in cluster #6 are VP35 and HD11 (both being genes). VP35 is a virus protein of the other highly infectious disease Ebola.HD11, also known as Homeobox protein, is known for regulating infectious diseases such as Avian infectious bronchitis virus (IBV) that is in the coronavirus family. Pulmonary coronavirus infection was grouped in cluster #1 and has the closest associations with gene PTP (protein tyrosine phosphatase), which has been mentioned in SARS-CoV replication inhibition studies. In case of SARS-CoV–infected human airway epithelia cell cultures, it is easy to notice that the entity is directly linked to the coronavirus infection. As for SARS-CoV infection damages lung that listed in cluster #2, IL-1-alpha (interleukin 1Alpha) also known as IL-1α, has the closest association with SARS-CoV, and evidence could be found in a research study. In particular, IL-1α is a proinflammatory cytokine that shows increase when infected by SARS-CoV. Sucralfate is a chemical compound that also holds close association with SARS-CoV, which has been studied as a potentially effective means against early-onset ventilator-associated pneumonia.Coronavirus upper respiratory infection was found in cluster #23, and pleuropneumoniae (disease) and plasmin (gene) are most 2 similar entities. Pleuropneumoniae is a pneumonia complicated with pleurisy, which has been linked to porcine upper respiratory tract and plasminogen (PLG) (the zymogen of plasmin) has also been proved to be related with coronavirus upper respiratory infection in research studies.Coronavirus-infected pneumonia was detected in cluster #10. The top 2 closest diseases are respiratory syncytial viral infection and pegylated interferon-alpha, which could be proved by reviewing research studies in Zou and Zhu and Haagmans et al.

Table 2.

Top 10 intracluster closest biomedical entities for 5 selected coronavirus infectious diseases

Coronavirus infectious diseases	Top 10 closest entities	Cosine similarity score
COVID-19 (cluster #6)	VP35 (Gene)	0.9777
	HD11 (Gene)	0.9774
	Coronavirus infection process (Disease)	0.9700
	Fibroblast growth factor (FGF)-2 (Gene)	0.9655
	Acute respiratory infection illness (Disease)	0.9596
	PIGS (Gene)	0.9576
	TGF alpha (Gene)	0.9571
	SFPQ (Gene)	0.9561
	Tumor necrosis factor (TNF) (Gene)	0.9549
	Praziquantel (Chemical)	0.9537
Pulmonary coronavirus infection (cluster #1)	PTP (Gene)	0.9754
	SARS-CoV–infected human airway epithelia cell cultures (Disease)	0.9699
	“5'-tgg gat tca aca” (Chemical)	0.9672
	Trachea nasal respiratory epithelial cells and llamas (lama glama) (Disease)	0.9658
	Suppressor of cytokine signaling 3 (Gene)	0.9620
	KAT (Gene)	0.9604
	CD32 (Gene)	0.9573
	Maternal SARS infection (Disease)	0.9553
	Respiratory syndrome coronavirus (MERS-CoV) infections (Disease)	0.9547
	S27 (Gene)	0.9546
SARS-COV infection damages lung (cluster #2)	IL-1α (Gene)	0.9560
	Sucralfate prn (Chemical)	0.9589
	Acute respiratory syndrome-cov infection (Disease)	0.9555
	IL-5– and IL-13–producing ilc-iis (Gene)	0.9487
	HAP1 (Gene)	0.9342
	FSK (Chemical)	0.9337
	Low fever (Disease)	0.9328
	HIV and Ebola virus infection (Disease)	0.9327
	YKL-40 (Gene)	0.9288
	ETF (Gene)	0.9280
Coronavirus upper respiratory infection (cluster #23)	Viruses Actinobacillus pleuropneumoniae (Disease)	0.9890
	Plasmin (Gene)	0.9719
	JAM-1 (Gene)	0.9654
	TNF receptor–associated factor 6 (Gene)	0.9648
	GPC3 (Gene)	0.9613
	Renin (Gene)	0.9582
	ZO-1 (Gene)	0.9563
	Cathepsin G (Gene)	0.9556
	rs5743313 (Mutation)	0.9547
	Alpha1 antitrypsin (Gene)	0.9544
Coronavirus-infected pneumonia (cluster #10)	Respiratory syncytial viral infection (Disease)	0.9923
	Pegylated interferon-alpha (Chemical)	0.9891
	IFITM6 (Gene)	0.9872
	Feline b (Chemical)	0.9858
	E119V (Mutation)	0.9854
	Epac2 (Gene)	0.9850
	GFTP2 (Gene)	0.9849
	Hepatitis coronavirus infection (Disease)	0.9843
	Ouabain (Chemical)	0.9797
	LY6G (Gene)	0.9786

Cluster ID and the type of entities are marked in parentheses.

Top 10 intracluster closest biomedical entities for 5 selected coronavirus infectious diseases Cluster ID and the type of entities are marked in parentheses.

DISCUSSION

In this study, we used 11 keywords to query CORD-19-on-FHIR for constructing the COVID-19–centered coronavirus co-occurrence network. As research studies of COVID-19 published on a daily basis, we will keep watching the new results and adding more significant diseases and comorbidities into the keyword list to provide timely update for the co-occurrence network. PMID information was not incorporated as an attribute into the co-occurrence network, which creates difficulties on the capability to trace each biomedical entity back to the original literature. In the future, we will add PMID list for each entity. On the one hand, it will help our evaluation on detecting if the closely associated terms are from the same article or different publications. On the other hand, it will provide more evidence to scientists and clinical investigators for assisting their research studies on COVID-19 in an efficient manner. We didn’t consider the co-occurred frequency for the pairs of biomedical entities while training the network embeddings. Instead of treating each edge equally, in the future, we will add weights over edges using frequency in order to better represent associations in the network and provide more accurate edge embeddings to better quantify power of associations among biomedical entities. In unsupervised learning approach, there is always a balance between the number of clusters and the silhouette score. In most cases, the silhouette score tends to be higher if the number of cluster is small. In this study, we used a heuristic way to determine silhouette score and the number of clusters for making clear separations over the biomedical entities. In the future, we sought to use our previously developed hierarchical clustering optimization algorithms to make dynamic balance between the optimal silhouette score and suitable cluster density. Moreover, after checking top similar entities for 5 selected coronavirus infectious diseases, we observed that applying clustering over the network embeddings could detect both explicit and implicit associations. It is easy to check explicit links, as most of them might be documented in existing studies. But for implicit associations, although they might hold huge potential on new discoveries, in order to validate their correctness, we will invite a clinical investigator from Mayo Clinic Division of Pulmonary and Critical Care Medicine for manual evaluation.

CONCLUSION

This study has explored the construction of co-occurrence network embeddings for COVID-19 and related coronavirus infectious diseases. We have tested different edge embeddings operations along with different machine learning algorithms to optimize the final network embeddings and developed unsupervised clustering algorithms to deep dive into specific COVID-19–related associations. Results indicated that the co-occurrence network embeddings were able to perform link prediction task well and detect both explicit and implicit associations for COVID-19, demonstrating its potential usage for discovering new disease management and treatment plan for COVID-19. Detailed implementations and data sources could be found online (https://github.com/shenfc/COVID-19-network-embeddings). A Web-based user-friendly tool for clustering visualization and entity similarity check is available online (https://www.davidoniani.com/covid-19-network).

FUNDING

This work was supported by National Institute of Health grant U01TR0062-1 (PI: HL and Co-I: FS).

AUTHOR CONTRIBUTIONS

DO was responsible for algorithm development, tool implementation, and manuscript drafting. GJ was responsible for extracting needed information from CORD-19-on-FHIR. HL provided supports on system evaluations. FS conceived and supervised the study. FS was responsible for research topic formulation, algorithm development, experiment design, tool implementation, and manuscript writing and revision.

17 in total

1. HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.

Authors: Feichen Shen; Suyuan Peng; Yadan Fan; Andrew Wen; Sijia Liu; Yanshan Wang; Liwei Wang; Hongfang Liu
Journal: J Biomed Inform Date: 2019-06-27 Impact factor: 6.317

2. Sedation, sucralfate, and antibiotic use are potential means for protection against early-onset ventilator-associated pneumonia.

Authors: C Bornstain; E Azoulay; A De Lassence; Y Cohen; M A Costa; B Mourvillier; A Descorps-Declere; M Garrouste-Orgeas; M Thuong; B Schlemmer; J-F Timsit
Journal: Clin Infect Dis Date: 2004-04-29 Impact factor: 9.079

3. Hegemonic structure of basic, clinical and patented knowledge on Ebola research: a US army reductionist initiative.

Authors: David Fajardo-Ortiz; José Ortega-Sánchez-de-Tagle; Victor M Castaño
Journal: J Transl Med Date: 2015-04-19 Impact factor: 5.531

4. The association of functional polymorphisms in genes encoding growth factors for endothelial cells and smooth muscle cells with the severity of coronary artery disease.

Authors: Tadeusz Osadnik; Joanna Katarzyna Strzelczyk; Andrzej Lekston; Rafał Reguła; Kamil Bujak; Martyna Fronczek; Marcin Gawlita; Małgorzata Gonera; Jarosław Wasilewski; Bożena Szyguła-Jurkiewicz; Marek Gierlotka; Mariusz Gąsior
Journal: BMC Cardiovasc Disord Date: 2016-11-11 Impact factor: 2.298

5. Structural analysis of inhibition mechanisms of aurintricarboxylic acid on SARS-CoV polymerase and other proteins.

Authors: YeeLeng Yap; XueWu Zhang; Anton Andonov; RunTao He
Journal: Comput Biol Chem Date: 2005-06 Impact factor: 2.877

6. Gga-miR-30d regulates infectious bronchitis virus infection by targeting USP47 in HD11 cells.

Authors: Hao Li; Jianan Li; Yaru Zhai; Lan Zhang; Pengfei Cui; Lan Feng; Wenjun Yan; Xue Fu; Yiming Tian; Hongning Wang; Xin Yang
Journal: Microb Pathog Date: 2020-01-23 Impact factor: 3.738

7. An orally bioavailable broad-spectrum antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple coronaviruses in mice.

Authors: Timothy P Sheahan; Amy C Sims; Shuntai Zhou; Rachel L Graham; Andrea J Pruijssers; Maria L Agostini; Sarah R Leist; Alexandra Schäfer; Kenneth H Dinnon; Laura J Stevens; James D Chappell; Xiaotao Lu; Tia M Hughes; Amelia S George; Collin S Hill; Stephanie A Montgomery; Ariane J Brown; Gregory R Bluemling; Michael G Natchus; Manohar Saindane; Alexander A Kolykhalov; George Painter; Jennifer Harcourt; Azaibi Tamin; Natalie J Thornburg; Ronald Swanstrom; Mark R Denison; Ralph S Baric
Journal: Sci Transl Med Date: 2020-04-06 Impact factor: 17.956

8. Knowledge Discovery from Biomedical Ontologies in Cross Domains.

Authors: Feichen Shen; Yugyung Lee
Journal: PLoS One Date: 2016-08-22 Impact factor: 3.240

9. Predicate Oriented Pattern Analysis for Biomedical Knowledge Discovery.

Authors: Feichen Shen; Hongfang Liu; Sunghwan Sohn; David W Larson; Yugyung Lee
Journal: Intell Inf Manag Date: 2016-05

10. Tissue plasminogen activator (tPA) treatment for COVID-19 associated acute respiratory distress syndrome (ARDS): A case series.

Authors: Janice Wang; Negin Hajizadeh; Ernest E Moore; Robert C McIntyre; Peter K Moore; Livia A Veress; Michael B Yaffe; Hunter B Moore; Christopher D Barrett
Journal: J Thromb Haemost Date: 2020-05-11 Impact factor: 5.824

3 in total

1. A Web Application for Biomedical Text Mining of Scientific Literature Associated with Coronavirus-Related Syndromes: Coronavirus Finder.

Authors: Dagoberto Armenta-Medina; Aniel Jessica Leticia Brambila-Tapia; Sabino Miranda-Jiménez; Edel Rafael Rodea-Montero
Journal: Diagnostics (Basel) Date: 2022-04-02

2. Subphenotyping of Mexican Patients With COVID-19 at Preadmission To Anticipate Severity Stratification: Age-Sex Unbiased Meta-Clustering Technique.

Authors: Lexin Zhou; Nekane Romero-García; Juan Martínez-Miranda; J Alberto Conejero; Juan M García-Gómez; Carlos Sáez
Journal: JMIR Public Health Surveill Date: 2022-03-30

3. Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation.

Authors: Chao Jiang; Victoria Ngo; Richard Chapman; Yue Yu; Hongfang Liu; Guoqian Jiang; Nansu Zong
Journal: J Med Internet Res Date: 2022-07-06 Impact factor: 7.076

3 in total