| Literature DB >> 36157791 |
Tao Dai1, Jie Zhao2, Dehong Li2, Shun Tian1, Xiangmo Zhao3, Shirui Pan4.
Abstract
The outbreak of COVID-19 brings almost the biggest explosions of scientific literature ever. Facing such volume literature, it is hard for researches to find desired citation when carrying out COVID-19 related research, especially for junior researchers. This paper presents a novel neural network based method, called citation relational BERT with heterogeneous deep graph convolutional network (CRB-HDGCN), for COVID-19 inline citation recommendation task. The CRB-HDGCN contains two main stages. The first stage is to enhance the representation learning of BERT model for COVID-19 inline citation recommendation task through CRB. To achieve the above goal, an augmented citation sentence corpus, which replaces the citation placeholder with the title of the cited papers, is used to lightly retrain BERT model. In addition, we extract three types of sentence pair according citation relation, and establish sentence prediction tasks to further fine-tune the BERT model. The second stage is to learn effective dense vector of nodes among COVID-19 bibliographic graph through HDGCN. The HDGCN contains four layers which are essentially all sub neural networks. The first layer is initial embedding layer which generates initial input vectors with fixed size through CRB and a multilayer perceptron. The second layer is a heterogeneous graph convolutional layer. In this layer, we expand traditional homogeneous graph convolutional network into heterogeneous by subtly adding heterogeneous nodes and relations. The third layer is a deep attention layer. This layer uses trainable project vectors to reweight the node importance simultaneously according to both node types and convolution layers, which further promotes the performance of learnt node vectors. The last decoder layer recovers the graph structure and let the whole network trainable. The recommendation is finally achieved by integrating the high performance heterogeneous vectors learnt from CRB-HDGCN with the query vectors. We conduct experiments on the CORD-19 and LitCovid datasets. The results show that compared with the second best method CO-Search, CRB-HDGCN improves MAP, MRR, P@100 and R@100 with 21.8%, 22.7%, 37.6% and 21.2% on CORD-19, and 29.1%, 25.9%, 15.3% and 11.3% on LitCovid, respectively.Entities:
Keywords: COVID-19 citation recommendation; Citation enhanced BERT; Deep graph convolutional network; Heterogeneous graph; Text representation learning
Year: 2022 PMID: 36157791 PMCID: PMC9482209 DOI: 10.1016/j.eswa.2022.118841
Source DB: PubMed Journal: Expert Syst Appl ISSN: 0957-4174 Impact factor: 8.665
Related works summarization for COVID-19 paper retrieval.
| Title | Date | Using techniques | |
|---|---|---|---|
| COVID-19 | CORD-19: The COVID-19 open research dataset. | 2020 | |
| LitCovid: An open database of COVID-19 literature. | 2020 | ||
| Commercial | Amazon AWS CORD-19 search. | 2020 | Knowledge Graph; |
| Google COVID-19 research explorer. | 2020 | BERT; | |
| COVID-19 paper retrieve methods | Rapidly deploying a neural search engine for the COVID-19 open research dataset. | 2020 | BioBERT; BM25; |
| Covidseer: Extending the CORD-19 dataset. | 2020 | TF-IDF; SciBERT | |
| COVID-19 knowledge graph: accelerating information retrieval and discovery for scientific literature. | 2020 | Knowledge graph; | |
| Discovering associations in COVID-19 related research papers. | 2020 | Association rule; | |
| Co-search: COVID-19 information retrieval with semantic search, question answering, and abstractive summarization. | 2020 | TF-IDF; BM25; | |
| Aspect-based document similarity for research papers. | 2020 | PLMs | |
| SLEDGE-Z: A zero-shot baseline for COVID-19 literature search. | 2020 | BM25; SciBERT | |
| Information mining for COVID-19 research from a large volume of scientific literature. | 2020 | Betweenness centrality | |
| Searching scientific literature for answers on COVID-19 questions. | 2020 | BM25; BioBERT | |
| CMT in TREC-COVID round 2: mitigating the generalization gaps from web to special domain search. | 2020 | Domain-adaptive | |
| Ad-hoc document retrieval using weak supervision with BERT and GPT2. | 2020 | BERT; GPT2 | |
| Covrelex: A covid-19 retrieval system with relation extraction. | 2021 | BERT; named entity | |
| COVID-19 literature mining and retrieval using text mining approaches. | 2022 | BOW; TF-IDF; | |
Related works summarization for citation recommendation.
| Title | Date | Using techniques | |
|---|---|---|---|
| Global citation recommendation | Global citation recommendation using knowledge graphs. | 2018 | Knowledge graph; |
| Recommending scientific paper via heterogeneous knowledge embedding based attentive recurrent neural networks. | 2021 | Graph learning; | |
| Gated relational stacked denoising autoencoder with localized author embedding for global citation recommendation. | 2021 | SDAE; attention | |
| Inline citation recommendation | Context-aware citation recommendation. | 2010 | Non-parametric |
| A neural probabilistic model for context based citation recommendation. | 2015 | Multi-layer neural | |
| Neural citation network for context-aware citation recommendation. | 2017 | Time delay neural | |
| Personalized citation recommendation via convolutional neural networks. | 2017 | CNN | |
| Using citation-context to reduce topic drifting on pure citation-based recommendation. | 2018 | Topic model; Word2Vec | |
| Attentive stacked denoising autoencoder with Bi-LSTM for personalized context-aware citation recommendation. | 2019 | SDAE; Bi-LSTM; | |
| Local citation recommendation with hierarchical-attention text encoder and SciBERT-based reranking. | 2022 | Sci-BERT; attention | |
| Dual attention model for citation recommendation with analyses on explainability of attention mechanisms and qualitative experiments. | 2022 | Neural model; Dual | |
Fig. 1The brief retraining and fine-tuning process of CRB.
An example construction of citation augmented corpus.
| Original citing sentence | Augmented citing sentence |
|---|---|
| These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole [8]. | These structures defined a novel family of enzymes, now called sedolisins or serine-carboxyl peptidases, that is characterized by the utilization of a fully conserved catalytic triad (Ser, Glu, Asp) and by the presence of an Asp in the oxyanion hole , Aorsin, a novel serine proteinase with trypsin-like specificity at acidic pH. |
Fig. 2The detailed architecture of heterogeneous deep graph convolutional network. There are essentially four layers in the network. The initial embedding layer firstly transfers nodes in COVID-19 paper dataset into vectors with fixed size by a multilayer perceptron. Secondly, the heterogeneous graph convolutional layer embeds the initial vectors into heterogeneous matrices by utilizing multiple relations among them. Thirdly, for each th vector matrices , the deep attention layer utilizes three learnable project vectors and to extract importance regulating factors and , respectively. The final vectors are then generated by combing all reweighted vectors in all convolutional layers. Finally, the decoder layer recovers the network structure from obtained vectors to ensure the trainability of the whole network.
Fig. 3A demonstration graph which illustrates the involved relationships and nodes in bibliographical heterogeneous graph.
Fig. 4A running example of a HDGCN with 2 convolutional layers.
Statistics of used datasets.
| Papers | Authors | Venues | ||
|---|---|---|---|---|
| CORD-19 | train | 327,619 | 284,581 | 15,924 |
| test | 37,289 | 35,417 | 7,423 | |
| LitCovid | train | 103,175 | 90,692 | 6,305 |
| test | 6,613 | 5,934 | 2,081 | |
Performance comparison between different methods on CORD-19.
| Methods | MAP | MRR | P@25 | P@50 | P@75 | P@100 | R@25 | R@50 | R@75 | R@100 |
|---|---|---|---|---|---|---|---|---|---|---|
| TF-IDF | 0.0541 | 0.0604 | 0.0191 | 0.0148 | 0.0062 | 0.0013 | 0.0719 | 0.1005 | 0.1251 | 0.1694 |
| BM25 | 0.0562 | 0.0689 | 0.0224 | 0.0161 | 0.0074 | 0.0017 | 0.0943 | 0.1139 | 0.1354 | 0.1825 |
| SciBERT | 0.0716 | 0.0845 | 0.0314 | 0.0193 | 0.0124 | 0.0048 | 0.1152 | 0.1382 | 0.1584 | 0.2136 |
| BERT-DVGCN | 0.0826 | 0.0912 | 0.0408 | 0.0214 | 0.0147 | 0.0085 | 0.1282 | 0.1426 | 0.1671 | 0.2216 |
| AR | 0.1107 | 0.1169 | 0.0594 | 0.0377 | 0.0272 | 0.0142 | 0.1442 | 0.1681 | 0.1869 | 0.2454 |
| CO-Search | 0.1227 | 0.1289 | 0.0697 | 0.0493 | 0.0423 | 0.0197 | 0.1569 | 0.1727 | 0.2036 | 0.2563 |
| CRB-HDGCN | 0.1495 | 0.1582 | 0.0848 | 0.0608 | 0.0572 | 0.0271 | 0.1953 | 0.2279 | 0.2483 | 0.3107 |
Performance comparison between different methods on LitCovid.
| Methods | MAP | MRR | P@25 | P@50 | P@75 | P@100 | R@25 | R@50 | R@75 | R@100 |
|---|---|---|---|---|---|---|---|---|---|---|
| TF-IDF | 0.0705 | 0.0753 | 0.0334 | 0.0237 | 0.0102 | 0.0034 | 0.1153 | 0.1406 | 0.1690 | 0.1907 |
| BM25 | 0.0723 | 0.0831 | 0.0365 | 0.0253 | 0.0179 | 0.0065 | 0.1268 | 0.1512 | 0.1784 | 0.2198 |
| SciBERT | 0.0923 | 0.1005 | 0.0564 | 0.0301 | 0.0240 | 0.0185 | 0.1706 | 0.1880 | 0.2037 | 0.2610 |
| BERT-DVGCN | 0.1081 | 0.1138 | 0.0626 | 0.0459 | 0.0307 | 0.0204 | 0.1792 | 0.2065 | 0.2204 | 0.2709 |
| AR | 0.1492 | 0.1673 | 0.1051 | 0.0761 | 0.0529 | 0.0306 | 0.2054 | 0.2275 | 0.2373 | 0.3096 |
| CO-Search | 0.1594 | 0.1717 | 0.1165 | 0.0852 | 0.0621 | 0.0378 | 0.2107 | 0.2388 | 0.2463 | 0.3159 |
| CRB-HDGCN | 0.2057 | 0.2161 | 0.1295 | 0.0952 | 0.0784 | 0.0431 | 0.2409 | 0.2770 | 0.2951 | 0.3517 |
Performance comparison between different variants on CORD-19.
| Methods | MAP | MRR | P@25 | P@50 | P@75 | P@100 | R@25 | R@50 | R@75 | R@100 |
|---|---|---|---|---|---|---|---|---|---|---|
| CRB | 0.0842 | 0.0929 | 0.0415 | 0.0237 | 0.0151 | 0.0093 | 0.1293 | 0.1442 | 0.1689 | 0.2236 |
| BERT-HDGCN-PA | 0.1224 | 0.1335 | 0.0653 | 0.0374 | 0.0349 | 0.0185 | 0.1702 | 0.1983 | 0.2263 | 0.2752 |
| BERT-HDGCN-PJ | 0.1135 | 0.1267 | 0.0606 | 0.0344 | 0.0316 | 0.0169 | 0.1683 | 0.1901 | 0.2185 | 0.2703 |
| BERT-HDGCN | 0.1265 | 0.1388 | 0.0717 | 0.0452 | 0.0392 | 0.0215 | 0.1762 | 0.2043 | 0.2308 | 0.2841 |
| CRB-HDGCN-PA | 0.1452 | 0.1541 | 0.0803 | 0.0559 | 0.0543 | 0.0246 | 0.1914 | 0.2218 | 0.2411 | 0.2983 |
| CRB-HDGCN-PJ | 0.1385 | 0.1479 | 0.0762 | 0.0512 | 0.0515 | 0.0226 | 0.1892 | 0.2176 | 0.2385 | 0.2925 |
| CRB-HDGCN | 0.1495 | 0.1582 | 0.0848 | 0.0608 | 0.0572 | 0.0271 | 0.1953 | 0.2279 | 0.2483 | 0.3107 |
Performance comparison between different variants on LitCovid.
| Methods | MAP | MRR | P@25 | P@50 | P@75 | P@100 | R@25 | R@50 | R@75 | R@100 |
|---|---|---|---|---|---|---|---|---|---|---|
| CRB | 0.1403 | 0.1492 | 0.0872 | 0.0593 | 0.0362 | 0.0261 | 0.1813 | 0.2096 | 0.2283 | 0.2856 |
| BERT-HDGCN-PA | 0.1782 | 0.1852 | 0.1105 | 0.0762 | 0.0614 | 0.0335 | 0.2203 | 0.2373 | 0.2552 | 0.3169 |
| BERT-HDGCN-PJ | 0.1646 | 0.1791 | 0.1025 | 0.0707 | 0.0585 | 0.0308 | 0.2102 | 0.2285 | 0.2497 | 0.3102 |
| BERT-HDGCN | 0.1892 | 0.1915 | 0.1175 | 0.0812 | 0.0693 | 0.0641 | 0.2261 | 0.2496 | 0.2634 | 0.3253 |
| CRB-HDGCN-PA | 0.1993 | 0.2093 | 0.1226 | 0.0885 | 0.0724 | 0.0382 | 0.2336 | 0.2698 | 0.2867 | 0.3428 |
| CRB-HDGCN-PJ | 0.1925 | 0.2037 | 0.1176 | 0.0824 | 0.0700 | 0.0360 | 0.2301 | 0.2635 | 0.2839 | 0.3380 |
| CRB-HDGCN | 0.2057 | 0.2161 | 0.1295 | 0.0952 | 0.0784 | 0.0431 | 0.2409 | 0.2770 | 0.2951 | 0.3517 |
Retraining and fine-tune time of CRB.
| Dataset | Retraining | Fine-tune |
|---|---|---|
| CORD-19 | 10,401 s | 6,729 s |
| LitCovid | 6,158 s | 4,613 s |
Efficiency evaluation on CORD-19.
| Methods | 10% samples | 50% samples | 100% samples | ||||||
|---|---|---|---|---|---|---|---|---|---|
| train time | test time | R@100 | train time | test time | R@100 | train time | test time | R@100 | |
| BERT-DVGCN | 14,052 s | 582 s | 0.0937 | 73,269 s | 2,353 s | 0.1547 | 143,839 s | 4,523 s | 0.2216 |
| CO-Search | 11,503 s | 904 s | 0.1037 | 56,731 s | 3,462 s | 0.1835 | 116,493 s | 6,620 s | 0.2563 |
| CRB-HDGCN | 14,937 s | 597 s | 0.1352 | 74,820 s | 2,495 s | 0.2235 | 149,522 s | 4,782 s | 0.3107 |
Efficiency evaluation on LitCovid.
| Methods | 10% samples | 50% samples | 100% samples | ||||||
|---|---|---|---|---|---|---|---|---|---|
| train time | test time | R@100 | train time | test time | R@100 | train time | test time | R@100 | |
| BERT-DVGCN | 4,641 s | 117 s | 0.1163 | 21,659 s | 450 s | 0.1703 | 47,628 s | 953 s | 0.2709 |
| CO-Search | 3,128 s | 182 s | 0.1328 | 16,510 s | 738 s | 0.2094 | 42,627 s | 1,773 s | 0.3159 |
| CRB-HDGCN | 5,002 s | 128 s | 0.1561 | 23,569 s | 497 s | 0.2554 | 49,953 s | 1,084 s | 0.3517 |
Fig. 5The impact of hyperparameters with different convolution layers : (a) on CORD-19; (b) on LitCovid; (c) on CORD-19; (d) on LitCovid.
Fig. 6The visualization of cosine similarity between words. The to represent “coronavirus”, “bacteria”, “antibodies”, “quarantine”, “vaccine”, “respiratory” and “dysgeusia”, respectively. The sub-graphs are the results by choosing different BERT model, including: (a) vanilla BERT, (b) SciBERT, (c) COVID-Twitter-BERT and (d) CRB.
An example of 4 recommended citations by CRB-HDGCN-5 and CRB-HDGCN-15 for paper “PMC7096991”.
| Ground truths: | CRB-HDGCN-5 | CRB-HDGCN-15 |
|---|---|---|