| Literature DB >> 34070422 |
Nikos Kanakaris1, Nikolaos Giarelis1, Ilias Siachos1, Nikos Karacapilidis1.
Abstract
We consider the prediction of future research collaborations as a link prediction problem applied on a scientific knowledge graph. To the best of our knowledge, this is the first work on the prediction of future research collaborations that combines structural and textual information of a scientific knowledge graph through a purposeful integration of graph algorithms and natural language processing techniques. Our work: (i) investigates whether the integration of unstructured textual data into a single knowledge graph affects the performance of a link prediction model, (ii) studies the effect of previously proposed graph kernels based approaches on the performance of an ML model, as far as the link prediction problem is concerned, and (iii) proposes a three-phase pipeline that enables the exploitation of structural and textual information, as well as of pre-trained word embeddings. We benchmark the proposed approach against classical link prediction algorithms using accuracy, recall, and precision as our performance metrics. Finally, we empirically test our approach through various feature combinations with respect to the link prediction problem. Our experimentations with the new COVID-19 Open Research Dataset demonstrate a significant improvement of the abovementioned performance metrics in the prediction of future research collaborations.Entities:
Keywords: document representation; future research collaborations; graph kernels; knowledge graph; link prediction; natural language processing; word embeddings
Year: 2021 PMID: 34070422 PMCID: PMC8226892 DOI: 10.3390/e23060664
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1The phases of the proposed approach. p denotes nodes of the ‘Paper’ type. w denotes nodes of the ‘Word’ type. a denotes nodes of the ‘Author’ type. loc denotes nodes of the ‘Location’ type. i denotes nodes of the ‘Institution’ type. lab denotes nodes of the ‘Laboratory’ type. The word embedding of a word (w) is denoted by e. SF and TF denote structure-related and text-related features, respectively. label denotes the label (0 or 1) that corresponds to the sample x of the given dataset. denotes the concatenation of the structure-related and text-related features, aiming to generate the feature vector of the sample x of the given dataset.
Figure 2The data schema of the proposed scientific knowledge graph. Dotted lines connect properties associated with the entities of the knowledge graph.
Figure 3Snapshots of the knowledge graph that is generated from the CORD-19 dataset: limited to 1000 (upper-left), 2000 (upper-right), 3000 (bottom-left) and 30,000 (bottom-right) nodes. The different node and edge colors highlight the heterogeneity of the produced graph.
Number of training samples (|Training subset samples|) and number of testing subset samples (|Testing subset samples|) of each dataset.
| Dataset ID | |Training Subset Samples| | |Testing Subset Samples| |
|---|---|---|
| D1 | 1000 | 330 |
| D2 | 1000 | 330 |
| D3 | 1000 | 330 |
| D4 | 1000 | 330 |
| D5 | 1000 | 330 |
| D6 | 1000 | 330 |
| D7 | 6000 | 1890 |
| D8 | 9900 | |
| D9 | 1000 | 330 |
| D10 | 1000 | 330 |
The features of each sample of the extracted datasets. A feature is associated with either a textual or a structural relationship of two authors.
| Feature | Description | Type |
|---|---|---|
| adamic adar | The sum of the inverse logarithm of the degree of the set of common neighbor ‘ | Structural (SF) |
| common neighbors | The number of neighbor ‘ | Structural (SF) |
| preferential attachment | The product of the in-degree values of a pair of ‘ | Structural (SF) |
| total neighbors | The product of the in-degree values of a pair of ‘ | Structural (SF) |
| pyramid match | The similarity of the text of the graph-of-docs graphs of two nodes of ‘ | Textual (TF) |
| propagation | The similarity of the text of the graph-of-docs graphs of two nodes of ‘ | Textual (TF) |
| weisfeiler pyramid match | The similarity of the text of the graph-of-docs graphs of two nodes of ‘ | Textual (TF) |
| jaccard | The similarity of the text of the graph-of-docs graphs of two nodes of ‘ | Structural and Textual (SF and TF) |
| Label | It denotes an edge of the ‘ | Class |
The various features combinations in order to test how the different combinations affect the performance of the ML models in link prediction.
| Feature Combination Name | Features Included | Proposed In |
|---|---|---|
| ALL | Adamic Adar, Common Neighbors, Preferential attachment, Total Neighbors, Pyramid match, Weisfeiler Pyramid match, Jaccard, Propagation | [ |
| PM | Pyramid Match | [ |
| WPM | Weisfeiler Pyramid match | [ |
| AA_J (baseline) | Adamic Adar, Jaccard | [ |
| AA (baseline) | Adamic Adar | [ |
|
| Propagation | [ |
| J (baseline) | Jaccard | [ |
| AA_WPM | Adamic Adar, Weisfeiler Pyramid match | |
| AA_P | Adamic Adar, Propagation | |
| AA_PM | Adamic Adar, Pyramid match |
Performance of the logistic regression classifier for each feature combination. * indicates statistical significance in improvement (p < 0.05) for each evaluation metric using the micro sign test against the AA_J baseline.
| Feature Combination | Accuracy | Recall | Precision |
|---|---|---|---|
| ALL | 0.6588 | 0.9963 * | 0.6345 |
| J | 0.5093 | 0.0233 |
|
| AA | 0.9818 | 0.9643 | 0.9995 |
| AA_J | 0.9834 | 0.9671 | 0.9998 |
|
| 0.6669 | 0.5589 | 0.8157 |
| PM | 0.838 | 0.6965 | 0.9752 |
| WPM | 0.9476 | 0.9044 | 0.9905 |
| AA_P | 0.9652 | 0.9923 * | 0.9625 |
| AA_PM | 0.998 * | 0.9966 * | 0.9995 |
| AA_WPM |
|
| 0.9995 |
Performance of the neural network classifier for each feature combination. * indicates statistical significance in improvement (p < 0.05) for each evaluation metric using the micro sign test against the AA_J baseline. Average binary cross-entropy between real and predicted label value is considered as the train and test loss.
| Feature Combination | Accuracy | Recall | Precision | Train Loss | Test Loss | Abs Loss Difference |
|---|---|---|---|---|---|---|
| ALL | 0.9908 |
| 0.9886 | 0.102 | 0.0499 | 0.0521 |
| J | 0.5093 | 0.0233 |
| 0.6647 | 0.6858 |
|
| AA | 0.9922 | 0.985 | 0.9995 | 0.1303 | 0.0497 | 0.0806 |
| AA_J | 0.9925 | 0.9856 | 0.9995 | 0.1097 | 0.0413 | 0.0684 |
|
| 0.6954 | 0.5045 | 0.8624 | 0.6289 | 0.6057 | 0.0232 |
| PM | 0.8452 | 0.7085 | 0.9816 | 0.3219 | 0.399 | 0.0771 |
| WPM | 0.9248 | 0.859 | 0.9905 | 0.2612 | 0.239 | 0.0222 |
| AA_P | 0.9923 | 0.9851 | 0.9995 | 0.1311 | 0.0464 | 0.0847 |
| AA_PM |
| 0.9886 * | 0.9995 | 0.1281 | 0.0395 | 0.0886 |
| AA_WPM | 0.9932 | 0.987 | 0.9995 | 0.1108 | 0.0372 | 0.0736 |
Figure 4(a) Comparison of validation accuracies of the NN model using the AA_PM and the AA_J feature combinations; (b) Comparison of cross-entropy of the NN model using the AA_PM and the AA_J feature combinations.