| Literature DB >> 34084929 |
Ilya Makarov1,2, Mikhail Makarov1, Dmitrii Kiselev1.
Abstract
Today, increased attention is drawn towards network representation learning, a technique that maps nodes of a network into vectors of a low-dimensional embedding space. A network embedding constructed this way aims to preserve nodes similarity and other specific network properties. Embedding vectors can later be used for downstream machine learning problems, such as node classification, link prediction and network visualization. Naturally, some networks have text information associated with them. For instance, in a citation network, each node is a scientific paper associated with its abstract or title; in a social network, all users may be viewed as nodes of a network and posts of each user as textual attributes. In this work, we explore how combining existing methods of text and network embeddings can increase accuracy for downstream tasks and propose modifications to popular architectures to better capture textual information in network embedding and fusion frameworks.Entities:
Keywords: Community detection; Graph embeddings; Graph visualization; Information fusion; Link prediction; Network science; Node classification; Node clustering; Text embeddings
Year: 2021 PMID: 34084929 PMCID: PMC8157042 DOI: 10.7717/peerj-cs.526
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Text methods on Cora for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW | 0.63 | 0.68 | ||
| TF-IDF | 0.35 | 0.49 | 0.70 | 0.76 |
| LDA | 0.49 | 0.57 | 0.60 | 0.61 |
| SBERT pretrained | 0.57 | 0.61 | 0.68 | 0.70 |
| Word2Vec pretrained | 0.34 | 0.44 | 0.59 | 0.63 |
| Word2Vec (d = 300) | 0.64 | 0.68 | 0.70 | 0.71 |
| Word2Vec (d = 64) | 0.65 | 0.68 | 0.70 | 0.72 |
| Doc2Vec pretrained | 0.54 | 0.61 | 0.65 | 0.67 |
| Doc2Vec (d = 300) | 0.49 | 0.58 | 0.66 | 0.68 |
| Doc2Vec (d = 64) | 0.50 | 0.58 | 0.65 | 0.67 |
| Sent2Vec pretrained | 0.63 | 0.69 | 0.74 | 0.77 |
| Sent2Vec (d = 600) | 0.75 | 0.77 | ||
| Sent2Vec (d = 64) | 0.75 | 0.77 | ||
| Ernie pretrained | 0.43 | 0.52 | 0.62 | 0.65 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Text methods on Citeseer-M10 for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW | 0.62 | 0.66 | 0.73 | |
| TF-IDF | 0.61 | 0.66 | 0.72 | 0.75 |
| LDA | 0.37 | 0.38 | 0.39 | 0.39 |
| SBERT pretrained | 0.66 | 0.68 | 0.72 | 0.73 |
| Word2Vec pretrained | 0.67 | 0.69 | 0.72 | 0.73 |
| Word2Vec (d = 300) | 0.55 | 0.57 | 0.59 | 0.60 |
| Word2Vec (d = 64) | 0.58 | 0.59 | 0.61 | 0.62 |
| Doc2Vec pretrained | 0.75 | |||
| Doc2Vec (d = 300) | 0.53 | 0.56 | 0.59 | 0.61 |
| Doc2Vec (d = 64) | 0.56 | 0.59 | 0.62 | 0.63 |
| Sent2Vec pretrained | 0.73 | 0.75 | ||
| Sent2Vec (d = 600) | 0.64 | 0.66 | 0.70 | 0.71 |
| Sent2Vec (d = 64) | 0.63 | 0.65 | 0.68 | 0.69 |
| Ernie pretrained | 0.59 | 0.63 | 0.67 | 0.68 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Text methods on DBLP for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 11% | 30% | 50% |
|---|---|---|---|---|
| BoW | 0.75 | 0.77 | ||
| TF-IDF | 0.74 | 0.76 | ||
| LDA | 0.54 | 0.55 | 0.55 | 0.56 |
| SBERT pretrained | 0.69 | 0.72 | 0.75 | 0.75 |
| Word2Vec pretrained | 0.72 | 0.73 | 0.74 | 0.74 |
| Word2Vec (d = 300) | 0.76 | 0.76 | 0.77 | 0.77 |
| Word2Vec (d = 64) | 0.76 | 0.76 | 0.76 | 0.77 |
| Doc2Vec pretrained | 0.73 | 0.75 | 0.76 | 0.76 |
| Doc2Vec (d = 300) | 0.55 | 0.56 | 0.57 | 0.58 |
| Doc2Vec (d = 64) | 0.54 | 0.54 | 0.55 | 0.55 |
| Sent2Vec pretrained | 0.73 | 0.75 | 0.77 | 0.77 |
| Sent2Vec (d = 600) | 0.79 | |||
| Sent2Vec (d = 64) | 0.78 | 0.78 | ||
| Ernie pretrained | 0.70 | 0.71 | 0.71 | 0.73 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Network methods for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| DeepWalk | 0.72 | |||
| Node2Vec | 0.76 | 0.80 | 0.81 | |
| HOPE | 0.29 | 0.30 | 0.30 | 0.31 |
| DeepWalk | ||||
| Node2Vec | ||||
| HOPE | 0.12 | 0.13 | 0.17 | 0.20 |
| DeepWalk | ||||
| Node2Vec | ||||
| HOPE | 0.29 | 0.30 | 0.31 | 0.31 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Fusion methods on Cora for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW + DeepWalk | 0.74 | 0.80 | 0.84 | 0.86 |
| Sent2Vec + DeepWalk | 0.76 | 0.79 | 0.84 | 0.85 ± 0.01 |
| TADW - TF-IDF | 0.72 | 0.80 ± 0.01 | 0.85 ± 0.01 | 0.86 ± 0.01 |
| TADW - Sent2Vec | 0.75 | 0.80 ± 0.01 | 0.83 ± 0.00 | 0.85 ± 0.00 |
| TADW - Ernie | 0.57 ( | 0.69 ( | 0.80 ( | 0.82 ( |
| TriDNR | 0.59 | 0.68 ± 0.00 | 0.75 ± 0.01 | 0.78 ± 0.01 |
| GCN - TF-IDF | 0.80 | 0.83 ± 0.01 | 0.86 ± 0.01 | 0.87 ± 0.01 |
| GCN - Sent2Vec | 0.77 | 0.82 ± 0.00 | 0.85 ± 0.01 | 0.87 ± 0.01 |
| GCN - Ernie | 0.60 | 0.67 | 0.77 | 0.81 |
| GAT - TF-IDF | ||||
| GAT - Sent2Vec | 0.78 | 0.81 ± 0.00 | 0.85 ± 0.01 | 0.86 ± 0.00 |
| GAT - Ernie | 0.58 | 0.62 | 0.71 | 0.73 |
| GraphSAGE - TF-IDF | 0.80 | 0.87 ± 0.01 | ||
| GraphSAGE - Sent2Vec | 0.75 ± 0.01 | 0.80 ± 0.01 | 0.86 ± 0.01 | |
| GraphSAGE - Ernie | 0.29 | 0.33 | 0.34 | 0.37 |
| GIC - TF-IDF | 0.74 ± 0.01 | 0.81 ± 0.00 | 0.85 ± 0.00 | |
| GIC - Sent2Vec | 0.66 ± 0.00 | 0.76 ± 0.02 | 0.84 ± 0.00 | 0.86 ± 0.00 |
| GIC - Ernie | 0.34 | 0.37 | 0.37 | 0.38 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Fusion methods on Citeseer-M10 for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW + DeepWalk | 0.73 | 0.76 | 0.81 | 0.83 |
| Sent2Vec + DeepWalk | 0.73 | 0.75 | 0.79 | 0.80 |
| TADW - TF-IDF | 0.47 | 0.51 | 0.57 | 0.59 |
| TADW - Sent2Vec | 0.57 | 0.60 | 0.65 | 0.66 |
| TADW - Ernie | 0.41 ( | 0.46 ( | 0.53 ( | 0.56 ( |
| TriDNR | 0.63 | 0.68 | 0.74 | 0.77 |
| GCN - TF-IDF | 0.71 | 0.76 | 0.81 | 0.83 |
| GCN - Sent2Vec | 0.73 | 0.84 | ||
| GCN - Ernie | 0.71 | 0.75 | 0.78 | 0.79 |
| GAT - TF-IDF | 0.72 | 0.76 | 0.82 | 0.84 |
| GAT - Sent2Vec | 0.79 | 0.81 | 0.83 | |
| GAT - Ernie | 0.70 | 0.74 | 0.77 | 0.78 |
| GraphSAGE - TF-IDF | 0.72 | 0.77 | 0.83 | 0.85 |
| GraphSAGE - Sent2Vec | 0.86 | |||
| GraphSAGE - Ernie | 0.58 | 0.63 | 0.65 | 0.68 |
| GIC - TF-IDF | 0.66 | 0.70 | 0.80 | 0.83 |
| GIC - Sent2Vec | 0.74 | 0.78 | 0.83 | 0.84 |
| GIC - Ernie | 0.49 | 0.57 | 0.57 | 0.63 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Fusion methods on DBLP for node classification (micro-F1, metric lies between (0,1) and higher value means better results).
| % Labels | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW + DeepWalk | 0.77 | 0.79 | 0.81 | 0.82 |
| Sent2Vec + DeepWalk | 0.78 | 0.80 | 0.80 | |
| TriDNR | 0.72 | 0.75 | 0.78 | 0.79 |
| GCN - TF-IDF | 0.71 | 0.76 | 0.81 | |
| GCN - Sent2Vec | 0.78 | 0.81 | 0.81 | |
| GCN - Ernie | 0.74 | 0.75 | 0.76 | 0.77 |
| GAT - TF-IDF | 0.82 | |||
| GAT - Sent2Vec | 0.79 | 0.80 | 0.80 | |
| GAT - Ernie | 0.73 | 0.73 | 0.75 | 0.75 |
| GraphSAGE - TF-IDF | 0.79 | 0.81 | 0.82 | |
| GraphSAGE - Sent2Vec | 0.81 | 0.81 | ||
| GraphSAGE - Ernie | 0.70 | 0.70 | 0.71 | 0.72 |
| GIC - TF-IDF | 0.75 | 0.77 | 0.80 | 0.81 |
| GIC - Sent2Vec | 0.78 | 0.79 | 0.81 | 0.81 |
| GIC - Ernie | 0.51 | 0.57 | 0.63 | 0.71 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Text embeddings on Cora for link prediction (micro-F1, metric lies between (0,1) and higher value means better results).
| % Train edges | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW | 0.69 | 0.71 | ||
| TF-IDF | 0.67 | 0.69 | 0.72 | 0.74 |
| LDA | 0.68 | 0.69 | 0.71 | 071 |
| SBERT pretrained | 0.69 | 0.71 | 0.74 | |
| Word2Vec pretrained | 0.60 | 0.62 | 0.63 | 0.64 |
| Word2Vec (d = 300) | 0.68 | 0.70 | 0.72 | 0.73 |
| Word2Vec (d = 64) | 0.70 | 0.70 | 0.72 | 0.73 |
| Doc2Vec pretrained | 0.63 | 0.66 | 0.70 | 0.70 |
| Doc2Vec (d = 300) | 0.67 | 0.70 | 0.73 | 0.74 |
| Doc2Vec (d = 64) | 0.66 | 0.68 | 0.69 | 0.69 |
| Sent2Vec pretrained | 0.66 | 0.69 | 0.73 | 0.75 |
| Sent2Vec (d = 600) | ||||
| Sent2Vec (d = 64) | 0.70 | 0.71 | 0.73 | 0.74 |
| Ernie pretrained | 0.56 | 0.58 | 0.62 | 0.63 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Text embeddings on Citeseer-M10 for link prediction (micro-F1, metric lies between (0,1) and higher value means better results).
| % Train edges | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| BoW | 0.52 | 0.52 | 0.52 | 0.52 |
| TF-IDF | 0.52 | 0.52 | 0.53 | 0.53 |
| LDA | 0.69 | 0.69 | 0.70 | 071 |
| SBERT pretrained | ||||
| Word2Vec pretrained | 0.53 | 0.53 | 0.54 | 0.54 |
| Word2Vec (d = 300) | 0.54 | 0.54 | 0.54 | 0.54 |
| Word2Vec (d = 64) | 0.54 | 0.54 | 0.54 | 0.54 |
| Doc2Vec (pretrained) | 0.55 | 0.55 | 0.55 | 0.55 |
| Doc2Vec (d = 300) | 0.77 | 0.77 | 0.78 | 0.79 |
| Doc2Vec (d = 64) | 0.77 | 0.77 | 0.77 | 0.78 |
| Sent2Vec pretrained | 0.54 | 0.54 | 0.55 | 0.55 |
| Sent2Vec (d = 600) | 0.54 | 0.55 | 0.55 | 0.56 |
| Sent2Vec (d = 64) | 0.53 | 0.53 | 0.54 | 0.54 |
| Ernie pretrained | 0.84 | 0.85 | 0.85 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Network embeddings for link prediction (micro-F1, metric lies between (0,1) and higher value means better results).
| % Train Edges | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| DeepWalk | 0.56 | 0.60 | 0.66 | |
| Node2Vec | 0.65 | |||
| HOPE | 0.50 | 0.50 | 0.51 | 0.52 |
| DeepWalk | 0.66 | 0.66 | ||
| Node2Vec | ||||
| HOPE | 0.50 | 0.51 | 0.54 | 0.57 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Fusion embeddings on Cora for link prediction (micro-F1, metric lies between (0,1) and higher value means better results).
| % Train Edges | 5% | 10% | 30% | 50% |
|---|---|---|---|---|
| TADW - BoW | 0.72 | 0.72 | 0.73 | 0.73 |
| TADW - TF-IDF | 0.73 | 0.74 | 0.74 | 0.75 |
| TADW - Sent2Vec | 0.70 | 0.70 | 0.71 | 0.73 |
| TADW - Word2Vec | 0.64 | 0.68 | 0.71 | 0.72 |
| TADW - Ernie | 0.51 | 0.53 | 0.54 | 0.54 |
| GCN - TF-IDF | ||||
| GCN - Sent2Vec | 0.69 | 0.71 | 0.73 | 0.75 |
| GCN - SBERT | 0.67 | 0.69 | 0.71 | 0.73 |
| GCN (Custom) | 0.72 | 0.75 | 0.75 | 0.75 |
| GCN - Ernie | 0.62 | 0.63 | 0.63 | 0.68 |
| GAT - TF-IDF | 0.71 | 0.73 | 0.75 | 0.75 |
| GAT - Sent2Vec | 0.61 | 0.61 | 0.65 | 0.68 |
| GAT - SBERT | 0.65 | 0.69 | 0.72 | 0.74 |
| GAT - Ernie | 0.56 | 0.56 | 0.59 | 0.62 |
| GraphSAGE - TF-IDF | 0.75 | |||
| GraphSAGE - Sent2Vec | 0.66 | 0.70 | 0.74 | 0.75 |
| GraphSAGE - SBERT | 0.58 | 0.62 | 0.69 | 0.64 |
| GraphSAGE - Ernie | 0.50 | 0.50 | 0.53 | 0.56 |
| GIC - TF-IDF | 0.73 | 0.75 | 0.77 | 0.78 |
| GIC - Sent2Vec | 0.74 | 0.75 | 0.77 | 0.78 |
| GIC - SBERT | 0.74 | 0.76 | 0.78 | |
| GIC - Ernie | 0.65 | 0.69 | 0.69 | 0.74 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Fusion embeddings on Citeseer-M10 for link prediction (micro-F1, metric lies between (0,1) and higher value means better results).
| % Train Edges | 5 % | 10 % | 30 % | 50 % |
|---|---|---|---|---|
| TADW - BoW | 0.50 ± 0.01 | 0.51 ± 0.02 | 0.51 ± 0.01 | 0.52 ± 0.01 |
| TADW - TF-IDF | 0.51 ± 0.01 | 0.51 ± 0.01 | 0.51 ± 0.01 | 0.52 ± 0.01 |
| TADW - Sent2Vec | 0.52 ± 0.01 | 0.53 ± 0.00 | 0.53 ± 0.00 | 0.54 ± 0.00 |
| TADW - Word2Vec | 0.52 ± 0.01 | 0.52 ± 0.01 | 0.53 ± 0.00 | 0.53 ± 0.00 |
| TADW - Ernie | 0.78 ± 0.01 | |||
| GCN - TF-IDF | 0.68 ± 0.01 | 0.69 ± 0.01 | 0.70 ± 0.01 | 0.70 ± 0.01 |
| GCN - Sent2Vec | 0.59 ± 0.01 | 0.62 ± 0.01 | 0.67 ± 0.01 | 0.68 ± 0.01 |
| GCN - SBERT | 0.68 ± 0.01 | 0.70 ± 0.01 | 0.72 ± 0.01 | 0.77 ± 0.01 |
| GCN (Custom) | 0.61 ± 0.01 | 0.67 ± 0.00 | 0.68 ± 0.01 | 0.68 ± 0.01 |
| GCN - Ernie | 0.67 ± 0.01 | 0.67 ± 0.01 | 0.76 ± 0.00 | 0.78 ± 0.01 |
| GAT - TF-IDF | 0.60 ± 0.01 | 0.63 ± 0.01 | 0.65 ± 0.01 | 0.64 ± 0.01 |
| GAT - Sent2Vec | 0.59 ± 0.01 | 0.63 ± 0.01 | 0.64 ± 0.01 | 0.63 ± 0.01 |
| GAT - SBERT | 0.61 ± 0.01 | 0.65 ± 0.01 | 0.71 ± 0.01 | 0.73 ± 0.01 |
| GAT - Ernie | 0.61 ± 0.00 | 0.64 ± 0.01 | 0.69 ± 0.01 | 0.70 ± 0.01 |
| GraphSAGE - TF-IDF | 0.66 ± 0.01 | 0.67 ± 0.01 | 0.73 ± 0.01 | 0.78 ± 0.01 |
| GraphSAGE - Sent2Vec | 0.64 ± 0.01 | 0.66 ± 0.01 | 0.73 ± 0.01 | 0.78 ± 0.01 |
| GraphSAGE - SBERT | 0.61 ± 0.01 | 0.63 ± 0.01 | 0.71 ± 0.01 | |
| GraphSAGE - Ernie | 0.63 ± 0.02 | 0.72 ± 0.01 | 0.72 ± 0.01 | 0.80 ± 0.01 |
| GIC - TF-IDF | 0.62 ± 0.01 | 0.66 ± 0.01 | 0.74 ± 0.01 | 0.80 ± 0.01 |
| GIC - Sent2Vec | 0.62 ± 0.01 | 0.66 ± 0.01 | 0.75 ± 0.01 | 0.81 ± 0.01 |
| GIC - SBERT | 0.63 ± 0.01 | 0.66 ± 0.01 | 0.75 ± 0.01 | 0.78 ± 0.01 |
| GIC - Ernie | 0.63 ± 0.01 | 0.66 ± 0.00 | 0.73 ± 0.01 | 0.81 ± 0.00 |
Note:
The best values with respect to confidence intervals are highlighted in bold.
Figure 1Embeddings visualization on Cora.
(A) TF-IDF (text embedding). (B) Sent2Vec (text embedding). (C) DeepWalk (network embedding). (D) TADW (fusion). (E) TriDnr (fusion). (F) GCN (fusion).