| Literature DB >> 31359873 |
Emeric Dynomant1,2,3, Romain Lelong2,3, Badisse Dahamna2,4, Clément Massonnaud2, Gaétan Kerdelhué2,4, Julien Grosjean2,4, Stéphane Canu3, Stefan J Darmoni2,4.
Abstract
BACKGROUND: Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset.Entities:
Keywords: data curation; data mining; natural language processing
Year: 2019 PMID: 31359873 PMCID: PMC6690161 DOI: 10.2196/12310
Source DB: PubMed Journal: JMIR Med Inform
Hyperparameters values used to train the 5 word embedding models.
| Parameter and applied to model | Value | ||
| Word2Vec/FastText | 25 | ||
| GloVe | 100 | ||
| All 3 models | 20 | ||
| All 3 models | 7 | ||
| All 3 models | 2.5x10-2 | ||
| All 3 models | 80 | ||
| All 3 models | 0.05 | ||
| Word2Vec/FastText | 12 | ||
| GloVe | 1e-6 | ||
The 10 most common words of our corpus. Note that Rouen is the city where the training data come from.
| French | English | Occurrences |
| de | of | 9,501,137 |
| docteur | doctor | 4,822,797 |
| le | the | 3,975,735 |
| téléphone | phone | 3,147,286 |
| d’ | ’s | 3,036,198 |
| Rouen | Rouen | 2,763,918 |
| à | at | 2,271,317 |
| l’ | the | 2,129,090 |
| et | and | 2,091,502 |
| dans | in | 2,001,135 |
Figure 1Two-dimensional t-SNE projection of 10,000 documents randomly selected among main classes in the HDW. The five different colors correspond to the five types of documents selected (discharge summaries [green], surgery [blue] or procedure [purple] reports, drug prescriptions [yellow], letters from a general practitioner [red]).
Algorithms training time (min).
| Algorithm | Training time (min) |
| FastText SG | 1678.1 |
| FastText CBOW | 1577.0 |
| Word2Vec SG | 182.0 |
| Word2Vec CBOW | 33.4 |
| GloVe | 17.5 |
Percentage of pairs validated by the 5 trained models on 2 UMNSRS evaluation sets.
| Algorithm | UMNSRS-Sim | UMNSRS-Rel |
| FastText SG | 3.89 | 5.04 |
| FastText CBOW | 3.89 | 3.79 |
| Word2Vec CBOW | 3.57 | 4.10 |
| Word2Vec SG | 2.92 | 4.10 |
| GloVe | 1.29 | 0.94 |
Percentage of odd one tasks performed by each of the 5 trained models.
| Algorithm | Odd one |
| Word2Vec SG | 65.4 |
| Word2Vec CBOW | 63.5 |
| FastText SG | 44.4 |
| FastText CBOW | 40.7 |
| GloVe | 18.5 |
Figure 2Global representation of the notation agreement between the 2 evaluators (CM and SJD). Notes attributed to a model output are going from 0 (bad matching) to 2 (good matching). Colors are ranging from light green (high agreement) to red (low agreement).
Comparison between cosine distance computed by each model and the human evaluation performed (notes ranging from 0 to 2). Notes and distances are in averages on the top 5 closest vectors for 112 queries on every model by each of the 2 evaluators (evaluator 1, SJD; evaluator 2, CM).
| Model | Cosine | Evaluator 1 | Evaluator 2 |
| Word2Vec SG | 0.776 | 1.469 | 1.200 |
| Word2Vec CBOW | 0.731 | 1.355 | 1.148 |
| FastText SG | 0.728 | 1.200 | 1.111 |
| FastText CBOW | 0.748 | 1.214 | 1.048 |
| GloVe | 0.884 | 0.925 | 0.480 |
Figure 3Comparison of the cosine distance calculated regarding the note given by two human evaluators. In both cases, the lower the note is, the lower the average distance is (evaluator 1, SJD; evaluator 2, CM).
Figure 4Small cluster of words found in both Word2Vec SG and CBOW (second one shown). Année(s) and an(s) mean year(s), semaine(s) mean week(s) and jour(s) mean day(s). The meta-token "number" used to replace numbers is visible in the expression numberj.
Figure 5Cluster of words related to the size found by reducing the number of dimensions of word vectors produced by GloVe algorithm.
Figure 6Pulled scores for each task regarding every of the five trained models. Log has been used to facilitate the visualization. Cosine score is duplicated regarding the UMSNRS used set.