| Literature DB >> 34654827 |
Adam Tsakalidis1,2, Pierpaolo Basile3, Marya Bazzi1,4,5, Mihai Cucuringu1,5, Barbara McGillivray6,7.
Abstract
Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in '.uk'. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection.Entities:
Year: 2021 PMID: 34654827 PMCID: PMC8520005 DOI: 10.1038/s41597-021-01047-x
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Flowchart of the creation of DUKweb.
Fig. 2Example of co-occurrence matrix for the word linux (year 2011) in DUKweb.
Statistics about the co-occurrences matrices in DUKweb.
| Year | Vocabulary Size | #co-occurrences | File Size |
|---|---|---|---|
| 1996 | 454,751 | 1,201,630,516 | 645.6MB |
| 1997 | 711,007 | 17,244,958,174 | 2.7GB |
| 1998 | 704,453 | 10,963,699,018 | 2.4GB |
| 1999 | 769,824 | 32,760,590,881 | 3.6GB |
| 2000 | 847,318 | 107,529,345,578 | 5.8GB |
| 2001 | 911,499 | 197,833,301,500 | 9.2GB |
| 2002 | 945,565 | 274,741,483,798 | 11GB |
| 2003 | 992,192 | 539,189,466,798 | 14GB |
| 2004 | 1,040,470 | 975,622,607,090 | 18.2GB |
| 2005 | 1,060,117 | 793,029,668,228 | 16.9GB |
| 2006 | 1,076,523 | 721,537,927,839 | 16.7GB |
| 2007 | 1,093,980 | 834,261,488,677 | 18.1GB |
| 2008 | 1,105,511 | 1,067,076,347,615 | 19.6GB |
| 2009 | 1,105,901 | 481,567,239,481 | 14.15GB |
| 2010 | 1,125,201 | 778,111,567,761 | 16.7GB |
| 2011 | 1,145,990 | 1,092,441,542,978 | 18.9GB |
| 2012 | 1,144,764 | 1,741,038,554,999 | 20.6GB |
| 2013 | 1,044,436 | 393,672,000,378 | 8.9GB |
The first column shows the year, the second column contains the size of the vocabulary for that year in terms of number of word types, the third column contains the total number of co-occurrences of vocabulary terms for that year, and the last column shows the size (compressed) of the co-occurrence matrix file.
Statistics about the vocabulary in terms of overall number of words and (compressed) file size, per year and per method (TRI, SGNS).
| Year | TRI | SGNS | ||
|---|---|---|---|---|
| Vocabulary Size | File Size | Vocabulary Size | File Size | |
| 1996 | 454,751 | 284.9MB | — | — |
| 1997 | 711,007 | 904.8MB | — | — |
| 1998 | 704,453 | 823.3MB | — | — |
| 1999 | 769,824 | 1.1GB | — | — |
| 2000 | 847,318 | 1.5GB | 235,428 | 114.7MB |
| 2001 | 911,499 | 1.9GB | 407,074 | 198.2MB |
| 2002 | 945,565 | 2.1GB | 571,419 | 277.8MB |
| 2003 | 992,192 | 2.4GB | 884,393 | 430.4MB |
| 2004 | 1,040,470 | 2.7GB | 1,270,804 | 619.1MB |
| 2005 | 1,060,117 | 2.7GB | 1,202,899 | 585.4MB |
| 2006 | 1,076,523 | 2.7GB | 1,007,582 | 490.6MB |
| 2007 | 1,093,980 | 2.8GB | 1,124,179 | 548.2MB |
| 2008 | 1,105,511 | 2.9GB | 1,173,870 | 572.2MB |
| 2009 | 1,105,901 | 2.6GB | 671,940 | 327.6MB |
| 2010 | 1,125,201 | 2.8GB | 1,183,907 | 576.8MB |
| 2011 | 1,145,990 | 2.9GB | 1,309,804 | 637.7MB |
| 2012 | 1,144,764 | 3.0GB | 1,607,272 | 784.0MB |
| 2013 | 1,044,436 | 1.9GB | 587,035 | 285.6MB |
Fig. 3Number of words included in the TRI and SGNS representations contained in DUKweb, along with the size of their intersected vocabulary size, per year.
Fig. 4Time series based on TRI (below) and SGNS (above) of four words whose semantics have changed between 2001–2013 according to the Oxford English Dictionary (i.e., they have acquired a new meaning during this time period).
Fig. 5Results of the SGNS and TRI embeddings and the two baselines models on the Word Analogy task for the four categories: Family, Grammar, Geography and Currency.
Fig. 6Results of the SGNS and TRI embeddings and the two baselines models on the Word Similarity and Word Relatedness tasks.
Average rank of a semantically shifted word; lower scores indicate a better model.
| Model | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SGNS | 36.33 | 34.03 | 30.90 | 26.85 | 29.29 | 27.16 | 26.88 | 28.60 | 27.21 | 25.69 | 29.13 | ||
| SGNS | 36.83 | 32.30 | 31.67 | 27.23 | 27.27 | 28.15 | 26.54 | 27.59 | 30.42 | 28.78 | |||
| SGNS | 27.99 | 30.38 | 27.17 | 28.96 | 27.84 | ||||||||
| TRI | 54.65 | 51.22 | 56.64 | 55.63 | 50.90 | 55.98 | 60.96 | 58.03 | 58.94 | 56.59 | 59.00 | 45.96 | 45.72 |
| TRI | 56.77 | 59.79 | 54.81 | 54.61 | 53.22 | 54.39 | 55.44 | 55.12 | 59.76 | 59.22 | 53.22 | 46.30 | 50.07 |
Recall at 10% of our different SGNS and TRI models.
| Model | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SGNS | 23.08 | 21.54 | 26.15 | 30.77 | 32.31 | 29.23 | 32.31 | 32.31 | |||||
| SGNS | 15.38 | 23.08 | 23.08 | 30.77 | 32.31 | 33.85 | 33.85 | 26.15 | 32.31 | ||||
| SGNS | 15.38 | 18.46 | 23.08 | 27.69 | 30.77 | 23.08 | 18.46 | ||||||
| TRI | 7.69 | 12.31 | 6.15 | 3.08 | 12.31 | 7.69 | 7.69 | 7.69 | 6.15 | 7.69 | 4.62 | 13.85 | 12.31 |
| TRI | 12.31 | 4.62 | 7.69 | 10.77 | 10.77 | 4.62 | 13.85 | 7.69 | 3.08 | 4.62 | 12.31 | 21.54 | 13.85 |
Examples of easy-to-predict (top-5) and hard-to-predict (bottom-5) words by our SNGS and TRI models.
| SGNS | SGNS | SGNS | TRI | TRI |
|---|---|---|---|---|
| cloud | sars | eris | tweet | root |
| sars | fap | ds | qe | purple |
| tweet | trending | follow | parmesan | blackberry |
| trending | eris | blw | event | tweet |
| fap | tweet | fap | sup | follow |
| tweeter | preloading | unlike | status | eta |
| like | chugging | chugging | grime | prep |
| preloading | bloatware | roasting | prep | grime |
| bloatware | tweeter | even | trending | status |
| parmesan | parmesan | parmesan | tomahawk | tomahawk |
| Measurement(s) | Natural Language • lexical semantic change |
| Technology Type(s) | Programming Language • word embeddings • Cosine Distance Method |
| Factor Type(s) | time period |