| Literature DB >> 30689665 |
Peter D Turney1, Saif M Mohammad2.
Abstract
We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word's length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.Entities:
Mesh:
Year: 2019 PMID: 30689665 PMCID: PMC6349325 DOI: 10.1371/journal.pone.0211512
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The normalized frequencies of the rapturous–ecstatic synset from 1800 AD to 2000 AD.
The sum of the five frequencies for any given year is 1.0. The data has not been smoothed, in order to show the level of noise in the trends. This synset is typical with respect to the shapes of the curves and the level of noise in the trends, but it is atypical in that it contains more words than most of the synsets. Most of the synsets contain two to three words.
Time periods for the training and testing sets, given a fifty-year cycle of eleven-year samples.
The average synset contains 2.23 to 2.25 words.
| Period | Train1 | Test1 | Train2 | Test2 |
|---|---|---|---|---|
| 1800 ± 5 | past | |||
| 1850 ± 5 | present | past | past | |
| 1900 ± 5 | future | present | present | past |
| 1950 ± 5 | future | future | present | |
| 2000 ± 5 | future | |||
| Synsets | 2,528 | 3,484 | 3,484 | 4,092 |
| Words | 5,640 | 7,795 | 7,795 | 9,198 |
| Words per synset | 2.23 | 2.24 | 2.24 | 2.25 |
| Change | 17.3% | 19.0% | 19.0% | 13.3% |
A sample of the Test1 dataset entries for the rapturous–ecstatic synset.
The highest frequencies for each time period are marked in bold, indicating the winners.
| Test1 dataset | Past frequency | Present frequency | Future frequency |
|---|---|---|---|
| ecstatic#a#1 | 5,576 | ||
| enraptured#a#1 | 4,334 | 7,148 | 5,263 |
| rapt#a#1 | 5,243 | 18,750 | 14,845 |
| rapturous#a#1 | 15,320 | 9,544 | |
| rhapsodic#a#1 | 45 | 696 | 3,595 |
The frequency of synset leadership changes over 200 years, given a fifty-year cycle of eleven-year samples.
Change of leadership is common.
| ≥ | Number of synsets | Percent of synsets |
|---|---|---|
| ≥ 1 change | 1,817 | 42.14% |
| ≥ 2 changes | 518 | 12.01% |
| ≥ 3 changes | 65 | 1.51% |
| = 4 changes | 2 | 0.05% |
A sample of the Test1 vector elements for two of the five words in the rapturous–ecstatic synset.
| Feature | rapturous#a#1 | ecstatic#a#1 |
|---|---|---|
| Normalized length | 0.900 | 0.800 |
| Syllable count | 3 | 3 |
| Unique ngrams | uro, rou, ous, us| | |ec, ecs, cst, sta, tat, ati, tic |
| Shared ngrams | 0.556 | 0.125 |
| Categorial variations | 3 | 2 |
| Relative growth | −0.122 | 0.107 |
| Linear extrapolation | 0.119 | 0.449 |
| Present age | 258 | 213 |
| Target class | 0 | 1 |
The 2 × 2 contingency table for change in the leadership of a synset.
| Condition positive | Condition nagative | |
|---|---|---|
| Predicted positive | True positive ( | False positive ( |
| Predicted negative | False negative ( | True negative ( |
Various statistics for NBCP and random systems.
All numbers are percentages, except for number of synsets.
| Statistic | Test1 | Test2 |
|---|---|---|
| Number of synsets | 3,484 | 4,092 |
| Percent changed | 19.0 | 13.3 |
| Percent stable | 81.0 | 86.7 |
| Precision for random | 16.9 | 10.9 |
| Recall for random | 46.1 | 42.4 |
| F-score for random | 24.8 | 17.3 |
| Precision for NBCP | 51.0 | 47.3 |
| Recall for NBCP | 31.0 | 40.0 |
| F-score for NBCP | 38.5 | 43.3 |
The drop in F-score when a feature is removed from the NBCP system.
Numbers that are statistically significant with 95% confidence are marked in bold. Negative numbers indicate that a feature is making a useful contribution to the system.
| Feature | Test1 | Test2 |
|---|---|---|
| Normalized length | 0.00 | −0.61 |
| Syllable count | 0.12 | 0.03 |
| Unique ngrams | 0.71 | |
| Shared ngrams | 0.00 | 0.00 |
| Categorial variations | −0.07 | −0.58 |
| Relative growth | −1.43 | −0.76 |
| Linear extrapolation | ||
| Present age | −0.19 | −0.29 |
The F-score of each feature alone minus the F-score of random guessing.
Numbers that are statistically significant with 95% confidence are marked in bold. Positive numbers indicate that a feature is better than random guessing.
| Feature | Test1 | Test2 |
|---|---|---|
| Normalized length | ||
| Syllable count | 0.86 | |
| Unique ngrams | ||
| Shared ngrams | −0.08 | |
| Categorial variations | −0.02 | 1.22 |
| Relative growth | ||
| Linear extrapolation | ||
| Present age |
The effect that varying cycle lengths has on the F-score of NBCP and random guessing.
| Cycle | Test1 | Test2 | Test3 | Test4 |
|---|---|---|---|---|
| 30 years | 1910 ± 5 | 1940 ± 5 | 1970 ± 5 | 2000 ± 5 |
| F-score for NBCP | 34.4 | 40.6 | 38.4 | 38.8 |
| F-score for random | 21.0 | 18.0 | 15.5 | 12.6 |
| 40 years | 1920 ± 5 | 1960 ± 5 | 2000 ± 5 | |
| F-score for NBCP | 34.7 | 38.3 | 42.5 | |
| F-score for random | 22.7 | 19.0 | 16.1 | |
| 50 years | 1950 ± 5 | 2000 ± 5 | ||
| F-score for NBCP | 38.5 | 43.3 | ||
| F-score for random | 24.8 | 17.3 | ||
| 60 years | 2000 ± 5 | |||
| F-score for NBCP | 39.5 | |||
| F-score for random | 21.2 |
The effect that varying cycle lengths has on the percentage of synsets that have changed leadership from present to future.
| Cycle | Test1 | Test2 | Test3 | Test4 |
|---|---|---|---|---|
| 30 years | 1910 ± 5 | 1940 ± 5 | 1970 ± 5 | 2000 ± 5 |
| Percent changed | 14.7 | 13.7 | 11.0 | 8.4 |
| Number of synsets | 3,041 | 3,622 | 3,958 | 4,275 |
| 40 years | 1920 ± 5 | 1960 ± 5 | 2000 ± 5 | |
| Percent changed | 17.5 | 14.5 | 11.0 | |
| Number of synsets | 3,038 | 3,732 | 4,203 | |
| 50 years | 1950 ± 5 | 2000 ± 5 | ||
| Percent changed | 19.0 | 13.3 | ||
| Number of synsets | 3,484 | 4,092 | ||
| 60 years | 2000 ± 5 | |||
| Percent changed | 15.4 | |||
| Number of synsets | 3,958 |
Analysis of the naive Bayes models for Test1.
The difference column is the mean of the Gaussian of the winners (class 1) minus the mean of the Gaussian of the losers (class 0). Differences that are statistically significant are marked in bold. Significance is measured by a two-tailed unpaired t test with a 95% confidence level. This table omits unique ngrams, which are presented in the next table.
| Feature | Means of Gaussians | Mean of the winners is … | ||
|---|---|---|---|---|
| Losers | Winners | Difference | ||
| Normalized length | 0.9128 | 0.9087 | −0.0041 | lower |
| Syllable count | 3.2494 | 3.2077 | −0.0417 | lower |
| Shared ngrams | 0.4797 | 0.4726 | −0.0071 | lower |
| Categorial variations | 3.3075 | 3.4432 | 0.1357 | higher |
| Relative growth | −0.0012 | 0.0956 | higher | |
| Linear extrapolation | 0.2058 | 0.8408 | higher | |
| Present age | 130.0 | 180.5 | higher | |
Analysis of the unique ngrams features in the naive Bayes models for Test1.
The table lists the top dozen trigrams with the greatest separation between the means. Differences that are statistically significant are marked in bold; all of the differences are significant. Significance is measured by a two-tailed unpaired t test with a 95% confidence level.
| Trigrams | Means of Gaussians | Presence of the trigram suggests … | ||
|---|---|---|---|---|
| Losers | Winners | Difference | ||
| ize | 0.0055 | 0.0285 | winner | |
| ise | 0.0289 | 0.0083 | loser | |
| nes | 0.0328 | 0.0134 | loser | |
| ty| | 0.0112 | 0.0297 | winner | |
| ss| | 0.0379 | 0.0202 | loser | |
| ity | 0.0100 | 0.0269 | winner | |
| ze| | 0.0022 | 0.0174 | winner | |
| ess | 0.0373 | 0.0229 | loser | |
| se| | 0.0206 | 0.0083 | loser | |
| lis | 0.0154 | 0.0032 | loser | |
| ic| | 0.0228 | 0.0348 | winner | |
| liz | 0.0022 | 0.0115 | winner | |