| Literature DB >> 31429062 |
Shu-Yen Lin1, Hsueh-Chih Chen2, Tao-Hsing Chang3, Wei-En Lee4, Yao-Ting Sung5.
Abstract
The application of word associations has become increasingly widespread. However, the association norms produced by traditional free association tests tend not to exceed 10,000 stimulus words, making the number of associated words too small to be representative of the overall language. In this study we used text corpora totaling over 400 million Chinese words, along with a multitude of association measures, to automatically construct a Chinese Lexical Association Database (CLAD) comprising the lexical association of over 80,000 words. Comparison of the CLAD with a database of traditional Chinese word association norms shows that word associations extracted from large text corpora are similar in strength to those elicited from free association tests but contain a much greater number of associative word pairs. Additionally, the relatively small numbers of participants involved in the creation of traditional norms result in relatively coarse scales of association measurement, whereas the differentiation of association strengths is greatly enhanced in the CLAD. The CLAD provides researchers with a great supplement to traditional word association norms. A query website at www.chinesereadability.net/LexicalAssociation/CLAD/ affords access to the database.Entities:
Keywords: Association measures; Chinese text corpora; Corpus-based; Corpus-derived; Lexical association; Word association; Word co-occurrence
Mesh:
Year: 2019 PMID: 31429062 PMCID: PMC6797702 DOI: 10.3758/s13428-019-01208-2
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Contingency table of observed frequencies of (co-)occurrences of a word pair
|
|
f (xy) = number of times word x and word y co-occur. f () = number of times that word x occurs, and word y does not ( = any word except y). f (x*) = sum of f (xy) and f ()—that is, the occurrence frequency of x (* = any word). N = size of the corpus.
Fig. 1Construction pipeline of the Chinese Lexical Association Database (CLAD)
Corpora used for the construction of the CLAD and their sizes
| Corpus | Number of Word Tokens |
|---|---|
| United Daily News (UDN) Corpus | 253,952,479 |
| Books for children and adolescents | 99,340,090 |
| Novels | 59,307,704 |
| PTT (a bulletin board system) | 10,734,678 |
| Sinica Corpus | 9,343,428 |
| Total | 432,678,379 |
Total number and mean size for each co-occurrence window type
| Window type | Total number | Size in Words | |
|---|---|---|---|
| Mean |
| ||
| Clause | 68,559,948 | 6.5 | 4.9 |
| Sentence | 17,007,069 | 26.3 | 21.7 |
| Paragraph | 4,310,468 | 96.0 | 129.6 |
A total of about 7% of the corpora lack paragraph information and were excluded from the construction of the paragraphs.
Fig. 2Dendrogram of the word association measures applied to create the CLAD
Fig. 3Viewing the whole word list of the CLAD
Fig. 4Using the keyword search function
Fig. 5Using the part-of-speech (POS) and frequency search function
Fig. 6Selecting the co-occurrence window and association measure to display and download association data
Sizes of the Chen norms and the CLAD
| Database | Number of Keywords/Stimuli | Number of Associative Word Pairs | Number of Associates/Responses per Word | |
|---|---|---|---|---|
| Mean |
| |||
| CLAD | ||||
| Clause window | 84,674 | 16,997,424 | 401 | 1,338 |
| Sentence window | 84,674 | 66,313,915 | 1,523 | 3,569 |
| Paragraph window | 84,674 | 207,538,494 | 4,516 | 6,889 |
| Chen norms | 1,200 | 103,006 | 86 | 17 |
Coverage rates of the Chen norms in the CLAD, computed using stimuli and associates in varying frequency and strength
| Clause Window | Sentence Window | Paragraph Window | ||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All | 75% | 50% | 25% | All | 75% | 50% | 25% | All | 75% | 50% | 25% | |||||||||||||||||||||||||||||||||||||
| 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | 10 | 40 | 70 | All | |
| Added value | .12 | .30 | .38 | .42 | .16 | .39 | .49 | .55 | .21 | .50 | .63 | .70 | .23 | .55 | .70 | .78 | .23 | .46 | .54 | .59 | .30 | .57 | .68 | .73 | .36 | .68 | .80 | .86 | .36 | .70 | .83 | .90 | .30 | .55 | .66 | .71 | .37 | .66 | .78 | .84 | .41 | .74 | .86 | .92 | .38 | .73 | .86 | .93 |
| Baroni-Urbani | .17 | .31 | .38 | .42 | .22 | .40 | .49 | .54 | .31 | .53 | .63 | .69 | .40 | .63 | .73 | .78 | .29 | .46 | .54 | .58 | .37 | .58 | .67 | .72 | .50 | .71 | .80 | .85 | .59 | .81 | .88 | .91 | .36 | .56 | .64 | .70 | .45 | .67 | .75 | .82 | .58 | .80 | .87 | .91 | .65 | .87 | .92 | .94 |
| Conditional probability | .17 | .31 | .38 | .42 | .22 | .41 | .49 | .55 | .29 | .54 | .64 | .70 | .36 | .63 | .73 | .79 | .28 | .47 | .54 | .59 | .36 | .59 | .68 | .73 | .46 | .74 | .82 | .86 | .54 | .81 | .88 | .91 | .35 | .58 | .66 | .71 | .44 | .71 | .80 | .84 | .55 | .83 | .90 | .92 | .59 | .87 | .92 | .94 |
| Gini index | .17 | .31 | .38 | .43 | .23 | .41 | .49 | .55 | .31 | .53 | .63 | .70 | .37 | .60 | .71 | .78 | .32 | .48 | .55 | .59 | .41 | .60 | .68 | .73 | .52 | .73 | .81 | .86 | .58 | .78 | .86 | .91 | .41 | .60 | .67 | .71 | .51 | .72 | .80 | .84 | .62 | .83 | .89 | .92 | .66 | .85 | .91 | .94 |
| Jaccard | .19 | .33 | .40 | .43 | .25 | .43 | .51 | .55 | .34 | .57 | .66 | .70 | .41 | .65 | .74 | .79 | .33 | .50 | .56 | .59 | .42 | .63 | .70 | .73 | .54 | .76 | .83 | .86 | .60 | .82 | .88 | .91 | .42 | .61 | .68 | .71 | .52 | .73 | .80 | .84 | .62 | .84 | .90 | .92 | .64 | .87 | .92 | .94 |
| Joint probability | .17 | .31 | .38 | .42 | .22 | .41 | .49 | .55 | .29 | .54 | .64 | .70 | .36 | .63 | .73 | .79 | .28 | .47 | .54 | .59 | .36 | .59 | .68 | .73 | .46 | .74 | .82 | .86 | .54 | .81 | .88 | .91 | .35 | .58 | .66 | .71 | .44 | .71 | .80 | .84 | .55 | .83 | .90 | .92 | .59 | .87 | .92 | .94 |
| Kappa | .19 | .33 | .39 | .43 | .25 | .42 | .50 | .55 | .34 | .56 | .64 | .70 | .42 | .64 | .72 | .78 | .33 | .49 | .55 | .59 | .43 | .62 | .69 | .73 | .55 | .75 | .81 | .86 | .62 | .81 | .86 | .91 | .43 | .61 | .68 | .71 | .53 | .73 | .80 | .84 | .64 | .84 | .89 | .92 | .66 | .87 | .92 | .94 |
| Log likelihood ratio | .19 | .32 | .39 | .43 | .25 | .42 | .50 | .55 | .34 | .54 | .63 | .70 | .40 | .61 | .71 | .78 | .34 | .49 | .55 | .59 | .44 | .61 | .68 | .73 | .54 | .73 | .81 | .86 | .60 | .78 | .86 | .91 | .44 | .62 | .68 |
| .54 | .74 | .80 |
| .64 | .84 | .89 |
| .67 | .86 | .91 |
|
| Michael | .19 | .33 | .39 | .43 | .25 | .43 | .50 | .55 | .33 | .56 | .64 | .70 | .41 | .64 | .72 | .79 | .32 | .49 | .55 | .59 | .41 | .62 | .69 | .73 | .53 | .75 | .82 | .86 | .60 | .81 | .86 | .91 | .39 | .60 | .68 | .71 | .49 | .73 | .80 | .84 | .60 | .84 | .90 | .92 | .64 | .88 | .92 | .94 |
| Mutual expectation | .19 | .33 | .40 | .43 | .25 | .43 | .51 | .55 | .33 | .56 | .65 | .70 | .40 | .65 | .74 | .79 | .32 | .50 | .56 | .59 | .41 | .63 | .70 | .73 | .51 | .75 | .83 | .86 | .57 | .82 | .88 | .91 | .42 | .62 | .68 | .72 | .50 | .74 | .81 | .84 | .59 | .84 | .90 | .92 | .61 | .87 | .92 | .94 |
| R cost | .18 | .33 | .39 | .43 | .23 | .42 | .51 | .55 | .31 | .55 | .65 | .70 | .38 | .63 | .73 | .79 | .33 | .50 | .56 | .59 | .42 | .62 | .70 | .73 | .53 | .75 | .83 | .86 | .60 | .81 | .88 | .91 | .44 | .62 | .68 | .72 | .53 | .74 | .81 | .84 | .63 | .84 | .90 | .92 | .67 | .87 | .92 | .94 |
| Reverse conditional probability | .10 | .29 | .38 | .42 | .14 | .37 | .49 | .54 | .18 | .48 | .62 | .69 | .21 | .53 | .69 | .77 | .19 | .43 | .54 | .59 | .24 | .54 | .67 | .73 | .30 | .64 | .78 | .85 | .32 | .67 | .83 | .90 | .24 | .52 | .64 | .71 | .29 | .61 | .75 | .83 | .32 | .68 | .83 | .91 | .33 | .69 | .85 | .93 |
| Simpson | .12 | .29 | .38 | .42 | .16 | .38 | .48 | .54 | .21 | .49 | .62 | .69 | .23 | .55 | .70 | .78 | .22 | .44 | .53 | .58 | .28 | .56 | .67 | .73 | .35 | .68 | .80 | .86 | .36 | .71 | .84 | .90 | .28 | .54 | .65 | .70 | .35 | .66 | .77 | .83 | .41 | .74 | .86 | .92 | .39 | .74 | .87 | .93 |
| Sokal–Michener | .03 | .16 | .27 | .37 | .04 | .20 | .34 | .47 | .05 | .24 | .41 | .59 | .04 | .22 | .41 | .64 | .05 | .19 | .32 | .49 | .06 | .22 | .38 | .60 | .05 | .21 | .39 | .69 | .04 | .16 | .33 | .70 | .06 | .19 | .31 | .57 | .06 | .19 | .33 | .66 | .04 | .14 | .29 | .70 | .04 | .12 | .25 | .70 |
| Squared log likelihood ratio | .04 | .19 | .30 | .39 | .06 | .25 | .39 | .50 | .07 | .30 | .48 | .62 | .06 | .29 | .49 | .68 | .07 | .24 | .37 | .52 | .08 | .29 | .45 | .63 | .08 | .29 | .48 | .72 | .06 | .23 | .42 | .73 | .08 | .23 | .37 | .60 | .08 | .24 | .39 | .68 | .05 | .19 | .35 | .72 | .04 | .14 | .29 | .71 |
| U cost | .05 | .18 | .29 | .38 | .07 | .23 | .37 | .49 | .08 | .29 | .47 | .62 | .10 | .33 | .53 | .70 | .10 | .29 | .42 | .54 | .12 | .35 | .52 | .66 | .14 | .40 | .60 | .77 | .21 | .54 | .73 | .86 | .15 | .39 | .54 | .66 | .17 | .46 | .63 | .77 | .22 | .55 | .73 | .86 | .32 | .70 | .83 | .92 |
| Unigram subtuples | .14 | .31 | .39 | .43 | .18 | .40 | .50 | .55 | .24 | .52 | .64 | .70 | .29 | .59 | .72 | .78 | .27 | .48 | .56 | .59 | .34 | .60 | .69 | .73 | .42 | .72 | .82 | .86 | .47 | .77 | .87 | .91 | .35 | .59 | .67 | .71 | .43 | .70 | .79 | .84 | .49 | .79 | .88 | .92 | .50 | .81 | .90 | .94 |
Tendencies of the association measures to align with the Chen norms
| Measure | Window Type | ||
|---|---|---|---|
| Clause | Sentence | Paragraph | |
| Added value | 2.98 | 2.25 | 2.11 |
| Baroni-Urbani | 2.04 | 1.65 | 1.57 |
| Conditional probability | 2.13 | 1.78 | 1.70 |
| Gini index | 2.04 | 1.59 | 1.49 |
| Jaccard | 1.91 | 1.57 | 1.50 |
| Joint probability | 2.13 | 1.78 | 1.70 |
| Kappa | 1.88 | 1.53 | 1.46 |
| Log likelihood ratio | 1.88 | 1.51 | 1.42 |
| Michael | 1.93 | 1.58 | 1.56 |
| Mutual expectation | 1.94 | 1.62 | 1.55 |
| R cost | 2.05 | 1.57 | 1.45 |
| Reverse conditional probability | 3.35 | 2.60 | 2.53 |
| Simpson | 2.96 | 2.29 | 2.14 |
| Sokal–Michener | 8.24 | 7.74 | 7.77 |
| Squared log likelihood ratio | 6.96 | 6.25 | 7.28 |
| U cost | 5.48 | 4.07 | 3.23 |
| Unigram subtuples | 2.63 | 1.94 | 1.80 |
Smaller values indicate a stronger tendency to align.
Percentages of word pairs in the Chen norms and the priming experiment, according to their normative association strengths
| Association Strength in the Chen Norms | Percentage in the Chen Norms | Percentage in the Priming Experiment |
|---|---|---|
| ≧.03 | 7.2 | 15.7 |
| ≧.015 & < .03 | 10.5 | 26.5 |
| = .01 | 13.2 | 21.6 |
| = .005 | 69.2 | 36.3 |
Association strength is calculated by dividing the number of particular responses by the number of respondents, which is always 200 for a stimulus in the Chen norms.
Means, standard deviations, and ranges for the predictor variables used in the regression analyses
| Predictor Variables | Mean |
| Range | |
|---|---|---|---|---|
| Prime | Stroke | 21.61 | 6.07 | (7, 43) |
| (related and unrelated) | Word frequency | 7.81 | 1.47 | (4.81, 10.65) |
| Left orthographic neighbors | 59.36 | 54.10 | (0, 298) | |
| Right orthographic neighbors | 102.07 | 121.31 | (1, 654) | |
| – 0.13 | 0.43 | (– 0.87, 1.05) | ||
| Target | Stroke | 21.48 | 6.87 | (5, 40) |
| Word frequency | 8.92 | 1.29 | (5.87, 11.87) | |
| Left orthographic neighbors | 80.5 | 74.61 | (2, 289) | |
| Right orthographic neighbors | 102.16 | 131.35 | (2, 654) | |
| – 0.32 | 0.33 | (– 0.95, 1.17) | ||
| Associative | Chen Norms | |||
| Forward | .016324 | .016619 | (.005, .105) | |
| Backward | .225938 | .280689 | (.004, 1) | |
| CLAD | ||||
| Added value | 0.048222 | 0.076665 | (– 9.4e-05, 0.515925) | |
| Baroni-Urbani | 0.551819 | 0.202924 | (0.058325, 0.903283) | |
| Conditional probability | 0.038788 | 0.057208 | (0.000319, 0.449438) | |
| Gini index | 6e-06 | 1.8e-05 | (4.63e-10, 8.8e-05) | |
| Jaccard | 0.00828 | 0.012235 | (9.5e-05, 0.071429) | |
| Joint probability | 2.5e-05 | 4.6e-05 | (1e-06, 0.000274) | |
| Kappa | 0.015398 | 0.023186 | (– 0.000159615, 0.133028) | |
| Log likelihood ratio | 513.248121 | 1,134.09 | (0.180993, 6,932.112495) | |
| Michael | 9.30e-05 | 0.000177 | (2e-06, 0.001083) | |
| Mutual expectation | 1.00e-06 | 4.00e-06 | (4.42e-10, 0.000029) | |
| R cost | 0.002334 | 0.007047 | (3e-06, 0.058224) | |
| Reverse conditional probability | 0.023922 | 0.060867 | (9.6e-05, 0.518809) | |
| Simpson | 0.051452 | 0.077473 | (0.002340824, 0.518809) | |
| Sokal–Michener | 0.996286 | 0.004341 | (0.975672, 0.999796) | |
| Squared log likelihood ratio | 16.441521 | 30.86232 | (0.174048, 201.581215) | |
| U cost | 0.351233 | 0.263277 | (0.002478, 0.979513) | |
| Unigram subtuples | 2.079332 | 1.534075 | (– 1.313466, 6.388045) | |
Stroke = number of strokes of the two characters of a word. Word frequency = logarithmic transformation (base e) of word frequency in the text corpora described in Table 3. Left orthographic neighbors = number of words sharing the same first character. Right orthographic neighbors = number of words sharing the same second character. zRTCLP = standardized RT according to the Chinese Lexicon Project. Associative = Chen and CLAD association strengths computed using the various measures. Forward = proportion of the target in the overall responses to the prime in the Chen norms. Backward = proportion of the prime as a stimulus in the overall associations in the Chen norms when the target was a response.
R2 of the multiple regression models and beta weights for the lexical and behavioral variables predicting the priming effect
| Model | Un Prime Stroke | Un Prime L Ortho | Un Prime R Ortho | Un Prime Freq | Un Prime | Rel Prime Stroke | Rel Prime L Ortho | Rel Prime R Ortho | Rel Prime Freq | Rel Prime | Target Stroke | Target L Ortho | Target R Ortho | Target Freq | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Added value | .24 | .08 | .15 | – .01 | .06 | .14 | .01 | – .04 | – .05 | – .15 | – .10 | .06 | – .01 | .15 | – .33† | .12 |
| Baroni-Urbani | .29* | .08 | .10 | – .01 | .13 | .19 | .00 | – .05 | – .06 | – .17 | – .11 | .07 | .02 | .16 | .02 | .23 |
| Conditional probability | .24 | .08 | .16 | .00 | .06 | .15 | .01 | – .05 | – .06 | – .13 | – .12 | .05 | – .02 | .15 | – .35† | .12 |
| Gini index | .26 | .10 | .17 | – .03 | .09 | .16 | .00 | – .07 | – .05 | – .23 | – .14 | .05 | – .02 | .14 | – .31† | .14 |
| Jaccard | .29* | .11 | .13 | .00 | .18 | .25† | .02 | – .10 | – .09 | – .27† | – .16 | .05 | – .01 | .17 | – .21 | .17 |
| Joint probability | .28* | .11 | .16 | – .01 | .14 | .19 | .03 | – .10 | – .06 | – .32* | – .15 | .06 | – .01 | .15 | – .34† | .14 |
| Kappa | .29* | .11 | .13 | .00 | .18 | .25† | .02 | – .10 | – .09 | – .27† | – .16 | .05 | – .01 | .17 | – .21 | .17 |
| Log likelihood ratio | .29* | .11 | .16 | – .03 | .16 | .21 | .02 | – .10 | – .06 | – .30* | – .16 | .04 | – .01 | .15 | – .29 | .14 |
| Michael | .28* | .11 | .16 | – .02 | .15 | .19 | .03 | – .10 | – .06 | – .32* | – .15 | .06 | – .01 | .15 | – .33† | .14 |
| Mutual expectation | .29* | .10 | .18 | – .02 | .16 | .22† | .04 | – .11 | – .08 | – .27† | – .15 | .03 | – .01 | .15 | – .29 | .13 |
| R cost | .26 | .09 | .16 | – .01 | .10 | .19 | .02 | – .06 | – .07 | – .20 | – .14 | .05 | – .02 | .15 | – .29 | .14 |
| Reverse cond. prob. | .24 | .08 | .14 | – .02 | .05 | .14 | .00 | – .04 | – .04 | – .16 | – .08 | .08 | .00 | .15 | – .30 | .12 |
| Simpson | .24 | .08 | .15 | – .01 | .05 | .14 | .01 | – .04 | – .05 | – .15 | – .10 | .06 | – .01 | .15 | – .33† | .12 |
| Sokal–Michener | .25 | .05 | .13 | – .03 | .08 | .14 | .02 | – .05 | – .06 | – .09 | – .09 | .06 | – .02 | .14 | – .12 | .14 |
| Squared log likelihood | .24 | .08 | .14 | – .01 | .04 | .14 | .01 | – .05 | – .06 | – .08 | – .08 | .06 | – .01 | .14 | – .27 | .12 |
| U cost | .27* | .08 | .11 | .01 | .02 | .15 | .01 | – .04 | – .07 | – .20 | – .07 | .06 | – .03 | .12 | – .16 | .20 |
| Unigram subtuples | .26 | .09 | .12 | – .01 | .11 | .17 | – .01 | – .05 | – .06 | – .13 | – .10 | .06 | .01 | .17 | – .22 | .15 |
Un = unrelated; Rel = related; Freq = frequency; L = left; R = right; Ortho = orthographic neighbors; zRTCLP = z-score standardized reaction time from the Chinese Lexicon Project (Tse et al., 2017). *p < .05; †p < .10.
Beta weights for the associative variables predicting the priming effect
| CLAD | Chen Norms | ||
|---|---|---|---|
| Forward | Backward | ||
| Added value | .09 | .12 | – .01 |
| Baroni-Urbani | .37* | – .01 | – .01 |
| Conditional probability | .12 | .09 | – .01 |
| Gini index | .22 | .07 | – .04 |
| Jaccard | .34* | – .02 | – .03 |
| Joint probability | .30* | .05 | – .02 |
| Kappa | .34* | – .02 | – .03 |
| Log likelihood ratio | .32* | .02 | – .02 |
| Michael | .31* | .05 | – .02 |
| Mutual expectation | .31* | .03 | – .01 |
| R cost | .21 | .04 | – .03 |
| Reverse cond. prob. | .10 | .14 | – .01 |
| Simpson | .08 | .12 | – .01 |
| Sokal–Michener | .19 | .13 | .03 |
| Squared log likelihood | .08 | .14 | .02 |
| U cost | .27* | .16 | .03 |
| Unigram subtuples | .21 | .06 | – .02 |
*p < .05, †p < .10.
The inventory of word association measures used in this study to compute word association strength
| # | Measure | Formula | Reference |
|---|---|---|---|
| 1. | Joint probability | (Giuliano, | |
| 2. | Conditional probability | (Gregory, Raymond, Bell, Fosler-Lussier, & Jurafsky, | |
| 3. | Reverse conditional probability | (Gregory et al., | |
| 4. | Pointwise mutual information | log | (Church & Hanks, |
| 5. | Mutual dependency (MD) | log | (Thanopoulos, Fakotakis, & Kokkinakis, |
| 6. | Log frequency biased MD | log | (Thanopoulos et al., |
| 7. | Normalized expectation |
| (Smadja & McKeown, |
| 8. | Mutual expectation |
| (Dias, Guilloré, Bassano, & Lopes, |
| 9. | Salience | log | (Kilgarriff & Tugwell, |
| 10. | Pearson’s |
| (Manning & Schütze, |
| 11. | Fisher’s exact test |
| (Pedersen, |
| 12. |
| (Church & Hanks, | |
| 13. |
| (Berry-Rogghe, | |
| 14. | Poisson significance |
| (Quasthoff & Wolff, |
| 15. | Log likelihood ratio | −2 | (Dunning, |
| 16. | Squared log likelihood ratio | −2 | (Inkpen & Hirst, |
| 17. | Russel–Rao |
| (Russel & Rao, |
| 18. | Sokal–Michener |
| (Sokal & Michener, |
| 19. | Rogers–Tanimoto |
| (Rogers & Tanimoto, |
| 20. | Hamann |
| (Hamann, |
| 21. | Third Sokal–Sneath |
| (Sokal & Sneath, |
| 22. | Jaccard |
| (Jaccard, |
| 23. | First Kulczynski |
| (Kulczynski, |
| 24. | Second Sokal–Sneath |
| (Sokal & Sneath, |
| 25. | Second Kulczynski |
| (Kulczynski, |
| 26. | Fourth Sokal–Sneath |
| (Kulczynski, |
| 27. | Odds ratio |
| (Tan, Kumar, & Srivastava, |
| 28. | Yulle’s ω |
| (Tan et al., |
| 29. | Yulle’s Q |
| (Tan et al., |
| 30. | Driver–Kroeber |
| (Driver & Kroeber, |
| 31. | Fifth Sokal–Sneath |
| (Sokal & Sneath, |
| 32. | Pearson |
| (Pearson, 1950) |
| 33. | Baroni-Urbani |
| (Baroni-Urbani & Buser, |
| 34. | Braun–Blanquet |
| (Braun-Blanquet, |
| 35. | Simpson |
| (Simpson, |
| 36. | Michael |
| (Michael, |
| 37. | Mountford |
| (Kaufman & Rousseeuw, |
| 38. | Fager |
| (Kaufman & Rousseeuw, |
| 39. | Unigram subtuples | log | (Blaheta & Johnson, |
| 40. | U cost | log | (Tulloss, |
| 41. | S cost | log | (Tulloss, |
| 42. | R cost | log | (Tulloss, |
| 43. | T combined cost |
| (Tulloss, |
| 44. | Phi |
| (Tan et al., |
| 45. | Kappa |
| (Tan et al., |
| 46. | max[ | (Tan et al., | |
| 47. | Gini index |
| (Tan et al., |
| 48. | Confidence | max[ | (Clark & Boswell, |
| 49. | Laplace |
| (Clark & Boswell, |
| 50. | Conviction |
| (Brin, Motwani, Ullman, & Tsur, |
| 51. | Piatetsky-Shapiro | (Piatetsky-Shapiro, | |
| 52. | Certainty factor |
| (Shortliffe & Buchanan, |
| 53. | Added value | max[ | (Sahar & Mansour, |
| 54. | Collective strength |
| (Aggarwal & Yu, |
| 55. | Klösgen |
| (Klösgen, |
Measures that work under the assumption of statistical independence employ both observed frequencies (as shown in Table 1) and expected frequencies (xy) = f (x*) f (*y) ∕ N. The estimated probabilities of (co-)occurrence are expressed by P. P(x|y) is interpreted as the probability of word x given word y.