| Literature DB >> 24860525 |
Alice F Jackson1, Donald J Bolger1.
Abstract
The GOLD model (Graph Of Language Distribution) is a network model constructed based on co-occurrence in a large corpus of natural language that may be used to explore what information may be present in a graph-structured model of language, and what information may be extracted through theoretically-driven algorithms as well as standard graph analysis methods. The present study will employ GOLD to examine two types of relationship between words: semantic similarity and associative relatedness. Semantic similarity refers to the degree of overlap in meaning between words, while associative relatedness refers to the degree to which two words occur in the same schematic context. It is expected that a graph structured model of language constructed based on co-occurrence should easily capture associative relatedness, because this type of relationship is thought to be present directly in lexical co-occurrence. However, it is hypothesized that semantic similarity may be extracted from the intersection of the set of first-order connections, because two words that are semantically similar may occupy similar thematic or syntactic roles across contexts and thus would co-occur lexically with the same set of nodes. Two versions the GOLD model that differed in terms of the co-occurence window, bigGOLD at the paragraph level and smallGOLD at the adjacent word level, were directly compared to the performance of a well-established distributional model, Latent Semantic Analysis (LSA). The superior performance of the GOLD models (big and small) suggest that a single acquisition and storage mechanism, namely co-occurrence, can account for associative and conceptual relationships between words and is more psychologically plausible than models using singular value decomposition (SVD).Entities:
Keywords: co-occurrence; computational model of language; distribution model; graph; similarity
Year: 2014 PMID: 24860525 PMCID: PMC4026710 DOI: 10.3389/fpsyg.2014.00385
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
and
tags as paragraph breaks; and the “sentence” level, which used sentence-final punctuation such as periods and exclamation points as delimiters in addition to the paragraph breaks. The GOLD model was constructed based on the paragraph level data, as a compromise between the computational complexity of full-document processing and the limited span of the sentence-level data. A total of 19,646 comment threads were collected, totaling 4,342,302 paragraphs, 97,976,253 words (types), with 431,822 unique words (tokens). Average paragraph length was 22.8 words, with a median of 15 words, minimum length of 1 word, a maximum length of 2013 words, and a standard deviation of 24.5 words.
Figure 1First-order associates of Connectivity between associates is not displayed. The large cloud of nodes are the associates of cat that are not also connected to grumpy; the small cloud of nodes are the associates of grumpy that are not also connected to cat; and the round blob between them is the set of nodes that is connected to both grumpy and cat. Figure produced using Force Atlas and Yifan-Hu layout algorithms in Gephi (Bastian et al., 2009).
Figure 2First-order associates of Connectivity between associates is not displayed. This subgraph is small enough to display weight information as well; weight of connections is depicted by color (red=large weights) as well as thickness. Figure produced using Force Atlas and Yifan-Hu layout algorithms in Gephi (Bastian et al., 2009).
Weight normalization methods.
| 1 | Raw weights | Weight |
| 2 | Pointwise mutual information (PMI) | |
| 3 | Sum of IDFs | ( |
| 4 | Product of IDFs | ( |
| 5 | Sum of document frequencies | ( |
| 6 | Product of document frequencies | ( |
| 7 | Inverse of sum of IDFs | |
| 8 | Inverse of product of IDFs | |
| 9 | Inverse of sum of document frequencies | |
| 10 | Inverse of product of document frequencies | |
| 11 | Sum of frequencies | ( |
| 12 | Sum of frequencies multiplied by log sum of frequencies | |
| 13 | Product of frequencies multiplied by log product of frequencies | |
| 14 | Sum of frequencies divided by log sum of frequencies | |
| 15 | Product of frequencies divided by log product of frequencies |
Figure 3Simplified graph of associates of Nodes on the blue region are the overlapping nodes, each of which is connected to both words in the word pair. Nodes on the green regions are the non-overlapping nodes, each of which is connected to only one of the words in the word pair. For clarity, only a few nodes are displayed.
Overall classifier accuracy on the Plaut and Booth (.
| smallGOLD | 90.0% | |
| bigGOLD | 90.4% | |
| LSA | 74.4% |
Classifier performance for the related and unrelated word pairs.
Percentages shown in red are the correct classifications.
Classifier accuracy on the Chiarello et al. (.
| smallGOLD | 60.2% | |
| bigGOLD | 57.9% | |
| LSA | 38.8% |
Classifier performance for the associated-only, both associated and similar, and similar-only word pairs.
Percentages shown in red are the correct classifications.
Methods and accuracy of the top 30 pairs of features described above.
| 1 | Assoc. | Sim. | PMI | Method5 | Prod freq ∗ log prod freq | Sum freq/log sum freq | 63.23 |
| 2 | Assoc. | Sim. | PMI | Method5 | Weight/(w1df ∗ w2df) | Sum freq/log sum freq | 62.58 |
| 3 | Assoc. | Sim. | PMI | Method5 | (w1idf + w2idf) ∗ weight | Sum freq/log sum freq | 61.94 |
| 4 | Assoc. | Sim. | PMI | Method5 | Prod freq/log prod freq | Sum freq/log sum freq | 61.94 |
| 5 | Assoc. | Sim. | PMI | Method5 | Sum freq ∗ log sum freq | Sum freq/log sum freq | 61.94 |
| 6 | Assoc. | Sim. | PMI | Method5 | w1idf ∗ w2idf ∗ weight | Sum freq/log sum freq | 61.94 |
| 7 | Assoc. | Sim. | PMI | Method5 | Weight/(w1df + w2df) | Sum freq/log sum freq | 59.35 |
| 8 | Assoc. | Sim. | PMI | Method5 | Prod freq ∗ log prod freq | Weight/(w1df + w2df) | 59.35 |
| 9 | Assoc. | Sim. | PMI | Method5 | Weight/(w1df ∗ w2df) | Weight/(w1df + w2df) | 58.71 |
| 10 | Assoc. | Sim. | Rel | Method4 | Sum freq/log sum freq | (w1idf + w2idf) ∗ weight | 58.06 |
| 11 | Assoc. | Sim. | PMI | Method5 | Sum freq/log sum freq | Sum freq/log sum freq | 58.06 |
| 12 | Assoc. | Sim. | PMI | Method5 | w1df ∗ w2df ∗ weight | Sum freq/log sum freq | 58.06 |
| 13 | Assoc. | Sim. | rel | Method4 | Weight/(w1df + w2df) | (w1idf + w2idf) ∗ weight | 57.42 |
| 14 | Assoc. | Sim. | PMI | Method5 | (w1df + w2df) ∗ weight | Sum freq/log sum freq | 57.42 |
| 15 | Assoc. | Sim. | PMI | Method5 | (w1f + w2f) ∗ weight | Sum freq/log sum freq | 57.42 |
| 16 | Assoc. | Sim. | PMI | Method5 | Sum freq ∗ log sum freq | Weight/(w1df + w2df) | 57.42 |
| 17 | Assoc. | Sim. | Rel | Method4 | Weight/(w1df + w2df) | Weight/(w1idf ∗ w2idf) | 56.77 |
| 18 | Assoc. | Sim. | PMI | Method5 | (w1df + w2df) ∗ weight | Weight/(w1df + w2df) | 56.77 |
| 19 | Assoc. | Sim. | PMI | Method5 | (w1f + w2f) ∗ weight | Weight/(w1df + w2df) | 56.77 |
| 20 | Assoc. | Sim. | Rel | Method4 | Sum freq/log sum freq | Raw | 56.13 |
| 21 | Assoc. | Sim. | Rel | Method4 | Sum freq ∗ log sum freq | Weight/(w1idf ∗ w2idf) | 56.13 |
| 22 | Assoc. | Sim. | PMI | Method5 | (w1idf + w2idf) ∗ weight | Weight/(w1df + w2df) | 56.13 |
| 23 | Assoc. | Sim. | PMI | Method5 | Prod freq/log prod freq | Weight/(w1df + w2df) | 56.13 |
| 24 | Assoc. | Sim. | PMI | Rel | Prod freq/log prod freq | (w1idf + w2idf) ∗ weight | 56.13 |
| 25 | Assoc. | Sim. | PMI | Method1 | Prod freq ∗ log prod freq | Raw | 56.13 |
| 26 | Assoc. | Sim. | PMI | Method1 | Prod freq/log prod freq | Raw | 56.13 |
| 27 | Assoc. | Sim. | Rel | Method4 | w1idf ∗ w2idf ∗ weight | (w1idf + w2idf) ∗ weight | 55.48 |
| 28 | Assoc. | Sim. | Rel | Method4 | w1idf ∗ w2idf ∗ weight | Raw | 55.48 |
| 29 | Assoc. | Sim. | Rel | Method4 | Sum freq/log sum freq | Weight/(w1idf + w2idf) | 55.48 |
| 30 | Sim. | Sim. | Method1 | Method5 | Prod freq ∗ log prod freq | Sum freq/log sum freq | 55.48 |