| Literature DB >> 27795703 |
Helena Gómez-Adorno1, Ilia Markov1, Grigori Sidorov1, Juan-Pablo Posadas-Durán2, Miguel A Sanchez-Perez1, Liliana Chanona-Hernandez2.
Abstract
We introduce a lexical resource for preprocessing social media data. We show that a neural network-based feature representation is enhanced by using this resource. We conducted experiments on the PAN 2015 and PAN 2016 author profiling corpora and obtained better results when performing the data preprocessing using the developed lexical resource. The resource includes dictionaries of slang words, contractions, abbreviations, and emoticons commonly used in social media. Each of the dictionaries was built for the English, Spanish, Dutch, and Italian languages. The resource is freely available.Entities:
Mesh:
Year: 2016 PMID: 27795703 PMCID: PMC5066026 DOI: 10.1155/2016/1638936
Source DB: PubMed Journal: Comput Intell Neurosci
Number of entries in each dictionary.
| Type of dictionary | Dutch | Italian | English | Spanish |
|---|---|---|---|---|
| Abbreviations | 1,237 | 107 | 1,346 | 527 |
| Contractions | 15 | 56 | 131 | 11 |
| Slang words | 250 | 362 | 1,296 | 939 |
| Emoticons | — | — | 482 | 482 |
|
| ||||
| Total | 1,520 | 525 | 3,255 | 1,959 |
Figure 1(a) Framework for learning word vectors and (b) framework for learning document vector.
Parameters of the Doc2vec method for each language.
| Parameter | Vector length | Window size | Minimum frequency |
|---|---|---|---|
| English | 200 | 14 | 3 |
| Spanish | 350 | 10 | 3 |
| Dutch | 200 | 11 | 5 |
| Italian | 200 | 4 | 4 |
Age and gender distribution over the PAN author profiling 2015 training corpus.
| English | Spanish | Dutch | Italian | |
|---|---|---|---|---|
|
| ||||
| 18–24 | 58 | 22 | — | — |
| 25–34 | 60 | 46 | — | — |
| 35–49 | 22 | 22 | — | — |
| 50–xx | 12 | 10 | — | — |
|
| ||||
|
| ||||
| Male | 76 | 50 | 17 | 19 |
| Female | 76 | 50 | 17 | 19 |
|
| ||||
|
|
|
|
|
|
Age and gender distribution over the PAN author profiling 2016 training corpus.
| English | Spanish | Dutch | |
|---|---|---|---|
|
| |||
| 18–24 | 26 | 16 | — |
| 25–34 | 135 | 64 | — |
| 35–49 | 181 | 126 | — |
| 50–64 | 78 | 38 | — |
| 65–xx | 6 | 6 | — |
|
| |||
|
| |||
| Male | 213 | 125 | 192 |
| Female | 213 | 125 | 192 |
|
| |||
|
|
|
|
|
Obtained results (accuracy, %) for age and gender classification on the PAN author profiling 2015 English training corpus under 10-fold cross-validation.
| Feature set | Age | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 66.45 |
| 68.42 |
|
| D2V (1 + 2-grams) |
|
|
|
|
| D2V (1 + 2 + 3-grams) | 69.73 |
| 68.42 |
|
| Character 3-grams | 65.78 |
| 66.44 |
|
| Bag-of-Words |
|
|
|
|
|
| ||||
| Feature set | Gender | |||
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
|
| ||||
| D2V (1-gram) | 59.87 |
| 56.57 |
|
| D2V (1 + 2-grams) | 63.15 |
| 61.84 |
|
| D2V (1 + 2 + 3-grams) |
|
|
|
|
| Character 3-grams | 57.23 |
| 59.21 |
|
| Bag-of-Words |
| 56.57 |
| 55.26 |
Obtained results (accuracy, %) for age and gender classification on the PAN author profiling 2015 Spanish training corpus under 10-fold cross-validation.
| Feature set | Age | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 59.00 |
|
| 60.00 |
| D2V (1 + 2-grams) | 59.00 |
|
|
|
| D2V (1 + 2 + 3-grams) | 62.00 |
| 64.00 |
|
| Character 3-grams |
|
| 64.00 |
|
| Bag-of-Words |
| 62.00 |
| 60.00 |
|
| ||||
| Feature set | Gender | |||
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
|
| ||||
| D2V (1-gram) |
| 63.00 | 63.00 |
|
| D2V (1 + 2-grams) |
| 66.00 |
| 61.00 |
| D2V (1 + 2 + 3-grams) |
| 67.00 |
| 66.00 |
| Character 3-grams |
|
|
|
|
| Bag-of- |
| 71.00 |
| 72.00 |
Obtained results (accuracy, %) for gender classification on the PAN author profiling 2015 Dutch training corpus under 10-fold cross-validation.
| Feature set | Gender | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 61.76 |
| 61.76 |
|
| D2V (1 + 2-grams) | 64.71 |
| 67.65 |
|
| D2V (1 + 2 + 3-grams) |
| 58.82 |
| 64.71 |
| Character 3-grams |
|
|
|
|
| Bag-of-Words | 64.71 |
| 64.71 |
|
Obtained results (accuracy, %) for gender classification on the PAN author profiling 2015 Italian training corpus under 10-fold cross-validation.
| Feature set | Gender | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) |
|
|
| 68.42 |
| D2V (1 + 2-grams) |
|
| 68.42 |
|
| D2V (1 + 2 + 3-grams) | 78.95 |
| 78.95 |
|
| Character 3-grams |
|
|
|
|
| Bag-of-Words | 76.32 |
|
|
|
Obtained results (accuracy, %) for age and gender classification on the PAN author profiling 2016 English training corpus under 10-fold cross-validation.
| Feature set | Age | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 44.71 |
|
| 41.65 |
| D2V (1 + 2-grams) |
|
|
| 42.49 |
| D2V (1 + 2 + 3-grams) | 41.41 |
| 40.71 |
|
| Character 3-grams | 39.53 |
| 37.65 |
|
| Bag-of-Words |
| 39.44 |
| 39.91 |
|
| ||||
| Feature set | Gender | |||
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
|
| ||||
| D2V (1-gram) | 73.18 |
|
| 70.66 |
| D2V (1 + 2-grams) |
|
| 71.53 |
|
| D2V (1 + 2 + 3-grams) | 71.53 |
| 69.41 |
|
| Character 3-grams | 68.47 |
| 69.65 |
|
| Bag-of-Words | 69.18 |
| 67.76 |
|
Obtained results (accuracy, %) for age and gender classification on the PAN author profiling 2016 Spanish training corpus under 10-fold cross-validation.
| Feature set | Age | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 44.40 |
|
| 43.60 |
| D2V (1 + 2-grams) | 47.20 |
| 46.40 |
|
| D2V (1 + 2 + 3-grams) |
|
|
|
|
| Character 3-grams | 50.80 |
|
| 47.60 |
| Bag-of-Words |
| 47.60 | 44.00 |
|
|
| ||||
| Feature set | Gender | |||
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
|
| ||||
| D2V (1-gram) |
| 68.00 |
| 64.80 |
| D2V (1 + 2-grams) | 69.60 |
| 68.00 |
|
| D2V (1 + 2 + 3-grams) | 70.40 |
|
|
|
| Character 3-grams | 68.00 |
| 61.60 |
|
| Bag-of-Words |
| 63.60 | 58.80 |
|
Obtained results (accuracy, %) for age and gender classification on the PAN author profiling 2016 Dutch training corpus under 10-fold cross-validation.
| Feature set | Gender | |||
|---|---|---|---|---|
| LR-NP | LR-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 74.74 |
| 71.09 |
|
| D2V (1 + 2-grams) | 70.83 |
| 71.09 |
|
| D2V (1 + 2 + 3-grams) | 73.44 |
| 70.31 |
|
| Character 3-grams |
| 72.66 |
| 72.92 |
| Bag-of-Words |
| 71.88 |
| 70.83 |
Obtained results (accuracy, %) for age and gender classification using SVM classifier on the PAN author profiling 2015 English training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Age | Gender | ||
|---|---|---|---|---|
| SVM-NP | SVM-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 66.57 |
| 59.38 |
|
| D2V (1 + 2-grams) |
| 65.55 | 63.13 |
|
| D2V (1 + 2 + 3-grams) | 66.97 |
|
|
|
Obtained results (accuracy, %) for age and gender classification using SVM classifier on the PAN author profiling 2015 Spanish training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Age | Gender | ||
|---|---|---|---|---|
| SVM-NP | SVM-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) |
| 56.67 |
|
|
| D2V (1 + 2-grams) | 68.33 |
|
|
|
| D2V (1 + 2 + 3-grams) |
|
| 65.00 |
|
Obtained results (accuracy, %) for gender classification using SVM classifier on the PAN author profiling 2015 Dutch training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Gender | |
|---|---|---|
| SVM-NP | SVM-WP | |
| D2V (1-gram) |
|
|
| D2V (1 + 2-grams) | 65.00 |
|
| D2V (1 + 2 + 3-grams) | 65.00 |
|
Obtained results (accuracy, %) for age and gender classification using SVM classifier on the PAN author profiling 2016 English training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Age | Gender | ||
|---|---|---|---|---|
| SVM-NP | SVM-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) | 41.89 |
|
| 73.27 |
| D2V (1 + 2-grams) |
|
|
|
|
| D2V (1 + 2 + 3-grams) | 42.83 |
| 75.77 |
|
Obtained results (accuracy, %) for age and gender classification using SVM classifier on the PAN author profiling 2016 Spanish training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Age | Gender | ||
|---|---|---|---|---|
| SVM-NP | SVM-WP | SVM-NP | SVM-WP | |
| D2V (1-gram) |
| 42.94 | 73.17 |
|
| D2V (1 + 2-grams) |
| 43.21 | 74.42 |
|
| D2V (1 + 2 + 3-grams) | 44.53 |
|
|
|
Obtained results (accuracy, %) for gender classification using SVM classifier on the PAN author profiling 2016 Dutch training corpus under 10-fold cross-validation with vectors extracted only from the training data.
| Feature set | Gender | |
|---|---|---|
| SVM-NP | SVM-WP | |
| D2V (1-gram) |
| 73.67 |
| D2V (1 + 2-grams) |
|
|
| D2V (1 + 2 + 3-grams) | 73.95 |
|
Significance levels.
| Symbol | Significance level | Significance |
|---|---|---|
| = |
| Not significant |
| + | 0.05 ≥ | Significant |
| ++ | 0.01 ≥ | Very significant |
| +++ |
| Highly significant |
Significance of results differences between pairs of experiments for the English, Spanish, and Dutch languages, where NP corresponds to “without preprocessing” and WP, “with preprocessing.”
| Approaches | English | Spanish | Dutch | |||
|---|---|---|---|---|---|---|
| 2015 | 2016 | 2015 | 2016 | 2015 | 2016 | |
| D2V-NP versus D2V-WP | + | + | = | = | = | = |
| Char. 3-grams-NP versus D2V-WP | +++ | +++ | + | +++ | ++ | ++ |
| Bag-of-Words-NP versus D2V-WP | +++ | +++ | = | +++ | + | +++ |