| Literature DB >> 27589740 |
Helena Gómez-Adorno1, Grigori Sidorov2, David Pinto3, Darnes Vilariño4, Alexander Gelbukh5.
Abstract
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of documents. On average, our method outperforms the state of the art approaches and gives consistently high results across different corpora, unlike existing methods. Our results show that our textual patterns are useful for the task of authorship attribution.Entities:
Keywords: authorship attribution; authorship verification; integrated syntactic graphs; shortest paths walks; syntactic n-grams; textual patterns
Year: 2016 PMID: 27589740 PMCID: PMC5038652 DOI: 10.3390/s16091374
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Dependency trees of three sentences of the target text using word_POS combination for the nodes and dependency labels for the edges.
Figure 2The integrated syntactic graph for the three sentences shown in Figure 1.
Figure 3Syntactic tree with the synonym expansion of a sentence: “Yes, here they come”.
Representation of a text using the feature extraction technique.
| Initial Node to Final Node | Lexical Features | Morphological Features | Syntactic Features | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ⋯ | ⋯ | ⋯ | ||||||||||
| 1 | 1 | ⋯ | 0 | 1 | 1 | ⋯ | 0 | 1 | 0 | ⋯ | 1 | |
| 1 | 0 | ⋯ | 1 | 1 | 0 | ⋯ | 1 | 0 | 1 | ⋯ | 0 | |
| 0 | 0 | ⋯ | 1 | 0 | 0 | ⋯ | 1 | 0 | 1 | ⋯ | 0 | |
| ⋮ | ⋮ | ⋮ | ⋮ | |||||||||
| 0 | 0 | ⋯ | 0 | 0 | 0 | ⋯ | 1 | 1 | 0 | ⋯ | 0 | |
Setup used in the experiments.
| Features | |||||||
|---|---|---|---|---|---|---|---|
| ✓ | ✓ | ||||||
| ✓ | ✓ | ||||||
| ✓ | ✓ | ✓ | |||||
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Results for authorship attribution using the C10 corpus. The best performing feature set is in bold.
| Feature Set | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| 43.0 | 63.0 | 36.4 | 46.1 | |
| 55.0 | 65.2 | 51.2 | 57.4 | |
| 67.0 | 71.2 | 66.8 | 68.9 | |
| 72.1 | 68.0 | 70.0 | ||
| 65.0 | 65.2 | 65.6 | 65.4 |
Figure 4Accuracy for each of the ten authors.
Results for authorship attribution with supervised-based methods, using the C10 corpus. The best performing supervised approach is in bold.
| Supervised Methods | Accuracy |
|---|---|
| BOW (words) | 78.2 |
| BOW (characters) | 75.0 |
| Char trigrams | 80.8 |
| LH Char 3-grams [ |
Results for authorship verification using the extrinsic and intrinsic approaches for the PAN’13 English corpus. The best performing feature sets are in bold.
| Feature Set | Instance Based | Profile Based | |
|---|---|---|---|
| Extrinsic | Extrinsic | Intrinsic | |
| 53.3 | 73.3 | 76.6 | |
| 66.6 | 76.6 | ||
| 66.6 | 76.6 | ||
| Random Baseline | − | − | 50.0 |
| Best PAN’13 System | 80.0 | − | − |
Results for authorship verification using the instance-based and profile-based approaches with extrinsic and intrinsic methods for the PAN’14 English corpus. The best performing approach for each corpus is in bold, while our best results are underlined.
| Profile Based | ||||
|---|---|---|---|---|
| 48.8 | 51.5 | 60.0 | ||
| 48.5 | 52.0 | 66.5 | 61.0 | |
| 48.0 | 49.5 | 62.5 | 54.87 | |
| 52.7 | 57.0 | − | − | |
| 52.2 | − | − | ||
| 51.2 | 53.0 | − | − | |
| Baseline System | 53.0 | 44.5 | ||
| 1st. PAN’14 System | ||||
| 2nd. PAN’14 System | 65.5 | 65.0 | ||