| Literature DB >> 29527072 |
Andi Rexha1, Mark Kröll1, Hermann Ziak1, Roman Kern1.
Abstract
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e. authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors. Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper. Both studies confirmed that this task is quite challenging. To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (1) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (2) assist forensic experts or linguists to create profiles of writers, (3) support intelligence applications to analyze aggressive and threatening messages and (4) help editor conformity by adhering to, for instance, journal specific writing style.Entities:
Keywords: Authorship identification; Content agnostic stylometry; High content similarity; Writing style analysis
Year: 2018 PMID: 29527072 PMCID: PMC5838116 DOI: 10.1007/s11192-018-2661-6
Source DB: PubMed Journal: Scientometrics ISSN: 0138-9130 Impact factor: 3.238
Fig. 1Description of a task. Three evaluators are assigned to each task and rank each target of the experiment from the “most similar” to the “least similar”
Fig. 2Example of a target snippet presented to the users in the qualitative evaluation. The list of features is added to half of the experiments to help the ranking process
List of stylometric features used calculate the similarity between text snippets
| Feature name | Description |
|---|---|
| Alpha-chars-ratio | The fraction of total characters in the paragraph which are letters |
| Digit-chars-ratio | The fraction of total characters in the paragraph which are digits |
| Upper-chars-ratio | The fraction of total characters in the paragraph which are upper-case |
| White-chars-ratio | The fraction of total characters in the paragraph which are whitespace characters |
| Type-token-ratio | Ratio between the size of the vocabulary (i.e. the number of different words) and the total number of words |
| Hapax-legomena | The number of words occurring once |
| Hapax-dislegomena | The number of words occurring twice |
| Yules-k | A vocabulary richness measure defined by Yule |
| Simpsons-d | A vocabulary richness measure defined by Simpson |
| Brunets-w | A vocabulary richness measure defined by Brunet |
| Sichels-s | A vocabulary richness measure defined by Sichel |
| Honores-h | A vocabulary richness measure defined by Honore |
| Average-word-length | Average length of words in characters |
| Average-sentence-char-length | Average length of sentences in characters |
| Average-sentence-word-length | Average length of sentences in words |
Many of those features are defined in Tweedie and Baayen (1998)
Pilot study #1: ranking of the crowd-sourcing’ evaluators for snippets from the same author and snippets from the same journal but different author
| Snippet/ranking | Most similar (%) | Similar (%) | Less similar (%) | Least similar (%) |
|---|---|---|---|---|
| Same author | 19 | 45 | 24 | 12 |
| Same journal | 21 | 41 | 26 | 12 |
The expectation for a random selection is 25%
Fig. 3Box plots representing the distribution of the annotators’ agreement over the similarity of the content
Fig. 4Box plots representing the distribution of the annotators’ agreement over the similarity of the writing style
Fig. 5Scatter plot relating three dimensions: style similarity, content similarity and agreement between annotators
Pilot study #2: precision of the annotators in finding the same author as the source snippet
| Precision without features | Precision with features | |
|---|---|---|
| Annotator #1 | 0.22 | 0.22 |
| Annotator #2 | 0.27 | 0.29 |
| Annotator #3 | 0.16 | 0.25 |
| Annotator #4 | 0.11 | 0.15 |
Results are presented for experiments provided with and without features to help for the ranking (the expectation of the precision for a random selection is 25%). Please note: Annotator #4 performed the ranking task wrt. content features