| Literature DB >> 23437047 |
Maryam Ebrahimpour1, Tālis J Putniņš, Matthew J Berryman, Andrew Allison, Brian W-H Ng, Derek Abbott.
Abstract
In this paper, we develop two automated authorship attribution schemes, one based on Multiple Discriminant Analysis (MDA) and the other based on a Support Vector Machine (SVM). The classification features we exploit are based on word frequencies in the text. We adopt an approach of preprocessing each text by stripping it of all characters except a-z and space. This is in order to increase the portability of the software to different types of texts. We test the methodology on a corpus of undisputed English texts, and use leave-one-out cross validation to demonstrate classification accuracies in excess of 90%. We further test our methods on the Federalist Papers, which have a partly disputed authorship and a fair degree of scholarly consensus. And finally, we apply our methodology to the question of the authorship of the Letter to the Hebrews by comparing it against a number of original Greek texts of known authorship. These tests identify where some of the limitations lie, motivating a number of open questions for future work. An open source implementation of our methodology is freely available for use at https://github.com/matthewberryman/author-detection.Entities:
Mesh:
Year: 2013 PMID: 23437047 PMCID: PMC3577839 DOI: 10.1371/journal.pone.0054998
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
English Text Corpus of Known Authorship.
| Bower, B. M. (1871–1940) | 5 | A Christmas Carol | 10 | Light of the Western Stars | |
| 1 | Cabin Fever | 6 | Dombey and Son | 11 | The Man of the Forest |
| 2 | Casey Ryan | 7 | George Silverman's Explanation | 12 | The Mysterious Rider |
| 3 | Chip, of the Flying U | 8 | Going into Society | 13 | The Rainbow Trail |
| 4 | Cow-Country | 9 | Great Expectations | 14 | The Redheaded Outfield |
| 5 | The Flying U Ranch | 10 | Hard Times | 15 | Riders of the Purple Sage |
| 6 | The Flying U's Last Stand | 11 | A House to Let | 16 | The Rustlers of Pecos County |
| 7 | Good Indian | 12 | Hunted Down | 17 | The Spirit of the Border |
| 8 | The Gringos | 13 | The Lamplighter | 18 | Tales of lonely trails |
| 9 | The Happy Family | 14 | Lazy Tour of Two Idle Apprentices | 19 | To the Last Man |
| 10 | The Heritage of the Sioux | 15 | Little Dorrit | 20 | The U. P. Trail |
| 11 | Her Prairie Knight | 16 | The Loving Ballad of Lord Bateman | 21 | Wildfire |
| 12 | Jean of the Lazy A | 17 | Martin Chuzzlewit | 22 | The Young Forester |
| 13 | Lonesome Land | 18 | Master Humphrey's Clock | 23 | The Border Legion End |
| 14 | The Lonesome Trail and Other Stories | 19 | A Message from the Sea | 24 | Light of the Western Stars End |
| 15 | The Long Shadow | 20 | Mrs. Lirriper's Legacy | 25 | To the Last Man End |
| 16 | The Phantom Herd | 21 | Mugby Junction | 26 | |
| 17 | The Range Dwellers | 22 | Oliver Twist |
| |
| 18 | Rowdy of the Cross L | 23 | The Holly-Tree | 1 | The Altar of the Dead |
| 19 | Starr, of the Desert | 24 | A House to Let | 2 | The Ambassadors |
| 20 | The Thunder Bird | 25 | Hunted Down | 3 | The American |
| 21 | The Trail of the White Mule | 4 | The Aspern Papers | ||
| 22 | The Uphill Climb |
| 5 | The Awkward Age | |
| 23 | Good Indian | 1 | The Adventure of the Bruce-Partington Plans | 6 | The Beast in the Jungle |
| 24 | The Gringos | 2 | The Adventure of the Cardboard Box | 7 | The Beldonald Holbein |
| 25 | Good Indian End | 3 | The Adventure of the Devil's Foot | 8 | A Bundle of Letters |
| 4 | The Adventure of the Dying Detective | 9 | The Chaperon | ||
|
| 5 | The Adventure of the Red Circle | 10 | Confidence | |
| 1 | The Amateur | 6 | The Adventure of Wisteria Lodge | 11 | The Coxon Fund |
| 2 | Billy and the Big Stick | 7 | The Adventures of Gerard | 12 | Daisy Miller |
| 3 | Captain Macklin | 8 | The Adventures of Sherlock Holmes | 13 | The Death of the Lion |
| 4 | A Charmed Life | 9 | Beyond the City | 14 | The Diary of a Man of Fifty |
| 5 | Cinderella And Other Stories | 10 | The Captain of the Polestar | 15 | Eugene Pickering |
| 6 | The Congo and Coasts of Africa | 11 | The Doings of Raffles Haw | 16 | The Europeans |
| 7 | The Consul | 12 | A Duet: A Duologue | 17 | The Figure in the Carpet |
| 8 | The Deserter | 13 | The Exploits of Brigadier Gerard | 18 | Glasses |
| 9 | The Frame Up | 14 | The Firm of Girdlestone | 19 | Greville Fane |
| 10 | Gallegher and Other Stories | 15 | The Green Flag | 20 | An International Episode |
| 11 | In the Fog | 16 | His Last Bow | 21 | In the Cage |
| 12 | The King's Jackal | 17 | The Hound of the Baskervilles | 22 | The Jolly Corner |
| 13 | Lion and the Unicorn | 18 | The Lost World | 23 | The Lesson of the Master |
| 14 | The Log of the Jolly Polly | 19 | The Mystery of Cloomber | 24 | Louisa Pallant |
| 15 | The Lost House | 20 | The Parasite | 25 | Madame De Mauves |
| 16 | The Lost Road | 21 | The Poison Belt | 26 | The Madonna of the Future |
| 17 | The Make-Believe Man | 22 | Round the Red Lamp | ||
| 18 | The Man Who Could Not Lose | 23 | The Sign of the Four |
| |
| 19 | The Messengers | 24 | The Valley of Fear | 1 | Alfred Tennyson |
| 20 | My Buried Treasure | 25 | The Adventure of the Red Circle | 2 | Angling Sketches |
| 21 | The Nature Faker | 26 | The Adventure of Wisteria Lodge | 3 | The Arabian Nights |
| 22 | Peace Manoeuvres | 4 | The Blue Fairy Book | ||
| 23 | The Princess Aline |
| 5 | The Book of Dreams and Ghosts | |
| 24 | A Question of Latitude | 1 | Betty Zane | 6 | Books and Bookmen |
| 25 | Ranson's Folly | 2 | The Border Legion | 7 | The Brown Fairy Book |
| 26 | The Red Cross Girl | 3 | The Call of the Canyon | 8 | Cock Lane and Common-Sense |
| 4 | The Day of the Beast | 9 | The Crimson Fairy Book | ||
|
| 5 | Desert Gold | 10 | Custom and Myth | |
| 1 | Barnaby Rudge | 6 | The Desert of Wheat | 11 | Grass of Parnassus |
| 2 | The Battle of Life | 7 | Heritage of the Desert | 12 | The Green Fairy Book |
| 3 | Bleak House | 8 | The Last of the Plainsmen | 13 | In the Wrong Paradise |
| 4 | The Chimes | 9 | The Last Trail | 14 | The Library |
These are the known texts used in this study for benchmarking the algorithms and indicating their accuracy.
Figure 1Number of function words vs. LOO-CV accuracy.
The SVM uses a polynomial kernel with , and . Both MDA and SVM accuracies increase with an increasing number of words up to 100 words, but neither of them improved significantly after this point. These tests use the known English corpus given in Table 1.
LOO-CV results for MDA classification of the English corpus.
| Predicted Authors | ||||||||
| Bower | Davis | Dickens | Doyle | Grey | James | Lang | Total | |
| Bower | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 25 |
| Davis | 0 | 26 | 0 | 0 | 0 | 0 | 0 | 26 |
| Dickens | 0 | 0 | 25 | 0 | 0 | 0 | 0 | 25 |
| Doyle | 0 | 1 | 0 | 24 | 0 | 1 | 0 | 26 |
| Grey | 0 | 0 | 0 | 1 | 25 | 0 | 0 | 26 |
| James | 0 | 0 | 0 | 0 | 0 | 26 | 0 | 26 |
| Lang | 1 | 1 | 0 | 0 | 0 | 1 | 11 | 14 |
Here, 162 out of 168 texts are classified correctly so the accuracy is 96.4%.
LOO-CV results for SVM classification of the English corpus.
| Predicted Authors | ||||||||
| Bower | Davis | Dickens | Doyle | Grey | James | Lang | Total | |
| Bower | 22 | 2 | 1 | 0 | 0 | 0 | 0 | 25 |
| Davis | 0 | 24 | 0 | 0 | 0 | 2 | 0 | 26 |
| Dickens | 0 | 0 | 24 | 1 | 0 | 0 | 0 | 25 |
| Doyle | 0 | 0 | 0 | 25 | 0 | 0 | 1 | 26 |
| Grey | 1 | 0 | 0 | 1 | 24 | 0 | 0 | 26 |
| James | 1 | 0 | 0 | 2 | 0 | 23 | 0 | 26 |
| Leng | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 14 |
Here, 155 texts out of 168 are classified correctly so the accuracy is 92.2%.
Figure 2Number of texts per author vs. accuracy of MDA classifier.
This graph investigates accuracy versus the size of the training dataset for the MDA case, with a fixed set of 100 function words, for the benchmark English corpus of known texts given in Table 1. The upper curve shows the LOO-CV accuracy of MDA, as a function of the number of author texts, by deliberately limiting the size of the training dataset. The lower curve shows the MDA accuracy that is obtained by inputting the hold-out texts to the classifier at each step.
Figure 3Number of texts per author vs. accuracy of SVM classifier.
This graph investigates accuracy versus the size of the training dataset for the SVM case, with a fixed set of 95 function words, for the benchmark English corpus of known texts given in Table 1. The SVM utilizes a polynomial kernel with and .
Extraction of function words.
| rank | English | Federalist | Greek | rank | English | Federalist | Greek |
| corpus | Papers | texts | corpus | Papers | texts | ||
| 1 | the | the | kai | 41 | said | at | loipon |
| 2 | and | of | eis | 42 | when | one | ean |
| 3 | of | to | o | 43 | if | them | hmwn |
| 4 | to | and | toy | 44 | out | people | toyto |
| 5 | a | in | na | 45 | what | these | epi |
| 6 | i | a | ton | 46 | we | if | kata |
| 7 | in | be | de | 47 | been | those | egw |
| 8 | he | That | en | 48 | would | any | ws |
| 9 | was | it | to | 49 | up | most | tas |
| 10 | that | Which | thn | 50 | no | no | tis |
| 11 | it | is | pros | 51 | or | we | legei |
| 12 | his | as | ayton | 52 | man | who | peri |
| 13 | you | by | aytoy | 53 | who | can | alla |
| 14 | had | this | dia | 54 | them | his | meta |
| 15 | with | or | den | 55 | are | must | ostis |
| 16 | as | for | ths | 56 | then | there | panta |
| 17 | for | have | h | 57 | upon | constitution | opoion |
| 18 | her | would | twn | 58 | into | upon | ihsoy |
| 19 | at | will | einai | 59 | their | union | qeos |
| 20 | she | not | oti | 60 | could | such | ymwn |
| 21 | but | from | toys | 61 | your | was | all |
| 22 | him | with | oi | 62 | very | so | hmas |
| 23 | not | their | ta | 63 | little | i | esas |
| 24 | is | on | sas | 64 | do | same | hto |
| 25 | on | are | aytoys | 65 | some | every | xristoy |
| 26 | my | government | qeoy | 66 | like | against | se |
| 27 | have | an | mh | 67 | down | national | omws |
| 28 | be | they | moy | 68 | more | authority | kyrioy |
| 29 | me | states | me | 69 | will | should | qeon |
| 30 | they | been | qelei | 70 | can | our | qelw |
| 31 | from | may | dioti | 71 | over | might | ec |
| 32 | this | power | tw | 72 | did | were | kaqws |
| 33 | which | all | ek | 73 | about | ought | tayta |
| 34 | there | other | soy | 74 | now | into | kyrios |
| 35 | one | its | apo | 75 | see | federal | ypo |
| 36 | all | but | aytwn | 76 | old | general | as |
| 37 | so | has | ti | 77 | only | under | aytos |
| 38 | were | state | ihsoys | 78 | time | public | eme |
| 39 | an | more | th | 79 | know | had | gar |
| 40 | by | than | eipe | 80 | any | shall | seis |
| 81 | never | great | aythn | 91 | come | time | palin |
| 82 | before | men | pantes | 92 | young | well | oy |
| 83 | well | only | legw | 93 | here | united | oytos |
| 84 | back | some | pneyma | 94 | mr | could | di |
| 85 | has | less | met | 95 | made | part | sy |
| 86 | other | he | idoy | 96 | good | us | ihsoyn |
| 87 | than | between | chapter | 97 | eyes | different | epeidh |
| 88 | two | each | tois | 98 | under | members | oyxi |
| 89 | how | necessary | para | 99 | first | particular | yios |
| 90 | where | first | men | 100 | each | legislative | eipen |
In this paper, our investigation uses up to 100 function words for the English corpus, the Federalist Papers, and the Koine Greek texts. The frequencies of function words in each text are used as classification features. The texts are initially pre-processed as follows: (i) all letters are changed to lower case, (ii) all accents are removed, (iii) all ASCII characters not in the set a-z (ASCII codes 97–122) and space (ASCII code 32) are removed without insertion of a space. This is with the exception of a hyphen (ASCII code 45) that is substituted with a space (ASCII code 32), (iv) all headings are removed from the texts, so that they only contain free flowing paragraphs, (v) any extra items added by modern editors, such as editorial notes are removed. Whilst hagiographers wrote the Greek texts in capital letters without accents or punctuation, modern editors insert these items for ease of interpretation. Thus, in order to recover the original Greek text, we apply steps (i) to (v). We do the same to English texts so that they act as a punctuation-free benchmark test. As our software only handles the reduced 27-character ASCII set a-z and space, the Greek text is transliterated using the Table 8. After this pre-processing, all the texts within each corpus are concatenated and word frequencies are counted. Words are ranked in descending order of frequency of occurrence. This is shown in the table below. These key words are called function words.
Figure 4Canonical discriminant functions for the Federalist Papers.
This is the result of MDA on the Federalist Papers using two discriminant functions. Each point represents a text, which is plotted according to the values of its discriminant functions. Here, 75 function words are utilised, which yields the most accurate result. Open circles indicate known texts, asterisks indicate the 13 disputed texts in question, and the crosses indicate the centroids of the known author clusters.
The predicted authors for the 13 disputed Federalist Papers.
| Text No. | MDA | SVM |
| 49 | Madison | Madison |
| 50 | Hamilton | Madison |
| 51 | Madison* | Madison |
| 52 | Madison | Madison |
| 53 | Madison* | Madison |
| 54 | Madison | Madison |
| 55 | Hamilton* | Hamilton |
| 56 | Madison | Hamilton |
| 57 | Madison | Hamilton |
| 58 | Madison | Madison |
| 62 | Madison* | Madison |
| 63 | Madison | Madison |
| 64 | Jay | Jay |
These results use 75 function words for both methods, which yields the best accuracy. The MDA results are selected using a simplistic approach where the authors with lowest distances (highlighted in bold in Table 6), are selected. However, greater certainty is achieved if the other contending authors have high distances from the text being classified. The cases where both remaining authors have distances greater than or approximately equal to the longest distance (LD) of a known text, are indicated with an asterisk. Thus entries with asterisks have a high degree of certainty, and those without asterisks are less certain and possibly may have resulted from collaboration with the next nearest author.
Mahalanobis distances from each Federalist Paper of disputed authorship to each author centroid.
| Text No. | Mahalanobis Distance to Centroid | ||
| Madison | Hamilton | Jay | |
| LD = 2.2 | LD = 2.6 | LD = 1.7 | |
| 49 |
| 2.0 | 3.2 |
| 50 | 2.0 |
| 2.5 |
| 51 |
| 3.5 | 3.9 |
| 52 |
| 2.3 | 3.7 |
| 53 |
| 2.6 | 2.6 |
| 54 |
| 1.5 | 3.5 |
| 55 | 2.9 |
| 3.6 |
| 56 |
| 1.4 | 3.0 |
| 57 |
| 1.4 | 2.9 |
| 58 |
| 2.2 | 3.9 |
| 62 |
| 2.5 | 3.1 |
| 63 |
| 2.2 | 2.2 |
| 64 | 1.9 | 3.2 |
|
The texts with the closest distance to each author centroid are highlighted in bold. The longest distance (LD) between an undisputed authorship text and its author centroid is given in the table header.
Greek to English character look-up table.
| Greek | English equivalent |
| α | a |
| β | b |
| γ | g |
| δ | d |
| ε | e |
| ζ | z |
| η | h |
| θ | q |
| ι | i |
| κ | k |
| λ | l |
| μ | m |
| ν | n |
| ξ | x |
|
| o |
| π | p |
| ρ | r |
| σ | s |
| τ | t |
| υ | u |
| φ | f |
| χ | c |
| φ | y |
| ω | w |
This look-up table is used to transliterate the Koine Greek alphabet to an English equivalent. This is used because the software only handles ASCII characters a-z and space. The software only requires a one-to-one correspondence between Greek letters and our reduced ASCII set, and thus the actual ASCII characters can be entirely arbitrary.
Authorship attribution for the Federalist Papers ranked by likelihood, .
| Likelihood | Author | Essay |
|
| number | |
| 20.96 | Madison | 62 |
| 5.74 | Hamilton | 55 |
| 3.40 | Madison | 53 |
| 2.37 | Madison | 49 |
| 2.15 | Hamilton | 50 |
| 1.62 | Madison | 52 |
| 1.26 | Madison | 63 |
| 1.26 | Madison | 51 |
| 0.82 | Madison | 58 |
| 0.14 | Madison | 56 |
| 0.14 | Madison | 57 |
| 0.06 | Jay | 64 |
| 0.04 | Hamilton | 54 |
Source Texts from New Testament and Apostolic Fathers in Koine Greek.
| Author | Book | Words | |
| 1 | Luke | Gospel of Luke | 21,318 |
| Acts of the Apostles | 20,612 | ||
| 2 | Mark | Gospel of Mark | 12,844 |
| 3 | Matthew | Gospel of Matthew | 21,957 |
| 4 | John | Gospel of John | 23,887 |
| 5 | Paul | Epistle to the Romans | 8,341 |
| First Epistle to the Corinthians | 7,932 | ||
| Second Epistle to the Corinthians | 5,149 | ||
| Epistle to the Galatians | 2,617 | ||
| Epistle to the Philippians | 1,890 | ||
| First Epistle to the Thessalonians | 1,666 | ||
| Epistle to Philemon | 335 | ||
| 6 | Clement | First Epistle of Clement | 9,833 |
| 7 | Ignatius | To the Ephesians | 1,828 |
| Letter to the Magnesians | 1,053 | ||
| Letter to the Trallians | 938 | ||
| Letter to the Romans | 1,092 | ||
| Letter to the Philadelphians | 965 | ||
| Letter to the Smyrnaeans | 1,149 | ||
| Letter to Polycarp, Bishop of Smyrna | 813 | ||
| 8 | Barnabas | The Epistle of Barnabas | 6,710 |
| 9 | Disputed | Letter to the Hebrews | 5,819 |
The table below lists all the Greek texts we used to compare against the Letter to the Hebrews. We restrict this training dataset corpus to those texts that are largely undisputed. The listing is grouped according to authorship and the numbers in the right column represent the total word count for each text. In order to reduce the variation of total word count from author to author, we concatenate the texts in each author group and divide by four. This is why, in Figure 5, there are four data points for each author.
Figure 5First three canonical discriminant functions for New Testament authors and Apostolic Fathers.
This plot shows the MDA results for the Greek texts, in order to determine which author's cluster of texts is closest to the Letter to the Hebrews. We use seven discriminant functions in this analysis, however, only the first three discriminant functions are plotted here for illustrative purposes. There are four data points for each author, as all their known texts are concatenated and divided by four.
The Mahalanobis distance between the Letter to the Hebrews and the centroids of authors.
| Authors | Mahalanobis | Longest distance |
| Distance | (LD) | |
| Luke | 12.7 | 5.9 |
| Mark | 16.5 | 9.2 |
| Matthew | 14.0 | 6.9 |
| John | 18.7 | 10.0 |
| Paul |
| 7.4 |
| Clement | 13.5 | 6.8 |
| Ignatius | 22.7 | 8.1 |
| Barnabas | 15.0 | 9.6 |
The Mahalanobis distance is calculated using the values of all seven discriminant functions. The first column shows the Mahalanobis distance between the disputed Letter to the Hebrews and the centroids of known author texts. The Mahalanobis distance in the second column is the longest distance (LD) between an author centroid and known texts by the same author. If a disputed text is classified outside this LD bound, then it is only a weak match. The lowest Mahalanobis distance is indicated in bold and belongs to the Apostle Paul. However, this is outside the LD bound (i.e. ) and the likelihood index is less than unity at suggesting a weak match.