| Literature DB >> 24091380 |
Victoria Bobicev1, Marina Sokolova, Khaled El Emam, Yasser Jafer, Brian Dewar, Elizabeth Jonker, Stan Matwin.
Abstract
BACKGROUND: Participants in medical forums often reveal personal health information about themselves in their online postings. To feel comfortable revealing sensitive personal health information, some participants may hide their identity by posting anonymously. They can do this by using fake identities, nicknames, or pseudonyms that cannot readily be traced back to them. However, individual writing styles have unique features and it may be possible to determine the true identity of an anonymous user through author attribution analysis. Although there has been previous work on the authorship attribution problem, there has been a dearth of research on automated authorship attribution on medical forums. The focus of the paper is to demonstrate that character-based author attribution works better than word-based methods in medical forums.Entities:
Keywords: medical forums; personal health information; privacy; text data mining
Mesh:
Year: 2013 PMID: 24091380 PMCID: PMC3806358 DOI: 10.2196/jmir.2514
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Statistics on the analyzed subforums on the IVF.ca website at the time of data collection.
| Subforum name | Topics, n | Posts, n | Posts per topic, mean |
| Introduction | 1716 | 13,569 | 7.91 |
| IVF/FET/IUI Cycle Buddies | 2167 | 116,994 | 53.99 |
| IVF Ages 35+ | 506 | 16,362 | 32.34 |
| Waiting Lounge | 418 | 3816 | 9.13 |
| Donor & Surrogacy Buddies | 893 | 7381 | 8.27 |
| Adoption Buddies | 304 | 4210 | 13.85 |
Figure 1Number of posts per author distribution for the selected subforums, IVF Ages 35+ (n=865) and IVF/FET/IUI Cycle Buddies (n=1195).
Figure 2Distribution of the number of posts per author (most prolific) for IVF Ages 35+ subforum (n=30).
Figure 3Distribution of the average post length (number of words) for the 30 most-prolific authors in the IVF Ages 35+ subforum.
Figure 4Number of posts per topic for each author (most prolific) in the IVF Ages 35+ subforum (n=30).
Statistics for authors and distribution of their posts per subforum.
| Author | Subforum, n | ||
| Introduction | Cycle_Buddies | Age_35+ | |
| Author 1 | 3 | 6 | 278 |
| Author 2 | 35 | 445 | 1 |
| Author 3 | 7 | 91 | 3 |
| Author 4 | 30 | 11 | 69 |
| Author 5 | 6 | 88 | 30 |
| Author 6 | 67 | 264 | 4 |
| Author 7 | 13 | 16 | 820 |
| Author 8 | 54 | 94 | 1 |
| Author 9 | 8 | 7 | 355 |
| Author 10 | 5 | 130 | 6 |
The IVF 35 Ages + classification results; 10-fold cross-validation, 30 authors, 100 posts per author.
| Model |
| Precision | Recall |
| Letters | 0.793 | 0.803 | 0.784 |
| Characters lowercase | 0.822 | 0.830 | 0.831 |
| Original capitalization | 0.826 | 0.836 | 0.817 |
Classification results for author identification on IVF/FET/IUI Cycle Buddies subforum; 10-fold cross-validation, 30 authors, 100 posts per author.
| Features |
| Precision | Recall |
| Letters | 0.836 | 0.851 | 0.822 |
| Characters lowercase | 0.887 | 0.896 | 0.877 |
| Original capitalization | 0.902 | 0.911 | 0.894 |
Dependency of the accuracy of author detection task on candidate author number on the IVF/FET/IUI Cycle Buddies subforum.
| Number of authors |
| Precision | Recall |
| 10 | 0.965 | 0.967 | 0.963 |
| 15 | 0.932 | 0.937 | 0.927 |
| 20 | 0.924 | 0.931 | 0.917 |
| 25 | 0.912 | 0.921 | 0.904 |
| 30 | 0.902 | 0.911 | 0.894 |
| 35 | 0.881 | 0.891 | 0.872 |
| 40 | 0.845 | 0.856 | 0.835 |
| 45 | 0.838 | 0.849 | 0.827 |
| 50 | 0.831 | 0.842 | 0.820 |
Dependency of the accuracy of author detection task on candidate author number on the IVF Ages 35+ subforum.
| Number of authors |
| Precision | Recall |
| 10 | 0.919 | 0.921 | 0.916 |
| 15 | 0.918 | 0.922 | 0.914 |
| 20 | 0.885 | 0.889 | 0.882 |
| 25 | 0.875 | 0.882 | 0.869 |
| 30 | 0.826 | 0.836 | 0.817 |
Figure 5Dependency of the accuracy on candidate author number for author detection task on the IVF/FET/IUI Cycle Buddies and IVF Ages 35+ subforums.
Dependency of the accuracy on training data volume for the author detection task.
| Number of training files |
| Precision | Recall |
| 20 | 0.503 | 0.496 | 0.511 |
| 40 | 0.668 | 0.669 | 0.667 |
| 60 | 0.765 | 0.773 | 0.758 |
| 80 | 0.794 | 0.800 | 0.787 |
| 100 | 0.806 | 0.812 | 0.800 |
| 120 | 0.815 | 0.823 | 0.808 |
| 140 | 0.826 | 0.834 | 0.819 |
| 160 | 0.834 | 0.841 | 0.827 |
| 180 | 0.837 | 0.843 | 0.831 |
Figure 6Dependency of the F score and the training data volume for the author attribution.
Dependency of the results of test files size for author detection task.
| Test files size (words) |
| Precision | Recall |
| 25 | 0.605 | 0.613 | 0.599 |
| 50 | 0.752 | 0.759 | 0.745 |
| 75 | 0.825 | 0.833 | 0.817 |
| 100 | 0.886 | 0.895 | 0.877 |
| 125 | 0.907 | 0.914 | 0.901 |
| 150 | 0.920 | 0.926 | 0.915 |
| 175 | 0.936 | 0.940 | 0.933 |
| 200 | 0.948 | 0.952 | 0.943 |
| 225 | 0.958 | 0.963 | 0.953 |
| 250 | 0.962 | 0.967 | 0.957 |
| 275 | 0.970 | 0.973 | 0.967 |
| 300 | 0.973 | 0.976 | 0.971 |
| 325 | 0.972 | 0.975 | 0.969 |
| 350 | 0.976 | 0.979 | 0.973 |
| 375 | 0.975 | 0.978 | 0.973 |
| 400 | 0.979 | 0.981 | 0,976 |
| 425 | 0.977 | 0.980 | 0.975 |
| 450 | 0.980 | 0.982 | 0.978 |
| 475 | 0.978 | 0.980 | 0.975 |
| 500 | 0.979 | 0.982 | 0.977 |
Figure 7Dependency of the F score on the test text size for the author attribution.
Results for author detection task using Naïve Bayes and support vector machine (SVM) classification models implemented in WEKA.
| Subforum | Features |
| |
|
|
| Naïve Bayes | SVM |
| IVF Ages 35+ | Frequent words only | 0.636 | 0.760 |
| IVF Ages 35+ | Frequent words + punctuation + figures + capital letters frequency | 0.624 | 0.766 |
| IVF Ages 35+ | Frequent 5-character sequences | 0.586 | 0.743 |
| IVF/FET/IUI Cycle Buddies | Frequent words only | 0.575 | 0.690 |
| IVF/FET/IUI Cycle Buddies | Frequent words + punctuation + figures + capital letters frequency | 0.567 | 0.694 |
| IVF/FET/IUI Cycle Buddies | Frequent 5-character sequences | 0.550 | 0.701 |