| Literature DB >> 35771807 |
Xiaoxiao Li1, Amy Zhang1, Rabah Al-Zaidy2, Amrita Rao3, Stefan Baral3, Le Bao1, C Lee Giles4.
Abstract
There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis.Entities:
Mesh:
Year: 2022 PMID: 35771807 PMCID: PMC9246134 DOI: 10.1371/journal.pone.0270034
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Flow diagram for systematic reviews of HIV prevalence, the HIV treatment cascade (HIV testing, linkage to care, treatment, viral suppression) and experienced violence (physical, sexual, and intimate partner) among among female sex workers.
Medical Subject Headings (MeSH) and Text Words (TW) used to retrieve relevant articles.
| Category | MeSH | TW |
|---|---|---|
| Female Sex Workers (FSW) | Prostitution, Sex Worker | prostitut |
| HIV | HIV, acquired Immunodeficiency Syndrom, HIV Infections | human immunodeficiency virus |
| Violence | Violence, Domestic Violence, Workplace Violence, Crime Victims, Battered Women, Rape, Homicide, Coercion | Violen |
* Represents usage of a wildcard operator in the database query, e.g.“violen*” returns both “violence” and “violent”.
Fig 2(a) ROC and (b) PR curves for the baseline model (keyword-guided approach), RF with 15 refined clusters and RF with top tokens (20, 50, 100, 250, and 500 tokens in addition to 15 clusters).
As a summary metric, area under the curve (AUC) is provided in the legend.
The area under the curve (AUC) for ROC and PR of random forest models.
All models include 15 refined clusters. The additional number of tokens is shown as the column names.
| Metric | 0 tokens | 20 tokens | 50 tokens | 100 tokens | 250 tokens |
|---|---|---|---|---|---|
| AUC-ROC | 0.83 | 0.90 | 0.90 | 0.90 | 0.90 |
| AUC-PR | 0.34 | 0.50 | 0.56 | 0.56 | 0.57 |
The area under the curve (AUC) for ROC and PR of Random Forest, ElasticNet, Support Vector Machine (SVM), Neural Network and Boosting, all with top 50 tokens.
They are presented in the decreasing order of AUC-PR.
| Metric | Random Forest | ElasticNet | SVM | Neural Network | Boosting |
|---|---|---|---|---|---|
| AUC-ROC | 0.90 | 0.89 | 0.88 | 0.85 | 0.86 |
| AUC-PR | 0.56 | 0.49 | 0.44 | 0.36 | 0.12 |
Fig 3AUC for (a) ROC and (b) PR on testing data for RF with top 250 tokens with proportion of pre-labeled documents ranging from 1% to 80%.