| Literature DB >> 33298074 |
Chuchu Liu1, Ziqiang Cao2, Xin Lu3,4.
Abstract
BACKGROUND: Understanding the geographic distribution of hidden population, such as men who have sex with men (MSM), sex workers, or injecting drug users, are of great importance for the adequate deployment of intervention strategies and public health decision making. However, due to the hard-to-access properties, e.g., lack of a sampling frame, sensitivity issue, reporting error, etc., traditional survey methods are largely limited when studying such populations. With data extracted from the very active online community of MSM in China, in this study we adopt and develop location inferring methods to achieve a high-resolution mapping of users in this community at national level.Entities:
Keywords: Geographic distribution; Hidden population; Location inference; MSM; Text analysis
Mesh:
Year: 2020 PMID: 33298074 PMCID: PMC7724834 DOI: 10.1186/s12942-020-00245-x
Source DB: PubMed Journal: Int J Health Geogr ISSN: 1476-072X Impact factor: 3.918
Fig. 1The workflow and links between location inferring algorithms
Comparison of mainstream location inferring algorithms
| The Gazetteer-based Method | Part-of-speech (POS) Tagging | Named Entity Recognition (NER) | |
|---|---|---|---|
| Features | Identifying geographical names according to external location knowledge (e.g., dictionary containing names of cities and states) | Recognizing geographical terms in a corpus based on the part of speech of its component words, according to both their definitions and contexts | Identifying and classifying words mentioned in unstructured corpus as pre-defined entity classes, i.e., persons, locations, organizations, etc. based on HMM models |
| Strengths | It is a popular approach when looking for locations in Web text [ | Part-of-speech information is a pre-requisite in many NLP (Natural Language Processing) algorithms | The algorithm is fast, and suitable for processing large-scale datasets |
| Limitations | Largely relies on the gazetteer, and easily affected by external geographic databases [ | Vulnerable to linguistic errors and idiosyncratic style [ Algorithm accuracy is relatively low | Cannot identify names of local streets or buildings, non-standard place abbreviations and misspellings which are common in microtext |
Chinese keywords employed in pattern recognition
| Former keywords | Later keywords | Global keywords | |
|---|---|---|---|
| In Chinese | 坐标,定位,同,在,从,去,是,也是,求,就是,大… | 人,上学,上班,有,有吗,附近,的,滴,是,加… | 交友,大学,学院,公司,同城,私聊… |
| In English | Coordinate, location, same, in, from, come, am, also be, seek, big (usually describe someone’s own place), etc | Person, go to school, work, have, any, close, is, add, follow, etc | Making friends, university, college, company, same city, private chat, etc |
The performances of three location inferring algorithms
| Gazetteer | Part-of-speech | Chinese NER | |
|---|---|---|---|
| 0.352 | 0.487 | ||
| 0.892 | 0.927 | ||
| 0.370 | 0.502 | ||
| 0.945 | 0.964 |
The performance of the gazetteer-based method after strengthen filtering rules
| Gazetteer | Gazetteer with context analysis | Gazetteer with pattern recognition | |
|---|---|---|---|
| 0.503 | 0.493 | ||
| 0.932 | 0.733 | ||
| 0.518 | 0.667 | ||
| 0.966 | 0.800 |
The performance of the gazetteer-based method with strong constraint
| Gazetteer | Gazetteer with context analysis (weak) | Gazetteer with context analysis (strong) | |
|---|---|---|---|
| 0.503 | 0.512 | ||
| 0.932 | 0.929 | ||
| 0.518 | 0.528 | ||
| 0.966 | 0.965 |
The algorithm performance with keyword augmentation
| Strong constraint | Strong constraint with more keywords | Strong constraint (full-mode segmentation) | |
|---|---|---|---|
| 0.542 | |||
| 0.762 | |||
| 0.614 | |||
| 0.866 |
The performance of HVA-LI
| S-Gazetteer & PT | Gazetteer & NER | S-Gazetteer & NER | |
|---|---|---|---|
| 0.578 | 0.586 | ||
| 0.756 | 0.767 | ||
| 0.666 | 0.668 | ||
| 0.883 | 0.881 |
Fig. 2Absolute accuracy of location inferring algorithms
Fig. 3a The city distribution of gay-bar users from location fields in profiles, b the location distribution extracted from the GPS coordinates of gay-bar users on both the city-level and the province-level
Fig. 4The geographic distribution of gay-bar users inferred from the published posts. a City-level distribution. b Province-level distribution