| Literature DB >> 30961607 |
Junqing He1,2, Mingming Fu3,4, Manshu Tu3,4.
Abstract
BACKGROUND: Medical and clinical question answering (QA) is highly concerned by researchers recently. Though there are remarkable advances in this field, the development in Chinese medical domain is relatively backward. It can be attributed to the difficulty of Chinese text processing and the lack of large-scale datasets. To bridge the gap, this paper introduces a Chinese medical QA dataset and proposes effective methods for the task.Entities:
Keywords: Chinese word segmentation; Convolutional neural networks; Deep learning; Medical question answering; Semantic matching
Mesh:
Year: 2019 PMID: 30961607 PMCID: PMC6454599 DOI: 10.1186/s12911-019-0761-8
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Comparison of cMedQA and our webMedQA dataset
| Dataset | cMedQA | webMedQA | |
|---|---|---|---|
| # Ans | Train | 94134 | 253050 |
| Dev | 3774 | 31685 | |
| Test | 3835 | 31685 | |
| Total | 101743 | 316420 | |
| # Ques | Train | 50000 | 50610 |
| Dev | 2000 | 6337 | |
| Test | 2000 | 6337 | |
| Total | 54000 | 63284 | |
| Contain category | No | Yes | |
The statistics of answers and questions in webMedQA dataset
| Train | Dev | Test | |
|---|---|---|---|
| Number of Ans. | 253050 | 31685 | 31685 |
| Avg. Length of Ans. | 146.88 | 147.74 | 148.50 |
| Max Length of Ans. | 500 | 499 | 499 |
| Min Length of Ans. | 2 | 2 | 2 |
| Number of Ques. | 50610 | 6337 | 6337 |
| Avg. Length of Ques. | 86.68 | 87.43 | 86.08 |
| Max Length of Ques. | 1312 | 1302 | 1150 |
| Min Length of Ques. | 2 | 3 | 5 |
Fig. 1A sample in the webMedQA. The 5 fields are on the left with their contents on the right
The frequency distribution over the categories
| Internal Medicine | 18327 | Cosmetology | 775 |
| Surgery | 13511 | Drugs | 529 |
| Gynecology | 8691 | Health Care | 439 |
| Pediatrics | 5312 | Assistant Inspection | 430 |
| Dermatology | 4969 | Rehabilitation | 276 |
| Ophthalmology & | 3983 | Home Environment | 253 |
| Otolaryngology | Child Education | 247 | |
| Oncology | 2118 | Nutrition and Health | 172 |
| Mental Health | 1536 | Slimming | 169 |
| Chinese Medicine | 1452 | Genetics | 86 |
| Infectious Diseases | 1360 | Medical Examination | 64 |
| Plastic Surgery | 1211 | Others | 31 |
Fig. 2Illustration of CSCR with a character-level input. m is the length of input sentence and d is the length of embedding for each character
Performance of different CWS tools on webMedQA with MV-LSTM
| Vocab Size | P@1(%) | MAP(%) | |
|---|---|---|---|
| Ansj | 44140 | 57.7 | 73.5 |
| Fnlp | 145058 | 57.9 | 74.4 |
| jieba | 94630 | 59.3 | 75.3 |
The performance of different matching models using character-level and word-level inputs
| Input Unit | Model | P@1(%) | MAP(%) |
|---|---|---|---|
| Random | 20.0 | 45.7 | |
| Char | BM25 | 26.6 | 51.2 |
| multiCNN[ | 39.8 | 60.1 | |
| MV-LSTM | 58.1 | 74.5 | |
| MatchPyramid | 66.0 | 79.3 | |
| Word | BM25 | 23.6 | 49.0 |
| multiCNN[ | 40.0 | 60.5 | |
| MV-LSTM | 59.3 | 75.3 | |
| MatchPyramid | 58.8 | 74.9 |
Fig. 3P@1 of matching models with and without CSCR using different input units
Fig. 4MAP of matching models with and without CSCR using different input units
Fig. 5The segmentation results of CWS tools on a sample. Segments are separated by /