| Literature DB >> 24595052 |
Wei-Nan Zhang1, Ting Liu1, Yang Yang2, Liujuan Cao3, Yu Zhang1, Rongrong Ji4.
Abstract
With the blooming of Web 2.0, Community Question Answering (CQA) services such as Yahoo! Answers (http://answers.yahoo.com), WikiAnswer (http://wiki.answers.com), and Baidu Zhidao (http://zhidao.baidu.com), etc., have emerged as alternatives for knowledge and information acquisition. Over time, a large number of question and answer (Q&A) pairs with high quality devoted by human intelligence have been accumulated as a comprehensive knowledge base. Unlike the search engines, which return long lists of results, searching in the CQA services can obtain the correct answers to the question queries by automatically finding similar questions that have already been answered by other users. Hence, it greatly improves the efficiency of the online information retrieval. However, given a question query, finding the similar and well-answered questions is a non-trivial task. The main challenge is the word mismatch between question query (query) and candidate question for retrieval (question). To investigate this problem, in this study, we capture the word semantic similarity between query and question by introducing the topic modeling approach. We then propose an unsupervised machine-learning approach to finding similar questions on CQA Q&A archives. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods.Entities:
Mesh:
Year: 2014 PMID: 24595052 PMCID: PMC3942313 DOI: 10.1371/journal.pone.0071511
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An example of the Q&A pair in Yahoo! Answers repository.
|
|
| What phone is best iPhone 5 or Samsung galaxy s3? |
|
|
| What phone is best to have iPhone 5 or Samsung galaxy s3? I've heard that the Samsung galaxy battery only lasts for 6 hrs and what's the best for apps? |
|
|
| Galaxy s3 hands down. Galaxy s3 features. Quad core processor. Several days of battery life. Flash. Better camera with more features such as burst mode. Will soon get Android 4.1 jellybean. Turn by turn voice navigation (said to be superior to apple maps) Built in FM radio. Micro SD card. Double your storage size. Multiple buttons. 4.8 inch super AMOLED plus display. More durable. Uses polycarbonate instead of glass… |
Figure 1The graphical representation of the LDA model.
Figure 2The transformation from documents to the representation of topic vector by the LDA model.
A summary of the features used in finding similar question task.
| Feature Name | Feature Description |
|
| the lexical feature of the question |
|
| the topic distribution of the question |
|
| the weight of the topic |
Figure 3The framework of the proposed approach to finding similar question in Yahoo! Answers Q&A repository.
Statistics of the experimental data set.
| # of queries | total # of questions | # of similar questions |
| 10,000 | 1,123,134 | 20,800 |
Experimental results of the comparing systems for finding similar questions.
| Models | BC | BCF | upBC | STM | LDAC | LDACF |
| AP | 0.543 | 0.556 | 0.564 | 0.575 | 0.638 |
|
| % AP improvements over | ||||||
|
| N/A | +2.39 | +3.87 | +5.89 | +17.50 | +20.81 |
|
| N/A | N/A | +1.44 | +3.42 | +14.75 | +17.99 |
|
| N/A | N/A | N/A | +1.95 | +13.12 | +16.31 |
|
| N/A | N/A | N/A | N/A | +10.96 | +14.09 |
|
| N/A | N/A | N/A | N/A | N/A | +2.82 |
|
| 0.550 | 0.561 | 0.577 | 0.585 | 0.648 |
|
indicates the results of our proposed methods are statistical significance over the four baseline methods (within 0.95 confidence interval using the -test). The results of our proposed approach are in bold.
The upper bound of the evaluation data which is obtained by removing the error clusters in similar question clustering results.
| BC | LDACF | |
| Upper bound of evaluation data | 93.7% | 99.1% |
The experimental results of the BC and LDACF approaches in the refined data set.
| BC | LDACF | |
| AP | 0.58 |
|
|
| 0.587 |
|
* indicate that the results of the LDACF are statistical significance over the BC (within 0.95 confidence interval using -test).The results of our proposed approach are in bold.
Experimental results of comparing systems on the diverse data sets for finding similar questions.
| Models | Cong et al. | |||
| AP |
| AP |
| |
|
| 0.517 | 0.520 | 0.551 | 0.570 |
|
| 0.525 | 0.532 | 0.559 | 0.570 |
|
| 0.533 | 0.544 | 0.577 | 0.585 |
|
| 0.554 | 0.560 | 0.593 | 0.600 |
|
| 0.598 | 0.615 | 0.608 | 0.620 |
|
|
|
|
|
|
indicates the results of our proposed methods are statistical significance over the four baseline methods (within 0.95 confidence interval using the -test). The results of our proposed approach are in bold.
Figure 4The change of the average precision with the varying of the topic numbers.
Four topics and the words mined from our experiment data set.
| TOPIC 1 ( | TOPIC 2 ( | TOPIC 3 ( | TOPIC 4 ( |
| E71 | Music | Card | internet |
| Nokia | player | memory | wifi |
| E63 | firmware | phone | connect |
| work | update | PC | connection |
| phones | device | file | WLAN |
| cheaper | version | transfer | access |
| N97 | media | computer | home |
| E51 | latest | contacts | settings |
| 5730 | problem | folder | point |
| features | quality | suite | wireless |
| prefer | sound | bluetooth | laptop |
| LG | files | data | working |
| Phone | software | cable | password |
| Black | reason | copy | router |
| information | format | Wi | |
| flash | installed | ||
| refresh | USB | ||
| Songs |
My [Nokia E71] 1 [music player] 2 is not [working] 4 properly even restored to factory [setting] 4. How can I fix this [problem] 4? After I install some added [features] 1 on my [Nokia E71] 1, it started not to [work] 1 properly. Having it restored to default factory [settings] 4, my [music player] 2 is not [working] 4 properly afterwards (was [working] 4 before restoring). Please help me resolve this issue, it would be highly appreciated Thanks!