| Literature DB >> 23961389 |
Abstract
To facilitate patient involvement in online health community and obtain informative support and emotional support they need, a topic identification approach was proposed in this paper for identifying automatically topics of the health-related messages in online health community, thus assisting patients in reaching the most relevant messages for their queries efficiently. Feature-based classification framework was presented for automatic topic identification in our study. We first collected the messages related to some predefined topics in a online health community. Then we combined three different types of features, n-gram-based features, domain-specific features and sentiment features to build four feature sets for health-related text representation. Finally, three different text classification techniques, C4.5, Naïve Bayes and SVM were adopted to evaluate our topic classification model. By comparing different feature sets and different classification techniques, we found that n-gram-based features, domain-specific features and sentiment features were all considered to be effective in distinguishing different types of health-related topics. In addition, feature reduction technique based on information gain was also effective to improve the topic classification performance. In terms of classification techniques, SVM outperformed C4.5 and Naïve Bayes significantly. The experimental results demonstrated that the proposed approach could identify the topics of online health-related messages efficiently.Entities:
Keywords: Online health community; Text classification; Topic classification; Topic identification
Year: 2013 PMID: 23961389 PMCID: PMC3736074 DOI: 10.1186/2193-1801-2-309
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Figure 1The design framework of automatic topic identification.
The UMLS semantic types used
| Abbr. | Semantic types | Abbr. | Semantic types |
|---|---|---|---|
| Aapp | Amino acid, peptide, or protein | Imft | Immunologic factor |
| Acab | Acquired abnormality | Inpo | Injury or poisoning |
| Anab | Anatomical abnormality | lbpr | Laboratory procedure |
| Bdsy | Body system | Mobd | Mental or behavioral dysfunction |
| Blor | Body location or region | Neop | Neoplastic process |
| Bmod | Biomedical occupation or discipline | Orch | Organic chemial |
| Bpoc | Body part, organ, or organ component | Patf | Pathologic function |
| Diap | Diagnostic procedure | Phsu | Pharmacologic substance |
| Dsyn | Disease or syndrome | Sosy | Sign or symptom |
| Horm | Hormone | Topp | Therapeutic or preventive procedure |
Figure 2Accuracies for different frequency thresholds for word n-grams.
Figure 3Accuracies using different feature sets and classification techniques.
Pairwise t-tests on accuracy and F-measure for different feature sets
| Feature set | P-value on accuracy | P-value on F-measure | ||||
|---|---|---|---|---|---|---|
| C4.5 | Naïve Bayes | SVM | C4.5 | Naïve Bayes | SVM | |
| FS1<FS2 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | 0.0002 |
| FS2<FS3 | 0.1717 | 0.0583 | 0.0036 | 0.3392 | 0.2104 | 0.0075 |
| FS3<FS4 | 0.0282 | <0.0001 | <0.0001 | 0.0425 | <0.0001 | <0.0001 |
Pairwise t-tests on accuracy and F-measure for different classification techniques
| Classifiers | P-value on accuracy | P-value on F-measure | ||||||
|---|---|---|---|---|---|---|---|---|
| FS1 | FS2 | FS3 | FS4 | FS1 | FS2 | FS3 | FS4 | |
| C4.5<Naive | 0.2997 | 0.3439 | 0.3667 | 0.00146 | 0.03679 | 0.1158 | 0.0748 | <0.0001 |
| C4.5<SVM | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
| Naive<SVM | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 |
Performance measures of different topic groups
| Feature set | Topic | Precision | Recall | F-measure |
|---|---|---|---|---|
| FS1 (F1) | Treatment | 75.2% | 70.3% | 72.6% |
| Emotional | 82.2% | 72.4% | 77.0% | |
| Survivorship | 76.2% | 87.1% | 81.3% | |
| Average | 77.6% | 77.4% | 77.2% | |
| FS2 (F1+F2) | Treatment | 77.0% | 74.1% | 75.5% |
| Emotional | 84.5% | 75.3% | 79.7% | |
| Survivorship | 78.4% | 87.2% | 82.6% | |
| Average | 79.7% | 79.5% | 79.4% | |
| FS3 (F1+F2+F3) | Treatment | 77.2% | 74.0% | 75.6% |
| Emotional | 84.6% | 75.3% | 79.7% | |
| Survivorship | 78.3% | 87.5% | 82.6% | |
| Average | 79.8% | 79.6% | 79.5% | |
| FS4 (selected F1+F2+F3) | Treatment | 80.2% | 73.9% | 76.9% |
| Emotional | 87.9% | 75.5% | 81.2% | |
| Survivorship | 77.2% | 90.3% | 83.2% | |
| Average | 81.2% | 80.7% | 80.6% |