| Literature DB >> 29065580 |
Jun Gao1, Ninghao Liu1, Mark Lawley2, Xia Hu1,3.
Abstract
Online healthcare forums (OHFs) have become increasingly popular for patients to share their health-related experiences. The healthcare-related texts posted in OHFs could help doctors and patients better understand specific diseases and the situations of other patients. To extract the meaning of a post, a commonly used way is to classify the sentences into several predefined categories of different semantics. However, the unstructured form of online posts brings challenges to existing classification algorithms. In addition, though many sophisticated classification models such as deep neural networks may have good predictive power, it is hard to interpret the models and the prediction results, which is, however, critical in healthcare applications. To tackle the challenges above, we propose an effective and interpretable OHF post classification framework. Specifically, we classify sentences into three classes: medication, symptom, and background. Each sentence is projected into an interpretable feature space consisting of labeled sequential patterns, UMLS semantic types, and other heuristic features. A forest-based model is developed for categorizing OHF posts. An interpretation method is also developed, where the decision rules can be explicitly extracted to gain an insight of useful information in texts. Experimental results on real-world OHF data demonstrate the effectiveness of our proposed computational framework.Entities:
Mesh:
Year: 2017 PMID: 29065580 PMCID: PMC5559930 DOI: 10.1155/2017/2460174
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1An example of an online health forum post.
Figure 2An overview of the interpretable classification framework.
Tags introduction.
| Tag | Description |
|---|---|
|
| Part-of-speech tags ( |
|
| Medications or drug terms ( |
|
| Symptom terms ( |
|
| Frequency phrases (customized regular expressions) |
Algorithm 1Frequent Labeled Sequential Patterns Generation.
Labeled sentences result.
| Med. | Symp. | Others | Total |
|---|---|---|---|
| 1127 | 772 | 200 | 2099 |
Model evaluation. We evaluate each model using 5-fold cross validation. Each of the average accuracy, weighted average precision, weighted average recall, and weighted average F-score for medication class, symptom class, and the overall performance is presented in each column. Each row represents the performance of each model trained on different feature combinations.
| Ft. set | M. Acc. | M. Prec. | M. Rec. | M. F1. | S. Acc. | S. Prec. | S. Rec. | S. F1. | Acc. | Prec. | Rec. | F1. | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Select + SVM | Word-based | 0.843 | 0.846 | 0.867 | 0.856 | 0.886 | 0.875 | 0.804 | 0.838 | 0.798 | 0.808 | 0.798 | 0.802 |
| + Semantic |
| 0.854 |
|
| 0.884 | 0.874 | 0.801 | 0.836 | 0.804 | 0.816 | 0.804 | 0.808 | |
| + Position | 0.843 | 0.846 | 0.867 | 0.856 | 0.886 | 0.875 | 0.805 | 0.838 | 0.798 | 0.808 | 0.798 | 0.802 | |
| + Thr. Crt. | 0.844 | 0.846 | 0.867 | 0.857 | 0.896 |
| 0.814 | 0.852 | 0.800 | 0.812 | 0.800 | 0.805 | |
| + Morpho. | 0.848 | 0.855 | 0.864 | 0.859 | 0.891 | 0.883 | 0.811 | 0.846 | 0.801 | 0.816 | 0.801 | 0.807 | |
| + Word Cnt. | 0.802 | 0.785 |
| 0.826 | 0.864 | 0.888 | 0.722 | 0.796 | 0.761 | 0.773 | 0.761 | 0.763 | |
| LSP | 0.799 |
| 0.709 | 0.790 | 0.831 | 0.862 | 0.644 | 0.737 | 0.691 | 0.821 | 0.691 | 0.731 | |
| + Semantic | 0.849 | 0.865 | 0.852 | 0.858 | 0.891 | 0.878 | 0.818 | 0.846 | 0.806 |
| 0.806 |
| |
| + Position | 0.841 | 0.851 | 0.852 | 0.851 | 0.893 | 0.883 | 0.817 | 0.848 | 0.800 | 0.815 | 0.800 | 0.806 | |
| + Thr. Crt. | 0.844 | 0.852 | 0.859 | 0.855 |
| 0.885 | 0.826 | 0.855 | 0.801 | 0.814 | 0.801 | 0.807 | |
| + Morpho. |
| 0.860 | 0.864 | 0.861 | 0.896 | 0.883 | 0.826 | 0.854 |
| 0.820 |
|
| |
| + Word Cnt. | 0.848 | 0.856 | 0.862 | 0.859 |
| 0.884 |
|
| 0.807 | 0.819 | 0.807 | 0.812 | |
| + Word-based | 0.810 | 0.810 | 0.844 | 0.826 | 0.870 | 0.887 | 0.739 | 0.806 | 0.768 | 0.792 | 0.768 | 0.776 | |
|
| |||||||||||||
| Lasso | Word-based | 0.794 | 0.730 |
|
| 0.886 |
| 0.712 | 0.820 | 0.791 |
| 0.791 | 0.756 |
| + Semantic | 0.793 | 0.741 | 0.947 | 0.831 | 0.886 | 0.923 | 0.752 | 0.828 | 0.789 | 0.754 | 0.789 | 0.757 | |
| + Position | 0.795 | 0.742 | 0.947 | 0.832 | 0.886 | 0.920 | 0.754 | 0.829 | 0.790 | 0.757 | 0.790 | 0.758 | |
| + Thr. Crt. | 0.796 | 0.745 | 0.945 | 0.833 | 0.889 | 0.922 | 0.762 | 0.834 | 0.791 | 0.756 | 0.791 | 0.759 | |
| + Morpho. | 0.797 | 0.745 | 0.947 | 0.834 | 0.889 | 0.924 | 0.759 | 0.833 | 0.792 | 0.757 | 0.792 | 0.760 | |
| + Word Cnt. | 0.798 |
| 0.947 | 0.834 | 0.891 | 0.927 | 0.762 | 0.836 | 0.793 | 0.759 | 0.793 | 0.762 | |
| LSP | 0.715 | 0.663 | 0.955 | 0.782 | 0.802 | 0.875 | 0.538 | 0.666 | 0.711 | 0.678 | 0.711 | 0.665 | |
| + Semantic | 0.769 | 0.712 | 0.955 | 0.816 | 0.861 | 0.911 | 0.689 | 0.785 | 0.767 | 0.727 | 0.767 | 0.728 | |
| + Position | 0.767 | 0.710 | 0.955 | 0.814 | 0.860 | 0.910 | 0.686 | 0.782 | 0.765 | 0.716 | 0.765 | 0.725 | |
| + Thr. Crt. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.911 | 0.700 | 0.791 | 0.769 | 0.728 | 0.769 | 0.731 | |
| + Morpho. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.910 | 0.698 | 0.790 | 0.769 | 0.728 | 0.769 | 0.730 | |
| + Word Cnt. | 0.771 | 0.715 | 0.953 | 0.817 | 0.864 | 0.910 | 0.698 | 0.790 | 0.769 | 0.728 | 0.769 | 0.730 | |
| + Word-based |
| 0.745 | 0.950 | 0.835 |
| 0.930 |
|
|
| 0.759 |
|
| |
|
| |||||||||||||
| Forest-based | Word-based |
| 0.795 |
|
| 0.881 | 0.891 | 0.773 | 0.827 | 0.819 | 0.808 | 0.819 | 0.795 |
| + Semantic | 0.815 | 0.761 | 0.956 | 0.847 | 0.878 | 0.901 | 0.751 | 0.819 | 0.802 | 0.805 | 0.802 | 0.778 | |
| + Position | 0.820 | 0.767 | 0.957 | 0.851 | 0.887 |
| 0.772 | 0.833 | 0.807 | 0.791 | 0.807 | 0.779 | |
| + Thr. Crt. | 0.817 | 0.765 | 0.949 | 0.847 | 0.872 | 0.884 | 0.749 | 0.811 | 0.799 | 0.792 | 0.799 | 0.774 | |
| + Morpho. | 0.832 | 0.776 | 0.965 | 0.860 | 0.890 | 0.907 | 0.781 | 0.838 | 0.816 |
| 0.816 | 0.789 | |
| + Word Cnt. | 0.830 | 0.779 | 0.954 | 0.858 |
| 0.893 | 0.804 |
| 0.814 | 0.797 | 0.814 | 0.783 | |
| LSP | 0.786 | 0.742 | 0.921 | 0.822 | 0.863 | 0.861 | 0.748 | 0.801 | 0.771 | 0.725 | 0.771 | 0.739 | |
| + Semantic | 0.837 | 0.824 | 0.887 | 0.854 | 0.879 | 0.860 | 0.802 | 0.829 | 0.809 | 0.805 | 0.809 |
| |
| + Position | 0.840 |
| 0.873 | 0.854 | 0.882 | 0.844 |
| 0.839 | 0.808 | 0.800 | 0.808 | 0.803 | |
| + Thr. Crt. | 0.832 | 0.825 | 0.875 | 0.849 | 0.879 | 0.849 | 0.814 | 0.831 | 0.802 | 0.796 | 0.802 | 0.797 | |
| + Morpho. | 0.841 | 0.829 | 0.886 | 0.856 | 0.881 | 0.843 | 0.832 | 0.837 | 0.812 | 0.802 | 0.812 | 0.804 | |
| + Word Cnt. | 0.829 | 0.816 | 0.881 | 0.847 | 0.880 | 0.856 | 0.808 | 0.831 | 0.800 | 0.791 | 0.800 | 0.793 | |
| + Word-based | 0.848 | 0.816 | 0.927 | 0.868 | 0.887 | 0.861 | 0.827 | 0.843 |
| 0.803 |
| 0.802 | |
Top 10 average weight of word-based, LSP, semantic features in Lasso.
| Word-based | Average weight | LSP | Average weight | Semantic | Average weight |
|---|---|---|---|---|---|
| Avoiding | −0.413 | (PRP, PRP, RB, SYMP) | 0.081 | sosy | 0.329 |
| Wrong | −0.363 | (PRP, PRP, VB, SYMP) | 0.060 | mobd | 0.207 |
| Avoid | −0.343 | (VBZ, CC, SYMP) | 0.058 | patf | 0.190 |
| Prescribe | −0.323 | (SYMP, SYMP, SYMP) | 0.054 | resa | −0.173 |
| Bleeding | 0.283 | (PRP, SYMP, CC, SYMP, IN) | −0.053 | inpo | 0.100 |
| Anxiety | 0.281 | (CC, SYMP, IN, SYMP) | −0.052 | anab | 0.094 |
| Swelling | 0.233 | (PRP, SYMP, VBG) | 0.049 | mcha | −0.092 |
| Increased | −0.185 | (RB, SYMP, VB) | 0.048 | aggp | −0.090 |
| Migraines | 0.185 | (JJ, IN, JJ, SYMP) | 0.036 | plnt | −0.063 |
| Fever | 0.160 | (NN, SYMP, RB, SYMP) | −0.033 | mamm | −0.052 |
Top 10 feature contributions for medication and symptom class in a random forest model.
| Feature | Back. | Med. | Sym. |
|---|---|---|---|
|
| |||
| Prescribed = 1 | −0.00275 | 0.01195 | −0.00920 |
| (PRP, CD, CD) = 1 | −0.00251 | 0.01156 | −0.00905 |
| Morpho. = 1 | −0.00206 | 0.00660 | −0.00455 |
| hlca = 1 | −0.00071 | 0.00559 | −0.00489 |
| (NN, SYMP, SYMP, CC) = 0 | 0.00115 | 0.00429 | −0.00544 |
| sosy = 0 | 0.00191 | 0.00406 | −0.00597 |
| (PRP, CD, IN, NN, NN) = 1 | −0.00075 | 0.00402 | −0.00327 |
| (CD, IN, CD, CD) = 1 | −0.00120 | 0.00396 | −0.00276 |
| thr. Crt. = 0 | 0.00154 | 0.00381 | −0.00535 |
| (PRP, CD, JJ, JJ) = 1 | −0.00086 | 0.00362 | −0.00276 |
|
| |||
| sosy = 1 | −0.00589 | −0.00783 | 0.01371 |
| Prescribed = 0 | 0.00234 | −0.015734 | 0.01339 |
| thr. Crt. = 1 | −0.00381 | −0.00683 | 0.01064 |
| (PRP, CD, CD) = 0 | 0.00271 | −0.01264 | 0.00993 |
| (SYMP, SYMP, SYMP) = 1 | −0.00330 | −0.00564 | 0.00895 |
| (NN, SYMP, SYMP, CC) = 1 | −0.00209 | −0.00667 | 0.00876 |
| Position < | −0.00334 | −0.00540 | 0.00874 |
| patf = 1 | −0.00254 | −0.00379 | 0.00633 |
| (SYMP, CC, JJ) = 1 | −0.00172 | −0.00404 | 0.00576 |
| Word count > | −0.00131 | −0.00423 | 0.00554 |
Top 10 discriminative patterns in a DPClass model.
| Pattern | Leaf class |
|---|---|
| ((RB, CD, CD) = 0) ∩ ((PRP, CD, CD, JJ) = 0) ∩ ((PRP, CD, NN, NN, NN) = 0) ∩ ((TO, VB, CD) = 1) | Med. |
| ((IN, NN, NN, comma, SYMP) = 0) ∩ ((CD, RB, CD) = 1) ∩ ((RB, IN, IN, CD, IN) = 0) ∩ ((PRP, CC, CD, NN) = 1) | Med. |
| ((SYMP, NN, VBG) = 1) | Sym. |
| ((VBP, CD, NN, NN) = 0) ∩ ((SYMP, SYMP, NN) = 1) | Sym. |
| ((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 1) ∩ (dsyn = 1) | Med. |
| ((PRP, VBP, CD) = 0) ∩ ((CD, CD, NN, NN) = 1) ∩ ((TO, CD, IN) = 1) | Med. |
| (“cough” = 1) | Sym. |
| ((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 0) ∩ (fndg = 0) ∩ ((NN, comma, comma, SYMP) = 1) | Sym. |
| ((RB, CD, IN, IN) = 0) ∩ ((VBP, IN, CD, CD, NN) = 0) ∩ (“mg” = 0) ∩ (“prescribed” = 1) ∩ (dsyn = 0) | Med. |
| (“anxiety” = 1) | Sym. |