| Literature DB >> 24489699 |
Víctor M Prieto1, Sérgio Matos2, Manuel Álvarez1, Fidel Cacheda1, José Luís Oliveira2.
Abstract
With the proliferation of social networks and blogs, the Internet is increasingly being used to disseminate personal health information rather than just as a source of information. In this paper we exploit the wealth of user-generated data, available through the micro-blogging service Twitter, to estimate and track the incidence of health conditions in society. The method is based on two stages: we start by extracting possibly relevant tweets using a set of specially crafted regular expressions, and then classify these initial messages using machine learning methods. Furthermore, we selected relevant features to improve the results and the execution times. To test the method, we considered four health states or conditions, namely flu, depression, pregnancy and eating disorders, and two locations, Portugal and Spain. We present the results obtained and demonstrate that the detection results and the performance of the method are improved after feature selection. The results are promising, with areas under the receiver operating characteristic curve between 0.7 and 0.9, and f-measure values around 0.8 and 0.9. This fact indicates that such approach provides a feasible solution for measuring and tracking the evolution of health states within the society.Entities:
Mesh:
Year: 2014 PMID: 24489699 PMCID: PMC3906034 DOI: 10.1371/journal.pone.0086191
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Regular expressions for detecting health disorders in Spanish tweets.
| Flu | Regular Expresion |
| Flu |
|
| Cold |
|
|
| |
| Flu Symptoms |
|
|
| |
|
| |
| Pregnancy |
|
| Common phrases |
|
|
| |
|
| |
| Depression |
|
| Depressed |
|
| Common phrases |
|
|
| |
|
| |
|
| |
| Obesity |
|
| Overweight |
|
| Bulimia |
|
| Anorexia |
|
| Bigorexia |
|
| Ebigorexia |
|
| Orthorexia |
|
| Common phrases |
|
|
| |
|
| |
|
| |
|
|
Regular expressions for detecting health disorders in Portuguese tweets.
| Flu | Regular Expresion |
| Flu |
|
| Cold |
|
| Flu Symptoms |
|
|
| |
|
| |
|
| |
| Pregnancy |
|
| Common phrases |
|
|
| |
|
| |
| Depression |
|
| Depressed |
|
| Common phrases |
|
|
| |
|
| |
|
| |
| Obesity |
|
| Overweight |
|
| Bulimia |
|
| Anorexia |
|
| Bigorexia |
|
| Ebigorexia |
|
| Orthorexia |
|
| Common phrases |
|
|
| |
|
| |
|
| |
|
|
Number of examples and features in each dataset.
| Spanish | Portuguese | |||
| Tweets | Features | Tweets | Features | |
| Depression | 3253 | 721 | 2846 | 983 |
| Pregnancy | 1985 | 698 | 2629 | 1042 |
| Flu | 827 | 608 | 1153 | 842 |
| Eating Disorders | 412 | 567 | 468 | 747 |
Number of tweets labelled as positive, negative and undecided.
| Spanish Tweets | Portuguese Tweets | |||||
| Disease | Positive | Negative | Undecided | Positive | Negative | Undecided |
| Depression | 160 | 3093 | 0 | 120 | 2725 | 1 |
| Pregnancy | 65 | 1920 | 0 | 38 | 2588 | 3 |
| Flu | 663 | 164 | 0 | 649 | 501 | 3 |
| Eating Disorders | 111 | 301 | 0 | 87 | 368 | 13 |
Results obtained on the datasets.
| Spanish Tweets | Portuguese Tweets | ||||||||
| Disease | Classifier | F-Measure | Precision | Recall | AUC | F-Measure | Precision | Recall | AUC |
| Depression | Naïve Bayes | 0.913 | 0.949 | 0.891 | 0.878 | 0.912 | 0.947 | 0.887 | 0.833 |
| SVM | 0.946 | 0.948 | 0.944 | 0.739 | 0.902 | 0.934 | 0.876 | 0.691 | |
| Decision Tree | 0.976 | 0.968 | 0.985 | 0.845 | 0.974 | 0.963 | 0.985 | 0.762 | |
| kNN | 0.862 | 0.937 | 0.814 | 0.784 | 0.900 | 0.937 | 0.871 | 0.768 | |
| Pregnancy | Naïve Bayes | 0.952 | 0.948 | 0.957 | 0.703 | 0.977 | 0.973 | 0.982 | 0.877 |
| SVM | 0.940 | 0.942 | 0.939 | 0.644 | 0.945 | 0.975 | 0.920 | 0.679 | |
| Decision Tree | 0.947 | 0.944 | 0.951 | 0.689 | 0.978 | 0.971 | 0.985 | 0.801 | |
| kNN | 0.949 | 0.945 | 0.953 | 0.701 | 0.979 | 0.975 | 0.985 | 0.714 | |
| Flu | Naïve Bayes | 0.766 | 0.759 | 0.775 | 0.743 | 0.667 | 0.667 | 0.669 | 0.746 |
| SVM | 0.755 | 0.749 | 0.764 | 0.696 | 0.681 | 0.691 | 0.690 | 0.671 | |
| Decision Tree | 0.779 | 0.757 | 0.804 | 0.670 | 0.672 | 0.672 | 0.674 | 0.746 | |
| kNN | 0.761 | 0.756 | 0.799 | 0.786 | 0.687 | 0.687 | 0.689 | 0.745 | |
| Eating Disorders | Naïve Bayes | 0.720 | 0.720 | 0.720 | 0.714 | 0.786 | 0.785 | 0.817 | 0.744 |
| SVM | 0.683 | 0.688 | 0.679 | 0.607 | 0.725 | 0.729 | 0.720 | 0.650 | |
| Decision Tree | 0.785 | 0.756 | 0.817 | 0.630 | 0.869 | 0.838 | 0.902 | 0.690 | |
| kNN | 0.684 | 0.714 | 0.669 | 0.696 | 0.667 | 0.737 | 0.630 | 0.686 | |
AUC = Area Under the receiver operating characteristic Curve.
Number of features selected by each selection method.
| Spanish | Portuguese | |||||||
| CFS | Pearson | Gain Ratio | Relief | CFS | Pearson | Gain Ratio | Relief | |
| Pregnancy | 52 | 51 | 51 | 51 | 31 | 51 | 51 | 201 |
| Depression | 55 | 51 | 51 | 51 | 62 | 51 | 51 | 201 |
| Flu | 74 | 51 | 301 | 101 | 52 | 51 | 301 | 201 |
| Eating Disorders | 58 | 51 | 51 | 51 | 62 | 51 | 51 | 101 |
Classification results obtained on subsets generated using CFS.
| Spanish Tweets | Portuguese Tweets | ||||||||
| Disease | Classifier | F-Measure | Precision | Recall | AUC | F-Measure | Precision | Recall | AUC |
| Depression | NB | 0.913+ | 0.949 | 0.890− | 0.892 | 0.912+ | 0.949 | 0.888+ | 0.861 |
| kNN | 0.932+ | 0.943 | 0.923+ | 0.876+ | 0.934+ | 0.931− | 0.938+ | 0.821 | |
| Pregnancy | NB | 0.955 | 0.953 | 0.958 | 0.832 | 0.982 | 0.981 | 0.985 | 0.882 |
| kNN | 0.950+ | 0.942− | 0.959+ | 0.774 | 0.979+ | 0.972− | 0.985+ | 0.847+ | |
| Flu | NB | 0.818 | 0.813 | 0.828+ | 0.818+ | 0.688 | 0.687 | 0.690 | 0.752 |
| kNN | 0.755− | 0.748− | 0.796− | 0.677− | 0.572 | 0.687 | 0.690 | 0.752 | |
| Eating Disorders | NB | 0.762 | 0.765 | 0.760+ | 0.796 | 0.844+ | 0.841+ | 0.853+ | 0.858+ |
| kNN | 0.705 | 0.765+ | 0.760+ | 0.796+ | 0.782+ | 0.773+ | 0.800+ | 0.762+ | |
Statistically significant differences (), as compared to using all the features and measured by a two-tailed t-test, are marked as ‘+’ (positive differences) and ‘−’ (negative differences).
Classification results obtained on subsets generated using Pearson correlation.
| Spanish Tweets | Portuguese Tweets | ||||||||
| Disease | Classifier | F-Measure | Precision | Recall | AUC | F-Measure | Precision | Recall | AUC |
| Depression | NB | 0.902 | 0.946− | 0.875 | 0.872 | 0.900− | 0.947 | 0.868− | 0.832 |
| kNN | 0.927+ | 0.941 | 0.916+ | 0.840+ | 0.934+ | 0.932− | 0.934+ | 0.783 | |
| Pregnancy | NB | 0.919− | 0.955+ | 0.893− | 0.778 | 0.961 | 0.981 | 0.946 | 0.892 |
| kNN | 0.946− | 0.944− | 0.949− | 0.701 | 0.974− | 0.973− | 0.976− | 0.872+ | |
| Flu | NB | 0.815 | 0.812 | 0.820 | 0.788+ | 0.672 | 0.672 | 0.674 | 0.744 |
| kNN | 0.761+ | 0.752 | 0.973+ | 0.643− | 0.584− | 0.601− | 0.607− | 0.628 | |
| Eating Disorders | NB | 0.759 | 0.760 | 0.743 | 0.781 | 0.834 | 0.838 | 0.831+ | 0.849+ |
| kNN | 0.714+ | 0.717 | 0.711+ | 0.680 | 0.721+ | 0.752+ | 0.701+ | 0.685− | |
Statistically significant differences (), as compared to using all the features and measured by a two-tailed t-test, are marked as ‘+’ (positive differences) and ‘−’ (negative differences).
Classification results obtained on subsets generated using Gain Ratio.
| Spanish Tweets | Portuguese Tweets | ||||||||
| Disease | Classifier | F-Measure | Precision | Recall | AUC | F-Measure | Precision | Recall | AUC |
| Depression | NB | 0.919+ | 0.949 | 0.900+ | 0.909+ | 0.944+ | 0.942− | 0.958+ | 0.864+ |
| kNN | 0.936+ | 0.932 | 0.943+ | 0.844+ | 0.937+ | 0.926− | 0.957+ | 0.813 | |
| Pregnancy | NB | 0.954 | 0.949+ | 0.963 | 0.750 | 0.979 | 0.980 | 0.971 | 0.879 |
| kNN | 0.951− | 0.936− | 0.967+ | 0.743 | 0.978− | 0.972− | 0.984− | 0.849+ | |
| Flu | NB | 0.803+ | 0.800 | 0.808+ | 0.797 | 0.719 | 0.713 | 0.714 | 0.786+ |
| kNN | 0.773 | 0.770 | 0.805+ | 0.690− | 0.736+ | 0.736+ | 0.736+ | 0.673− | |
| Eating Disorders | NB | 0.710− | 0.797+ | 0.767+ | 0.729+ | 0.786+ | 0.800+ | 0.826+ | 0.750+ |
| kNN | 0.640− | 0.739+ | 0.735+ | 0.710+ | 0.738+ | 0.850+ | 0.815+ | 0.731+ | |
Statistically significant differences (), as compared to using all the features and measured by a two-tailed t-test, are marked as ‘+’ (positive differences) and ‘−’ (negative differences).
Classification results obtained on subsets generated using Relief.
| Spanish Tweets | Portuguese Tweets | ||||||||
| Disease | Classifier | F-Measure | Precision | Recall | AUC | F-Measure | Precision | Recall | AUC |
| Depression | NB | 0.904 | 0.946− | 0.877 | 0.873 | 0.885− | 0.945 | 0.842− | 0.820 |
| kNN | 0.924+ | 0.942 | 0.911+ | 0.820+ | 0.925+ | 0.936 | 0.915+ | 0.787 | |
| Pregnancy | NB | 0.893− | 0.950+ | 0.849− | 0.711 | 0.961 | 0.981 | 0.945 | 0.887 |
| kNN | 0.945− | 0.943− | 0.947− | 0.707 | 0.965− | 0.975 | 0.956− | 0.747 | |
| Flu | NB | 0.785+ | 0.770 | 0.748− | 0.746+ | 0.643 | 0.635 | 0.635 | 0.730− |
| kNN | 0.758 | 0.747 | 0.787 | 0.700 | 0.670− | 0.650− | 0.640− | 0.695− | |
| Eating Disorders | NB | 0.731 | 0.733 | 0.728 | 0.736 | 0.824 | 0.826+ | 0.822+ | 0.777 |
| kNN | 0.711+ | 0.707− | 0.715+ | 0.700+ | 0.781+ | 0.778+ | 0.782+ | 0.677− | |
Statistically significant differences (), as compared to using all the features and measured by a two-tailed t-test, are marked as ‘+’ (positive differences) and ‘−’ (negative differences).
Figure 1Classifier ROC curves for the Spanish datasets.
The ROC curves illustrate the performance for classifying tweets as related to (positive) or not related to (negative) a given health state. Results shown are for a Naïve Bayes classifier trained with the subsets of features generated by four different feature selection algorithms. A) Depression, B) Pregnancy, C) Flu and D) Eating Disorders.
Figure 2Classifier ROC curves for the Portuguese datasets.
The ROC curves illustrate the performance for classifying tweets as related to (positive) or not related to (negative) a given health state. Results shown are for a Naïve Bayes classifier trained with the subsets of features generated by four different feature selection algorithms. A) Depression, B) Pregnancy, C) Flu and D) Eating Disorders.
Time (s) spent training the models and performing classification.
| Features | Classifier | Training | Classification |
| All Features | Naïve Bayes | 23.25 | 219.32 |
| kNN | - | 60.95 | |
| CFS | Naïve Bayes | 0.01 | 0.01 |
| kNN | - | 15.95 | |
| Pearson correlation | Naïve Bayes | 0.01 | 0.10 |
| kNN | - | 15.78 | |
| Gain Ratio | Naïve Bayes | 0.02 | 0.22 |
| kNN | - | 20.88 | |
| Relief | Naïve Bayes | 0.01 | 0.33 |
| kNN | - | 40.62 |