| Literature DB >> 36171569 |
Kazi Zainab1, Gautam Srivastava2,3,4, Vijay Mago1.
Abstract
BACKGROUND: Twitter is a popular social networking site where short messages or "tweets" of users have been used extensively for research purposes. However, not much research has been done in mining the medical professions, such as detecting the occupations of users from their biographical contents. Mining such professions can be used to build efficient recommender systems for cost-effective targeted advertisements. Moreover, it is highly important to develop effective methods to identify the occupation of users since conventional classification methods rely on features developed by human intelligence. Although, the result may be favorable for the classification problem. However, it is still extremely challenging for traditional classifiers to predict the medical occupations accurately since it involves predicting multiple occupations. Hence this study emphasizes predicting the medical occupational class of users through their public biographical ("Bio") content. We have conducted our analysis by annotating the bio content of Twitter users. In this paper, we propose a method of combining word embedding with state-of-art neural network models that include: Long Short Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit, Bidirectional Encoder Representations from Transformers, and A lite BERT. Moreover, we have also observed that by composing the word embedding with the neural network models there is no need to construct any particular attribute or feature. By using word embedding, the bio contents are formatted as dense vectors which are fed as input into the neural network models as a sequence of vectors. RESULT: Performance metrics that include accuracy, precision, recall, and F1-score have shown a significant difference between our method of combining word embedding with neural network models than with the traditional methods. The scores have proved that our proposed approach has outperformed the traditional machine learning techniques for detecting medical occupations among users. ALBERT has performed the best among the deep learning networks with an F1 score of 0.90.Entities:
Keywords: Deep learning; Medical data; Natural language processing; Text classification; Twitter
Mesh:
Year: 2022 PMID: 36171569 PMCID: PMC9520792 DOI: 10.1186/s12859-022-04933-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Metrics scores
| Model | Accuracy | Precision weighted | Recall weighted | F1 weighted score | Training time (s) |
|---|---|---|---|---|---|
| Naïve-Bayes | 0.73 | 0.62 | 0.60 | 0.64 | 586 |
| LR | 0.78 | 0.70 | 0.70 | 0.70 | 574 |
| SGD | 0.79 | 0.74 | 0.78 | 0.75 | 607 |
| GRU | 0.89 | 0.83 | 0.86 | 0.88 | 357 |
| LSTM | 0.90 | 0.89 | 0.85 | 0.84 | 453 |
| Bi-LSTM | 0.91 | 0.86 | 0.85 | 0.88 | 762 |
| BERT | 0.94 | 0.90 | 0.89 | 0.89 | 908 |
| ALBERT | 0.95 | 0.92 | 0.91 | 0.90 | 350 |
Fig. 5Precision, Recall, F1-Score Metrics of various models. Sub-figure (a) highlights the LSTM model, (b) Bi-LSTM model, (c) GRU model, (d) BERT model, (e) AlBERT model, (f) Naive-Bayes model, (g) Logistic Regression model, and (h) SCD model
The table shows the number of classes (left column) and classified jobs with multiple sub-major groups (middle column) by National Occupation Classification
| Occupational class | National occupation classification | Users |
|---|---|---|
| 1 | Nursing Occ. | 1691 |
| 2 | Doctor/Physician Occ. | 2137 |
| 3 | Dentist Occ. | 257 |
| 4 | Pharmacist Occ. | 614 |
| 5 | Dietitian/Nutritionist Occ. | 1831 |
| 6 | Medical technologist, technician Occ. | 112 |
| 7 | Paramedic Occ. | 90 |
The right-most column represents the number of users
Text, Tokenized term indexes
| Text | Pharmacist | professor | and | specialist | in | geriatrics |
|---|---|---|---|---|---|---|
| Tokenized term index | [ | [ | [ | [ | [ | [ |
| Original term index | 467 | 43 | 254 | 78 | 51 | 165 |
Fig. 1Subset of the NOC classification hierarchy
Fig. 2The generated vocabulary and VSM from the non-annotated texts
Fig. 3Shows the input sequence of vectors into the neural networks for classification. (Figure 4 describes the architecture in detail)
Fig. 4Architecture of the Model