| Literature DB >> 35035262 |
Abstract
We exploit the Twitter platform to create a dataset of news articles derived from tweets concerning COVID-19, and use the associated tweets to define a number of popularity measures. The focus on (potentially) biomedical news articles allows the quantity of biomedically valid information (as extracted by biomedical relation extraction) to be included in the list of explored features. Aside from forming part of a systematic correlation exploration, the features - ranging from the semantic relations through readability measures to the article's digital content - are used within a number of machine learning classifier and regression algorithms. Unsurprisingly, the results support that for more complex articles (as determined by a readability measure) more sophisticated syntactic structure may be expected. A weak correlation is found with information within an article suggesting that other factors, such as numbers of videos, have a notable impact on the popularity of a news article. The best popularity prediction performance is obtained using a random forest machine learning algorithm, and the feature describing the quantity of biomedical information is in the top 3 most important features in almost a third of the experiments performed. Additionally, this feature is found to be more valuable than the widely used named entity recognition.Entities:
Keywords: Grammatical relations; Popularity prediction; SemRep relations; Twitter
Year: 2022 PMID: 35035262 PMCID: PMC8742664 DOI: 10.1007/s11042-021-11621-5
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Overview of news article based features
| Feature category | Specific features |
|---|---|
| Text based | Number of words in the title & body of text, average word lengths in title & body of text, total number of sentences in title & body of text, number of words per sentence, numbers of images and videos. |
| Readability based | Readability measures as described in Table |
| Semantic content based | Numbers of grammatical relations extracted by a parser tuned for biomedical text (SemRep) [ |
Summary of readability measures
| Readability measure | Purpose | Based on numbers of ... |
|---|---|---|
| Flesch Reading Ease (FRE) and FleschKincaid Grade Level (F-K) [ | General | Syllables, words and sentences. |
| Automated readability index (ARI) [ | Technical | Characters, words and sentences. |
| Coleman-Liau (C-L) [ | Education | Characters, words and sentences. |
| Gunning Fog Index (FOG) [ | Business & product | Words, complex words ( |
| Simple Measure of Gobbledygook (SMOG) index [ | Healthcare | Sentences and polysyllabic words. |
| Dale-Chall index (D-C) [ | General / education | Words, sentences and ‘difficult words’ from own set. |
| Linsear Write metric (LW) [ | Technical | Easy and hard words, sentences. |
Correlations between the number of re-tweets and some of the features
| Feature | Restriction | Pearson | Spearman | Kendal | |||
|---|---|---|---|---|---|---|---|
| r | p | p | p | ||||
| Avg Stanford GRs | SMOG | 0.38 | 0.00 | 0.21 | 0.07 | 0.18 | 0.05 |
| Avg SemRep GRs | ARI | 0.12 | 0.15 | 0.27 | 0.00 | 0.22 | 0.00 |
| Number of videos | ARI | 0.45 | 0.00 | 0.28 | 0.00 | 0.25 | 0.00 |
| Avg word length in title | D-C | −0.8 | 0.00 | −0.32 | 0.19 | −0.26 | 0.17 |
Statistical overview of features considered as popularity measures
| Favourites | Followers | Hashtags | Re-tweets | Log_followers | |
|---|---|---|---|---|---|
| Minimum | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1st quartile | 0.0 | 306.5 | 0.0 | 0.0 | 5.7 |
| Median | 0.17 | 1973.0 | 0.0 | 0.0 | 7.6 |
| 3rd quartile | 1.0 | 17038.0 | 1.0 | 2.0 | 9.7 |
| Maximum | 1342.0 | 26200902 | 19 | 17779.4 | 17.1 |
| Mean | 2.28 | 219892.0 | 0.7 | 9.7 | 7.9 |
| Standard deviation | 26.7 | 1329625.9 | 1.6 | 2480996.8 | 3.1 |
Fig. 1Boxplots for log(distribution) of the four importance features: favourite count, followers, hashtags, re-tweets
Fig. 2Quantile plot for followers
For each percentage division, the actual value is given and the number of non influential (0) and influential (1) articles this yields
| 85% | 87.5% | 90% | |||||||
|---|---|---|---|---|---|---|---|---|---|
| value | 0 | 1 | value | 0 | 1 | value | 0 | 1 | |
| Favourites | 2 | 10432 | 2056 | 2.4 | 10927 | 1561 | 3 | 11142 | 1346 |
| Followers | 79902 | 10517 | 1971 | 114611 | 10927 | 1561 | 214131 | 11233 | 1255 |
| Hashtags | 1.5 | 10609 | 1879 | 2 | 10841 | 1647 | 2 | 10841 | 1647 |
| Re-tweets | 4.1 | 10477 | 2011 | 5.5 | 10927 | 1561 | 7.7 | 11239 | 1249 |
Fig. 3Distribution of most common sources in Flesch Reading Ease; readability calculation on left, textstat on right
Feature vector lengths when restrictions are applied
| Restriction | Readability grades | Textstat | Readability | All |
|---|---|---|---|---|
| Num features | 24 | 38 | 50 | 74 |
Fig. 4Experiment setup
Best performing (F-measure, ) combination of algorithm and dataset balancing for each threshold (T) using the readability features
| T | Algorithm & optimal hyperparameters | Balancing | |
|---|---|---|---|
| 80% | RFC | Random oversampling | 0.504 |
| criterion: gini, max_depth: 23 | |||
| max_features: log2, n_estimators: 500 | |||
| 82.5% | RFR | Random oversampling | 0.482 |
| criterion: mse, max_depth: 16 | |||
| max_features: auto, n_estimators: 200 | |||
| 85% | RFC | Borderline SMOTE SVM | 0.438 |
| criterion: gini, max_depth: 15 | |||
| max_features: auto, n_estimators: 800 | |||
| 87.5% | RFC | Borderline SMOTE SVM | 0.361 |
| criterion: gini, max_depth: 15 | |||
| max_features: sqrt, n_estimators: 800 | |||
| 90% | RFC | SMOTE & undersampling | 0.338 |
| criterion: entropy, max_depth: 23 | |||
| max_features: sqrt, n_estimators: 1100 |
Statistical information regarding information containing features for threshold binary80
| Relation | Feature rank | 80% | ||||
|---|---|---|---|---|---|---|
| 1 | 1-3 | 1-5 | 1-10 | Mean | SD | |
| Semrep | 22% | 31% | 33% | 34% | 2.95 | 6.68 |
| Average semrep | 0% | 0% | 1% | 19% | 0.03 | 0.05 |
| Stanford | 0% | 0% | 0% | 0% | 1322.48 | 2095.63 |
| Average stanford | 0% | 0% | 2% | 33% | 12.68 | 4.98 |
| Named entity | 0% | 3% | 8% | 19% | 72.28 | 140.05 |
| Algorithm | Abbrev | Parameter grid |
|---|---|---|
| Decision tree classifier | DTC | max_depth: range(3,20) |
| criterion: [gini, entropy] | ||
| Decision tree regressor | DTR | max_depth: range(3,20) |
| criterion: [mse, mae] | ||
| Random forest classifier | RFC | n_estimators: [200, 500, 800, 1100] |
| max_features: [auto, sqrt, log2] | ||
| max_depth: range(8,25) | ||
| criterion: [gini, entropy] | ||
| Random forest regressor | RFR | n_estimators: [200, 500, 800, 1000] |
| max_features: [auto, sqrt, log2] | ||
| max_depth: range(8,25) | ||
| criterion: [mse, mae] | ||
| Gradient boosting | GBC | learning_rate: [0.05, 0.1, 0.2, 0.5] |
| n_estimators: [100, 200, 500] | ||
| max_features: [log2, sqrt] | ||
| max_depth: range(4,12,2)] | ||
| criterion: [friedman_mse, mse] | ||
| KNN | n_neighbors: range(1, 31) | |
| weights: [uniform, distance] | ||
| Support vector machines | SVM | C: [0.1, 1, 10, 100] |
| gamma: [0.1, 0.01, 0.001] | ||
| kernel: [rbf, poly] | ||
| Multilayer perceptron | MLP | activation: [relu, tanh] |
| hidden_layer_sizes: [(50,), (25,50,), (25,37,50,)] | ||
| solver: [adam, lbfgs] | ||
| early_stopping: [True] | ||
| max_iter: [5000] |
| Threshold | Algorithm | Balancing | F-measure |
|---|---|---|---|
| 80% | RFC | Random oversampling | 0.504 |
| RFR | Random oversampling | 0.496 | |
| SVM | Random oversampling | 0.476 | |
| KNN | ADASYN | 0.459 | |
| MLP | SMOTE | 0.457 | |
| GBC | Borderline SMOTE SVM | 0.454 | |
| DTR | Random oversampling | 0.446 | |
| DTC | Random oversampling | 0.438 | |
| 82.5% | RFR | Random oversampling | 0.482 |
| RFC | SMOTE | 0.475 | |
| SVM | Borderline SMOTE SVM | 0.450 | |
| GBC | SMOTE | 0.446 | |
| KNN | Borderline SMOTE | 0.428 | |
| DTC | SMOTE & undersampling | 0.415 | |
| DTR | Random over | 0.409 | |
| MLP | ADASYN | 0.409 | |
| 85% | RFC | Borderline SMOTE SVM | 0.438 |
| RFR | Random undersampling | 0.433 | |
| GBC | Borderline SMOTE | 0.413 | |
| SVM | Random undersampling | 0.405 | |
| MLP | Random undersampling | 0.381 | |
| KNN | Random undersampling | 0.377 | |
| DTC | Borderline SMOTE SVM | 0.375 | |
| DTR | ADASYN | 0.358 | |
| 87.5% | RFC | Borderline SMOTE SVM | 0.361 |
| GBC | Random undersampling | 0.351 | |
| SVM | Random undersampling | 0.350 | |
| RFR | Random undersampling | 0.342 | |
| MLP | SMOTE | 0.321 | |
| KNN | Random undersampling | 0.319 | |
| DTC | Random oversampling | 0.310 | |
| DTR | Random oversampling | 0.310 | |
| 90% | RFC | SMOTE & undersampling | 0.338 |
| RFR | SMOTE & undersampling | 0.336 | |
| SVM | SMOTE & undersampling | 0.332 | |
| KNN | SMOTE & undersampling | 0.330 | |
| MLP | SMOTE | 0.299 | |
| GBC | Random undersampling | 0.279 | |
| DTC | ADASYN | 0.273 | |
| DTR | Borderline SMOTE | 0.270 |