| Literature DB >> 34770633 |
Sidney Loyola de Sá1, Antonio A de A Rocha1, Aline Paes1.
Abstract
The Internet's popularization has increased the amount of content produced and consumed on the web. To take advantage of this new market, major content producers such as Netflix and Amazon Prime have emerged, focusing on video streaming services. However, despite the large number and diversity of videos made available by these content providers, few of them attract the attention of most users. For example, in the data explored in this article, only 6% of the most popular videos account for 85% of total views. Finding out in advance which videos will be popular is not trivial, especially given many influencing variables. Nevertheless, a tool with this ability would be of great value to help dimension network infrastructure and properly recommend new content to users. In this way, this manuscript examines the machine learning-based approaches that have been proposed to solve the prediction of web content popularity. To this end, we first survey the literature and elaborate a taxonomy that classifies models according to predictive features and describes state-of-the-art features and techniques used to solve this task. While analyzing previous works, we saw an opportunity to use textual features for video prediction. Thus, additionally, we propose a case study that combines features acquired through attribute engineering and word embedding to predict the popularity of a video. The first approach is based on predictive attributes defined by resource engineering. The second takes advantage of word embeddings from video descriptions and titles. We experimented with the proposed techniques in a set of videos from GloboPlay, the largest provider of video streaming services in Latin America. A combination of engineering features and embeddings using the Random Forest algorithm achieved the best result, with an accuracy of 87%.Entities:
Keywords: machine learning; popularity prediction; video; word embeddings
Mesh:
Year: 2021 PMID: 34770633 PMCID: PMC8588537 DOI: 10.3390/s21217328
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Taxonomy according to the prediction methods and attributes used.
Studies classified according to Taxonomy.
| Prediction Task | Features | |||||
|---|---|---|---|---|---|---|
| C | R | Textual | Visual | Meta-Data | Content Type | References |
| ✓ | ✓ | Videos and News | [ | |||
| ✓ | ✓ | Videos | [ | |||
| ✓ | ✓ | Videos | [ | |||
| ✓ | ✓ | Images | [ | |||
| ✓ | ✓ | Videos | [ | |||
| ✓ | ✓ | Videos | [ | |||
| ✓ | ✓ | ✓ | News | [ | ||
| ✓ | ✓ | News | [ | |||
| ✓ | ✓ | ✓ | News | [ | ||
| ✓ | ✓ | Videos | [ | |||
| ✓ | ✓ | News | [ | |||
| ✓ | ✓ | Videos | [ | |||
C = Classification R = Regression.
Performance of models.
| Best Performance | ||||
|---|---|---|---|---|
| Task | Model | Metric | Performance | References |
| R | LN model | RSE | graphic | [ |
| R | Linear Regression | Spearman | 0.8539 | [ |
| R | MRBF model | RSE | 0.1723 | [ |
| R | SVR | Spearman | 0.81 | [ |
| R | Popularity-SVR | Spearman | 0.9413 | [ |
| R | CI Random Forest |
| 0.8 | [ |
| C | Bagging | Accuracy | 83.96% | [ |
| C | Random Forest | AUC | 0.73 | [ |
| C | AD Tree | AUC | 0.837 | [ |
| C | Popularity-LRCN | Accuracy | 0.7 | [ |
| C | Gradient Boosting | Accuracy | 79% | [ |
C = Classification R = Regression.
Features observed in literature.
| Feature | References |
|---|---|
| Category | [ |
| Author or Source | [ |
| Title subjectivity | [ |
| Content subjectivity score | [ |
| Number of friends/followers of Author | [ |
| Number of Named Entities | [ |
| Number of keywords | [ |
| Frequency of positive words | [ |
| Frequency of negative words | [ |
| Number of words in title | [ |
| Number of words in content | [ |
| HOG | [ |
| GIST | [ |
| Output of CaffeNet | [ |
| Output of ResNet | [ |
| Video’s length | [ |
| Video’s resolution | [ |
| HUE | [ |
| Thumbnail contrast | [ |
| Number of tweets/retweets | [ |
| Number of Shares | [ |
| Number of Views in the first day | [ |
| Number of Views | [ |
Mainly features observed in literature about popularity prediction.
Figure 2Example of the ROC curve.
Figure 3Complementary cumulative distribution function of number of views in log scale.
Figure 4Percentage of total views separated by five classes of number of views.
Figure 5Percentage of total payload separated by five classes of number of views.
Number of videos with corresponding percentage of total views and total payload.
| Number of Views | Number of Videos | % Views | % Payload |
|---|---|---|---|
| 0–3 | 2500 | 0.10 | 0.10 |
| 3–20 | 2564 | 0.60 | 1.10 |
| 20–83 | 2434 | 2.70 | 5.30 |
| 83–1000 | 1875 | 10.90 | 20.20 |
| 1000+ | 616 | 85.70 | 73.30 |
Textual features collected from the title and the description of Globoplay.
| Number | Feature | Number | Feature |
|---|---|---|---|
| 1 | Number of words of the title | 19 | Weekday is Saturday? |
| 2 | Number of words of the description | 20 | Weekday is Sunday? |
| 3 | Rate of unique words of the Description | 21 | Is Weekend? |
| 4 | Rate of non-stop words in the Description | 22 | Title Polarity |
| 5 | Rate of unique non stop words in the Description | 23 | Title Subjectivity |
| 6 | Average of word length in the Description | 24 | Description Polarity |
| 7 | Number of NER in the Description | 25 | Description Subjectivity |
| 8 | Topic LDA | 26 | Rate of Negative Words in Description |
| 9 | Closeness to LDA Topic 0 | 27 | Rate of Positive words in the Description |
| 10 | Closeness to LDA Topic 1 | 28 | Rate of Positive Words among non-neutral in the Description |
| 11 | Closeness to LDA Topic 2 | 29 | Rate of Negative Words among non-neutral in the Description |
| 12 | Closeness to LDA Topic 3 | 30 | Average of Negative Polarity among words in the Description |
| 13 | Closeness to LDA Topic 4 | 31 | Maximum of Negative Polarity among words in the Description |
| 14 | Weekday is Monday? | 32 | Minimum Negative Polarity among words in the Description |
| 15 | Weekday is Tuesday? | 33 | Average of Positive Polarity among words in the Description |
| 16 | Weekday is Wednesday? | 34 | Maximum of Positive Polarity among words in the Description |
| 17 | Weekday is Thursday? | 35 | Minimum Positive Polarity among words in the Description |
| 18 | Weekday is Friday? | - | - |
Classification Results Features NLP.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| KNN | 0.65 | 0.67 | 0.66 | 0.72 |
| Naive Bayes | 0.57 | 0.59 | 0.53 | 0.55 |
| SVM | 0.78 | 0.57 | 0.57 | 0.78 |
| Random Forest | 0.73 | 0.76 | 0.74 | 0.80 |
| AdaBoost | 0.68 | 0.68 | 0.68 | 0.76 |
| MLP | 0.71 | 0.73 | 0.72 | 0.78 |
The five most important features in RF Model.
| Feature | Importance |
|---|---|
| Avg polarity of Negative words | (1) 0.11636 |
| Closeness to top 2 LDA topic | (2) 0.09072 |
| Rate of Negative words | (3) 0.07067 |
| Rate of Positive words | (4) 0.06947 |
| Avg polarity of Positive words | (5) 0.05893 |
Classification Results Embeddings Descriptions.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| KNN | 0.59 | 0.61 | 0.61 | 0.52 |
| Naive Bayes | 0.56 | 0.56 | 0.42 | 0.43 |
| SVM | 0.64 | 0.68 | 0.65 | 0.71 |
| Random Forest | 0.63 | 0.65 | 0.64 | 0.72 |
| AdaBoost | 0.49 | 0.49 | 0.49 | 0.63 |
| MLP | 0.68 | 0.67 | 0.67 | 0.76 |
Classification Results Embeddings Titles.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| KNN | 0.70 | 0.75 | 0.70 | 0.74 |
| Naive Bayes | 0.59 | 0.59 | 0.45 | 0.45 |
| SVM | 0.74 | 0.77 | 0.75 | 0.80 |
| Random Forest | 0.77 | 0.77 | 0.77 | 0.82 |
| AdaBoost | 0.51 | 0.51 | 0.50 | 0.60 |
| MLP | 0.76 | 0.75 | 0.76 | 0.82 |
Classification Results NLP + Titles.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| KNN | 0.71 | 0.74 | 0.72 | 0.78 |
| Naive Bayes | 0.60 | 0.59 | 0.43 | 0.43 |
| SVM | 0.88 | 0.51 | 0.45 | 0.76 |
| Random Forest | 0.81 | 0.83 | 0.82 | 0.87 |
| AdaBoost | 0.77 | 0.80 | 0.78 | 0.83 |
| MLP | 0.75 | 0.75 | 0.75 | 0.81 |
Classification Results Total Features.
| Model | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|
| KNN | 0.72 | 0.75 | 0.73 | 0.78 |
| Naive Bayes | 0.58 | 0.57 | 0.41 | 0.41 |
| SVM | 0.88 | 0.51 | 0.44 | 0.76 |
| Random Forest | 0.80 | 0.83 | 0.81 | 0.86 |
| AdaBoost | 0.77 | 0.80 | 0.78 | 0.83 |
| MLP | 0.74 | 0.74 | 0.74 | 0.80 |