| Literature DB >> 32613125 |
Sangjae Lee1, Bikash Kc1, Joon Yeon Choeh2.
Abstract
While many business intelligence methods have been applied to predict movie box office revenue, the studies using an ensemble approach to predict box office revenue are almost nonexistent. In this study, we propose decision trees, k-nearest-neighbors (k-NN), and linear regression using ensemble methods and the prediction performance of decision trees based on random forests, bagging and boosting are compared with that of k-NN and linear regression based on bagging and boosting using the sample of 1439 movies. The results indicate that ensemble methods based on decision trees (random forests, bagging, boosting) outperform ensemble methods based on k-NN (bagging, boosting) in predicting box office at week 1, 2, 3 after release. Decision trees using ensemble methods provide better prediction performance than ensemble methods based on linear regression analysis in the box office at week 1 after release. This is explained by the results that after comparing the prediction performance between ensemble methods and non-ensemble methods. For decision tree methods, unlike the other methods, the prediction performance of ensemble methods is greater than that of non-ensemble methods. This shows that decision trees using ensemble methods provide better application effectiveness of ensemble methods than k-NN and linear regression analysis.Entities:
Keywords: Big data; Business management; Data analysis; Data analytics; Decision trees; Ensemble methods; Management; Movie box office revenue; Prediction of box office revenue
Year: 2020 PMID: 32613125 PMCID: PMC7322254 DOI: 10.1016/j.heliyon.2020.e04260
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Comparison of ensemble methods for movie box office revenue prediction.
Summary of variables.
| Category | Variables | Description | Number of possible values |
|---|---|---|---|
| eWOM | Average review rating | Represents average of review rating. | Real values |
| Average number of reviews | Represents the average number of user reviews until the movie is released in a new market | Real values | |
| Average Emotional reviews | Represents the proportion of emotional reviews among total reviews | Real values | |
| Helpfulness | Proportion of positive answers to total answers to question asking if the review is helpful | Real values | |
| Total helpfulness votes for reviewer | Represents the total number of helpfulness votes for reviewer | Real values | |
| Movie related variables | Award | Indicates whether a movie got awards winners/nominees (value of 1) or not (value of 0) | 2 |
| Code film rating | Rating of a movies (2 stars, 3stars, 5 stars) | Real values | |
| Sequel | Indicates whether a movie is a sequel (value of 1) or not (value of 0). | 2 | |
| Timing of release | Indicates whether the time of release is high (peak season) and low season | 2 | |
| Similar movie revenue | Revenue of competing movies release in first day (d1), first week (wk1), second week (wk2) | Real values | |
| Genre | Represents the content category the movie belongs to. A movie can be belonged to more than one content category at the same. This study chooses one dummy variable for the drama genre to which our sample belongs to in the greatest proportions. | 2 | |
| Nationality (Nation1,Nation2) | Movie released in the respective country (South Korea, USA) | Real values |
Descriptive statistics of samples.
| Frequency | Percent | Mean (revenue sum) (Won) | |
|---|---|---|---|
| a) Age | |||
| No restrictions on age | 229 | 15.9 | 1.838E+09 |
| Allowed for teenager (12 < age<18) | 815 | 56.5 | 3.046E+09 |
| Allowed for Adults (age>18) | 395 | 27.4 | 1.596E+09 |
| Total | 1439 | 100 | 2.456E+09 |
| b) Genre | |||
| Action | 101 | 7 | 1.16E+07 |
| Animation | 95 | 6.6 | 2.09E+09 |
| Comedy | 176 | 12 | 2.24E+09 |
| Crime | 22 | 1.5 | 5.38E+08 |
| Documentary | 89 | 7 | 1.65E+10 |
| Drama | 673 | 46.8 | 2.02E+10 |
| Family | 5 | 0.3 | 7.62E+08 |
| Fantasy | 4 | 0.3 | 6.57E+09 |
| History | 2 | 0.1 | 3.99E+09 |
| Horror | 54 | 3.8 | 2.47E+09 |
| Music, Romance | 147 | 10 | 1.21E+09 |
| Musical | 5 | 0.3 | 3.03E+08 |
| Science Fiction | 8 | 0.5 | 5.83E+09 |
| Thriller, Mystery | 55 | 3.8 | 2.22E+09 |
| War | 1 | 0 | 2.12E+09 |
| Total | 1439 | 100 | 9.22E+07 |
| c) Award | |||
| Not awarded | 1412 | 98.1 | 2.315E+09 |
| Awarded | 27 | 1.9 | 9.840E+09 |
| Total | 1439 | 100 | 2.456E+09 |
| d) Nation | |||
| South Korea | 571 | 40 | 2.738E+09 |
| USA | 394 | 27 | 2.769E+09 |
| UK | 66 | 4.6 | 3.504E+09 |
| France | 87 | 6 | 1.924E+09 |
| Japan | 127 | 8.8 | 1.138E+09 |
| China | 13 | 1 | 5.213E+09 |
| Others | 181 | 12.6 | 1.484E+09 |
| Total | 1439 | 100 | 2.456E+09 |
| e) Sequel | |||
| No sequel | 1415 | 98.3 | 2.315E+09 |
| Sequel | 24 | 1.7 | 9.840E+09 |
| Total | 1439 | 100 | 2.456E+09 |
| f) Timing release | |||
| Released in holidays starting | 907 | 63.0 | 1.902E+09 |
| Released in other times | 532 | 37 | 3.401E+09 |
| Total | 1439 | 100 | 2.456E+09 |
Comparing decision trees using ensemble methods with k-nearest-neighbors (best-k).
| Compared Models | Mean | T | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ k-nearest-neighbors | -4.821E+08 | -3.874 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors | -4.834E+08 | -3.610 | 0.001 | |
| decision trees (boosting) ─ k-nearest-neighbors | -5.131E+08 | -3.224 | 0.003 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors | -5.941E+08 | -4.332 | 0.000 | |
| Week 2 | decision trees (random forest) ─ k-nearest-neighbors | -5.174E+08 | -5.710 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors | -4.976E+08 | -5.147 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors | -3.563E+08 | -2.643 | 0.012 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors | -5.085E+08 | -5.004 | 0.000 | |
| Week 3 | decision trees (random forest) ─ | -2.841E+08 | -4.993 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors | -2.733E+08 | -3.462 | 0.001 | |
| decision trees (boosting) ─ k-nearest-neighbors | -1.798E+08 | -1.938 | 0.061 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors | -2.788E+08 | -3.859 | 0.000 | |
Comparing decision trees using ensemble methods with k-nearest-neighbors (bagging).
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ k-nearest-neighbors (bagging) | -8.505E+08 | -7.561 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (bagging) | -8.519E+08 | -7.185 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (bagging) | -8.816E+08 | -6.289 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (bagging) | -9.626E+08 | -8.068 | 0.000 | |
| Week 2 | decision trees (random forest) ─ k-nearest-neighbors (bagging) | -8.618E+08 | -7.869 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (bagging) | -8.420E+08 | -7.301 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (bagging) | -7.007E+08 | -4.607 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (bagging) | -8.529E+08 | -7.051 | 0.000 | |
| Week 3 | decision trees (random forest) ─ k-nearest-neighbors (bagging) | -5.638E+08 | -7.751 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (bagging) | -5.530E+08 | -6.229 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (bagging) | -4.596E+08 | -4.649 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (bagging) | -5.585E+08 | -6.736 | 0.000 | |
Comparing decision trees using ensemble methods with k-nearest-neighbors (boosting).
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ k-nearest-neighbors (boosting) | -1.063E+09 | -7.698 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (boosting) | -1.064E+09 | -7.456 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (boosting) | -1.094E+09 | -7.101 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (boosting) | -1.175E+09 | -8.496 | 0.000 | |
| Week 2 | decision trees (random forest) ─ k-nearest-neighbors (boosting) | -9.864E+08 | -6.373 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (boosting) | -9.667E+08 | -6.165 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (boosting) | -8.254E+08 | -4.504 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (boosting) | -9.776E+08 | -6.060 | 0.000 | |
| Week 3 | decision trees (random forest) ─ k-nearest-neighbors (boosting) | -6.168E+08 | -6.211 | 0.000 |
| decision trees (bagging) ─ k-nearest-neighbors (boosting) | -6.060E+08 | -5.190 | 0.000 | |
| decision trees (boosting) ─ k-nearest-neighbors (boosting) | -5.126E+08 | -4.073 | 0.000 | |
| decision trees (random forest, bagging, boosting) ─ k-nearest-neighbors (boosting) | -6.115E+08 | -5.512 | 0.000 | |
Comparing decision trees using ensemble methods with linear regression.
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ linear regression | -1.842E+08 | -1.367 | 0.180 |
| decision trees (bagging) ─ linear regression | -1.856E+08 | -1.328 | 0.193 | |
| decision trees (boosting) ─ linear regression | -2.153E+08 | -1.606 | 0.117 | |
| decision trees (random forest, bagging, boosting) ─ linear regression | -2.963E+08 | -2.303 | 0.027 | |
| Week 2 | decision trees (random forest) ─ linear regression | -6.204E+07 | -.850 | 0.401 |
| decision trees (bagging) ─ linear regression | -4.228E+07 | -.555 | 0.582 | |
| decision trees (boosting) ─ linear regression | 9.901E+07 | 1.315 | 0.197 | |
| decision trees (random forest, bagging, boosting) ─ linear regression | -5.318E+07 | -.811 | 0.423 | |
| Week 3 | decision trees (random forest) ─ linear regression | 3.689E+06 | .076 | 0.940 |
| decision trees (bagging) ─ linear regression | 1.451E+07 | .392 | 0.697 | |
| decision trees (boosting) ─ | 1.080E+08 | 2.100 | 0.043 | |
| decision trees (random forest, bagging, boosting) ─ linear regression | 9.034E+06 | .250 | 0.804 | |
Comparing decision trees using ensemble methods with linear regression (bagging).
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ linear regression (bagging) | -1.802E+08 | -1.257 | 0.217 |
| decision trees (bagging) ─ linear regression (bagging) | -1.816E+08 | -1.230 | 0.227 | |
| decision trees (boosting) ─ linear regression (bagging) | -2.112E+08 | -1.489 | 0.145 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (bagging) | -2.923E+08 | -2.131 | 0.040 | |
| Week 2 | decision trees (random forest) ─ linear regression (bagging) | -7.242E+07 | -1.038 | 0.306 |
| decision trees (bagging) ─ linear regression (bagging) | -5.267E+07 | -.757 | 0.454 | |
| decision trees (boosting) ─ linear regression (bagging) | 8.863E+07 | 1.223 | 0.229 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (bagging) | -6.357E+07 | -1.046 | 0.303 | |
| Week 3 | decision trees (random forest) ─ linear regression (bagging) | -7.922E+06 | -.148 | 0.883 |
| decision trees (bagging) ─ linear regression (bagging) | 2.899E+06 | .082 | 0.935 | |
| decision trees (boosting) ─ linear regression (bagging) | 9.634E+07 | 1.899 | 0.066 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (bagging) | -2.577E+06 | -.068 | 0.946 | |
Comparing decision trees using ensemble methods with linear regression (boosting).
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ linear regression (boosting) | -1.806E+08 | -1.434 | 0.160 |
| decision trees (bagging) ─ linear regression (boosting) | -1.819E+08 | -1.391 | 0.173 | |
| decision trees (boosting) ─ linear regression (boosting) | -2.116E+08 | -1.663 | 0.105 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (boosting) | -2.926E+08 | -2.435 | 0.020 | |
| Week 2 | decision trees (random forest) ─ linear regression (boosting) | -5.891E+07 | -.932 | 0.358 |
| decision trees (bagging) ─ linear regression (boosting) | -3.916E+07 | -.599 | 0.553 | |
| decision trees (boosting) ─ linear regression (boosting) | 1.021E+08 | 1.415 | 0.166 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (boosting) | -5.006E+07 | -.891 | 0.379 | |
| Week 3 | decision trees (random forest) ─ linear regression (boosting) | 1.785E+05 | .004 | 0.997 |
| decision trees (bagging) ─ linear regression (boosting) | 1.100E+07 | .325 | 0.747 | |
| decision trees (boosting) ─ linear regression (boosting) | 1.044E+08 | 2.034 | 0.050 | |
| decision trees (random forest, bagging, boosting) ─ linear regression (boosting) | 5.523E+06 | .162 | 0.872 | |
Comparing decision trees using ensemble methods with single decision trees.
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | decision trees (random forest) ─ single decision trees | -1.110E+08 | -2.547 | 0.015 |
| decision trees (bagging) ─ single decision trees | -1.123E+08 | -2.906 | 0.006 | |
| decision trees (boosting) ─ single decision trees | -1.420E+08 | -1.607 | 0.117 | |
| decision trees (random forest, bagging, boosting) ─ single decision trees | -2.230E+08 | -4.335 | 0.000 | |
| Week 2 | decision trees (random forest) ─ single decision trees | -1.497E+08 | -3.077 | 0.004 |
| decision trees (bagging) ─ single decision trees | -1.299E+08 | -2.527 | 0.016 | |
| decision trees (boosting) ─ single decision trees | 1.139E+07 | .112 | 0.911 | |
| decision trees (random forest, bagging, boosting) ─ single decision trees | -1.408E+08 | -2.421 | 0.021 | |
| Week 3 | decision trees (random forest) ─ single decision trees | -1.575E+07 | -.560 | 0.579 |
| decision trees (bagging) ─ single decision trees | -4.930E+06 | -.177 | 0.861 | |
| decision trees (boosting) ─ single decision trees | 8.852E+07 | 1.712 | 0.096 | |
| decision trees (random forest, bagging, boosting) ─ single decision trees | -1.041E+07 | -.389 | 0.699 | |
Comparing k-nearest-neighbors using ensemble methods with k-nearest-neighbors (best-k).
| Compared Models | Mean | T | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | k-nearest-neighbors (bagging) ─ k-nearest-neighbors (best-k) | 5.702E+08 | 4.214 | 0.000 |
| k-nearest-neighbors (boosting) ─ k-nearest-neighbors (best-k) | 7.823E+08 | 4.897 | 0.000 | |
| k-nearest-neighbors (bagging, boosting) ─ k-nearest-neighbors (best-k) | 6.421E+08 | 4.454 | 0.000 | |
| Week 2 | k-nearest-neighbors (bagging) ─ k-nearest-neighbors (best-k) | 1.012E+09 | 7.719 | 0.000 |
| k-nearest-neighbors (boosting) ─ k-nearest-neighbors (best-k) | 1.136E+09 | 6.401 | 0.000 | |
| k-nearest-neighbors (bagging, boosting) ─ k-nearest-neighbors (best-k) | 1.032E+09 | 6.739 | 0.000 | |
| Week 3 | k-nearest-neighbors (bagging) ─ k-nearest-neighbors (best-k) | -2.915E+08 | -1.862 | 0.071 |
| k-nearest-neighbors (boosting) ─ k-nearest-neighbors (best-k) | -2.385E+08 | -1.367 | 0.180 | |
| k-nearest-neighbors (bagging, boosting) ─k-nearest-neighbors (best-k) | -3.066E+08 | -1.874 | 0.069 | |
Comparing linear regression using ensemble methods with linear regression.
| Compared Models | Mean | t | Sig. (2-tailed) | |
|---|---|---|---|---|
| Week 1 | linear regression (bagging) ─ linear regression | -3.672E+06 | -.219 | 0.828 |
| linear regression (boosting) ─ linear regression | -4.016E+06 | -.187 | 0.852 | |
| linear regression (bagging, boosting) ─ linear regression | -3.844E+06 | -.237 | 0.814 | |
| Week 2 | linear regression (bagging) ─ linear regression | -3.122E+06 | -.114 | 0.910 |
| linear regression (boosting) ─ linear regression | 1.039E+07 | .285 | 0.777 | |
| linear regression (bagging, boosting) ─ linear regression | 3.632E+06 | .120 | 0.905 | |
| Week 3 | linear regression (bagging) ─ linear regression | 3.510E+06 | .383 | 0.704 |
| linear regression (boosting) ─ linear regression | 1.161E+07 | .688 | 0.496 | |
| linear regression (bagging, boosting) ─ linear regression | 7.561E+06 | .686 | 0.497 | |