| Literature DB >> 33223615 |
Juvenal José Duarte1, Sahudy Montenegro González1, José César Cruz1.
Abstract
Market participants use a wide set of information before they decide to invest in risk assets, such as stocks. Investors often follow the news to collect the information that will help them decide which strategy to follow. In this study, we analyze how public news and historical prices can be used together to anticipate and prevent financial losses on the Brazilian stock market. We include an extensive set of 64 securities in our analysis, which represent various sectors of the Brazilian economy. Our analysis compares the traditional Buy & Hold and the moving average strategies to several experiments designed with 11 machine learning algorithms. We explore daily, weekly and monthly time horizons for both publication and return windows. With this approach we were able to assess the most relevant set of news for investor's decision, and to determine for how long the information remains relevant to the market. We found a strong relationship between news publications and stock price changes in Brazil, suggesting even short-term arbitrage opportunities. The study shows that it is possible to predict stock price falls using a set of news in Portuguese, and that text mining-based approaches can overcome traditional strategies when forecasting losses. © Springer Science+Business Media, LLC, part of Springer Nature 2020.Entities:
Keywords: Brazilian Portuguese news; Brazilian stock market; Machine learning algorithms; Price forecasting; Text mining
Year: 2020 PMID: 33223615 PMCID: PMC7670979 DOI: 10.1007/s10614-020-10060-y
Source DB: PubMed Journal: Comput Econ ISSN: 0927-7099 Impact factor: 1.876
Fig. 1Rolling windows: past, today, future
Fig. 2Method description to obtain a predictor for a single stock
Monthly news volume by portal
| Year/Month | br.advfn.com | br.investing.com | estadao.com.br | exame.com.br | g1.globo.com | infomoney.com.br | ultimoinstante.com.br | uol.com.br | valor.globo.com | veja.com.br | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016/08 | 0 | 0 | 0 | 0 | 59 | 0 | 0 | 8 | 0 | 0 | 67 |
| 2016/09 | 1071 | 0 | 980 | 3 | 6431 | 0 | 0 | 605 | 0 | 4 | 9094 |
| 2016/10 | 2104 | 0 | 1721 | 1 | 13,420 | 1 | 0 | 1199 | 0 | 2 | 18,448 |
| 2016/11 | 2383 | 0 | 1655 | 2 | 12,645 | 0 | 0 | 1252 | 0 | 3 | 17,940 |
| 2016/12 | 2331 | 0 | 1652 | 1 | 10,597 | 0 | 0 | 1246 | 0 | 7 | 15,834 |
| 2017/01 | 1504 | 1 | 1076 | 7 | 6950 | 0 | 0 | 895 | 0 | 3 | 10,436 |
| 2017/02 | 2354 | 0 | 1566 | 20 | 9357 | 1 | 0 | 1024 | 0 | 15 | 14,337 |
| 2017/03 | 2753 | 0 | 1751 | 21 | 11,004 | 4 | 0 | 1190 | 1 | 22 | 16,746 |
| 2017/04 | 2513 | 0 | 1669 | 18 | 3971 | 4 | 0 | 1209 | 1 | 11 | 9396 |
| 2017/05 | 2822 | 0 | 2043 | 13 | 2153 | 6 | 0 | 1555 | 19 | 12 | 8623 |
| 2017/06 | 2592 | 0 | 1729 | 32 | 2039 | 8 | 0 | 1520 | 37 | 10 | 7967 |
| 2017/07 | 2548 | 7 | 1525 | 889 | 3227 | 12,296 | 250 | 1354 | 2373 | 2214 | 26,683 |
| 2017/08 | 3163 | 27 | 1792 | 1612 | 3538 | 4276 | 945 | 1419 | 4217 | 3936 | 24,925 |
| 2017/09 | 1921 | 14 | 1487 | 1161 | 2580 | 2951 | 513 | 1112 | 3274 | 2665 | 17,678 |
| 2017/10 | 2462 | 25 | 1641 | 0 | 3265 | 3348 | 788 | 1300 | 4092 | 1937 | 18,858 |
| 2017/11 | 2048 | 25 | 1383 | 0 | 2733 | 4535 | 588 | 1065 | 3831 | 1711 | 17,919 |
| 2017/12 | 1683 | 19 | 1152 | 0 | 2270 | 2945 | 75 | 944 | 3121 | 145 | 12,354 |
| 2018/01 | 437 | 26 | 1544 | 0 | 2723 | 3916 | 679 | 1208 | 3551 | 0 | 14,084 |
| 2018/02 | 0 | 20 | 1096 | 0 | 1410 | 2655 | 513 | 942 | 2972 | 0 | 9608 |
| 2018/03 | 0 | 2 | 123 | 0 | 61 | 735 | 2 | 83 | 470 | 0 | 1476 |
| 2018/04 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 33 | 0 | 34 |
| 2018/05 | 0 | 0 | 44 | 0 | 8 | 52 | 0 | 0 | 310 | 0 | 414 |
| Total | 36,689 | 166 | 27,629 | 3780 | 100,441 | 37,734 | 4353 | 21,130 | 28,302 | 12,697 | 272,921 |
Number of samples for training and test for each stock and publications in and returns for , and
| Stock | Training | Test | AVG | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Negative samples | Positive samples | Negative samples | Positive samples | Pos. | |||||||||
| 0 d | 5 d | 21 d | 0 d | 5 d | 21 d | 0 d | 5 d | 21 d | 0 d | 5 d | 21 d | (%) | |
| SUZB3 | 49 | 54 | 66 | 31 | 26 | 14 | 21 | 30 | 38 | 17 | 8 | 0 | 26 |
| CVCB3 | 152 | 187 | 220 | 129 | 94 | 61 | 74 | 86 | 108 | 64 | 52 | 30 | 34 |
| MGLU3 | 178 | 204 | 238 | 103 | 77 | 43 | 67 | 72 | 97 | 71 | 66 | 41 | 35 |
| FIBR3 | 139 | 169 | 187 | 142 | 112 | 94 | 82 | 94 | 105 | 56 | 44 | 33 | 37 |
| RENT3 | 154 | 180 | 185 | 127 | 101 | 96 | 70 | 82 | 108 | 68 | 56 | 30 | 38 |
| TIMP3 | 158 | 175 | 191 | 123 | 106 | 90 | 75 | 79 | 96 | 63 | 59 | 42 | 39 |
| SANB11 | 153 | 166 | 182 | 128 | 115 | 99 | 75 | 81 | 104 | 63 | 57 | 34 | 39 |
| VALE3 | 140 | 153 | 161 | 141 | 128 | 120 | 79 | 89 | 109 | 59 | 49 | 29 | 40 |
| PETR3 | 133 | 152 | 148 | 148 | 129 | 133 | 82 | 94 | 96 | 56 | 44 | 42 | 41 |
| BRAP4 | 140 | 165 | 166 | 141 | 116 | 115 | 81 | 76 | 93 | 57 | 62 | 45 | 42 |
| PETR4 | 140 | 153 | 157 | 141 | 128 | 124 | 74 | 94 | 91 | 64 | 44 | 47 | 42 |
| EQTL3 | 147 | 162 | 168 | 134 | 119 | 113 | 76 | 81 | 87 | 62 | 57 | 51 | 42 |
| RAIL3 | 156 | 179 | 177 | 125 | 102 | 104 | 75 | 76 | 74 | 63 | 62 | 64 | 42 |
| ABEV3 | 155 | 162 | 178 | 126 | 119 | 103 | 75 | 73 | 85 | 63 | 65 | 53 | 43 |
| SBSP3 | 148 | 169 | 160 | 133 | 112 | 121 | 72 | 81 | 86 | 66 | 57 | 52 | 43 |
| BBAS3 | 157 | 170 | 189 | 124 | 111 | 92 | 75 | 76 | 66 | 63 | 62 | 72 | 43 |
| ESTC3 | 159 | 166 | 169 | 122 | 115 | 112 | 78 | 66 | 82 | 60 | 72 | 56 | 43 |
| USIM5 | 136 | 168 | 176 | 145 | 113 | 105 | 72 | 87 | 73 | 66 | 51 | 65 | 44 |
| B3SA3 | 159 | 175 | 187 | 122 | 106 | 94 | 76 | 70 | 65 | 62 | 68 | 73 | 44 |
| GOLL4 | 137 | 164 | 187 | 144 | 117 | 94 | 67 | 62 | 98 | 71 | 76 | 40 | 44 |
| GOAU4 | 146 | 150 | 169 | 135 | 131 | 112 | 69 | 84 | 85 | 69 | 54 | 53 | 44 |
| ITUB4 | 162 | 174 | 190 | 119 | 107 | 91 | 71 | 72 | 65 | 67 | 66 | 73 | 44 |
| GGBR4 | 139 | 151 | 161 | 142 | 130 | 120 | 72 | 87 | 85 | 66 | 51 | 53 | 44 |
| BTOW3 | 147 | 163 | 164 | 134 | 118 | 117 | 72 | 75 | 85 | 66 | 63 | 53 | 44 |
| HYPE3 | 142 | 156 | 177 | 139 | 125 | 104 | 77 | 74 | 80 | 61 | 64 | 58 | 44 |
| ITSA4 | 153 | 173 | 177 | 128 | 108 | 104 | 64 | 76 | 76 | 74 | 62 | 62 | 44 |
| BBDC4 | 159 | 185 | 186 | 122 | 96 | 95 | 70 | 67 | 64 | 68 | 71 | 74 | 44 |
| LREN3 | 159 | 173 | 207 | 122 | 108 | 74 | 71 | 73 | 48 | 67 | 65 | 90 | 45 |
| VVAR3 | 156 | 175 | 195 | 125 | 106 | 86 | 62 | 59 | 73 | 76 | 79 | 65 | 45 |
| BRKM5 | 136 | 165 | 188 | 145 | 116 | 93 | 66 | 71 | 75 | 72 | 67 | 63 | 45 |
| WEGE3 | 152 | 149 | 192 | 129 | 132 | 89 | 60 | 79 | 70 | 78 | 59 | 68 | 46 |
| SMLS3 | 155 | 170 | 193 | 126 | 111 | 88 | 70 | 70 | 54 | 68 | 68 | 84 | 46 |
| BBDC3 | 162 | 169 | 175 | 119 | 112 | 106 | 71 | 63 | 60 | 67 | 75 | 78 | 47 |
| CYRE3 | 143 | 167 | 169 | 138 | 114 | 112 | 70 | 69 | 67 | 68 | 69 | 71 | 47 |
| CSAN3 | 151 | 165 | 142 | 130 | 116 | 139 | 74 | 65 | 75 | 64 | 73 | 63 | 47 |
| NATU3 | 152 | 153 | 165 | 129 | 128 | 116 | 61 | 72 | 75 | 77 | 66 | 63 | 47 |
| IGTA3 | 148 | 170 | 195 | 133 | 111 | 86 | 69 | 67 | 48 | 69 | 71 | 90 | 47 |
| PCAR4 | 138 | 169 | 188 | 143 | 112 | 93 | 67 | 62 | 59 | 71 | 76 | 79 | 48 |
| MRVE3 | 138 | 146 | 156 | 143 | 135 | 125 | 68 | 68 | 75 | 70 | 70 | 63 | 48 |
| ECOR3 | 146 | 175 | 197 | 135 | 106 | 84 | 65 | 58 | 48 | 73 | 80 | 90 | 49 |
| FLRY3 | 146 | 158 | 183 | 135 | 123 | 98 | 65 | 70 | 51 | 73 | 68 | 87 | 49 |
| JBSS3 | 136 | 149 | 142 | 145 | 132 | 139 | 66 | 70 | 77 | 72 | 68 | 61 | 49 |
| UGPA3 | 145 | 141 | 154 | 136 | 140 | 127 | 77 | 71 | 58 | 61 | 67 | 80 | 49 |
| LAME4 | 128 | 150 | 158 | 153 | 131 | 123 | 72 | 64 | 70 | 66 | 74 | 68 | 49 |
| EMBR3 | 137 | 154 | 161 | 144 | 127 | 120 | 61 | 70 | 66 | 77 | 68 | 72 | 49 |
| RADL3 | 137 | 157 | 169 | 144 | 124 | 112 | 69 | 57 | 64 | 69 | 81 | 74 | 50 |
| VIVT4 | 153 | 162 | 144 | 128 | 119 | 137 | 70 | 64 | 54 | 68 | 74 | 84 | 50 |
| QUAL3 | 160 | 174 | 200 | 121 | 107 | 81 | 68 | 45 | 24 | 70 | 93 | 114 | 52 |
| MULT3 | 161 | 163 | 180 | 120 | 118 | 101 | 67 | 56 | 28 | 71 | 82 | 110 | 52 |
| MRFG3 | 148 | 153 | 164 | 133 | 128 | 117 | 62 | 49 | 58 | 76 | 89 | 80 | 52 |
| EGIE3 | 126 | 139 | 134 | 155 | 142 | 147 | 63 | 61 | 76 | 75 | 77 | 62 | 52 |
| BBSE3 | 135 | 140 | 152 | 146 | 141 | 129 | 68 | 63 | 52 | 70 | 75 | 86 | 53 |
| CMIG4 | 129 | 137 | 126 | 152 | 144 | 155 | 68 | 67 | 65 | 70 | 71 | 73 | 53 |
| ENBR3 | 146 | 152 | 171 | 135 | 129 | 110 | 64 | 57 | 38 | 74 | 81 | 100 | 53 |
| CSNA3 | 130 | 137 | 155 | 151 | 144 | 126 | 62 | 70 | 50 | 76 | 68 | 88 | 53 |
| CPLE6 | 134 | 141 | 146 | 147 | 140 | 135 | 64 | 57 | 59 | 74 | 81 | 79 | 53 |
| CIEL3 | 132 | 121 | 91 | 149 | 160 | 190 | 68 | 64 | 70 | 70 | 74 | 68 | 55 |
| ELET6 | 131 | 127 | 134 | 150 | 154 | 147 | 61 | 62 | 50 | 77 | 76 | 88 | 56 |
| KROT3 | 145 | 158 | 180 | 136 | 123 | 101 | 63 | 43 | 18 | 75 | 95 | 120 | 56 |
| ELET3 | 127 | 113 | 132 | 154 | 168 | 149 | 62 | 63 | 52 | 76 | 75 | 86 | 57 |
| BRML3 | 137 | 168 | 162 | 144 | 113 | 119 | 60 | 45 | 25 | 78 | 93 | 113 | 57 |
| CCRO3 | 145 | 145 | 139 | 136 | 136 | 142 | 60 | 48 | 25 | 78 | 90 | 113 | 58 |
| BRFS3 | 133 | 129 | 124 | 148 | 152 | 157 | 57 | 37 | 21 | 81 | 101 | 117 | 63 |
Fig. 3Sequence of data splits during validation and test phases. The first bar, “Val 1”, illustrates the first training-validation round. The second bar, “Val 2”, illustrates the second training-validation round. The third bar,“Test”, shows the percent of the dataset covered on the final training-test round
Evaluation metrics and their formulas
| Metric | Formula |
|---|---|
| Recall | |
| Precision | |
| F1 or F-score | |
| Return factor | |
| Loss factor | |
| Loss (%) |
Example predictions when . “L” stands for long position and “S” for short position
| Date | Return factor | Forecast | Inv. position |
|---|---|---|---|
| 2017-09-19 | 0.98 | 0 | L |
| 2017-09-20 | 0.97 | 1 | S |
| 2017-09-21 | 0.99 | 0 | S |
| 2017-09-22 | 1.01 | 1 | S |
| 2017-09-25 | 0.95 | 1 | S |
| 2017-09-26 | 0.97 | 0 | S |
| 2017-09-27 | 1.03 | 0 | S |
| 2017-09-28 | 1.00 | 0 | S |
| 2017-09-29 | 1.04 | 0 | S |
| 2017-10-02 | 1.03 | 0 | L |
| 2017-10-02 | 1.01 | 0 | L |
Fig. 4F-Score average for each feature space subset
Fig. 5Loss Percent average for each feature space subset
Function w applied to the news’ vector representation
| ID | Equation | |
|---|---|---|
| TF-IDF-1 | 1 | Equation |
| TF-IDF-5 | 5 | Equation |
| TF-IDF-21 | 21 | Equation |
| EW-TF-IDF-5 | 5 | |
| EW-TF-IDF-21 | 21 |
Fig. 6Average F1 among classifiers trained under different representations, with and without news replication. Light tones indicate higher performance, dark tones worse
Algorithms and parameters for grid search
| Algorithm | Parameters |
|---|---|
| G-NB | Default |
| DT | Criteria = {Gini, Entropy}; |
| Balance = {Automatic, None}; | |
| Max leaf nodes = {5, 10, 35, None}; | |
| Min. samples on Leaf = {1, 3, 5} | |
| RF | Criteria = {Gini, Entropy}; |
| Estimators = {1500}; | |
| Balance = {Automatic, None}; | |
| OOB score = {True}; | |
| Max leaf nodes = {5, 10, 35, None}; | |
| Min. samples on leaf = {1, 3, 5} | |
| GB | Criterion = {Friedman MSE}; |
| Estimators = {500}; | |
| Learn. Rate = {0.001, 0.01, 0.1}; | |
| Max. depth = {3, 5, 10}; | |
| Max. leaf nodes = {5, 10, 35, None}; | |
| Min. samples on leaf = {1, 3, 5} | |
| XGB | Booster = {gbtree,gblinear,dart}; |
| Max. delta step = {0,1,5}; | |
| Reg. lambda = {1, 3, 5, 10, 50, 100} | |
| MLP | Shuffle = {Yes, No}; |
| Alpha = {0.0001, 0.01, 1, 10}; | |
| Epochs = {100}; | |
| Neurons = {100, 50} | |
| SVM | Kernel = {Linear, RBF, Polynomial}; |
| C = {0.001, 0.01, 0.1,1}; | |
| Degree = {2, 3}; | |
| Gamma = {0.001, 0.01, 0.1, 1, Auto}; | |
| Balance = {Automatic, None} | |
| LR | Solver = {lbfgs}; |
| C = {0.001, 0.01, 0.1, 1, 10} | |
| KNN | K = {3, 8, 15}; |
| Weights = {Uniforme, Distance} |
Computational and financial metrics comparison for daily, weekly, and monthly return windows
| Return window | Algorithm | Precision | Recall | F1 | Return F. | Loss (%) |
|---|---|---|---|---|---|---|
| 0 | MLP | 0.9390 | 0.9087 | 0.9181 | 2.1795 | 13.5336 |
| 0 | LR | 0.9267 | 0.8679 | 0.8843 | 2.0673 | 19.1129 |
| 0 | XGB | 0.9252 | 0.8597 | 0.8773 | 2.0375 | 20.2866 |
| 0 | SVM-L | 0.8902 | 0.8121 | 0.8232 | 1.8362 | 26.3662 |
| 0 | SVM-R | 0.8233 | 0.7698 | 0.7627 | 1.7300 | 29.0689 |
| 0 | RF | 0.7284 | 0.6797 | 0.6765 | 1.4130 | 41.5300 |
| 0 | GB | 0.7179 | 0.6641 | 0.6705 | 1.3721 | 43.5216 |
| 0 | SVM-P | 0.6606 | 0.7698 | 0.6670 | 1.3937 | 27.1112 |
| 0 | DT | 0.5451 | 0.5113 | 0.4908 | 1.0017 | 58.7191 |
| 0 | KNN | 0.5563 | 0.5575 | 0.4776 | 1.0073 | 50.5174 |
The values are the average among all stocks
Bold values represent the best scores
Fig. 7Classifiers performance for the worst performing stock in the period: BRFS3
Fig. 8Distribution of returns and losses for all stocks, classifiers and strategies, when and
F-score obtained by each method for different return windows
| Classifier | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Train | Validation | Test | Train | Validation | Test | Train | Validation | Test | |
| NB-G | 0.93 | 0.98 | 0.93 | 0.93 | 0.93 | 0.86 | 0.94 | 0.70 | 0.78 |
| MLP | 0.99 | 0.97 | 0.91 | 0.99 | 0.90 | 0.84 | 0.99 | 0.63 | 0.73 |
| LR | 0.96 | 0.88 | 0.88 | 0.98 | 0.66 | 0.79 | 0.99 | 0.46 | 0.64 |
| SVM-L | 0.97 | 0.67 | 0.82 | 0.98 | 0.61 | 0.77 | 0.99 | 0.39 | 0.63 |
| SVM-R | 0.86 | 0.81 | 0.76 | 0.84 | 0.66 | 0.70 | 0.77 | 0.39 | 0.59 |
| SVM-P | 0.76 | 0.68 | 0.66 | 0.73 | 0.57 | 0.59 | 0.86 | 0.51 | 0.58 |
| XGB | 0.92 | 0.92 | 0.87 | 0.91 | 0.74 | 0.80 | 0.94 | 0.55 | 0.61 |
| GB | 0.80 | 0.63 | 0.67 | 0.82 | 0.54 | 0.43 | 0.89 | 0.50 | 0.46 |
| RF | 0.91 | 0.73 | 0.67 | 0.88 | 0.48 | 0.55 | 0.95 | 0.20 | 0.65 |
| DT | 0.86 | 0.60 | 0.49 | 0.87 | 0.52 | 0.49 | 0.93 | 0.48 | 0.49 |
| KNN | 0.87 | 0.47 | 0.47 | 0.78 | 0.31 | 0.44 | 0.73 | 0.23 | 0.47 |