| Literature DB >> 24572909 |
Ilya Zheludev1, Robert Smith1, Tomaso Aste1.
Abstract
Social media analytics is showing promise for the prediction of financial markets. However, the true value of such data for trading is unclear due to a lack of consensus on which instruments can be predicted and how. Current approaches are based on the evaluation of message volumes and are typically assessed via retrospective (ex-post facto) evaluation of trading strategy returns. In this paper, we present instead a sentiment analysis methodology to quantify and statistically validate which assets could qualify for trading from social media analytics in an ex-ante configuration. We use sentiment analysis techniques and Information Theory measures to demonstrate that social media message sentiment can contain statistically-significant ex-ante information on the future prices of the S&P500 index and a limited set of stocks, in excess of what is achievable using solely message volumes.Entities:
Year: 2014 PMID: 24572909 PMCID: PMC5379406 DOI: 10.1038/srep04213
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Twitter Filters used to collect the social media data. We set-up a custom-built Twitter Collection Framework (TCF) to filter in from up to 10% of all messages from Twitter's elevated-access Gardenhose Feed, those that we deemed to be in reference to the financial instruments we consider in this study. Two types of string-filters were used for stocks: either only industry Ticker-IDs; or industry Ticker-IDs AND/OR Company Names. Other filters, such as those for additional currency pairs or stocks were excluded on the principle of insufficient daily Tweet volumes (<24 per day) as determined prior to the study
| Instrument | Filter type | Filter |
|---|---|---|
| Apple, Inc. CFDs | Ticker ID AND/OR Company Name | $AAPL AND/OR “Apple” |
| Apple, Inc. CFDs | Ticker ID | $AAPL |
| Amazon.com, Inc. CFDs | Ticker ID AND/OR Company Name | $AMZN AND/OR “Amazon” |
| Amazon.com, Inc. CFDs | Ticker ID | $AMZN |
| American Express, Co. CFDs | Ticker ID AND/OR Company Name | $AXP AND/OR “American Express” |
| Bank of America, Corp. CFDs | Ticker ID AND/OR Company Name | $BAC AND/OR “Bank of America” |
| Bank of America, Corp. CFDs | Ticker ID | $BAC |
| Cisco Systems, Inc. CFDs | Ticker ID AND/OR Company Name | $CSCO AND/OR “Cisco” |
| EURUSD CFDs | Ticker ID | $EURUSD |
| EURUSD Futures | Ticker ID | $EURUSD |
| GBPUSD CFDs | Ticker ID | $GBPUSD |
| GBPUSD Futures | Ticker ID | $GBPUSD |
| General Electric, Co. CFDs | Ticker ID AND/OR Company Name | $GE AND/OR “GE” AND/OR “General Electric” |
| General Electric, Co. CFDs | Ticker ID | $GE |
| Google, Inc. CFDs | Ticker ID AND/OR Company Name | $GOOG AND/OR “Google” |
| Google, Inc. CFDs | Ticker ID | $GOOG |
| The Home Depot, Inc. CFDs | Ticker ID AND/OR Company Name | $HD AND/OR “Home Depot” |
| Hewlett Packard, Co. CFDs | Ticker ID AND/OR Company Name | $HPQ AND/OR “Hewlett-Packard” AND/OR “Hewlett Packard” |
| Hewlett Packard, Co. CFDs | Ticker ID | $HPQ |
| IBM, Corp. CFDs | Ticker ID AND/OR Company Name | $IBM AND/OR “IBM” |
| IBM, Corp. CFDs | Ticker ID | $IBM |
| Intel Corp. CFDs | Ticker ID AND/OR Company Name | $INTC AND/OR “Intel” |
| Intel Corp. CFDs | Ticker ID | $INTC |
| Johnson & Johnson, Co. CFDs | Ticker ID AND/OR Company Name | $JNJ AND/OR “Johnson & Johnson” AND/OR “Johnson and Johnson” |
| J.P. Morgan, Inc. CFDs | Ticker ID AND/OR Company Name | $JPM AND/OR “JPMorgan” AND/OR “JP Morgan” |
| J.P. Morgan, Inc. CFDs | Ticker ID | $JPM |
| Coca-Cola, Co. CFDs | Ticker ID AND/OR Company Name | $KO AND/OR “Coca-Cola” AND/OR “Coca Cola” |
| Coca-Cola, Co. CFDs | Ticker ID | $KO |
| McDonald's, Corp. CFDs | Ticker ID AND/OR Company Name | $MCD AND/OR “McDonald's” AND/OR “McDonalds” |
| McDonald's, Corp. CFDs | Ticker ID | $MCD |
| 3M, Co. CFDs | Ticker ID AND/OR Company Name | $MMM AND/OR “3M” |
| Microsoft, Corp. CFDs | Ticker ID AND/OR Company Name | $MSFT AND/OR “Microsoft” |
| Microsoft, Corp. CFDs | Ticker ID | $MSFT |
| Oracle, Corp. CFDs | Ticker ID & Company Name | $ORCL AND/OR “Oracle” |
| Oracle, Corp. CFDs | Ticker ID | $ORCL |
| FTSE100 Index CFDs | UK Geographical | String-unfiltered UK Tweets |
| FTSE100 Index Futures | UK Geographical | String-unfiltered UK Tweets |
| S&P500 Index CFDs | US Geographical | String-unfiltered US Tweets |
| S&P500 Index Futures | US Geographical | String-unfiltered US Tweets |
| AT&T, Inc. CFDs | Ticker ID AND/OR Company Name | $T AND/OR “AT&T” |
| AT&T, Inc. CFDs | Ticker ID | $T |
| Wal-Mart, Inc. CFDs | Ticker ID AND/OR Company Name | $WMT AND/OR “Wal-Mart” AND/OR “Wal Mart” |
| Exxon Mobil, Corp. CFDs | Ticker ID AND/OR Company Name | $XOM AND/OR “Exxon Mobil” |
| Exxon Mobil, Corp. CFDs | Ticker ID | $XOM |
Figure 1Examples of when hourly changes in social media sentiment contain lead-time information securities' hourly returns ahead of time.
We refer to the percentage increase in Mutual Information between hourly changes in the social media sentiment data and securities' hourly returns at leading time-shifts, relative to zero time-shift, as the information surplus. Here, social media sentiment data is offset such that it precedes financial data, and the Mutual Information between the two time-series is compared to that which is available at no time-shift. If the information surplus is positive, then sentiment data contains more Mutual Information about financial data at an exploitable leading time-shift, compared with the no-offset configuration. We suggest that in such scenarios, hourly changes in the sentiment data contain lead-time information about securities' hourly returns as they remove more uncertainty, ahead of time, about the financial data time-series than if the two time-series are not offset. To determine eligibility for social media to lead financial data, three further caveats were met: the assets' Twitter Filters attracted a minimum mean message volume of 60 messages per hour from our connection to Twitter's 10% Gardenhose feed; the information surplus values were greater when sentiment data preceded financial data, than the converse (when financial data preceded sentiment data); and finally that the observations were statistically-significant to the 99% confidence interval (relative to sentiments generated from randomly permutated data). In this manner, we identify twelve instruments for which hourly changes in the sentiments of social media messages contain lead-time information about securities' hourly returns ahead of time. In this figure, we show the maximum information surplus seen per time-shift. Of the permitted assets, Apple Inc. was the only company for which such an indication was visible using a Twitter Filter searching solely for an asset's industry Ticker-ID (rather than the company name). Tweets on the remaining individual stocks were obtained by filtering Twitter for Company Names AND/OR their industry Ticker-IDs. Finally, the sentiments of string-unfiltered Tweets from the USA were shown to lead the returns of S&P500 Futures for one time-shift.
Figure 2Hourly changes in Tweet message sentiments lead financial data more than hourly changes in Tweet message volumes.
We use Mutual Information to determine the extent to which Twitter messages on financial instruments can lead their securities' returns. We perform our analysis on hourly changes in Tweet sentiments vs. the hourly returns of forty-four financial instruments, showing that Twitter sentiment leads securities' returns in a statistically-significant manner for twelve instruments. We then perform identical analyses on the hourly changes in Twitter message volumes vs. the hourly returns and the absolute hourly returns of the same forty-four financial instruments, to echo recent studies which compare social media89 and search engine10111213 message volumes with financial market performance. We demonstrate that the Tweet sentiments result in proportionally larger maximum information surplus values compared to the maximum information surplus values seen from our Tweet volume (rather than Tweet sentiment) experiments. This is demonstrated in the top chart, where we show the ratios of the maximum leading statistically-significant information surpluses seen from our three experiments: hourly changes in Tweet message sentiments as evaluated against hourly returns (blue bars); hourly changes in Tweet message volumes as evaluated against hourly returns (red bars); and hourly changes in Tweet message volumes as evaluated against absolute hourly returns (green bars). Tweet message sentiments outperformed Tweet message volumes in leading securities' hourly returns in a statistically-significant manner for twelve assets. In the bottom chart we demonstrate the ratios of the number of observed instances of statistically-significant leading information surpluses from our three experiments for each asset. We observe that for twelve assets, hourly changes in Tweet message sentiments (blue bars) lead the securities' hourly returns more often than hourly changes in Tweet message volumes, whether these volumes are evaluated against hourly returns (red bars) or absolute hourly returns (green bars). In one additional one case (Bank of America, Corp.) hourly changes in Tweet message volumes led the security's hourly returns in a statistically-significant manner when Tweet message sentiments did not. For all remaining assets from the original forty-four, Tweets do not lead securities' returns in a statistically-significant manner.
Figure 3Determining if sentiment data is more leading than trailing.
By way of example, we demonstrate the Mutual Information between hourly changes in sentiments and financial data for the Twitter Filter: “$GOOG” AND/OR “Google” compared with the hourly returns of Google CFDs. For this example, we only consider the negative sentiments as calculated by SentiStrength, a leading20 research-orientated sentiment classification tool tailored for the lexically and grammatically-incorrect nature of social media text. The data is presented for time-shifts between 0 and 24-hours both in a leading configuration (such that hourly changes in the sentiment data lead the security's hourly returns) and in a trailing configuration (such that security's hourly returns lead the hourly changes in the sentiment data). We only admit those time-shifts for which the per-time-shift leading Mutual Information exceeds the mean trailing Mutual Information, as indicated by the vertical green bar, and reject those time-shifts for which per-time-shift leading Mutual Information is less than the mean trailing Mutual Information, as indicated by the vertical red bar.
Figure 4Determining if sentiment data can lead financial data.
We use the term information surplus to denote situations when hourly changes in the sentiment data carry more information about securities' hourly returns ahead of time than at no leading time-shift. By way of example, we demonstrate the information surplus between hourly changes in the sentiment data for the Twitter Filter: “$GOOG” AND/OR “Google” and the hourly returns of Google, Inc. CFDs. For the sentiment data to be considered leading, it must demonstrate positive information surplus at time-shifts where sentiment data is offset to lead financial data. As in the example above, we admit those leading time-shifts for which the information surplus curve is above the information surplus threshold line of zero.
Figure 5Sentiment data can lead financial data for a range of time-shifts in a statistically-significant manner.
By way of example, we demonstrate the statistically-significant leading information surplus between hourly changes in the sentiment data for the Twitter Filter: “$GOOG” AND/OR “Google” and the hourly returns of Google, Inc. CFDs. Here, we demonstrate the performances of the three different sentiment types (positive, negative and net), as produced by the SentiStrength classifier. Instances where the information surplus is positive denotes: a leading time-shift for which the hourly changes in the sentiment data contain more information about the security's hourly returns ahead of time than at zero time-shift in a statistically-significant manner and simultaneously this sentiment data is more leading than trailing. Thus, for such instances we can say that social media sentiment data does precede the financial data. Note that for the financial-instrument/Twitter-Filter combination shown in this example, there are no instances where hourly changes in the positive sentiments of the Tweets performed successfully in leading the security's hourly returns. However, there are three instances where hourly changes in the negative sentiment component of the Tweets do lead the security's hourly returns with a confidence interval of 99%. Similarly, we observe eleven instances in this example where hourly changes in the net sentiment component of the Tweets lead the security's hourly returns in a statistically-significant manner.
Social media sentiment can lead financial returns. For each instrument above we show its largest statistically-significant information surplus seen in the study, i.e. Twitter sentiment's best ability to lead financial data ahead of time, relative to no time-shift. For each instrument, we also offer a summary of: the search characteristics of their Twitter Filters; their mean minutely message volume over the 3-month collection period; and their corresponding largest statistically-significant information surplus. We also demonstrate the leading time-shift (in hours) at which this occurs, and the corresponding sentiment type (positive, negative or net). We also report the total number of statistically-significant instances where social media sentiment leads financial data. Note: as discussed in the Methods, the full 24-hour autocorrelation-removal moving mean windows have been used throughout. We observe that Twitter Filter #11 (“$AAPL”) is the only filter admitted which uses just the financial instrument's industry Ticker-ID. *: We witness unexpectedly-low hourly volumes of string-unfiltered US Tweets. This is because we employed the most-accurate location-detection methodology available: only admitting those Tweets which are stamped with geographical-coordinates encompassed within the extremes of the United States' border. The majority of Tweets are not stamped with geographical-coordinates since typically only those messages which are sent from GPS-enabled devices may contain geographical-coordinates. Nonetheless, this hourly message volume was sufficient to pass our minimum mean message volume threshold of 1 message per minute. Finally, we note that our methodology identifies the following financial-instrument/Twitter-Filter combinations as inadmissible due to a lack of statistical-significance: Microsoft CFDs, FTSE100 CFDs and Futures, S&P500 CFDs, IBM CFDs, Wal-Mart CFDs and Bank of America CFDs. These assets do attract sufficient Tweet volumes, but their sentiments are not able to lead financial data in a statistically-significant manner for any of the leading time-shifts considered in this investigation
| # | Instrument name | Twitter Filter | Mean message volume per minute | Largest statistically-significant |
|---|---|---|---|---|
| 1 | Apple, Inc. CFDs | $AAPL AND/OR “Apple” | 126.7 | 0.140% |
| 2 | Amazon.com, Inc. CFDs | $AMZN AND/OR “Amazon” | 123.1 | 3.473% |
| 3 | Google, Inc. CFDs | $GOOG AND/OR “Google” | 184.0 | 2.638% |
| 4 | Intel, Inc. CFDs | $INTL AND/OR “Intel” | 12.9 | 1.414% |
| 5 | Coca-Cola, Co. CFDs | $KO AND/OR “Coca Cola” AND/OR “Coca-Cola” | 24.8 | 0.723% |
| 6 | McDonald's, Corp. CFDs | $MCD AND/OR “McDonald's” AND/OR “McDonalds” | 46.5 | 1.902% |
| 7 | S&P500 Futures | String-unfiltered US Tweets | 142.7 | 2.462% |
| 8 | Oracle, Corp. CFDs | $ORCL AND/OR “Oracle” | 5.0 | 0.363% |
| 9 | Cisco Systems, Inc. CFDs | $CSCO AND/OR “Cisco” | 4.0 | 2.766% |
| 10 | The Home Depot, Inc. CFDs | $HD AND/OR “Home Depot” | 1.9 | 2.813% |
| 11 | Apple, Inc. (Ticker only) CFDs | $AAPL | 1.8 | 3.347% |
| 12 | J.P. Morgan, Inc. CFDs | $JPM OR “JPMorgan” OR “JP Morgan” | 1.1 | 3.936% |
| 1 | Apple, Inc. CFDs | 10 | Negative | 2 |
| 2 | Amazon.com, Inc. CFDs | 20 | Net | 30 |
| 3 | Google, Inc. CFDs | 14 | Net | 14 |
| 4 | Intel, Inc. CFDs | 1 | Negative | 2 |
| 5 | Coca-Cola, Co. CFDs | 8 | Positive | 13 |
| 6 | McDonald's, Corp. CFDs | 13 | Net | 7 |
| 7 | S&P500 Futures | 22 | Net | 1 |
| 8 | Oracle, Corp. CFDs | 1 | Net | 1 |
| 9 | Cisco Systems, Inc. CFDs | 13 | Net | 15 |
| 10 | The Home Depot, Inc. CFDs | 11 | Positive | 8 |
| 11 | Apple, Inc. (Ticker only) CFDs | 14 | Negative | 2 |
| 12 | J.P. Morgan, Inc. CFDs | 12 | Positive | 2 |
k-means clustering of admitted assets by Tweet volume. We run a k-means20 clustering algorithm on the mean minutely volumes of Tweets collected over the entire study for the financial-instrument/Twitter-Filter combinations for which we deem hourly changes in social media sentiments to lead securities' hourly returns in a statistically-significant manner. By clustering these volumes into two categories, we compare the mean minutely Tweet volume to the financial-instrument's brand value24. We observe that the companies grouped into cluster 1: Apple Inc., Amazon.com Inc. and Google Inc. (with a centroid of 144.1 messages per minute) are also the most popular brands admitted in our study. Cluster 2 encapsulates the remaining companies admitted by our study (with a centroid of 12.3 messages per minute). We therefore quantitatively show the intuitive relationship that companies of high brand-value are also represented strongly in terms of Tweet volumes, and suggest that any trading strategies built on the analytics of social media data should give particular attention to such companies due to the high-density of Tweets *: Note that we exclude message volumes attributed to the S&P500 index Futures and to Apple, Inc. CFDs (collected solely via the Ticker-ID Twitter Filter) from these clustering calculations
| # | Instrument name | Twitter Filter | Mean message volume per minute | k-means clustering category for message volume | Brand value (m) |
|---|---|---|---|---|---|
| 1 | Apple, Inc. CFDs | $AAPL AND/OR “Apple” | 126.7 | 1 | $87,304 |
| 2 | Amazon.com, Inc. CFDs | $AMZN AND/OR “Amazon” | 123.1 | 1 | $36,788 |
| 3 | Google, Inc. CFDs | $GOOG AND/OR “Google” | 184.0 | 1 | $52,132 |
| 4 | Intel, Inc. CFDs | $INTL AND/OR “Intel” | 12.9 | 2 | $21,139 |
| 5 | Coca-Cola, Co. CFDs | $KO AND/OR “Coca Cola” AND/OR “Coca-Cola” | 24.8 | 2 | $34,205 |
| 6 | McDonald's, Corp. CFDs | $MCD AND/OR “McDonald's” AND/OR “McDonalds” | 46.5 | 2 | $21,642 |
| 7 | S&P500 Futures | String-unfiltered US Tweets | 142.7 | 1 | N/A |
| 8 | Oracle, Corp. CFDs | $ORCL AND/OR “Oracle” | 5.0 | 2 | $16,047 |
| 9 | Cisco Systems, Inc. CFDs | $CSCO AND/OR “Cisco” | 4.0 | 2 | $15,468 |
| 10 | The Home Depot, Inc. CFDs | $HD AND/OR “Home Depot” | 1.9 | 2 | $23,423 |
| 11 | Apple, Inc. (Ticker only) CFDs | $AAPL | 1.8 | 2 | N/A |
| 12 | J.P. Morgan, Inc. CFDs | $JPM OR “JPMorgan” OR “JP Morgan” | 1.1 | 2 | $13,775 |