Discover a faster, simpler path to publishing in a high-quality journal. PLOS ONE promises fair, rigorous peer review, broad scope, and wide readership – a perfect fit for your research every time.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Affiliation IMT Institute for Advanced Studies, Piazza San Francesco 19, 55100 Lucca, Italy

[email protected]

Affiliation Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia

Affiliations IMT Institute for Advanced Studies, Piazza San Francesco 19, 55100 Lucca, Italy, Istituto dei Sistemi Complessi (ISC), Via dei Taurini 19, 00185 Rome, Italy, London Institute for Mathematical Sciences, 35a South St. Mayfair, London W1K 2XF, United Kingdom

Abstract

Social media are increasingly reflecting and influencing behavior of other complex systems. In this paper we investigate the relations between a well-known micro-blogging platform Twitter and financial markets. In particular, we consider, in a period of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the Dow Jones Industrial Average (DJIA) index. We find a relatively low Pearson correlation and Granger causality between the corresponding time series over the entire time period. However, we find a significant dependence between the Twitter sentiment and abnormal returns during the peaks of Twitter volume. This is valid not only for the expected Twitter volume peaks (e.g., quarterly announcements), but also for peaks corresponding to less obvious events. We formalize the procedure by adapting the well-known “event study” from economics and finance to the analysis of Twitter data. The procedure allows to automatically identify events as Twitter volume peaks, to compute the prevailing sentiment (positive or negative) expressed in tweets at these peaks, and finally to apply the “event study” methodology to relate them to stock returns. We show that sentiment polarity of Twitter peaks implies the direction of cumulative abnormal returns. The amount of cumulative abnormal returns is relatively low (about 1–2%), but the dependence is statistically significant for several days after the events.

Normal and abnormal returns. To appraise the event’s impact one needs a measure of the abnormal return. The abnormal return is the actual ex-post return of the stock over the event window minus the normal return of the stock over the event window. The normal return is defined as the return that would be expected if the event did not take place. For each company i and event date d , we have: where AR i , d , R i , d , E [ R i , d ] are the abnormal, actual, and expected normal returns, respectively. There are two common choices for modeling the expected normal return: the constant-mean-return model, and the market model. The constant-mean-return model, as the name implies, assumes that the mean return of a given stock is constant through time. The market model, used in this paper, assumes a stable linear relation between the overall market return and the stock return.

Estimation of the normal return model. Once a normal return model has been selected, the parameters of the model must be estimated using a subset of the data known as the estimation window. The most common choice, when feasible, is to use the period prior to the event window for the estimation window (cf. Fig 1 ). For example, in an event study using daily data and the market model, the market model parameters could be estimated over the 120 days prior to the event. Generally, the event period itself is not included in the estimation period to prevent the event from influencing the normal return model parameter estimates.

Statistical validation. With the estimated parameters of the normal return model, the abnormal returns can be calculated. The null hypothesis, H 0 , is that external events have no impact on the returns. It has been shown that under H 0 , abnormal returns are normally distributed, AR i , τ ∼ 𝓝(0, σ 2 ( AR i , τ )) [ 34 ]. This forms the basis for a procedure which tests whether an abnormal return is statistically significant.

Event detection from Twitter data. The following subsections first define the algorithm used to detect Twitter activity peaks which are then treated as events. Next, a method to assign a polarity to the events, using the Twitter sentiment, is described. Finally, we discuss a specific type of events for the companies studied, called earnings announcement events (abbreviated EA), which are already known to produce abnormal price jumps.

Detection of Twitter peaks. To identify Twitter activity peaks, for every company we use the time series of its daily Twitter volume, TW d . We use a sliding window of 2 L + 1 days ( L = 5) centered at day d 0 , and let d 0 slide along the time line. Within this window we evaluate the baseline volume activity TW b as the median of the window [ 54 ]. Then, we define the outlier fraction ϕ ( d 0 ) of the central time point d 0 as a relative difference of the activity TW d 0 with respect to the median baseline TW b : ϕ ( d 0 ) = [ TW d TW b ]/ max ( TW b , n min ). Here, n min = 10 is a minimum activity level used to regularize the definition of ϕ ( d 0 ) for low activity values. We say that there is an activity peak at d 0 if ϕ ( d 0 ) > ϕ t , where ϕ t = 2. The threshold ϕ t determines the number of detected peaks and the overlaps between the event windows—both increase with larger ϕ t . One should maximize the number of detected peaks, and minimize the number of overlaps [ 41 ]. We have analyzed the effects of varying ϕ t from 0.5 to 10 (as in [ 54 ]). The decrease in the number of overlaps is substantial for ϕ t ranging from 0.5 to 2, for larger values the decrease is slower. Therefore, we settled for ϕ t = 2. As a final step we apply filtering which removes detected peaks that are less then 21 days (the size of the event window) apart from the other peaks.

As an illustration, the resulting activity peaks for the Nike company are shown in Fig 2 . After the peak detection procedure, we treat all the peaks detected as events. These events are then assigned polarity (from Twitter sentiment) and type (earnings announcement or not).

Download:
Fig 2. Daily time series of Twitter volume for the Nike company.

Detected Twitter peaks and actual EA events are indicated.

https://doi.org/10.1371/journal.pone.0138441.g002

Polarity of events. Each event is assigned one of the three polarities: negative, neutral or positive. The polarity of an event is derived from the sentiment polarity P d of tweets for the peak day. From our data we detected 260 events. The distribution of the P d values for the 260 events is not uniform, but prevailingly positive, as shown in Fig 3 . To obtain three sets of events with approximately the same size, we select the following thresholds, and define the event polarity as follows:

  • If P d ∈ [−1,0.15) the event is a negative event ,
  • If P d ∈ [0.15,0.7] the event is a neutral event ,
  • If P d ∈ (0.7,1] the event is a positive event .
  • Download:
    Fig 3. Distribution of sentiment polarity for the 260 detected Twitter peaks.

    The two red bars indicate the chosen thresholds of the polarity values.

    https://doi.org/10.1371/journal.pone.0138441.g003

    Putting thresholds on a signal is always somewhat arbitrary, and there is no systematic treatment of this issue in the event study [ 41 ]. The justification for our approach is that sentiment should be regarded in relative terms, in the context of related events. Sentiment polarity has no absolute meaning, but provides just an ordering of events on the scale from −1 (negative) to +1 (positive). Then, the most straightforward choice is to distribute all the events uniformly between the three classes. Conceptually similar approaches, i.e., treating the sentiment in relative terms, were already applied to compare the sentiment leaning of network communities towards different environmental topics [ 39 ], and to compare the emotional reactions to conspiracy and science posts on Facebook [ 38 ]. Additionally, in the closely related work by Sprenger et al. [ 36 ], the authors use the percentage of positive tweets for a given day d , to determine the event polarity. Since they also report an excess of positive tweets, they use the median share of positive tweets as a threshold between the positive and negative events.

    Event types. For a specific type of events in finance, in particular quarterly earnings announcements (EA), it is known that the price return of a stock abnormally jumps in the direction of the earnings [ 34 , 35 ]. In our case, the Twitter data show high posting activity during the EA events, as expected. However, there are also other peaks in the Twitter activity, which do not correspond to EA, abbreviated as non-EA events. See Fig 2 for an example of Nike.

    The total number of peaks that our procedure detects in the period of the study is 260. Manual examination reveals that in the same period, there are 151 EA events (obtained from http://www.zacks.com/ ). Our event detection procedure detects 118 of them, the rest are non-EA events. This means that the recall (the fraction of all EA events that were correctly detected as EA) of our peak detection procedure is 78%. In contrast, Sprenger et al. [ 36 ] detect 224 out of 672 EA events, yielding the recall of 33%. They apply a simpler peak detection procedure: a Twitter peak is defined as one standard deviation increase of the tweet volume over the previous five days.

    The number of the detected peaks indicates that there is a large number of interesting events on Twitter which cannot be explained by earnings announcement. The impact of the EA events on price returns is already known in the literature, and our goal is to reconfirm these results. On the other hand, the impact of the non-EA events is not known, and it is interesting to verify if they have similar impact on prices as the EA events.

    Therefore, we perform the event study in two scenarios, with explicit detection of the two types of events, all the events (including EA) and non-EA events only:

  • Detecting all events from the complete time interval of the data, including the EA days. In total, 260 events are detected, 118 out of these are the EA events.
  • Detecting non-EA events from a subset of the data. For each of the 151 EA events, where d is the event day, we first remove the interval [ d − 1, d + 1], and then perform the event detection again. This results in 182 non-EA events detected.
  • We report all the detected peaks, with the dates and their polarity, in S1 Appendix . The EA events are in Table 1 , and the non-EA events are in Table 2 .

    Download:
    Table 2. A comparison of the inter-annotator agreement and the classifier performance.

    The inter-annotator agreement is computed from the examples labeled twice. The classifier performance is estimated from the 10-fold cross-validation.

    https://doi.org/10.1371/journal.pone.0138441.t002

    The first scenario allows to compare the results of the Twitter sentiment with the existing literature in financial econometrics [ 34 ]. It is worth noting, however, that the variable used to infer “polarity” of the events there is the difference between the expected and announced earnings. The analysis of the non-EA events in the second scenario tests if the Twitter sentiment data contain useful information about the behavior of investors for other types of events, in addition to the already well-known EA events.

    Estimation of normal returns. Here we briefly explain the market model procedure for estimation of normal returns. Our methodology follows the one presented in [ 34 ] and [ 55 ]. The market model is a statistical model which relates the return of a given stock to the return of the market portfolio. The model’s linear specification follows from the assumed joint normality of stock returns. We use the DJIA index as a normal market model. This choice helps us avoid adding too many variables to our model and simplifies the computation of the result. The aggregated DJIA index is computed from the mean weighted prices of all the stocks in the index. For any stock i , and date d , the market model is: (4) (5) (6) where R i , d and R DJIA , d are the returns of stock i and the market portfolio, respectively, and ϵ i , d is the zero mean disturbance term. are the parameters of the market model. To estimate these parameters for a given event and stock, we use an estimation window of L = 120 days, according to the hint provided in [ 34 ]. Using the notation presented in Fig 1 for the time line, the estimated value of is: where are the estimated parameters following the OLS procedure [ 34 ]. The abnormal return for company i at day d is the residual:

    Statistical validation. Our null hypothesis, H 0 , is that external events have no impact on the behavior of returns (mean or variance). The distributional properties of the abnormal returns can be used to draw inferences over any period within the event window. Under H 0 , the distribution of the sample abnormal return of a given observation in the event window is normal: (9) Eq (9) takes into account the aggregation of the abnormal returns.

    The abnormal return observations must be aggregated in order to draw overall conclusions for the events of interest. The aggregation is along two dimensions: through time and across stocks. By aggregating across all the stocks [ 55 ], we get: The cumulative abnormal return ( CAR ) from time τ 1 to τ 2 is the sum of the abnormal returns: To calculate the variance of the CAR , we assume (shown in e.g., [ 34 , 55 ]): where N is the total number of events. Finally, we introduce the test statistic . With this quantity we can test if the measured return is abnormal: where τ is the time index inside the event window, and ∣ τ 2 τ 1 ∣ is the total length of the event window.

    Results

    This section first presents an exhaustive evaluation of the Twitter sentiment classification model. Then it shows the correlation and Granger causality results over the entire time period. Finally, it shows statistically significant results of the event study methodology as applied to Twitter data.

    Twitter sentiment classification

    In machine learning, a standard approach to evaluate a classifier is by cross-validation. We have performed a 10-fold cross-validation on the set of 103,262 annotated tweets. The whole training set is randomly partitioned into 10 folds, one is set apart for testing, and the remaining nine are used to train the model and evaluate it on the test fold. The process is repeated 10 times until each fold is used for testing exactly once. The results are computed from the 10 tests and the means and standard deviations of different evaluation measures are reported in Table 2 .

    Cross-validation gives an estimate of the sentiment classifier performance on the application data, assuming that the training set is representative of the application set. However, it does not provide any hint about the highest performance achievable. We claim that the agreement between the human experts provides an upper bound that the best automated classifier can achieve. The inter-annotator agreement is computed from a fraction of tweets annotated twice. During the annotation process, 6,143 tweets were annotated twice, by two different annotators. The results were used to compute various agreement measures.

    There are several measures to evaluate the performance of classifiers and compute the inter-annotator agreement. We have selected the following three measures to estimate and compare them: Accuracy , Accuracy ±1, and . Accuracy (−,0,+) is the fraction of correctly classified examples for all three sentiment classes. This is the simplest and most common measure, but it doesn’t take into account the ordering of the classes. On the other extreme, Accuracy ±1(−,+) (a shorthand for Accuracy within 1 neighboring class) completely ignores the neutral class. It counts as errors just the negative sentiment examples predicted as positive, and vice versa. is the average of F 1 for the negative and positive class. It does not account for the misclassification of the neutral class since it is considered less important than the extremes, i.e., negative or positive sentiment. However, the misclassification of the neutral sentiment is taken into account implicitly as it affects the precision and recall of the extreme classes. F 1 is the harmonic mean of Precision and Recall for each class. Precision is a fraction of correctly predicted examples out of all the predictions of a particular class. Recall is a fraction of correctly predicted examples out of all actual members of the class. is a standard measure of performance for sentiment classifiers [ 56 ].

    Table 2 gives the comparison of the inter-annotator agreement and the classifier performance. The classifier has reached the annotator agreement in all three measures. In a closely related work by Sprenger et al. [ 36 ], they use Naive Bayes for sentiment classification. Their classifier is trained on 2,500 examples, and the 10-fold cross-validation yields Accuracy of 64.2%.

    We argue that in our case, there is no need to invest further work to improve the classifier. Most of the hypothetical improvements would likely be the result of overfitting the training data. We speculate that the high quality of the sentiment classifier is mainly the consequence of a large number of training examples. In our experience in diverse domains, one needs about 50,000—100,000 labeled examples to reach the inter-annotator agreement.

    If we compare the F 1 measures, we observe a difference in the respective Precision and Recall . For both classes, − and +, the sentiment classifier has a considerably higher Precision , at the expense of a lower Recall . This means that tweets, classified into extreme sentiment classes (− or +) are likely indeed negative or positive ( Precision about 70%), even if the classifier finds only a smaller fraction of them ( Recall about 40%). This suits well the purpose of this study. Note that it is relatively easy to modify the SVM classifier, without retraining it, to narrow the space of the neutral class, thus increasing the recall of the negative and positive classes, and decreasing their precision. One possible criterion for such a modification is to match the distribution of classes in the application set, as predicted by the classifier, to the actual distribution in the training set.

    Correlation and Granger causality

    Pearson correlation. Table 3 shows the computed Pearson correlations, as defined in the Methods section. The computed coefficients are small, but are in line with the result of [ 30 ]. In our opinion, these findings and the one published in [ 30 ] underline that when considering the entire time period of the analysis, days with a low number of tweets affect the measure.

    Download:
    Table 3. Results of the Pearson correlation and Granger causality tests.

    Companies are ordered as in Table 1 . The arrows indicate a statistically significant Granger causality relation for a company, at the 5% significance level. A right arrow indicates that the Twitter variable (sentiment polarity P d or volume TW d ) Granger-causes the market variable (return R d ), while a left arrow indicates that the market variable Granger-causes the Twitter variable. The counts at the bottom show the total number of companies passing the Granger test.

    https://doi.org/10.1371/journal.pone.0138441.t003

    Granger causality. The results of the Granger causality tests are also in Table 3 . They show the results of the causality test in both directions: from the Twitter variables to the market variables and vice versa. The table gives the Granger causality links per company between a) sentiment polarity and price return, and b) the volume of tweets and absolute price return. The conclusions that can be drawn are:

  • The polarity variable is not useful for predicting the price return, as only three companies pass the Granger test.
  • The number of tweets for a company Granger-causes the absolute price return for one third of the companies. This indicates that the amount of attention on Twitter is useful for predicting the price volatility. Previously, this was known only for an aggregated index, but not for individual stocks [ 28 , 30 ].
  • Cumulative abnormal returns

    The results of the event study are shown in Figs 4 and 5 , where the cumulative abnormal returns ( CAR ) are plotted for the two types of events defined earlier. The results are largely consistent with the existing literature on the information content of earnings [ 34 , 35 ]. The evidence strongly supports the hypothesis that tweets do indeed convey information relevant for stock returns.

    Download:
    Fig 4. CAR for all detected events, including EA.

    The x axis is the lag between the event and CAR , and the red markers indicate days with statistically significant abnormal return.

    https://doi.org/10.1371/journal.pone.0138441.g004

    Fig 4 shows CAR for all the detected Twitter peaks, including the EA events (45% of the detected events are earnings announcements). The average CAR for the events is abnormally increasing after the positive peaks and decreasing after the negative sentiment peaks. This is confirmed with details in Table 4 . The values of CAR are significant at the 1% level for ten days after the positive sentiment events. Given this result, the null hypothesis that the events have no impact on price returns is rejected. The same holds for negative sentiment events, but the CAR (actually loss) is twice as large in absolute terms. The CAR after the neutral events is very low, and barely significant at the 5% level at two days; at other days one cannot reject the null hypothesis. We speculate that the positive CAR values for the neutral events, barely significant, are the result of the uniform distribution of the Twitter peaks into three event classes (see Fig 3 ). An improvement over this baseline approach remains a subject of further research.

    Download:
    Table 4. Values of the statistic for each type of event.

    Significant results at the 1% level ( ) are denoted by **, and at the 5% level ( ) by *.

    https://doi.org/10.1371/journal.pone.0138441.t004

    A more interesting result concerns the non-EA events in Fig 5 . Even after removing the earnings announcements, with already known impact on price returns, one can reject the null hypothesis. In this case, the average CAR of the non-EA events is abnormally increasing after the detected positive peaks and decreasing after the negative peaks. Table 4 shows that after the event days the values of CAR remain significant at the 1% level for four days after the positive events, and for eight days after the negative events. The period of impact of Twitter sentiment on price returns is shorter when the EA events are removed, and the values of CAR are lower, but in both cases the impact is statistically significant. The CAR for the neutral events tend to be slightly negative (in contrast to the EA events), albeit are not statistically significant. However, this again indicates that the distribution of Twitter peaks into the event classes could be improved.

    These results are similar to the ones reported by Sprenger at al. [ 36 ]. In addition, the authors show statistically significant increase in the CAR values even before the positive event days. They argue that this is due to the information leakage before the earnings announcements. We observe a similar phenomena, but with very low CAR values, and not statistically significant (cf. the positive events at day −1 in Fig 4 ).

    Discussion

    In this work we present significant evidence of dependence between stock price returns and Twitter sentiment in tweets about the companies. As a series of other papers have already shown, there is a signal worth investigating which connects social media and market behavior. This opens the way, if not to forecasting, then at least to “now-casting” financial markets. The caveat is that this dependence becomes useful only when data are properly selected, or different sources of data are analyzed together. For this reason, in this paper, we first identify events, marked by increased activity of Twitter users, and then observe market behavior in the days following the events. This choice is due to our hypothesis that only at some moments, identified as events, there is a strong interaction between the financial market and Twitter sentiment. Our main result is that the aggregate Twitter sentiment during the events implies the direction of market evolution. While this can be expected for peaks related to “known” events, like earnings announcements, it is really interesting to note that a similar conclusion holds also when peaks do not correspond to any expected news about the stock traded.

    Similar results were corroborated in a recent, independent study by Sprenger et al. [ 36 ]. The authors have made an additional step, and classified the non-EA events into a comprehensive set of 16 company-specific categories. They have used the same training set of 2,500 manually classified tweets to train a Naive Bayesian classifier which can then reasonably well discriminate between the 16 categories. In our future work, we plan to identify topics, which are not predefined, from all the tweets of non-EA events. We intend to apply standard topic detection algorithms, such as Latent Dirichlet allocation (LDA) or clustering techniques.

    Studies as this one could be well used in order to establish a direct relation between social networks and market behavior. A specific application could, for example, detect and possibly mitigate panic diffusion in the market from social network analysis. To such purpose there is some additional research to be done in the future. One possible direction is to test the presence of forecasting power of the sentiment time series. Following an approach similar to the one presented by Moat et al. [ 57 ] one can decide to buy or sell according to the presence of a peak in the tweet volume and the level of polarity in the corresponding direction. However, detection of Twitter events should rely just on the current and past Twitter volume.

    Also, during the events, we might move to a finer time scale, e.g., from daily to hourly resolution, as done by [ 32 ]. Finally, our short term plan is to extend the analysis to a larger number of companies with high Twitter volume, and over longer period of time.

    Supporting Information

    S1 Appendix. Event dates and polarity.

    Detailed information about the detected events from the Twitter data and their polarity. We show the 118 detected EA events and 182 detected non-EA events.

    https://doi.org/10.1371/journal.pone.0138441.s001

    (PDF)

    References

    1. 1. King G. Ensuring the data-rich future of the social sciences. Science. 2011;331(6018):719–721. pmid:21311013