Literature DB >> 34138970

A mutual information based R-vine copula strategy to estimate VaR in high frequency stock market data.

Abstract

In this paper, we explore mutual information based stock networks to build regular vine copula structure on high frequency log returns of stocks and use it for the estimation of Value at Risk (VaR) of a portfolio of stocks. Our model is a data driven model that learns from a high frequency time series data of log returns of top 50 stocks listed on the National Stock Exchange (NSE) in India for the year 2014. The Ljung-Box test revealed the presence of Autocorrelation as well as Heteroscedasticity in the underlying time series data. Analysing the goodness of fit of a number of variants of the GARCH model on each working day of the year 2014, that is, 229 days in all, it was observed that ARMA(1,1)-EGARCH(1,1) demonstrated the best fit. The joint probability distribution of the portfolio is computed by constructed an R-Vine copula structure on the data with the mutual information guided minimum spanning tree as the key building block. The joint PDF is then fed into the Monte-Carlo simulation procedure to compute the VaR. If we replace the mutual information by the Kendall's Tau in the construction of the R-Vine copula structure, the resulting VaR estimations were found to be inferior suggesting the presence of non-linear relationships among stock returns.

Entities: Chemical Disease Species

Year: 2021 PMID： 34138970 PMCID： PMC8211166 DOI： 10.1371/journal.pone.0253307

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1) Introduction

Developing multivariate models and estimating joint density function is an area of key interest amongst researchers not only in finance but also in various other fields [1-3]. In finance, the researchers have already discarded multivariate Gaussian distributions on log returns of stocks and hence developing methods to estimate the joint distribution of stock returns have always attracted lot of interest [4]. In this paper we use Copula functions to achieve the important goal of estimating the joint probability distribution of the portfolio. The Sklar’s theorem [5] expresses a multivariate cumulative distribution function in terms of the univariate cumulative distribution functions and a Copula function. So we need to overcome two challenges: firstly, to identify the probability distributions of the individual stocks and secondly devise a computationally efficient method of combining these marginal distributions with an appropriate Copula to obtain the joint distribution of the portfolio. The Kolmogorov-Smirnov test [6] suggested the Student’s t-distribution as a good choice for the probability distribution functions of the individual stocks. The second step was handled using the R-Vine Copula structure which was originally introduced in [7-9] by extending the concept of Markov Trees. Two special subclasses of R-vine copulas namely the D-vine and C-vine copulas were studied in [10] and since then have become immensely popular in the analysis of financial data owing to their simple structure [11-14]. Working with a general R-vine structure is computationally challenging especially in higher dimensions. The sequential algorithm due to Dißmann et.al. [15] is a breakthrough in this direction and enables an efficient construction of general R-vine copula structures in higher dimensions. In [15], joint distribution functions of 16 variables is computed and in this paper we go as high as 50 variables via this algorithm. It is relevant to point out that the construction of the R-vine structure in [15] made use of Kendall’s Tau–a non parametric measure which captures an ordinal relationship between two random variables and also indicates a non linear relationship among them. In [16-18], this approach has been applied successfully to a number of financial markets. Some very recent works [19-23] reveal the growing popularity of mutual information between two random variables as a quantifier of a linear or non-linear relationship. Mutual information (MI) between two random variables is defined to be the relative entropy between the joint distribution and the product of the marginal distributions. A direct consequence of this definition is that MI of independent random variables is zero. MI captures the reduction in the uncertainty of one random variable given the knowledge about another random variable. In particular, Sharma and Habib [23], in the context of high frequency data, have demonstrated that mutual information based methods capture the non-linear relationship between log returns better when compared to Spearman correlation based methods. This observation motivates a key part of the present paper which deals with the computation of the joint density function of log returns of stocks using a mutual information based R-vine copula structure. Our analysis begins with the removal of Autocorrelation and Hetroscedasticity using the GARCH models on the time series data of log returns. A number of popular GARCH models were fitted and the best of the lot turned out to be ARMA(1,1)-EGARCH(1,1). Next, the R-Vine copula structure was constructed using the error (residual) terms of the ARMA(1,1)-EGARCH(1,1) model. This approach is similar to the one taken in [18] in which data of daily returns of 96 stocks listed on S&P was analysed. The above R-Vine structure is then used to estimate the Value at Risk (VaR) of portfolios through Monte-Carlo simulation. Recently, multivariate copula based models for the estimation of VaR have been proposed in [24] and a Kendall’s Tau based vine copula model for estimating Var is presented in [18]. For earlier models focused towards the VaR estimation, the reader is referred to [25, 26]. However, none of these models have employed mutual information. We have considered 5% and 10% VaRs for portfolios consisting 5, 10, 25 and 50 stocks in our analysis. The remaining part of the paper is divided into 4 sections. In section 2, we give a brief description of the data used in our analysis. In section 3, we give an overview of the methods and methodology used. In section 4, we compare the effectiveness of VaR estimation based on Kendall’s Tau method and MI method. In the last section, we summarize our observations and findings.

2) Data description

The high frequency data analysed in the present paper is an instant-by-instant record of the prices and volume of all the stocks listed on CNX100 index of the National Stock Exchange (NSE) for each working day of the year 2014. The working hours of the NSE are from 9:00AM till 4:00PM. Further, we divide this duration into time intervals of length 30 seconds and call each such interval as a tick. The interval length of 30 seconds ensures sufficiently many data points for the fitted models to have small bias. We chose to ignore the first and the last half an hour (that is, 9–9:30AM and 3:30-4PM) data due to some ambiguity and incompleteness in the recorded data. Thus the total number of ticks considered for each working day will be 720. In general, corresponding to any tick t, that is, in the t 30-second interval there will be several transactions for various stocks. Let be the volume of the stock k traded at an instant i (within the duration corresponding to the tick t) and be the price of stock k at the instant i. We now define the volume weighted average price S(t,k) for the tick t by Here the summation runs over all possible instances within the 30-second duration corresponding to the tick t. The log return of each stock k at tick t is given by In our data we encountered 30-second intervals in which zero trade was recorded. This would make the formula (1) indeterminate for those ticks. To overcome this issue, the recent most value of S for each stock k was considered for these ticks. We include only 50 stocks in our analysis that had either no gap interval or very few gap intervals. In other words, these stocks are highly traded in the market. Also, 2014 was the year when General Elections were held in India and a change in government was seen after 10 years. One may expect high volatility during the election times. We wanted to study how does our model gets impacted during the election or pre-election or the post-election periods. Thus our discrete time series data was studied under three periods: (a) pre-election period: Jan-Feb 2014 (b) election period: Mar-May 2014, (c) post-election period Jun-Dec 2014.

3) Methods and methodology

3.1 Pair copula construction

Before we explain the construction of the R-vine structure, it is important to have a clear understanding of a Copula. So we first recall some preliminaries. For any natural number n, let I denote the unit cube in the extended n–dimensional space . The elements of are n–tuples of extended real numbers a: a = (a1, …,a). For any , we shall write a ≤ b whenever a ≤ b for all i. Now for any a ≤ b, the Cartesian product of closed intervals, B = [a1,b1]×…× [a,b], is called an n–box and will be denoted by [a,b]. The set of vertices, V, of B is the collection of all n–tuples (c1,…,c) for which each c = a or b. Let H be a real valued function with domain of the form S1 ×…×S, where each S is a subset of extended real numbers . The H–volume of B is defined to be the sum Here sgn(c) takes on +1 if c = a even number of times; and it takes on -1 otherwise. Also, note that the above summation is finite since the total number of vertices is finite. The reader is referred to [27] for other equivalent forms of V (B). An n–dimensional Copula is a function C:I → I satisfying the following axioms: C〈u〉 = 0 if there exists an i such that u = 0. C〈u〉 = u if u = 1 for all i ≠ k. V ([a,b]) ≥ 0 for any n–box [a,b] with a,b ∈ I. A real valued function F defined on is called an n–dimensional distribution function if V (B) ≥ 0 for all n–boxes B with vertices in ; and F〈u〉 = 0 whenever u = −∞ for some i. F (u) = 1 whenever u = ∞ for all i. It has been established in [27] that the n–dimensional distribution function F has one dimensional marginal distribution functions F1, F2, …,F. The famous Sklar’s theorem guarantees that there exists an n–dimensional Copula function C such that F(x1, x2, …,x) = C(F1(x1), F2(x2), …,F(x)). However, we are more interested in the converse which states that for a given n–dimensional Copula function C and univariate distribution functions F1, F2, …,F, the formula F(x1, x2, …,x) = C(F1(x1), F2(x2), …,F(x)) defines an n–dimensional distribution function with marginals are F1, F2, …,F. Equivalently the joint density function f(x1,x2,…,xn) = f1(x1)f2(x2)…f(x)c(F1(x1), F2(x2), …,F(x)) where c is the nth order partial derivative of C. Thus, if we wish to study the joint behaviour of n random variables, we can first fit the marginal distribution functions of each random variable separately and then combine them through an appropriate multivariate copula. The process of constructing multivariate copula that we adopt is the Pair-wise Copula Construction (PCC) which relies on Vine copulas (or pair copulas) introduced in [7]. At the heart of this process lies the fact that a joint copula function is broken down as product of bivariate copula functions that can be estimated independently. Thus, bivariate copulas are building blocks for the PCC method. An R-vine on n variables as introduced by Bedford and Cooke [9] is a finite sequence of trees T = (V, E), j = 1,2,…,n−1, with vertices V and edges E satisfying the conditions: The tree T1 has nodes N1 = {1,2,…,n}. Trees T are connected with nodes N = E and that the cardinality of N is n–j + 1 for each j = 1,2,…,n. Let a = {a1,a2} and b = {b1, b2} be two elements of N (2≤j≤n-1), then {a,b} ∈ E provided that the cardinality of a ∩ b is exactly one. The last axiom says that we will join two nodes by an edge only when these nodes interpreted as edges of the preceeding tree have exactly one node of the preceeding tree in common. Bedford and Cooke [9] follow a convenient way of enumerating the nodes of trees in an R-vine structure in terms of conditioned and conditioning sets. For further details and illustrative examples the reader may refer to [9, 15]. We make use of the same enumeration strategy to write down the probability density function corresponding to the distribution realized by the R-vine copula structure for the portfolio of stocks. In order to construct an R-vine structure of stocks, we start with a tree T1 with n nodes (N1) represented by each stock and E1 edges. In our analysis, we considered T1 as minimum spanning tree network of stocks based on both mutual information metric (Eq 10) and Kendall’s Tau based metric (Eq 11). Edge in E1 is represented by a bivariate copula C{s(e),t(e)} where s(e), and t(e), are nodes connected by the edge e. Then we move on to next tree, T2 with the nodes set N2 same as the edge set E1. Each node in T2 is thus represented by C{s(e),t(e)} and edge in E2 is represented by conditional copula C{s(e),t(e)/D(e)} where D(e) is the common node. Similarly we keep on building the trees T3, T4, …,T. Once we have constructed a R-vine structure on n stocks with random variables X1,X2,…,X, their joint density function with marginal density functions f1,f2,…,f is given by where Fs(e)/D(e) is distribution function of conditional random variable Xs(e)/D(e) and Cs(e),t(e)/D(e) is second order partial derivative of copula connecting Xs(e)/D(e) and Xt(e)/D(e). For example consider the joint density function of three random variables f can be decomposed as f1f2/1f3/12 where f2/. denotes conditional density functions. We can further decompose conditional density function f2/1 as where f12 is joint density of variable 1 and 2, c12 is the 2nd order derivative of copula C12 connecting variable 1 and 2. Similarly, we have Thus, using Eqs (5) and (6) we have joint density function of 3 variables can be decomposed as f1 f2c12f3/1c23/1 = f1 f2f3c12c13c23/1. The analogous R-vine copula is given in Fig 1.

Fig 1

R-vine copula with 3 stocks.

T1, T2, T3 corresponds to trees 1, 2 and 3 respectively.

R-vine copula with 3 stocks.

T1, T2, T3 corresponds to trees 1, 2 and 3 respectively. For fast execution of statistical methods such as the Maximum likelihood estimate, Morales and Napoles et al. [28] proposed an efficient scheme of storing an R-vine on n–variables as an n × n lower triangular matrix M = (m). The matrix M has interesting properties such as each column has distinct elements; and deleting the first row and first column of M yields a (n –1)–dimensional R-vine matrix. The decomposition in Eq (4) now can be expressed in terms of the R-vine matrix: Note that the above equation is in terms of a bivariate copula function. An efficient algorithm for computing the conditional distributions appearing as arguments of this copula function has been proposed in [15].

3.2 Mutual information and Kendall’s Tau based metrics

Mutual Information (MI) between two random variables captures mutual dependence between them and is zero if and only if they are independent. MI between two random variables is defined to be the difference between the sum of the respective entropies of random variables and their joint entropy. The mutual information of discrete random variables X and Y is defined as A generalization to the continuous case is Based on mutual information, the normalized distance [23] between two random variables X and Y is defined as where, I is the mutual information and H is the joint entropy. Based on this metric, we can construct minimum spanning tree (MST) network between n stocks. There are two well-known methods to construct Minimal Spanning Tree: Kruskal’s algorithm and Prim’s algorithm. We used Prim’s algorithm for construction of the stock networks since the stocks networks are dense networks and in such cases Prim’s algorithm works well. We also considered building stock networks based on Kendall’s Tau quantifier. The metric used is where τ is Kendall’s Tau coefficient between X and Y. Sharma and Habib [23] studied MI based stock networks and showed the existence nonlinearity in the stock returns data at high frequency level.

3.3 Fitting univariate models to log returns of stocks

A stochastic process R1, R2, …,R is a white noise process with mean μ and variance σ2, if E(R) = μ for all t, Var(R) = σ2 for all t, and Cov(R,R) = 0 for all t ≠ s. In order to check if the log returns of stocks exhibit the properties of white noise, we carried out Ljung-Box test [29] to check if the log returns of stocks exhibit any autocorrelation or heteroscedasticity at 1% level of significance We carried out hypothesis testing for each day and each stock on log returns and squares of log returns. Fig 2A corresponds to log returns and Fig 2B corresponds squares of log returns. Presence of autocorrelation and heteroscedasticity can be seen at a lag of 1. Thus, GARCH methods are applied to our data aiming to remove the autocorrelation and heteroscedasticity in the time series. We tested for GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) models with the error estimated by student’s t-distribution. A process R is called an ARCH(p) process if R = μ + σε where ε is a white noise and is the conditional standard deviation of R given the past values R,…,R. It is to be noted that an ARCH(p) process has constant mean and constant unconditional variance but its conditional variance is not constant. The GARCH(p,q) model, on the other hand, tries to improve some of the deficiencies of the ARCH(p) model by expressing σ in terms of the past values of standard deviation σ,…,σ in addition to the past values R,…,R. Specifically, we have . For further details, the reader can refer to [30, 31].

Fig 2

Ljung-Box test on log returns and squares of log returns.

Ljung-Box test on log returns and squares of log returns.

On horizontal axis we have listed 50 stocks and on vertical axis we have working days of year 2014. Black and white colour represents that the null hypothesis (H0: data is independent,H: data exhibit serial correlation) is rejected or accepted respectively. (a) Corresponds to test applied to log returns (b) Corresponds to test applied to squares of log returns. We also tried fitting Normal Inverse Gaussian (NIG) distribution as well on the error terms. We used Kolmogorov Smirnov test to check the goodness of fit of univariate distribution on errors. Both NIG and student’s t distribution turns out to be better choices over normal distribution. Due to computational simplicity, we used student’s t distribution in our model. In Fig 3, we summarize the p-values corresponding to the test applied to the error terms obtained after fitting ARMA(1,1)-EGARCH(1,1) model for each of 50 stocks computed daily.

Fig 3

Kolmogorov Smirnov test for testing t-distribution for error terms from ARMA(1,1)-EGARCH(1,1).

Kolmogorov Smirnov test for testing t-distribution for error terms from ARMA(1,1)-EGARCH(1,1).

On horizontal axis we have listed 50 stocks and on vertical axis we have working days of year 2014. Black and white colour represents that the null hypothesis (H0: data follows t—distribution) is rejected or accepted respectively. (a) Corresponds to test at 5% level of significance (b) Corresponds to test at 1% level of significance. In all the equations given below, R is as defined in Eq (2). The GARCH(1,1) model [31] for the kth stock is given by where, we fit a student’s t-distribution to the noise ϵ The ARMA(1,1)-GARCH(1,1) model [31] for the kth stock is given by where, we fit a student’s t-distribution to the noise ϵ The ARMA(1,1)-EGARCH(1,1) model [31] for the kth stock is given by where we fit a student’s t-distribution to the noise ϵ In all three models, we tested if the noise term ϵ exhibit properties of a white noise by again running Ljung Box Tests at 1% level of significance. Figs 4 and 5 corresponds to the results obtained from running Ljung Box Test on ϵ and ϵ2 respectively. Clearly ARMA(1,1)-EGARCH(1,1) proves to be better fitted model in comparison to other models. We use AIC values to compare the three methods. 94.84% of the times ARMA(1,1)-EGARCH(1,1) was seen to have the lowest AIC values and it again emerged to be a better fit in comparison to the other two methods. We used adjusted Pearson chi-squared goodness of fit test [32] to check the effectiveness of the univariate model for each stock on each working day at 5% and 1% level of significance. Fig 6 gives whether the null hypothesis, H0: ARMA(1,1)−EGARCH(1,1) is a good fit, was rejected (black colour) or accepted (white colour) for each stock on for each working day. 32 stocks out of 50 were seen to pass the test for more than 90% of times, i.e. null hypothesis was not rejected at 1% level of significance more than 90% of times. Also all the stocks showed an efficiency of a good fit for more than 72% of times. Thus, we conclude that ARMA(1,1)-EGARCH(1,1) is a good fit.

Fig 4

Ljung-Box test on errors.

Fig 5

Ljung-Box test on squares of errors.

Fig 6

Goodness of fit test for ARMA(1,1)-EGARCH(1,1) model.

On horizontal axis we have listed 50 stocks and on vertical axis we have working days of year 2014. Black and white colour represents that the null hypothesis (H0: ARMA(1,1)−EGARCH(1,1) is a good fit) is rejected or accepted respectively. (a) Corresponds to test at 5% level of significance (b) Corresponds to test at 1% level of significance.

Ljung-Box test on errors.

Ljung-Box test on squares of errors.

Goodness of fit test for ARMA(1,1)-EGARCH(1,1) model.

3.4 Value at risk (VaR) prediction

Value at risk (VaR) of a portfolio is measure of risk associated with it. For example if a portfolio has one-tick 5% VaR of x amount, then it means that there is 5% chance that the portfolio looses its value by an amount x over the time duration of one tick in the absence of trading. It is well known that α% VaR of the portfolio is given by the α–percentile of log returns of the portfolio [25]. Once the joint distribution function of n stocks is known, we can use a Monte-Carlo simulation to estimate the VaR of the underlying portfolio. In this paper we have drawn inferences by calculating the VaR for equally weighted portfolios of 5, 10, 25, and 50 stocks. Consider a portfolio consisting of n stocks and random variables S(t,k), R, are as defined in Eqs (1) and (2). Let w be the weight associated with kth stock in the portfolio and S be the value of the portfolio corresponding to the tick t, then, the log return of the portfolio R, in the time interval [t,t + 1] is given by Using identities e ~(1 + x), and ln (1 + x) ~x for small x, in above equation, we get We first use ARMA(1,1)+EGARCH(1,1) to model univariate log returns of each stock and then use R-Vine copula construction on the error terms ϵ to estimate joint copula on the error terms. We fit the model on the first 4 hours of each day and use it to predict Var for next 2 hours. We summarize the algorithm as below: Consider log returns of each stock for the first 4 hours i.e. 9:30AM to 1:30PM (this gives 480 terms in each time series) on each day. Fit an ARMA(1,1)-EGARCH(1,1) model to log returns of each stock obtained in step 1, with univariate Student’s t-distribution assumed on the error term ∈ of each stock k. So if there were n–stocks in the portfolio then the data generated at this step can be written conveniently as (∈)480×. Fit an R-vine copula structure to the random variables ϵϵϵ (sampled at 480 ticks in step 2) to obtain the joint distribution of the error terms. In the R-vine algorithm we choose the first tree T1 as the minimum spanning tree based on Kendall’s Tau metric (Eq 11) and also MI based metric (Eq 10). In this paper we fitted the R-vine structure on n = 50 stocks. Using the joint distribution obtained in step 3, we then employ Monte-Carlo simulation to generate a large number of values (say N = 5000) of (ϵ481,1,ϵ481,2,…,ϵ481,) simultaneously and substitute these in Eqs 16 and 17 to estimate the corresponding large number of values of (R481,1,R481,2,…,R481,). For each of the N tuples (R481,1,R481,2,…,R481,) obtained, compute the portfolio log return R481, using Eq 19. In our analysis, we have worked with equally weighted portfolios with 5,10,25,50 stocks respectively. The α% VaR for 481st instant, VaR481, is now calculated by finding the α percentile of the N simulated values of R481,. Here P is a portfolio whose size is chosen to be of 5, 10, 25, and 50 stocks respectively. In this paper we have considered α = 5%, 10% respectively. We then compare the actual R481, with the estimated VaR481,. Once the actual R481, is known, then we can use Eq (16) to calculate actual ϵ481, as Next, we use Eq (17) to calculate σ482,. We then repeat steps 4, 5 and 6 for predicting (R482,1,R482,2,…,R482,) and compare the actual R482, with the estimated VaR482,. This way we calculate estimated VaR where i = 483,…,719, and compare these values with the respective actual R. Note that the model was fitted only once a day. We repeat step 1 to step 7 for all working days in year 2014.

4) Discussion

Data for each day was divided into 2 subsets: training data from 9:30AM to 1:30PM and testing data 1:31PM to 3:30PM. We fit both Kendall Tau’s based and MI based vine copula structure on the training data as discussed in the previous section. We then estimated VaRs corresponding to equally weighted portfolios for each time tick of the testing data. In our analysis we have considered portfolios consisting of all 50 stocks, randomly picked 25 or 10 or 5 stocks. Also, we have considered 5% and 10% VaRs in all the cases. To check the effectiveness of our model we carried out unconditional (UC) and conditional (CC) coverage test formulated by Christoffersen [33]. There are 32, 39 and 113 days in the pre-election, election and post-election period for which our proposed model was a good fit. We carried out the hypothesis testing for each day and calculated percentage of times, the null hypothesis was not rejected. We refer to this calculated percentage of times as the success rate of the model. Tables 1–3 summarizes the results obtained in pre-election, election and post-election period.

Table 1

Unconditional (UC) and conditional (CC) coverage tests at 1% and 5% level of significance: Pre-election period.

Pre-election Period	5 stocks 10% VaR	10 stocks 10% VaR	25 stocks 10% VaR	50 stocks 10% VaR	5 stocks 5% VaR	10 stocks 5% VaR	25 stocks 5% VaR	50 stocks 5% VaR
UCpvalue > 0.01 (Kendall’s Tau method)	87.50%	84.38%	78.13%	87.50%	87.50%	90.63%	84.38%	84.38%
UCpvalue > 0.01 (MI method)	87.50%	87.50%	81.25%	87.50%	87.50%	90.63%	84.38%	87.50%
UCpvalue > 0.05 (Kendall’s Tau method)	75.00%	84.38%	71.88%	75.00%	81.25%	84.38%	75.00%	78.13%
UCpvalue > 0.05 (MI method)	75.00%	84.38%	71.88%	75.00%	81.25%	84.38%	78.13%	78.13%
CCpvalue > 0.01 (Kendall’s Tau method)	90.63%	78.13%	62.50%	43.75%	87.50%	81.25%	78.13%	68.75%
CCpvalue > 0.01 (MI method)	90.63%	78.13%	65.63%	43.75%	87.50%	81.25%	78.13%	71.88%
CCpvalue > 0.05 (Kendall’s Tau method)	71.88%	71.88%	37.50%	28.13%	84.38%	78.13%	65.63%	62.50%
CCpvalue > 0.05 (MI method)	71.88%	71.88%	43.75%	21.88%	84.38%	78.13%	65.63%	59.38%

Table 3

Unconditional (UC) and conditional (CC) coverage tests at 1% and 5% level of significance: Post-election period.

Election Period	5 stocks 10% VaR	10 stocks 10% VaR	25 stocks 10% VaR	50 stocks 10% VaR	5 stocks 5% VaR	10 stocks 5% VaR	25 stocks 5% VaR	50 stocks 5% VaR
UCpvalue > 0.01 (Kendall’s Tau method)	91.15%	88.50%	81.42%	81.42%	90.27%	89.38%	83.19%	81.42%
UCpvalue > 0.01 (MI method)	91.15%	88.50%	83.19%	81.42%	91.15%	89.38%	85.84%	83.19%
UCpvalue > 0.05 (Kendall’s Tau method)	78.76%	81.42%	74.34%	69.91%	82.30%	81.42%	74.34%	74.34%
UCpvalue > 0.05 (MI method)	80.53%	82.30%	74.34%	72.57%	84.07%	82.30%	74.34%	75.22%
CCpvalue > 0.01 (Kendall’s Tau method)	88.50%	82.30%	69.03%	54.87%	89.38%	85.84%	70.80%	66.37%
CCpvalue > 0.01 (MI method)	88.50%	84.07%	68.14%	55.75%	89.38%	86.73%	70.80%	69.03%
CCpvalue > 0.05 (Kendall’s Tau method)	76.11%	73.45%	49.56%	37.17%	82.30%	77.88%	57.52%	53.98%
CCpvalue > 0.05 (MI method)	76.99%	73.45%	49.56%	36.28%	84.07%	78.76%	61.06%	56.64%

It was observed that the VaR prediction were more accurate in case of portfolios consisting of small number of stocks like 5 or 10 in comparison to portfolios consisting of large number of stocks like 25 or 50. Also, the success rate of MI based model was seen to be much better than the Kendall’s Tau based model, 41 out of 96 times (42.71%) in comparison to 6 out of 96 (6.25%) times when success rate of Kendall’s Tau based model was observed to be better than that of MI based model. 49 out of 96 times (51.04%), the success rates based on both the methods were seen to be at par. One can also observe that even during the election times which is full of uncertainties, the success rate of the model was quite high.

5) Conclusion

This paper demonstrates the power of incorporating mutual information based metrics into the construction of R-vine copula structures in learning the joint distribution of a large number of stocks from a high frequency market data. The data considered in the present analysis has an instant-by-instant record of transactions of 89 stocks listed on the National Stock Exchange (NSE) of India in the year 2014. In order to give a time series interpretation to our data, we divide each working day into 720 “ticks” where each tick represents a 30 second duration. Out of the 89 stocks, we have considered only the top 50 traded stocks. On the basis of the Ljung-Box test it is concluded that ARMA(1,1)-EGARCH(1,1) captured the autocorrelation and heteroscedasticity of the time series of log returns of the above portfolio of 50 stocks significantly better than the famous GARCH(1,1) and ARMA(1,1)-GARCH(1,1) methods. In fact on 94.84% of the occasions the AIC values obtained after fitting ARMA(1,1)-EGARCH(1,1) were found to be the lowest in comparison to the other methods (In the R software package, a lower AIC indicates that the model is superior). The joint distribution of the respective error terms in the ARMA(1,1)-EGARCH(1,1) model applied to each stock is then computed by learning R-Vine copula structures in 2 ways: first, by starting with the minimal spanning tree computed on the basis of the mutual information metric; and second, by starting with the minimal spanning tree computed on the basis of the Kendall’s Tau based metric. Next, the VaR of the underlying 50 stock portfolio is computed through Monte-Carlo simulations in both the cases. The Christoffersen’s UC and CC tests show that VaR predictions in the mutual information case out performs the VaR predictions in the Kendall’s Tau case. The success rate obtained from the MI based method is seen to be higher than Kendall’s Tau based method on 42.71% occasions. On 51.04% of the occasions the success rates from both the methods were at par. The predictions were quite good even during the election period when there is lot of anticipation amongst the buyers. We finally conclude that MI based R-Vine Copula model is able to capture the joint distribution well and thus leads to better VaR predictions in a high frequency scenario.

Data corresponding to Figs 2–6 and Tables 1–3.

(XLSX) Click here for additional data file. 15 Apr 2021 PONE-D-20-23135 A Mutual Information Based R-Vine Copula Strategy To Estimate VaR in High Frequency Stock Market Data PLOS ONE Dear Dr. Sharma, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by May 30 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Alessandro Barbiero, Ph.D. in Statistics Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: 2a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. 2b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: - I would suggest a brief description of ARMA and GARCH models and the meaning of their parameters. How did you check the fit of these models, just using AIC? - For the univariate distribution, you explained you considered both NIG and Student's t and you selected the latter due to its simplicity. What do you specifically mean? That it is easier to estimate its parameters? How is Kolmogorov-Smirnov test affected by the fact that parameters are unknown and need to be estimated? As far as I know, estimating unknown parameters may lead to a "biased" p-value. Did you explored other parametric families (eg stable distributions)? - Figures are blurred, please fix them - Perhaps when explaining the application of vines some graphs could be beneficial to a better understanding [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper seems quite polished. The authors clearly know this material, and then seem to have executed everything correctly. I would "Accept," but with one minor suggestion. The authors repeated refer to "mutual information," which is not a phrase with which I'm familiar. I believe they mean "dependence," in the statistical sense. Could they more clearly define (very early in the paper) what they mean by "mutual information"? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: David Zimmer [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 1 Jun 2021 Dear Editor, Our sincere thanks to you and all the reviewers for your valuable comments on the scientific content and presentation of our manuscript titled “A Mutual Information Based R-Vine Copula Strategy To Estimate VaR in High Frequency Stock Market Data” (PONE-D-20-23135). We have made several changes to manuscript in accordance with these comments. We hope that the revised version of the manuscript will be considered favourably by PLOS ONE. We now give an item wise response to the reviewers’ comments: Regarding Journal Requirements: 1) Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. We crosschecked and the revised manuscript is in accord with the requirements. 2) Restriction on sharing data We have uploaded the processed data as a supporting file “S1 Table”. The raw data is “available from a third party”, details of which are provided in the cover letter as well: NSE Data & Analytics (https://www.nseindia.com/supra_global/content/dotex/about_dotex.htm). . The terms of purchase prohibit us from redistributing the Historical Data or any component of it. Revised Data Availability Statement: Raw data cannot be shared publicly because of copyright agreement with NSE Data & Analytics, formerly known DotEx International Ltd.(third party). Data are available from the NSE Data & Analytics, formerly known DotEx International Ltd., (contact via dotex_kraops@nse.co.in) for researchers who meet the criteria for access to confidential data. The terms of purchase prohibit us from redistributing the Historical Data or any component of it. However, the raw data can be purchased from NSE Data & Analytics at: https://www.nseindia.com/supra_global/content/dotex/data_products.htm We also confirm that we did not have any special access privileges that others would not have. However, we also confirm that all the processed data corresponding to each figure and each table is uploaded as a supporting file named “S1 Table”. 3) Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. While adhering to the editor’s suggestion of including a brief description of ARCH and GARCH, we needed to include a fresh reference of a paper by Robert F. Engle. The list of references have been revised in view of the stated changes and we are certain that no references correspond to retracted manuscripts. Further, we have updated the missing DOIs and ISBN(in case of books) information as well. All the relevant changes have been incorporated in the revised document. Additional Editor Comments: 1) I would suggest a brief description of ARMA and GARCH models and the meaning of their parameters. How did you check the fit of these models, just using AIC? Specific details about ARCH and GARCH models have been inserted in section 3.1(lines 271-279) of the manuscript. Fitting of the three models i.e. GARCH(1,1), ARMA(1,1)-GARCH(1,1) and ARMA(1,1)-EGARCH(1,1) were based on the two steps. First step: p-values for each parameter in each model was observed to be significantly high justifying the fit of the respective models. Second step: to pick the best model out of the three we used minimum AIC criteria. 2) For the univariate distribution, you explained you considered both NIG and Student's t and you selected the latter due to its simplicity. What do you specifically mean? That it is easier to estimate its parameters? How is Kolmogorov-Smirnov test affected by the fact that parameters are unknown and need to be estimated? As far as I know, estimating unknown parameters may lead to a "biased" p-value. Did you explored other parametric families (eg stable distributions)? NIG is a subclass of generalized hyperbolic distribution depending upon 4 parameters. On the other hand, Student’s t-distribution depends only on 1 parameter. Considering errors associated with the estimation of the parameters, it is always beneficial to estimate one parameter over 4 parameters. We tried fitting of both the distributions on the univariate data and ran Kolmogrov Smirnov test to check goodness of fit. All 50 stocks cleared the test at 1% level of significance in both the cases. To keep our computational errors low we decided to pick Student’s t-distribution over NIG. Kolmogrov-Smirnov test was conducted using simulations, which ensured unbiased p-values. Since NIG and t-distribution gave us satisfying results, thus we decided to stick to them and didn’t explore further. 3) Figures are blurred, please fix them We have re-generated all figures at 600dpi and ran through the PACE software. Hope they are of acceptable quality now. 4) Perhaps when explaining the application of vines some graphs could be beneficial to a better understanding Thank you for your valuable suggestion. We have now included an example in section 3.1(refer to lines 204-213 of the manuscript) along with the associated graph (Fig 1). Reviewer #1: 1) …one minor suggestion. The authors repeated refer to "mutual information," which is not a phrase with which I'm familiar. I believe they mean "dependence," in the statistical sense. Could they more clearly define (very early in the paper) what they mean by "mutual information"? Thanks for your valuable suggestion. We have now included the definition of mutual information in the introduction section 1 (lines 69-74) and explained it in detail in section 3.2(lines 232-244). Submitted filename: Response to Reviewers.docx Click here for additional data file. 3 Jun 2021 A Mutual Information Based R-Vine Copula Strategy To Estimate VaR in High Frequency Stock Market Data PONE-D-20-23135R1 Dear Dr. Sharma, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Alessandro Barbiero, Ph.D. in Statistics Academic Editor PLOS ONE Additional Editor Comments (optional): Dear authors, I would just ask you to revise the new part explaining Mutual information (rows 69-72): it is still a bit confusing with some repetitions ("random variables") Reviewers' comments: 7 Jun 2021 PONE-D-20-23135R1 A Mutual Information Based R-Vine Copula Strategy To Estimate VaR in High Frequency Stock Market Data Dear Dr. Sharma: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Alessandro Barbiero Academic Editor PLOS ONE

Table 2

Unconditional (UC) and conditional (CC) coverage tests at 1% and 5% level of significance: Election period.

Election Period	5 stocks 10% VaR	10 stocks 10% VaR	25 stocks 10% VaR	50 stocks 10% VaR	5 stocks 5% VaR	10 stocks 5% VaR	25 stocks 5% VaR	50 stocks 5% VaR
UCpvalue > 0.01 (Kendall’s Tau method)	92.31%	89.74%	89.74%	87.18%	92.31%	92.31%	87.18%	87.18%
UCpvalue > 0.01 (MI method)	92.31%	87.18%	89.74%	87.18%	92.31%	92.31%	92.31%	89.74%
UCpvalue > 0.05 (Kendall’s Tau method)	71.79%	82.05%	82.05%	79.49%	87.18%	89.74%	87.18%	76.92%
UCpvalue > 0.05 (MI method)	79.49%	82.05%	82.05%	79.49%	87.18%	92.31%	87.18%	79.49%
CCpvalue > 0.01 (Kendall’s Tau method)	89.74%	82.05%	82.05%	56.41%	92.31%	87.18%	87.18%	69.23%
CCpvalue > 0.01 (MI method)	89.74%	84.62%	82.05%	58.97%	92.31%	89.74%	89.74%	74.36%
CCpvalue > 0.05 (Kendall’s Tau method)	64.10%	69.23%	56.41%	38.46%	82.05%	84.62%	69.23%	48.72%
CCpvalue > 0.05 (MI method)	71.79%	71.79%	56.41%	35.90%	84.62%	87.18%	74.36%	51.28%

4 in total

1. Networks in financial markets based on the mutual information rate.

Authors: Paweł Fiedor
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2014-05-01

2. Development of stock correlation networks using mutual information and financial big data.

Authors: Xue Guo; Hu Zhang; Tianhai Tian
Journal: PLoS One Date: 2018-04-18 Impact factor: 3.240

3. Mutual information based stock networks and portfolio selection for intraday traders using high frequency data: An Indian market case study.

Authors: Charu Sharma; Amber Habib
Journal: PLoS One Date: 2019-08-29 Impact factor: 3.240

4. MIDER: network inference with mutual information distance and entropy reduction.

Authors: Alejandro F Villaverde; John Ross; Federico Morán; Julio R Banga
Journal: PLoS One Date: 2014-05-07 Impact factor: 3.240

4 in total