Implied volatility directional forecasting: a machine learning approach

This study investigates whether the direction of U.S. implied volatility, VIX index, can be forecast. Multiple forecasts are generated based on standard econometric models, but, more importantly, on several machine learning techniques. Their statistical significance is assessed by a plethora of performance evaluation measures, while real-time investment strategies are devised to appraise the investment implications of the underlying modeling approaches. The main conclusion of the analysis is that the implementation of machine learning techniques in implied volatility forecasting can be more effective compared to mainstream econometric models and model selection techniques, as they are superior both in a statistical and an economic evaluation setting.


Introduction
Over time, a great deal of attention has centered on the predictability of equity market volatility, be it realized and/or implied, as changes in market volatility have significant repercussions on investor preferences. History shows that asset performance and its associated volatility is distinct and asymmetric during different market volatility regimes, as the pricing of market risk and, as a result, investor sentiment is time-varying. As such, identifying, but, more importantly, predicting aggregate market volatility is of critical importance for the implementation of effective asset allocation programs, investment and/or trading strategies and for hedging purposes in particular.
As periods of transition from a low to a high market volatility regime can be very abrupt and relatively short-lived, the development of an effective modeling framework is of critical importance for the design and implementation of active portfolio immunization strategies in order to avoid sizeable drawdowns during periods of turmoil in particular.
In practice, a number of different metrics have been introduced for the estimation of market volatility; the most well-known and followed metrics, however, are realized and implied volatility. Realized volatility gauges the fluctuations of underlying securities or indices by measuring *Corresponding author. Email: svrontos@essex.ac.uk price changes over predetermined time periods, while implied volatility is a forward-looking metric that represents future expectations of the market's uncertainty. This study focuses on the Chicago Board Options Exchange's (CBOE) VIX index (VIX) that can be considered as a model-free estimator of the equity market's implied volatility. More specifically, the VIX is designed to produce a measure of constant, 30-day expected volatility, derived from real-time, mid-quote prices of S&P 500 index call and put options. The VIX is one of the most recognized volatility measures globally.
A substantial amount of research by both academics and practitioners has focused on the investigation of volatility forecasting and, consequently, on identifying variables that have predictive ability for time-varying volatility dynamics. It is noteworthy that a large part of the empirical evidence on implied volatility predictability is mixed. Dumas et al. (1998) explore whether S&P 500 implied volatility dynamics are predictable across different strike prices and expiry dates over alternative periods. They conclude that there is too much instability in the underlying forecasts for them to be useful for pricing and hedging purposes. By contrast, Goncalves and Guidolin (2006) detect a statistically predictable pattern that cannot be meaningfully exploited due to the high transaction costs involved.
Part of the literature has concentrated on the predictability of short-term at-the-money volatility; more specifically, Harvey and Whaley (1992) study at the money option volatility for the S&P 100 index, Guo (2000) for the Philadelphia Stock Exchange currency options, and Brooks and Oozeer (2002) for LIFFE long Gilt futures and options markets. A set of macroeconomic predictors is used to conclude that changes in implied volatility are partially predictable, but, once again, the results are not economically significant. In a similar study, Gemmill and Kamiyama (2000) find that changes in index option implied volatilities in a certain market are impacted by lagged changes in other markets implied volatilities, i.e. there are large spillover effects. Moreover, Goyal and Saretto (2006) use information stemming from the cross-section of various stock option implied volatilities to find predictable patterns in implied volatility dynamics and they conclude that there is both significant statistical and economic predictability.
The empirical evidence on the predictability of implied volatility indices is less extensive. Early studies include Aboura (2003), Ahoniemi (2006) and they conclude that the evolution of implied volatility indices is statistically predictable. In addition, Ahoniemi (2006) investigates the economic significance of the underlying forecasts and finds that implementing a trading strategy based on S&P 500 options does not achieve abnormal profits. Konstantinidi et al. (2008) explore whether European and U.S. index implied volatility can be forecast. More specifically, four major U.S. (VIX, VXO, VXN, VXD) and three European (VDAX, VCAC, and VSTOXX) implied volatility indices are studied. A set of economic and financial market variables that exhibit asset return predictability, serve as potential predictors. Four different types of models are employed to generate forecasts; an economic variables model, uni-and multivariate (VAR) autoregressive models, a Principal Components (PCA) model, as well as ARIMA and ARFIMA models. In sample, the estimated models exhibit low explanatory power, with the VAR and PCA ones performing better than the others. Moreover, there is higher explanatory power associated with European indices. Out of sample, both point and interval forecasts are generated. Regarding point forecasts, the results show that in 28% of the cases one of the estimated models outperforms the random walk model. Predictability is, once again, higher within European indices, as in 41% of the cases there is outperformance over the random walk model, compared to just 18% in the case of their U.S. counterparts. The VAR and PCA models continue to generate the best forecasts related to the European indices, while the ARIMA and ARFIMA models perform best in the case of the U.S. indices. The overall results suggest that there are implied spillover effects between markets, as the information contained in all implied indices can be used to predict separately every European index. This is not true for the U.S. indices; their underlying autocorrelation structure must be considered in order to forecast their evolution. Regarding interval forecasts, predictability is present in 47% of the cases and is stronger for the U.S. indices. Once again, volatility spillovers could be helpful for predictive purposes. However, there is no economic significance associated with the statistically predictable patterns. Clements and Fuller (2012) employ a semi-parametric approach to create a forecasting model regarding the sign of the change in the VIX index. More specifically, VIX forecasts are generated using a weighted average of past signed changes in the index, where the highest weight is assigned to periods with comparable conditions to the time for which the forecasts are being formed. The main focus of the study is to correctly predict increases in the VIX index. The underlying models are able to generate successful forecasts marginally higher than 50% of the time, while the proportion of increases being correctly predicted range from 49.86% to 54.95%. The generated forecasts are then included in a portfolio immunization context. An active volatility exposure is considered to hedge an underlying long position in the S&P 500 index. It is shown that a long volatility hedge improves the risk-adjusted performance of a long-only equity market position. This is especially true during periods of market turbulence. Fernandes et al. (2014) perform a statistical investigation of the time series properties of the daily VIX index. To comprehend the statistical behavior of this index, they employ a multivariate setting that controls for different macroeconomic and financial market conditions. Variants of the Heterogeneous Autoregressive (HAR) process, including a semi-parametric neural network model with Bayesian regularization, are estimated both in and out-ofsample. Out-of-sample forecasts are generated for different daily horizons which show that the simple HAR model seems to exhibit superior performance when compared to alternative HAR specifications and a random walk model, while the neural network model performs best at longer forecast horizons.
Full-blown application of Machine Learning (ML) algorithms in realized and implied volatility forecasting, however, has been relatively scarce. Vortelinos (2017) examines whether nonlinear models, such as Principal Components Combining, Neural Networks and GARCH, are better in forecasting realized volatility in numerous U.S. asset markets relative to the HAR model. Recently, Hosker et al. (2018) compare three established financial models that have been used in equity market volatility forecasting to various machine/deep learning supervised regression methods. More specifically, their analysis provides forecasts for the 1-month VIX futures contract three and five days ahead. They conclude that ML methods based on Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) generate improved results over linear regression, Principal Components Analysis (PCA), and ARIMA models. More recently, Audrino et al. (2020) analyze the impact of sentiment indicators and investor attention variables on realized equity market volatility. They apply an innovative sentiment classification system to explore whether sentiment measures contain additional information content for realized volatility, after controlling for numerous macroeconomic and financial predictive variables. Using a penalized regression framework, they identify the most relevant variables for volatility prediction. Their results reveal that although attention and sentiment indicators improve volatility forecasts, they are not significant from an economic perspective. Hirsa et al. (2021) apply several machine learning techniques and neural networks to replicate the VIX index as well as VIX futures by using a small number of S&P options, and thus, are able to exploit potential arbitrage opportunities between the VIX index and its underlying derivatives.
The majority of the empirical research conducted on realized and implied volatility forecasting has been focused on point or interval forecasts rather than directional ones. The primary aim of this study is to create an effective forecasting framework to predict the direction of the VIX index. Forecasts are generated using standard binary econometric models, combination of forecasts methodologies and numerous machine learning techniques. Various statistical evaluation metrics are employed to assess the predictive ability of the underlying models and a number of investment strategies are devised to appraise their economic significance.
More specifically, the contributions of our work are several; first, a binary-valued dependent variable representing the directional patterns of the VIX index is introduced to explore whether the direction of the VIX can be forecast using a wide set of predictor variables. Second, following the trend of new variable detection, a number of less well studied predictors are considered in the underlying analysis. Third, a plethora of machine learning techniques is implemented in order to find out which of these methods improve predictive performance. Fourth, within the machine learning framework and for penalized likelihood techniques, automatic variable selection is performed to identify important predictors that are crucial in order to understand the evolution of market risk and the way that market returns change over time, and to build real-time investment strategies.
In general, the results reveal that machine learning techniques can lead to substantial improvement in predicting the directional patterns of the VIX index, as revealed by statistical and economic evaluation metrics.
The remainder of the paper is organized as follows. Section 2 provides an outline of the research methodology, i.e. introduces the different model specifications and the machine learning techniques; in Section 3 the data sets are described; Section 4 summarizes the empirical design and findings and finally, Section 5 concludes.

Research methodology: model specifications and machine learning techniques
The modeling approaches and machine learning techniques used to forecast VIX index related directional patterns are outlined in this section. More specifically, standard Logit regression models, penalized likelihood Logit models, as well as several machine learning techniques are briefly discussed.

Binary Logit regression models
A directional binary-valued dependent variable Y t of the following form is considered throughout this paper: there is an increase in the VIX volatility index at time t 0, otherwise, i.e. if VIX t ≤ VIX t−1 , there is a decrease in the VIX volatility index at time t.
Generalized linear models (GLM) can be used to model the dependence of the binary directional variable Y t , given a set of N (lagged) covariates or predictor variables, X j,t−l , j = 1, . . . , N, l = 1, . . . , h and/or autoregressive components, Y t−i , i = 1, . . . , p. In the analysis, GLM are considered to model the mean μ of an observation of Y, which is related to the predictor variables through a link function and a linear predictor model of the form g(μ) = η, where g(μ) is the link function, and η is the linear predictor.
Regarding the Logit regression model, the link function g(μ) is the Logit transformation, g(μ t ) = ln( p t 1−p t ), which is a linear function of the predictor variables, the autoregressive components, and the unknown parameters.
denotes the augmented vector of predictors that contain the lagged predictor variables and the autoregressive components, h being the maximum number of lags used in the model, and θ = (β , φ ) the total parameter vector with β = (β 0 , β 1,1 , . . . , β 1,h , . . . , β N,1 , . . . β N,h ) and φ = (φ 1 , . . . , φ p ) . Then, the Logit regression model can be written in the form: Maximum likelihood estimates of the Logit regression model parameters, θ, can be obtained by assuming that each Y t is the outcome of independent Bernoulli random variables with an upward directional move probability p t . The loglikelihood function for a sample of T observations is given by

Penalized likelihood binary logit regression models
The Ridge, the Least Absolute Shrinkage and Selection Operator (LASSO), and the Elastic Net regularization techniques are presented below, as they are implemented in the empirical analysis, using variants based on the Area Under the Curve (AUC), Deviance, and Class metrics. The specific techniques are penalized least squares methods that impose shrinkage in the regression coefficients and allow for automatic variable selection. The inclusion of a parameterization penalty in the likelihood function is strongly recommended to overcome overfitting issues, especially in the presence of a large predictor set, and to improve out-of-sample predictive performance. Hoerl and Kennard 1970) can result in better prediction accuracy, by shrinking the estimated coefficients towards zero. The specific process has numerous benefits, as it leads to lower variance and decreases the complexity of the underlying model, without reducing the number of predictors. Given the log-likelihood function of the ordinary Logit model, with parameter vector θ , the Ridge log-likelihood Logit function introduces a shrinkage penalty that employs the 2 -norm of θ , θ 2 = i θ 2 i , and a tuning parameter λ, λ > 0, that controls the degree of regularization. Increasing the value of λ has the tendency to reduce the magnitude of the estimated coefficients, but does not result in the elimination of any of the predictors. The maximum likelihood estimator of the Ridge Logit model is given bŷ

The LASSO Logit model. An alternative modeling approach is the Least Absolute Shrinkage and Selection
Operator (LASSO) model, introduced by Tibshirani (1996). By imposing an alternative type of shrinkage, the LASSO Logit model provides better interpretability relative to that of the Ridge regression and also performs variable selection. The LASSO Logit coefficients are estimated by maximizing the corresponding log-likelihood function, while imposing a shrinkage penalty based on the 1 -norm of θ . The vectorθ where θ 1 = i |θ i | is the 1 -vector norm of θ and λ, λ > 0, is the LASSO-related tuning parameter that controls the degree of shrinkage of θ . The 1 penalty function shrinks the underlying coefficients towards zero, forcing some of the coefficient estimates to be exactly equal to zero, when the tuning parameter λ is sufficiently large. The specific modeling approach eliminates entirely a number of predictors and could, thus, result in a set of predictive variables that are the most important to accurately forecast the direction of the VIX index.

The Elastic Net Logit model. The Elastic
Net is an alternative modeling approach, proposed by Zou and Hastie (2005) that has been extended to the Logit regression framework. The Elastic Net was introduced as an improved technique (to LASSO and Ridge), that is able to handle high correlated variables in the predictor set that is, typically, inherent in the analysis of high dimensional data. It combines the penalty terms of the Ridge and LASSO regression by using a convex combination scheme. In practice, this can be achieved as the 1 -norm of the Elastic Net performs automatic variable selection, while the 2 -norm stabilizes the solution, and, thus, raises the out-of-sample predictive performance. The Elastic Net Logit coefficient estimates are obtained by maximizing the log-likelihood function that penalize the size of the model coefficients based on both the 1 -vector norm and 2 -vector norm of θ . Thus, the parameter estimates are obtained bŷ where λ and α are the Elastic Net tuning parameters. For tuning parameter selection λ in Ridge, LASSO and Elastic Net approaches, a validation set is needed in which the predictive value of the specific penalized Logit model can be compared for various values of the tuning parameter, and the optimal parameter should be chosen in such a way that the error rate is minimal. In the analysis, the optimal tuning parameter is chosen through cross-validation.

Discriminant analysis
Both Linear and Regularized Discriminant analysis is applied in this study. These techniques are alternative classification tools that overcome certain limitations linked to standard Probit/Logit type regression models, i.e. when the binary-valued dependent variable is well separated or when there are just a few observations in a certain state, increasing the probability that the Logit regression estimates become unstable.

Linear discriminant analysis. Linear Discriminant
Analysis (LDA) is a multivariate statistical technique used to classify and generate forecasts with respect to a binary dependent variable based on a set of predictors. The main idea is to derive a linear combination of the predictors that 'best' differentiates the events related to a specific state. There are several ways to determine discriminant criteria. One is to use a Bayesian approach for classification and derivation of probabilities. Based on the Bayes' theorem, the probability of a specific event given a set of predictors Z is estimated by: , for k = 0, 1 where k = 0 and k = 1 denote the two possible outcomes of the dependent variable, π k denotes the prior probability, f k (z) = Pr(Z = z|Y = k) denotes the conditional density of Z given Y = k, while Pr(Y = k|Z = z) denotes the posterior distribution of higher or lower implied volatility, given the set of predictors Z. Assuming that the conditional distribution of the set of predictors Z given Y = k has a multivariate normal density with mean vector μ k and a common covariance matrix S k = S for k = 0, 1, i.e. f k (z) ∼ N(μ k , S), LDA assigns an observation Z = z to each regime based on the largest value of z S −1 μ k − 1 2 μ k S −1 μ k + logπ k for k = 0, 1. For more details, please refer to James et al. (2013).

Regularized discriminant analysis. The Regularized
Discriminant Analysis (RDA) technique replaces the class specific covariance estimates used in LDA by their average, pooled covariance matrix and utilizes regularization to enhance performance. The regularized covariance matrices have the following form: where S is the pooled covariance matrix used in LDA, S k , k = 0, 1, is the class specific covariance matrix used in Quadratic Discriminant Analysis, and λ ∈ [0, 1] can be specified by employing cross-validation in practice. An alternative way of shrinkage is to use a convex combination that allows S k to be shrunk towards a scaled identity matrix, using the shrinkage parameter γ as follows: is the mean of the diagonal elements of S k (λ), representing the mean variance of the class predictors. The RDA classifier is given by the following equation: where λ is the cross-validation parameter that controls the degree of shrinkage of the individual class covariance matrix estimates towards the pooled estimates and γ is an additional regularization parameter that controls shrinkage toward a multiple of the identity matrix for a given value of λ. Please refer to Friedman (1989) for more details.

Classification and Regression Trees
A classification problem can also be addressed using treebased classification methods. Classification and Regression Tree (CART) models, introduced by Breiman et al. (1984), offer a flexible way to analyze non-linear relationships between a dependent variable and a set of predictors. In tree-based models, the space of the predictor variables is partitioned into a number of simple regions created recursively employing a sequence of binary decisions. The terminal nodes of the tree correspond to distinct and non-overlapping regions of the partition, and the partition is determined by splitting rules associated with the set of predictors at each of the internal (splitting) nodes. The estimation procedure of the tree structure is usually based on a top-down, greedy algorithm to grow a tree together with a pruning algorithm to avoid overfitting. Starting from a single node at the top of the tree, and then successively splitting the predictor space, the tree is growing by sequentially choosing the new partitions, based at each step on a model fitting criterion, such as the Gini index or entropy. This sequential procedure produces a maximal binary tree that is corrected by pruning through a model selection criterion, such as cost complexity pruning, cross-validation, or an information criterion like AIC or BIC. Based on the estimated tree structure, prediction for a given observation can be generated by using the mode of the training observations in the partition to which it belongs.

Bagging.
Bagging (Breiman 1996) improves the predictive accuracy of a classifier by generating multiple versions of the classifier based on bootstrap replicates of the training data set, and then combines these classifiers in order to create a single one. The specific aggregated process produces a stable forecast/classifier with smaller variance and, thus, succeeds in achieving substantial gains in terms of accuracy.
The core idea of the algorithm is to create B bootstrap samples S * 1 , S * 2 , . . . , S * B . From each bootstrap sample S * i , i = 1, . . . , B, the quantity of interest, i.e. the classifier, sayĈ * i , is estimated based on the same learning procedure. Then, the bagged estimator/classifier,Ĉ bag , can be obtained by aggregating the different bootstrap classifiers.

Random Forest.
The Random Forest (Breiman 2001) represents an alternative machine learning technique that generates improved predictive accuracy compared to the standard classification and regression tree models. Random Forest can reduce the classifier variance by aggregating a number of low correlated trees obtained using different bootstrap training samples. This can be achieved by creating multiple decision trees, where the splitting variables of the internal nodes are based on a random process, i.e. by introducing randomness in the tree-growing procedure.
As in the case of bagging, B bootstrap samples S * 1 , S * 2 , . . . , S * B are generated from the training data set. For each bootstrap sample, a Random Forest classification tree is created, T rf i , and the classifierĈ rf i , i = 1, . . . , B, is estimated based on the same learning procedure. Then, the Random Forest classifier,Ĉ rf , can be obtained by aggregating the different bootstrap classifiers using the majority vote rule, i.e.Ĉ rf = majority vote {Ĉ rf i } B 1 . The Random Forest classifier found by aggregating different bootstrapped classifiers is similar to that generated by the Bagging technique. There is a significant modification in the way that Random Forest classification trees are created, however. The splitting variable is chosen to be the 'best' variable among a random subset of m candidate variables taken from the full set of the N predictive variables. A new random sample of m candidate splitting variables is taken at each splitting node of the tree. Thus, the Random Forest selects at each node of each tree, a random subset of variables, and only these variables are used as candidates to find the best split for the node. The use of different bootstrap samples, i.e. create different classification trees, together with the introduction of this randomness at each splitting node of each tree, provides a number of uncorrelated trees and, as a consequence, the Random Forest forecast/classifier achieves reduced variance.

Adaptive boosting.
Boosting is an ensemble technique that combines several weak classifiers to produce a more accurate and powerful classifier. The basic idea is to combine classifiers that are iteratively created through the resampling of the training data by assigning higher weight to observations that were misclassified, to produce a new classifier that could boost the underlying performance. This process is repeated, generating several classifiers, which are then combined into a final classifier by applying weighted majority vote.
A very interesting feature of the Adaptive Boosting algorithm was presented by Friedman et al. (2000) who showed that Adaptive Boosting can be interpreted as a forward stagewise additive model that minimizes exponential loss. In the analysis, the Adaptive Boosting (Adaboost) algorithm proposed by Freund and Schapire (1997) is employed. Friedman (2001Friedman ( , 2002 developed a framework for the creation of a new generation of boosting techniques, based on boosting's link to the statistical principles of additive modeling and maximum likelihood (Friedman et al. 2000). More specifically, Friedman (2001) presented a general method for estimating/approximating a function F(z), mapping the predictor variables Z to the dependent variable, based on numerical optimization in the function space. The function F(z) can be expressed in terms of forward stagewise additive modeling and the application of steepest-descent minimization. A generic gradient descent 'boosting' algorithm is developed (Gradient Boosting) for additive expansions based on different fitting criteria. The core idea of the Gradient Boosting algorithm is to create additive models by sequentially fitting a base (weak) learner to current pseudo-residuals at each iteration. Thus, the learning process fits new models consecutively, to provide a more accurate estimate of the dependent variable.

Gradient boosting.
Assuming that each Y t (Y t = 1 denotes higher implied volatility at time t) is the outcome of independent Bernoulli random variables with probability of higher implied volatility p t , the quantity of interest is the log-odds, denoted by λ(z), where λ(z) = ln( p t 1−p t ) that is a function of the set of predictors Z t . Under Gradient Boosting, λ(z) is the function F(z) that can be expressed based on additive expansions of the form: is the base learner (e.g. a tree), and γ b is the parameter vector to be estimated (for trees, it contains identification of splitting variables, their splitting values, and the constants in the terminal nodes). The steps of the Gradient Boosting algorithm can be found in Friedman (2001) and Efron and Hastie (2016).

Recursive Partitioning Algorithm.
Recursive Partitioning Algorithms (rpart) are used to create classification and regression models; that is, the resulting models can be represented as binary trees. The tree structured model is created using a two-stage procedure; in the first step, a single predictor variable is found that best splits the data into two groups. There are several measures that can be employed to find the best split; possible choices can be the information index and the Gini index. The dataset is partitioned using this splitting variable into two child nodes, and this process is applied independently to each sub-group (child node). This is done recursively until the subgroups or the terminal nodes either reach a minimum size, or until no improvement can be achieved. The specific tree structure may be overly complicated. The second step of the procedure entails pruning of the full tree. A complexity parameter α, measuring the 'cost' of adding another variable, can be used. α is chosen using cross-validation. More details can be found at Therneau and Atkinson (2019).

k-Nearest Neighbor. k-Nearest
Neighbor (k-nn) is a non-parametric technique used for classification. It is a distance-based classifier that takes into account the similarity of the predictive variables at a future time period with the observations of the training set. Then, the new observation is classified based on the majority vote of its neighbors, i.e. the new observation is assigned to the class that is most common among its k nearest neighbors. Implementation of this technique, requires the specification of a distance metric, that is needed to calculate the distance between the new observation and that of the training sample, and a positive integer k, which determines the number of neighbors used in the voting process. A Euclidean distance metric can be used for computing distances in a multi-dimensional predictor space with quantitative variables, as is the case in the underlying analysis. The choice of k is very important, as it affects the classifier obtained by the technique. A potential solution is to use crossvalidation to select the value of k. The specific classification method does not require any model to be fitted, and does not impose any strong assumption regarding the underlying data.
2.4.7. Naive Bayes. The Naive Bayes learning scheme (see, e.g., Bauer and Kohavi 1999) applies the Bayes theorem and assumes conditional independence of the predictors given the classifier. The basic mechanism for classification can be derived as follows: where Pr(Y = k|Z = z) denotes the posterior distribution of higher implied volatility (k = 1), or lower implied volatility (k = 0), given the set of predictors Z, Pr(Z = z|Y = k) denotes the conditional density of Z given Y = k, and π(Y = k) denotes the prior distribution of the binary dependent variable. Thus the Naive Bayes classifier can be found by applying: where π k is estimated by the sample proportions. It is evident that the method assigns an observation Z = z to the class with the largest posterior probability, known as the maximum posterior decision rule. The method is implemented in classification problems, especially when the dependent variable is qualitative, as the independence assumption is less restrictive than might be expected, and, usually, the method assigns maximum probability to the correct class. Its good performance can be attributed to the use of the zero-one loss function.

Bayesian generalized linear models.
The Bayesian approach to inference regarding the estimation of stable Logit regression coefficients is briefly outlined. In Bayesian analysis, inference regarding the model parameters is based on the posterior distribution that is computed using the Bayes theorem: where f (θ m |D) is the posterior distribution of a specific model's m parameters, θ m , given the data D, f (D|θ m ) is the likelihood function, and f (θ m ) is the prior density of model m parameters. The posterior distribution summarizes all that is known about the model parameters after 'seeing' the data.
To reflect the ex-ante opinion of the uncertainty regarding the model parameters, a prior distribution is assigned. Conditional on having observed the data, the prior opinion can then be updated to a posterior opinion on model parameters using the Bayes theorem.
To implement the Bayesian generalized linear model in the problem of interest, the approach proposed by Gelman et al. (2009) is applied, using the 'bayesglm' function, that finds the approximate posterior mode and variance using extensions of the classical generalized linear model computations. Gelman et al. (2009) propose the use of a new prior distribution, the Student-t family, focusing on the Cauchy sequence as a conservative choice, for the Logit regression model parameters by first scaling the predictor variables. The proposed prior distribution setting enables the production of stable, regularized estimates, even when there is separation in Logit regression, and allows for an automated process in applied data analysis.

Data
As the primary sources of equity market volatility stem from either exogenous and/or endogenous shocks in the economic, credit, and financial cycle, it is only rational to include a set of economic and financial variables as likely predictors. A relatively large number of predictive variables is considered, as empirical research has shown that their effectiveness is, to a large extent, time dependent. More specifically, the set of predictors contains 31 macroeconomic and financial market related indicators. Part of them is widely-followed by both policy makers and practitioners, and has been used in the existing literature for implied volatility prediction, while a number of less studied indicators is also included in the analysis. In general, the forecasting variables are representative of risk categories related to macroeconomic and financial market risk, momentum, as well as investor sentiment factors with monthly frequency.
The set of macroeconomic risk factors includes variables, such as the term spread, the oil price, the dollar index, the U.S. Economic Policy Uncertainty index and a proxy for macroeconomic volatility based on the annualized 12-month standard deviation of industrial production growth. Equity market risk is proxied by numerous equity related factor portfolios. Apart from the well-known Fama-French three factor portfolios, HML (Value), SMB (Size), and MKT (Market), additional factor portfolios are employed, representing alternative risk factors, such as QMJ (Quality), BAB (Low Volatility), and RVAR (Residual Variance). In addition, financial and credit risk is proxied by the default and the TED spread, and the Chicago Financial Conditions Leverage index, among others.
The use of medium and short-term equity momentum factors is supported by the large literature on the predictability of asset returns. As a result the Fama-French MOM (medium term price momentum) and STR (short-term reversal) portfolio returns are apparent candidates. However, momentum factors from various other asset classes (fixed income, commodity, and foreign exchange) are also employed, due to the existence of cross-asset or inter-market linkages. Last, a series of investor sentiment related factors based on the American Association of Individual Investor survey is included in the analysis.
The directional indicator is calculated from the Chicago Board Options Exchange (CBOE) Volatility index (VIX). The indicator is a binary variable that takes the value of 1, if the VIX trends higher in month t, and 0, if the VIX trends lower. Autoregressive components of the directional indicator are used as predictors, to capture the likely underlying persistence in the implied volatility series. A comprehensive list of the predictors, as well as their source, is presented in table A1 in the appendix.
The data cover the period between January 1990 to December 2019 (360 monthly observations). As the initial data point of the VIX index is January 1990, the directional indicator begins in February 1990. The full sample is divided into an in-sample and an out-of-sample period; the in-sample period ranges from February 1990 to December 2003 (167 monthly observations), while an out-of-sample period ranging from January 2004 to December 2019 (192 monthly observations) is used to evaluate predictive performance. Out-of-sample forecasts are generated recursively, using a forecast horizon of 1-month.
When CART techniques are implemented, each classification model is trained; the training dataset is preprocessed both in a closed center and scale form, the parameter(s) of each model is tuned by cross-validation as well as resampling, and variable importance is determined before out-of-sample forecast generation. The resampling approach in specific, seeks to determine the values of each of the model parameters (if any) and uses the best tuning parameter(s) based on fitted (in-sample) accuracy measures to produce out-of-sample forecasts. In each model, the best tuning parameter(s) are used to run the out-of-sample forecasts recursively, and their respective performance evaluation measures are obtained.

Empirical design and analysis
The empirical findings of the analysis are presented in this section. The main objective is to evaluate the out-of-sample statistical and economic performance of the different modeling approaches under consideration. First, the out-of-sample predictive performance of the underlying models will be investigated based on several statistical evaluation measures. Next, the economic significance of the generated forecasts will be assessed through the creation of real-time investment strategies.
Initially, the out of sample performance of various univariate Logit regression models is examined using the set of predictor variables, as well as autoregressive components. Motivated by the inferior performance of these basic models, the attention is directed to alternative modeling approaches to find out whether they are able to substantially improve the underlying predictive ability. In this sense, the implemented class of models is extended to a multiple Logit regression framework, and the use of penalized likelihood Logit regression models, as well as other machine learning techniques.
In general, under the multiple regression framework estimation of the model parameters is straightforward, the identification of the most important set of predictors, however, is not an easy task, due to the high dimensionality of the problem. More specifically, with a set of N = 31 predictor variables, using a Logit autoregressive model of order three, i.e. p = 3, and individual predictor variables with one up to three lagged terms, i.e. h = 3, the maximum number of potential lagged predictors is 96, and the number of competing models vast. Different model/variable selection strategies have been used in the literature to address this issue. Stepwise regression (and/or its variants), standard information criteria such as Akaike's (AIC) 1973 or Schwarz's 1978 Bayesian Information Criterion (BIC), the Bayesian approach to model selection, or penalized likelihood methods, such as the Elastic Net, could be used to narrow down and identify important predictors. While the majority of these approaches is implemented, the Elastic Net regularization approach, that imposes shrinkage in the regression coefficients and allows for automatic variable selection, is the core approach followed in this study, and in a real-time out-of-sample framework in specific.
The total sample is divided into an in-sample and an outof-sample period to assess the predictive performance of different modeling approaches and machine learning techniques. To obtain out-of-sample forecasts, the model selection algorithm that identifies important predictors is implemented, and the predictive models are estimated recursively using an expanding window scheme. That is, the model selection algorithm and the fitted models are first estimated using data from the beginning of the dataset to the end date of the insample period and a one month-ahead forecast is generated. Then, the estimation window is updated by adding one observation to the estimation sample at each step. As such, model selection is conducted and coefficients are re-estimated after each step of the recursion. Proceeding iteratively in this way through the end of the out-of-sample period, a series of P out-of-sample forecasts is generated.
The proposed algorithm is briefly presented below: (1) Set the out-of-sample period P.
(2) For iteration = 1, . . . , P (a) Specify the estimation window, which is updated based on a recursive (expanding) scheme (b) Apply the Elastic Net regularization approach and identify the important predictors (c) Implement the proposed modeling approaches, i.e. the machine learning techniques and the standard predictive models (d) Compute VIX directional one-step ahead forecasts (e) Construct different investment strategies (3) Compute Statistical and Economic Evaluation measures

Predictive performance evaluation based on statistical metrics
The results of the statistical performance evaluation of the analyzed modeling approaches are presented in this section. The out-of-sample performance of the directional VIX forecasts obtained by the univariate Logit models based on autoregressive (lagged) terms and individual predictor variables of the form ln p t 1−p t = φ 0 + p l=1 φ l y t−l , and ln p t N, respectively, are presented first, in order to illustrate the potential benefits and the motivation behind the use of machine learning techniques.
Table 1 (panel A) presents the statistical evaluation metrics for the VIX index directional forecasts obtained by the univariate Logit models. The majority of the performance evaluation metrics reported, are based on the confusion matrix, which summarizes the forecasting outcomes regarding the classification problem. A detailed description and analysis of several performance evaluation measures used in such problems can be found in Sokolova andLapalme (2009), Hossin andSulaiman (2015) and Tharwat (2020), among several others. In particular, metrics such as Area Under the ROC curve (AUC), Misclassification Error (MCE), Accuracy, Kappa statistic, Sensitivity and Specificity portion, Precision, F 1 -Score, and Balanced Accuracy are presented. For all metrics, with the exception of MCE, high positive values indicate superior forecasting performance. In addition, the statistical  Konstantinidi et al. (2008), that point out that autoregressive terms and the market variable are important predictors for the evolution of implied volatility. The most rational move is to focus on multiple Logit regression models, where the basic set up remains the same, i.e. both autoregressive components and predictor variables up to lag three, i.e. p = 3 and h = 3 are considered. A real-time stepwise regression framework based on the Akaike Information Criterion (AIC), the Schwarz Bayesian information criterion (BIC), and the Deviance metrics was implemented. Deviance outperform the corresponding model based on the AIC criterion, and are slightly better than the univariate Logit model with three autoregressive terms, but clearly superior compared to all the rest of the univariate models, as they produce higher AUC (around 60%), lower MCE (around 39%), higher Accuracy (around 61%), kappa statistic (around 0.21), Precision (around 57%, 58%), and a significant statistical predictive performance based on the PT statistic (2.973 and 2.856, respectively). It is noteworthy that the inferior results of the AIC based model might be due to its tendency to overestimate the number of predictors (possible overfitting as it selects a larger number of predictors in each out-of-sample iteration) compared to the BIC and Deviance criteria. The next step is to use a penalized likelihood method for variable selection. More specifically, the Elastic Net momentum factor, MOMEQ t−1 , are selected by the algorithm at all the out-of-sample iterations. Another important predictor is the equity market factor, MKT t−1 , with frequency of inclusion of around 99%. In addition, the residual variance factor, RVAR t−1 , and the oil price, Oil t−1 are part of the top ranked list, with frequency of inclusion about 70% and 25%, respectively. Other predictors that affect the VIX direction, but, to a smaller extent, and at different lags, are the Fama-French momentum factor, MOM t−1 , the lagged autoregressive components of VIX, VIX t−3 , the quality minus junk factor, QMJ t−2 , and the high minus low devil factor, HMLD t−3 . The list of relevant predictors is not surprising; the lagged binary variable and the market factor have already exhibited their significance in the univariate Logit model setting. In addition, the presence of medium-term price momentum factors is consistent with empirical evidence that show that they are significant for asset return predictability. The low volatility, quality, and value-related factor portfolios approximate equity market risk factors. Last, but not least, it has been shown that commodity markets tend to be forward-looking and exhibit leading properties; as such, the inclusion of the oil price is not unforeseen.
To assess the sensitivity of the selected predictors, the variable selection exercise is repeated using the Elastic Net Deviance model, but with an a = 0.75. Table 3 presents the selected predictors together with their inclusion frequency across the out-of-sample iterations. The results are robust, as they are similar to those of table 2, as the first autoregressive component, the equity market momentum factor (MOM EQ), the equity market factor (MKT), the residual variance factor (RVAR) and the oil price (Oil price) are, once again, identified as important predictors.
Having preselected the set of predictors, next, the ability of the machine learning techniques is evaluated with the aim to provide even more accurate VIX directional forecasts. Table 4 reports the statistical evaluation measures for different modeling approaches (panels A to H). More specifically, the analysis involves the application of the Ridge, LASSO and Elastic Net regularization techniques, Discriminant analysis and Bayesian methods, as well as Classification Tree related techniques, k-nn and partitioning methods. Useful insights emerge from the reported results. It is evident that the majority of the machine learning techniques produces significant predictive performance based on the high positive values of the PT statistic and the different confusion matrix-related measures. In particular, the Naive Bayes and the Adaptive boosting methods seem to be the top performing models, which clearly outperform the univariate and the multiple stepwise Logit regression models, as well as the other machine learning techniques. Naive Bayes and Adaptive To assess the robustness of the results the real-time outof-sample forecasting exercise is repeated using the Elastic Net Deviance model with an a = 0.75. Table 5 presents the statistical evaluation measures for the different models and confirms the superior predictive performance of the Naive Bayes method over the competing techniques, since it ranks first with respect to almost all statistical measures. The predictive ability of the Naive Bayes method has been reported in the literature; see, for example Domingos and Pazzani (1997) and the references therein. Additional modeling approaches that produce high predictive ability and rank among the best performers are the Ridge Deviance model and the Adaptive Boosting method. In general, the results are robust, and similar in spirit with those of table 4, as the majority of the machine learning techniques produces significant out-of-sample performance. As a final check, a series of combined † implied volatility probability forecasts is generated based on linear weighting schemes, as well as Bayesian methods (Bayesian Model Averaging). Even though the accuracy of the forecasts generated by the combination methods is, in general, higher compared to the univariate Logit and stepwise Logit regression models, in most cases they fail to outperform the higher ranked machine learning techniques.
In conclusion, the evidence clearly supports the implementation of machine learning techniques in implied volatility prediction, as it significantly improves the accuracy of outof-sample forecasts, based on numerous statistical evaluation measures.

Economic performance evaluation based on real-time trading strategies
To evaluate the economic significance of the forecasts generated by each of the estimated models, a number of investment strategies based on the VIX index and the S&P 500 index have been devised. The strategies are implemented also for models with inferior statistical performance, as statistical evidence is not always confirmed by the underlying financial performance.
A good starting point would be to focus on the performance of the most naive strategy. That would result in being constantly long the VIX index. It is not surprising that the specific strategy would constitute a suboptimal investment choice, as the long periods of VIX range trading combined with the relatively short periods of sizeable rise are clearly against long positioning. This is reflected in the performance evaluation metrics (see table 6, panel A) with negative Sharpe and Sortino ratio readings, as well as high average draw downs. Table 6 presents the various metrics, including an estimated alpha and a beta coefficient versus a reference benchmark that consists of 50% cash (the return on 1-month Treasury bill) and 50% equities (the return of the S&P 500 index). The specific composition tries to reflect an investor that is long an equity portfolio, but at the same time aims to protect his downside risk.
The first strategy (Strategy 1) goes long the VIX index when implied volatility is expected to rise, and remains on the sidelines when implied volatility is expected to fall. The second strategy (Strategy 2) is leveraged, as it invests in the VIX index when implied volatility is forecast to rise and shorts the VIX index when it is forecast to decline. In the third strategy (Strategy 3), the S&P 500 index is bought when implied † Elliott et al. (2013) introduced a new method for combining forecasts based on complete subset regressions. For a given set of potential predictors, the authors propose combining forecasts from all possible linear regressions that keep the number of predictors fixed. For K possible predictors, there are K univariate models and N k,K = K!/((K − k)!k!) different k-variate models for k ≤ K. The set of models for a fixed value of k is referred to as a complete subset and the authors propose to use equal-weighted combinations of the forecasts from all models within these subsets indexed by k. See, also, Meligkotsidou et al. (2019) and Meligkotsidou et al. (2021), who developed quantile forecast combination schemes for realized volatility prediction and a complete subset quantile regression approach for equity premium prediction. volatility is expected to trend lower. For robustness and confidence reasons variants of Strategies 2 and 3 are also formed. More specifically, when the underlying forecast lies between the 45%-55% probability interval, no position is taken; if the estimated probability is above 55% the strategy goes long the VIX index, alternatively, when the probability is below 45% the strategy goes short the VIX index (similar to Strategy 2) and long the S&P 500 index (similar to Strategy 3). The specific strategies are tagged Confidence Strategy 2 (Strategy Con. 2) and Confidence Strategy 3 (Strategy Con. 3), respectively. All the outlined strategies are fully implementable, as the recent rise in passive investing or indexing has enabled investors gain exposure to previously inaccessible market segments through innovative index-linked financial products. More specifically, all strategies can be executed through the use of Exchange Traded Funds (ETFs), as nowadays there are available long and short VIX-related Short-Term ETFs. No transaction costs are taken into account in the calculations.
Table 6 (panel B) shows the economic performance evaluation metrics of a selected number of univariate Logit models. More specifically, the performance of the four top ranked models, i.e. of the VIX lagged autoregressive, the equity market factor (MKT), the High minus Low Devil factor (HMLD), and the equity market momentum factor (MOM EQ) model, based on the statistical evaluation measures, such as the PT statistic, Balanced Accuracy, Precision, is presented. The results reaffirm the predictive ability of the autoregressive model, as well as the market factor one, as they achieve relatively high risk adjusted performance as reflected in the Sharpe and Sortino ratios, as well as the high estimated alphas versus the reference benchmark. In general, the rest of the univariate models produce suboptimal performance.
Next, the economic performance of the multiple Logit models based on the real-time stepwise regression framework (AIC, BIC and Deviance metrics) is reported in table 6 (panel C) and suggest that an investor that generates forecasts through the stepwise BIC approach will achieve higher positive annualized returns, ranging from 23.81% (Strategy Con. 3) to 71.46% (Strategy 2), compared to the corresponding returns based on the univariate Logit models, as well as those of the stepwise AIC and Deviance approach. Moreover, the BIC-related forecasts attain higher risk adjusted performance, as the Sharpe and Sortino ratios record values ranging from 0.59 (Strategy Con. 3) to 1.02 (Strategy 2), and from 1.00 (Strategy 1) to 1.66 (Strategy 2), respectively. Strategy 2 provides the investor with the highest annualized returns (71.46%) and also higher risk adjusted performance metrics, i.e. Sharpe (1.02) and Sortino (1.66) ratios. The stepwise Deviance approach ranks second with relatively high economic performance metrics.
The economic performance evaluation measures related to the machine learning techniques is depicted in table 7 (a = 0.5 in the Elastic Net variable selection algorithm). As in the case of the univariate and multiple Logit regression models, the results mostly confirm the statistical evaluation rankings. Investment strategies constructed based on forecasts generated by the Naive Bayes, Ridge Deviance, Discriminant Analysis models, and combination of forecasts produce the best annualized returns and risk adjusted performance measures. In particular, the Naive Bayes and the Ridge Deviance     Table 8 presents the evaluation measures for the machine learning techniques (a = 0.75 in the Elastic Net variable selection algorithm) and confirms the superior economic performance of these techniques. Once again, the Ridge Deviance, the Regularized Discriminant Analysis (RDA), the Naive Bayes based forecasts, as well as the median forecast combination scheme with a subset of k = 2 predictors over the competing modeling approaches are the best performers.
In conclusion, the findings support the implementation of machine learning techniques in implied volatility prediction, as there is strong evidence regarding their superior accuracy for directional VIX forecasts that is also reflected in higher risk adjusted returns when real-time investment strategies are implemented.

Conclusion
The primary target of this study was to explore whether the application of innovative modeling approaches can contribute to the literature on the predictability of implied volatility indices. To this end, numerous alternative modeling approaches (Logit, Stepwise Logit, Combination of forecasts, and machine learning techniques) and model specifications have been employed to generate directional forecasts of the CBOE VIX index. The predictive accuracy of the generated out-of-sample forecasts was assessed both in a statistical and an economic setting. The economic significance was evaluated using various investment strategies based on the VIX index and the S&P 500 index.
The main conclusion is that the application of certain machine learning techniques in implied volatility prediction is strongly encouraged, as it significantly improves the accuracy of out-of-sample forecasts, based on numerous statistical evaluation measures over more mainstream econometric methods. Moreover, the use of penalized likelihood methods, such as the Elastic Net regularization technique, is strongly advised, as it can be used to narrow down the number and identify the important predictors.
As anticipated, there is a certain degree of divergence in the predictive accuracy of the underlying machine learning models. Model accuracy is not consistent across all models, some techniques, however, exhibit consistent superior relative performance (Naive Bayes, Ridge Deviance, Adaptive Boosting, Discriminant Analysis).
The analysis employs an elaborate set of predictive variables containing 31 macroeconomic and financial market related indicators, the majority of which have been widelyfollowed and used in the existing literature on the predictability of asset returns. It also introduces a number of less studied indicators in the area of implied volatility predictability.

Disclosure statement
No potential conflict of interest was reported by the author(s).