Volatility forecasting using deep neural network with time-series feature embedding

Abstract Volatility is usually a proxy indicator for market variation or tendency, containing essential information for investors and policy-makers. This paper proposes a novel hybrid deep neural network model (HDNN) with temporal embedding for volatility forecasting. The main idea of our HDNN is that it encodes one-dimensional time-series data as two-dimensional GAF images, which enables the follow-up convolution neural network (CNN) to learn volatility-related feature mappings automatically. Specifically, HDNN adopts an elegant end-to-end learning paradigm for volatility forecasting, which consists of feature embedding and regression components. The feature embedding component explores the volatility-related temporal information from GAF images via the elaborate CNN in an underlying temporal embedding space. Then, the regression component takes these embedding vectors as input for volatility forecasting tasks. Finally, we examine the feasibility of HDNN on four synthetic GBM datasets and five real-world Stock Index datasets in terms of five regression metrics. The results demonstrate that HDNN has better performance in most cases than the baseline forecasting models of GARCH, EGACH, SVR, and NN. It confirms that the volatility-related temporal features extracted by HDNN indeed improve the forecasting ability. Furthermore, the Friedman test verifies that HDNN is statistically superior to the compared forecasting models.


Introduction
Financial markets are highly uncertain because of economic dynamics (Barra et al., 2020;Wang et al., 2021).Investors should have a deep insight into the market variation or uncertainty to increase the potential profitability of investments (Khan et al., 2016;Li et al., 2021).As volatility is an indicator of uncertainty, it plays a vital role in financial markets.High volatility is associated with market turbulence and significant price fluctuation, while low volatility describes calm and quiet markets.Hence, forecasting the tendency of volatility has become essential to success for investments.
In recent decades, volatility forecasting has been one of the most active areas in time-series econometrics and financial data mining.Effective forecasting is of great importance for many financial activities, such as risk management (Dvorsky et al., 2021), option pricing (Kim et al., 2021), and portfolio optimisation (Lampariello et al., 2021).Unlike other time-series applications, there are many unpredictable influences on financial volatility, such as political events, investors' psychology, and market information asymmetry.Moreover, how to accurately forecast the fluctuation or tendency of the Stock Index has become an urgent issue for investors and policymakers.Hence, studying the volatility of the Stock Index is imperative.
This paper proposes an end-to-end hybrid deep neural network (HDNN) model with temporal embedding for volatility forecasting.Inspired by the time-series encoding (TSE) framework (Wang & Oates, 2015), we first encode the raw 'log returns' and 'realised volatility' time-series data as particular two-dimensional GAF images.These transformed GAF features allow the deep convolution neural network (CNN) model to 'visually' recognise, exploit, and learn the temporal patterns for volatility forecasting.Therefore, HDNN can automatically explore the temporal characteristics of time-series data and capture the volatility dynamics via the deep CNN model.The main attractive features of the proposed HDNN model are as follows: Rather than learning from time-series data directly, HDNN utilises an elegant strategy to encode one-dimensional time-series data as two-dimensional GAF images.These encoding images can preserve the temporal relationship between successive time intervals.HDNN is an end-to-end learning model for volatility forecasting.Namely, it can simultaneously learn the temporal feature mapping and the volatility forecasting task.The network architecture of HDNN consists of feature embedding and regression.The feature embedding component explores the volatility-related embedding space from GAF images via CNN.Meanwhile, the regression component takes the embedding vectors of the time-series data to perform volatility forecasting tasks.
The rest of the paper is structured as follows.Section 2 presents a comprehensive review of volatility forecasting models.In Section 3, we first provide a brief introduction to the notations and the learning tasks of volatility forecasting.Then, we present the approach for encoding the time-series data into GAF images.Afterwards, we give the structure design of the HDNN model.Section 4 describes the experimental settings.Subsequently, the extensive results on the synthetic and real-world Stock Index datasets are analysed in Section 5. Section 6 ends the paper with the conclusions and future works.

Literature review
In econometrics and financial data mining, the study of volatility has garnered much attention.Therefore, a great deal of research has been conducted to forecast volatility (Poon & Granger, 2005;Shin, 2018;Takahashi et al., 2021;Vychytilov a et al., 2019).Generally, there are two main branches: ARCH and machine learning.

ARCH branch for volatility forecasting
The ARCH branch has been widely considered to explore volatility clustering and leptokurtosis within financial time-series data.The pioneering work could be traced back to ARCH proposed by the most influential economist Engle (1982).The main idea of ARCH is to formulate the error variance as an autoregressive (AR) process.Subsequently, Bollerslev (1986) proposed GARCH with lagged conditional variance to track changes in historical innovations.As a powerful extension of ARCH, GARCH uses a more general model in that the variance is expressed as an autoregressive moving average (ARMA) process.However, both fail to measure leverage effects because of the symmetrical distribution assumption.
As a result, several studies have brought great efforts to their improvements.EGARCH (Nelson, 1991) was proposed to capture asymmetry in time-series data by separating the sign and the volume of previous returns.However, when data are nonstationary, the performance of EGARCH will degrade.Engle (2002) extended GARCH with the dynamic conditional correlation (DCC) strategy for multivariate volatility.However, the DCC procedure is less tractable than expected.Bauwens et al. (2007) proposed Mixed-GARCH by modelling the conditional distribution of time series with a mixture of multivariate normal distributions.Nevertheless, the normal distribution assumption is too strict in practice.To forecast the nonstationary volatility, Li et al. (2018) presented ZD-GARCH with a first-order zero drift strategy for heteroscedasticity.It performs stability with the sample path oscillating randomly between zero and infinity overtime under a specific condition.Paolella et al. (2019) improved GARCH with hidden Markov regime-switching dynamics for volatility and further proposed Markov-GARCH.This model is coherent in the sense of being a well-defined stochastic process.Wang et al. (2021) established ARMA-GARCH models with Student's t-distribution assumption for the daily volatility of stock forecasting.
However, most ARCH models strongly depend on the generalised linearity assumption.Thus, it is challenging to disclose the nonlinear relationship within volatility.Moreover, most of them were originally proposed for low-frequency data (daily, weekly, etc.), which are unsuitable for high-frequency large-scale intraday data.

Machine learning branch for volatility forecasting
Another branch of volatility forecasting is machine learning (ML), which can automatically explore underlying patterns from a large amount of financial time-series data (Sezer et al., 2020;Ge et al.).Henrique et al. (2019) provided a comprehensive review of the most influential literature on financial market prediction, including neural network (NN), support vector machine (SVM), gradient boosting tree (GBT), and random forest (RF).The results showed that machine learning models outperformed the traditional ARCH models with lower forecasting errors in most cases.Yang et al. (2020) verified that SVM could significantly improve predictive ability compared with the GARCH models.Meanwhile, Kim et al. (2021) demonstrated that machine learning models provided better forecasting performance than generalised linear models.Based on approximately 80 relevant papers in the finance volatility forecasting field, Ge et al. (2023) surveyed that NN could effectively learn from the time-series data and achieve excellent forecasting performance.
Other researchers have applied various machine learning methods to financial data for volatility forecasting.To enhance the ability of GARCH models for S&P 500 forecasting, Hajizadeh et al. (2012) proposed a hybrid NN-EGACH model by feeding NN with the estimates of volatility from GARCH.Mishra and Padhy (2019) extended the support vector regression (SVR) for stock volatility and then applied it to portfolio construction.With the Markowitz strategy, Ma et al. (2021) combined SVR, RF, and deep learning (DL) models for portfolio optimisation.Zhang et al. (2021) constructed an NN model with different horizon times to forecast copper fluctuations.To improve the ability of echo state networks, Gabriel et al. (2021) proposed a hybrid model by incorporating the consistent PSO metaheuristic approach for hyperparameter tuning.Overall, many results in recent research (Ecer, 2013;Eduardo et al., 2019;Lee & Kim, 2020;Nayak et al., 2015;Yang et al., 2020) affirmed that machine learning models generally provide better forecasting ability than standard statistical models.
However, most machine learning methods learn directly from the raw time-series data.Such a raw form may not suitably represent the essential data characteristics (Wang & Oates, 2015).Thus, it is urgent to extract the most task-related temporal features from the raw data to explore the underlying volatility patterns.

The learning task of volatility forecasting
First, we give notations used throughout this paper.Upper and lower boldface letters are used for matrices and column vectors, respectively.All vectors are column vectors unless transformed to row vectors by a prime superscript ðÁÞ': A vector of zeros and ones of arbitrary dimensions is represented by 0 and e, respectively.An identity matrix of arbitrary dimensions is denoted as I: In what follows, we describe the learning task of volatility forecasting (Liu, 2019;Yang et al., 2020).
Definition 1. (Log returns) Suppose the transaction price series is P ¼ p 1 , . . ., p tÀ1 , p t ½ , whose element p t denotes the t-th transaction price observation.Then, the t-th log returns r t is defined as Definition 2. (Interval) Suppose two consecutive observations are p t and p tÀ1 : Then, the sampling interval or frequency I D is defined as where TimeðpÞ is the timestamp of the observation p: It can be seen that the higher the frequency of the transaction price series, the smaller the time interval.Definition 3. (Realised volatility, RV) The RV is the square root of the realised variance, which is calculated as the variance of the log returns in a window of m intervals into the future, where r t stands for the average log returns in the considered window.
The RV measures the magnitude of the change in terms of returns, which reflects the actual market risk or reward.Thus, we model RV as the forecasting parameter of volatility.
Definition 4. (The learning task of volatility forecasting) Let R t ¼ r tÀmþ1 , . . ., r t f g 2 R m and V t ¼ RV tÀmþ1 , . . ., RV t f g 2 R m be the time-series of log returns r and realised volatility RV with m historical lags at the t timestamp.The goal of volatility forecasting for our MDVF is to learn a nonlinear function which maps the predictor time-series R t and V t at the t timestamp to the response variable RV tþ1 at the future t þ 1 timestamp.

Encoding the time-series data into GAF images
Recently, the convolution neural network (CNN), one of the most popular deep neural networks, has achieved state-of-the-art computer vision performance (LeCun et al., 2015).CNN has the attractive property that it is good at feature extraction according to the particular learning task without any prior knowledge and human intervention.That is, it can automatically learn the task-related features via the training process, whereas these features should be hand-engineered in traditional algorithms.Moreover, CNN can generate shift-invariance mappings with the sharing weights through the receptive fields.Although CNN achieves significant performance in many domains, it cannot extract the temporal features directly from the raw financial time-series data.The reason is that the representation of time-series data is a one-dimensional vector, while the input of many pre-trained CNNs requires a twodimensional matrix.Therefore, transforming time-series features into 'visual' two-dimensional features has attracted much attention in signal processing and physics.The time-series encoding (TSE) framework (Wang & Oates, 2015) provides an elegant way to transform time-series into GAF images.As a result, it enables the existing pre-trained deep CNN models to learn from the encoding data directly, rather than the time-consuming process of training RNN or 1D-CNN models from scratch.
We now discuss how to encode the predictor time-series R t and V t by GAF images under the TSE framework (Wang & Oates, 2015).For simplicity, we unify R t and V t as the time-series X t ¼ x tÀmþ1 , . . ., x t f g with m lags at the t timestamp.First, we rescale the elements of X t into the interval À1, þ 1 ½ via the following normalisation operation: Then, we transform the rescaled time-series Xt from the Cartesian coordinate system into the polar coordinate system by where / i is the angle of xi , a i is the corresponding amplitude or radius, i is the sequence number in Xt , and m is the length of Xt to regularise the span of the polar coordinate system.Equation ( 6) describes the coordinates of each element of Xt in the polar system, which provides a new perspective on Xt : Based on the new polar system, we construct two Gramian Angular Fields (GAF) matrices by calculating the sum/difference between elements within the series Xt as follows: and Remark 1. From ( 7) and ( 8), we can efficiently analyse the temporal correlation property of the series Xt in the polar system by considering the trigonometric sum and difference between different time intervals.Precisely, the time interval increases as the position moves from top-left to bottom-right in the GAF matrices, which preserves the rich temporal dependency within the series.That is, assuming xi and xj with i À j j j ¼ k in Xt , the element of GAF matrices represents the relative correlation of xi and xj with respect to time interval k: According to some mathematical knowledge, we reformulate ( 7) and ( 8) in the following form and where the dimension of S t and D t is m Â m: Note that the operations in ( 9) and ( 10) can encode the series Xt effectively into two-dimensional GAF matrices, which is suitable for most CNN models.Finally, we merge a series of images I t from the above matrices as where the dimension of I t is 2m Â 2m, and the superscripts r and RV in S t and D t denote the GAF encoding of R t and V t , respectively.The whole procedure for generating the GAF images from the time-series data is illustrated in Figure 1.

A hybrid deep neural network model for volatility forecasting
This subsection introduces a novel hybrid deep neural network (HDNN) for volatility forecasting.The architecture of the HDNN is described in Figure 2. Specifically, our HDNN consists of two components: feature embedding and regression.The feature embedding part aims to automatically learn an underlying embedding space from the GDF images via CNN.To preserve as much temporal information as possible, it generates a series of feature mappings that transform each GDF image from the original space into a new embedding space.Then, the regression part uses these new representation vectors combined with the historical time-series for the volatility regression learning task.Now we discuss the structure of the CNN in our HDNN model.Inspired by the art-of-the-state VGG16 network (Zhang et al., 2016), the CNN comprises six conv2D (2 D convolution) layers and a fully connected dense layer, detailed as follows: Similar to VGG16 (Simonyan & Zisserman, 2015), we use the two stacked 3 Â 3 conv2D layers with a stride of one throughout the whole neural network rather than relatively large receptive fields in the convolution layers.The advantage of such a network structure is that it needs fewer parameters but obtains more nonlinearity or discriminative ability than 5 Â 5 or 7 Â 7 conv2D layers.To extract the underlying temporal feature from GAF images, we configure three groups of two stacked conv2D layers with 32, 64, 128 channels, as shown in different colours in Figure 2. To avoid model overfitting, we implement the Dropout layer after the convolution layers.It offers remarkably effective regularisation to improve the generalisation ability in deep neural networks.Moreover, we use ReLU for the activation function in each layer.Since ReLU has a simple formula maxð0, xÞ, it obtains a more efficient gradient computation than the sigmoid or tanh function.Additionally, it reduces the vanishing gradient issue during the training procedure.Finally, we use a flatten layer to convert the output of the conv2D layers into a single long feature vector.With the flattening vector, we further connect it to a dense layer to perform dimensionality reduction.In other words, we put all the high-dimensional tensors together in one line and make connections with the dense layer.
After obtaining the feature vector from the dense layer in the CNN, we combine it with the historical time-series R t and V t via a concatenation operation.Then, the concatenated vector is fed as the input of the follow-up network for volatility forecasting.Here, we utilise three fully connected dense layers to implement the regression task: the first two have 128 channels with the ReLU activation function, and the third performs volatility forecasting with one output channel.The configuration of these dense layers is the same in all networks.Once the HDNN model is built, we should design a criterion to validate its forecasting ability to guide the model optimisation.It is worth noting that the loss function plays an essential role during the model learning phase, which measures how good the prediction is.Namely, we estimate the model performance iteratively via the loss function and then update the parameters to reduce errors on the subsequent evaluation.In our model learning, we adopt the mean square error (MSE) or least square as the loss function where d RV i is the volatility forecast, RV i is the ground-truth volatility, and m is the scale of the dataset.MSE is the average of the squared differences between the predicted d RV i and actual RV i , which is one of the best measures for volatility.
Remark 2. Different from most existing volatility forecasting methods, our HDNN can automatically learn the volatility-related temporal embedding via a convolution neural network.Specically, it enjoys the following attractive features: (1) Rather than learning from the time-series data directly, HDNN utilises an elegant strategy to encode one-dimensional time-series data as two-dimensional GAF images.These encoding images can preserve the temporal relationship between successive time intervals.
(2) HDNN is an end-to-end learning model for volatility forecasting.There is no need to construct or select features manually.That is, it can learn the temporal feature mapping and the volatility forecasting task simultaneously.
(3) The network architecture of HDNN consists of feature embedding and regression.The feature embedding component explores the volatility-related embedding space from GAF images via CNN.Meanwhile, the regression component takes the embedding vectors of the time-series data to perform volatility forecasting tasks.

Experimental setting
To demonstrate the validity of our HDNN model, we investigate its performance on both synthetic and real-world time-series datasets.The synthetic datasets are generated by the geometric Brownian motion (GBM) process, while the real-world datasets are collected from five Stock Index datasets.The implementations are carried out on a Linux Centos workstation using Python 3.8 with an Intel i9-10090k processor (3.7 GHz) and 128 GB random-access memory (RAM).

Baselines
Our experiments focus on comparisons between the proposed HDNN and five timeseries forecasting methods, including GARCH, EGARCH, linear kernel SVR, RBF kernel SVR and NN, detailed as follows: GARCH (Bollerslev, 1986) and EGARCH (Nelson, 1991) are popular volatility methods that use observations of returns to model volatility shocks.Here, we utilise the 'ARCH' Python library to train both GARCH and EGARCH in our experiments.The optimal lag parameters (p and q) in GARCH and EGARCH are chosen from 1, 2, . . ., 10 f gaccording to the Bayesian Information Criteria (BIC).SVR (Mishra & Padhy, 2019) is the regression version of SVM, which is a preeminent maximum-margin learning paradigm for data mining.The structural risk minimisation principle gives SVR an excellent generalisation ability.We consider the linear SVR (SVR_lin) and the RBF nonlinear SVR (SVR_rbf) for comparison in the experiments.Moreover, we utilise the 'scikit-learn' Python library to train these SVR models.The optimal parameters (C for both SVR models and r for RBF kernel) are chosen from 2 À6 , 2 À5 , . . ., 2 5 , 2 6 f g according to the cross-validation method.Neural Network (NN) (LeCun et al., 2015), is a well-known mathematical model that uses learning algorithms to explore the underlying pattern among data.It employs a series of hidden neurons to approximate the complex relationships between inputs and outputs.For a better comparison, the structure of the baseline NN consists of three dense layers with the same setting as the regression part of the HDNN.The combination vector of the historical log returns and realised volatility time-series is the input of the NN.We utilise the 'PyTorch' deep learning Python library with CUDA 10.2 to develop both the NN and our HDNN.The Adam optimiser is used to train these models on an NVIDIA GPU (Quadro RTX 8000).Moreover, the learning rate is chosen from the set 0:0001, 0:005, 0:001, 0:005 f g :

Performance criteria
In this subsection, we describe the criteria used to evaluate the performance of these volatility forecasting models.Without loss of generality, let m be the number of samples, d RV i be the volatility forecast, RV i be the ground-truth volatility, and RV ¼ 1 m P m i¼1 RV i be the average value of the series RV 1 , RV 2 , . . ., RV m f g : Then we use the following criteria (Davidovic, 2021;Liu, 2019;Wu et al., 2021;Yang et al., 2020) for model evaluation.

RMSE:
The mean squared error (MSE) is a risk metric corresponding to the expected value of the squared error, defined as MAD: The mean absolute deviation (MAD) is a risk metric corresponding to the expected value of the absolute error, defined as MedAE: The median absolute error (MedAE) is calculated by taking the median of all absolute differences between the ground truth and the prediction, defined as MAPE: The mean absolute percentage error (MAPE) is the mean or average of the absolute percentage errors of forecasts, defined as R 2 : The coefficient of determination R 2 is a statistical measure that represents the proportion of the variance for a dependent variable, defined as

Time-series cross-validation for volatility forecasting
For the time-series learning tasks, the traditional cross-validation strategy becomes invalid.Namely, we cannot split training and validating sets in the random sense for the time-series forecasting.The reason is that such a random instance selection mechanism will cause discontinuity and break down the temporal relationship within the series.It makes no sense to forecast the past using future-looking data.Thus, we must avoid choosing future-looking data in the training set during model learning.In short, we should preserve the temporal dependency within the training and validating dataset.
To address the above issue, we utilise cross-validation on a rolling basis (Bergmeir & Ben ıtez, 2012) for our forecasting learning task, as shown in Figure 3. Specifically, we begin with a small subset of series data from the start to the t timestamp for training and forecast the follow-up future data with a length of t þ n: Then, we evaluate the forecasting performance with the criteria in Section 4.2.Subsequently, we expand the training set to the t þ n timestamp with the last forecasted data and forecast the future segment with n length.The process moves the forecasted segment successively until it hits the end of the instances.

Results of the synthetic GBM time-series datasets
To demonstrate the effectiveness of our HDNN model, in this subsection, we perform comparisons on the synthetic time-series datasets generated by the geometric Brownian motion (GBM) process (Chen & Tsai, 2020).GBM is a powerful mathematical finance tool to model stock prices, similar to the famous Black-Scholes pricing model.It describes the pricing model as a continuous-time stochastic process, in which the logarithm of price p t over time obeys a Brownian diffusion process: where r is the risk-free rate, r is the constant volatility for the pricing process, and the random variable w t $N ð0, r 2 tÞ: For an initial price p 0 , Equation ( 18) has an analytic solution: Following the GBM process in Equation ( 19), we generate four pricing time-series with different levels, i.e., r ¼ 0:05, 0:1, 0:15, 0:2 f g : Each pricing series consists of 1000 samples with a random initial value p 0 , as illustrated in Figure 4.For simplicity, we use TypeA $ TypeD to notate them.Figure 4 shows that TypeA $ TypeD have diverse volatilities and tendencies, which can reflect different market situations.Table 1 summarises the statistics of volatility on these datasets.
In our experiments, we train and test each forecasting model via the five-fold time-series cross-validation executions.Table 2 records the mean and std results of six regressors on each GBM time-series dataset in terms of the five metrics.The entries in table with the best performance (the lowest RMSE, MAD, MedAD and MAPE, and the highest R 2 ) are highlighted in bold.Each entry in Table 2 has two parts: the first denotes the mean value of the cross-validation results, and the second stands for plus or minus the standard deviation.From Table 2, we can obtain the following: Our HDNN model yields better generalisation capability than other models in all cases.Specifically, HDNN exhibits the smallest RMSE, MAD, MedAD and MAPE with the largest R 2 among these methods.It confirms the effectiveness of our HDNN in forecasting time-series data with different volatility levels.The comparison results between HDNN and NN show that the feature embedding component helps HDNN construct the informative temporal embedding space.As a result, it can better understand the volatility and tendency within the time-series data.The machine learning models (HDNN, NN, and SVR) outperform the traditional ARCH based methods (GARCH and EGARCH), with a lower forecasting error and variance in most cases.GARCH and EGARCH formulate the volatility in the linear sense, which may not be suitable for high volatility cases.

Results on the real-world Stock Index datasets
In this subsection, we apply our HDNN to forecast the volatility of five 5-minute high-frequency real-world Stock Index datasets.These datasets 1 come from the CSI300 Index (000300.SS), SSE50 Index (000016.SS), ChiNext Index (399006.SZ), S&P 500 Index (.SPX), and NASDAQ Composite (.IXIC), described as follows:  The CSI300 Index is a capitalisation-weighted stock market index designed to reflect the performance of the top 300 A-shares on the Shanghai Stock Exchange (SSE) and the Shenzhen Stock Exchange (SZSE).It consists of the 300 constituent stocks with the largest market capitalisation and liquidity from the entire universe of A-share enterprises in China.
The SSE50 Index consists of the 50 most representative A-shares from the SSE by the scientific and objective method.It aims to reflect the complete picture of those excellent quality enterprises, which have the most influence within SSE.The ChiNext Index is free-float capitalisation-weighted and comprises the 100 largest liquid A-shares listed on the ChiNext Market of SZSE.As the benchmark index of the ChiNext Market, the ChiNext Index aims to reflect the performance of innovative businesses and emerging industries.
The S&P 500 Index is a stock market index tracking the performance of 500 large companies listed on stock exchanges in the United States.It is one of the most commonly followed equity indices.
The NASDAQ Composite is a stock market index that includes almost all stocks listed on the NASDAQ stock exchange in the United States.The composition of the NASDAQ Composite is heavily weighted towards companies in the information technology sector.
Specifically, our experiments use the 5-minute high-frequency data of CSI300, SSE50, ChiNext, S&P 500, and NASDAQ observed in the last three years: 2019, 2020, and 2021, as shown in Figure 5. Firstly, we conduct an exploratory data analysis on the above index datasets to investigate their characteristics.Take CSI300 in 2020 for example.Figure 6 plots the corresponding intraday log returns (left) and realised volatility (right).Continuous fluctuations appear during different timeframes.Due to the effects of COVID-19, the fluctuations in the first half of 2020 are relatively large, while the fluctuations in the second half tend to be stable.This phenomenon indicates that the financial market is more sensitive to unpredictable events than other  forecasting applications.Moreover, Table 3 summarises the statistics of realised volatility on these datasets.
To verify the forecasting ability of the models, we perform the five-fold time-series cross-validation on the above real-world Stock Index datasets.The prediction results of the HDNN, NN, and SVM are better than those of the statistical GARCH and EGARCH models.The reason is that GARCH and EGARCH only consider the linear structure of variance to model volatility, leading to less capability to track the complex high-frequency information.HDNN and NN outperform SVR in most cases.The structure of SVR can be seen as a single hidden layer shallow network, which is suitable for small sample tasks.
For large-scale learning cases, the nonlinear multilayer network of HDNN and NN enjoys more capacity than SVR to exploit the high-frequency volatility trend.HDNN achieves better performance than NN.With the help of the time-series encoding framework, our HDNN can use the CNN to extract the underlying temporal embedding from the high-frequency time-series data.This strategy enables HDNN to learn more informative features for volatility, which helps improve forecasting performance.
To provide more statistical evidence (Hatamlou, 2013;Yu et al., 2017), we employ the Friedman test to verify whether there are significant differences between HDNN and the other regressors on the learning results in Figures 7-11.Tables 4-8 record the rank of regressors obtained via the Friedman test according to each metric.The results show that HDNN is ranked first in the whole situation, followed by NN successively.Now, we calculate the X 2 F value for the Friedman test as follows: where k is the number of regressors, and r i is the rank on the N datasets for the i-th regressor listed in Tables 4-8.In our experiments, k ¼ 6 and N ¼ 15 (3 years with the five-fold time-series cross-validation, i.e., N ¼ 3 Â 5).Let us consider, for example, the RMSE metric on CSI300.The term P k i¼1 r 2 i is computed for the first line of Table 4: Then, substituting k ¼ 6, N ¼ 15, and ( 21) into (20), we obtain Based on the above Friedman statistic X 2 F ¼ 52:8213, we calculate the F-distribution statistic F F with ðk À 1, ðk À 1ÞðN À 1ÞÞ ¼ ð5, 70Þ degrees of freedom     Similarly, we calculate the statistic F F for the other cases, summarised in the 'Statistic' column of Tables 4-8.Additionally, the p-value is provided by the Friedman test.The significance level is set as 0.05.The results reject the null hypothesis (there is no significant difference among the compared regressors.) for these datasets.It indicates that our HDNN is statistically superior to the other forecasting models.Overall, the above results confirm the feasibility of HDNN.

Conclusion
Learning volatility from time-series data plays an essential role in making investment decisions for the financial market.This paper proposes a novel deep neural network model with temporal embedding for volatility forecasting, termed HDNN.The main characteristics of our HDNN are as follows: 1.An elegant time-series encoding strategy is introduced to transform one-dimensional time-series data into two-dimensional GAF images.These images can preserve the temporal relationship among the time-series data.2. With the encoding GAF images, the follow-up CNN can automatically learn the volatility-related feature mappings, which is very challenging for traditional feature engineering methods.The feature embedding component can explore an underlying temporal embedding space from the GAF images via the elaborate CNN.The new representation vectors of the time-series data in the embedding space preserve the temporal information as much as possible.3. HDNN is an end-to-end learning model for volatility forecasting.It can simultaneously carry out temporal feature learning (feature embedding component) and volatility forecasting tasks (regression component).
Note that most existing methods (Henrique et al., 2019;Wang et al., 2021) directly utilise the raw time-series data to build the volatility forecasting model.However, the raw time-series data cannot effectively express the volatility tendency.To address the above issue, our HDNN encodes one-dimensional time-series data as two-dimensional GAF images rather than directly learning from the raw time-series data.Then, the CNN constructs the most volatility-related features in the embedding space.Afterwards, the volatility-related features are used to train the forecasting regressor.Compared with the raw data, these features can better represent the time-series data with more helpful and pure information related to volatility.
Finally, the proposed HDNN model is compared with the state-of-the-art GARCH, EGACH, SVR, and NN models on the GBM synthetic dataset and the real-world high-frequency Stock Index datasets (CSI300, SSE50, ChiNext, S&P 500, and NASDAQ).The extensive results confirm that the HDNN remarkably improves the forecasting ability.Furthermore, the Friedman test results indicate that HDNN is statistically superior to the other forecasting models.

Figure 1 .
Figure 1.The procedure for transforming the time-series data into the encoding GAF images.Source: The authors' illustration.

Figure 3 .
Figure 3.The procedure of time-series cross-validation on rolling windows.Source: The authors' illustration.

Figure 4 .
Figure 4.The illustration of four types of GBM time-series datasets.Source: The authors' illustration.

Figure 6 .
Figure 6.The log returns (left) and realized volatility (right) of CSI300 index in 2020.Source: The authors' illustration.
Figures 7-11 illustrate the box plot of the comparison results of each regressor in terms of RMSE, MAD, MedAD, MAPE and R 2 .Box in these Figures records the metric score of the five-fold

Figure 7 .
Figure 7.The RMSE metric of regressors on the five Stock Index datasets.Source: The authors' illustration.

Figure 8 .
Figure 8.The MAD metric of regressors on the five Stock Index datasets.Source: The authors' illustration.

Figure 9 .
Figure 9.The MedAD metric of regressors on the five Stock Index datasets.Source: The authors' illustration.

Figure 10 .
Figure 10.The MAPE metric of regressors on the five Stock Index datasets.Source: The authors' illustration.

Figure 11 .
Figure 11.The R 2 metric of regressors on the five Stock Index datasets.Source: The authors' illustration.

Table 1 .
The statistics of realised volatility for the four GBM time-series datasets.

Table 2 .
The forecasting results of each regressor on the four GBM time-series datasets, in terms of the five metrics.

Table 3 .
The statistics of realized volatility for the five Stock Index datasets.

Table 4 .
Rank and statistic of the Friedman test on the CSI300 dataset.
Source: The authors' illustration.

Table 5 .
Rank and statistic of the Friedman test on the SSE50 dataset.

Table 6 .
Rank and statistic of the Friedman test on the ChiNext dataset.

Table 7 .
Rank and statistic of the Friedman test on the S&P 500 dataset.

Table 8 .
Rank and statistic of the Friedman test on the NASDAQ dataset.