A Comparative Study of non-deep Learning, Deep Learning, and Ensemble Learning Methods for Sunspot Number Prediction

ABSTRACT Solar activity has significant impacts on human activities and health. One most commonly used measure of solar activity is the sunspot number. This paper compares three important non-deep learning models, four popular deep learning models, and their five ensemble models in forecasting sunspot numbers. In particular, we propose an ensemble model called XGBoost-DL, which uses XGBoost as a two-level nonlinear ensemble method to combine the deep learning models. Our XGBoost-DL achieves the best forecasting performance (RMSE and MAE ) in the comparison, outperforming the best non-deep learning model SARIMA (RMSE and MAE ), the best deep learning model Informer (RMSE and MAE ) and the NASA’s forecast (RMSE and MAE ). Our XGBoost-DL forecasts a peak sunspot number of 133.47 in May 2025 for Solar Cycle 25 and 164.62 in November 2035 for Solar Cycle 26, similar to but later than the NASA’s at 137.7 in October 2024 and 161.2 in December 2034. An open-source Python package of our XGBoost-DL for the sunspot number prediction is available at https://github.com/yd1008/ts_ensemble_sunspot.


Introduction
Human activities and various events on Earth are strongly intertwined with solar activity (Pulkkinen 2007;Hathaway 2015). An increase in solar activity includes increases in extreme ultraviolet and X-ray emissions from the Sun toward Earth, resulting in the atmospheric heating that can be harmful to spacecrafts, satellites and radars (Walterscheid 1989;Ruohoniemi and Greenwald 1997;Lybekk et al. 2012). Increased solar flares and coronal mass ejections due to high solar activity can damage the communication and power systems on Earth (Lewandowski 2015). The approximately 11-year cyclic pattern of solar activity seems easily predictable, but the cycle varies in both amplitude and duration (Cameron and Schüssler 2017). Accurate prediction of solar activity is thus of great interest to estimate the expected impact of space weather on space missions and societal technologies.
Solar cycles are also considered to impact many aspects of human health. Juckett and Rosenberg (1993) observed longer human longevities during solar cycle minimums. rithm to deal with missing values, and the cache-aware access technique to effectively utilize hardware resources. Besides these advantages, we essentially use XGBoost as a two-level ensemble method, because XGBoost itself is an ensemble of decision trees and each decision tree therein nonlinearly combines the forecasting models. The twolevel nonlinear ensemble nature of our usage of XGBoost makes it potentially more powerful than single-level ensemble methods such as those aforementioned.
In this paper, we conduct a comparative study of non-deep learning and deep learning models as well as their ensemble models for the sunspot number prediction. We compare the three important non-deep learning models, Seasonal Autoregressive Integrated Moving Average (SARIMA; Box and Jenkins 1976), Exponential Smoothing, and Prophet, and the four popular deep learning models, LSTM, GRU, Transformer, and Informer. We also consider their ensemble models from basic ensembles, the errorbased method, linear regression, and XGBoost.
The contributions of the paper are summarized below.
• We compare three important non-deep learning models (SARIMA, Exponential Smoothing, and Prophet), four popular deep learning models (LSTM, GRU, Transformer, and Informer), and their five ensemble models (via mean, median, the error-based method, linear regression, and XGBoost) in predicting the sunspot number. • We propose to use XGBoost as a two-level nonlinear ensemble method to combine the results from time-series forecasting models. Our XGBoost-DL model, which uses XGBoost to ensemble the four deep learning models, has the best performance in comparison with other considered base and ensemble models as well as the prediction from the National Aeronautics and Space Administration (NASA). • We provide an open-source Python package of the XGBoost-DL model for the sunspot number prediction at https://github.com/yd1008/ts_ensemble_ sunspot. • We use the proposed XGBoost-DL model to forecast the Solar Cycles 25 and 26, and compare the result with the NASA's prediction.
The rest of this paper is organized as follows. Section 2 introduces the seven aforementioned time-series forecasting methods. Section 3 describes the five ensemble learning methods including the proposed XGBoost-based ensemble. Section 4 compares all considered base and ensemble models as well as NASA's report for the sunspot number prediction. Section 5 makes concluding remarks.

Forecasting Methods
In this section, we introduce seven time-series forecasting methods that will be compared in our experiments for predicting sunspot numbers, including the three non-deep learning methods SARIMA, Exponential Smoothing, and Prophet, and the four deep learning methods LSTM, GRU, Transformer, and Informer. SARIMA and Exponential Smoothing are classical non-deep learning methods widely used for decades in forecasting seasonal time series (Box and Jenkins 1976;Holt 1957;Winters 1960). Facebook's Prophet is relatively new compared to the former two but has been used internally in Facebook for years and also allows for seasonality (Taylor and Letham 2018). LSTM and its simpler variant GRU are two most popular deep-learning models based on gating mechanisms to tackle sequential prediction (Hochreiter and Schmidhuber 1997;Cho et al. 2014;Chung et al. 2014). Transformer is an innovative deep-learning network instead using a self-attention mechanism (Vaswani et al. 2017), and its variant Informer, the winner of AAAI-21 Outstanding Paper Award, improves Transformer on long-sequence time series forecasting (Zhou et al. 2021). We are aware of many other models (e.g., Tabassum, Rabbani, and Omar 2020;Benson et al. 2020;Beltagy, Peters, and Cohan 2020;Child et al. 2019), most of which are similar to or variants of the above seven methods, but we only consider these seven methods that are most widely recognized in time series forecasting.
We denote a univariate time series by y = (y 1 , y 2 , . . . , y t , . . . ), where y t is the observation at time t.
2.1. Non-deep learning methods 2.1.1. SARIMA SARIMA (Box and Jenkins 1976), as an ARMA variant, is one most commonly used model in the past decades to forecast trend and seasonal time series. It uses a mix of autoregressive terms, moving average terms, and differencing procedures for both nonseasonal and seasonal components to represent the current value in a time series based on prior observations. Specifically, an SARIMA(p, d, q)(P, D, Q) s model is defined by where (φ i , θ i , Φ i , Θ i ) and (p, q, P, Q) are the parameters and the orders of the nonseasonal autoregressive, the non-seasonal moving average, the seasonal autoregressive, and the seasonal moving average terms, respectively, t is white noise, L is the lag operator, d is the order of non-seasonal differencing, D is the order of seasonal differencing, s is the span of the seasonality, and c is a constant.

Exponential Smoothing
Proposed in 1950s, Exponential Smoothing (Brown 1956;Holt 1957;Winters 1960) remains to be one of the widely used time series forecasting methods. Although there are several types of Exponential Smoothing, we focus on the Holt-Winters Exponential Smoothing method (Holt 1957;Winters 1960), which can model the trend and seasonality of a time series. The Holt-Winters method and SARIMA have shown comparable performances in a number of previous studies (Lidiema 2017;Liu et al. 2020;Rabbani et al. 2021).
Also known as Triple Exponential Smoothing, the Holt-Winters method comprises three smoothing equations for the level l t , the trend b t , and the seasonality s t , respec-tively. There are two main models of this method, the additive seasonal model and the multiplicative seasonal model where α, β, γ ∈ [0, 1] are smoothing parameters, m is the frequency of the seasonality, and h + m = [(h − 1) mod m] + 1 for h ≥ 1. The two models differ in the nature of the seasonality. The additive model ought to be considered when the seasonal variations are stable over time, while the multiplicative model is used when the seasonal variations are changing proportional to the level of the time series. Due to the variability of the amplitude of sunspot cycles, following Tabassum, Rabbani, and Omar (2020), we use the multiplicative model for the sunspot number prediction.

Prophet
Facebook's Prophet (Taylor and Letham 2018) is a more recent time series forecasting algorithm compared to the previous two. Despite some commonalities with SARIMA and Exponential Smoothing, it provides a more intuitive approach to model the trend and seasonality of time series by incorporating more flexibilities in its configuration.
Prophet has three essential components: trend b t , seasonality s t , and holiday h t . The holiday option allows Prophet to adjust the forecast that may be affected by holidays or major events. The full model with a logistic trend term is where t is the error term, C is the carrying capacity that is the maximum value of the logistic curve, k is the growth rate that controls the steepness of the curve, m is an offset parameter corresponding to the curve's midpoint, the seasonality s t is expressed by a standard Fourier series with parameters {α n , β n } N n=1 and a regular period P that the time series has, D i is the set of dates for holidays, and κ ∈ R L is the change in the forecast caused by holidays.

Deep learning methods
Although non-deep learning methods can handle a wide range of time series forecasting tasks, their theoretical limitations often prevent them from being directly applicable to data or modeling complex non-linearity in data, as well as computational complexity makes them impractical to large datasets. Therefore, deep learning techniques such as Rucurrent Neural Networks (RNNs) were introduced (Lara-Benítez, Carranza-García, and Riquelme 2021). RNNs have shown better performance than non-deep learning methods in time series forecasting due to the ability to deal with longer sequences and better capture complex temporal dependencies (Tsui et al. 1995;Zhang and Man 1998).

LSTM
The LSTM network (Hochreiter and Schmidhuber 1997) is one most popular model in the RNN family. The vanilla RNN suffers from vanishing gradients and is not capable to achieve desirable results for long sequence data (Hochreiter 1998). The LSTM network mitigates this issue rising from long-term dependencies by introducing a gated memory cell architecture, which controlls the information flow with three gates: an input gate i t , an output gate o t , and a forget gate f t . The LSTM cell is formulated as follows: where x t is the input which is the hidden state h t from the previous layer and is y t for the first hidden layer, σ(·) is the sigmoid function, tanh(·) is the hyperbolic tangent function, h t is the hidden state, c t is the state of the memory cell, c t is the candidate state of the memory cell, and W and U are the weights of the input and recurrent connections as well as b is bias with subscripts i, f , o and c for the input gate, forget gate, output gate, and memory cell, respectively.
In each hidden layer of the LSTM network, a sequence of LSTM cells are aligned side-by-side and input data are sequentially fed into each cell. LSTM's capability of communicating across multiple cells comes from the hidden states and the cell states. The hidden state carries over information from the previous cell to the next and a new hidden state is generated from each LSTM cell, while the cell state selectively stores the past information. The input, output, and forget gates generate a new hidden state and update the the cell state in the following procedure. The input gate determines whether new information will be added to the memory in two steps: i t uses a sigmoid function to decide which information needs to be updated, and c t utilizes a hyperbolic tangent function to select new candidate information to be added to the memory. The forget gate decides what information should be discarded from the memory by applying a sigmoid function to the previous hidden state h t−1 and current input value x t . The output of the forget gate ranges between 0 and 1, with 0 indicating complete removal of the previously learnt value and 1 indicating retention of all information. The output gate determines what will be generated as the output. o t acts like a filter by selecting the relevant information from the memory with a sigmoid function to generate an output value, which is then multiplied with the cell state passing through a hyperbolic tangent function to form the representations in the hidden state. The structure of the LSTM cell is illustrated in Figure 1.

GRU
The GRU Chung et al. 2014) network is another well-known RNN model using a gating mechanism similar to that in LSTM, but it has a simpler cell architecture and is computationally more efficient (Torres et al. 2021; Lara-Benítez, Carranza-García, and Riquelme 2021). The GRU cell only has two gates, an update gate z t and a reset gate r t . The update gate decides the amount of previous information to be passed to the next state, helping capture long-term dependencies in the sequence. The reset gate determines how much of the past information to neglect and is responsible to learn short-term dependencies. The GRU cell is formulated by Figure 2. The structure of the GRU cell.
where the notation is similar to that in (1). The functionalities of the gates used in GRU are also similar to those in LSTM. Figure 2 shows the structure of the GRU cell.

Transformer
Since introduced in 2017, Transformer (Vaswani et al. 2017), a solely self-attention based encoder-decoder network, has become the state of the art model in natural language processing, along with a number of its variants (Devlin et al. 2018;Yang et al. 2019;Liu et al. 2019). Recently, Transformer and self-attention based models have gained increasing popularity in time series tasks (Wu et al. 2020;Zhou et al. 2021;Wen et al. 2020;Li et al. 2019). Transformer abandons the recurrent layers of RNNs that process data of the input sequence one after another. It instead uses a self-attention mechanism, which ditches the sequential operations and can access any part of the sequence, to capture global dependencies and enable parallel computation. The core of Transformer is the scaled dot-product self-attention written as where Q, K and V are matrices with the i-th rows being the i-th query, key and value vectors, and d k is the dimension of the key vectors. In a self-attention layer, input data pass through three separate linear layers to form the query, key, and value matrices. Dot products of queries and keys are then calculated. In masked attention layers, a mask of the same size as the dot product matrix, with the upper triangle of the mask  . Transformer's encoder-decoder architecture. The left four blocks form an encoder layer and the right six blocks form a decoder layer. The encoder's input is first passed through a multi-head attention block, followed by residual connection and layer normalization. The output then goes through a feedforward layer, followed by another residual connection and layer normalization. The decoder's input is fed first into a masked multi-head attention block, then residual connection and layer normalization. The procedure for the next four blocks is the same as for the encoder, except that the encoder output is fed into the multi-head attention block with the decoder output from the second block.
having values of −∞ and 0's elsewhere, is added to the dot product to prevent values in a sequence from attending to succeeding ones. The dot product matrix scaled by √ d k is then fed into the softmax function. The attention scores are calculated by the dot product between the output of the softmax function and the value matrix. In multi-head attention layers, multiple self-attention layers are stacked with each layer consisting of different sets of weights. The final attention scores are generated by combining attention scores calculated in parallel from each self-attention layer. To use the sequence-order information, Transformer adopts a positional encoding mechanism based on sine and cosine functions. The encoded values are added to the input data, indicating positional information of each value in the sequence, such that Transformer can distinguish values in one position from another without requiring specific order of the input data. Figure 3 illustrates Transformer's encoder-decoder architecture.

Informer
The quadratic computational complexity of dot-product self attention and the heavy memory usage in stacking layers are major concerns in dealing with long input sequences. There have been some attempts to address these issues (Child et al. 2019;Li et al. 2019;Beltagy, Peters, and Cohan 2020;Zhou et al. 2021). We focus on the award-winning method Informer (Zhou et al. 2021).
Informer is proposed as an enhancement of Transformer on long-sequence time series forecasting. Informer adopts a ProbSparse self-attention mechanism which selects only the most dominant u queries for each key to attend to: where Q only contains the top-u queries chosen under the query sparsity measurement with Kullback-Leibler divergence which is further approximated efficiently with random sampling. The ProbSParse self-attention allows Informer to reduce both time complexity and memory usage from O(L 2 ) down to O(L log L) for input length L. In addition, Informer allows longer input sequences by using the self-attention distilling: where [·] AB is the attention block, X t j+1 is the t-th sequence in the j-th attention layer, MaxPool(·) is a max-pooling layer with stride 2, ELU(·) is the ELU activation function, and Conv1d(·) is a 1-D convolutional filter with kernel size of 3. By distilling, a more condensed feature map is passed from one attention layer to the next, while information is largely preserved. The corresponding memory usage is O((2 − )L log L) instead of O(JL log L) for J-stacking layers, where is a small number.

Ensemble Learning Methods
Ensemble learning combines the results from multiple models to produce predictions (Dong et al. 2020). The combined results usually exhibit better performance because ensemble methods can often decrease the likelihood of overfitting, reduce the chance of being trapped in a local minimum, and extend the size of the search space (Sagi and Rokach 2018). In time series forecasting tasks, commonly used ensemble methods include basic ensemble methods, error-based method, and machine learning methods. In the following sections, Y = ( y ij ) ∈ R n×m denotes the matrix of predictions from m forecasting models at n time points, y ij is the predicted value from the j-th model at the i-th time point, and y i is the final prediction at the i-th time point.

Basic ensemble methods
Basic ensemble methods combine the predictions from all base models by some simple functions such as mean and median: y i = mean( y i1 , . . . , y im ), y i = median( y i1 , . . . , y im ).
Although these methods are straightforward and simple to comprehend, they ignore the possible relationship among base models and thus are not adequate for combining nonstationary models (Allende and Valle 2017).

Error-based method
Unlike the mean ensemble method that weighs all base models equally, error-based method assigns weights inversely proportional to forecast errors (Adhikari and Agrawal 2014). Specifically, the data is divided into training and validation sets. A specific evaluation metric is selected to compute the forecast error e j on the validation set for the j-th base model fitted on the training set, and then the model's weight is

The final prediction of this method is
The error-based method improves the mean ensemble method by allowing different weights for base models, but its performance is still highly dependent on the forecasting performance of each base model.

Linear regression
The mean, median, and error-based ensemble methods are all linear ensemble methods in the form of (3). The linear ensemble problem can be written as the linear model: where the forecasts of base models { y ij } m j=1 are the features, and the true observation y i is the target. Linear regression, or the least squares method, provides the optimal weights that minimize the sum of squared errors n i=1 ε 2 i . Formulate (4) in the matrix form y = Y w + ε.
The optimal weight vector given by linear regression is w = Y † y with Y † the Moore-Penrose pseudo-inverse of Y (Penrose 1956).

XGBoost
As ensemble learning is essentially a regression problem with features { y ij } m j=1 and target y i , any machine learning algorithm for regression is applicable as an ensemble method. In addition to linear regression, there are many nonlinear regression methods such as kernel regression (Li and Racine 2007), support vector regression (Cherkassky and Ma 2004), and tree-based regression algorithms (Breiman et al. 1984;Breiman 2001;Friedman 2001).
We consider the state-of-the-art nonlinear method, XGBoost (Chen and Guestrin 2016), a highly efficient and effective implementation of gradient boosted decision trees. XGBoost aims to solve the objective function and Ω(f ) = γT + 1 2 λ ω 2 2 , where l is a differentiable convex loss function that measures the difference between the prediction y i and the target y i , x i is the vector of input features, {f k } K k=1 are decision trees, and Ω(f ) is the penalty term with tuning parameters γ and λ to control the model complexity in which T and ω are the number of leaves and the leaf weights in the tree. The first term is the objective of the traditional gradient tree boosting, while the second term added by XGBoost is to prevent overfitting.
To apply XGBoost to ensemble multiple forecasting models, we simply let the input vector be the predictions of the base forecasting models, that is, This is an innovative use of XGBoost, in contrast to its ordinary usage in supervised learning where explicit features are ready to be the input. In other words, XGBoost can not be directly applied to predict the time series {y i } i≥0 , since we only have the time series itself and the associated time. If time is used as the input, then the XGBoost model is approximately equivalent to a high-order polynomial of time, which is not recommended in time-series literature because the resulting forecast is often unrealistic when the model is extrapolated (Hyndman and Athanasopoulos 2018). Alternatively, if observations at previous time steps are the input, the XGBoost model resembles the autoregressive model that only captures short-range temporal dependence (Brockwell and Davis 1991). Instead of building a XGBoost-based time series model from scratch, we use XGBoost to ensemble the predictions from existing forecasting methods.
We intrinsically use XGBoost as a two-level ensemble method here to combine the predictions of multiple forecasting methods. As shown in (5), XGBoost itself is an ensemble method that sums the results of K decision trees, each of which is a base model. In addition to the ensemble nature of XGBoost, each decision tree therein nonlinearly ensembles the forecasts of base time-series models. The entire procedure turns out to be an ensemble model that has two levels of ensembles, which is expected to perform better than single-level ensemble methods such as those mentioned above.
We thus propose a model called XGBoost-DL, which uses XGBoost to ensemble the predictions from the four deep learning models LSTM, GRU, Transformer and Informer. Our XGBoost-DL is simple yet novel and effective, because it takes advantage of our innovative use of XGBoost as a two-level nonlinear ensemble method and the superior performance of these deep learning models in time-series forecasting, and indeed achieves the best result in our experiment.

Data description
The sunspot number dataset is obtained from the website of World Data Center SILSO, Royal Observatory of Belgium, Brussels 1 . The dataset contains 3277 records of monthly averaged total sunspot number from January 1749 to January 2022. We also consider NASA's past forecast (from April 1999 to January 2022) and most recent forecast (from February 2022 to October 2041) obtained from its website 2 for comparison with considered methods. NASA uses a linear regression model based on the 13-month smoothed and Lagrangian interpolated data to predict sunspot numbers, for which a brief description is given in Suggs (2017) but detailed implementation and computer code are not disclosed.
The observed sunspot number data is split in chronological order into training (2160 records from January 1749 to December 1928), validation (843 records from January 1929 to March 1999), and testing (274 records from April 1999 to January 2022) sets. The validation set monitors the training for hyperparameter tuning. The testing set evaluates the performance of each trained model on unseen data. The ratio of training and validation sets is 7/3, and the testing set is concurrent with NASA's past forecast. The best model will be used to predict sunspot numbers from February 2022 to October 2041, covering the remaining portion of current Solar Cycle 25 and the coming Solar Cycle 26, in alignment with NASA's current forecasting time range.

Implementation details
We compare the three non-deep learning models, SARIMA, Holt-Winters multiplicative Exponential Smoothing, and Prophet, and the four deep learning models, LSTM, GRU, Transformer, and Informer. We also consider their ensemble models from the mean and median ensemble methods, the error-based method, linear regression, and our proposed ensemble method via XGBoost.
We adopt the two evaluation metrics, the root mean squared error (RMSE) and the mean absolute error (MAE), which are widely-used in time series forecasting problems (Hyndman and Athanasopoulos 2018), to assess the prediction performance of considered methods: where y i and y i are the predicted value and the true value, respectively, and n is the number of predictions.
All the experiments are performed in Python 3.8 (Van Rossum and Drake 2009) environment. All deep learning methods are implemented with Pytorch 1.9.0 (Paszke et al. 2019). SARIMA, basic ensemble methods, and the error-based ensemble method are implemented with Merlion 1.0.0 (Bhatnagar et al. 2021). Exponential Smoothing and the linear regression ensemble method are performed with Darts 0.15.0 (Herzen et al. 2021). The XGBoost ensemble method is carried out using xgboost 1.5.1 (Chen and Guestrin 2016). Models are trained on two NVIDIA RTX8000 GPUs. We use Tune  in Ray API 1.9.0 ) for tuning hyperparameters with best values chosen as the minimizers of the RMSE on the validation set. All deep learning models are trained up to 200 epochs by the AdamW optimizer, and earlystopping is used to prevent overfitting. Source codes with pre-trained models, tuning ranges of model-specific hyperparameters, and best configurations for all models used in this study can be found at our GitHub repository 3 .
All ensemble methods are applied independently to ensemble the forecast outputs of the three non-deep learning models, the four deep learning models, and all the seven models, respectively. For basic ensemble methods, training is not required and performance is evaluated directly on the testing set. The ensemble weights from the error-based method are calculated by the inverse of the mean squared error (MSE) of each base model on the validation set. The linear-regression ensemble method computes the weights from the training and validation sets. MSE is chosen as the loss function for training XGBoost-based ensemble models. Table 1 presents the performance results of the seven base models on the testing set. All the four deep learning models outperform the three non-deep learning models. The two attention-based deep learning models (Transformer and Informer) exhibit better performance than the two RNN models (LSTM and GRU). In particular, Informer achieves the lowest RMSE and MAE with values 29.90 and 22.35, respectively. All the deep learning models except LSTM show more accurate results than NASA with Informer having 38.20% lower RMSE and 41.87% lower MAE. Figure 4 (a) displays the true sunspot numbers and their estimates from deep learning models and NASA for the testing set. It is observed that predictions of deep learning models generally follow along patterns in the ground truth data, while the LSTM and NASA have large deviations around years 2003 and 2020. Figure 4 ( b) shows the forecasts from non-deep learning models. The three models fail to predict the true trend of the series due to highly inaccurate forecasts of the peaks and troughs, albeit have an approximately 11year cycle. With the lowest RMSE and MAE among the three models, the estimates from SARIMA are rather accurate at the beginning of the testing time horizon but notably deviate from the actual values thereafter. The improved performance of deep learning methods over non-deep learning ones can be attributed in part to the formers' ability to effectively capture non-linearities, as observed in the complex variations in solar cycles, whereas the latters primarily use linear combinations. Furthermore, the two attention-based methods (Transformer and Informer) appear to better capture the trend by using self-attention mechanisms that allow dependencies over a longer period of time to be included into forecasts. Table 2 summarizes the testing results from the ensemble models. With larger MAE or RMSE compared to those in Table 1, all ensembles of non-deep learning models and those of all base models fail to improve the performance of SARIMA and Informer, respectively, since non-deep learning models are highly inaccurate in predicting the peaks and troughs of the time series as shown in Figure 4 (b). All ensembles of deep learning models perform better than those of non-deep learning models due to the solid base on the well-performing deep learning models, and ensembles of all base models have performances between those of the former two.

In-sample forecast
In particular, the best ensemble model, our XGBoost-DL model, i.e., the XGBoost ensemble of deep learning models, significantly boosts the performance, with the smallest RMSE of 25.70 and MAE of 19.82 which are 14.05% and 11.32% lower than those of the best base model Informer. Figure 5 illustrates the forecasts from the three types of ensemble models. All deep-learning ensemble models show strong forecasting abilities in capturing the true trend and seasonality of the series.

Future forecast
We further consider the prediction of future sunspot numbers in current Solar Cycle 25 and the coming Solar Cycle 26. For conciseness of presentation, we only consider XGBoost-DL and NASA for the future forecast, because XGBoost-DL is the best model for the in-sample forecast in Section 4.3.1, and NASA is the authority in this field, whose forecast serves as the benchmark here due to no ground truth available. We fine-tune our XGBoost-DL model using the entire sunspot number data with a new training and validation split of ratio 7/3 (data from January 1749 to February 1940 for training and the rest for validation). Figure 6 shows the forecast from our XGBoost-DL model in comparison with NASA's. Our forecast indicates that the Solar Cycle 25 will reach a peak sunspot number of 133.47 in May 2025 and Solar Cycle 26 will have a peak number of 164.62 in November 2035. According to our prediction, the two Solar Cycles will be overall stronger than the past Solar Cycle 24. NASA's forecast shows similar but earlier peak values of 137.7 for Solar Cycle 25 in October 2024 and 161.2 for Solar Cycle 26 in December 2034. The similarity of the peaks predicted by XGBoost-DL and NASA moderately shows the plausibility of our XGBoost-DL. But the slightly different time of their predicted peaks is noteworthy to researchers who use NASA's prediction, because it is inferior to XGBoost-DL in the in-sample forecast shown in Section 4.3.1.
Although little work has been done on the sunspot number forecast for Solar Cycle 26, there are a number of recent forecasts for Solar Cycle 25 in the literature with the peak sunspot number ranging from 57.24 to 228.9 and occurring between 2022 and 2026. Our forecast is around the middle of the receptive ranges of magnitude and time. Labonville, Charbonneau, and Lemerle (2019)

Conclusion
We compare three non-deep learning models (SARIMA, Exponential Smoothing, and Prophet) and four deep learning models (LSTM, GRU, Transformer, and Informer) to forecast sunspot numbers in this study. The deep learning models outperform the non-deep learning models with lower RMSE and MAE. The deep learning models use deep neural networks, which theoretically can approximate any sophisticated nonlinear functions (Haykin 2009), to better model the complex nonlinear temporal dependencies in time series, in contrast to the non-deep learning models that primarily use linear combinations of time series patterns. In particular, Transformer and its variant Informer are superior over LSTM and GRU, since the former two can capture global dependencies in time series by adopting self-attention mechanisms which ditch the sequential operations of the recurrent layers in the latter two and thus provide access to any part of the sequence. The best-performing deep learning model, Informer, en-   hances Transformer on long-sequence time series forecasting by using the ProbSparse self-attention mechanism and the self-attention distilling operation.
Additionally, five ensemble learning methods (via mean, median, the error-based method, linear regression, and XGBoost) are applied to the forecast results of nondeep learning models, deep learning models, and all base models, separately. Ensemble models based on deep learning models have more accurate predictions than those based on non-deep learning models or all base models. Our proposed XGBoost-DL model that uses XGBoost to ensemble the four deep learning models achieves the best performance among all base and ensemble models as well as NASA's forecast. The outstanding result of our XGBoost-DL is owing to its strong two-level nonlinear ensemble architecture formed by decision trees built upon deep learning models. Our prediction indicates that Solar Cycles 25 and 26 will be overall stronger than the most recent Solar Cycle 24, and will have the peak sunspot number of 133.47 in May 2025 and 164.62 in November 2035, respectively, which are similar to but later than the peaks forecast by NASA at 137.7 in October 2024 and 161.2 in December 2034.
There is still room to improve our current work. Our study only considers the point forecasting without taking into account the prediction intervals of the forecasted values. Prediction intervals quantify the forecast uncertainty and are useful in decision making by providing more information to mitigate the risk associated with point forecasts. Although accurate computation of prediction intervals for non-deep learning or deep learning methods is available, e.g., by using quantile regression (Wen et al. 2017), the precise estimation method of prediction intervals for ensemble learning is lacking. Besides, some researchers (Labonville, Charbonneau, and Lemerle 2019;Prasad et al. 2022) investigate the sunspot number prediction by only considering the 13-month smoothed monthly sunspot number series instead of its original monthly mean sunspot number series which is used in our study and many others (Pala and Atici 2019;Benson et al. 2020;Tabassum, Rabbani, and Omar 2020). The original monthly data has relatively large fluctuations and thus is more challenging to predict than the smoothed data. However, it may be interesting to see whether ensembling the deep learning models trained from smoothed data together with those trained from original monthly data will enhance the forecasting performance. We leave them for our future work.

Disclosure Statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by Dr. Hai Shu's startup fund from New York University.