Effective forecast of Northeast Pacific sea surface temperature based on a complementary ensemble empirical mode decomposition–support vector machine method

Abstract The sea surface temperature (SST) has substantial impacts on the climate; however, due to its highly nonlinear nature, evidently non-periodic and strongly stochastic properties, it is rather difficult to predict SST. Here, the authors combine the complementary ensemble empirical mode decomposition (CEEMD) and support vector machine (SVM) methods to predict SST. Extensive tests from several different aspects are presented to validate the effectiveness of the CEEMD-SVM method. The results suggest that the new method works well in forecasting Northeast Pacific SST at a 12-month lead time, with an average absolute error of approximately 0.3 °C and a correlation coefficient of 0.85. Moreover, no spring predictability barrier is observed in our experiments.


Introduction
The oscillation of sea surface temperature (SST) has substantial impacts on the climate. Anomalously high SST near the equator (between 5°S and 5°N and the Peruvian coast) causes the El Niño phenomenon, while low SST in this area brings about the La Niña phenomenon, both of which impose considerable influence on temperature, precipitation and wind globally (Luo 2000;Yuan, Yang, and Li 2012;Lian 2014). Owing to the considerable effect that these issues have on humans, many efforts (Feng et al. 2007;Shang et al. 2013;Meng, Lin, and Tang 2015) have been made to analyze SST. In doing so, it has become clear that accurate observations and effective predictions of SST are very important. However, owing to its seriously nonlinear nature, non-stationary and strongly stochastic properties, it is rather difficult to analyze and predict SST. To the best of our knowledge, there is no single method that has been verified to predict SST accurately.
Empirical mode decomposition (EMD) is a signal processing method originally proposed by Huang et al. (1998). This method decomposes original signals into several intrinsic mode functions (IMFs) with different frequencies.
These IMFs can be regarded as an orthogonal product of the original signal; nonetheless, the non-stationarity of the resulting IMFs has been improved greatly. Actually, the EMD method has been applied widely in engineering, meteorology, and some other fields (Chen 2012;Yu et al. 2014;Liang et al. 2015). However, the EMD algorithm has its own defects in practical applications; for instance, the endpoint effect and mode confusion. The endpoint effect has been removed by several approaches, including the extension technique (Xu 2009;Shi et al. 2014). Meanwhile, in order to resolve the problem of mode confusion, Wu and Huang (2009) proposed the ensemble empirical mode decomposition (EEMD) algorithm, which improves the EMD method by adding some random white (Bond et al. 2015). In this paper, we consider time series of 408 months  of mean values of SST anomaly (SSTA) data in the region (40°-50°N, 150°-135°W). As shown in Figure 1, the observed SSTA series is obviously nonlinear, evidently non-periodic, and strongly stochastic.

CEEMD-SVM method
The CEEMD-SVM method for forecasting the SSTA sequence proceeds as follows: Step 1. The CEEMD method is applied to decompose the historical SSTA data into several modes (IMFs and the remainder mode, R) with different frequencies.
Step 2. For each available mode, we divide the time sequence into two parts, which serve the learning and control groups, respectively (the same division is always used for each mode). Then, the SVM is trained and validated using the learning and control groups separately.
Step 3. For each available mode, the trained SVM is used to generate the predictive data.
Step 4. The predictive result of the SSTA is reconstructed by the forecasting data of all modes.

Data decomposition by CEEMD
Using the CEEMD method, the original SSTA sequence is decomposed into eight modes (Figure 2), including seven IMFs and the remainder mode, R. It is observed that the frequencies of seven IMFs decrease successively and the corresponding periods increase. Also, the remainder mode, R, is monotonically increasing, in accordance with the upward trend of the original SSTA data.
Obviously, the modes IMF 4 -IMF 7 are intrinsically regular in time since they have their own characteristic frequencies, while the characteristic frequencies of the first few IMFs, especially IMF 1 , are indistinct. In other words, the non-periodic and strongly stochastic properties of the original SSTA data are mainly inherited in the first several noise to balance the random error in the original signal. In their study, several EMDs were run for the same signal by adding some random white noise, allowing them to obtain the final IMFs by averaging the resulting IMFs in each run. However, the accumulation of man-made white noise always introduces large reconstruction error in the EEMD method. This problem, though, can be solved to a large extent via complementary ensemble empirical mode decomposition (CEEMD), as suggested by Yeh and Shieh (2010). In this paper, we apply the CEEMD algorithm to decompose the SST data with small reconstruction error for our prediction.
There are many existing approaches, such as curve fitting, mean generating function periodic extrapolation, neural networks, and the support vector machine (SVM) method, for predicting nonlinear time signals. Here, we are interested in the SVM method, which is a machine learning algorithm proposed by Vapnik (1995). Based on Vapnik-Chervonenkis dimension theory and the principle of structural risk minimization, the SVM method has its own advantages in the prediction of nonlinear time signals. Examples of its application (Chen 2012;Cai, Zhang, and Yang 2014) can be found in several fields, such as computer engineering, control and decision-making, and meteorology.
We combine the CEEMD and SVM methods in this paper to construct a new procedure for SST forecasting. Extensive numerical tests are presented to verify the feasibility of our method.

SST data
Observations show that the SST in the Northeast (NE) Pacific Ocean has been higher than usual in the past few years. In particular, from 2014 to the first half of 2016, the SST in this region was 2 °C higher than normal, and has had obvious impacts on both coastal and inland areas IMFs, while the regular components in the SSTA data are separated within the other modes, such as IMF 4 -IMF 7 and the remainder mode, R. This characteristic is important for our prediction of the SSTA sequence because, compared with a direct prediction of the observed SSTA data, the prediction accuracy of regular modes could be significantly

Twelve-month forecasting experiments
We design a class of 12-month forecasting experiments with the initial time in January to test the accuracy of SVM prediction by running it 10 times. The SSTA data of 2006-2015 are forecasted year by year. The main trend of the predictive series is the same as the real data in the 12-month prediction experiments. One can see that, for each month in 2006-2015, the average absolute error between the simulated SSTA and the observed data is less than 0.4436 °C (Figure 4). The corresponding forecasting errors in three different measures are also reported in Table 1. We find that the forecasting errors are generally small and acceptable. Actually, the average error is approximately 0.3 °C, the maximum error is less than 0.8 °C, and the average correlation coefficient between the simulated data and the observed data reaches 0.85. Hence, the 12-month SVM prediction is effective. Note that the final prediction error is accumulated by the forecasting errors of all modes generated by the CEEMD method. We find that the average predicted error of the first mode, IMF 1 , reaches 0.2 °C and provides the largest contribution to the overall forecasting error; however, this is not unexpected. Actually, due to the evidently non-periodic and strongly stochastic properties (see Figure  2), it is rather difficult to forecast IMF 1 in general.

Thirty-six-month forecasting experiments
To examine the predictable time length of the SSTA forecast via the CEEMD-SVM method, we design 36-month forecasting experiments. Similar to the experiments reported in Section 4.1, in this group of experiments the SSTA data are forecasted from 2013 to 2015.
improved. Naturally, it should be expected that our CEEMD-SVM method generates a better prediction than the SVM method if the reconstruction error of CEEMD is negligible.
The reconstruction error of CEEMD is defined by the absolute value of difference between the original data and the summation of all decomposed modes. Figure 3 depicts the reconstruction error of our decomposition for the SSTA data. It is apparent that the reconstruction error is around 10 −16 °C for most of the total 408 months, and only for 5 months is the error located between 2.5 × 10 −16 °C and 5 × 10 −16 °C. That is to say, the eight modes in Figure 2 are effective and acceptable.

Forecast experiments via the SVM method
The effectiveness of SVM prediction, including the prediction accuracy, the predictable time length, and the practical effect from different starting months of prediction, will be examined here before we apply it to generate the predictive data of each mode.   problem (Zhang, Yu, and Duan 2012). Specifically, the prediction of ENSO events possesses seasonal dependence and the prediction error always increases fastest in spring (April, May), as compared with other seasons. Thus, starting the prediction from different months may affect the forecasting results. To examine this, the 12-month forecasting experiments are run five times, from 2010 to 2014, using twelve starting months (from January to December). The forecasting results based on the different starting months are plotted in Figure 6, and the corresponding prediction error and correlation coefficients between the forecasting results and the observed data are reported in Table 3. We find that the forecasting results based on the 12 different starting months, from January to December, are acceptable and effective. Actually, the average absolute errors are approximately 0.3 °C and the correlation coefficients are no less than 0.73. Additionally, compared with other starting months, the forecasting starting in January, February, or March introduces relatively smaller prediction error. Forecasting errors become increased when the prediction starts in April or May, but the forecasting starting from June to December successfully avoids the negative influence produced by starting in April or May.
As a result, for the present SSTA data and the CEEMD-SVM method, there is no spring predictability barrier. The reason for the disappearance of this phenomenon is not clear, so we cannot rule out its reappearance when predicting SSTA sequences in other regions using our CEEMD-SVM method. Figure 5 depicts the predictive result and the original data for the three years, and Table 2 lists the corresponding forecasting errors in three different measures. It is apparent that the best prediction result appears in the 12 months of 2013, with the minimum average error being 0.3815 °C and the largest correlation coefficient being 0.9112. In the 12 months of 2014, the average prediction error increases to 0.6295 °C and the correlation coefficient decreases to 0.3067. The situation becomes worse in 2015, with the maximum average absolute error being 0.7025 °C and the smallest correlation coefficient being 0.1814. As expected, the forecasting effect of the CEEMD-SVM method becomes worse when the prediction time is prolonged. On the other hand, the present experimental results suggest that the predictable time length of our CEEMD-SVM method is about 12 months, at least for the current NE Pacific SSTA data.

Sensitivity of the SSTA forecast to the starting month
According to previous experience, climate forecasting may encounter the so-called spring predictability barrier   namics, Institute of Atmospheric Physics, Chinese Academy of Sciences, who assisted in providing the SST data, and also put forward valuable comments. In addition, the authors wish to extend particular thanks to the anonymous reviewers, whose comments were very helpful.

Disclosure statement
No potential conflict of interest was reported by the authors.

Summary
In this paper, the CEEMD and SVM methods are combined to predict the monthly SSTA sequence in the NE Pacific Ocean, which is characterized by seriously nonlinear and evidently non-periodic and strongly stochastic properties.
(1) The observed SSTA is decomposed using CEEMD, and several modes with different periods are obtained. Compared with the original sequence, the decomposed modes stabilize the non-stationarity and improve the periodicity, which lays the foundation for the forecasting.
(2) Owing to the evidently non-periodic and strongly stochastic properties of the first mode, IMF 1 , it contributes the largest prediction error to the overall forecasting error, with an averaged error of around 0.2 °C.
(3) For the present SSTA data and the CEEMD-SVM method, no spring predictability barrier is observed. However, we cannot explain why this is the case, meaning it might reappear when predicting SSTA sequences in other areas. (4) From the experimental results, the CEEMD-SVM method works well when forecasting the SSTA data with a 12-month lead time, and the forecasting result becomes worse when the prediction time is prolonged. Actually, the suggested method is effective for a 12-month SSTA forecast.
Finally, it is important to highlight that this paper considers SSTA data from 1982 to 2015 only, and only for the region of (50°-40°N, 150°-135°W) in the NE Pacific. It therefore serves as a reference for studying SST prediction in other regions. Table 3. the absolute forecasting errors from different starting months (units: °c) and the correlation coefficients between the simulated and observed data.