Decision trees using local support vector regression models for large datasets

ABSTRACT Our proposed decision trees using local support vector regression models (tSVR, rtSVR) aim to efficiently handle the regression task for large datasets. The learning algorithm tSVR of regression models is done by two main steps. The first one is to construct a decision tree regressor for partitioning the full training dataset into k terminal-nodes (subsets), followed which the second one is to learn the SVR model from each terminal-node to predict the data locally in a parallel way on multi-core computers. The algorithm rtSVR learns the random forest of decision trees with local SVR models for improving the prediction correctness against the tSVR model alone. The performance analysis shows that our algorithms tSVR, rtSVR are efficient in terms of the algorithmic complexity and the generalization ability compared to the classical SVR. The experimental results on five large datasets from UCI repository showed that proposed tSVR and rtSVR algorithms are faster than the standard SVR in training the non-linear regression model from large datasets while achieving the high correctness in the prediction. Typically, the average training time of tSVR and rtSVR are 1282.66 and 482.29 times faster than the standard SVR; Furthermore, tSVR and rtSVR improve 59.43%, 63.70% of the relative prediction correctness compared to the standard SVR.


Introduction
In the last decades, there is an explosion in data due to a very fast progress in computer hardware, the rapid increasing number of internet users and mobile device access to the internet, the emergence of e-commerce and huge search engines. The report of researchers from the University of Berkeley illustrates that there is about 1 Exabyte (10 9 Gigabyte) of data generated every year (Lyman et al., 2003). Recent book (Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences;National Research Council, 2013) shows that Google, Yahoo!, Microsoft, Facebook, Twitter, YouTube and other internet-based companies with hundreds of millions of users and billion daily active users collect Exabytes of data. Such a huge amount of data brings out challenges in data analysis because current state-of-the-art techniques are not well suited for that data scale. Therefore, it is a high priority to scale-up learning algorithms for addressing massive datasets. Our research focuses on creating new algorithms of decision trees using local support vector regression models to efficiently handle the non-linear regression task for large datasets. Support vector machines (SVM) proposed by Vapnik (1995) and kernel-based methods have shown the state-of-the-art relevant to many data mining problems, including the classification, the regression and the novelty detection (Guyon, 1999). The SVM algorithm produces the most accurate models in the practice. Nevertheless, the learning problem requires the solution of a quadratic programming (QP), so that the computational cost of an SVM approach is at least square of the number of training datapoints making SVM impractical to deal with large datasets. There is a need to scale-up SVM learning algorithms to handle massive datasets.
This paper presents the extension of our works in Tran-Nguyen, Bui, Kim, Do (2018). We propose new learning algorithms of decision tree using local support vector regression (called tSVR and rtSVR) to efficiently deal with the non-linear regression task of large datasets. Instead of training a global SVR model, as done by the classical SVR algorithm is very hard to handle large datasets, our tSVR algorithm is to learn in a parallel way an ensemble of local ones that are easily trained by the standard SVR algorithms. The tSVR trains regression models via two main steps. The first one is to use the decision tree algorithm (Breiman, Friedman, Olshen, & Stone, 1984) to partition the large training dataset into k terminal-nodes (subsets). The idea is to reduce the data size for training local non-linear SVR models at the second step. And then, the tSVR learns k non-linear SVR models in a parallel way on multi-core computers in which an SVR model is trained in each terminal-node to predict the data locally. The extension of tSVR is the algorithm rtSVR which trains the random forest of decision trees with local prediction SVR models for improving the prediction correctness versus the tSVR model alone.
The performance analysis shows that our algorithms tSVR, rtSVR are efficient in terms of the algorithmic complexity and the generalization ability compared to the classical SVR. The experimental results on five large datasets from UCI repository (Lichman, 2013) showed that our proposed tSVR, rtSVR are faster than the standard SVR in the nonlinear regression of large datasets while achieving the high prediction correctness.
The remainder of our paper is organized as follows. In Section 2, we briefly present the standard SVR algorithm. In Section 3, we illustrate how the tSVR algorithm learns local regression models from large datasets. The algorithm rtSVR of random decision trees with local prediction SVR models is presented in Section 4. The numerical test results are presented in Section 5 before discusses about related works in Section 6. We then conclude in Section 7.

Support vector regression
Let us consider a regression task with m datapoints x i (i = 1, . . . , m) in the n-dimensional input space R n , having corresponding response values y i [ R. The support vector regression (SVR) proposed by Vapnik (1995) tries to find the best hyperplane (denoted by the normal vector w [ R n and the scalar b [ R) that has at most ϵ deviation from the response value y i . Figure 1 is a simple example of SVR. The training algorithm of SVR pursues this goal with the Lagrangian dual quadratic programming (1) using the Lagrangian multipliers a i and a j : where C is a positive constant used to tune the margin size and the error and a linear kernel function K〈x i , x j 〉 = 〈x i · x j 〉 (the dot product of x i , x j ). The resolution of the quadratic programming (1) gives SV support vectors for which a i , a i * . 0. The predictive hyperplane and the scalar b are determined by these support vectors. And then, the prediction of a new datapoint x is as follows: Variations on training SVR models use different types of the kernel (Cristianini & Shawe-Taylor, 2000). It only needs replacing the linear kernel function K〈x i , x j 〉 = 〈x i · x j 〉 with other non-linear ones: . a polynomial function of degree d: The SVR models are the most accurate and practical pertinent for many successful applications reported in the classification, the regression, the novelty detection (Guyon, 1999).

tSVR algorithm for dealing with large datasets
Platt studied the algorithmic complexity of the SVM approach in Platt (1999). His analysis illustrates that the training complexity of the SVR solution in (1) is at least square of the number of training datapoints (i.e. O(m 2 )), making standard SVR intractable for large datasets. It means that learning the model from the full massive dataset using the standard SVR algorithm in the usual way is a challenge due to a very high computational cost.

tSVR for learning local SVR models
Our proposed tSVR algorithm learns an ensemble of local SVR models that are easily trained by the standard SVR algorithm. As illustrated in Figure 2, the tSVR handles the regression task with two main steps. The first one of the algorithm tSVR is to learn the decision tree regressor model (using the decision tree training algorithm Breiman et al., 1984) for partitioning the full training dataset into k terminal-nodes (subsets). Thus, the second step of the algorithm tSVR is to learn k non-linear SVR models in a parallel way on multi-core computers in which the SVR model is trained in each terminal-node to predict the data locally.
We consider a simple regression task given a response variable y and a predictor (variable) x. Figure 3 shows the comparison between a global SVR model (left part) and 4 local SVR models obtained by tSVR (right part) for this regression task, using a non-linear RBF kernel function with g = 10, a positive constant C = 10 5 (i.e. the hyper-parameters u = {g, C}) and a tolerance e = 0.05.
Since the terminal-node size is smaller than the size of the full training dataset, the standard SVR can easily perform the training task from the terminal-node, due to the square reduction of the complexity compared to the learning from the full dataset. In addition, k local models in the training task of tSVR are independently learnt from k terminalnodes. This is a nice condition to train k local models in a parallel way to take into account the benefits of high-performance computing, e.g. multi-core computers or grids. We propose to develop the parallel tSVR algorithm in a simple way based on the shared memory multiprocessing programming model OpenMP (OpenMP Architecture Review Board, 2008) for multi-core computers. The parallel training of tSVR is described in Algorithm 1.

Prediction of a new individual x using tSVR model
A new individual x is predicted by the tSVR model as follows. Firstly, it pushes x down the tree t of the tSVR model from the root to the terminal-node D p . And then the local SVR model lsvr p learnt from this terminal-node D p is used to predict the response value of x.

Performance analysis
Let's consider the performance of the learning algorithm tSVR in terms of the algorithmic complexity and the generalization capacity.
We starts with the algorithmic complexity of tSVR for building k local SVR models in the parallel way. Suppose that tSVR uses the decision tree regressor to partition the full dataset (with m datapoints) into k balanced terminal-nodes and then the terminal-node size is about m/k; in other words the minimum number of datapoints for a split to be tried minob j = m/k. The training complexity of a local SVR model for a terminal-node is Therefore, the algorithmic complexity of tSVR for parallel training k local SVR models on a P-core processor is This complexity analysis illustrates that parallel learning k local SVR models in the tSVR algorithm is k.P times faster than building a global SVR model by the standard SVR algorithm (the complexity at least O(m 2 )). It remarks that the complexity analysis of the tSVR excepts the decision tree regressor learnt to split the full dataset. However this training the decision tree regressor has a very low computational cost compared with the quadratic programming solution required by the SVR learning algorithm.
We will state how to assess the generalization capacity of tSVR models learnt by the tSVR algorithm via the margin size of the binary support vector classification (SVC).
The regression task of the SVR is considered as a classification problem of the SVC as illustrated in an intuitive geometric approach (Bi & Bennett, 2003). Let us consider x, y be the predictor variable and the response variable, respectively. In the left part of Figure 4, the training SVR algorithm tries to find the optimal plane (w.x−b=0) that has at most ϵ deviation from response values y i . This task is considered as the binary classification one illustrated in the right part of Figure 4 where the response values y i increased by ϵ forms the positive class D+ and the response values y i decreased by ϵ forms the negative class D−. The optimal plane for separating D+ from D− found by the learning SVC algorithm is identical to the solution of the learning SVR one. It can intuitively assess that the largest margin solution (biggest separation boundary of two classes) gives the safest prediction model. And then the generalization ability of SVR models trained by the SVR algorithm (e.g. tSVR) can be explained in terms of the margin size of the binary SVC.
The generalization capacity of such local models is presented in Vapnik (1991), Bottou and Vapnik (1992), and Vapnik and Bottou (1993). Recently Do and his colleague (Do & Poulet, 2016b, 2017 illustrate that the learning algorithms of local SVC models give a guarantee of the generalization capacity compared to the standard SVM algorithm for the global SVC models. To give the justification for the generalization capacity of local SVR models, we turn back Theorem 5.2 (Vapnik, 2000, p. 139). This theorem states that the large margin hyperplane has the high generalization ability. With a training set containing m datapoints being separated by the maximal margin hyperplanes, the upper bound of the expectation of the probability of test error is as follows: where sv is the number of support vectors, R is the radius of the sphere containing the data and Δ is the margin size, n is the number of dimensions. It means that the generalization ability of the maximal margin hyperplane is justified in terms of For the training task, the tSVR algorithm trains the decision tree regressor to split the full dataset having m datapoints into k terminal-nodes (the terminal-node size m k is about minob j = m/k). And then the generalization capacity of local SVR models trained by the tSVR algorithm is assessed in terms of The justification of the generalization ability is based on the comparison between Equation (6) of the global SVR model trained for the full dataset and Equation (7) of the local SVR model trained from a terminal-node. According to Theorem 1 in Do and Poulet (2016b) and Theorem 2, 3 in Do and Poulet (2017), the margin size D X k of the local SVR model is greater than the margin size D X of the global one. Nevertheless, the use of the terminal-node (subset) in the training task of tSVR leads to R k ≤ R and m k ≤ m. These allow concluding that there is a compromise between the locality (i.e. the terminal-node size minob j and the radius of the sphere containing the data) and the generalized ability (the margin size). Therefore, a local SVR model trained by the tSVR algorithm can endorse the prediction performance compared to the global SVR one. The performance analysis in terms of the algorithmic complexity (4) and the generalization capacity (7) shows that the parameter minob j is used in the tSVR to give a trade-off between the generalization capacity (the prediction correctness) and the computational cost. This point can be understood as follows: . If minob j is small then the tSVR algorithm significantly reduces training time. And then, the model learnt from such small terminal-node size has a very low capacity (i.e. the low prediction correctness). . If minob j is large then the tSVR algorithm reduces insignificant training time. However, the size of a terminal-node is large; It improves the capacity (i.e. the high prediction correctness).

Learning algorithm of r random tSVR models for dealing with large datasets
The performance analysis can state that there is a trade-off between the algorithmic complexity and the generalization capacity in the learning algorithm tSVR. If the parameter minob j is set to small then the tSVR algorithm significantly speed up the training time but also making the low prediction correctness, compared to the learning global SVR algorithm. To address this problem, our proposed algorithm rtSVR learns the random forest of r decision trees using local prediction SVR models to improve the generalization capacity of the tSVR alone.
According to the Bias-Variance framework proposed by Breiman (1996Breiman ( , 2001, the performance of a learning model can be understood through Bias term and Variance term. Bias is the systematic error component and independent upon the learning sample. Variance is the error component with respect to the variability of the learning model due to the randomness of the learning sample. The ensemble-based learning algorithms (Breiman, 1996(Breiman, , 1998(Breiman, , 2001Dietterich, 2000) aim to reduce Variance component and/or Bias component by way of using the randomization of the learning sample. And then, the ensemble-based learning algorithm can improve the generalization capacity of the use of the single one model. Therefore, our proposed ensemble learning algorithm rtSVR trains r random tSVR models for reducing Variance. It means that the rtSVR improves the generalization ability of the tSVR.

Learning r random tSVR models
The rtSVR algorithm trains the ensemble of r random tSVR models using the tSVR algorithm (described in Algorithm 1 and Figure 3).
The tSVR algorithm learning the ith random tSVR model from a ith bootstrap sample (sampling with replacement from the original dataset). For each non-terminal node in the training phase of the tree regressor, randomly choose n ′ dimensions from n original ones are used to calculate the best split among these n ′ dimensions. The tree regressor is grown with the early stopped minob j = m/k and without pruning. Finally, the algorithm tSVR learns k non-linear SVR models for k terminal-nodes.
As described in Algorithm 2 and Figure 5, the rtSVR constructs independently r random local tSVR models. This allows parallelizing the learning task with OpenMP on multi-core computers.
Thus, the complexity of parallel learning rtSVR on a P-core processor is as follows: O r.m 2 k.P = O r.minob j.m P (8) Figure 5. Training algorithm of r random tSVR models.

Prediction of a new datapoint x using r random tSVR models
The prediction for a new individual x is the average of the prediction results obtained by r random tSVR models.

Experimental evaluation
We are interested in the experimental performance of our decision trees (tSVR and rtSVR) using local support vector regression models for predicting large datasets. Therefore it is necessary to set up experimental tests to evaluate the performance in terms of training time and prediction correctness so that these numerical tests are consistent with the performance analysis in terms of the algorithmic complexity and the generalization ability.

Software programs
Our algorithms tSVR krSVR are implemented in Python programming language, using Scikit-learn library (Machine Learning in Python Pedregosa et al., 2011), pymp package (OpenMP-like functionality for Python Lassner, 2017). The highly efficient standard library SVM, LibSVM (Chang & Lin, 2011) with OpenMP (OpenMP Architecture Review Board, 2008) is also used to train global SVR models.
Our evaluation of the performance is reported in terms of training time and prediction correctness. We are interested in the comparison the regression results obtained by our proposed tSVR, rtSVR for local SVR models with LibSVM for global SVR models.
All experiments are run on machine Linux Fedora 20, Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores and 32 GB main memory.

Datasets
Experiments are conducted with the five datasets from UCI repository (Lichman, 2013). Table 1 presents the description of datasets. The evaluation protocols are illustrated in the last column of Table 1. Datasets are already divided in training set (Trn) and test set (Tst). We used the training data to build the SVR models. Then, we predicted the test set using the resulting models.

Tuning parameters
We propose to use RBF kernel type in tSVR, rtSVR and LibSVM from training SVR models because it is general and efficient (Lin, 2003). The cross-validation protocol ( twofold) is used to tune the regression tolerance ϵ, the hyper-parameter γ of RBF kernel (RBF kernel of two individuals x i , x j , K[i, j] = exp ( − g x i − x j 2 )) and the cost C (a trade-off between the margin size and the errors) to obtain a good correctness. For two largest datasets (Buzz in social media Twitter, YearPredictionMSD), we used a subset randomly sampling about 5% training dataset for tuning hyper-parameters due to the expensive computational cost.
Our tSVR uses the parameter minob j (the minimum number of datapoints for a split to be tried by the decision tree regression). We propose to set minob j = 200 (according to the advise of Vapnik & Bottou, 1993). The rtSVR algorithm is set the parameters to train 10 random tSVR models (r=10) with the number of randomly chosen dimensions being the square root of the full set (n ′ = n √ as advised by Breiman, 2001). Table 2 presents the hyper-parameters of tSVR, rtSVR and LibSVM for training regression models.

Regression results
The regression results of LibSVM, tSVR and rtSVR on the datasets are given in Table 3 and Figures 6-8.

Comparison in training time
The training time in Table 3 show that our decision tree algorithms using local SVR models (tSVR and rtSVR) outperform LibSVM in terms of training time. The average of tSVR, rtSVR and LibSVM training time are 1.19, 3.22 and 1551.15 min, respectively. It means that: . training time of tSVR is 1303.49 times faster than LibSVM; . rtSVR is 481.72 times faster than LibSVM; . training time of rtSVR is about 2.71 times longer than tSVR.
The comparison of training time for 3 first small datasets (Appliances energy prediction, Facebook comment volume, BlogFeedback) illustrates that the training speed improvements of tSVR and rtSVR against LibSVM are 346.67 and 127.64 times, respectively.
For two very large datasets (Buzz in social media --Twitter and YearPredictionMSD), the learning time improvements of tSVR and rtSVR compared to LibSVM is more significant. These training speed improvements of tSVR and rtSVR versus LibSVM are 2261.39 times and 834.31 times, respectively.

Comparison in prediction correctness
Prediction correctness (mean absolute error -MAE) presented in Table 3 show that the error average made by tSVR, rtSVR and LibSVM are 25.16, 22.51 and 62.01, respectively. The comparison of prediction correctness, dataset by dataset, shows that tSVR and rtSVR are beaten only once (with BlogFeedback dataset) by LibSVM; tSVR and rtSVR has 4 wins and 1 defeat against LibSVM. rtSVR is more accurate than tSVR with 5/5 wins. Prediction results demonstrate that our tSVR, rtSVR are more efficient than LibSVM.
These numerical test results demonstrate that our tSVR, rtSVR improve not only the training time, but also the prediction correctness when dealing with large datasets.

Comparison with decision tree algorithms
We are interested in studying the impact of using local SVR rules against the average prediction ones of classical decision trees. Therefore, we would like to compare the performance of tSVR, rtSVR with the decision tree regressor (DT) and the random forest regressor (RF). The regression results of tSVR, rtSVR compared to DT, RF are presented in Table 4  improvement of the prediction correctness being 1.21. In comparison to RF, rtSVR also reduces 1.40 of the prediction error but requires 3.26 times of the training time.
The experimental results allow to believe that decision tree algorithms tSVR and rtSVR using local SVR models are efficient for dealing with such large datasets.

Discussion on related works
Our proposed tSVR is in machine learning techniques related to local SVM algorithms. A theoretical analysis studied in Bottou and Vapnik (1992) and Vapnik and Bottou (1993) illustrates that there is the trade-off between the generalization capacity (the prediction correctness) and locality of local learning algorithms. The approaches of local SVMs can be categorized into two kinds of strategies (the hierarchical strategy and the nearest neighbours strategy).  The kind of local algorithms in the hierarchical strategy performs the training task with two main stages (the partition of the training set and learning local models). The first stage of learning task is to split the full training dataset into partitions (subsets) and then the second stage is to train the local supervised models from subsets. The learning algorithm proposed in Jacobs, Jordan, Nowlan, and Hinton (1991) uses the expectation-  maximization (EM) clustering algorithm (Dempster, Laird, & Rubin, 1977) to split the full training dataset into k joint clusters (the EM clustering algorithm uses the posterior probabilities to make a soft cluster assignment Bishop, 2006); And then the algorithm learns neural network (NN) models from clusters to locally classify the individuals in clusters. Instead of building local NN models as done by Jacobs et al. (1991), the parallel mixture of SVMs algorithm in Collobert, Bengio, and Bengio (2002) proposes to learn local SVM models to predict the individuals in clusters. More recent, the Latent-lSVM (Do & Poulet, 2016a, 2017 uses Latent Dirichlet Allocation (LDA Blei, Ng, & Jordan, 2003) to perform the clustering task for sparse data representation. CSVM (Gu & Han, 2013), kSVR (Bui, Tran-Nguyen, Kim, & Do, 2017), and kSVM (Do, 2015) propose to uses kmeans algorithm (MacQueen, 1967) to split the full training dataset into k disjoint clusters; And then kSVR and kSVM learn local non-linear SVM models in a parallel way, instead of training weighted local linear SVMs from clusters as done in CSVM. krSVM (Do & Poulet, 2015) is to learn the random ensemble of kSVM models. DTSVM (Chang, Guo, Lin, & Lu, 2010;Chang & Liu, 2012) and tSVM (Do & Poulet, 2016b) use decision tree algorithms (Breiman, Friedman, Olshen, & Stone, 1984;Quinlan, 1993) to split the full training dataset into t terminal-nodes (tree leaves); follow which the tSVM algorithm builds local SVM models for classifying impurity terminal-nodes (with a mixture of labels) while DTSVM learns local SVM models from all tree leaves. These algorithms are shown to reduce the computational cost for dealing with large datasets while maintaining the prediction correctness.
The kind of local algorithms in the nearest neighbours strategy tries to find k nearest neighbours (kNN) of a new testing individual from the training dataset and then it only learns local supervised model from these kNN to classify the new testing individual. First local learning algorithm proposed in Bottou and Vapnik (1992) and Vapnik and Bottou (1993) is to train the neural network model from k neighbourhoods to predict the label of the testing individual. The investigation of Vincent and Bengio (2001) is to train k-local hyperplane and convex distance nearest neighbour (large margin nearest neighbours). Recent local SVM algorithms, including SVM-kNN (Zhang, Berg, Maire, & Malik, 2006), ALH (Yang & Kecman, 2008), FaLK-SVM (Segata & Blanzieri, 2010) find kNN of the testing individual with different techniques. SVM-kNN tries to use different metrics. ALH algorithm is to combine the weighted distance and features to predict the label of the testing individual. FaLK-SVM uses the cover tree (Beygelzimer, Kakade, & Langford, 2006) to improve the computational cost for finding the kNN in the feature kernel space.

Conclusion and future works
We presented decision tree algorithms using local SVR models that achieve high performances for the non-linear regression of large datasets. The training task of tSVR is to partition the full training dataset into k terminal-nodes. This aims at reducing data size in training local SVR. And then it easily learns k non-linear SVR models in a parallel way on multi-core computers in which an SVR model is trained in each terminal-node to predict the data locally. The algorithm rtSVR is to learn the random forest of decision trees using local prediction SVR models for improving the prediction correctness against the tSVR model alone. The performance analysis and the numerical test results on datasets from UCI repository showed that proposed algorithms tSVR and rtSVR are efficient in terms of training time and prediction correctness compared to the standard learning algorithm LibSVM for global SVR models. The learning time improvements of tSVR and rtSVR versus LibSVM are 1282.66 and 482.29 times; Furthermore, tSVR and rtSVR also improve 59.43%, 63.70% of the relative prediction correctness compared to the standard LibSVM. An example of tSVR's effectiveness is given with the non-linear regression of YearPredictionMSD dataset (having 400,000 datapoints, 90 dimensions) in 3.297 min and 7.38 mean absolute error obtained on the testset.
In the near future, we intend to provide more empirical test on large benchmarks and comparisons with other algorithms. A promising avenue aims at improving the prediction correctness and automatically tuning hyperparameters of tSVR.