Hybrid intelligent deep kernel incremental extreme learning machine based on differential evolution and multiple population grey wolf optimization methods

ABSTRACT Focussing on the problem that redundant nodes in the kernel incremental extreme learning machine (KI-ELM) which leads to ineffective iteration increase and reduce the learning efficiency, a novel improved hybrid intelligent deep kernel incremental extreme learning machine (HI-DKIELM) based on a hybrid intelligent algorithms and kernel incremental extreme learning machine is proposed. At first, hybrid intelligent algorithms are proposed based on differential evolution (DE) and multiple population grey wolf optimization (MPGWO) methods which used to optimize the hidden layer neuron parameters and then to determine the effective hidden layer neurons number. The learning efficiency of the algorithm is improved by reducing the network complexity. Then, we bring in the deep network structure to the kernel incremental extreme learning machine to extract the original input data layer by layer gradually. The experiment results show that the HI-DKIELM methods proposed in this paper with more compact network structure have higher prediction accuracy and better ability of generation compared with other ELM methods.


Introduction
The artificial neural network analyses the data through the abstract simulation process to the biological neuron network, thereby realizing some functions such as data classification, system identification, function approximation and numerical estimation. However, the training efficiency and learning ability of the traditional Single Hidden Layer Feed Forward Neural Networks (SLFNS) is still too low. It is need to update all parameters of the network in the learning process. Recently, Huang et al. [1] proposed an extreme learning machine (ELM) algorithm for training single hidden layer feed forward neural networks. Compared to the traditional neural network, the parameters of the hidden layer nodes in ELM are randomly initiated and then fixed without iteratively tuning and tedious iterative process. The only free parameters need to be learned are the connections or weights between the hidden layer and output layer, and its output weight is obtained by the generalized inverse solution of the matrix using regularized least squares methods. In this manner, ELM can achieve good universal approximation capability as well as high running efficiency based on excellent network learning performance and network structure, thereby, avoiding the local minimum and slow convergence problems.
In practice, because of the complexity of various problems, several methods for optimizing the ELM hidden nodes have been proposed to obtain a suitable network structure and size. Huang et al. [2] proposed a standard optimization method for classification problems, then Huang [3] also proved the possibility of using ELM for arbitrary multi-classification problems and obtained good experiment performance. In [4], the class weight has been introduced to solve the complex unbalanced learning problems to further improve the performance. At present, ELM has been widely used in face recognition, speech recognition, licence plate recognition, power system [5][6][7] and so on. For the reason of more classification labels, lack of training samples and insufficient feature descriptions, the recognition accuracy of ELM in traditional classification problems is undesirable. Therefore, under the precise of ensure the superiority of fast training speed and good generalization performance, which further improve the overall classification performance and recognition accuracy of ELM, is becoming the present research focus.
In traditional ELM, the higher dimensional network structures have always been used for the purpose of obtaining stronger learning ability, but the number of optimal hidden layer nodes and the scale of control model are difficult to determine. For this reason, Huang et al. [8] proposed the Incremental Extreme Learning Machine (I-ELM), where the hidden nodes are added incrementally and the output weights are determined analytically. In [9], a state-of-art learning algorithm known as Enhanced Incremental Extreme Learning Machine (EI-ELM) is presented to elect the effective hidden layer node to construct the network structure using some novel optimization algorithms while reducing the complexity of the network to some extent. However, when the size of the network is too large, the iteration of the EI-ELM is greatly increased, which affects the generalization ability. Huang et al. [10] proposed the Barron-optimized Convex Incremental Extreme Learning Machine (CI-ELM) to calculate the output weights of existing nodes again after the increase of hidden layer nodes to obtain a higher convergence rate. In [11], a hybrid incremental Extreme Learning Machine (HI-ELM) is proposed using chaos optimization algorithm to optimize the parameters of hidden layer nodes. However, the current I-ELM still has some problems need to be solved urgently. The complexity of the network structure will be increasing due to the reason of redundant nodes which reduces the learning efficiency. The convergence rate is low; the number of hidden layer nodes exceeds the number of learning samples. More sensitive to the new data, the online prediction ability is not strong.
The combination of the parameters is crucial for the ELM because it is affecting the training speed and learning accuracy of the ELM to some extent. Therefore, the intelligent optimization algorithm to optimize the ELM parameters based on bionics methods for the purpose of improving the learning speed and accuracy is becoming the research focus. In [12], a differential evolution (DE) algorithm utilized to adjust the ELM input parameters is proposed. In [13], an adaptive DE algorithm to optimize the parameters of the hidden layer nodes is given, and then the MP generalized inverse method is utilized to solve the output weights. In [14], an improved particle swarm optimization algorithm is utilized to optimize hidden layer node parameters. In [15], a hybrid intelligent ELM is proposed using the DE algorithm and the particle swarm optimization method to optimize the hidden layer nodes. However, the aforementioned hybrid intelligent optimization algorithm still faces two problems: although the DE algorithm has strong global optimization ability but will appear premature convergence problems, while the particle swarm optimization algorithm can perform local optimization but is searching speed are too slow.
Meanwhile, another leading trends for hierarchical learning are called deep learning (DL), similarly, the deep architecture extracts feature by a multilayer feature representation framework, and the higher layers represent more abstract information than those from the lower ones in order to improve the ELM performance. In [16], a multilayer ELM is given which combine the excellent feature extraction capabilities of deep learning and the fast training ability of ELM. In [17], the kernel function is being introduced and a novel deep kernel ELM is proposed, and used for aero-engine component fault diagnosis to improve diagnostic accuracy.
It is noteworthy that for ELM and its variants, all of the improving algorithms are composed of two stages: at first, random feature mapping and then output weight optimization. However, for complex classification problems, the effect of using random feature mapping to boost the separability of the original sample space is often limited, which increase the dependence on subsequent output weight optimization process. Moreover, most of the current variants of ELM are based on the existing ELM framework and there are few variations for combining the ELM network structure with another network structure adjustment, except for the deep learning network.
In this paper, to ensure the superiority of the proposed network structure, a hybrid intelligent deep kernel incremental extreme learning machine is proposed in order to improve the ELM network performance. First, the deep kernel incremental ELM (DKI-ELM) is proposed based on incremental kernel ELM and the deep leaning network. And the deep network structure is used to extract the data in multiple layers to obtain effective features and improve the classification accuracy. Second, a hybrid intelligent differential evolution multiple grey wolf optimization algorithm is proposed using the global search ability of the DE algorithm and the local search capability of MPGWO algorithm in order to obtain the optimal output weights for the purpose of improving the training speed and classification accuracy of the ELM.
In this paper, our major contribution is summarized as follows: 1) An HI-DKIELM (hybrid intelligent deep kernel incremental extreme learning machine) network classifier is designed. HI-DKIELM consists of a deep learning network and kernel incremental extreme learning machine of cascade, where the input data through the deep leaning network to extract more information and boost the separability can achieve higher dimensional spatial mapping, then the ELM network can be utilized to provide a superior classification surface. In this way, the HI-DKIELM proposed in this paper combines the advantages of the deep learning network and the KIELM network, and can improve the performance effectively. 2) In order to explore an optimal parameter belonging to the Extreme Learning Machine, an appropriate hybrid intelligent optimize algorithm for HI-DKIELM is presented. The proposed hybrid differential evolution multiple grey wolf optimization algorithm (DE-MPGWO) optimizes the method using the global search ability of the DE algorithm and the local search capability of multigroup grey wolf algorithm.
The remainder of this paper is organized as follows. Section 2 reviews the implementation of ELM and KIELM. Section 3 presents the Hybrid Intelligent differential evolution multiple grey wolf optimization algorithm. In Section 4, the detail of the HI-DKIELM is briefly discussed. Hence, the experimental result is presented in Section 5. Section 6 concludes our work and outlines our future work to generalize the method to multibiometric recognition system.

Preliminary
In this part, we will give the notation of Extreme Learning Machine (ELM) and kernel incremental extreme learning machine (KI-ELM).

Extreme Learning Machine Theory
Extreme Learning Machine is a high efficient learning algorithm proposed on the singer-hidden layer neural network. Unlike other different traditional neural network, all the parameters in the Extreme Learning Machine are generated randomly and the complicated iteration process is avoided. Suppose the training set is composed of N training samples, the input is x i which the dimension is d, t i is the label of the output, then the output of the ELM is In Equation (1), the parameter w j = [w j1 , w j2 , . . . , w jn ] T is the input weight of the jth hidden node, b j is the deviation of the jth hidden node and β j is the weight of the jth hidden node to the output node of the ELM. G(a j, w j , x i ) is the output function of the jth hidden node. From Equation (1), we will obtain that h( is the output of hidden layer in regard to training sample x i . And Equation (1) can be simplified as where H is the hidden layer output matrix o and h(x i ) is the ith row hidden layer output vector relative to the input X i: L×m is output weight matrix and T = t 1 , t 2 , . . . , t n T N×m is the expected output. In order to improve the generalization ability of ELM, a penalty factor C is introduced in Equation (3), and the output weight matrix β is Then the output of the extreme learning machine can be expressed as

Kernel Incremental Extreme Learning Machine (KI-ELM)
Incremental Extreme Learning Machine (I-ELM) is different from the original incremental neural network which is only a specific kind of active function can be used. I-ELM can use any continuous or piecewise continuous function as the active function. Under the equal premise learning accuracy, the training speed of the I-ELM is 1000 times faster than SVM and BP neural network. In the past 5 years, some variants of the I-ELM such as EI-ELM, PC-ELM, KI-ELM and OP-ELM have been proposed respectively. These improved Incremental Extreme Learning Machines are mainly aimed at improving the hidden layer node parameters in I-ELM. The kernel matrix of the KI-ELM can be expressed as then the output function of KI-ELM can be converted from Equation (5) to In Equation (7), assuming So at the t + 1 moment, there are To simplify Equation (9), suppose then Equation (9) can be simplified as Using new data, we can obtain In Equation (13),

The proposed DE-MPGWO algorithm
In this part, an improved hybrid intelligent optimized strategy called Differential Evolution Multiple Grey Wolf Optimization algorithm (DE-MPGWO) inspired from the idea of Frog Leaping algorithm (FLA) is proposed based on DE and MPGWO. In order to facilitate the proposed optimization algorithm, in Section 3.1 and Section 3.2, we briefly review the concept of DE algorithm and MPGWO algorithm, the hybrid intelligent optimization proposed in this paper is discussed in Section 3.3.

Differential evolution
The DE algorithm is an optimization method based on group evolution process and it computed the optimal solution by three major manipulations: the differential variation process, the binary mutation operation and greedy choose. The major computing process is as follows [20]. At first, the DE algorithm will generate N p population, which the dimension is D, the solution of the individual is X i,G = (x i1,G , x i2,G , . . . , x iD,G ). G stands for the number of iteration, then, generates corresponding variation vector V i,G using different differential variation strategy towards every solution vector X i,G . The differential variation strategy used in this paper is DE/rand/2 [21]: After the operation, to generate the final probing solution U i,G based on every solution vector X i,G and variation vector V i,G using the binary mutation operation: In Equation (16), u i,G , v i,G and x i,G are the jth vector of the final probing solution U i,G , variation vector V i,G and solution vector X i,G respectively. C R stands for the mutation probability, j rand is the stochastic number.
Finally, we conduct the choosing operator operation between probing solution U i,G and solution vector X i,G , and then the best solution will be regard as the new solution X i,G+1 and will be stored in the next generation.

MPGWO algorithm
The GWO algorithm imitates the leadership hierarchy and hunting mechanism of grey wolves in nature proposed by Mirjalili et al. [14]. Grey wolves are considered to be at the top of food chain and they prefer to live in a pack. Four types of grey wolves such as alpha (α), beta (β), delta (δ) and omega (ω) are employed for simulating the leadership hierarchy. In order to mathematically model the social hierarchy of wolves while designing GWO, we consider the fittest solution as the alpha (α). Consequently, the second and third best solutions are named as beta (β) and delta (δ), respectively. The rest of the candidate solutions are assumed to be omega (ω). Figure 1 shows three main steps of GWO algorithm, namely hunting, chasing and tracking for prey, encircling prey and attacking prey which are implemented to design GWO for performing optimization.
Recently, a multi-population version of the GWO (MPGWO) was proposed which extends the idea of the original GWO for solving optimization problems with multiple and conflicting populations. In MPGWO, a fixed sized external archive is integrated to the GWO for saving and retrieving the Pareto optimal solutions. This archive is then employed to define the social hierarchy and simulate the hunting behaviour of grey wolves in multi-objective search spaces, and share and exchange information among different populations in order to improve the diversity of the population for the purpose of optimal solutions. The pseudo code of the MPGWO algorithm is taken as in Table 1.

The proposed hybrid DE-MPGWO algorithm
The new proposed hybrid DE-MPGWO algorithm using the DE and MPGWO as the evolution method has the ability of meme evolution derived from Frog Leaping algorithm (FLA) in order to improve the performance when taking the advantage of the two Step 1: In the solution space, generate ND dimensional solutions as initial populations randomly, the total number of evolution iterations is I iter max , the number of iterations belong to each subpopulation isI iter , C n = I iter max I iter .
Step 2: The population is divided into N k subpopulations randomly.
Step 3: Choose k subpopulations randomly in the N k subpopulations while 1 < k < N k , using the DE algorithm to compute I iter generations in the iteration process respectively. Regarding to the rest N k − k subpopulations, we divide them into three grey wolf populations. Using the MPGWO algorithm to compute I iter generations in the corresponding iteration process respectively. In the entire iteration process, recording all the changes of the optimal value to all the populations.
Step 4: Mixing N k subpopulations to obtain the new population, judging whether the number of the iteration of iteration of local search reaches the designated number C n , if so, the iteration stops, if not, turn to Step 2; Step 5: The algorithm is termination.
algorithms. The detailed implementation of the proposed optimization algorithm is described in Table 2.

The proposed HI-DKIELM
In this part, the traditional extreme learning machine (ELM) is to be extended to HI-DKIELM based on kernel incremental extreme learning machine and deep learning network. The proposed HI-DKIELM consists of an input layer, output layer and some hidden layer of cascade. The structure of the HI-DKIELM is shown in Figure 2. In the training process, we utilized the DE-MPGWO optimization algorithm while given in this paper to optimize the output weight for the purpose of robustness. The initial data after the subtract through k hidden layer is to obtain the input feature X k , then mapping the input feature using the kernel function. The detailed implementation process of the proposed HI-DKIELM is given in Table 3.
In this paper, the proposed hybrid intelligent HI-DKIELM extracts the input data layer by layer in order to obtain more effective features, which are conducive  The pseudo code of HI-DKIELM algorithm Step 1: Suppose training samples, t i ∈ R, set the expected network output error function is η, the prediction error of the output is ϕ(x i , t i ), the number of the hidden node L = 0, network error e * 0 = T, the number of iterations is k = 0; Step 2: Set the hidden layer nodesL = L + λ, when λ = 1,indicate that we add one node in the hidden layer; Step 3: Computing the prediction error: Step 4:Y L * Computing the Optimal parameters of the hidden layer node Y L * based on the DE-MPGWO algorithm which proposed in this paper, then computing the output weigh Step 5: Computing the output error: the training step termination, else, turn to Step2; Step 6:Suppose A = 1 C + K ELM , for the t moment is A t , for the t + 1 moment is A t+1 , computing the generalized inverse of A t+1 ; Step 7: Update the data online, computing the outputŶ test ; Step 8: The algorithm is termination.
to distinguish confused types easily and improving classification accuracy. In addition, these abstract features are not original input features, but the kernel function calculation could instead of the inner product calculation in the high-dimensional space, which is conducive to further improving the accuracy of classification.

Experimental settings
In this section, we will provide a wide range of different experimental results in different quarters to access the effectiveness of the proposed new method.
In the experiment, the system operating environment is Intel (R) Xeon (R) CPU E3-1231 v3@ 3.40 GHz 3.40 GHz, memory 16 GB, running Win7 PC, and the programming language is Matlab2013a. In order to verify the validity and robustness of the proposed HI-DKIELM algorithm, the experiments are including four parts: (1) At first, we test the performance and robustness of the proposed DE-MPGWO optimization algorithm, which proposed in Section 3 while using for obtaining the parameter of the hidden layer node and output weight. (2) For the HI-DKIELM while proposed in Section 4, the number of hidden layers has an important impact for the performance of neural networks. Based on the Abalone database, we test the impact of different number of hidden layers for the network structure. (3) In order to test the generalization performance of the proposed HI-DKIELM algorithm using 10 groups data of UHI real data set, we compare it with the common CI-ELM, EI-ELM, ECI-ELM and DCI-KELM for regression problems. (4) In order to test the generalization performance of the proposed HI-DKIELM algorithm using 10 groups data of UHI real data set, we compare it with the common CI-ELM, EI-ELM, ECI-ELM and DCI-KELM for classification problems.

Evaluate the performance and robustness of the DE-MPGWO optimization algorithm
In this section, we will evaluate the performance and robustness of the DE-MPGWO optimization algorithm proposed in Section 3.
In this experiment, we are using 10 typical functions, which are stated in Table 5 in order to check the optimization ability. The dimension of the solution in every typical function D is set as 30, the range of the solution of the typical function F n6 is set as [−100, 100], F n9 is set as [−500, 500], the remains set as [−30, 30].
In order to compare the performance of the DE-MPGWO algorithm, we compare it with three basic optimization algorithms: DE, PSO and FLA and DEPSO that proposed in reference [24] while it is our former work. The parameters in each method are given in Table 6.
In the numerical experiment, the scales of the population in four algorithms are all the same that means N p = 40, the number of the iteration in each method is 2000. For each typical function, the times of the optimization using four optimization algorithm is 50 and the average optimization value will be utilized as the final optimization result. Table 7 outlined the cos(  From the optimization results given in Table 7, we can found that when we take optimization experiment to the typical function F n2 , the searching performance obtained using four optimization algorithms can reach an ideal result. While the optimization result taking another nine typical functions can be summarized as follows: (1) For the precision of searching results, the proposed DE-MPGWO optimization method is better than the DE, PSO and the FLA methods obviously when taking to the other nine typical functions, and it can obtain more precise solution. (2) For the ability of out of local minimum, the PSO algorithm falls into the minimum value point quickly and the length which play the major role in the time domain in the optimization periods is very short; for the proposed DE-MPGWO strategy, its can out of the local minimum continuously in the iteration process in order to search the optimal solution and it have a better searching ability. In conclusion, the proposed DE-MPGWO optimization method has a better improvement in the searching optimization ability and has a good balance to the searching optimization precision and convergence speed.

Setting the parameters of HI-DKIELM
In HI-DKIELM, the number of different network hidden layers has an important impact on the performance of neural networks. Based on the Abalone database, it assumes that the number of hidden layers is 1-6 respectively, and the number of nodes in each layer is 20. In this experiment, the impact of different hidden layer numbers on the network structure is tested. Each network structure is tested 10 times and the experimental results are as shown in Figure 3.
From the results shown in Figure 3, it can be seen that the testing accuracy does not increase with the increase of the hidden layer data. When the number of hidden layers is 3, the performance of the proposed method is the most stable. When the number of hidden layers continues to increase, the testing accuracy decreased, so in our following experiment, the number of hidden layers was chosen to be 3.
In HI-DKIELM, the kernel function parameters γ and the regularization parameters C have a great influence on the performance. At present, most of the existing methods are selected by the cross-validation method. In this paper, the values of the two parameters are changed from the range 10 0 ∼ 10 10 at the same time, and the testing accuracy is calculated. The test accuracy and the values of the two parameters are plotted as a surface, shown in Figure 4.
It can be seen from the experiment results shown in Figure 4 that when the regularization parameter C takes a small value, the performance is poor and varies violently with the kernel function. When the regularization parameter increases, the performance tends to be stable and has the highest accuracy simultaneously.

Evaluate the performance of the proposed HI-DKIELM based on the regression problem
In this part, we will evaluate the performance of the proposed HI-DKIELM based on the regression problem. In the comparing experiment, the purpose is to evaluate the generalization and robustness of the HI-DKIELM, so we compare it with four basic ELM algorithms: CI-ELM, EI-ELM, ECI-ELM and DCI-KELM. In the experiment, the number of initial hidden layer neurons in the neural network is one and the number of hidden layer neurons in each iteration is increased by one, all the extreme learning machines have the same hidden layer neurons and the same number of iterations. The comparison of the training error and the testing error on the regression problem test is outlined   in Table 8. Table 9 is the comparison of the network complexity and training time on the regression problem. The value in the bracket is the RMSE which is the error termination condition.
From the results shown in Table 8, we can find that for the regression problem, the accuracy of the HI-DKIEM method proposed in this paper has improved significantly compared with the other four ELM algorithms. For example, taking account to the CCPP database, when the maximum hidden layer node number is 100 and the error termination condition criteria RMSE is 0.052, the training error and the testing error of the HI-DKIELM are 0.0417 and 0.0435 respectively. However, for the DCI-KELM method, the training error is 0.0513 and the testing error is 0.0508, for the ECI-KELM algorithm, the training error and the testing error are 0.0535 and 0.0604 respectively. So from the perspective of training error and testing error, the generalization and robustness of the HI-DKIELM is better than other four ELM methods obviously.
The comparison of network complexity and training time for the regression problem between the five ELM is given in Table 9. From the experiment results, we can find that the network complexity and training time of the HI-DKIEM method proposed in this paper has improved significantly compared with the other four ELM algorithms. For example, taking account the CCPP database, when the termination condition criteria RMSE is 0.052, the network nodes and training time of the HI-DKIELM are 27.06 and 1.0277 s respectively. However, for the DCI-KELM method, the network nodes are 45.35 and the training time is 2.0743S, for the ECI-KELM algorithm, the network nodes and training time are 111.92 and 3.0178 s respectively. So from the perspective of network nodes and training time, the generalization and robustness of the HI-DKIELM is better than other four ELM methods obviously.

Evaluate the performance of the proposed HI-DKIELM based on the classification problem
In this part, we will evaluate the generalization performance of the proposed HI-DKIELM based on the classification problem. In the comparing experiment, the purpose is to evaluate the generalization and robustness of the HI-DKIELM, so we compare it with four basic ELM algorithms: CI-ELM, EI-ELM, ECI-ELM and DCI-KELM. In the experiment, the number of initial hidden layer neurons in the neural network is one and the number of hidden layer neurons in each iteration is increased by one, all the extreme learning machines have the same hidden layer neurons and the same number of iterations. The comparison of the training error and the testing error on the classification problem test are outlined in Table 10. Table 11 is the comparison of the network complexity and training time on the classification problem. From the results shown in Table 10, we can find that for the classification problem, the accuracy of the HI-DKIEM method proposed in this paper has improved significantly compared with the other four ELM algorithms. For example, taking account to the Boston Housing database, when the maximum hidden layer node number is 100 and the error termination condition criteria RMSE is 0.1, the mean value and the standard deviation of the HI-DKIELM are 98.21and 0.0038 respectively. However, for the DCI-KELM method, the mean value is 93.01 and the standard deviation is 0.0041, for the ECI-KELM algorithm, the mean value and the standard deviation are 84.82 and 0.0072 respectively. So from the perspective of the mean value and the standard deviation, the generalization and robustness of the HI-DKIELM are better than other four ELM methods obviously.
The comparison of network complexity and training time for the regression problem between the five ELM is given in Table 11. From the experimental results, we can find that the network complexity and training time of the HI-DKIEM method proposed in this paper have improved significantly compared with the other four ELM algorithms. For example, taking account to the CCPP database, when the termination condition criteria RMSE is 0.1, the network nodes and training time of the HI-DKIELM are 19.42 and 0.0774 s respectively. However, for the DCI-KELM method, the network nodes are 22.06 and the training time is 0.0942 s, for the ECI-KELM algorithm, the network

Conclusion
In this paper, a novel HI-DKIELM based on KIELM and a new DE-MPGWO method under the deep learning network structure is proposed in this paper. The proposed HI-DKIELM can reduce the redundant network nodes due to the reason of ineffective iteration increase and lower learning efficiency.
The main contribution of this paper can be summarized as follows: (1) An HI-DKIELM network classifier is designed. HI-DKIELM consists of a deep learning network and kernel incremental extreme learning machine of cascade, where the input data through the deep leaning network to extract more information and boost the separability can achieve higher dimensional spatial mapping, then the ELM network can be utilized to provide a superior classification surface. In this way, the HI-DKIELM proposed in this paper combines the advantages of the deep learning network and the KIELM network and can improve the performance effectively. (2) In order to explore an optimal parameter belonging to the Extreme Learning Machine, an appropriate hybrid intelligent optimize algorithm for HI-DKIELM is presented. The proposed hybrid DE-MPGWO algorithm optimizes the method using the global search ability of the DE algorithm and the local search capability of multi-group grey wolf algorithm.

Disclosure statement
No potential conflict of interest was reported by the authors.