Deep Learning: Parameter Optimization Using Proposed Novel Hybrid Bees Bayesian Convolutional Neural Network

ABSTRACT Deep Learning (DL) is a type of machine learning used to model big data to extract complex relationship as it has the advantage of automatic feature extraction. This paper presents a review on DL showing all its network topologies along with their advantages, limitations, and applications. The most popular Deep Neural Network (DNN) is called a Convolutional Neural Network (CNN), the review found that the most important issue is designing better CNN topology, which needs to be addressed to improve CNN performance further. This paper addresses this problem by proposing a novel nature inspired hybrid algorithm that combines the Bees Algorithm (BA), which is known to mimic the behavior of honey bees, with Bayesian Optimization (BO) in order to increase the overall performance of CNN, which is referred to as BA-BO-CNN. Applying the hybrid algorithm on Cifar10DataDir benchmark image data yielded an increase in the validation accuracy from 80.72% to 82.22%, while applying it on digits datasets showed the same accuracy as the existing original CNN and BO-CNN, but with an improvement in the computational time by 3 min and 12 s reduction, and finally applying it on concrete cracks images produced almost similar results to existing algorithms.


Introduction
Artificial Intelligence (AI) builds on the idea of making machines behave like humans, facilitating the development of intelligent systems (Li et al. 2017) in order to increase the productivity and maximize the efficiency of the processes such as manufacturing machines.The most popular AI techniques are based on Artificial Neural Network (ANN), a type of Machine Learning inspired by biological human brain.These computing systems can be used to model big data and find complex relationships (De Filippis et al. 2017).ANNs have the ability to handle high-dimensional real-time data and extract implicit meaningful patterns that can be used to predict the future state of complex systems (Wuest et al. 2016).In addition, ANNs are capable of handling complex dynamic problems due to its ability to deal with nonlinearity.It is worth noting that ANNs are trained on historic data with relative ease by adjusting its control parameters such as learning rate and momentum.
Big Data analytics (Geissbauer, Vedso, and Schrauf 2016) is an integral part of the industry 4.0 paradigm, known as the fourth industrial revolution, which aims at creating smart systems where technologies are transformed by Internet of Things (IoTs), Cyber-Physical Systems (CPSs) and Cloud Computing (CC).Modeling an IoT system is based on modeling stochastic system addressing the relationship between process and system performance and providing a quantitative analysis of system performance (Ciortea 2018, August).On the other hand, there are many challenges remaining when applying ANN (Wuest et al. 2016), the biggest being data acquisition as the availability of relevant data are not guaranteed.In addition, after collecting a dataset, applying appropriate data mining can also be challenging, particularly for cases where a high amount of irrelevant data may have been collected, thus affecting the performance of the produced models.
As an extension to ANN capabilities, Deep Learning (DL) techniques are now well established, producing better learning capability as stated by Singh et al. (2018), they have the advantage of using automatic feature extraction by learning large numbers of nonlinear filters before the decision-making stage.One of the most popular DL networks is Convolutional Neural Network (CNN) (Singh et al. 2018).
This paper addresses designing better topology issue by proposing novel nature inspired hybrid algorithm that combines the Bees Algorithm (BA), which is known to mimic the behavior of honey bees (Ebubekir 2010), with Bayesian Optimization (BO) in order to increase the overall performance of CNN which is referred to as BA-BO-CNN, so the contribution of this study is improving the performance of CNN using novel hybrid BA-BO-CNN algorithm.
The paper is structured as follows: section II shows a general review of DL followed by that of CNN in particular is presented in section III together with the gaps and open issues that need to be addressed in order to improve the performance of CNN models.Section IV shows the impact of integration nature inspired algorithms with DL networks, in addition it proposes a novel hybrid BA-BO-CNN algorithm, section V presents the results and discussion.Finally, section VI concludes the study and suggests future direction.

Deep Learning
This section introduces the main definition, advantages, limitations, and applications of DL.

Definition
DL is a class of machine-learning techniques (Singh et al. 2018) that utilizes multiple processing layers where the output from a previous layer is used as an input for the following layer, to learn representations of data with multiple levels of abstraction.The difference between traditional learning and DL can be seen in terms of the feature extraction process, which depends on the data and the model types.Using the traditional learning leads to significant time being consumed while applying a trial-and-error approach to the feature extraction process and the success will depend on the user experience.On the other hand, the DL approach will benefit from an automatic feature extraction process through the learning of a large number of nonlinear filters before making decisions.Thus, DL combines feature extraction and decision-making within one model and avoids the often-suboptimal manual handcrafting.

Advantages, Limitations, and Applications of Deep-Learning Networks
DL can have many network structures that can then be used in different reallife applications.Each network will have its own advantages, limitations depending on the applications as shown in Table 1.These applications are coming from many fields, such as manufacturing.Wang et al. (2018) mentioned that DL enables advanced analytics for smart manufacturing systems in terms of object, equipment, process, people, and environment using aggregated big data.DL can help describe what happened in a process by capturing products' conditions and then by supporting the diagnoses of why particular issues occurred, examining the causes and detecting specific failures.After that DL can be used to help predict what will happen, for example predicting products' quality deviations, and finally to assist in the prescription of corrective actions by identifying measures to be taken in order to improve the quality of a product.The deep insights brought by DL can support a company's decision-making process throughout a product lifecycle by improving analyses of data emerging from design, manufacturing, and the supply chain, enhancing the process control and shorten downtime.DL has been applied in a wide range of manufacturing systems specially in the area of fault diagnosis and product quality inspection.
In addition to the manufacturing context, there are other fields in which DL has great applications.DL has been applied in the pharmaceutical context (Ekins 2016), such as to predict aqueous solubility, the epoxidation site in molecules, liver injuries induced from drugs as well as to diagnose cancer, to extract pattern in gene expression, to predict protein disorder, to analyze the content of breast cancer, to repurpose drugs and for microscope images classification.Thus, while DL appears to be a promising tool in biological context it is generally expected to have even greater applications in the future.

Convolutional Neural Network
This section presents a brief review of CNN principles in order to find gaps and open issues that need to be solved in order to improve the performance of CNN created models.

Definition and Way of Working
CNN is one of the most popular DL networks and is used mainly to perform image classification tasks.(MathWorks-1) stated that it is useful for detecting patterns in images that help in the automatic recognition of real physical objects.These patterns are extracted by CNN's directly from image datasets without the need for extracting feature manually, which is the most important factor the makes CNN very popular.In addition, it produces highly accurate recognition results and has the flexibility to be retrained to perform new recognition and to be built on previously created models.It provides an optimal model architecture, enabling advances in detecting and recognizing objects, thus it is a key technology in automated facial recognition.CNN might have hundreds of layers helping to detect patterns in images, filters are used to extract information and can start from simple features, such as brightness, to more complex ones that uniquely identify an object.This filtering is applied to each training image and the output of each image after convolution can be used as an input to the following layer.CNNs are composed of an input layer, an output layers, and many hidden layers.Every neuron in the hidden layer connects to all inputs neurons as shown in Figure 1, as mentioned by Le (2015): The hidden layers perform learning feature tasks using three most common feature learning layers which are: • Convolution: it activates some features in the images using convolutional filters that represented by matrix of weights that slide along the pixel brightness input matrix to created feature map matrix using special dot product as mentioned in by Hui (2017).• Rectified linear unit (ReLU): it is used after each convolutional layer to increase the speed and effectiveness of training by mapping negative values to zeros and maintaining positive values, which is helpful as an activation.
• Batch normalization: it is used as supplement layers after each convolutional layer to mitigate the risk of overfitting by normalizing the input values of the following layers (Yamashita et al. 2018).
• Pooling: it is used between the convolutional layers to reduce the dimensionality of the output volume (McDermott 2021) without losing the important features that contribute to minimize the computational cost, it reduces the number of parameters needed to learn by making nonlinear down-sampling that simplifies the output.There are two types max pooling that take the most activated feature and average pooling that take the average presence of the feature, so max pooling is better with dark background and average pooling is better with white background as mentioned in Ouf (2017).
Furthermore, it uses two classification layers: • Fully connected: it shows the probability of each image being classified for each class.

Gaps and Open Issues
According to Joshi et al. (2019), the most important challenge in training CNN is the generalization to unseen datasets so that the model does not overfit data sets and can give judgment on unknown data.Overfitting is a common issue in training CNN, where the model fits well enough to the training data but does not have the capability to generalize to other datasets.It can be controlled by increasing the sample data using data augmentation techniques, reducing the complexity of the architecture and stopping the training earlier.In addition, Liang and Liu (2015) and Cogswell et al. (2015) discussed the same overfitting issue and suggested another way to prevent it using dropout, Pan (2017) highlighted that it is an efficient method to randomly remove a unit from the network with related edges independently for each hidden unit and sample.Wu (2017) 2016) and Masko and Hensman (2015) addressed the third issue which is training the data using imbalanced classes where the sample is not uniformly distributed, it is a significant and long-standing challenge in training CNN models.
Improving the convergence speed is another future challenge because it sometimes increases the time of convergence in order to get better accuracy as stated by Chiroma et al. (2019).Furthermore, five research papers Zhang, Kiranyaz, and Gabbouj (2018), Baldominos, Saez, and Isasi (2018), Sinha, Verma, and Haidar (2017), Ma et al. (2018) and Panwar et al. (2017) discussed the most challenging aspect in training CNN network, which is designing better topology, the traditional heuristic approach of using trial and error might result in less accurate model depending on user experience.Applying optimization techniques such as nature inspired algorithms to optimize the parameters of CNN network can improve the performance of the model.However, designing better CNN topology is still an open issue and there is no approach found yet that can give the best CNN topology.The following section will discuss the impact of developing hybrid CNN network in increasing the accuracy results.

Nature Inspired Algorithms in Deep Learning
This section will show the impact of integration nature inspired algorithms with DL networks and proposes a novel hybrid algorithm.Chiroma et al. (2019) discussed the synergy between nature inspired algorithms with DL.They mentioned that the inspiration of such algorithms can be from animals' behaviors, human activities, and biological systems.The paper presented many nature inspired algorithms such as harmony search, firefly, cuckoo search, evolutionary, ant colony optimization, practical swarm optimization, genetic, simulated annealing, and gravitational search algorithm.The authors stated that combining DL with nature inspired algorithms has the advantage of solving local minima problem and improving the performance of the network by increasing the accuracy of its architecture.In addition, the need for trial-and-error techniques in determining the parameters of DL architecture is eliminated as nature inspired algorithms realize the best parameters values automatically.Although the optimum parameters setting is still an open problem in the research area.The authors suggested to eliminate the need for human interventions in determining the parameters by obtaining parameter-less nature inspired algorithms in the future.Finally, the paper suggested to apply meta-optimization that is excessive in the DL area, and it helps to tune optimization methods by using another optimization method.

Impact and Recent Applications
Furthermore, other research papers discussed the hybrid CNN with nature inspired sward-based optimization techniques such as CNN with Evolutionary Algorithm (EA-CNN) and CNN with Genetic Algorithm (GA-CNN) in addition to other techniques like CNN with Long Short-Term Memory (LSTM-CNN) and CNN with Fuzzy Logic (FL-CNN).Table 2 summarize the accuracy of original CNN and improved accuracy after hybridization.Applying evolutionary algorithm yield a highly accurate CNN with 98.88% accuracy for handwritten digit recognition and lower percentage of 62.37% for animal image classification, the author used weight inheritance technique which consider the training process as kind of mutation that reduce the evolution cycle time.Bernard and Leprévost (2018) explained the process of evolution by reproducing the population through generation by crossing members and inducing random mutation, evolving input image would maximize feature activation.Genetic algorithm is one of the evolutionary algorithms that has been applied to optimize the parameters of CNN model to predict the stock market, Mallawaarachchi (2017) described the natural selection of selecting the fittest individual in the population which helped improve the accuracy from 71.69% to 75.95%, the algorithm produces offspring that inherit the parents' characteristics, so that they have a chance to survive if their parents have a better fitness, so the algorithm consists of five phases: initial population, fitness function, selection, crossover, and mutation.
In addition to nature inspired algorithms, CNN can be integrated with other DL algorithm called long-short-term memory that automatically recognize worker unsafe actions in motion data, the use of LSTM would enable the sequence of learning features, dealing with sequential data is an important advantage of using this algorithm as stated by Motepe, Hasan, and Stopforth (2019), it is an effective technique when capturing dependencies in the long-term avoiding recurrent neural network challenges such as vanishing gradient problem using nonlinear gating that regulate the flow of information.Furthermore, the hybridization of CNN with FL would add one more layer, a fuzzy self-organization layer, Korshunova (2018) explained its function that distribute input data into clusters do not equivalent to number of output classes where the output of this layer is the membership functions values for the fuzzy clusters.This new hybrid model improved the handwritten digits recognition accuracy from 97.35% to 99.10%.
However, there are still opportunities to test the integration, with CNN, of other popular swarm-based algorithms, such as artificial bee colony.Such integration was done in other applications, for example to optimize the hyperparameters of ANN as presented by (Rashid and Abdullah (2018).They integrated artificial bee colony, genetic algorithm, and back propagation neural network to classify and diagnose diabetes.Adding genetic operator helps avoid sucking in local optima, an issue mentioned by Packianather et al. (2014).In addition, Bullinaria and AlYahya (2014) made a comparison between training ANN with back propagation and BA, and they found that back propagation is significantly better than artificial bee colony.Xu et al. (2019) applied a modified artificial bee colony that have better performance in utilizing the neighbor information in order to accelerate the convergence, this new algorithm was used to train ANN.Qolomany et al. (2017) optimized two variables of DL model, which are number of hidden layers and number of neurons in each layer using particle swarm optimization.Furthermore, Badem et al. (2017) applied BA along with limited memory Broyden-Fletcher-Goldfarb-Shannon to train autoencoder network while Lee, Park, and Sim (2018) optimized the hyperparameters of CNN using free harmony search technique.Looking at the literature, the hybridization between bees algorithm and CNN is an important gap which will be addressed in the following section by proposing novel hybrid algorithm that takes the advantage of BA to train CNN.
In addition, nature inspired algorithm can be hybridized with other deep-learning networks such as deep Recurrent Neural Network (RNN), (Zeybek et al. 2021) presented novel metaheuristic algorithm that train deep RNN using an enhanced ternary bee's algorithm (BA-3+) for sentiment classification task.BA-3+ algorithm finds the optimal set of parameters for deep RNN architecture by collaborative search of three bees, the authors found that it outperformed other optimization algorithms such as stochastic gradient descent (SGD), differential evolution (DE) and particle swarm optimization (PSO).Training deep RNN using BA-3+ algorithm achieved accuracy rate between 80%-90% while training it using SGD produced accuracy between 50% and 60% for most datasets.

Proposed Novel Hybrid Algorithm
A new approach for optimizing the parameters and weights of CNN through BO and BA will be explained.First, design of experiments technique will be conducted to investigate the significant factors affecting validation accuracy (VA).Then, BO will be used to find the optimal hyperparameters for the network by optimizing the significant factors in order to minimize the classification error on the validation set, which is the objective function.(MathWorks-2) stated the benefits of the parameters and defined a specific range for each of them as shown in Table 3.
Four factors and three levels will be used in the experiments as shown in following table: Taguchi orthogonal array (L9) is designed for nine experiments of applying original CNN with 9 different combinations of parameters on 'Cifar10DataDir' dataset which is benchmark image data that consists of 10 classes airplane, automobile, bird, cat deer, dog, frog, horse, ship, and truck.Each class has 6,000 images so, the total sample size is 60,0000 images, the results of VA are shown in the following table : After Running the experiments and recording VA for each run, analysis of variance (ANOVA) has been applied using Minitab software in order to investigate the most influential factors affecting VA as shown below:  Looking at the results, it is revealed that all four factors have significant effect on VA, so all of them will be optimized using BO technique to minimize the classification error on the validation set, which is the objective function, it is a function of section depth, initial learning rate, momentum, and regularization: Minimize Classification Error on Validation Set = CNN Training and Validation (SD, ILR, M, R) BO algorithm builds the probability model of the objective function to be used to select the hyperparameters of the network and then evaluate them in the true objective function, it maintains a Gaussian process model in the objective function that uses network variables as inputs to specify the network architecture.It is applied in conjunction with stochastic gradient descent momentum (SGDM) as one of the training options in CNN architecture, momentum adds inertia that helps the current update to make proportional contribution to the previous iteration update.
BA is used to optimize the weight learning rate factor to adjust the global learning rate obtained by BO algorithm in each of the three convolutional layers and fully connected layer.Optimal learning rate means the optimum amount of weight update (Brownlee 2020) in the convolutional filters and fully connected layer that perform the classification, so the classification accuracy on validation set is improved.
According to Al-Musawi (2019) BA is one of the most important swarm-based optimization techniques that perform an intense search after local search to optimize the variables, so combining it with BO technique will contribute to optimize CNN parameters that minimize the classification error on validation set.Its way of working is inspired by honeybees foraging behavior.The algorithm requires to set the following parameters: The algorithm starts with the global search when the scout bees (n) arrive at random positions and evaluate them based on the fitness value, then a local search stage starts by selecting the best sites (m) and abandoning the remaining sites.After that an intense search is initiated by selecting elite sites (e), which are the best among the best sites.The next step is selecting the size of the neighborhood search space in order to recruit more bees for the elite sites (e) and fewer bees for non-elite sites (m-e) to conduct local search.The global and local search will be performed simultaneously, while the recruited bees are exploring the best solutions around the neighborhood, the global search on the remaining sites is carried out randomly.This iterative process is stopped when the optimal solution found, the iteration number exceeded or no improvement over specific sequential number of iterations.
In addition to the basic Baronti (2020) showed other two types of this technique shrinking and standard algorithm, the idea behind the shrinking approach is taking the samples from increasingly small regions in solution space during the local search while the standard algorithm includes abandoning a site when stagnating local search in addition to shrinking procedure.Lindfield and Penny (2017) introduced further modification by counting the number of times a recruited bee failed to explore an improved site, it is used as a guide in the exploration process to improve the efficiency and effectiveness of the search.Other enhancements that were introduced by Imanguliyev ( 2013) include an early neighborhood efficient algorithm, hybrid Tabu bee's algorithm and autonomous bees' algorithm.Packianather, Al-Musawi, and Anayi (2019) proposed a new version of BA discovering the rule automatically by adding two parameters namely quality weight and coverage weight to avoid ambiguous situations in the prediction stage.They formulated the new two parameters to carry out meta pruning and make the algorithm suitable for classification tasks, it achieved better classification accuracy and reduced the number of rules making it more efficient algorithm than other classification methods such as Jrip and other evolutionary algorithms.
The novel hybrid Bees Bayesian CNN (BA-BO-CNN) uses the basic BA to optimize the weight learning rate factor to adjust the global learning rate obtained by BO algorithm in each of the three convolutional layers and fully connected layer.Applying BA is after optimizing section depth, initial learning rate, momentum, and regularization parameters using BO method.The values of hyperparameters for bee's algorithm is assigned based on the computer capability and equations in (MathWorks-3):

Results and Discussion of Proposed BA-BO-CNN
MATLAB software is used to apply the hybrid BA-BO-CNN algorithm on three benchmark datasets, cifar10DataDir benchmark image data that consists of 60,000 images classified evenly into 10 classes airplane, automobile, bird, cat deer, dog, frog, horse, ship, and truck.The second dataset is handwritten digits with ten classes from 0 to 9 with 1,000 images in each class.The last set of data is concrete crack images with two classes, 5,000 negative images without cracks present in the road and 5,000 positive images with cracks.
Defining the number of samples for training, validation and testing sets that will be selected randomly and shuffled every epoch Defining CNN Architecture, for example CNN structure for classifying cifar10DataDir images composed of 16 layers of which one input layer, three convolutional layers, three rectified linear unit layer, three batch normalization layer, two max pooling layer, one average pooling layer, one fully connected layer, one soft max layer, and one classification layer.Minimum batch size is 256

Loading the dataset
The input layers represented by a matrix of size height by width of the image, it consists of pixel brightness numbers between 0 to 255 (0 for black and 255 for white), the input layer for cifar10DataDir data is 32 height by 32 width by 3 color channel Defining the objective function for BO algorithm which is minimizing the classification error on the validation set of CNN Defining the optimization variables section depth, initial learning rate, Momentum and, Regularization along with specific range for each one Applying BO algorithm that builds the probability model of the objective function to be used to select the hyperparameters of the network and then evaluate them in the true objective function, it maintains a Gaussian process model in the objective function that uses network variables as inputs to specify the network architecture.It is applied in conjunction with SGDM as one of the training options in CNN architecture The convolutional layers contain 11 filters, 22 filter, and 44 filters respectively with size 3x3, it is represented by matrix of weights that slide along the pixel brightness input matrix to create feature map matrix using special dot product as mentioned in by J.
Hui (Hui, 2017).Padding that helps to detect the edges of the images is set as 'same' so that the software calculates the size of the padding at the training time automatically and produce output size equal to the input size if the stride (number of pixel shift) is one, the section depth is three Max pooling layers are used between the convolutional layers with size 3x3 and two stride value to reduce the dimensionality of the output volume (McDermott, 2021) without losing the important features which contribute to minimize the computational cost, it takes the most activated feature rather than the average pooling that take the average presence of the feature, so max pooling is better with dark background and average pooling is better with white background as mentioned in H. Ouf (Ouf, 2017) Classification  (Brownlee, 2020) in the convolutional filters and fully connected layer that perform the classification, so the classification accuracy on validation set is improved BA starts with Six scout bees arrive at random positions and evaluate them based on validation error value, then a local search starts by selecting three best sites (m) and abandoning the remaining sites.After that an intense search is initiated by selecting one elite sites (e), which is the best among the best sites since it is has the minimum validation error.The next step is selecting the size of the neighborhood search space which is 0.02 in order to recruit six bees for the elite sites (e) and three bees for non-elite sites (m-e) to conduct local search within neighborhood size and update the four parameters.The global and local search will be performed simultaneously, while the recruited bees are exploring the best solutions around the neighborhood, the global search on the remaining sites is carried out randomly The four learnable parameters for BA are weight learning rate factors for three convolutional layers and fully connected layer.They are updated after local search with range between 0.9 to 1.1 to adjust the global learning rate obtained by BO algorithm First, BO technique is applied on cifar10DataDir benchmark image data to find the optimum CNN parameters, Figure 5 shows the minimum observed objective function and estimated minimum objective function: Figure 6 shows the optimum parameters values for section depth, initial learning rate, momentum, and regularization that yielded the minimum classification error obtained by BO technique. .The minimum classification error value is 0.1928, it is a result of section depth value of 3, initial learning rate of 0.68934, momentum of 0.84074 and regularization of 4.5857e-05.
Then, BA is added to adjust the global learning rate in each of the three convolutional layers and fully connected layer, Figure 7 shows the optimal weight learning rate factors optimized by BA along with classification error on validation set: So, the global learning rate of 0.68934 obtained by BO is adjusted by multiplying it by 0.9308 in the first convolutional layer to become 0.6416.In the section convolutional layer, the adjustment factor is 1.0924 resulting in a learning rate value of 0.7530, the factor for third convolutional layer is 1.0753, so the adjusted value is 0.7412.Finally, the new learning rate value for fully connected layer is 0.6877 after adjusting it by a factor of 0.9977.Figure 8 shows the training progress and validation accuracy after adding BA, the training progress curve for CNN, BO-CNN and BA-BO-CNN are similar. .The following three figures present the confusion matrix for training, validation and testing set for BA-BO-CNN showing precision and false alarm in the blue and red rows, and recall and miss out in the blue and red columns respectively:  The following table shows the training, validation, and testing accuracy along with computational time for original CNN, BO-CNN, and BA-BO-CNN: It is seen that the best training accuracy is 92.68% for the original CNN, while the best validation and testing accuracy are 82.22% and 80.74% achieved by novel Hybrid BA-BO-CNN.The improved validation accuracy from BO-CNN to BA-BO-CNN is 1.5%, which is higher than the improved accuracy of 0.38% between existing original CNN and existing BO-CNN.Also, it has better performance than EA-CNN model designed by F. Badan (Badan 2019) that achieved an accuracy of 62.37% on the same cifar10DataDir dataset as was shown in Table 2. So, it is concluded that novel hybrid BA-BO-CNN has better classification performance and generalization capability to unseen data.In addition, it is the best algorithm in terms of cost-effectiveness since it achieved lower computational time than BO-CNN by 2 min and 11 s.
The same procedure is followed to apply original CNN, BO-CNN, and novel hybrid BA-BO-CNN on other two datasets, the results are shown in the following two tables:  The table shows that the novel hybrid BA-BO-CNN produces exactly the same accuracy due to simple features in the digits images which can be classified with high accuracy using existing algorithms, but the computational time in the hybrid algorithm is better by 3 minutes and 12 seconds reduction, so it is better in terms of cost effectiveness.
In this dataset, the new hybrid algorithm has almost similar results to existing BO-CNN since it has only two classes, so applying the existing algorithms produced high accuracy.

Conclusion
Artificial Neural Network has the ability to handle high-dimensional realtime data and extract implicit meaningful patterns that can be used to predict the outcome for previously unseen cases.The performance of ANNs depend on the significance of the features extracted from the training data used in training which is often a time-consuming process leading to sub-optimal solutions.However, this problem could be overcome by Deep Learning (DL), which is a type of Machine Learning technique.DL has better learning capability as it has the advantage of automatic feature extraction by learning large number of nonlinear filters before making decisions.
In this paper, a review on DL has been conducted showing all its networks along with their advantages, limitations, and applications.The most popular DL network is CNN that use four feature learning layers, namely, convolution, rectified linear unit batch normalization and pooling layers in addition to two classification layers one of which is fully connected and the other is a SoftMax layer used mainly for image classification.It has been found from the review that there are five open issues namely, overfitting, exploding gradient problem, training the model using imbalanced classes, the convergence speed and designing better CNN topology which is the most important issue that need to be addressed in order to improve the performance of CNN models further.
In addition, nature inspired algorithms have been included showing its contribution to increase the classification accuracy.This paper addressed designing better CNN topology issue by proposing a novel nature inspired hybrid algorithm that combines Bayesian Optimization (BO) with Bees Algorithm (BA) in order to increase the overall performance of CNN which is referred to as BA-BO-CNN.BO optimized four parameters section depth, initial learning rate, Momentum, and Regularization while BA optimized the weight learning rate factor to adjust the global learning rate obtained by BO algorithm in each of the three convolutional layers and fully connected layer.
Applying the hybrid algorithm on Cifar10DataDir benchmark image data yielded an increase in the validation accuracy from 80.72% to 82.22%, while applying it on digits datasets showed the same accuracy as the existing original CNN and BO-CNN, but with an improvement in the computational time by 3 min and 12 s reduction, finally applying it on concrete cracks images produced almost similar results to existing algorithms.
In the future, the novel hybrid BA-BO-CNN algorithm can be developed further by optimizing the weight regularization factor in the convolutional layers and fully connected layer using BA in order to improve the performance of CNN further.

•
SoftMax: it is an activation function that works better with multi-class classification problems rather binary classification problem that requires sigmoid logistic function as shown inMcDermott (2021), it provides the classification output and may not have any parameters as mentioned byWu (2017).

Figure 2
Figure 2 below demonstrates the way of working for CNN showing feature learning and classification layers. .

•
Number of scout bees: n • Number of selected bees: m • Number of elite bees: e • Number of recruited bees for elite (e) sites: nep • Number of recruited bees for other best (m-e) sites: nsp • Neighborhood size for each selected patch (local search): ngh
Joshi et al. (2019)zation technique that helps in reducing kernel redundancy and thus preventing overfitting.However, improving the learning capability is still an open challenge and can be enhanced continuously.In addition,Joshi et al. (2019)reported other issues, such as exploding gradient problem where the model stops learning after certain number of epochs which causes instability in learning process resulting in NAN values, Shah et al. (2016)overcome by redesigning the network architecture and selecting appropriate activation function.Also,Shah et al. (2016)suggested using residual and highway networks that learn in earlier layers allowing for earlier representation.In addition toJoshi et al. (2019), Fu et al. (

Table 4 .
Factors and levels definition.

Table 6 .
Classification accuracy and computational time for algorithms (cifar10DataDir dataset).

Table 7 .
Classification accuracy and computational time for algorithms (digits dataset).

Table 8 .
Classification Accuracy and computational time for algorithms (concrete cracks dataset).