New convolutional neural network models for efficient object recognition with humanoid robots

ABSTRACT Humanoid robots are expected to manipulate the objects they have not previously seen in real-life environments. Hence, it is important that the robots have the object recognition capability. However, object recognition is still a challenging problem at different locations and different object positions in real time. The current paper presents four novel models with small structure, based on Convolutional Neural Networks (CNNs) for object recognition with humanoid robots. In the proposed models, a few combinations of convolutions are used to recognize the class labels. The MNIST and CIFAR-10 benchmark datasets are first tested on our models. The performance of the proposed models is shown by comparisons to that of the best state-of-the-art models. The models are then applied on the Robotis-Op3 humanoid robot to recognize the objects of different shapes. The results of the models are compared to those of the models, such as VGG-16 and Residual Network-20 (ResNet-20), in terms of training and validation accuracy and loss, parameter number and training time. The experimental results show that the proposed model exhibits high accurate recognition by the lower parameter number and smaller training time than complex models. Consequently, the proposed models can be considered promising powerful models for object recognition with humanoid robots.


Introduction
Nowadays, the humanoid robots are increasingly used in a lot of areas, such as the medicine, the education, the healthcare, the logistics and the house and hotel services, to improve the quality of human life (Andtfolk et al., 2021;Angelopoulos et al., 2021;Dannecker & Hertig, 2021;Garcia-Haro et al., 2021;Nenchev et al., 2018;Oliver et al., 2021). The humanoid robots were used instead of humans or together with humans (Ambrose et al., 2001;Chohra & Madani, 2018;Fitzpatrick & Metta, 2003;Levine et al., 2016;Reforgiato Recupero, 2021;Sakagami et al., 2002). The humanoid robot used in (Fitzpatrick & Metta, 2003) learns the objects around it by using its body so that it can define and interpret its environment. In this case, it is considered the main indicator in the information it can provide vision. Computer vision technologies are generally used in applications as done in this study. For example, computer vision has been used in robotic applications for tasks such as obstacle avoidance and navigation (Chang, 2010;Pandey & Gelin, 2017), human-robot interaction (Le et al., 2018;Yavşan & Uçar, 2016), object detection for assisting robots, recognition (Martinez-Martin & Del Pobil, 2017), and object recognition for capture (Aslan et al., 2020;2021;Ku et al., 2017;Levine et al., 2016).
Object recognition and detection is essential for humanoid robots, mobile robots, robotic arms, and flying robots etc., to be able to interpret and understand their environment. Many machine learning algorithms have already been proposed and used in object recognition and detection applications (Bhuvaneswari & Subban, 2018;Lee, 2015). Traditional machine learning methods, such as random forest, Support Vector Machines (SVMs), Adaptive boosting, and Feed Forward Neural Networks (FNNs), have some limitations in their ability to process raw data (Alpaydin, 2016;Haykin, 2009). The methods require some attributes for inputs. Hence, they need to extract an attribute from classifiers. They cannot demonstrate end-to-end solutions without using the feature engineering step.
Deep learning methods for object recognition have become a popular field of study in recent years (Lee, 2015;Xie et al., 2017). The deep learning model is a graph of layers without a direct loop. The most common example is that it is a linear set of layers that maps a single input to a single output (Chollet, 2017). Deep Neural Networks (DNNs) can outperform traditional feature-based approaches at the computer vision tasks such as perception, recognition, and segmentation (Girshick, 2015;Girshick et al., 2014;Redmon & Farhadi, 2017;Ren et al., 2015). Convolutional Neural Networks (CNNs) are also a type of DNNs (Srinivas et al., 2016). While the feature vectors are used as the inputs in classical machine learning methods, the image can be used directly in CNNs. Therefore, when CNNs are used for classification, they perform better than conventional machine learning methods (Lee, 2015). But, the DNN models, such as AlexNet (Krizhevsky et al., 2012), GoogleNet (Szegedy et al., 2015), Inception V3 (Szegedy et al., 2016), Residual Network (ResNet) (He et al., 2016) and VGG-16 (Simonyan & Zisserman, 2014), are shown as particularly deep strong structures. However, these models need a large memory to store the weights and a lot of time for performance. DNNs usually suffer from excessive parameterization and often encode highly correlated parameters, resulting in inefficient computing and memory usage (Chen et al., 2015;Han et al., 2015). This paper proposes new models to get rid of this disadvantage.
Recent works on effective deep learning have focused on model compression and reducing the computational operations in DNNs (Chen et al., 2015;Denil et al., 2013;Han et al., 2015;Idelbayev & Carreira-Perpinán, 2021;Lee et al., 2021;Rastegari et al., 2016). Chen et al. (2015) presented a new network architecture called HashedNets that takes the advantage of inherent redundancy. Han et al. (2015) introduced a new deep compression network with pruning, trained quantization and Huffman coding. Rastegari et al. (2016) proposed simple and accurate binary approximations to apply faster convolutional operations. Denil et al. (2013) presented a parameter prediction technique for reducing the parameter number in DNNs. Jaderberg et al. (2014) used a cross-channel or filter redundancy to generate low-rank filters for speeding up pre-trained CNNs with minimal loss of accuracy. In Lee et al. (2021); Idelbayev and Carreira-Perpinán (2021), a tensor compose-decompose approximation and an approximation combining low-rank decompositions, using different matrix shapes, were proposed to obtain the compressed DNNs with high performance.
Several works have been carried out to decrease the parameter number of DNNs (Ayinde & Zurada, 2018;Gong et al., 2014;Gowda & Yuan, 2018;Huang et al., 2017;Iandola et al., 2016;Jha et al., 2020;Krizhevsky et al., 2012;Srinivas & Babu, 2015;Yang et al., 2015). Gong et al. (2014); Yang et al. (2015); Srinivas and Babu (2015) rely on the compression of fully connected layers bearing most of the weights. They did not improve the speed of the network. In Iandola et al. (2016), a small DNN architecture is proposed that is completely independent and has 50 times fewer parameters than AlexNet (Krizhevsky et al., 2012), but it is slower. Jha et al. (2020) proposed the LightLayers network to reduce the number of parameters in DNNs. It consists of the classical Conv2D and Dense layers, based on matrix decomposition. In Ayinde and Zurada (2018), a pruning technique was provided for removing redundant features in CNNs. Huang et al. (2017) introduced the Dense Convolutional Network (DenseNet) using direct connections between any two layers with the same feature-map size. In Gowda and Yuan (2018), the images of seven different colour spaces were used together with dense networks for increasing the performance.
This paper proposes four CNN models with small structure without using additional pruning or decomposing steps different from the literature. The performances of the proposed models with the popular models in the literature are comparatively presented on the benchmark the MNIST and CIFAR-10 datasets (Krizhevsky & Hinton, 2009;LeCun et al., 1998). The Robotis-Op3 humanoid robot is used for real object recognition application (Robotis-Op3, 2020). The performances of all models and the VGG-16 and ResNet-20 models generated using transfer learning are evaluated in terms of the training and validation accuracy, the training and validation loss, the parameter number, and the training time.
The rest of this paper is organized as follows: in Section 2 the basic layers of the CNNs are shortly introduced and are presented the proposed CNN models; The comparative results of the MNIST and CIFAR-10 datasets are presented in Section 3.1. In Section 3.2, the object recognition with humanoid robots is carried out and the simulation results are illustrated to demonstrate the performance of the proposed models. Section 4 concludes this paper.

Convolution neural networks
CNNs are similar to conventional FNNs, optimizing their weight and bias by means of selflearning (Girshick et al., 2014;Srinivas et al., 2016). Each input neuron receives an input and employs a mathematical operation such as a product and linear or nonlinear activation function. The obtained layer outputs are sent to the next layer by the weights. The final layer includes a loss activation function to reach to the ground truth values. The main difference between CNNs and FNNs is that CNNs can use the images as input. This property drastically reduces the parameter number with respect to ANN importing a vector to its input.
CNNs are composed of the input layer including the images with usually three dimensions (height, width, and depth), convolutional layers, nonlinear activation functions, pooling layers, dropout layer, batch normalization layer, one or multiple fully connected layers called dense layers, and loss activation layer. A simple CNN structure is illustrated in Figure 1.
The main functionalities of the five layers are defined as follows: (1) The input layer receives the image pixel values as the inputs.
(2) Convolution layer includes a set of learnable filters. The filter dimensions are smaller than the input dimensions. The local receptive regions of the input are connected to the neurons in the next layer via a dot product between the weights and the inputs in the region at every spatial dimension. In other words, each filter is moved across the width and height of the input and produces a feature map. Figure 2 depicts the receptive fields and the feature maps. Different from FNNs, CNNs share the same weights for all local regions. The weight sharing reduces the parameter number and provides the learning and expression efficiency, good generalization, and invariant against translation.
Given I i = Input[:, :, i] and the kernel K in the form of square weight matrix, the convolution is applied as where the asterisk shows the convolution. 0-th feature map, F 0 , is calculated as where g is the activation function and b is the bias value. After the convolution process by using each filter, the image features, such as edge and corners, are extracted.
(1) Pooling layer applies down-sampling along the spatial dimension of the convolution layers. Thanks to the pooling layer, the overfitting problem and vanishing gradient problem are slighted. Min, max, or average pooling are some of the pooling methods. (2) Dropout layer is a regularization method. During the training stage, some nodes of the layer are randomly and temporarily removed to prevent the overfitting. After the dropout layer, the nodes are usually flat by being transformed to a vector. (3) Batch normalization layer normalizes the outputs of the layer as follows: where c, s, and o are the mean, standard deviation, and mini-batch size, respectively.z is the normalized value and [ is a small number. The batch normalization makes the  training faster and the performance of the network higher. The batch normalization is usually used together with Rectified Linear Unit Activation Function (ReLU). The ReLU function is defined as max(0, input). The ReLU and its derivative calculation are easy and fast. The derivative of ReLU has 0 value for the negative input, otherwise it has 1 value. Hence, there is no vanishing gradient problem when applying ReLU (Girshick et al., 2014;Srinivas et al., 2016). (4) Fully connected layers perform the same duties in FNNs by connecting every neuron in the current layer to every neuron in the next layer. Before the layer is used, the data are transformed into a vector. The operation is called flatten. These layers include a nonlinear activation function or a softmax activation to produce the class scores (Girshick et al., 2014;Srinivas et al., 2016). The softmax function is used in the classification applications. The function assigns with the probability each class by making the total where C is the class number.

Proposed convolutional neural network models
In this paper, we propose new four CNN models with small structure instead of large ones using multiple convolutions for the object recognition with humanoid robots. Our models are listed in Tables 1-4. The structure of Model 1 consists of four sequential blocks: convolution, activation, max pooling, and dropout. The layers of the model are followed with the flatten, dense, activation, dropout, dense, and activation layers. In Model 1, the dimensions of the feature maps are 16, 32, 64, 128, and 256, respectively. Model 2 is generated from  three sequential blocks, including convolution, activation, batch normalization, convolution, activation, batch normalization, max pooling, and dropout. The layers of the model are followed with the flatten and dense layers. In Model 2, the dimensions of the feature maps are 16, 32, and 64, respectively. Model 3 includes Convolution, Max Pooling, Convolution, Max Pooling, Flatten, and Dense layers. In Model 3, the dimensions of the feature maps are 16 and 32, respectively. Model 4 is constructed by three sequential blocks, including convolution, activation, max pooling, and dropout. The layers of the model are followed with flatten, dense, activation, dropout, dense, and activation layers. In Model 4, the dimensions of the feature maps are 32, 64, and 128, respectively. In all models, we used a 3 × 3 filter, the ReLU activation function, and same convolution and we adaptively generated our model structure with respect to the input image size, nxn, and class number, C.

Experiments
To evaluate the performance of the proposed CNN models, we conducted two different experiments. Firstly, we used the MNIST and CIFAR-10 datasets and evaluated the performance of our models on the datasets (Krizhevsky & Hinton, 2009;LeCun et al., 1998) to show the performance of our models over the models in the literature. Secondly, we carried out the object recognition application with a real humanoid robot.

Experimental results on the MNIST dataset
We used firstly the benchmark MNIST dataset (LeCun et al., 1998). It consists of handwritten digits ranging from 0 to 9. This dataset contains a total of 70,000 data with 60,000 for training and 10,000 for testing. All images in the dataset have a size of 28 × 28. Figure 3 shows some examples of the MNIST digits.
In the experiments, we applied Adam optimization with a minimum batch size of 32 and the learning rates as 0.001, and cross entropy loss for 20 epochs. We trained all models on the MNIST dataset. Moreover, we transferred the pre-trained ResNet-20 model and then trained it also.
The results of accuracy and parameter number, relating to the proposed CNN models, are presented by means of the pink and green blocks in Figure 4 and the values in Table 5   list the test results relating to our four models and three CNN models: ResNet-20 (He et al., 2016), Conv2D (Jha et al., 2020), and LightLayers-CNN (Jha et al., 2020) from the literature. The accuracy values in percent of our models 1-4, Conv2D,99.25,98.99,99.40,98.70,98.87,and 97.75, respectively. The highest values of accuracy are 99.40 belonging to our model 4 and 99.25 belonging to our model 2, as shown in Figure 4.

Experimental results on the CIFAR-10 datasets
We used firstly the benchmark the CIFAR-10 dataset (Krizhevsky & Hinton, 2009). It has 10 classes: aeroplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. It includes the colour images that are 50,000 in training and 10,000 in testing. The image size is 32 × 32. Figure 5 depicts a random sample of images belonging to the dataset.
In the experiments, we applied Adam optimization with a minimum batch size of 32 and the learning rates as 0.001, and cross entropy loss for 20 epochs. We trained all models on the CIFAR-10 dataset. The results of test accuracy, relating to the proposed CNN models, are presented by means of the pink and green blocks in Figure 6 and the values in Table 6. Table 5 lists the test results relating to our four models and eight CNN models called LightLayers-CNN (K = 5) (Jha et al., 2020), VGG-16-pruned-A (Ayinde & Zurada, 2018), ResNet-56 pruned-A (Ayinde & Zurada, 2018), ResNet-110 pruned-A (Ayinde & Zurada, 2018), Colornet-40-12 (Gowda & Yuan, 2018), Colornet-40-48 (Gowda & Yuan, 2018), and Densenet-BC-100-12 (Huang et al., 2017) from the literature. The accuracy values in percent of our models 1-4, LightLayers-CNN (K = 5), VGG-16-pruned-A, ResNet-56 pruned-A, 82.51,67.55,75.55,68.82,55.76,93.67,93.12,93.27,95.02,96.86 and 94.08, respectively. The highest value of accuracy is 95.02 belonging to Colornet-40-12 (Gowda & Yuan, 2018). On the other hand, Model 2 has the highest accuracy (82.51) within our models, as in Figure 6. The parameter number values of all models are shown in Table 6. The lowest value of parameter number is 16,618 belonging to our model 3. However, its accuracy is low and its value is 67.55. The parameter number is 110,458. We can see that model 2 has an acceptable trade-off between the accuracy and parameter number.

Experimental results for object recognition with humanoid robots
In the object recognition application with the humanoid robots, we used the humanoid robot, Robotis-Op3 humanoid robot in Figure 7 (Robotis-Op3, 2020). Robotis-Op3 consists of 20 axes and includes Intel NUC i3 (Dual core, 2133 Mhz) operating system, 3 axis gyro, 3 axis magnetometer sensors, 3 axis accelerometer, Linux operating system, Logitech C920 HD-Pro camera, C, Robot Operating System (ROS), and Dynamixel SDK. The experiments show results on workstation using Nvidia Titan XP.
In the experiments, we used the real objects with 7 different shapes: a small square, a big square, a cylinder, a ball, a rhombus, a triangle in addition to a black sponge. Figure 8  Figure 9. The training times of the models on our dataset.
depicts some example objects. We set the head and neck angles of the humanoid robot to 0, -0.95 radian. We take the images with the resolution of 640 × 360 pixels from the camera of the robot using the ROS environment in the robot. We localized the objects at 20 different locations and views and collected 720 images. We removed the clutter regions not within the grasping area of the robot. Training set and validation set were prepared by using 576 images and 144 images, respectively. We re-sized the images to 256 × 256. We trained our models 1-4 for 20 iterations by using the Adam method and cross entropy loss function. Moreover, we reconstructed our models for 28 × 28 and 32 × 32 and trained the new models also similarly. We evaluated our 12 models on the datasets we obtained by using the humanoid robot. Moreover, we loaded the ResNet-20 model and the VGG-16 model by using transfer learning and we later trained the models. Table 7 shows the training and validation accuracies, the training and validation losses, the parameter numbers, and the training times of all models. On the other hand, Figures 10-13 show the accuracy and loss values with respect to epochs. As can be seen from the figures, the accuracy increases with respect to the epoch. The validation accuracy is usually smaller than training accuracy as the epoch increases. The training loss is smaller than validation loss as the epoch increases. There is no overfitting. Table 7 and Figures 10-13 show that Model 3 is the best one and it has the lowest training loss, training accuracy, validation loss, and validation accuracy of 0.0013, 1.0000, 0.0017, and 1.0000, respectively. Moreover, the parameter numbers in Table 7 present that the proposed models have much less complexity than those of VGG-16 and ResNet-20. As can be seen from Table  7, small sizes of the input showed a lower parameter number and less training time than using one of the original versions. For example, the parameter number of Figure 11. The training and validation accuracy and the loss values of Model 2 for the inputs of (a) 256 × 256, (b) 28 × 28, and (c) 32 × 32.
Model 3 was reduced from 743,142 to -9,894, which is about 75 times less than that of the original one. Hence, the small input size is important for low training time and fast calculation due to low computational load in the testing stage. Figure 9 shows in blocks the training time of our models, VGG-16 model, and ResNet-20 model for the inputs of 256 × 256 size. As can be seen from the figure, Model 3 takes the smallest training time and the best one for real-time applications. Embedding our models into the robot in the ROS environment, the robots recognize all objects and later go ahead to the manipulation step. In addition, it should be given an attention to overfitting as in Figures 10-13 and the training process should be stopped when the validation loss is smaller than the training loss. We did the experiment with different numbers of epochs, such as 20, 40, and 60. We observed higher accuracy for MNIST in all models as the epoch was increased, but for CIFAR-10 and our dataset, we observed lower accuracy in some models. Hence, we used 20 epochs for all datasets. On the other hand, the performance of our models can be improved by using convolutions with different sizes as in the Inception network (Szegedy et al., 2016). Moreover, the performance can be increased with different optimization methods and learning rates.

Conclusion
In this paper, four efficient and fast CNN models were developed for recognizing the objects with humanoid robots. The main aim of the proposed models was to take on the powerful classification capability of CNN models and to provide fast and correct decisions in real time with humanoid robots. The MNIST and CIFAR-10 datasets are first used for object recognition on the proposed four CNN models. The results on the MNIST dataset show that the proposed models provide the best accuracy of 99.40% even for small iteration numbers. Moreover, the models have smaller structures over those of the literature. The results on the CIFAR-10 dataset show that the proposed models provided high accuracy about 82.5% although they are the smallest parameter numbers and trained for only 20 epochs.
A comparative study was also conducted on the proposed four models and the derivations of different sizes. The experimental results showed that our model 3 has significantly surpassed the performance of the VGG-16 and Resnet20 models in terms of shorter training time and smaller parameter number; all versions of the model 3 also provided both training and validation recognition accuracy of 100%. In addition, our models can also be applied to real-time recognition applications. In future studies, the performance of proposed models will be improved using smaller and explainable structures at object recognition.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the Scientific and Technological Research Council of Turkey (Türkiye Bilimsel ve Teknolojik Araştirma Kurumu, TUBITAK) [grant number 117E589]. In addition, GTX Titan X Pascal GPU in this research was donated by the NVIDIA Corporation.

Notes on contributors
Simge Nur Aslan (Student Member, IEEE) received his BS degree from the Mechatronics Engineering Department at the University of Firat of Turkey in 2019. He is currently pursuing a master degree in the same department. His research interests include humanoid robots and deep learning.
Ayşegül Uçar (Senior Member, IEEE) received her BS degree, MS degree, and PhD degree from the Electrical and Electronics Engineering Department at the University of Firat of Turkey in 1998Turkey in , 2000Turkey in , and 2006, respectively. In 2013, she was a visiting professor at Louisiana State University in the USA. She has been a professor in the Department of Mechatronics Engineering since 2020. She has more than 21 years of experience in autonomous technologies and artificial intelligence, its engineering applications, robotic vision, teaching and research. Ucar is active in several professional bodies; she is an associate editor of IEEE Access and Turkish Journal Electrical Engineering and Computer Sciences and a member of European Artificial Intelligence Alliance Committee.