Human face recognition based on convolutional neural network and augmented dataset

To deal with the issue of human face recognition on small original dataset, a new approach combining convolutional neural network (CNN) with augmented dataset is developed in this paper. The original small dataset is augmented to be a large dataset via several transformations of the face images. Based on the augmented face image dataset, the feature of the faces can be effectively extracted and higher face recognition accuracy can be achieved by using the ingenious CNN. The effectiveness and superiority of the proposed approach can be verified by several experiments and comparisons with some frequently used face recognition methods.


Introduction
With the rapid development of computer science and technology, artificial intelligence (AI) is increasingly prevalent in our daily life. Among others, deep neural network (DNN) has been successfully utilized in a variety of tasks (Hinton et al., 2012;Jurgen, 2015;Liu et al., 2016;Ptucha et al., 2019;Zeng et al., 2018), e.g. character recognition, image recognition, and speech recognition, etc. In particular, face recognition based on DNN has been under intensive investigations in the past few years.
For example, a novel coupled mapping approach was proposed for the recognition of low resolution face images based on convolutional neural network (CNN) (Zangeneh et al., 2020), which is one of the architectures of DNN (Liu et al., 2016). Many face recognition experiments were carried out on the common face databases in Zhang et al. (2015), and the results showed that DNN can effectively extract the facial features based on diverse image preprocessing approaches. In Guo et al. (2015), an algorithm combining principal component analysis (PCA) with support vector machine (SVM) was presented to deal with the issue of face recognition, where PCA can not only reduce the computation load, but also promote the recognition accuracy. In Zhao et al. (2020), CNN was used to extract the facial features, then PCA was employed to reduce the dimension of the obtained features. Moreover, joint Bayesian algorithm was applied for the face recognition. The final experiment results can validate the superiority of the hybrid approach on the CONTACT Baoye Song songbaoye@sdust.edu.cn recognition accuracy. Nevertheless, it is worth noting that most of DNN-based face recognition methods are usually developed based on a large original dataset (Guo & Zhang, 2019;Sun & Meng, 2016;Zhang et al., 2015). Obviously, a large original dataset can provide more features of the face images, but it is usually difficult to be obtained in comparison with a small original dataset. As a result, some of the existing successful methods could lead to poor performance on the small original dataset. Besides, though a larger original dataset can bring higher accuracy of the model and stronger generalization ability of the network, the data labelling of a large original dataset is really a tedious and time-consuming work. Therefore, from a practical point of view, it is a promising topic to develop the DNN-based face recognition methods on the small original dataset (Huang & Mu, 2014). Motivated by above-mentioned observations and considerations, a new approach combining convolutional neural network with augmented dataset is developed for human face recognition in this paper. The main contributions of this paper can be summarized as threefold. (1) The small original dataset is augmented to be a large dataset by using several transformations of the face images. (2) Based on the augmented human face dataset, the face recognition is implemented via an ingenious CNN, which is robust to the image transformations. (3) Several experiments are carried out on a common face database, and the superiority of the new approach can be confirmed by comparing with some of the frequently used methods. The remainder of the paper can be outlined as follows. A preliminary of CNN is briefly introduced in Section 2 for the sake of completeness. Section 3 is devoted to the face recognition method, where the model of CNN is sketched and the augmentation method of face images is described in this section. In Section 4, some experiments are carried out to verify the effectiveness of the augmented dataset and the superiority of the proposed approach. Finally, the conclusion is drawn in Section 5.

Preliminary of convolutional neural network
Convolutional neural network (CNN) is a kind of neural network with convolutional layers (LeCun et al., 1989(LeCun et al., , 1998. In general, CNN contains two kinds of hidden layers, i.e. convolutional layers and pooling layers, which are usually arranged alternately in the neural network (Wang et al., 2012;Zeiler & Fergus, 2014).
Similar to biological neural network, the connection weights of CNN can be shared in the whole neural network, which can not only reduce the amount of the connection weights, but also simplify the complexity of the network model. Thus, the training time of CNN can be remarkably shortened in most cases. In particular, when an image is the input of CNN, the image can be put into the neural network directly to avoid several complicated works, such as feature extraction and data reconstruction. Owing to the advantages of weight sharing, pooling and local receptive field, CNN has a robust performance on several image transformation operations, e.g. translation, rotation, and scaling. For the sake of completeness, the preliminary of CNN is briefly introduced in the rest of this section.

Activation function
The performance of a neural network is closely related to not only its structure but also the adopted activation function, which is usually selected as a nonlinear function to deal with some complex issues. Three frequently used activation functions in CNN are sigmoid, hyperbolic tangent (tanh) and rectified linear unit (ReLU) (Bengio et al., 2016;Lin & Shen, 2018), which can be formulated as follows and illustrated in Figure 1.

Back propagation algorithm
Back propagation (BP) (LeCun et al., 1989) algorithm is one of the most frequently used algorithms to train a neural network, and the mapping of the input and output data is actually a nonlinear optimization problem of the connection weights. Based on the gradient descent (GD) of BP algorithm (LeCun et al., 1998), the connection weights of neural network can be updated iteratively by minimizing the mean square error (MSE) between the real and expected values of the output. Here, the MSE, which is usually defined as the cost function in the training of neural network, can be expressed as: where W and B denote, respectively, the weight and bias matrices to be optimized in the neural network; a iL and t iL indicate, respectively, the real and expected output values of the ith neuron in the output layer with N L neurons.
In the neural network shown in Figure 2, the output of the ith neuron in the lth layer can be calculated as follows: where f il (·) and b il stand for the activation function and bias of a il , respectively; w ijl denotes the connection weight between the ith neuron of the lth layer and the jth neuron of previous layer; l indicates the layer number of the L-layer neural network; and N l is the neuron amount of the lth layer. Based on the forward propagation of the input and the cost function defined above, the weights and biases of the neural network can be updated iteratively as follows: where w ijl and b il can be calculated according to the chain rule of calculus as follows: where η is the learning rate that can control the rate of gradient descent. Note that the above-mentioned chains are not expanded in detail for the sake of simplicity. The interested readers can refer to Ranganathan et al. (2019) and Hagan et al. (2002) for more details on BP algorithm.

Convolution operation
Convolution is a kind of mathematical operation that has been widely used in image processing. The result of convolution can be sorted as three modes (Bengio et al., 2016), i.e. the modes of Full, Same and Valid, which can be utilized in different occasions. For example, Valid mode is usually applied for forward propagation to facilitate the feature extraction of image, and Full mode is often employed in the back propagation to obtain the optimal weights. In the convolution operation, the operation of edge zeroing is implemented for the input image, where the layer amount of the edge can be determined according to the size of the convolution kernel (Lawrence et al., 1997). The purpose of edge zeroing is to ensure the rationality of the results, i.e. the elements of the input image and the convolution kernel can be weighted and summated sequentially. Additionally, the convolution kernel should be turned around and flipped up and down as shown in Figure 3, where the kernel is actually rotated 180 degrees around the centre.
It is worth noting that convolution operation can achieve sparse multiplication and parameter sharing, which can compress the dimension of the input data. In comparison with DNN, it is not necessary for CNN to provide connection weights separately for all neurons of the input data. Actually, CNN can be regarded as a common feature extraction process like most neural networks used for feature extraction.

Receptive field
In CNN, the receptive field is the local connection field of a neuron in the hidden layer (Alonso & Chen, 2009). Suppose that the input of the neural network is an image with 100 × 100 pixels and there are 100 neurons in the hidden layer, there will be 100 × 100 × 100 connection weights between the input and the hidden layers if every pixels of the image are connected to all neurons of the hidden layer as shown in Figure 4(a). There is no doubt that the huge computation load will decrease the training efficiency of the neural network. In contrast, if each hidden neuron is connected to a local field of the input image  (e.g. 10 × 10 pixels) as shown in Figure 4(b), the amount of connection weights will be reduced to 10 × 10 × 100, which is 1/100 of the full connection case.
In practice, the connection weights can be further reduced by using the shared weight method, i.e. all neurons have the same weights in one convolution kernel. Thus, the amount of connection weights can be reduced from 10 × 10 × 100 to 10 × 10. Accordingly, the training speed of the neural network can also be significantly promoted.

Pooling
The pooling layers, which are usually located behind the convolutional layers, are mainly used to compress the output feature data of the convolutional layers. After the pooling layer, the improved output results can reduce the likelihood of over-fitting in the neural network. Besides, the feature of image can be further extracted through pooling operation without influencing on the information acquisition of the image. Actually, pooling is a reduction processing of the image, which can be classified as mean-pooling, maxpooling, overlapping-pooling, stochastic-pooling, and global average pooling (He et al., 2015;Krizhevsky et al., 2017). For instance, mean-pooling can extract the average value of the feature points and has the effect of maintaining the relative background; while max-pooling can extract the maximum value of the feature points and achieve better texture extraction. Specifically, for the mean-pooling, if a feature map with size of 4 × 4 is sampled by using a kernel with size of 2 × 2, the output is a feature map with size of 2 × 2 as shown in Figure 5.

CNN model for face recognition
In this paper, a CNN model is developed to improve the accuracy of face image classification. The structure of the model is similar to the classical LeNet-5 model, but they are different on some parameters of the model, such as input data, network width and full connection layer. The  developed CNN is composed of two convolutional layers (C1 and C2) and two pooling layers (S1 and S2). These layers are arranged alternately in the form of C1-S1-C2-S2 as sketched in Figure 6.
There is only one feature map in the input layer, which is used to put the normalized face image into the CNN model. C1 is the first convolutional layer that includes 6 feature maps, in which each neuron is convoluted with a randomly generated convolution kernel with size of 5 × 5. S1 is the first pooling layer, whose output is 6 feature maps calculated based on the output of previous layer. Each element in the feature map is connected with the mean convolution kernel of the corresponding feature map in C1 layer, and the receptive fields of the elements will not overlapped with each other. C2 and S2 are, respectively, the second convolutional layer and pooling layer, both of which have 12 feature maps and similar calculation steps with their previous counterparts. Moreover, a fully connected single-layer perceptron is placed between the S2 layer and the output layer. As shown in Figure 6, the final output is a 40-dimensional vector for the face recognition of 40 individuals, where the sigmoid function is used for the multi-label classification.

Dataset and its augmentation
Olivetti Research Laboratory (ORL) face dataset is a collection of 400 human face images from 40 individuals (ATT Laboratories Cambridge, 2005). Each individual devotes 10 face images in diverse states to the dataset. The size of each image is 92 × 112 pixels, and each of the images is saved as the file types of BMP and PGM. ORL face dataset is a widely used standard face dataset that is more easier to be labelled than others such as MIT or Yale face datasets. However, the amount of images is not abundant to train the deep neural network for accurate face recognition. To deal with this problem, the image amount of the dataset is augmented by using four data augmentation methods, including horizontal flip, shift, scaling, and rotation as shown in Figure 7.
Actually, the dataset can be augmented tremendously by tuning the parameters of the augmentation methods, see Figure 8 for example. In this paper, the dataset is augmented by 1000 times after aforementioned operations. Then, the images are scaled, normalized and labelled before they are put into the face recognition system. It can be predictable that the augmented dataset can not only reduce the probability of over-fitting but also improve the robustness of the system.

Remark 3.1:
It is obvious that a large original dataset can provide abundant image features for the human face recognition problem. Unfortunately, however, such an ideal database is difficult to obtain in practical applications. By using the transformation of the face images, the small image dataset can be tremendously augmented to be a large one. As such, more image features can be extracted to train the classifier so as to achieve better face recognition result.

Experiment results and analyses
In this section, some experiments are implemented to evaluate the performance of the developed face recognition method, and the superiority of the method can be verified in comparing with some frequently used methods. The experiments are carried out using MATLAB 2018a on a PC with Intel Core i7-6700 CPU.
To test the effect of the augmented dataset, the face recognition network is trained using different amounts of samples, i.e. 100,000, 200,000, 300,000, and 400,000, where all batch sizes and epochs are set as 50 and 1, respectively. Meanwhile, different amounts of test samples are adopted in the test of the neural network, see Figure 9 for more details. It can be observed that the face  recognition accuracy can be promoted with the increasing of training samples, simultaneously maintains stable for different amounts of test samples. This means that the augmented dataset can introduce abundant sample features, which can enhance the training of network and result in high face recognition accuracy.
As mentioned above, the face recognition network is robust to different amounts of test samples, so that the test sample can be kept constant in the following experiments. Now, the network is trained using different amounts of samples and epochs, and the results are summarized in Figure 10. It is clear that the face recognition accuracy will increase with more training samples in the cases of different epochs. Meanwhile, more epochs will also promote the accuracy in the case of different training samples. That is the face recognition accuracy is positively correlated with the amounts of epochs and training samples. Actually, this conclusion can also be confirmed in the experiments using more epochs and training samples.
MSE is usually used for the cost function of neural network. In the back propagation of error, MSE is the expectation of the squared error between the real and expected values of the output. The small MSE means high accuracy of the network model. To analyse the MSE of the face recognition network, some experiments are implemented using different amounts of training samples (i.e. 100,000, 200,000, 300,000, and 400,000) and epochs (i.e. 1, 2, 3, and 4) with the batch size of 50. As seen in Figure 11, the MSE will decrease with the increasing of training iterations in all cases of training samples. Additionally, the results of MSE can further verify the conclusions of the previous experiments.
To further verify the superiority of the proposed approach, the results of the approach are compared with some other face recognition methods in the literature based on ORL face dataset, including ANN (artificial neural network), PCA+ANN, PCA+SVM, Wavelet+SVM, and Wavelet+SVM (Gumus et al., 2010;Guo et al., 2015), and their face recognition accuracies are listed in Table 1. It is clear that the SVM-based methods can outperform the  ANN-based methods on the face recognition accuracy of the ORL face dataset. While, the proposed approach, which combines CNN with the augmented face dataset, can achieve a higher face recognition accuracy than others. As such, it can be concluded that the augmented face dataset can promote the performance of face recognition owing to the obtainable abundant features.

Conclusion
In this paper, a new approach has been developed to deal with the problem of human face recognition on small original dataset. The original small dataset is augmented to be a large dataset by using the transformations of the face images, such as flip, shift, scaling, and rotation. Based on the remarkably augmented face dataset, the face recognition can be effectively implemented via an ingenious CNN. Several experiments are carried out to verify the effectiveness of the augmented dataset, and the superiority of the new approach can also be confirmed in comparison with some of the frequently used face recognition methods. Actually, the proposed strategy is an economic method to augment the dataset, and can be applied to a variety of fields related to data-based training and learning. Our future work will focus on the application of the data augmentation approach on some more complex problems, e.g. signal processing, image recognition, and image-based fault detection (Chen et al., 2018;Huang et al., 2020;Song et al., 2019;Wang & Yang, 2019;Xu et al., 2018;Zeng et al., 2017), etc.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported in part by the National Natural Science Foundation of China [grant number 61703242].