An analysis of rotation matrix and colour constancy data augmentation in classifying images of animals

ABSTRACT In this paper, we examine a novel data augmentation (DA) method that transforms an image into a new image containing multiple rotated copies of the original image. The DA method creates a grid of cells, in which each cell contains a different randomly rotated image and introduces a natural background in the newly created image. We investigate the use of deep learning to assess the classification performance on the rotation matrix or original dataset with colour constancy versions of the datasets. For the colour constancy methods, we use two well-known retinex techniques: the multi-scale retinex and the multi-scale retinex with colour restoration for enhancing both original (ORIG) and rotation matrix (ROT) images. We perform experiments on three datasets containing images of animals, from which the first dataset is collected by us and contains aerial images of cows or non-cow backgrounds. To classify the Aerial UAV images, we use a convolutional neural network (CNN) architecture and compare two loss functions (hinge loss and cross-entropy loss). Additionally, we compare the CNN to classical feature-based techniques combined with a k-nearest neighbour classifier or a support vector machine. The best approach is then used to examine the colour constancy DA variants, ORIG and ROT-DA alone for three datasets (Aerial UAV, Bird-600 and Croatia fish). The results show that the rotation matrix data augmentation is very helpful for the Aerial UAV dataset. Furthermore, the colour constancy data augmentation is helpful for the Bird-600 dataset. Finally, the results show that the fine-tuned CNNs significantly outperform the CNNs trained from scratch on the Croatia fish and the Bird-600 datasets, and obtain very high accuracies on the Aerial UAV and Bird-600 datasets.


Introduction
Data augmentation (DA) has often been used in deep learning to increase the number of training images to obtain high classification accuracies. Previous approaches to data augmentation use cropping, rotation, illumination, scaling and colour casting for creating more training images. A recent research by Pawara, Okafor, Schomaker, and Wiering (2017) examined the classification performances of two convolutional neural network (CNN) methods (AlexNet and GoogleNet) with several DA techniques for different plant datasets. This research investigates the rotation matrix and colour constancy algorithms as methods for data augmentation with the objective to use one or more machine learning algorithms to classify images within three animal datasets.
Some researchers have considered rotating plant images in different angular positions while the effect of white or zero pixel values introduced during rotation of the images was not discussed (Ghazi, Yanikoglu, & Aptoula, 2017;Pawara et al., 2017), however, their research show that DA techniques can be used to reduce overfitting and improve the overall performance of the CNN models. A recent study investigated the relevance of the radial transform (Salehinejad, Valaee, Dowdell, & Barfett, 2018) as a method of data augmentation on character and medical multi-modal images. Additionally the research by Sladojevic, Arsenovic, Anderla, Culibrk, and Stefanovic (2016) attempts to develop a plant disease recognition CNN model with three image transformation techniques: affine, perspective and rotation.
In contrast to the rotation technique as mentioned earlier, the idea of colour constancy algorithms has widely been studied in image processing and computer vision as a method for enhancing the quality of an image while preserving the colour information of an object under varying illumination conditions. The authors in Rahman, Jobson, and Woodell (1996) and Jobson, Rahman, and Woodell (1997) have proposed a multi-scale retinex (MSR) method, which has the prowess to achieve excellent colour rendition and dynamic range compression as opposed to their previous works on the single scale retinex (SSR). An improvement was made in the MSR by the authors in Rahman, Jobson, and Woodell (2004), who incorporated colour restoration to produce a multi-scale retinex for colour restoration (MSRCR). Several improvements have been made on MSR to produce variants of the MSR algorithm. One of such methods is the combination of MSR with chromaticity preservation (Petro, Sbert, & Morel, 2014). Another modification on the MSR is the incorporation of the Autolevel algorithm that removes outliers, improves the contrast level within an image and shows computational improvements when used with a graphical processing unit (Jiang, Woodell, & Jobson, 2015).
However, the unification of colour constancy and rotation matrix algorithms as a method of data augmentation has received limited attention. This paper extends the research by Okafor, Smit, Schomaker, and Wiering (2017) by considering the proposed n × n rotation algorithm together with colour constancy techniques as methods of data augmentation. The proposed techniques are examined on two animal datasets (Croatia fish (Jaeger et al., 2015) and Bird-600 (Lazebnik, Schmid, & Ponce, 2005)) and an aerial image dataset collected using an unmanned aerial vehicle (UAV) . The use of UAVs has a lot of potential for precision agriculture as well as for livestock monitoring. A previous study (Zhang & Kovacs, 2012) recommended that the combination between precision agriculture and remote sensing and UAV methods can be very beneficial for agricultural purposes. Other research (Katsigiannis, Misopolinos, Liakopoulos, Alexandridis, & Zalidis, 2016;López-Granados et al., 2016;Lukas et al., 2016) has examined this area with the use of UAVs for different tasks. A novel area of research is recognizing aerial imagery with the use of deep neural networks. The study in Lin, Cui, Belongie, and Hays (2015) demonstrates that the use of a CNN for ground-to-aerial localization yielded a good performance on some datasets.
Another interesting study is the use of deep reinforcement learning for active localization of cows (Caicedo & Lazebnik, 2015). Next to the task of localization, there exists some recent research on the use of UAVs for motion detection and tracking of objects. The study in Fang, Du, Abdoola, Djouani, and Richards (2016) analysed the merits of the use of optical flow with a coarse segmentation approach for aerial motion detection of animals from several videos. Furthermore, in Gonzalez et al. (2016) the authors extended the idea of using UAVs with object detection and tracking algorithms for monitoring wildlife animals. Another approach is detection and tracking of humans from UAV images using local feature extractors and support vector machines (SVMs) (Imamura, Okamoto, & Lee, 2016).
The idea of data augmentation has been successfully applied to UAV data as well. In Jeon et al. (2017), the authors studied augmentation of drone sounds using a publicly available dataset that contains several real-life environmental sounds. Furthermore, the research by Charalambous and Bharath (2016) explored the use of a DA method for training a deep learning algorithm for recognizing gaits. Another interesting use of data augmentation is the development of a model for 3D pose estimation using motion capture data (Rogez & Schmid, 2016). However, limited research has examined colour constancy as a method of data augmentation. The research by Galdran et al. (2017) proposed a DA method adapted for skin lesion analysis with neural networks with emphasis on the use of colour constancy to normalize the colour information of images within a training set. Moreover, a research has redeveloped colour constancy as a neural network regression technique for estimating the colour of a light source (Lou, Gevers, Hu, & Lucassen, 2015).
Most of the previous DA techniques transform a training image to multiple training images using techniques such as cropping, contrast, illumination, mirroring, colour casting, scaling and rotation. In this paper, we extend the DA method proposed in Okafor et al. (2017) that transforms a single input image to another image containing n × n rotated copies of the original (ORIG) image. This method enhances the amount of information in an image. Additionally, this paper investigates the use of two well-known colour constancy methods (MSR and MSRCR) for creating more samples of both original and rotation matrix versions of three datasets: Aerial UAV , Croatia fish (Jaeger et al., 2015) and Bird-600 (Lazebnik et al., 2005). The objective of this paper is to use CNNs to assess the classification performance on several variants of the used datasets. Moreover, our study inspects if the novel DA methods lead to higher classification accuracies when combined with different machine learning techniques such as CNNs or classical feature descriptors on a novel dataset containing aerial images of animals.
Contributions: This paper describes a novel DA technique  that transforms a train or test image into a novel single image with multiple randomly rotated copies of the input image. To combine the different rotated images, the proposed method puts them in a grid and adds realistic background pixels to glue them together. This approach presents some merits: (1) it provides more informative images which may aid to yield higher accuracies and (2) the method can also be used to perform data augmentation on test images in the operational stage. The utility of the proposed approach is evaluated by using a CNN which is derived from the original GoogleNet (Szegedy et al., 2015) architecture by keeping only several inception modules. For training this CNN, we evaluate if there are differences in using the cross-entropy loss function (softmax classifier) compared to using a hinge loss function. Furthermore, we compared the CNNs to several classical computer vision techniques using ORIG images and DA images. All techniques were used to investigate the recognition accuracies of aerial images of cows in natural scenes, for which we created our own dataset with a UAV.
Additionally, this paper investigates the use of well-known colour constancy techniques (MSR and MSRCR) for creating new image samples of both ORIG and the new rotation matrix (ROT) images on three datasets: UAV aerial images, Croatia fish (Jaeger et al., 2015) and Bird-600 (Lazebnik et al., 2005), with the aim to increase the amount of training image samples. This approach enhances the colour information of the images which could be very useful to get higher classification accuracies with the CNN. We train the CNN with the cross-entropy loss function and compare the classification performances of the colour constancy data augmentation (with ORIG/ROT), ORIG alone and ROT-DA alone on three datasets. The study also considers two broad forms of data augmentation based on their increase (colour constancy data augmentation) or no increase (ROT-DA alone) in the amount of training images.
The results show that the fine-tuned CNN with an appropriate selection of the grid resolution and angular bounds for the rotation algorithm combined with colour constancy methods yields the highest classification accuracies on most of the used datasets. Moreover, the results show that using fine-tuned CNN models with the proposed data augmentation (ROT-DA) technique on the Aerial UAV images leads to significantly better results than all other approaches. Finally, the results of our proposed approaches to data augmentation combined with the fine-tuned CNN significantly surpass previous results on the Bird-600 dataset (Lazebnik et al., 2005).
Paper outline: Section 2 describes the used datasets and the proposed DA techniques. Section 3 discusses the methods used for classifying the Aerial UAV dataset and two other animal datasets. Section 4 describes the CNN experimental setups and the results obtained from the various classification methods on the used datasets. Finally, the conclusion is presented in Section 5.

Datasets and data augmentation
This section entails the description of three datasets and describes two kinds of data augmentation which are evaluated in Section 4.

Aerial UAV dataset
(1) Dataset collection: We employed the DJI Phantom 3 Advanced UAV for collecting video frames of cows and natural backgrounds at different positions and orientations . An illustration of the UAV is shown in Figure 1. We applied manual cut-outs with a fixed size of 100 × 100 pixels to obtain positive samples of images that contain a cow, while we employed an automatic extraction of negative samples which have no presence of cows in the image. We flew the drone three times over different fields containing cows in order to obtain different samples. A summary of the three subsets of the obtained images with the amount of positive and negative samples, the video streaming time and the amount of unique objects is reported in Table 1. The unique objects denote cows that are recorded at different time frames and therefore have different appearances in time. Figure 2 shows some samples of images of our Aerial UAV dataset.
(2) Cross-set splits: We used cross-set splits whereby each recorded subset is considered as a separate fold. One subset is used for testing and the other subsets are used for the training set. This process is repeated for the three available subsets. The classical feature descriptors combined with supervised learning algorithms and the derived CNN technique are employed for determining the existence of cows in the natural images. We maintain the same dataset splits for all the experiments using the CNN and the feature extraction techniques.

Croatian fish dataset
This dataset was originally presented in Jaeger et al. (2015). It contains a total of 794 images and has 12 classes with a non-uniform distribution of the images per class. The authors reported an accuracy of 66.78% in their study using a CNN combined with a linear SVM classifier. We adopted a different split in our experiment because of the imbalance of the image samples within the various classes. We ensured that approximately half of the image samples were kept aside as test sets. Figure 3 shows sample images of this dataset for each of the classes.

Bird-600 dataset
This dataset was originally presented in Lazebnik et al. (2005). The dataset contains a total of 600 images and has 6 classes with 100 individual image samples per class. We adopted a similar dataset distribution by keeping 50% of the total image samples as test set as reported in Lazebnik et al. (2005) in our experiments. The authors reported an accuracy of 92.33% in their study by using a probabilistic part-based method for texture and object recognition. Figure 4 shows sample images of this dataset for each of the classes.  We propose a new offline DA algorithm called ROT-DA that transforms an input image to a new single image containing multiple randomly rotated versions put in n × n cells. The use of a larger value for n leads to a new image containing more different poses. For the Aerial UAV dataset, the value of n was set to 4 in the experiments, because using higher values of n resulted in making the cow images look very small. On the other two animal datasets, we set n = {2, 4} for Croatia fish while for the Bird-600, we set the value n = {1, 2}. An illustration of the proposed DA method and the overall classification system using the CNN is shown in Figure 5. The pseudo-code in Algorithm 1 explains the various transformations of the ORIG image to obtain the multi-orientation image. After inserting the images in the newly created image, background pixels are added to glue them together. This is done by using the nearest neighbour pixels around the edges of the images. We will also perform experiments with ROT-DA without rotations (ROT-DA-NR), but we do this only for the classical feature-based techniques.

Algorithm 1 Multi-Orientation Data Augmentation Algorithm
Input: Given images I i (x, y) from an input directory, where x, y denote the pixel row and column, and a grid size of n × n. Output: The data-augmentated versions of the images. 1:procedure CONSTRUCT A FILELIST WITH N IMAGES FROM AN INPUT DIRECTORY. 2: for each image I i , i ∈ N do 3: Initialize the total number of cells n × n = M 4: for each image I i , for all cells m ∈ M do 5: Define the size of the image resolution. 6: Compute a pad-size I q = ceil((size(I i ))/2). 7: Compute a pad-array I p using a pixel replication padding technique, given I i , I q , pad value set to 'replicate' and the pad direction set to 'both'. 8: Rotate I p with a random angle within the bound [1°, 180°], this yields a new image I r . 9: Adjust the image I r to I a such that undesired background introduced during rotation is filled with artificial pixels from the nearest neighbour pixels. 10: Concatenate each I a into M cells.  I c = [I a (k) … I a (k + n − 1); … ; … I a (M = n 2 )] n×n Given that k = 1, ∀ M cells, the ellipses (…) denote the column cells entries containing rotated sub-images, and the semicolon (;) in this study represents the start of a new row. Note that each cell in the n×n grid of cells contains a rotated copy of the input image I a (k) in a reduced size. 12: end for 13: Convert the cell structure of I c into a matrix I m . 14: Resize the image I m to 250×250 pixels. 15: Store each I m (i) into an output directory 16: end for 17:end procedure Figure 4. Sample images of the Bird-600 dataset for each of the bird species (each column): egret, mandarin, owl, puffin, toucan and wood duck (Lazebnik et al., 2005).

Colour constancy data augmentation
Colour constancy is the perception of an object which ensures that perceived colours of objects remain relatively constant under various variations in illumination conditions. This area of study has found relevance in image processing and computer vision. Colour constancy uses contrast/lightness enhancement and colour rendition for improving the quality of an image. Most colour constancy algorithms use the retinex theory. The idea of the retinex theory was proposed initially by Land and McCann (1971). The research in Provenzi, Marini, De Carli, and Rizzi (2005) provided the basis for understanding the retinex algorithm from a mathematical standpoint. Our study examines two kinds of MSR algorithms.
(1) Multi-scale retinex: This algorithm was proposed by Rahman et al. (1996) and Rahman et al. (2004). The algorithm provides a trade-off between colour rendition and local dynamic range (Petro et al., 2014). MSR computes the weighted sum of the outputs from various SSR. According to Jobson et al. (1997), an MSR image can be computed as where f m k is the SSR output for M scales, W m denote the weights for each scale variable, W m = 1/3, the maximum number of scales is M = 3 because the number of the RGB image channels is equal to the number of scales, C m represents the normalization factor and I k (x, y) denotes the image pixel coordinates for a given colour band k. The s m [ {15, 80, 250} are the standard deviations of the Gaussians for each of the scales. We adopted the same parameters as used in Jobson et al. (1997) and Petro et al. (2014), because they also perform well in our study. Furthermore, we further computed the f msr k (x, y) by using the mathematical expression proposed in Moore, Allman, and Goodman (1991), where each colour channel is modified by the absolute minimum and maximum of the RGB colour channels. This can be computed as . (3) (2) Multi-scale retinex with colour restoration: Jobson et al. (1997) and Rahman et al. (2004) initially proposed the MSRCR algorithm. An MSRCR image f msrcr k can be computed by the product of colour restoration functions C k of the chromaticity and the MSR outputs. The modified version of the MSRCR f msrcr k (x, y) from the research in Petro et al. (2014) can be computed as where α controls the strength of the non-linearity and λ is a constant. For the MSRCR experiment, α is set to 125 while λ is set to 0.8 and K represent the total number of spectral bands (K = 3) while β is set to 46.
Proposed colour constancy data augmentation: This study examines the possibility of using the ORIG or ROT images that are fed as input to the MSR or MSRCR algorithm. This process can also be done vice versa by creating the colour constancy images and then pass them as inputs to the rotation matrix algorithm. The new images are then combined with either ORIG or ROT images to obtain either double or three times the effective size of the initial train-validation image dataset. Please note by three times, we mean combining ORIG+MSRCR-ORIG+MSR-ORIG or ROT+MSRCR-ROT+MSR-ROT. We carried out experiments using two animal datasets and the UAV dataset. Some samples of both the ORIG and ROT images with and without colour constancy are shown in Figures 6, 7 and 8 for the Aerial UAV dataset, Croatia fish dataset and Bird-600 dataset respectively.
We carried out some considerations to the rotational bounds for the ROT-DA alone or colour constancy data augmentation with ROT images on the three datasets.  (c) On the Bird-600 experiments, we also considered the colour constancy data augmentation with 1 × 1-ROT which used the same angular rotation bounds as in V2. This setup can be seen as a combined DA method of rotation and colour constancy.

Three inception module CNN architecture
This architecture is directly derived from the famous GoogleNet architecture as proposed in Szegedy et al. (2015). We eliminated all the layers after the inception 4a module, except for layers which lead to the first classifier and this is because the used datasets contain few classes (2, 6 and 12) for the Aerial UAV, Bird-600 and Croatia fish datasets respectively. Hence, we want to know how the reduced architecture can handle these problems. We will compare the reduced CNN architecture to the original GoogleNet on the Aerial UAV dataset. Another modification made with respect to the original GoogleNet architecture Figure 7. Examples of the ORIG and ROT-DA images from the Croatian fish dataset (Jaeger et al., 2015). The first row accounts for the ORIG images (columns 1-4) and ROT-DA images (columns 5-8) without colour constancy. The second and third rows are the MSR and MSRCR versions for both the ORIG and ROT-DA images respectively. The colour constancy algorithms also show improvement in the image resolution compared to the ORIG image samples. Figure 8. Examples of the ORIG and ROT-DA images. The first row accounts for the ORIG images (columns 1-3), 2 × 2 ROT-DA images using V1 rotation condition (columns 4-6), 2 × 2 ROT-DA images using V2 rotation condition (columns 7-8) and 1 × 1 ROT-DA images using V2 rotation condition (columns 9-10) all mentioned without colour constancy. The second and third rows are the MSR and MSRCR versions for both the ORIG and ROT-DA images respectively.
is the use of Nesterov's Accelerated Gradient Descent (NAGD) rather than using the conventional stochastic gradient descent (SGD) to update the weights in the deep neural network. The NAGD optimization update rule (Sutskever, Martens, Dahl, & Hinton, 2013) is described in Equations (6) and (7): where L [ {L h , L c } is the loss function, μ is the momentum value, a L is the learning rate, u i is the momentum variable, ∇ is the rate of change in L, i is the iteration number and W i denote the learnable weights. We employed randomly initialized weights for the scratch CNN and pretrained weights from the ImageNet dataset for the fine-tuned CNN (Google-Net architecture). In addition to our modification, we remark that the original GoogleNet (in the Caffe framework) uses a simple online data augmentation that involves cropping (with a default crop size of 224 × 224 pixels), i.e. cutting out several patches from an input image at five positions (as five in a dice), and additionally flipping (horizontal reflection) to obtain more samples. During training of the CNN model, it automatically flips each cropped image to double the effective dataset size. The cropping means an act of extracting some portions from an input image. In our customized CNNs, we considered the original and two additional crop sizes: 125 × 125 and 250 × 250 pixels. The crop size of 250 × 250 implies the single actual size of the input image. Furthermore, we evaluated flip and non-flip conditions. All the input images to the CNN have image sizes of 250 × 250 pixels. For the ROT-DA images, each cell of the 4 × 4 grid contains a copy of the input image in a reduced size and the method fills up empty spaces with nearest neighbour pixels. The derived three inception module CNN architecture is described in Table 2. This architecture involves the use of three inception modules that allow the concatenation of filters of different dimensions and sizes into a single new filter (Shin et al., 2016;Szegedy et al., 2015). In each inception module, there exist six convolution layers and one pooling layer. Moreover, there exist several rectifiers (ReLUs) which are placed immediately after the convolutional and fully connected layers. Furthermore, there exist four pooling layers excluding those within the inception modules, two bottom convolutional layers and one top convolutional layer which comes after the average pooling layer. The authors in Lapin, Hein, and Schiele (2017) provide an analysis of loss functions for multi-class problems. We use a top-1 loss function which employs either the hinge loss or the cross-entropy loss (for the Softmax classifier). The L 1 -norm hinge loss L h used in our study can be defined as where y k i = {1, − 1}, y k i = 1, if x i belongs to the target class of the kth class output unit, and y k i = −1 if x i does not belong to the target class. The variable N denotes the total number of training images in a batch. K accounts for the number of class labels and z k = x T w is the final activation of the output units. Here, x [ R D denote the D-dimensional features of the previous hidden layer, and the learnable weights of the last layer arew [ R D×K .
The cross-entropy loss L c used in our study is defined as where y i denotes the target values y i [ {0,1}. The fraction within the log accounts for the softmax activation function (Okafor et al., 2016), which computes the probability distribution of the classes in a multi-class classification problem. Note that in this study, we investigate both binary and multi-classification problems.
The CNN under study consists of two fully connected (FC) layers: FC 1 with a corresponding ReLU computes the hidden unit activations, which is immediately followed by a regularization dropout of 0.7, and FC 2 contains the output neurons: 2, 12 and 6 for Aerial UAV, Croatia fish and Bird-600 datasets respectively. The working operations of the CNN are well explained in Szegedy et al. (2015).

Classical features combined with supervised learning algorithms
In this section, we describe the three feature extraction techniques which we use and combine with the k-nearest neighbour classifier and the SVM with a linear kernel or a radial basis function (RBF) kernel trained on the Aerial UAV dataset. In our preliminary experiments, we compared the classical approaches to the CNN techniques on the Aerial UAV dataset variants alone (without colour constancy). Note that for the classical techniques, we considered two image resolution sizes: 100 × 100 and 250 × 250 pixels. We remark that the classical methods performed worse compared to the CNN techniques. Hence, we only considered the CNN approach on the other two datasets. The classical methods are described as follows.

Colour histogram
The colour histogram (Colour Hist) is a feature extraction technique that analyses the pixel colour values within an image. For this, the pixel colour values of an image which exist as RGB (Red, Green and Blue) are first transformed to HSV (Hue, Saturation and Value). After that, the value of each pixel in a channel is put in a histogram consisting of different bins. In the experiments, only the saturation channel with a bin size of 32 is used, because it obtained the best performance in preliminary experiments. The resulting feature vector containing 32 values is given to the supervised learning algorithms.

Histogram of oriented gradients
The histogram of oriented gradients (HOG) (Dalal & Triggs, 2005) features descriptor analyses patches (local regions) from an image. Then histograms are constructed based on the occurrences of orientation gradients within the patches. The HOG descriptor can process greyscale or colour image information. For the UAV dataset, we only considered the grey option. The procedure for constructing the HOG is as follows: convert the colour images of the aerial imagery into greyscale, then compute the gradients with two gradient kernels to compute the gradient values for each pixel from the greyscale image. The gradients for each pixel within a small block (cell) are put in bins (Junior, Delgado, Gonçalves, & Nunes, 2009;Takahashi, Takahashi, Cui, & Hashimoto, 2014), where each bin defines a specific orientation range. The following parameters were used, because they worked best in preliminary experiments: a grid of 2 × 2 blocks is used, where each block is split into 2 × 2 cells. The number of orientation bins is set to 4. This results in a feature dimension size of 64. This feature vector is fed as input to the supervised learning algorithms.

The combination of HOG and Colour Hist
In this technique, the features from both the HOG and Colour Hist are combined to form the HOG-Colour Hist feature descriptor. The features from both the HOG and Colour Hist are first computed separately. The optimal parameters used for HOG in the combined feature are different from the HOG descriptor alone, because they gave slightly better results in the preliminary experiments. The HOG parameters used in this technique use 32 × 32 pixels per cell, for which we used 9 cells in total from 100 × 100 pixel images with a single block. The number of orientation bins is set to 4 and the final feature dimensionality is 36. We used the hue channel from the colour histogram with 32 bins. These features are normalized and concatenated to obtain the final feature vector with 68 elements.
Several experiments were conducted to determine the best choice of parameters for the used classifiers with the different classical feature descriptors. For the k parameter in k-nearest neighbour (KNN), we tried k = {1, 2, 3, 4, 5, 10}. The C parameter of the linear SVM is set to C = 2 q−1 , with the explored values q [ {1, 2, . . . , 19}. For the SVM with the RBF kernel, we tried C = {1, 2, 3, 5} with g = 10 p−1 , where p [ {1, 2, . . . , 4}. The optimal parameters used for each of the classifiers are reported in Table 3. All the algorithms used for the classical techniques were developed in Python.

Experimental setup and results
This section entails the description of the experimental setup and shows and discusses the results on the used datasets.

CNN experimental setup
In this section, we explain the experimental setups used for each of the datasets.

CNN experimental setup for the Aerial UAV dataset
The enumeration below briefly describes the CNN setups for the experiments without and with colour constancy DA variants of this dataset.
(1) CNN setup on the non-colour constancy DA variants of the Aerial UAV dataset: All experiments were run on the Caffe deep learning framework on a Ge-Force GTX 960 GPU model. The used experimental parameters are as follows: training display interval is set to 40, average loss is set to 40, learning rate is set to 0.001, learning policy is set to step, the step size is set to 4000 iterations, power is set to 0.5, gamma is set to 0.1, the momentum value is set to 0.9, weight decay is set to 0.0002 and maximum iteration is set to 10,000, which generates a snapshot model after every 500 iterations (which represent a snapshot). This resulted in 20 snapshots for the entire training process. The mentioned parameters were not altered during all the experiments for the different model configurations. The training images from the combination of any of the two subsets as reported in Table 1 are further split into the ratio 80% for training and 20% for validation. We employed a training batch size set to 20 and testing batch size set to 5 for all experiments, but with different test iterations. The altered parameters for the three subsets of the Aerial UAV dataset used with their corresponding splits are described in Table 4. We first performed experiments with both the original and our derived CNN trained from scratch on the ORIG images. The preliminary results show that our proposed architecture requires less memory usage and a decrease in training computing time. This is summarized in Table 5. Additionally, our architecture obtains a similar level of performance compared to the original CNN.
(2) CNN setup on the colour constancy DA variants of the Aerial UAV dataset: In this dataset, the effective sizes of the train-validation sets of the variants of colour constancy DA images in either original or rotation matrix form are increased to double or three times the original dataset size for the different subsets of this dataset. The new versions of the datasets result in a slight modification of the CNN training parameters: changes in the solver test iterations (validation/train) for the respective  datasets are detailed in Table 6. The table also shows the dataset distribution. Moreover, we employed similar experimental settings as explained before. We remark that the test iterations for the three test sets that exist in either ORIG or ROT-DA alone were kept constant with the aim to examine the effectiveness of the new CNN models. Please note that we separated the rotation matrix and original versions of the test sets before applying colour constancy only on the train validation sets.

CNN experimental setup for Croatia fish dataset
In this dataset, we investigated the ORIG and ROT-DA datasets alone, and colour constancy data augmentation of ORIG and ROT-DA separately. Moreover, we also studied the impact of grid resolution on the ROT-DA; this means we used 4 × 4 and 2 × 2 ROT-DA images in our experiments separately. Similar CNN experimental settings as described in Section 4.1.1 were used. The additional modifications to the proposed CNN include the batch size for training, validation and testing is set to 12/8/1 respectively. The training of each of the CNN models uses maximum iterations of 7200, which generates a snapshot at each interval of 720 iterations, the step-size is set to 3600. This results in a decrease in the learning rate to 1/10th times the base learning rate of 0.001. For the ORIG and ROT-DA alone, we set the test interval to 240 while for the colour constancy DA versions (ORIG/ROT-DA) it is set to 720. The dataset variants were shuffled based on fivefold cross-validation with five different test sets ensuring no overlap exists in the train validation sets. Please note that we separated the rotation matrix and original versions of the test sets before applying colour constancy only on the train validation sets. The dataset distributions are detailed in Table 6.

CNN experimental setup for Bird-600 dataset
In this dataset, we investigated the ORIG and ROT-DA alone, and colour constancy data augmentation of ORIG and ROT-DA separately. Our preliminary experiments suggest that the 2 × 2 ROT-DA yields better performances as compared to the larger 4 × 4 grid. This informed our choice of this grid, so we will use smaller grids for this dataset. A similar CNN experimental setup as described in Section 4.1.1 is used. The additional modification to the proposed CNN includes the batch size for training, validation and testing is set to 9/1/1 respectively. The training of each of the CNN models uses maximum iterations of 8100, which creates a snapshot at each interval of 810 iterations, the step-size is set to 4000. We used a base learning rate of 0.001. For the ORIG and ROT-DA alone, we set the test interval to 270 while for the colour constancy DA versions (ORIG/ROT-DA) it is set to 810. Similarly, the various dataset variants were shuffled based on fivefold cross-validation with five different test sets ensuring no overlap exists in the train validation set. Please note that we separated the rotation matrix and original versions of the test sets before applying colour constancy only on the train validation sets. The dataset distributions are detailed in Table 6.

Evaluation of the CNN architecture on the datasets
In this section, we discuss the classification performances on the used datasets.

Results on the Aerial UAV dataset
To compute the average results of the different subsets of this dataset, we compute the weighted average accuracy, which is computed by summing over the relative testing dataset sizes multiplied with the average accuracies on the testing datasets. The weighted mean can be computed using the expression: T m = ( S s=1 W s T s )/( S s=1 W s ), where T m denotes the weighted mean test accuracies, W s denote the weights, which represent the number of individual images per test subset W s = {262,2569,1150}, and T s are the test accuracies for the various subsets, with S=3.
(1) Evaluation of the CNN on Aerial UAV dataset variants (without colour constancy): In our preliminary studies, we carried out experiments on the data augmentation (ROT-DA) version of our dataset to determine the optimal crop size. We used models generated from the train validation experiments for evaluating our test sets. We initially employed the scratch CNN with the cross-entropy classification loss, which is combined with or without flipping and with different crop sizes: 125 × 125, 224 × 224 and 250 × 250. The results of these experiments are shown in Figure 9(a) and suggest that the optimal method uses a crop size of 224 × 224 pixels with flipping. This yields an accuracy of 98.18% that occurred at the 5th snapshot. We observed in general that there exist marginal differences between the various settings. Based on this outcome, we used the best crop size with flip settings to carry out the experiments using the scratch and fine-tuned versions of the CNN. For this, we used both the ROT-DA and ORIG images. The validation results from Figure 9(b) show that the scratch and the fine-tuned CNN applied on the two kinds of images converge to a near maximum level of performance. The reason for this lies in the fact that most of the validation images contain similar objects as in the training set. The validation results at the 5th snapshot are reported in Table 7. From the table, we can see that the use of the original dataset leads to more overfitting. The results of the different CNNs with the cross-entropy loss function are shown in Figure 9(c). From this figure, we can observe that the best obtained test accuracy is obtained by the fine-tuned CNN applied on the ROT-DA images in the 2nd snapshot. We further investigated the CNN with the L 1 hinge loss, using the earlier mentioned CNN settings (scratch and  fine-tuned versions) applied on the two sets of images (ROT-DA and ORIG). The results obtained are shown in Figure 9(d).
Based on the performances recorded during this preliminary investigation, we only compared results obtained at the 5th snapshot as reported in Table 7. The results show that the fine-tuned CNN trained on the data-augmented images yields higher test classification accuracies when compared to the fine-tuned CNN trained on the ORIG images of the dataset. We compared the different approaches using the binomial distribution of correctly classifying test images. The results show that the fine-tuned CNN trained on the data-augmented images yields significantly higher classification accuracies (P , 0.01) when compared to the fine-tuned CNN trained on the ORIG images of the dataset. Overall, the fine-tuned CNNs obtain the best results and combined with the data-augmented images, the results are very good (99.65%). Finally, the results show that overall the use of the cross-entropy loss function leads to better results than the use of the hinge loss function.
(2) Evaluation of classical descriptors on the Aerial UAV dataset (without colour constancy): The weighted mean test accuracies of the classical techniques on the Aerial UAV dataset are reported in Table 8. We observe that the RBF-SVM outperforms the other two classifiers (K-NN and linear SVM) when combined with each of the feature descriptors. Another observation is that the classifiers with the Colour Hist or HOG-Colour Hist features yield better performances than using the HOG descriptor alone. This shows the importance of using colour information for this classification problem. Still, the results are significantly worse than the results using the CNN methods. Table 8 also shows the results of using the RBF-SVM with different datasets and different feature descriptors using larger images (250 × 250 pixels). The results show that here data augmentation does not lead to significantly better results. This can be explained by the fact that the best feature descriptor, the colour histogram, is not affected by this DA method. Finally, we note that the ORIG image with the smaller 100 × 100 resolution works better for the HOG feature descriptor and therefore also for HOG combined with the colour histogram. This can be explained by the fact that we optimized the HOG parameters using the smaller images.
Although the performances of the CNN techniques are much better, the classical techniques have a lower training computing time: t ≤ 1 min. This is because of the low dimensionality of the extracted features and the low number of trainable parameters.
(3) Results of the CNN on the Aerial UAV dataset variants (with colour constancy): The CNN training computing time on the colour constancy DA variants for the different subsets is t ≤ 46 min. We used the same approach of computing the weighted mean of the accuracies for the three subsets as reported before. The subfigures in Figure 10 show the learning curves for both training and testing on the colour constancy DA variants of ORIG and ROT images respectively. From Figure 10(a,b), we observe that CNN validation accuracies of the colour constancy DA methods yield very similar performances for both fine-tuned and scratch experiments. From Figure 10(c), we observe that the use of fine-tuned CNN on the ROT-MSRCR-ROT+MSR-ROT-DA attained a peak accuracy of 99.5% at the 4th snapshot while that of fine-tuned CNN on the ORIG+MSRCR-ORIG+MSR-ORIG obtained 99.06% at the 5th snapshot. In both approaches, the performances reduce for longer iterations; this suggests that early stopping will be most appropriate for these methods. The validation performance in Figure 10(a) shows that most of the techniques examined were stable after the 7th snapshot (3.5K iterations). Hence we choose this iteration point as the basis of our comparison. A summary of the validation and the test accuracies is reported in Table 9. Overall, the fine-tuned CNN applied on ROT+MSR-ROT-DA yields a very good performance for almost all iterative points of evaluation.
In this dataset, using fine-tuned CNN on colour constancy data augmentation with ROT images yields a higher accuracy than with the fine-tuned CNN using either colour constancy data augmentation with ORIG images or ORIG images alone. However, all fine-tuned CNN results obtained using colour constancy DA images do not surpass results obtained from fine-tuned CNN on ROT-DA images alone. This is possibly due to fact that the test sets are only using ROT-DA images. Overall the proposed rotation matrix algorithm leads to higher accuracies on this dataset with or without the colour constancy algorithm.
In contrast to this observation, in the scratch experiments, the results obtained from training scratch CNNs on colour constancy data augmentation with ORIG images outperform CNN results obtained on ROT-DA and ORIG images alone. Thus it seems that adding more images to train the scratch CNNs plays the most important role. Based on this observation, we will use the best scratch technique (ORIG-MSRCR-ORIG+MSR-ORIG-DA) and its rotation matrix version on the next two datasets. It is surprising that the scratch CNN performs better than the fine-tuned CNN on the ORIG+MSRCR-ORIG+MSR-ORIG-DA dataset. This may be caused by some overfitting problem, which we observed in the test accuracy of subset 2.

Results on Croatia fish dataset
We trained the CNNs using fivefold cross-validation data splits. The training time of the CNN models for each of the methods is t ≤ 16 min. The models generated from the CNNs using colour constancy DA variants with (ROT or ORIG) or (ROT or ORIG alone) were used to compute the accuracy on the test sets that contain either ORIG or ROT-DA images without colour constancy. The learning curves for train validation and testing phases while training for 7200 iterations are shown in Figure 11. The mean accuracies for test and validation sets for the different approaches after that number of iterations are reported in Table 10. We report that there is no significant difference between the test and validation performances for most methods. This indicates that the test and validation performances are consistent.
From Table 10, we observe that the fine-tuned CNN on ORIG alone, the colour constancy data augmentation on ORIG and the 2 × 2-ROT version of the dataset all yield high accuracies. There is no significant difference in accuracies between these three methods. The best method is the fine-tuned CNN on the2 × 2-ROT+MSRCR-ROT+MSR- ROT-DA variant of this dataset. When we compare the results of the fine-tuned CNN applied on 2 × 2-ROT+MSRCR-ROT+MSR-ROT-DA to 4 × 4-ROT-DA, there exists a significant difference (P<0.05). This indicates that the use of colour constancy data augmentation with ROT images and the right choice of grid resolution are important for this dataset. We also note that the fine-tuned CNN significantly outperforms the scratch CNN on this dataset.
For the scratch experiments, training the CNN using ORIG+MSRCR-ORIG+MSR-ORIG-DA yields the highest accuracy. This best scratch CNN approach significantly outperforms the 4 × 4 ROT+MSRCR-ROT+MSR-ROT (P<0.05). Overall, the choice of colour constancy data augmentation with 2 × 2 ROT images works better in our experiment than the use of colour constancy data augmentation with 4 × 4 ROT images.

Results on bird dataset
We trained the CNNs using fivefold cross-validation data splits. The training time of the CNN models for each of the methods is t ≤ 13 min. The models generated from the CNNs using colour constancy DA variants with (ROT or ORIG) or (ROT or ORIG alone) of this dataset were used to compute the accuracies on the test sets that only contain either ORIG or ROT-DA images (without colour constancy images). The learning curves for train validation and testing phases, while training for 8100 iterations are shown in Figure 12. The mean accuracies for test and validation sets after that number of iterations are reported in Table 11. From this table, we report that there is no significant difference between the test and validation performances for each of the examined methods, this shows again that the test and validation performances are consistent to each other. From the subfigures in Figure 12, we observe that the fine-tuned CNNs outperform the scratch CNN methods on the different dataset variants.
The best techniques are the fine-tuned CNN on either 1 × 1-ROT+MSRCR-ROT+MSR-ROT-DA or ORIG+MSRCR+ORIG+MSR-ORIG-DA. These results indicate the importance of colour constancy on the ROT or ORIG images. This success can be attributed to training CNN weights with enhanced colour information and with more images. To obtain this better performance, it was important to choose smaller rotational bounds [−15 • , 15 • ] as used in the 1 × 1-ROT+MSRCR-ROT+MSR-ROT-DA rather than the original rotational bounds [1 • , 180 • ]. Such higher angular bounds may not be suitable for images that have an upright representation of objects. Furthermore, we compared the results obtained with the fine-tuned CNNs on the different variants of this dataset to the baseline result from Lazebnik et al. (2005) which obtained 92.33% using a probabilistic part-based method (maximum entropy framework). Our best approach significantly outperformed the baseline with a margin of 6.14% using the fine-tuned CNN on 1 × 1-ROT+MSRCR-ROT+MSR-ROT-DA. However, we remark that the obtained scratch CNN results on this dataset performed worse than the baseline method.

Conclusion
In deep learning, data augmentation can play an important role if a dataset does not contain many training images. In this paper, we developed a novel DA method that transforms an image into a new image containing multiple random transformations of the image. We combined this method with the use of colour constancy algorithms that add several transformed images to the training datasets. We created different combinations of methods: using ORIG or ROT images combined with colour constancy transformed images or not. These combinations were compared on three different animal datasets: Aerial UAV containing cows or not, a dataset with bird images and a dataset with fish images. Overall we considered two broad forms of data augmentation based on their increase (colour constancy data augmentation with ORIG or ROT-DA) or no increase (ROT-DA alone) in the amount of training images.
The results show that for the Aerial UAV dataset, the augmented ROT images are very useful. The Aerial UAV dataset consists of pictures taken from the sky, and therefore it is important to cope with 2D rotations to obtain the highest accuracies. It should be noted that this DA algorithm is useful for the CNNs, because although CNNs are more or less translational invariant, they are not rotational invariant. For the fish and birds dataset, the proposed rotation matrix DA method does not lead to better results than using the ORIG images. For these datasets, the images show objects which are often in an upright position, and therefore there is less need to battle rotational variances.
The colour constancy data augmentation helps in overall to get better accuracies, but the differences are not very large compared to using the ORIG images. Only on the bird dataset, the colour constancy data augmentation plays a very important role when training the CNN from scratch. The variation in colours is quite large for this dataset, and therefore adding additional images with different illumination levels is helpful. On this dataset, colour constancy data augmentation also improves the results of the finetuned CNN.
The results have also shown that the fine-tuned CNNs significantly outperform the CNNs trained from scratch on the Croatia fish and Bird-600 datasets. Furthermore, the fine-tuned CNNs obtain very high accuracies on the Aerial UAV and Bird-600 datasets.
Future works can explore the use of deep neural network architectures to artificially transform colours in images. This could be done with a novel way of data augmentation or by adding initial layers that immediately transform the colour pixels. It will also be interesting to create a deep neural network that can create the best ROT images, possibly trained using an adversarial learning framework.