Video fingerprinting based on quadruplet convolutional neural network

In order to achieve fast and accurate retrieval of video copies, this paper proposes a compact video fingerprinting based on quadruplet convolutional neural network. The algorithm consists of four branch networks with shared weights, each branch network contains feature extraction and quantization coding. The projection and excitation network is combined with 3D convolution for feature extraction, which mainly learns feature weights to improve useful features and suppresses valueless features. The deep features learned are mapped into approximate real vectors in a fully connected form and quantized to generate binary codes. The model employs an improved quadruplet loss to divide the feature distance between the copied videos and the non-copied, and a quantization error term is added to ensure that the fingerprint codes contains as much similar information as the original data. The experimental results performed on the public dataset show that the algorithm can effectively improve the robustness and distinctiveness, and the average detection accuracy under multiple compound attacks is better than the compared algorithms.


Introduction
With the popularization of Internet and the development of computer technology, multimedia data have been conveniently spread. Taking videos as an example, while enriching human life and increasing human knowledge, some illegal content contained in it will infract the owner's copyright directly and affect the healthy development of society seriously (Gu et al., 2017). For this reason, various video copy detection technologies have been proposed successively. In order to reduce the computer memory and accelerate retrieval, video fingerprinting has gradually developed into an important part of video copy detection. Video fingerprint codes are obtained by extracting features from video and quantizing them into binary form, so as to represent a large amount of data with very little data (Oostveen et al., 2002). It usually requires satisfying robustness, distinctiveness and compactness. Robustness means that the extracted features are highly similar under some distortions; distinctiveness means that the codes are different from different videos; compactness is to enable the codes to be expressed as short as possible.
The key of video fingerprinting is how to effectively extract video features. Traditional video fingerprintings mainly rely on handcraft methods to extract features (De et al., 2005;Esmaeili & Ward, 2010;Lee & Yoo, 2006; CONTACT Yi Yang chinayangyi@hpu.edu.cn; yangyi@hpu.edu.cn Li & Monga, 2013;Malekesmaeili et al., 2009), there are problems such as complicated pre-processing and insufficient capture capabilities of spatio-temporal information. In recent years, as deep learning have made outstanding achievements in computer vision, researchers have tried to use convolutional neural network (CNN), long-short term memory (LSTM), recurrent neural network (RNN) and other neural networks to autonomously extract features, which has promoted the development of video fingerprinting. Jiang and Wang (Jiang & Wang, 2016) used pre-trained AlexNet to extract frame features, experimental results shown that it has superior performance than traditional methods. Wang, Bao and Li  used VGGNet to extract features and then reduced the dimensions through principle component analysis (PCA), and the performance was further improved. The above methodes are based on 2D CNN, which can only extract the spatial features but ignore the temporal information among consecutive frames. For this reason, Li and Chen (Li & Chen, 2017) trained conditional restricted boltzmann machine (CRBM) and denoising auto-encoder (DAE) respectively to obtain robust spatio-temporal descriptor. Li, Zhang and Wan (Li et al., 2018) used parallel 3D CNN to extract spatio-temporal features directly. Compared with 2D convolution, 3D convolution can capture motion information along the time dimension, however, the common 3D CNN is difficult to mine deeper semantic information to satisfy compactness . Due to the great memory consuming, some works try to combine deep learning and hashing to generate binary codes directly. Ma, Gu and Gong (Ma et al., 2018) used CNN and LSTM to extract video spatial and temporal features respectively, and fuses frame-level features into video-level features. Zhang, Wang and Hong (Zhang et al., 2016) integrated feature extraction and quantization coding under a unified framework, and proposed a binary LSTM unit. Compared with separated feature extraction and hash retrieval methods, the above end-to-end fingerprintings reduce the information loss in quantization. Guo, Li, Yang and Xu (Guo et al., 2019) proposed a video fingerprinting based on the triplet network, each subnetwork used a 3D residual network to simultaneously capture the global and local spatio-temporal information. The above algorithms have been shown to significantly improve performance under certain distortions, however, the triplet net is not enough to fully metric the similarity among videos.
Aiming at the deficiency of 3D convolution and triplet network, we builds quadruplet network to improve it. Specifically, each branch of quadruplet network uses 3D ResNet (Hara et al., 2017) with an implanted PE Block (Rickmann et al., 2019) as feature extractor. The PE Block can generate weights for feature channels. In addition, the output layer of the network is used to generate the fingerprint codes. The overall framework can realize the end-to-end mapping of original video to binary code. During training, both the improved quadruplet loss and the quantization error loss are jointly optimized.
The remainder of this paper is structured as follows. Section 2 shows the overall quadruplet framework. Section 3 describes the detailed quadruplet fingerprinting. Section 4 introduces the complete experimental process of model training and performance testing. Section 5 gives relevant conclusions and further work.

Framework design
As shown in Figure 1, the framework proposed in this paper is quadruplet network , where the input uses a set of quadruplet video, anchor represents the source video, positive corresponds to the copied video, negative1 and negative2 are non-copy videos. Each sub-network uses 3D ResNet-50 as the backbone, and project and excite block is implanted in the convolution layers. The high-dimensional features learned by the feedforward network are mapped to a real-valued vector with fixed length through a full connection, and then a binary code is obtained by the binarization process. The objective function composed of the improved  quadruplet loss and quantization error loss completes the model optimization to ensure that the fingerprint meets the robustness and distinctiveness.

Projection and excitation network
The feature maps obtained after each convolution in the convolutional neural network contain rich feature information. For traditional 3D CNNs (Tran et al., 2015), the information obtained in the current layer mainly comes from the feature fusion in the spatial and temporal dimensions of the previous feature maps , however, the difference information among the channels is often ignored. Therefore, this paper introduces projection and excitation network (PENet) (Rickmann et al., 2019) to achieve the feature fusion through channels . Specifically, each feature channel is assigned a weight, so that the model learns the relationship among the channels, so as to filter the channel features that meet the current task. Figure 2 shows the principle of the projection and excitation network. The input is a feature map with D × H × W × C. The calculation includes two steps. In the first step, the projection performs global average pooling along the three dimension directions D, H and W of each feature channel to obtain three projection vectors; ⊕ represents an addition operation, and it fuses spatial information in each direction. The second step to stimulate the excitation operation consists of two layers of 1 × 1 × 1 convolutions; ReLU and Sigmoid functions are used to activate each convolution layer, and r is the dimensionality reduction coefficient. These operations can generate weights for each feature channel to grasp the importance. means dot multiplication operation.

End-to-end structure
According to the framework shown in Figure 1, each branch is an end-to-end structure that maps original video data to fingerprint codes. The structure and detailed parameters are shown in Figure 3, the Input is video data, the Conv1 layer uses a 7 × 7 × 7 convolution kernel to obtain 64 feature maps, and the Max Pool layer uses a sliding window of size of 3 × 3 × 3 for max pooling. Conv2_x, Conv3_x, Conv4_x and Conv5_x are stacked by 3, 4, 6, 3 repeated residual units respectively, each of which contains two 1 × 1 × 1 and one 3 × 3 × 3 convolution kernels. The PE Block is located in each residual unit, this embedding method does not change the basic network structure. More importantly, it can adaptively calibrate the feature channels, so that the network captures important channel feature information. Ave Pool layer uses a sliding window of size of 1 × 4 × 4 for average pooling. The Output consists of two parts: the 2048-dimensional vector and the16-bit binary codee 1.

Related definitions
The purpose of the video fingerprinting is to establish a mapping relationship, while keeping the code as compact as possible. Its formula is as follows: where h i (·)(i = 1, 2, · · · , k) represents the mapping function, and [·] T represents the transpose operation. In addition, we use f (v; Θ) ∈ R k×1 to represent the kdimensionvector extracted from the video v, where Θ is the model parameter; similarly, we use the sign function sgn(·) to quantize and encode the real-value features.

New quadruplet loss
For the quadruplet framework, four videos v a , v p , v n1 , and v n2 form a quadruplet as the training sample, where v a and v p are copied pair, v a and v n1 , v a and v n2 are the noncopy pairs. The quadruplet can extract the correspond- ). This can further reduce the intra-class distance and increase the inter-class distance, prompting the model to generate higher quality video fingerprints. However, the traditional quadruplet loss  is optimized by the max(0, ·) function, which makes the function have undifferentiable points. In response to these problems, we design a new quadruplet loss function, that is, a smoother continuous function ln(1 + exp(·)) is used to replace the max(0, ·) for gradient calculation. The specific formula is: ), and f e (v n2 ) represent the normalized real-value features. The continuous feature vector is used as the optimization object of the quadruplet loss. It can simplify the calculation and avoid the undifferentiable situation when optimizing the discrete codes.
In (2), · 2 2 represents the square Euclidean distance, ω 1 α and ω 2 β represent the adaptive thresholds, and ω 1 and ω 2 are the corresponding thresholds. Using adaptive thresholds can better constrain the feature distance among various samples. α and β are determined by the quadruplet number during training. The mathematical expression is as follows: where N is the training batch size. It calculates the average distance between the distribution of v a and v p , v a and v n1 in the feature space to obtain α; Obtaining β is similar. The gradients of Equation (2) where

Quantization error loss
In the binarization process, in order to reduce the quantization error as much as possible, the following quantization error loss function based on quadruplet is used: ). The gradients of Equation (6) with respect to f e (v a ), f e (v p ), f e (v n1 ) and f e (v n2 ) are:

Algorithm flow
The objective function integrates the above new quadruplet loss and quantization error loss. At the same time, in order to prevent overfitting during training, L1 regularization term is added to enhance the model generalization ability. Because L1 regularization term will impel many parameters to be close to 0, thus some important features are kept and the trivial are discarded. Then the common key features among the samples from training and testing set are intensified. The formula of the objective function is: (8) In (8), Θ 1 represents the sum of the absolute values of the parameters of each part of the model, and λ, μ are the weighted parameters.

Experiments
The experiment configuration includes Ubuntu 16.04, CPU of Inter core i7, 6 core, 3.70 GHz frequency and 32GB memory, Graphics card of RTX2080. In addition, the used deep learning framework is PyTorch.
The specific construction process of our dataset is as follows: firstly, three public datasets were selected with a resolution of 320 × 240, and a total of 4986 videos with large differences in visual; then, all the selected video clips are divided into training set and testing set, the number of training set and testing set are 3986 and 1000 respectively; finally, in order to ensure the convenience and uniformity, the first 100 frames of all selected video clips are intercepted, and the size of each sequence is normalized to 100 × 320 × 240.

Training
In order to successfully train the model, it is necessary to construct video quadruplet. It is to get three videos from the training set randomly, and select one to generate a distorted copy, and then form them to be training samples. According to the input need, each video is evenly sampled at equal intervals to obtain 16 frames. The four corners and the centre of the frame are respectively cropped at five scales 1, 1/2 1/4 , 1/ √ 2, 1/2 3/4 , 1/2 to obtain an image with size 112 × 112. During the experiments, the relevant parameters in the objective function are set as follows: the threshold coefficients ω 1 and ω 2 are 1 and 0.5, and the hyperparameters λ and μ are 0.01 and 0.001, respectively. It is should be noted that the selection of λ and μ will affect the results. because these two parameters are the weight values of quantization error term and L1 regularization term, so their changes will affect the optimization direction of the overall objective function. Through experiments, it is found that when λ and μ are 0.01 and 0.001 respectively, the convergence effect of the model is the best.
The training process adopts the following strategies to accelerate the convergence to achieve the best effect: (1) The PE Block parameters are randomly initialized. In addition, we use pre-trained 3D ResNet-50 parameters by the Kinetics dataset (Carreira & Zisserman, 2017) for initial assignment; (2) Model training uses stochastic gradient descent (SGD), where the momentum is set to 0.9 and the weight attenuation is set to 0.001; (3) According to the computer configuration, each epoch selects 10,000 quadruplets; (4) The learning rate is adjusted according to the iterations. When iterations reaches 20,000, the learning rate drops to 0.1 times of the original one.
We compare the complete training process of the PE_Quadruplet algorithm with the NL_Triplet algorithm (Guo et al., 2019) to intuitively reflect the entire training. To be fair, the NL_Triplet algorithm is trained on the same dataset and iterations. Figures 4 and 5 show the changing curves of loss and accuracy with increasing iterations. It  can be seen from Figure 4, both networks can fit the video data very well. It can be more clearly observed through Figure 5 that the PE_Quadruplet recognition accuracy is significantly higher than that of NL_Triplet.

Testing
The test experiment is mainly to evaluate the detection accuracy under various distortions. Specifically, one or two simulated distortions are performed on each video in the testing set to generate 8 copies including single and combined types. The distortions include three aspects: spatial distortion, geometric distortion, and temporal distortion. The specific distortions and parameters are shown in Table 1.
For the evaluation index of the algorithm performance, we choose the receiver operator characteristic (ROC) curve and F 1 score. The specific calculation formula is as follows: where max{·} means taking the maximum value, and η means 100 thresholds selected at equal intervals between d min and d max in the Hamming distance. The higher the F 1 score calculated using Equation (9), the better the algorithm performance, and vice versa. In order to comprehensively verify the performance of the algorithm, we conducted two groups of experiments: the first group is the comparison of our algorithm under kinds of distortion; the second group is the comparison among several algorithms. Figure 6 shows the ROC curves for distortions. Among them, Figure 6(a) is the graph of frame dropping. It can be seen that the performance decreases slightly with the increasing of frame dropping. Figure 6(b) is the graph of FPS reduction. Obviously, the change of the FPS has little effect on the performance, indicating that the combination of the 3D residual network and PE Block can better capture the frame correlation. Figure 6(c) is the case of frame rotation. It can be found that the larger the rotation angle, the worse the algorithm performance. Figure 6(d) is the graph of video frame shifting plus FPS reduction. In the initial stage, as the frame shifting position moves away, the performance shows the downtrend. However, when the translation position reaches a certain distance the performance rises, indicating that the combination of shifting and FPS reduction does not necessarily reduce the performance. Figure 6(e) is the graph when a logo is inserted. The closer the inserted logo is to the centre of the video frame, the worse the anti-interference ability of the algorithm. Conversely, the closer the logo position is to the edge of the video frame, the more anti-interference is strong, this is because most of the key information contained in the centre area of the video frame. Figure 6(f) and (g) corresponds to the graphs of median blur plus frame dropping, salt &pepper noise plus frame dropping, respectively.
It can be seen that no matter whether it is median blur or salt&pepper noise, when the intensity is improved, the performance does not decrease sharply. This is because the 3D residual network has an outstanding learning ability for the pixel-level features. Even if a part of the video frames are additionally discarded, the robustness and distinctiveness of the algorithm still perform well under the distortions of these two types. Figure 6(h) is the graph of the video frame scaling. It can be seen that scaling the video frame according to a certain factor will obviously affect the performance, It means that shrinking the frame loses more spatial structure information than expanding it.
The second group of experiments is to compare our PE_Quadruplet algorithm with four classic video fingerprintings, RASH (De et al., 2005), CGO (Lee & Yoo, 2006), TIRI (Esmaeili & Ward, 2010) and SGM (Li & Monga, 2013), and the NL_Triplet algorithm (Guo et al., 2019) based on deep learning. The NL_Triplet uses the same dataset as the PE_Quadruplet for experiments, and the codes length for them is set to 16 bits. Figure 7 shows the ROC curves of these algorithms. It can be seen that the overall detection performance of our algorithm has improved compared with others . Specifically, in the frame dropping case shown in Figure 7(a), our algorithm is significantly better than the TIRI, CGO and RASH, which shows that the 3D residual network combined with PE Block can still better grasp the linkage among frames when the temporal information is damaged. As shown in Figure 7(b), when the FPS reduces, our algorithm is slightly inferior to the SGM and NL_Triplet, but the overall performance is still at a high level. This is because the 3D convolutional network and the projection and excitation network have limited levels of information acquisition in the temporal and channel dimensions, resulting in the recognition effect when the FPS changes are not particularly significant. Figure 7(c) shows that our algorithm is the best compared in the rotation situation, because the proposed quadruplet loss is used to optimize the model to further improve the resistance ability to geometric attacks. Figure  7(d) shows that under the double attacks of frame shifting and FPS reduction, our performance is significantly higher than that of other algorithms. It can be seen that the PE_Quadruplet algorithm is more prominent in the anti-interference ability of geometric and time-domain combined distortion. Figure 7(e) shows that in the case of inserting a logo, the PE_Quadruplet algorithm is still the best among all algorithms, indicating that it is also very robust against local distortion. Figure 7(f) shows that under the double attacks of median blurring plus frame dropping, our algorithm exceeds four traditional algorithms, and it is almost the same as the deep learning algorithm NL_Triplet, indicating that the PE_Quadruplet algorithm is excellent in the robustness of spatial distortion of blur-type. Figure 7(g) shows that under the double attacks of salt &pepper noise plus frame dropping, although our performance is slightly inferior to the deep learning algorithm NL_Triplet, but it is slightly better than the traditional algorithms. It reflects that the PE_Quadruplet algorithm has certain robustness to noisetype spatial distortion. As shown in Figure 7(h), in the scaling situation, the algorithm has the best detection effect compared with the other algorithms, which shows the strong robustness and high distinctiveness against geometric distortion. In addition, Table 2 shows the F 1 scores of several algorithms. Obviously, compared with other algorithms, our F 1 scores are the best. This shows that the fingerprintcodes generated by the quadruplet network are compact, at the same time, it has outstanding robustness and distinctiveness, which enables fast and accurate retrieval of video copies.

Conclusion
The algorithm in this paper combines deep learning and hashing, and uses the 3D ResNet with PE Block embedded to learn the semantic similarity features of the quadruplet video. In order to facilitate the calculation of gradients, the designed new quadruplet loss and quantization error loss are used to jointly train the model. When experimentally verifying the feasibility of the algorithm, it is found that the overall effect of the quadruplet training method has indeed improved, however, its performance is not very satisfactory in terms of spatial domain distortion of signal processing such as adding noise and blurring, indicating the end-to-end network still cannot achieve the expected effect for individual distortions. Therefore, the research will focus on the video autoencoder to extract better spatial-temporal features.