Compact video fingerprinting via an improved capsule net

Robustness, distinctiveness and compactness are the three basic performance metrics for video fingerprinting, and the three factors affect each other. It is challenging to improve them simultaneously. For this reason, an end-to-end fingerprinting via a capsule net is proposed. In order to capture video features, a capsule net, based on a 3D/2D mixed convolution module, is designed, which maps raw data to compact real vector directly. A new designed adaptive margin triplet loss function is introduced, and it can automatically adjust the loss according to the sample distance. It is beneficial for reducing training difficulty and improving performance. Three open access video datasets FCVID, TRECVID and You Tube are composed to train and test, large experimental results have shown that the proposed fingerprinting achieves better performance than traditional and deep learning methods.


Introduction
With the rapid development of image sensor device, computer and internet, the spread of video becomes more wide and convenient. It facilitates people's work and lives; however, the uploading of illegal video and copyright infringement occurs at the same time. Some steps, such as video fingerprinting, have been taken to prevent this. Video fingerprinting is a technology which maps video data to short vector. This process needs to meet some conditions, robustness, distinctiveness and compactness. Robustness requires that the fingerprints for original video and its modified copies should be the same or extremely similar. Distinctiveness means that the fingerprints should be different for different videos. Compactness means the fingerprints should be as short as possible for efficient matching. In general, distinctiveness is contradicted to robustness and compactness. How to balance the relationships among them to obtain a better performance is very challenging. The previous works mainly focused on robustness and distinctiveness (Coskun et al., 2006;De Roover et al., 2005;Esmaeili et al., 2011;Guzmanzavaleta et al., 2017;Hu & Lu, 2018;Jiang & Wang, 2016;Kordopatis-Zilos et al., 2017a, 2017bLee & Yoo, 2008). Only a few works pursued compactness while keeping the other two performances (Li & Monga, 2013). In this paper we focus on designing a compact fingerprinting while improving the robustness and distinctiveness. CONTACT Li Xinwei lixinwei@hpu.edu.cn According to the feature extraction process, there are two kinds of video fingerprinting: traditional methods and deep learning methods. The traditional feature is handcrafted based on experience. In the early stage various robust features have been designed, such as Radial hASHing(RASH) (De Roover et al., 2005), 3D-DCT Hashing (Coskun et al., 2006), Centroids of Gradient Orientations Hashing (CGO) (Lee & Yoo, 2008), Temporally Informative Representative Images (TIRI) (Esmaeili et al., 2011), Structure Graphic Model (SGM) (Li & Monga, 2013), Combination of Acoustic and Visual (Guzmanzavaleta et al., 2017). These features are extracted from either key frames or video clips, and the methods based on clips, such as 3D-DCT Hashing, TIRI and CGO, remedied the dropping of temporal information. In general, these methods are well designed and received satisfactory results for one aspect at that time.
Currently due to deep neural network having been applied in many visual tasks successfully, deep learningbased features are also introduced into multimedia fingerprinting (Hu & Lu, 2018;Jiang & Wang, 2016;Kordopatis-Zilos et al., 2017a, 2017b. Jiang et al. (2016) studied the fingerprinting based on standard CNN and Siamese CNN, they designed efficient network to save computation time and received better performance than traditional methods. Kordopatis-Zilos et al. (2017a) proposed layer-based CNN(CNN-L) feature aggregation scheme, which extracted intermediate CNN features to aggregate histograms with the idea of bag-of-words model. At the same time, they also designed a fingerprinting with deep metric learning (DML) (Kordopatis-Zilos et al., 2017b). Both methods obtained state-of-the-art performance compared to competitive methods at that time.
In order to extract temporal information well, Hu et al. (2018) combined the CNN features and long short-term memory unit to represent the spatial-temporal information. This method showed better performance than Siamese CNN. However, the stages of feature extraction and quantization are independent for all the existed deep features-based video fingerprinting, which leads to the mapping function is local optimized. In fact, there are several stages, such as feature extraction, feature compression, binary quantization, for all the methods based on handcraft features even most CNN features. Independent stages mean local optimization; this could be enhanced through integrated mapping function and global optimization. So we will design an end-to-end fingerprinting with an improved capsule net.
In order to overcome the deficiency of CNN's presentation capabilities, Sabour et al. (2017) proposed the capsule net which describes entities with vector neurons. Being different from traditional CNN, the capsule net intends to describe entities via the relationship between parts and whole. Finally, they obtained well and compact representations for entities with few net layers. However, because there is forward clustering iteration nested in the backward propagation, it is hard to expand to video data directly for huge computation. Since CNN features have been published, a large number of CNN networks were proposed to extract image features among which 3D CNN is a typical net for extracting video features (Tran et al., 2015). In order to reduce computation cost and compress features, a mixed 3D/2D CNN is designed (Zhou et al., 2018). Inspired by this, we will design an efficient capsule net with 3D/2D module.
During fingerprinting matching, traditional methods measure the distance or similarity between samples with handcraft functions, but it is difficult to design well for a specific problem. Metric Learning can automatically learn a metric from data (Bellet et al., 2013), which were employed in video fingerprinting usually (Hu & Lu, 2018;Kordopatis-Zilos et al., 2017b). In this paper we will use a triplet net and design an adaptive margin triplet loss.

Methodology
There are two main parts for the proposed video fingerprinting, including the weight sharing triplet net and triplet loss function. The main branch is 3D/2D mixed capsule net, which realizes the capsule computation on video data. Triplet net needs triple inputs, which are anchor sample, positive sample and negative sample. The triplet loss function will impel the net to narrow down the positive pair distance and widen the negative pair gap. The structure diagram is shown in Figure 1, and its details will be discussed in the next sections.

Capsule net construction
Capsule net has shown the powerful abilities of object representation and feature compression, which is suitable for constructing compact fingerprinting. But there is a clustering iteration in the error backward propagation, and its computation is too huge to act on 3D video data. Being different with still image, there is temporal information among video frames, how to extract exactly and represent compactly it is challenging. The very natural idea is exerting 3D-DCT on video to capture spatial-temporal information, but the difficulty is huge computations as before. We pay particular attention to this problem and design a fast capsule net for video data.
The capsule net is composed of three parts: mixed convolution module, primary capsule layer and advanced capsule layer. The parameters are detailed in Table 1. The parameters after the filter represent kernel size and channel number.
For the mixed convolution module, there are three convolution layers and one average pooling layer. Tanh is employed to be the activation function for each layer. The former two layers are 3D convolution with kernel size 3 × 5 × 5 and 3 × 3 × 3 to extract spatial-temporal information, behind that following an average pooling along temporal dimension. The third layer is 2D convolution with kernel size 9 × 9. In this way, the computing cost is reduced obviously. The function of primary capsule layer is convolution computation and dimension transformation to compose initial capsules. There are 8 groups of 2D convolution with kernel size 9 × 9 and 64 channels exerted on the former output. Eight groups of feature maps with 64 × 6 × 6 will be derived, and then each cube is reshaped to 1 × 2304 linear vector. Eight corresponding elements are extracted from 8 vectors to compose of 2304 capsules. In fact, the primary capsule is composed of 8 neurons in this paper.
Being similar with primary capsule, the advanced capsule is composed of 32 neurons, which is a highly compact representation for a 3D video data at last. There are two crucial issues for primary capsule with 8-dimension transforming to advanced capsule with 32-dimension. First is a 8 × 32 transformation matrix, which increases the dimension. Second is dynamic routing, which is a clustering process in essence (Sabour et al., 2017). After the matrix transformation and dynamic routing, a capsule fingerprint with 32-dimension is received. The details of dynamic routing process can be found in the work (Sabour et al., 2017).

Triplet loss with adaptive margin
In the metric learning methods, Siamese network and triplet network are employed usually. In the Siamese network, the positive pairs and negative pairs are used to be inputs. Because one pair is fed into the net each time during training, only the absolute similarity is exerted on the net. In fact, the sample similarity is continuous rather than yes or no. In order to solve this problem, triple samples are fed into the triplet network each time, thus the relative similarity is introduced. Generally, the formula for triplet loss T is where v t , v + t , v − t ∈ R l represent the fingerprint vector corresponding to the sample of anchor, positive and negative, l is the fingerprint length, n is the triple number, β is a constant. Formula (1) will explain the positive similarity is higher β than negative. However, the relative similarities are different for all kinds of samples. Hard β will lead to inefficient or non-convergence network training. In addition, the positive similarity may be lower than the negative pair when neglecting absolute similarity. Thus it will cause precision reduced. Figure 2 shows the distances among triple samples.
Considering these issues, we design a new adaptive triplet loss L. The absolute similarity, dynamic similarity and hard threshold are integrated. Its formula is where σ (·) is the sigmoid function, λis a scale factor, which is set to 0.8 in this paper. S + t , S − t are Cosine similarities corresponding to positive and negative pair, respectively. Their formulas are denoted as (3) For formula (2), the loss function will push the gap between S + t and S − t to vary forward to λ · σ (S − t − S + t ) and ||1 − S + t || 2 . At the same time, S + t will be close to 1 with the hard constraint ||1 − S + t || 2 . Thus the factors, including absolute, relative and dynamic, are all involved. Its effectiveness will be shown in the experiment section.

The process for training and testing
There are two stages for the whole fingerprinting, training and testing. Figure 3 shows the flow chart. The details of training and testing process are listed below.
ALGORITHM 1 Training process.
Input: Training data set. Output: Network parameters.
Step 1. Video pre-treatment. Rescale the video with size 64 × 56 × 56 and transform to YCrCb format.
Step 2. Parameters initialization. Initialize the kernel filter, transform matrix with random data.
Step 3. Select n triples randomly from the training set to fed to the net, compute the advanced capsule vector through forward propagation.
Step 4. Compute the loss according to formula (2), and update the network parameters with SGD optimizer.
Step 5. Repeat Step 3, Step 4 800 iterations in each epoch, until the loss keeps unvarying.

Experiment setting
All the experiments are completed with the computer by the following settings: CPU is Intel Core i7-8700 K @ ALGORITHM 2 Testing process.
Input: Original and query video. Output: The decision of copy or non-copy.
Step 1. Video pre-treatment. Rescale the video pair with size 64 × 56 × 56 and transform to YCrCb format.
Step 2. Fed each of them in turn to the trained net to get advanced capsule vector, then quantize it with sign function sign(·) to get b o , b q ∈ {−1,1} l Step 3. Calculate the hamming distance Step 4. Make a decision according to the hamming distance and the given threshold. 3.70 GHz 6-Core, GPU is NVidia GeForce RTX 2080 8GB and memory is 32GB. The dataset is composed of training set and testing set. There are 4000 videos in the training set, which are randomly captured from FCVID (Jiang et al., 2018). There are 801 videos in the testing set, in which 1800 videos come from YouTube (Test set and Matlab) and 201 videos come from TRECVID (TRECVID).
Each video of the dataset is transformed with 8 kinds of distortion. The specific parameters are listed in Table 2. These distortions are usually displayed in the real scene. These videos and their copies are used to train and test the proposed network.
For the network parameters setting, the hyper parameters are set as follows. The iteration is set to 800 for each epoch, and batch-size, momentum factor and initial learning rate are 10, 0.9, and 0.01, respectively. The advanced capsule dimension is 32, and the dynamic routing iteration is 3.

Performance evaluation metrics
The Receiver Operating Characteristic (ROC) is used to evaluating the fingerprinting performance usually. In this paper we will plot ROC curves to display the results. The miss probability P M means the ratio of missed true copies to all true copies which reflects the robustness. The false alarm probability P FA means the ratio of mistaking copies to all negative copies which reflects the distinctiveness. Their formulas are where α ∈ [0, HD max ], HD max is the maximum of hamming distance. F 1 score can display a comprehensive evaluation, and its formula is

Experimental results and analysis
In order to prove the effectiveness of proposed fingerprinting, the traditional and deep learning methods are compared. The traditional methods include RASH (De Roover et al., 2005), TIRI (Coskun et al., 2006), CGO (Lee & Yoo, 2008) and SGM (Li & Monga, 2013). The deep learning methods include CNN-L (Kordopatis-Zilos et al., 2017a), DML (Kordopatis-Zilos et al., 2017b) and CNN + LSTM (Hu & Lu, 2018). All the parameters for these methods are set as their literatures to get best performance. Figure 4 shows the ROC curves of these methods for kinds of distortion. The horizontal axis indicates false alarm probabilityP FA , and the vertical axis indicates miss probabilityP M . The point (0, 0) is the best ideal case, so which is close to this point means well performance. It can be seen from Figure 4(a-h), the red curves show the best, which is corresponding to our methods. There is an obvious gap between traditional methods and deep learning methods. It demonstrates deep learning features are better than handcraft features. From Figure 4(b,e,h), our method has the obvious superiority compared with the CNN-L, DML and CNN + LSTM. It is own to the mixed 3D/2D convolution capsule net and end-to-end pattern. Table 3 shows the F 1 score for all the methods. It should be noted that the fingerprint code length for our method and SGM is 32 bit. The length for all the other methods is much higher than 32 bit. It is 256 bit for DML method.
Whether average or total, our method has obtained the highest F 1 score among all methods. Compared with DML, which is the best among compared methods, there is still 0.21 and 0.25 higher than that. However, our code length is very shorter than that of DML. This shows our method reduces the code length greatly while enhancing robustness and distinctiveness.  Distortion intensity also affects the performance; in order to test the stability we have performed large experiments for different distortion intensity. The results are shown in Figure 5 , where there is a slight variation when varying the distortion intensity. This proves that the proposed method is stable for different distortion intensity.
Loss function is crucial to the fingerprinting performance. In order to prove the designed adaptive is effective; experiments on loss function have been test. Figure 6 shows the curves of margin affected. The margin β in formula (2) is 0.2, 0.3, 0.4 and 0.5, respectively for comparing. When the margin is equal to 0.4 the curve displays best. The adaptive case shows the obvious advantages.  The computational cost of video fingerprints is also one of the important indicators to measure the performance of the algorithm. Based on the extraction of fingerprint features of a single video, this paper analyses the computational cost from three aspects: time complexity, space complexity and running time. Based on the same computer hardware configuration, it is compared with the three deep learning algorithms of CNN-L, DML and CNN + LSTM. It can be seen from Table 4 that the space complexity and time complexity of the 3D/2D mixed convolutional capsule network designed in this paper are 13.8M and 7.36G, respectively. Compared with the CNN-L algorithm using the VGG16 network and CNN + LSTM algorithm combining resnet50 and LSTM as backbone network has obvious advantages in the amount of network parameters and the number of floating point operations, which saves a lot of network computing time. Compared with the DML algorithm with the lowest calculation cost, the running time gap is small, which shows that the algorithm has a certain high efficiency in extracting video fingerprints.

Conclusions
We presented a novel end-to-end fingerprinting, which utilized a novel mixed 3D/2D convolution capsule net and adaptive triplet loss function. The designed capsule net was applied to video data efficiently and obtained a compact representation. The adaptive loss was better than hard threshold. Experiments on the public dataset showed that our proposed method achieved the stateof-the-art performance. However, how to design a fast capsule net for 3D video is challenging. Moreover, how to output binary code directly is a valuable research issue.