Digital video watermarking algorithm based on asymmetric causal spatio-temporal filter

ABSTRACT Digital video is vulnerable to accidental or malicious destruction in storage, transmission, processing, which damages the legitimate rights and interests of the product owners. Digital video watermarking technology is a mean to protect intellectual property rights. We analyze the basic principle of the 3D-Harris algorithm and Gabor filter method. By considering the temporal causality of the video, the common symmetrical filter is discarded. We design a causal filter which conforms to video characteristics and proposes an asymmetric causal filter in space–time domain. On this basis, an improved video watermarking algorithm is proposed, combining space–time feature points and DCT domain. The experimental results show that proposed algorithm can not only ensure high invisibility, and can effectively resist various attacks in time domain and space domain.


Introduction
Video data integrates content about information in the environment and is widely applied in many visual tasks including surveillance, content authentication, service tracing, etc. However, the feasibility of distributing and acquiring video data causes the issue of protecting intellectual property rights (IPR). Digital video watermarking is one of the most prominent and effective solutions and has been extensively investigated in the recent years.
A video can be considered as a set of sequential static images over time, (i.e. 2 spatial dimensions + 1 temporal dimension). When watermarked video is distributed, it is often attacked by both time and space domains simultaneously, resulting in the loss of synchronization in time or space, which often leads to inaccuracy or even inability to detect/extract watermark information. Therefore, robust video watermarking algorithm research against space-time attacks (including the time-domain operation of video frames and the geometric operation of space domain, signal processing, etc.) is of great significance. (Heidari et al., 2017;Hosseini & Ghofrani, 2014;Tong, Chen, Zhang, & Dong, 2012).
Many interesting events in video can be characterized by strong variations of the data in both the spatial and the temporal dimensions. As a result, video data can be represented by space-time interest points which have been used in video analysis, video interpretation, etc. These space-time interest points reveal the intrinsic structure CONTACT Hongbo Bi bhbdq@126.com of video including the spatial and temporal dimension, which can be seen as a potential solution for temporal synchronization in video watermarking while preserving the advantage in spatial domain.
Whereas it seems to be not sufficient enough to applying space-time interest points to the video watermarking. The response function of time-space corner method is insensitive to the changes in the time dimension. The number of detected time-space points of interest is relatively small, and the points of interest may be missed, which is not conducive to follow-up analysis. (Gaj, Rathore, Sur, & Bora, 2017;Gunjan, Mitra, & Gaur, 2012).
To improve the timeline insensitivity, Gabor time filter can be utilized. However, video has a temporal causal relationship, while Gabor filters do not. Some researchers proposed spatio-temporal video watermarking similarly, however, their schemes did not consider the temporal causal relationship either. (Hongbo, Yubo, & Xueming, 2011;Li, Guo, & Pan, 2008). Therefore, we improve the time filter to form an asymmetric causal spatio-temporal filter and combines with the Gaussian kernel filter in space to form a new space-time filter. (Darabkh, 2014;Shabani, Clausi, & Zelek, 2011;Zhou, et al., 2014).
The proposed method of space-time extraction can meet the following needs: (1) The algorithm is simple and can reduce the computational complexity as much as possible due to the huge amount of video data; (2) A temporal filter is introduced that detects enough points of interest, not to skip points of slow-moving or cornerinsensitive motion; (3) An improved one-dimensional temporal filter that is an asymmetric causal filter, which can meet the causal temporal relationship of video in the real life.
The spatio-temporal feature points are extracted from the objective conditions of video. DCT is used to embed the watermark as the location of video watermarking. Thus a new video watermarking algorithm is formed, and experiments verify that it has good invisibility and robustness.
The outline of the rest of the paper is as follows. In section 2, the related space-time interest points detection algorithm is mentioned. In section 3, the method of spatio-temporal features point detection of asymmetric causal filter is proposed. In section 4, the proposed watermarking scheme is presented, and we give the embedding and extraction of the watermarking. In section 5, some experimental results are mentioned. Finally, conclusions are given in section 6.

3D-Harris algorithm spatio-temporal interest point detection methods
Spatio-Temporal Interest Point was firstly proposed Laptev et al, extending Harris corner detection to 3D space-time domain. (Kalra, Talwar, & Sadawarti, 2015;Laptev, Caputo, Schuldt, & Lindeberg, 2007;Liu, Wang, & Zhu, 2015). The video sequence can be regarded as a 3D space-time function. Because the video sequence has the dual characteristics of time and space and the motion is relatively independent in time and space, the 3D Gaussian convolution function includes two independent variances: time dimension variance and spatial dimension variance, corresponding to the time scale and the spatial scale, which is represented by τ l and σ l respectively. The linear scale space of space-time three-dimensional video sequence is denoted by L, and the video sequence is expressed by f . The space-time three-dimensional video sequence of linear scale space L is expressed as Equation (1).
where '.' represents 3D coordinates(x, y, t). Where the 3D Gaussian convolution function is defined as To detect interest points, 3D response function is defined as where μ is defined as where σ i 2 is an integrated spatial scale and τ i 2 is an integrated time scale. σ l 2 and τ i 2 is the local scale of space and time respectively, which all has a smooth effect. L x , L y , L z represents the first-order partial derivatives in the xaxis, y-axis and z-axis respectively, and the expressions of which are defined as L The local maximum value of the response is a corner point, and the value of the parameter k will affect the result.
λ 1 , λ 2 , λ 3 are the three eigenvalues of μ respectively. α = λ 2 /λ 1 and β = λ 3 /λ 1 are defined. Therefore, the response value can be derived as Since the local maximum value of the response value is selected, the response value is set to be greater than zero first and k ≤ αβ/(1 + α + β) 3 can be derived. The response value is the largest when α = β = 1, and the value of the parameter is 1/27 now. Corner-based spatio-temporal point of interest detection algorithm has some limit, the reciprocating movement is more sensitive, for example, walking, running, waving and so on. However, for general rotational movement, the corners are not very sensitive and often do not detect the time-space interest points. In addition, the spatio-temporal corner method is equally insensitive to some very slow-moving behaviour. (Nazari, Sharif, & Mollaeefar, 2017;Nyeem, Boles, & Boyd, 2014). Furthermore, because the response function of the corner-based detection algorithm is not sensitive to the changes made in the time dimension, the number of space-time interest points detected on some videos is less. Considering the problems mentioned above, Dollar proposed a linear filter detection method. (Nyeem et al., 2014;Ridzoň & Levický, 2013;Tewari, Saxena, & Gupta, 2014).

The method of spatio-temporal interest points detection of linear filter based on Dollar
Dollar's linear filter response function employs a Gaussian filter in the spatial domain and a one-dimensional Gabor filter in the time domain. Compared with the time-space corner method, the algorithm can detect enough points of interest and show the human motion region better. There are false positives in these detected spatiotemporal points of interest. The response function of a one-dimensional Gabor filter in the time domain is: where f is the value of a pixel in two-dimensional space; R is the response value of the point; g 0 is a two-dimensional Gaussian smoothing kernel in the space domain, and is expressed as g 0 = e −(x 2 +y 2 )/2σ 2 2πσ 2 ; h e and h o represent a pair of functions that are orthogonal to each other, establishing a one-dimensional Gabor filter, which plays a role in the time domain, and the expression is: h e (t, τ , w) = − cos(2πtw)e −t 2 /τ 2 , h o (t, τ , w) = − sin(2πtw)e −t 2 /τ 2 3, w = 4/τ . The parameters σ and τ , respectively, represents the detection scale of the response function of the spatial and temporal domains. The spatio-temporal point of interest is the local maximum of the response value is R.

The method of spatio-temporal features point detection of asymmetric causal filter
A digital image can be viewed as a graph network in which each pixel x is a node connected to the neighbour pixel by a resistance R. The brightness u at each pixel is the charge on a capacitor C. The flux of the brightness between each two neighbouring pixels depends on their relative brightness. (Nazari et al., 2017). We extend this model to a 3D space-time video signal by considering the effect of the potential of the corresponding pixel of the immediate past frame on the potential evolution at the current frame. This consideration satisfies the time causality and is modeled as an external conductance R ext . For simplicity, we show this modeling for a 1D + t signal in Figure 1 and derive the equations for this circuit. From Kirchhoff's Laws, we can derive the diffusion equation as the change of the brightness at a given (time-like) diffusion scale s. The potential at this pixel is denoted by u(x; t; s) and the current by i(x; t; s).
i(x + dx, t; s) = (u(x, t; s) − u(x + dx, t; s))/R 2 , (8) − C ∂u(x, t; s) ∂s = i(x, t; s) + i(x + dx, t; s) + I ext . (10) Consider R1 = R2 = R and the per unit quantity, R = rdx, R ext = r ext dt, C = cdx, and I ext = i ext (x, t; s)dx, Substitute Equations 7-9 in the KCL Equation 10. At the ∂u(x, t; s) ∂s = 1 rc The filter in this paper is composed of an asymmetric causal temporal filter and a Gaussian kernel spatial filter. The spatial filter adopts the Gaussian filter with the spatial Gaussian kernel G with standard deviation σ = √ 2αs. The temporal filter is composed of sin c and S(t), and the time scale of the temporal filter is τ = βs.
The S(t) denotes the Heaviside step function. The expressions of the spatial filter G and the temporal filter k are as follows: G(x, y; σ ) = e −(x 2 +y 2 )/2σ 2 2πσ 2 , To get a phase insensitive animation, we need a set of orthogonal kernels to capture all the phase information. Therefore, we can build a function k h orthogonal to k, which can be obtained by convolution of k with h(t), To get spatio-temporal points of interest, R filter can be used to represent the space-time feature response value, and the local maximum of which is the space-time feature point sought. Express the filter as follows: where u 0 represents the original video sequence value.

Video watermarking algorithm based on space -time feature points
Space-time feature point detection method based on asymmetric filter is used to extract feature points, which shields the shortcoming of Harris corner detection being insensitive to motion. It is also possible to collect enough interest points by using linear filter space-time interest detection method, which is more in line with video characteristics because of the use of asymmetric time filters. (Heidari et al., 2017). The execution of the proposed scheme are described in section 4.1, 4.2 and 4.3, respectively.

Extraction of feature points and selection of feature regions
According to the space-time feature point detection method of asymmetric filter to extract the feature points, enough points of interest can be obtained, and the space-time coordinates and the corresponding response values of the related points of interest can be obtained. (Laptev et al., 2007;Liu et al., 2015). Because using this method can get most point of interest and too many points of interest increase the difficulty of computation, a few interests with large response value are selected from all feature points of the video. As showed in the 10th frame in Figure 2, the 16thframe in Figures 3-5 show the feature frame points of large response values in Figures 3 and 4 respectively. Since the range of points of interest for which a large response value is selected is not in frames, the number of points of interest in each frame is uncertain, and there may be more than one or none.
For the selected point of interest, we need to discover the feature region to embed the watermark. To create the feature region, we need a certain area, but the edge of the point does not have enough area to embed the watermark. If the feature region created by the feature point of a frame coincides, select the region with larger response value.

Embedding the watermark
The repeated watermark is embedded in the generated feature area.
Transform the transformation of the domain Suppose the watermark is a binary sequence m, m i ∈ {0, 1}. For each region, it is first transformed into the DCT domain. Then the DCT coefficients are reordered according to the ZigZag scanning order to form a sequence ranging from low frequency to high frequency. Then the mid-frequency coefficients are selected using a private key. The selected mid-frequency coefficients are segmented into non-overlapping blocks denoted as F to form a new matrix.
2 SVD The new matrix F is decomposed using Singular Value Decomposition (SVD). B is a non-negative diagonal matrix of diagonal elements.
where r is the rank of F, λ i = √ σ i is named singular value (SV), while σ i is named eigenvalue. The SV of the image represents the energy distribution, which is not easily changed, and the SV of the matrix has rotation and proportional invariance. The SV respectively reflects the internal characteristics of the image. The watermark embedding position can be determined according to the internal features, which are beneficial to the visual effect and make the watermark more concealed. The singular value of B decreases with the increase of the number of rows, so the SV of the first row can be chosen as the watermark embedding scheme. The robustness of the watermark is beneficial because the singular values of the matrix have rotation and proportional invariance.
3 Watermark embedding We compute the Euclidean norm of SVs.
And the modification of the SV is (Hongbo et al., 2011): Suppose ε is suitable quantization value which is selected combined with HVS experimentally, Q = rem (N, ε) and m represent the watermark bit, where rem represents remaining operation, we change the norm according to (18). Let the biggest SV B(1, 1) = N, and we get new B m . Applying inverse SVD to B m , U, V H , and we obtain F = UBV H . At this point, the watermark embedding within a feature area is completed.
For each non-overlapping block, repeat the step 2 , 3 until the whole watermark is embedded into the host frame. Then perform the inverse ZigZag scanning and inverse DCT combined with low frequency and high frequency coefficients, and we get the watermarked frame.
Repeat this for each frame containing STIP then the watermarked video will be obtained.

Watermark extraction and detection
Extracting the watermark is simple and it is a blind procedure. Watermark transforms the transformation of the domain, then the singular value decomposition. Specific steps are: Find the frame containing the feature points, segment blocks and compute the Euclidean norm of SVs and remainder Recover the watermark bit

Experimental results and analysis
For assessing the performance of the proposed algorithm, a variety of experiments were carried out. The block size for embedding is 80*80, and the length of the watermark is 128 bits, the threshold of the watermark is 86, as a result, the false alarm probability is 3*10 −6 .

Invisibility test
According to the algorithm, we compare the host 10th frame and the watermarked 10th frame. The naked eye   is difficult to distinguish watermark traces. Objective evaluation index PSNR is nearly 31.4db, so the visual invisibility is significant (Figures 6 and 7). Table 1 lists the results after various attacks on the watermarked frames.

Robustness test
Where the denominator and the numerator correspond to the number of detected feature regions and the number of matched feature regions where the presence of the watermark can be successfully determined after various attacks respectively. And '×' denotes that it is not provided in the scheme. Compared with Li's and Bi's scheme, our scheme is more robust against a variety of attacks, especially in the time domain. (Hongbo et al., 2011;Li et al., 2008). The attack experiment verifies the robustness of the proposed algorithm and proves the watermark has high stability.

Conclusion
The 3D-Harris algorithm is fast, but it is insensitive to the time axis, and it may miss the detection of some slowmoving corners points, resulting in fewer detected feature points, which is not conducive to analysis. Gabor algorithm introduces time filter, and it will detect enough points, but its time filter does not match the actual video. Aiming at the shortage of time filter, an asymmetric time cause filter is introduced, and a novel video watermarking technique is proposed by combining the spatio-temporal point extraction method with the watermark in DCT domain. Experiments on video watermarking algorithm based on improved Gabor show that it has good invisibility and robustness.

Disclosure statement
No potential conflict of interest was reported by the authors.