A Privacy-preserving Image Retrieval Method Based on Improved BoVW Model in Cloud Environment

ABSTRACT With the rapid development of cloud computing technology, more and more users choose to outsource image data to clouds. To protect users’ privacy and guarantee data’s confidentiality, images need to be encrypted before being outsourced to CSP, but this brings new difficulties to some basic yet important data services, such as content-based image retrieval (CBIR). In this paper, a privacy-preserving image retrieval method based on an improved BoVW model is proposed. An improved BoVW method based on Hamming embedding can provide binary signatures that refine the matching based on visual words; therefore, retrieval precision is improved significantly; orthogonal transformation is utilized to implement privacy-preserving image retrieval, where image features are divided into two different fields with orthogonal decomposition, for which encryption and distance comparison are executed separately, and two kinds of operation results are fused in the final vector with orthogonal composition. As a result, cloud server can extract components from encrypted features directly and compare distance with those of query image, without violating the privacy of images and features. Any algorithm can be used to encrypt features, which enhance the practicability of the proposed method. The security analysis and experimental results prove its security and retrieval performance.


INTRODUCTION
Along with the development of cloud computing technology, public cloud service can provide unlimited storage space and computing ability for massive multimedia. At the same time, it brings serious security problems because data owners lose physical control of data while CSP (cloud service provider) is considered "honest but curious", which means that it may lead to unauthorized access to data and cause leakage of personal privacy [1][2][3]. Given such circumstance, data must be encrypted before being outsourced to the CSP to protect its confidentiality. Although data security can be guaranteed by encryption, it also brings many new challenges to data management and data sharing service, such as image retrieval. Content-based image retrieval (CBIR) is a very promising method in image retrieval field which is characterized by extracting image features and comparing the distance between features automatically, and it has grown rapidly and made progresses in both the derivation of new features and the construction of signatures based on these features [4][5][6]. However, conventional CBIR methods cannot be applied under cloud environment directly because encrypted data fail to preserve the distance between feature vectors. If users want to retrieve images from CSP, encrypted image need to be decrypted first, then retrieval can be operated on plaintext, which makes the sensitive information being exposed to attackers, breaks privacy, and hence is not desired. Therefore, it is important to develop technologies of CBIR over encrypted domain, which is also called private/privacypreserving CBIR (PCBIR) [7][8][9].
Many advances have been achieved in image retrieval over cryptographic domain. Early technology mainly originated from retrieval on text document for example, Song et al. [10] proposed a ciphertext scanning method based on streaming cipher to make sure whether the search term exists in the ciphertext; Boneh et al. [11] proposed a keyword search method based on publickey encryption; Swaminathan et al. [12] ranked the order the documents securely and extracted the most relevant documents from an encrypted collection; Cao et al. [13] proposed privacy-preserving multi-keyword ranked search over encrypted data. Although these methods can be extended to image retrieval based on userassigned tags, extension to CBIR is not straightforward. CBIR typically relies on comparing the distance of image features, but comparing similarity among highdimensional vectors using cryptographic primitives is challenging.
Recently, several methods have been proposed to solve the problem of PCBIR. Lu et al. [14] proposed three distance-preserving methods: bit plane randomization, random projection, and random unary encoding, which are applied on low-level features such as color histogram. Karthik et al. [15] proposed a transparent privacypreserving hash method, which keeps the statistical rules of encrypted AC coefficients but ignores the spatial information distribution of the image. Xu et al. [16] proposed a secure retrieval method for JPEG images, which preserves the distribution of the AC coefficients and the statistical rule of color after the decoding of encrypted images. Ferreira et al. proposed a method in [17], where color information is encrypted by deterministic encryption techniques to support color-feature-based CBIR, and texture information is encrypted by probabilistic encryption algorithms for better security. Zhang et al. [18] use Pallier encryption algorithm to protect some lower level features such as color, texture, and shape, and achieve secure retrieval effects. Some local-feature-based PCBIR methods are also proposed. Hsu et al. [19] put forward a homomorphic encryption-based secure SIFT method which has good security at the cost of serious ciphertext extensions. Xia et al. [20] proposed to use SIFT features and transform earth mover's distance (EMD) in a way that the cloud server can evaluate the similarity between images without learning sensitive information. Huang et al. [21] convert the high-dimensional VLAD descriptors to compact binary codes, and then adapt the asymmetric scalar-product-preserving encryption to design PCBIR to achieve the privacy requirements in the cloud environment. In contrast with global feature approaches, local-feature-based PCBIR methods achieve higher retrieval accuracy, but it requires quite complex methods to implement distance preserving, which is not suitable for large-scale image retrieval under cloud environment.
It should be noted that the above research results are all relying on specific encryption methods, such as shuffling and homomorphic encryption, which limit its universality. For example, homomorphic encryptions are too computation-and communication-intensive to be used in low-profile devices and large-scale systems, while shuffling is not suitable for some situations that have high requirements to security. Besides, these methods are all relying on global features or local features to evaluate the similarity of images, whose retrieval accuracy can hardly meet requirements of practical applications because of "semantic gap" between the visual features and the richness of human semantics [22]. Aiming to solve these problems, we propose a privacy-preserving image retrieval method under cloud environment in this paper, where an improved BoVW model is employed to improve retrieval precision, orthogonal transformation is combined together to implement privacy-preserving retrieval under cloud environment. An improved BoVW method based on Hamming embedding can provide binary signatures that refine the matching based on visual words; therefore, retrieval precision is improved significantly; orthogonal transformation are used to divide features into two different domains, for which encryption and distance comparison are executed separately. As a result, cloud server can extract components from encrypted features directly and compare distance with those of query image, without violating the privacy of images and features. Because the encryption operation and distance comparison operation are independent, any algorithm can be used to encrypt features, which enhance the practicability of the proposed method. The experimental results prove its effectiveness and security.
The organization of this paper is as follows: Chapter 2 introduces system architecture and preliminaries; Chapter 3 proposes our scheme. Chapter 4 provides experimental results and performance analysis, and Chapter 5 presents conclusions.

System Architecture
The system model used in this paper is given in Figure 1.
There are three entities involved in this model: content owner, CSP, and user. Content owner trains visual dictionaries, extracts features from images, and constructs search index, which are then encrypted together with images and outsourced to the CSP. CSP stores cipher images and secure index and performs image retrieval when receiving user request. Users send retrieval request to content owner and CSP, decrypt cipher-images returned by CSP, and get requested images.

BoVW Algorithm
Bag of visual words (BoVW) model is extended from natural language processing and information retrieval field to computer vision in [23]. Descriptors are quantized into visual words with the k-means algorithm, and then image can be represented by the frequency histogram of visual words obtained by assigning each descriptor of the image to the closest visual word. Fast access to the frequency vectors is achieved by an inverted file system. BoVW has been successfully adopted to enable fast indexing and retrieval of large image collections [24]. The retrieval process consists of four steps: image content is described by means of a set of visual descriptors such as SIFT and SURF.; descriptors are clustered into visual words which form a vocabulary; descriptors are compared and assigned to one or more visual words so as to map the image into a histogram of visual word frequencies; images with the closest histogram distance will be returned as retrieval results.
Although BoVW has shown good performance in image retrieval task, it still suffers from some problems, such as insufficient discriminative power of visual words, quantization error caused by assigning descriptors to visual words, and low efficiency caused by comparing distance between high dimensions of vectors. Many methods are proposed to improve the performance of BoVW, such as [25,26]. Among these methods, [27] is the one that achieves higher accuracy for large-scale image search based on Hamming embedding (HE). The main idea of HE is to select low number of centroids k in k-means clustering to form a rough visual dictionary and refine the quantized index q (x i ) (the index of the centroid closest to the descriptor x i ) with those of a d b -dimensional binary signature, thus the Euclidean distance between two descriptors can be mapped into the Hamming distance, and mismatch points can be removed by setting proper threshold; thus, retrieval performance is improved. This method includes four steps: (1) Assign descriptors x to their closest centroid, resulting in q(x); project x by means of a d b × d orthogonal projection matrix P, producing a vector z = Px = [z 1 , . . . ,z d b ] T ; compute the median value τ i of all vectors that belong to the same centroid; (3) Use HE matching function to match the distance of x, y: where tf − idf () weights the visual words according to their frequency, and h()is the Hamming distance defined as follows: h t is a fixed Hamming threshold such that 0 ≤ h t ≤ d b . (4) Given a query image represented by its local descriptor y i , and a set of database image j = 1, . . . , n represented by their local descriptor x i,j , then a voting system is used to evaluate the similarity between the query image and each database image. The image score s * j is given as follows: where g j () is a post-processing function, m j is the number of local descriptors of the j-th database image, m is the number of local descriptors of the query image. The score reflects the number of matches between the query and each database image.

Distance-Preserving Method Based on Orthogonal Transformation
Orthogonal transformation is a kind of vector representation method that consists of two processes: orthogonal decomposition and orthogonal composition. Any vector can be expressed as a sum of a set of component coefficients by orthogonal decomposition, and component coefficients are fused into one composite vector by orthogonal composition.
Suppose vector X = (x 1 , x 2 , .., x n ), orthogonal matrix B = (b 1 , b 2 , . . . , b n ) , then X can be expressed as follows: Vector Y can be expressed as follows: T correspondingly, and we can obtain the following: If encryption is operated on Y 1 and feature is extracted from Y 2 , respectively, then we can obtain the following: Two different operations will be independent and will not interfere with each other because of the independence of the orthogonal decomposition, while two kinds of operation results are fused in final vector X ef because of the fusion of the orthogonal composition. We can also obtain the following: From these equations, we can see that features can be extracted from encrypted vectors directly, and the distance between the component vectors is similar to the one between the original vectors because of the distancepreserving characteristic of the orthogonal transformation.

THE PROPOSED METHOD
Based on algorithms mentioned in Section 2, we proposed a privacy-preserving CBIR method combined with an improved BoVW model and orthogonal transformation.
The specific retrieval process is given as follows:

Configuration Stage
In configuration stage, visual dictionary, encryption/ decryption key, and Gaussian orthonormal matrix are constructed for later usage.
• (VD) ← VDGen(TB): Content owner generates visual dictionary VD from training image dataset TB, where two steps are included: SIFT descriptors are extracted from TB, which have been widely used due to its distinctiveness and robust matching across a substantial range of affine distortion, addition of noise, and change in illumination; Descriptors are quantized into visual words to form a visual dictionary with k-means clustering method. • (K I , K j ) ← KeyGen(1 α ): Encryption key K I , K J are generated with KeyGen() function, where α is a security parameter used to generate key; • (P, B) ← MatrixGen(): A random Gaussian orthonormal matrix is generated as P, B, R, and S are selected from B in terms of users' practical needs on security or retrieval performance. • (τ q(x j ) ) ← MedianMatrixGen(P, q(x j ), x j ): The descriptors x j from VD are assigned to their closest centroid q(x j ) and projected by P toz = Px j = [z 1 , . . . ,z d b ] T , then median value τ q(x j ) of all vectors that belong to the same centroid q(x j ) is generated.

Content Owner Side
• (S x j ) ← SigBuild(Img, VD, P, τ q(x j ) ): Sift descriptors x j are extracted from images Img that will be outsourced to CSP, assigned to particular cluster centers q(x j )(j = 1, . . . , m j ) in VD, and x j is mapped to vectors with a orthogonal projection matrix P, then binary signature of x j is generated by comparing with median matrix τ l according to Eq.(1). Therefore a descriptor x j can be represented as a signatureS Orthogonal Decomposition is applied to S x to obtain encryption field S 1x j = (q 1 (x j ), b 1 (x j )), encryption algorithm AES are executed on it to obtainq 1e (x j ), b 1e (x j )) according to Equation (8), finally encrypted features (q ef (x j ), b ef (x j )) are obtained by means of orthogonal composition. TF-IDF is used to construct secure image index. • (Img e ) ← SecureImg Build(Img, K J ): Images that will be outsourced to CSP are encrypted by K J and AES encryption algorithm, cipher-images Img e are obtained.
Data owner uploads cipher-images Img e and corresponding secure image index to CSP.

User Side
• User sends query request to the data owner. After identity authentication, owner sends secure parameters (R, S, P,K I ) to users securely.
• (S q ) ← SigBuild(Img q , VD, P): Sift descriptors x q are extracted fromImg q , and binary signature S q = (q(x q ), b(x q )) is generated with the method we mentioned in 3.2.
Orthogonal Decomposition is applied on signature S q to obtain encryption field S 1q , encryption algorithm AES and encryption key K I are executed on S 1q , then orthogonal composition is applied to obtain encrypted features (q ef (x q ), b ef (x q )) as trapdoor. User sends trapdoor to the CSP as query. • Img ← Dec(Img eq , K J ): After receiving retrieval results Img eq from CSP, user decrypts cipher images with K J and obtains requested images.

CSP Side
• CSP obtains sub-matrix S content data owner securely, operates orthogonal decomposition on encrypted feature (q ef (x i,j ), b ef (x i,j )) and search trapdoor (q ef (x q ), b ef (x q ) sent by users according to Equation (10), then distance comparison field is obtained to compare the distance between (q 2f ( The matching scores are available by the similarity comparison between (q 2f ( ) according to Equations (11)(12)(13). The higher the matching scores, the more similar the corresponding image is to the query image, and the encrypted images Img eq with the highest scores are sent to the requested user.

EXPERIMENTAL RESULTS
In this chapter, we present an experimental evaluation of the proposed method and compare retrieval results with several classical privacy-preserved image retrieval methods, including random projection, bit-plane randomization, and randomized unary encoding proposed in [14], Pallier algorithm put forward in [19]. We perform experiment on INRIA Holidays dataset [28] containing 1491 images, which are divided into 500 queries and 991 corresponding relevant images. The performance of our scheme is evaluated in terms of security and retrieval performance.

Security
The orthogonal decomposition is implemented on image features instead of images, and images can be encrypted by any encryption method. We use AES-128 encryption algorithm to encrypt images. Even with the most powerful biclique attacks, the computational complexity is 2 126.1 , which means the security of AES will not be broken. For the encrypted features, since S is known by CSP, and as a sub-matrix of B, R, and S are not completely independent, so there is a potential secure problem. The complexity of attack B and R in the case of known the sub-matrix S is as follows: where the size of B is n, element of B is k-bit fixed word-length integer, m is the number of columns of the sub-matrix S, and k m is a factor that includes the efficiency improvement achieved over an exhaustive search by using a different algorithm. Therefore, the complexity of attacking B and attacking R tend to O(n 2 k) and O(m 2 k), which means even if orthogonal sub-matrix S is known, the computational complexity of attacking the orthogonal matrix B and the orthogonal sub-matrix R is exponentially increasing; therefore, the security of features can also be guaranteed.
Three different metrics are used to evaluate the security of features: autocorrelation function, information entropy, and histogram distribution.
(1) Autocorrelation function The autocorrelation function of a feature vector measures how correlated the neighboring feature elements are [29]. Suppose n is the size of feature vectors, T is the delay of signals, then the autocorrelation function R(T) is calculated as follows: The autocorrelation function for the raw color histogram, randomized features of three different retrieval methods of Lu, random uniform vector, and the encrypted features using algorithm proposed in this paper are shown in Figure 2. We can see that the raw color feature has non-negligible correlation, which indicates there exists strong association between adjacent vectors, while our method has similar autocorrelation as random features encrypted by Pallier algorithm and is much lower than Lu's three methods, which shows encrypted features using our method has low autocorrelation thus has good confidentiality.
(2) Information entropy Information entropy describes the uncertainty of random variables. A higher entropy of encrypted features means that it has a distribution closer to uniform, thus higher randomness and higher security. The entropy is defined as follows: where x(i) is the feature, n is the length of x(i), p(x(i))is the probability of x(i). A higher entropy of encrypted features means it has a distribution closer to uniform, thus higher randomness and higher security. The entropy of encrypted features for different algorithm is given in Table 1.
From this table, we can see that both the color histogram and visual words have low entropy, which means that there is strong inherent correlation among feature values. The features encrypted by the proposed method achieve higher entropy than Lu's three methods but lower than uniform random features encrypted by Pallier's algorithm. Although its security is not as good as features encrypted by complex homomorphic algorithm, its time complexity consumed by encryption is much lower than Pallier's algorithm; therefore, it is a more practical method. (

3) Histogram distribution
Histogram distribution is usually employed to evaluate security of encrypted data. In the proposed method, secure image index of each image is outsourced to CSP, which consists of two parts: (q ef (x j ), b ef (x j )), q ef (x j ) means the encrypted cluster centers that feature x j belonging to; b ef (x j ) means the encrypted binary signature of x j . The CSP can obtain only the frequency histogram distribution of encrypted visual words, but it can hardly acquire image features from its encrypted binary signature. Therefore it is difficult for the CSP to only use frequency histogram distribution to recover the image features and infer the image content without correct decryption key. Figure 3 shows the comparative results of frequency histogram distribution between the encrypted visual words and the original visual words of an image. From Figure 3, we can see that the histogram of encrypted visual words is different from that of original visual words, and it is difficult for attackers to infer the image content information only through frequency histogram distribution of the encrypted visual words.

Analysis of Time Complexity
Assume the size of the selected operation data X is n, the dimension of sub-matrix S is k(k < n), g is the number  of bit-planes to randomize, m is the dimension of projected features, M is the eigenvector of the feature vector, then the time complexity comparison results with other methods are given in Table 2.
From Table 2, we can see that the time complexity of the Pallier homomorphism method is the highest; Randomized unary encoding is higher; the proposed method is lower than these methods and similar to random projection proposed by Lu.

Retrieval Precision
Retrieval precision is evaluated by mean average precision (mAP) that is widely used to measure image retrieval performance for a group of queries [30]. It is defined as follows: where Q is the number of queries, avePq represents the average value of all the precisions measured each time a new relevant image is retrieved. In this experiment, a group of 500 query images are retrieved in a database containing 991 images, and the MAP of query images for different encryption methods are computed as Table 3 shows.
From the table, we can see that the retrieval result of our scheme is much better than those of other methods. Meanwhile, we also compute the mAP of the unencrypted HE method, and its MAP is 80.59 which is roughly equal to the result of our encrypted scheme, namely, our method can hardly reduce retrieval accuracy of original HE.
Through the above analysis of retrieval performance, it is proved that the proposed method has a good tradeoff between security and retrieval performance, it not only guarantees the security of retrieval in cloud environment but also ensures the high precision of retrieval for practical applications.

CONCLUSION
A privacy-preserving image retrieval method based on improved BoVW model in cloud computing is proposed in the paper. An improved BOVW method based on Hamming embedding which provides binary signatures refining visual words are combined with orthogonal transformation to achieve privacy-preserving image retrieval under cloud environment. Experiments show that our scheme has obvious advantages in security and retrieval precision compared with other methods. Future research will focus on exploring retrieval efficiency further.

DISCLOSURE STATEMENT
No potential conflict of interest was reported by the authors.