Collaborative classification mechanism for privacy-Preserving on horizontally partitioned data

ABSTRACT We propose a novel two-party privacy-preserving classification solution called Collaborative Classification Mechanism for Privacy-preserving () over horizontally partitioned data that is inspired from the fact, that global and local learning can be independently executed in two parties. This model collaboratively trains the decision boundary from two hyper-planes individually constructed by its own privacy data and global data. can hide true data entries and ensure the two-parties' privacy. We describe its definition and provide an algorithm to predict future data point based on Goethals's Private Scalar Product Protocol. Moreover, we show that can be transformed into existing Minimax Probability Machine (MPM), Support Vector Machine (SVM) and Maxi–Min Margin Machine () model when privacy data satisfy certain conditions. We also extend to a nonlinear classifier by exploiting kernel trick. Furthermore, we perform a series of evaluations on real-world benchmark data sets. Comparison with SVM from the point of protecting privacy demonstrates the advantages of our new model.


Introduction
Collecting training and predicting data are two necessary steps in the pattern classification system. Those data instances are generally distributed in different parties. Traditional classifiers deal with the data under the assumption that all parties' data can be free accessed and centralized at the data centre. Currently, privacy concerns may prevent the parties from directly sharing the data and some confidential information about the data. It is well documented [1][2][3][4] that the unlimited exposing of privacy through the Internet and other media has reached a point where threats against privacy are very common and deserve serious concern.
Generally, there are mainly two kinks of approaches for privacy-preserving classification: the perturbationbased approach [5] and the cryptography-based approach [6]. The methods based on perturbation have been widely used for data mining, however, when being used for classification, those methods must have a trade-off between privacy and accuracy. The methods based on cryptography can safely preserve privacy without loss of accuracy, however, have high computing and communication costs.
Each privacy-preserving classifier may face the two scenarios: the vertically distributed data [7] and the horizontally distributed data [6,8]. In the first scenario, the features of one entry may be distributed in multiparties. In the second scenario, each entity holds all the feature values for its own group of parties while other entities hold similar data for other groups of parties.
Global and local learning is a recently emerging field, to the best of our knowledge, the idea was firstly introduced by Lanckriet et al. in the Minimax Probability Machine (MPM) [9], this model utilizes a given mean and covariance matrix of each class to render individual global data, and tries to minimize the probability of misclassification of future data points in a worstcase setting, as a result, an optimal linear discriminant is obtained with an explicit upper bound on the probability of misclassification of future data. Following this idea, Huang [12] where global data are represented by the centre and radius of hyperellipsoid. Also, in recent research [13], a bridge between the Minimum Enclosing Ball (MEB) [14] and the Fuzzy Inference Systems (FIS) was established. Those studies demonstrated that collaborating on classification with local data and global data gives advantages to a classifier.
From the viewpoint of privacy-preserving, local data can be considered as privacy data, and global data can be estimated from local data. In turn, when some conditions are satisfied, local data cannot be derived from global data and privacy information will not be revealed from global data. Hence, global data can be appropriately used to hide privacy information in a classification scheme. A fact in the real world is that one's privacy should be shielded only from others, and should be freely accessed by oneself. Those findings motivate us to develop the proposed model.
In this paper, we focus on a two-category classification task in which Alice holds and j ∈ R n with negative label. The two parties want to collaboratively learning a classifier from those samples. By the traditional method, those data may be centralized to train a classifier, however, for privacy concerns, X (A) and Y (A) owned by Alice are prohibited from being accessed by Bob, likewise X (B) and Y (B) are shielded from Alice.
Our proposed Collaborative Classification Mechanism for Privacy-Preserving (C 2 MP 2 ) is different from existing privacy-preserving classifiers for its collaborative mechanism. Our approach bases on the following idea. Alice and Bob can individually get their local classifiers by training their local data, and the two local classifiers can be combined to get a jointed decision. From the view point of global and local learning, inaccuracy of local classifier can be compensated by introducing global information to the local classifiers.
As shown later, C 2 MP 2 is closely related to the three models, namely M 4 , MPM and SVM. Another important feature of the C 2 MP 2 is that no any third party needs and its training and testing algorithm can be executed only within two parties.
The third feature of our proposed model is that the linear version of C 2 MP 2 can be extended to more powerful nonlinear classification approach by using kernel trick.
The paper is organized as follows. In the next section, we overview the related preliminaries and definitions. In Section 3, we introduce the unkernelized linear version of C 2 MP 2 model in detail, including its definition, collaborative mechanism, solving method, secure training and testing algorithms. In this section, we will also analyse its various connections with the existing M 4 , MPM and SVM models. Following that, we demonstrate the kernelized nonlinear version of C 2 MP 2 . We then, in Section 5, evaluate the unkernelized linear version and kernelized nonlinear version of C 2 MP 2 on real-world benchmark data sets. Finally, we summarize the main results of the paper, give concluding remarks and envision possible future work in Section 6.

Related preliminaries
Privacy Information. Most privacy concerns can be classified as either unwanted intrusions into an individual's private life, or the right to control the uses of personal information about oneself [15]. In this paper, we assume that all raw data except label attribute can be regarded as privacy, and should be shielded against other parties, thus, local data equal to private data.
Global Data are those data, which summarize the data and provide the practitioners with knowledge on the structure of data [10]. In this paper, the mean value and covariance matrix denoted as {x, x } and {ȳ, y } in each category data represent global data respectively.
Secure Two-party Computation [16,17] deals with computing any function on any input in a distributed network. Each participant holds one of the inputs while ensuring that no more information is revealed to a participant in the computation that can be inferred from the participant's input and output.
Homomorphic Encryption Scheme (HES) [18,19] is a public-key cryptosystem represented by a triple (Gen, Enc, Dec), in which, Gen is the key generator, encryption algorithm Enc and its corresponding decryption algorithm satisfy that, given any two ciphertext Enc(A) and Enc(B), there exists a cipher-text where * is an algebraic operation in Group G. This property can be used to construct a secure inner product when * is addition operation.

Privacy-preserving classification scheme
In the following, we first present the definition and formulation of unkernelized linear version of C 2 MP 2 , then introduce its collaborative idea and discuss its connections with other models including M 4 , MPM and SVM. In this section, we also present its training and testing algorithms and analyse their security.

Linear version of C 2 MP 2
Following previously described data distribution, let y respectively denote the number of data entries held by Alice and Bob, let N = N (A) + N (B) denote the total number of data entries. Hereafter, we denote the covariance matrix of positive class and that of negative class as x and y .
We wish to determine a hyperplane f (z) = w T z + b, where w ∈ R n \{0} and b ∈ R, which separates the above horizontally distributed two classes of data as robustly as possible. Future data points z for which f (z) ≥ 0 will be then classified as positive class; otherwise, they will be classified as negative class. The procedure of training and testing should guarantee that the private data entries in X (A) and Y (A) cannot be disclosed to Bob, and Bob's data entries should be also shielded against Alice. Moreover, when testing the future data z held by Tom, the privacy of z should not be disclosed to any other party.
We construct the first classifier with the privacy data of Alice and the global data, such that the privacy data of Alice is shielded. This classifier can be reasonably expressed as −(w a T y The second classifier is constructed likewise for using only the privacy data of Bob, and can be also reasonably expressed as where x , y ∈ R n×n respectively denote the covariance of positive class and negative class, both are symmetric and positive semi-definite. The first classifier described by (1)-(3) tries to maximize the margin defined as the minimum Mahalanobis distance between the privacy training samples of Alice. This classifier uses only the local data held by Alice and the covariance matrices of the two classes, and can be executed only be Alice, without disclosure of Alice's privacy. The second classifier described by (4)-(6) works likewise and can protect the privacy of Bob. Compared to SVM and M 4 , C 2 MP 2 divides the whole classifier into two separated classifiers for protecting local private data. We concede that due to the absence of the opposite party's local private data, individual classifiers are biased and yield decision errors. However, we will show in the following that the injection of bias will be compensated by jointly combining the two decision hyperplanes.
For dealing with the nonseparable case, we introduce slack variables. Thus, the optimization of the first classifier is changed into In a similar way, the second classifier described by (4)-(6) can be rewritten as where ξ k and ε k are nonnegative slack variables, which can be considered as the degree how the local training data disobey the margin (ρ a and ρ b ). Functionally, C a and C b are positive penalty parameters, thus, C a k=1 ε k can be conceptually regarded as training errors or the empirical errors. In other words, the two optimizations (7)-(9) and (10)-(12) successfully maximize the minimum margin while minimizing the total training errors respectively.
The above two classifiers in (7)-(9) and (10)-(12) constitute the unkernelized linear version of C 2 MP 2 . As can be clearly observed, the optimization (7)-(9) is similar to M 4 [11], i.e. this optimization can be cast as a sequential Second Order Cone Programming

How collaborative mechanism works
Several natural questions for the linear version of C 2 MP 2 are how to get x and y , why disclosure of covariance will not disclose the privacy, and how to achieve a final decision hyperplane from the separated f x (z) and f y (z). In this section, we address those problems.
The whole positive class data X are horizontally split into X (A) and X (B) x denote the total number of X, the mean value of X can be estimated byx The covariance of X can be estimated from As observed form (14) and (15), for obtainingx, Alice (resp. Bob) can firstly require N x andx (A) ) from Bob (resp. Alice), then Alice and Bob can respectively calculate x . Finally, combining opponent's component, x can be shared by both Alice and Bob without disclosure x (A) i and x (B) j . y can be calculated likewise.
Some researches have shown that disclosure of statistical values may lead to leakage of privacy [20], in this paper, we focus on preventing opponent to deduce raw data from the covariance matrix. Assuming that Bob wants to deduce x that are all information shared from Alice to Bob. x variables, this problem is known to be NP-hard over any field [21,22].
For achieving a joint decision from f a (z) and f b (z), theoretically inspired from the schemes of combining classifiers [23] and collaborative learning [24], we can consider those points which locate in the margin area and are equally far from f a (z) and f b (z) with Mahalanobis distance, a point set will thus be given by The roots of (16) constitute a final hyperplane. If * , then the final decision hyperplane can be jointly expressed as This collaborative mechanism can be considered as a modified median rule of combining classifiers [23], which uses average over Mahalanobis distance instead of arithmetical average. Moreover, Mahalanobis distance takes into account the global information of the data set, including compactness and orientation, so, the final hyperplane combined by this mechanism can achieve a more reasonable decision than individual classifier. Meanwhile, this collaborative strategy can compensate the previously described bias introduced by individual classifiers. Later experimental results on real data sets also demonstrate its effectiveness.

Connections with other models
In this section, based on the above linear separable version of C 2 MP 2 , we build the connections between C 2 MP 2 and other model.

Connection with M 4
If one assumes (3) and (4)-(6) can be combined into the following (18)- (20) is exactly the M 4 optimization [10,11]. M 4 uses both global and local private data, however, the local private data are directly shared with each other, so, M 4 is a centralized model without privacypreserving. In comparison, C 2 MP 2 is a distributed model which collaboratively deals with local private data and combines two classifiers to achieve the goal of privacy-preserving.

Connection with MPM
Following the path from C 2 MP 2 to M 4 , adding up all N x constraints in (19) together and average this sum, one can immediately obtain the following: Similarly, from the N y constraints in (20), one can obtain Adding up (21) and (22), then the two optimizations can be combined into one optimization, i.e.
Equation (23) is exactly the MPM optimization [9]. Note, the above derivation cannot be reversed, this means that MPM is looser than C 2 MP 2 . From the viewpoint of privacy-preserving, MPM is a centralized model of C 2 MP 2 , the centralization takes effect when setting On the other hand, only global data are used and all local private data are ignored in MPM, this inobservance may cause inaccurate. Although of those discussions, we must emphasize that the original goal of the MPM is not for privacy-preserving but to provide guarantees with respect to classification accuracy, here, we only explore this model freshly from the viewpoint of privacy-preserving.

Connection with SVM
Under the same assumptions that Notice that magnitude of w will not influence the optimization, one can set ρ √ w T w = 1 without loss of generality. Additionally, if one assumes = I, where Equations (27)-(29) exactly mean the standard SVM model. Assuming w a = w b = w, b a = b b = b, ρ a = ρ b = ρ and x = y = , means SVM is also a centralized model of C 2 MP 2 . Assuming = I means SVM discards orientation or shape information [11], and uses only local private data. So, SVM can be considered as a centralized model without privacy-preserving.
It is worth stressing that the goal of this paper is neither to beat M 4 , MPM or SVM from the point view of classification accuracy nor to design a novel cryptosystem, we only try to explore applicability of cooperating global and local data for privacy-preserving based on the M 4 framework.
As the end of this section, we summarize those differences and connections among the four models in Table 1.

Secure training algorithm for linear version of C 2 MP 2
As previously, the theoretical analysis is on R, while for modular operation in testing and computing Gram matrix, all values should be in a bounded region before employing the training algorithm. Usually, attributes are presented with fixed precision floats, we can encode them as integers by scaling each attribute to the range [−M, M]. Of course, we have to use the same method to scale testing data before testing.
We then introduce Algorithm 1 which states the training procedure of the linear version of C 2 MP 2 and in which the communication between Alice and Bob is considered.
After running Algorithm 1, both parties can compute the final output from the received messages, mean and covariance of both sides. Consequently, neither of them can learn additional information besides the mean, the covariance, and the decision function (17).

Secure testing algorithm for linear version of C 2 MP 2
Once the final decision hyperplane (17) is securely constructed for each party, to predict a future point z held by Tom using (17), we need to guarantee that z does not be disclosed to any other party. When z is held by either Alice or Bob, because (17) is equally shared by them, testing can be computed only in the party who holds z and no exchanging data is needed, in this scenario, testing is naturally secure. However, for commercial interests or protecting intellectual property, (17) can be considered as classification knowledge rule, so, Alice and Bob do not want to disclose this rule to Tom. For example, an online anti-spam mail provider does not open its classification rule to public, while distinguishing between spam and normal email for customer without revealing the personal privacy.
Based on the existing secure scalar product protocol [18] and the Paillier homomorphic cryptosystem [19], we propose Algorithm 2 to compute scalar production w · z and then predict the label of z. Consider the equality on sharing (17) with Alice and Bob, we can assume that the testing service is provided by Alice.
Algorithm 2 can be seen as a special case of Proto-col3 proposed by Goethals et al. [18] on the condition of S b ← 0. Goethals et al. [18] has given a formal security proof in the semi-honest model. Here we apply this protocol to executing our testing procedure.
After executing Algorithm 2, Alice obtains no more knowledge than w · z and the predicted label of z, Tom obtains no new knowledge than the predicted label of z.
Our proposed Algorithm 2 as well as PP-SVM [25] borrows the same idea from the scalar production protocol [18]. However, a semi-honest third party is avoided in our model, while in PP-SVM model, it works for both testing and training. Algorithm 2 Testing algorithm for linear version of C 2 MP 2 Input: Service provider Alice holds w and secure homomorphic encryption system keypair (sk, pk) Input: Customer Tom holds a future point z Output: Tom receives the predicted label of z 1. for k = 1 · · · n do 2.
Alice generates a random nonce r k 3.
Alice computes c k ← Enc pk (w k ; r k ) and sends c k to Tom 4. end for 5. Tom computes C ← n k=1 c k z k 6. Tom sends C to Alice 7. Alice computes Dec sk (C) = w · z 8. Alice computes sign(w · z + b) to predict the label of z and sends the label to Tom

Kernelization
In order to handle nonlinear classification problems, we seek to use the kernelization trick to map the ndimensional data points into a high-dimensional feature space R f via a mapping function ϕ : We will demonstrate that, although C 2 MP 2 possesses a significantly different optimization from MPM, the kernelization method used in [9] is still viable, provided that suitable estimates for means and covariance matrices are applied therein. This method has been also extended to MEMPM [26] and M 4 [11]. The similar kernelization method for C 2 MP 2 is described in the following. After being kernelized, (1)-(3) in the feature space can be written as Equations (4)- (6) can be also written as ϕ(y) . To carry out the above two optimizations, we need to reformulate them and their final decision hyperplane in term of a given inner product kernel function K(x i , x j ) = ϕ(x i ) T ϕ(x j ) satisfying Mercer's conditions. We now state Algorithm 4.1 similar to Corollary 5 proposed by Lanckriet et al. [9] and Proposition 1 proposed by Huang et al. [11] and prove its validity in solving the kernelized C 2 MP 2 model.
where I f is the identity matrix of dimension f, λ i , θ j , i and j are the normalized weights for data points respectively. The positive constants τ x and τ y can be regarded as the regularization term of covariance matrices. Then, the optimal α * and β * for (30)-(32) lie in the space spanned by the training points, i.e. α * ∈ span({ϕ( Proof: We write α = α d + α p , where α d is the projection of α in the vector space spanned by all training data points and α p is the orthogonal component to this span space. We can then easily check that the denominator of left part in (31) can be changed to and α T d α p = 0. It is evident that an orthogonal component α p of α will not affect the constraints (30) and (32). Since the objective is to be maximized, the denominators α T (A) ϕ(x) α should be as small as possible, this will lead to α p * = 0, hence α * = α d * . In other words, the optimal α * lies in the vector space spanned by all the training points, i.e. α * ∈ According to Algorithm 4.1, α can be written as a linear combinations form of training data points where the coefficients μ i , ν j ∈ R. Represented by vector form For the purpose of clarity, let {z i } N (A) i=1 denote all N (A) training data points held by Alice, where Following aforementioned denotation, let K i denote the ith row vector, where K i ∈ R N (A) and i = 1, 2, . . . , N (A) , moreover, let K x and K y denote the first N (A) x rows and the last N (A) y rows respectively: If we use the plug-in estimates to approximate the means and covariance matrices, we can write plug-in estimated covariance matrices aŝ In order to represent the covariance matrix into an inner product form, we then define M (A) as where m x , m y ∈ R N (A) , whose elements are defined as Consequently, covariance matrices can be represented asˆ Notice that, in (36), if set τ x = 0, the objective (30) and the constraints (31) and (32) will not be affected. So, we can set τ x = 0 and τ y = 0, and substitute (37) and (38) into (31) and (32) respectively. Finally, the first classifier of kernelized C 2 MP 2 can be written as the following: To solve this problem, the optimal η (A) * and b a * can be obtained. Similarly, the second classifier can be solved  ionosphere  225  126  34  glass  70  76  9  musk  207  209  167  parkinsons  48  147  22  pima  268  500  8  sonar  97  111  60  vote  168  267  16  yeast  463  429  6 with its optimal solution η (B) * and b b * . The optimal decision hyperplane can be represented as a linear form in kernel space N (B) . This combining operation can be considered as learning with hyperkernels [27].
Computing K ij involves an inner product computation, consider the Gaussian kernel, K ij can be pre- , so the secure scalar product protocol [18] is still viable for testing future points in kernel space.

Experiments
In this section, we evaluate C 2 MP 2 and compare the performance of C 2 MP 2 with that of SVM on eight benchmark data sets. The covariance matrices are given by the plug-in estimates.

Evaluations on benchmark data sets
We next perform evaluation on eight benchmark data sets obtained from the University of California at Irvine (UCI) machine learning repository [28]. To evaluate the algorithms in the horizontally distributed scenario, we need to construct a training set which are horizontally split into two sides(Alice and Bob). To this end, for each data set, 70 percent of the data examples are randomly selected for training, and one random half held by Alice and another half held by Bob. All the remaining data are used for test. This simulates the situation that data set is horizontally distributed in two parties. Further details of these data sets are listed in Table 2.
We randomly split each data set into training and test sets with the above scheme. Then, the fivefold cross-validation is performed on the training set for parameter selection. Using the tuned parameters, the experiment is then repeated 10 times independently on each data set, and the averages accuracy and standard deviations are summarized in Tables 3 and 4. Furthermore, the paired t-test on 0.05 significance level is performed over the 10 accuracies of each data set and the corresponding p-value are also listed. For the purpose of clarity, we separately analyse the results in unkernelized linear versions and in kernelized nonlinear versions. As can be seen in Table 3, in comparison with SVM, C 2 MP 2 achieves the best overall performance, it loses only on parkinsons, in which the number of total positive class data is 48, after being horizontally split into two parts by the above scheme, each party only holds 16 training samples, to train a classifier by those less samples may lead to inaccuracy.
In the kernelized version with Gaussian kernel, as can be seen in Table 4, C 2 MP 2 wins five out of eight, and is significantly better on musk, parkinsons, pima, sonar and vote. Although the linear C 2 MP 2 wins on ionosphere, glass and yeast, the kernelized C 2 MP 2 loses on these data sets. Those differences may be caused by the approximation errors introduced by plug-in estimates of the covariance matrices in their kernelized feature space, in which data points are very sparse, compared with the huge dimensionality in the Gaussian kernel.

Computation and communication cost
The training linear C 2 MP 2 has O(Nn 3 ) time complexities which is equal to that of M 4 . The communication cost of training linear C 2 MP 2 is also quite low.
In Algorithm 1, step 1 transmits (n 2 + 1) elements, and step 7 transmits n elements, due to symmetry of communication, the total number of communication messages is 2(n 2 + n + 1) that is independent of the number of samples.
Assuming the dimension of samples is n and the max length of each sample is m bits, the communication overhead of Algorithm 2 is (n + 2)m bits. In executing this algorithm, Alice must perform n encryptions. Bob has to perform n exponentiations, n multiplications and 1 decryption. The current hardware allows to do approximately 10 6 multiplications per seconds and thus the computational complexity of both Alice and Bob is tolerable.

Conclusion
We have proposed a novel two-party privacy-preserving classification solution called Collaborative Classification Mechanism for Privacy-preserving (C 2 MP 2 ) which theoretically based on combing classifiers and practically respected the fact that ones privacy should be shielded only from others and free accessed by oneself, and that sharing global data with others will not disclose ones own privacy. Based on local and global learning theory, two local classifiers are constructed without revealing one's privacy, then they are combined to give a joint decision. From the viewpoint of privacypreserving, we have also established detailed connections among our model and other models including M 4 , MPM and SVM. Moreover, we have designed a training algorithm and a testing algorithm to securely carry out our model without disclosing ones privacy to any others. In addition, we have extended our model to a nonlinear classification approach by exploiting kernel trick. Experimental results on benchmark data sets have demonstrated the advantages of C 2 MP 2 in privacypreserving. How to extend C 2 MP 2 to multi-party classifications is also an important future topic.

Disclosure statement
No potential conflict of interest was reported by the authors.