Association rule hiding using enhanced elephant herding optimization algorithm

Association rule hiding is an efficient solution that helps organizations to avoid the risk caused by sensitive knowledge leakage when sharing data in their collaborations. Cuckoo Optimization Algorithm (COA) sanitizes the transaction database but this method has limitation due to its slow convergence and exploitation capabilities. Hence in this paper, Enhanced Elephant Herding Optimization Algorithm for Association Rule Hiding (EEHOA4ARH) is proposed for association rule hiding. In EEHOA, two core functions such as clan updating operator and separating operator are used for association rule hiding that also realizes the fast convergence and exploitation capabilities. Moreover, the searching strategy in COA4ARH for the selection of best solution is highly time consuming. To reduce the time consumption for the selection of best solution, a Crowding Distance (CD) concept is combined with EEHOA4ARH. By continuously updating the best elephant and replacing the worst elephant in the population, EEHOA4ARH-CD sanitizes the transaction database effectively. Thus the proposed EEHOA4ARH achieves the less computation time, fast convergence and better exploitation capabilities by using crowding distance. The experimental results prove the effectiveness of the proposed EEHOA4ARH–CD method in terms of hiding failure, lost rule and execution time with 44.66 s.


Introduction
Data mining is the task of examining huge volume of data to identify the patterns and sensitive information in it.The data mining process is applied to huge volume of data in companies and business organizations to enable them for making appropriate decisions.Most of the companies and organizations have some sensitive information that should be secured against unauthorized access.Maintaining the privacy or confidentiality of this information is an essential goal for the research area of database security and government organizations.As a result, a key challenge is to find a trade-off between the user's requirements and privacy of information.The use of data mining techniques such as clustering, classification and association rule mining may endanger the database owner's security.Hence, a new research topic called Privacy Preserving Data Mining (PPDM) [1] was introduced.
Nowadays, PPDM has become an important issue since huge volume of personal information has been used by many companies and organizations.In many situations, users are unwilling to reveal their personal data without guaranteeing the protection of their confidential data.To prevent the disclosure of sensitive information, the algorithms in this research area make some modifications in the database (data modifications) and alter amount of data (data distortion) [2] in the database.However, data modification and data distortion protect the confidentiality of sensitive information is not much perfect and has some side effects.Association Rule Hiding (ARH) [3] is a subfield of PPDM that analyses the side effects of data mining methods created from the sensitive information belong to individuals or organizations.The main intention of ARH is to find a sanitized database such that when a mining technique is applied on it, all sensitive rules will be hidden while all non-sensitive rules can be mined.
ARH sanitizes the original database in a way that at least one of the following goals is accomplished [4].
• All the non-sensitive rules that appear when mining the original database at pre-defined threshold of support and confidence can be successfully mined from the sanitized database at the same threshold or higher.
• No rule that is considered as sensitive from the owner's perspective and can be mined from the original database at pre-specified support and confidence, can be revealed from the sanitized database, when this database is mined at the same or at higher thresholds and • No rule that was not derived from the original database when the database was mined at prespecified thresholds of confidence and support can be derived from its sanitized counterpart when it is mined at the same or at higher thresholds.
One of the powerful and fast algorithms that sanitize a database to hide the sensitive association rules is heuristic approach.Because of their scalability and reliability, most of the researchers in the data mining filed concentrated on heuristic approach [5][6][7][8][9].But, this approach suffers from unintended side effects and leads them to classify the approximate hidden solutions in various circumstances.So, evolutionary algorithms are used for ARH.Cuckoo Optimization Algorithm for the sensitive Association Rule Hiding (COA4ARH) [10] is an evolutionary algorithm that was used to hide the sensitive association rules with fewer side effects.
Initially, Apriori algorithm was applied on the original database to create the association rules.Then the sensitive rules were selected based on the user defined Minimum Support Threshold (MST) and Minimum Confidence Threshold (MCT).Then the database is pre-processed by selecting the transaction which supported one or more sensitive rules and these are called critical transactions.The sensitive items with critical role in sanitization are addressed for change.Initially, each cuckoo randomly was inserted or deleted the sensitive items in the database to sanitize it.Then new solution was generated based on a fitness function of each cuckoo.Finally, a sanitized database was obtained.However, the whole process of COA4ARH is time consuming, because of the searching strategy used to select the best solution for ARH.
So in this article, an Enhanced Elephant Herding Optimization Algorithm for Association Rule Hiding (EEHOA4ARH) is proposed in which EEHOA is used instead of COA for ARH with less time consumption.EEHOA is also evolutionary-based algorithms which mimics herding behaviour and can be modelled into two operators are clan operators and separating operators.Initially in EEHOA, each elephant in the population randomly insert or delete the sensitive items to sanitize the database.Then fitness functions of each elephant are calculated and the best solution is selected based on clan updating operator and separating operator.Also, the time consumption for the selection of best solution is reduced by introducing crowding distance where an optimal solution is generated by solving the conflicts between the multiple objective functions in fitness function.Thus the proposed EEHOA4ARH-CD reduces the time consumption for ARH and handles the conflict between the multiple objective functions using CD.
The rest of this article is organized as follows: Section 2 studies the research related to ARH.Section 3 explains the proposed EEHOA4ARH to hide the sensitive rules in the database.Section 4 demonstrates the performance efficiency of EEHOA4ARH method.Section 5 summarizes the article with future scope.

Literature survey
A rule hiding approach [11] was proposed based on Evolutionary Multi-Objective Optimization (EMO) algorithm for association rule hiding.While sanitizing the database, a balanced relation among the side effects were analysed and collaborated with the association rule hiding process using EMO algorithm.In EMO algorithm, the transactions were determined to be changed through decoding the chromosomes and found the items to be removed.However, the density of the dataset greatly influences the efficiency of rule hiding approach.
A sanitization approach [12] was proposed for hiding sensitive itemsets in the association rule based on the concept of Particle Swarm Optimization (PSO).However, PSO is easy to fall into local optimum in highdimensional space and has a low convergence rate in the iterative process.A distortion-based method [13] was proposed to hide the sensitive association rules by removing some items in a database which reduced the support or confidence of sensitive rule below user defined threshold values.However, the effectiveness of this method depends on the user defined threshold value.
Modified Decrease Support of LHS item using Equivalent class transformation (MDSLE) [14] approach was proposed for association rule hiding.In MDSLE, the frequent itemsets and the sensitive items in the transactions were identified by applying equivalent class transformation.Also, the sensitive association rules were hiding using a heuristic approach.This approach will be enhanced by decreasing its undesired side effects for achieving less information loss.
An algorithm [15] was proposed to hide the sensitive association rules by inserting dummy items in the rules.However, it is not suitable for huge volume of dataset.A fuzzy logic approach [16] was proposed to sanitize the transaction database.This approach used anonymization approach to hide the sensitive association rules.It avoided the undesired side effects by removing frequent item-sets on new entrance data.The sensitive degree of every association rule was calculated using suitable membership functions and anonymization was done with respect to the membership function.However, if there is any change in membership function, then it causes some change in height of appropriate generalization.
An efficient algorithm [17] was proposed to hide association rule using genetic algorithm.In this algorithm, genetic optimization was used to transform the raw database into sanitized database.It introduced simple genetic encoding for sensitive association rule hiding.The solution encoding and the objective function were defined for association rule hiding based on genetic algorithm.It achieved a better time cost as well as minimum side effects of the non-sensitive rules.This algorithm will be enhanced by hiding a set of rules in one optimization run instead of hiding one in every run.
An optimization algorithm called Electromagnetic Field Optimization Algorithm (EFO4ARH) [18] was proposed for association rule hiding.At first, electromagnetic particles were generated and each particle indicated a solution to a sanitation database.Then the particles were split into positive, negative and neutral fields based on their fitness function.A new solution was generated based on the position and pole of the chosen particles and after that the fitness value of the new solution was compared with the fitness value of the last solution and if it is better, the last solution would be removed and placed the new solution in the list.In future, a new fitness function will be defined to reduce the number of lost rules.
MAXARH algorithm [19] was proposed for finding the sensitive rules and providing the privacy of sensitive rules.However, this algorithm still has side effects during hiding the sensitive association rules.Whale optimization and Least Lion Optimization Algorithm (LLOA) [20] were introduced for privacy-preserving association rule hiding.LLOA sanitized the database through hiding the sensitive items in the association rules using a privacy factor and utility factor in the objective function of LLOA.But the convergence speed of LLOA depends on the stopping criterion.
Hiding technique based on Genetic Algorithm (HGA) and Dummy Items Creation (DIC) techniques [21] was proposed for hiding sensitive association rules.However, this technique has high artefactual error rate.An efficient meta-heuristic chemical reaction optimization-based algorithm, [22] was proposed for association rule hiding through an advanced perturbation approach.This algorithm combined the characteristics of Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Cuckoo Optimization Algorithm (COA) to perturb the transaction database and hide the sensitive association rules.But it does not show the drastic change in the quality of the sanitized transaction database.
A modified genetic algorithm [23] was proposed for association rule hiding.At first, Frequent Patterngrowth (FP-growth) algorithm was applied to generate association rules and then genetic algorithm was applied on it to hide the sensitive association rules.However, it has slow convergence problem.A pattern sanitization approach [24] was proposed for hiding sensitive itemsets for privacy preserved pattern sharing.However, this approach hides the sensitive itemsets under a single minimum support threshold.An optimized support balance model [25] was proposed for association rule hiding.This model modified the increase shift left-based association rule hiding technique in which only the sensitive rules have the confidence value less than the minimum confidence were hidden.However, this model is not much effective when an association rule has more than two sensitive items.

Proposed methodology
In this section, the EEHOA4ARH-CD is described in detail for association rule hiding to preserve the sensitive information in the database.Initially, Apriori algorithm is applied on the collected transaction database D to generate rules.Then the database is preprocessed to avoid the generation of unrelated solution for ARH.During pre-processing critical transaction (i.e. a transaction which fully supports one or more sensitive rules) is selected and then considered only those sensitive items with critical role in sanitization are addressed for change.The sensitive items with critical role are inserted or deleted by EEHOA4ARH to hide the sensitive association rules.The conflict between the multiple objective functions in fitness function can be solved by CD.The block diagram of EEHOA4ARH-CD is shown in Figure 1.

Enhanced elephant herding optimization algorithm based sanitization of transaction database
Elephants are social creatures live in social structures of claves and females.An elephant clan is headed by a matriarch and consisted of number of elephants.Female elephants in the clan prefer to stay with their family members, whereas the male elephants choose to stay somewhere.They will progressively become independent of their families until they leave their families completely.EEHOA is inspired from the herding behaviour of elephant.Following are some of the assumptions which are considered in EEHOA: • Some clans with fixed numbers of elephants include the elephant population.• A certain number of male elephants will abandon their family group and in every generation they live alone far from the main elephant group.• A matriarch guides the elephants in each clan.
The behaviour of elephants can be modelled as clan updating operator and separating operator.Each elephants in the population either insert or delete the sensitive items in the transactions to sanitize the database.In EEHOA, the behaviour of elephants is modelled as clan updating operator and separating operator.The elephants are updated using their current position and matriarch through clan updating operator and the separating operator is then implemented.EEHOA is proposed based on fixing the convergence speed and maintaining a trade-off between exploitation and exploration phases.

a) Initialization and Fitness function
Each elephant in the population in a clan is indicative of a solution (i.e.sanitized database) which is shown with a sequence of 0 s and 1 s.The 1 indicates the presence of sensitive item and 0 indicates the absence of sensitive item in transactions.The first elephant in the population is a sequence of critical transactions of original database.The other elephants in the population randomly quantify the sensitive items and the other items are same as the first elephant.Therefore, an initial population with number of solution is generated.After the initialization process, the fitness values of each elephant are calculated with respect to the number of hiding failure, number of lost rules, rule hiding distance, rule lost distance, ghost rule and data loss.The fitness function is formulated as ) In above equations, |HF| denotes the number of hiding failure, |LR| denotes the number of lost rules, RHD is the rules hiding distance, RLD is the rules lost distance, No_of _GR is the number of ghost rule which is a non-sensitive association rule that cannot be discovered from the original database but can be mined from the sanitized database, R is the total number of rules that can be mined with the given Minimum Support Threshold (MST) and Minimum Confidence Threshold (MCT), No_of _S denotes the number of transactions that are sanitized and Size_of _D denotes the size of the database.
In each generation, the individual with minimum fitness in a clan c x is selected as the matriarch (m) at time t.
In Equation ( 7), e x is the collection of individual elephants in clan x.

b) Clan updating operator
Every elephant y in clan x has an old position e t x,y .Its new position e t+1 x,y is influenced by the clan matriarch m t x based on the following equation: In Equation ( 8), α, β and γ are scaling factors range from 0 to 1 that finds the influence of the clan matriarch on the elephant new position, affinity of elephant to move towards the clan centre and affinity of elephant to walk randomly, correspondingly.rand = (2 × r − 1)(e max − e min ) is a random vector drawn from a uniform distribution, e max and e min are upper and lower bounds of individual elephants position, c t x is the centre of the clan and is calculated as In Equation ( 9), Num x is the number of elephants in clan x.To fix the convergence speed, the matriarch update operator in EEHOA is calculated as The matriarch new position is a linear combination of its prior position.Here, the three control factors such as (α, β, γ ) are used to control the convergence towards the clan centre and the random walk in parallel.

c) Separating operator
The separating operator generated by male elephants which can be modelled as e t x,worst = e min + (e max − e min ) × r (11) In Equation ( 11

Selection of optimal solution based on crowding distance
In COA4ARH, linear searching strategy is applied on the fitness function to select the best solution for sanitizing the database.The linear searching strategy has limitation as high computation time.So in EEHOA, crowding distance concept is used to select the best solution for ARH.The multiple objective functions in fitness function are not interacting with each other and they may be conflicting with each other.This is known as multi-objective optimization problem and it is formulated as In Equation (12), is the decision space and e ∈ is a decision vector.One of the most popular ways to solve the multiobjective optimization problem is finding a Pareto optimal set using crowding distance.A Pareto-optimal solution is a solution, around which there is no way of improving any objective without degrading at least one other objective.The definition and description of Pareto set is given as follows: A vector E = (e 1 , e 2 , . . .e Nobj ) is said to dominate another vector E * = (e * 1 , e * 2 , . . .e * Nobj ), denoted as • and E = E * .When e such that − → fit (e) ≺ − → fit (e * ), a feasible solution e * ∈ is called a Pareto optimal solution.The collection of all Pareto optimal solutions is called Pareto Set (PS), which is given as follows: The non-dominated elephants in E i into external repository rep.At every iteration, the non-dominated are compared one by one to the solution in rep.When the new solution is dominated by any member of the rep, the solution will be leaved.On other hand, the solution will be included to the rep.After including the new solution, when there any solutions in the rep dominated by the new solution, those solutions will be leaved.This process is continued till a maximum number of iteration is achieved.
To reduce the time consumption and to generate an optimal set, Crowding Distance (CD) mechanism has been combined into EEHOA.CD value of a solution represents an assessment of density of solutions neighbouring that solution.CD is computed by first arranging the collection of solution in decreasing order of objective function values.CD value of a particular solution is the average distance of its two nearby solutions.The bordering solutions which have the highest (minimum fitness value) and lowest (maximum fitness value) objective function values are given infinite CD values so that they are always chosen.The overall CD value is computed as the sum of individual distance values related to every objective in the fitness function.CD value is calculated as A solution s 1 is called as constrained-dominate a solution s 2 when any of the following criterion is true: • Solution s 1 is sufficient and solution s 2 is not.
• Both solutions s 1 and s 2 are insufficient, but solution s 1 has a smaller overall constraint violation.• Both solution s 1 and s 2 are sufficient and solution s 1 dominate solution s 2 .
When comparing two sufficient elephants, an elephant which dominates the other elephant is considered as a better solution.On other hand, when both elephants are insufficient, the elephant with a less number of constraint violations is considered to be as a better solution. Enhanced

Results and discussion
To analyse the efficiency of EEHOA4ARH-CD, this method is executed on a chess, mushroom and bank marketing database.The proposed EEHOA4ARH-CD is compared with existing COA4ARH and PSO [12] is done with the parameters such as hiding failure, lost rule and execution time.The existing and proposed ARH methods are implemented in MATLAB (Version 2018a) and runs on a Microsoft Windows 7 with Intel processor running at 2.70 GHz and 4GB memory.The characteristics of chess, mushroom and bank marketing databases is shown in Table 1.Table 2 shows the parameter setting of EEHOA4ARH-CD.The download links of these datasets are provided in [26-28].

Hiding failure
Hiding Failure (HF) denotes the number of sensitive rules which sanitization algorithm could not hide and are still mined from the sanitized data.HF is calculated as where |R s (D )| is the number of sensitive rules explored in the sanitized database D and |R s (D)| is the number of sensitive rules explored in the original database D.
Figure 3 shows the hiding failure of PSO, COA4ARH and EEHOA4ARH-CD methods on chess, mushroom and bank marketing database.X denotes the number of iteration and Y axis denotes the hiding failure.When the number of iteration is 6, the hiding failure of EEHOA4ARH-CD is 40.74% and 20% less than PSO and COA4ARH methods on chess database.From this analysis, it is proved that the proposed EEHOA4ARH-CD method has less hiding failure than other methods on three different databases.

Lost rule
Lost Rule (LR) denotes the number of non-sensitive rules that are lost because of the act of association rule hiding methods.The non-sensitive rule will not mined from the D .LR is calculated as Figure 4 shows the lost rule of PSO, COA4ARH and EEHOA4ARH-CD methods on chess, mushroom and bank marketing database.X denotes the number of iteration and Y axis denotes the hiding failure.When the number of iteration is 6, the lost rule of EEHOA4ARH-CD is 37.25% and 11.11% less than PSO  and COA4ARH methods on chess database.From this analysis, it is proved that the proposed EEHOA4ARH-CD method has less lost rule than other methods on three different databases.

Execution time
Execution time denotes the amount of time taken by ARH methods to sanitize the transaction database.
The execution time of PSO, COA4ARH and EEHO A4ARH-CD methods on three different datasets is shown in Figure 5. X axis denotes the databases and Y axis denotes the execution time in seconds.The execution time of EEHOA4ARH-CD method is 25.37% and 10.71% less than PSO and COA4ARH for ARH on chess database.From this analysis, it is proved that the proposed EEHOA4ARH-CD has less execution time than state-of-the-art methods for ARH on three different databases.

Conclusion
In this paper, EEHOA4ARH-CD is proposed for ARH with fast convergence rate and exploitation capabilities.In EEHOA4ARH-CD, the clan updating operator has been fixed to avoid the problem of slow convergence and hence improve the exploration phase and this will increase population diversity.The EEHOA4ARH also solved skewed distribution of initial elephant population by using separating operator.The time consumption for selection of best solution is reduced by selecting an Pareto optimal set using crowding distance concept.Finally, the investigational tests on chess, mushroom and bank marketing database proved that the proposed EEHOA4ARH-CD method achieves less hiding failure, lost rule and execution time compared to the PSO and COA4ARH methods.In future, big data analytics can be included.
), e min and e max are the upper and lower bounds of the elephant individual position respectively and e t x,worst is the worst individual elephant in clan c x .For the separating operator, probability density function starts with r, a Pseudo Random Number Generator (PRNG) function that creates a uniformly distributed random number in the interval [0, 1].r has to be scaled and shifted to create a uniformly distributed random number in the range [e min , e max ).A floor function is used to create a uniformly distributed random integer number in a specified range.It is clear that floor ([e min , e max )) = (e min .emax−1 ), hence a continuous uniform distribution in the range [e min , e max ].The overall flow of EEHOA based sanitization of transaction database is shown in Figure 2.

Figure 2 .
Figure 2. Overall flow of EEHOA-based sanitization of transaction database.

Figure 3 .
Figure 3.Comparison of hiding failure on (a) chess, (b) mushroom and (c) bank marketing database.
s (D)| − |∼ R s (D )| |∼ R s (D)| where |∼ R s (D)| number of non-sensitive rules explored in D and |∼ R s (D )| number of non-sensitive rules explored in D .

Figure 4 .
Figure 4. Comparison of lost rule on (a) chess, (b) mushroom and (c) bank marketing database.

Figure 5 .
Figure 5.Comparison of execution time.

Table 1 .
Save the non-dominated vectors found in E i into rep 20. while gen count < max_gen count 21.Calculate the CD values of each non-dominated solution in the archive rep using Equation (14).22. for x = 1 to N clans 23.Randomly choose the global best guide for E x from a specified top portion of the sorted archive rep and store its position to the best elephant.Insert all new non-dominated solution in E x into rep if they are not dominated by any of the stored solutions.All dominated solutions in the archive by the new solution are removed from the archive.33.If the archive is full, the solution to be replaced is determined Database characteristics.