A General Method for mining high-Utility itemsets with correlated measures

ABSTRACT Discovering high-utility itemsets from a transaction database is one of the important tasks in High-Utility Itemset Mining (HUIM). The discovered high-utility itemsets (HUIs) must meet a user-defined given minimum utility threshold. Several methods have been proposed to solve the problem efficiently. However, they focused on exploring and discovering the set of HUIs. This research proposes a more generalized approach to mine HUIs using any user-specified correlated measure, named the General Method for Correlated High-utility itemset Mining (GMCHM). This proposed approach has the ability to discover HUIs that are highly correlated, based on the all_confidence and bond measures (and 38 other correlated measures). Evaluations were carried out on the standard datasets for HUIM, such as Accidents, BMS_utility and Connect. The results proved the high effectiveness of GMCHM in terms of running time, memory usage and the number of scanned candidates.


Introduction
The task of High-Utility Itemset Mining aims to discover itemsets from customer transaction databases, for which the itemsets must satisfy a user-specified threshold. Several existing algorithms have been proposed to perform the task in many efficient ways, such as Two-Phase (Liu et al., 2005)), HUI-Miner (Liu & Qu, 2012), EFIM (Zida et al., 2017), Hminer (Krishnamoorthy, 2017), MEFIM, iMEFIM (Nguyen et al., 2019a), EFIM-Closed (Nguyen et al., 2019b) and HMiner-Closed (Nguyen et al., 2019b). However, these algorithms only consider the utility measure of the patterns and ignore the correlation among the high-utility patterns.
Recently, Fournier-Viger et al. proposed an approach for mining HUIs considering the patterns' correlation. The authors developed the FCHM algorithm (Fournier-Viger et al., 2020) to explore HUIs to find correlated measures that satisfy the downward-closure property (DCP). Fast measure calculation techniques and pruning strategies on each measure are also presented (the authors proposals are based on the all confidence and bond measurements). These DCP strategies along with their conditions allow the FCHM algorithm (Fournier-Viger et al., 2020) to gain better performance compared with those of the FHM algorithm (Fournier-Viger et al., 2014).
Motivation: Fournier-Viger et al. pointed out the importance of HUIM in considering the itemset's correlation. Thus, the authors developed a solution to solve the problem. However, this approach still has the following drawbacks: (1) Correlation measures must satisfy the DCP. Considering the correlation measure m C , then m C must meet the following condition: m C (X) ≥ m C (Y), ∀X # Y. To apply a correlation measure into the mining process, it is necessary to prove that DCP is satisfied, and thus several correlation measures are not applicable.
(2) Each measure requires different strategies to calculate and prune to enhance the performance of the corresponding mining task.
To address the above limitations, this work proposes a more generalized approach to mine HUIs using any correlated measures. Based on previous works on correlation measures (Vo & Le, 2011;Geng & Hamilton, 2006;Huynh et al., 2007), we present an algorithm capable of fast and efficient calculation of the measures by storing all the required information. This definitely improves the effectiveness of mining HUIs using correlated measures. Our major contributions are as follows.
(1) The proposed approach is a HUIM-class algorithm, namely GMCHM, which considers the high-correlated property among itemsets within customer transaction databases.
(2) Previous algorithms must use a different strategy for each respective measure. Our work proposes an approach to store all necessary information to compute the given measure in only one database scan.
(3) Evaluation studies on standard databases illustrate that the new GMCHM algorithm is more efficient than the existing algorithms, in terms of execution time, memory consumption, and the candidates checked. This will be discussed in Experimental Study section of this paper.
The rest of this research is structured as follows. The Related Work section surveys the related works on the field of HUIM with correlated measures. The Preliminaries section presents the preliminaries and definitions. The next section, GMCHM Algorithm, proposes the GMCHM algorithm, along with a new data structure to store the required information while mining HUI using a specific correlation measure. Evaluation studies are carried out and discussed in the Experimental Study section. Finally, the Conclusion and Future Work section draws out the conclusions and discusses future improvements of the work.

Related Work
The HUIM problem was initially investigated in 2004 by Yao et al (Liu et al., 2005)). Then, Liu et al. introduced the Two-Phase algorithm (Liu et al., 2005)) to mine HUIs in two stages, as its name implies. Based on the pattern growth approach, Han et al. extended the FP-Growth (Han et al., 2004) algorithm to develop the UP-Growth algorithm (Tseng et al., 2010) using the UP-Tree structure. Later, the authors improved that algorithm and introduced the UP-Growth+ algorithm (Srinivasa Rao and Krishna Prasad 2012). Then Liu and Qu (2012) proposed the HUI-Miner algorithm to mine HUIs using only one phase. The authors also introduced a novel and efficient structure called a utility-list. In this list, each element is a tuple containing three fields, TID, iutil, rutil.  (Krishnamoorthy, 2017) using five different pruning strategies to improve the performance of the HUI mining process. The EFIM algorithm (Zida et al., 2017) proposed by Zida et al. incorporates two new techniques to reduce the cost of database scans and memory consumption, named high-utility database projection (HDP) and high-utility transaction merging (HTM). By applying these two new strategies, EFIM (Zida et al., 2017) is much more effective in mining HUIs from dense databases. However, sparse databases contain several short or unidentical transactions, and HTM faces a major performance issue in terms of long execution time. To overcome this issue, Nguyen et al. presented a new extension of the EFIM algorithm, named iMEFIM (Nguyen et al., 2019a). This relies on a P-set novel structure to significantly reduce the cost of database scans. Furthermore, the authors also proposed a novel utility framework enabling previous HUIM algorithms to extract HUIs from databases having dynamic profit.
Mining patterns based on correlated measures are used to find itemsets that satisfy a user-specified correlated threshold, such as association rules, a set of highly correlated itemsets within a database. The use of database's correlated measures was first proposed by Agrawal et al. to mine association rules (ARs) (Agrawal et al., 1993) based on the itemset's occurrence frequency measure (or support). Later on, several works on association rule mining based on other measures were introduced, for instance support, confidence, lift, chi-square, cosine, gini-index, Laplacian, and so (up to 40 different measures). Vo and Le (2011)

Preliminaries
The mining high-correlated itemsets problem is to discover all itemsets that satisfy the following conditions: they must be high-utility itemsets and highly-correlated based on a correlated measure specified by the user. This section states the problem and preliminary concepts with regard to the correlated HUI mining task.
Definition 1 (Transaction database). A database containing a set of transactions {T 1 , T 2 , . . . , T n } is called a transaction database, denoted as D. The sample data and structure are described in Table 1. Let I = {i 1 , i 2 , . . . , i n } be the set of all distinct items in D, T q be a transaction in D with a unique identifier, and q be the set of all items in that transaction, T q # I. Each item i in T q is assigned with a non-negative integer value, called the quantity, denoted as q(i, T q ). Furthermore, item i also has another associated value, also a positive integer, called the unit profit and denoted as p(i) (Liu et al., 2005); Zida et al., 2017). Table 2 presents the unit of every item from D given in Table 1.
Definition 2 (Utility of an item, an itemset). . Utility of an item i in a transaction T q , denoted as u(i, T q ), is defined as the product between its quantity in T q and unit profit, u(i, T q ) = p(i) × q(i, T q ) (Liu et al., 2005)). . Utility of an itemset X in transaction T q , denoted as u(X, T q ), determines the sum of utility values of every item i in T q , and calculated as u(X, T q ) = i[X^X#T q u(i, T q ) (Liu et al., 2005); Zida et al., 2015). . Utility of an itemset X within the whole database D, denoted as u(X), is specified as the sum of utility of X from all the transactions containing X, and is calculated as (Liu et al., 2005)).
Definition 4 (Support, support-bin). . Support measure of an itemset X, denoted as supp(X), is the occurrence frequency of that itemset within the database D, and is defined as Calculation of the support of an itemset. Initialize the support-bin supp array to zeros. For each transaction T q in the database D, update the value for support-bin supp[z] for each z [ T q as follows. Definition 5 (Measures). Let X and Y be the two itemsets in database D; n be the number of transactions in D, n = |D|; n X , n Y , n XY be the occurrence frequencies of X, Y and XY in D, respectively; n X , n Y , n XY , n X Y , n XY be non-occurrence count of X, Y, XY, X and not Y, and Y and not X, respectively; dissup(XY) is the occurrence count of X, Y or XY (disjunctive support), in other words, dissup(XY) will increase by one if a transaction contains X or Y or both XY (Vo & Le, 2011;Geng & Hamilton, 2006;Huynh et al., 2007). We then have the following equations. Definition 7 (Correlated high-utility itemset). An itemset is called a correlated highutility itemset if and only if it's a HUI and its interestingness measure ≥ minimum correlated threshold, with the minimum correlated threshold is specified by the user prior to mining.
Definition 8 (Correlated high-utility itemset mining). With all the definitions and preliminaries presented, the task of Correlated High-utility Itemset Mining (CHIM) using an interestingness measure is to discover the complete set of high-utility itemsets and highly correlated itemsets based on the user-specified interestingness measure.
Example: with minutil = 65 and the correlated threshold varying in {0.0, 0.5, 0.7} using two interestingness measures called all confidence and bond. Table 3 below shows the obtained set of correlated high-utility itemsets discovered from database D in Tables 1  and 2 above. The current manuscript also uses several known definitions in the task of high-correlated itemset mining, interestingness measures, high-utility itemset mining, TWU pruning strategies (Liu et al., 2005)), local-utility (Zida et al., 2015), sub-tree utility (Zida et al., 2015), primary and secondary sets (Zida et al., 2015) to eliminate unpromising candidates and reduce the search space. Other than that, transaction merging (Zida et al., 2015) and database projection techniques (Zida et al., 2015) are also used to further enhance the performance of the mining task.

GMCHM Algorithm
In general, all the correlation measures are about values such as the support of the currently considered itemset (support of X), support of the extension itemset (support of Y), support of XY, disjunctive support of XY and the total number of transactions (n) in database D. The remaining values can be easily obtained via Definition 5.
To quickly identify the correlation of each itemset in the list of HUIs, we extend the structure to store the HUIs as follows.
. The TID-set is constructed to record the occurrence of each item in the transactions during the first database scan. The constructed TID-set would speed up the computation of the disjunctive support of every itemset significantly after the itemset is identified as a HUI. . The support-bin is to store the support measure of every single item i [ I, and the support calculation of any item in D would thus be faster and more convenient.
Based on the above structure, we modified the EFIM algorithm (Zida et al., 2017) to discover HUIs combined with all the techniques to calculate the correlation measure into a new algorithm, called the General Method for Correlated High-utility itemset Mining (GMCHM). The GMCHM algorithm first discovers all HUIs, then computes the correlation measure using the discovered set of HUIs.
Algorithm 1: The GMCHM algorithm Input: D: a transaction database, minutil: user-defined min utility threshold, mincore: user-defined correlate threshold Output: The list of high-utility and correlated itemsets. 1 X= ∅⍰; 2 Scan D to build TID-set, loop each item i ∈ I, calculate lu(X,i) and sup(i); 3 Secondary(X)= {I | i ∈ I ∧ lu(X,i) ≥ minutil}; 4 Sort Secondary(X) in ascending order of lu(X,i); 5 Scan D to eliminate any item i ∉ Secondary(X) from transactions, and pull out the empty transactions; 6 Sort transaction in D by the item order in ascending order according to Secondary(X) during reading backward; 7 Set su(X,i)= utility of the sub-tree of each item i ∈ Secondary(X) by scanning D and using a utility-bin array; 8 Set Primary(X)= {i | i ∈ Secondary(X) ∧ su(X,i) ≥ minutil}; 9 RETURN Search (X, D, Primary(X), Secondary(X), minutil, mincore);

Procedure: Search
Input: X: an itemset, DX: the projected database of X, Primary(X): the primary items of X, Secondary(X): the secondary items of X, minutil and mincore threshold Output: the set of correlated high-utility itemsets that are extensions of X. 1 FOR EACH item i ∈ Primary(X) do 2 β= X∪{i}; 3 Scan DX to calculate u(β) and create Dβ; 4 I Fu(β) ≥ minutil AND measure(β) ≥ mincore THEN RETURN β; 5 Calculate su(β, z) and lu(β, z) for all items z ∈ Secondary(X) by scanning D β once, using two utility-bin arrays; 6 Primary(β)= {z ∈ Secondary(X) | su(β,z) ≥ minutil}; 7 Secondary(β)= {z ∈ Secondary(X) | lu(β,z) ≥ minutil}; 8 Search (β, D β , Primary(β), Secondary(β), minutil, mincore); 9 END FOR The input parameters for the algorithm are the transaction database D, minimum utility threshold minutil, and minimum correlate threshold mincore (all confidence, bond, etc.). The two thresholds are user-specified. The major modification of the algorithm comes from line #2 of the GMCHM algorithm (Algorithm 1) and line #4 in the Search procedure. Line #2 of the GMCHM algorithm performs construction of the TID-set and the supportbin. Line #4 of the Search procedure calculates the correlated measure on an itemset besides its utility. Line #3 of the GMCHM algorithm uses the local utility of every single item i [ I to construct set Secondary set, at this step X = ∅. Line #4 then sorts the Secondary set based on the ascending order of items' TWU values. Line #5 removes all items i [ I which do not exist in the Secondary set and pulls out any empty transactions. Line #6 sorts all transactions in D in the ascending order of items in Secondary when reading backward. The block of line #4 to line #6 aims to reduce memory consumption and database scan costs. The algorithm GMCHM then constructs the Primary set using the items from Secondary having their sub-tree utility no less than minutil. The algorithm then passes all the gathered information into the recursive Search procedure in order to start the depth-first search process on the search space.
The recursive Search procedure performs the main correlated HUI mining task using the information passed from the main GMCHM algorithm. Its input parameters are the itemset to be extended X, the projected database on X (named as DX), the Primary and Secondary sets of X, minimum utility threshold minutil and minimum correlated measure threshold mincore. Let b be the extended itemset X using a single item i [ Primary(X) (Line #2), b = X < {i}. Line #3 scans through the projected database DX to calculate the utility value of b and construct the projected database of b, denoted as D b . This step also performs transaction merging if necessary. Line #4 determines whether the itemset b is a correlated high-utility itemset or not by comparing its utility value u(b) and measure(b) against the specified thresholds minutil and mincore, respectively. Whereas measure(b) is the user-defined function that computes the correlated measure of item b. If the itemset b co-satisfies the two conditions, it will be a correlated high-utility itemset and will be returned.

Experimental Study
To validate the effectiveness of the GMCHM algorithm, several experiments were carried out on a machine with the following configuration: Intel® Core™ i5-10400F processor @ 2.90 GHz, 8GB of RAM, and running on Windows 10 Pro 64-bit operating system. Java was selected as the primary language to implement the algorithm, using JDK8. The common standard testing databases from the SPMF, an Open-Source Data Mining Library 1 are used, and their characteristics are shown in Table 4.
Besides recording the execution time and memory usage in each test case with a different threshold, the number of generated candidates is also recorded. The recorded data is then compared against that of the FCHM algorithm using all confidence and bond. The source code of the FCHM algorithm is loaded from the SPMF 2 , which is Java-based source code, for a fair comparison. Throughout the experiments, the minutil threshold value was varied from 0.04% to 28% and the mincore value was varied in {0.0, 0.1, 0.3, 0.5}. The Figures 1-6 present the results of the comparison of mining HUIs with mincore = 0.1 on different datasets. Runtime: In each testing dataset, the obtained results are shown from Figures 1-3 It is recognized that GMHCM dominates the execution time test. The speed of GMHCM is up to 7.5 times faster than that of the FCHM algorithm using the bond measure and up to 2,600 times faster when the all confidence measure is 0.1 Furthermore, the same results are also observed on 38 other measures. The performance of GMCHM is mainly due to the transaction merging and database projection strategies, which significantly reduce the database scan costs. As seen in Figure 1, the runtime of GMCHM is up to 2-170 times faster than FHCM by with respect to the all confidence and bond measures. The same speed-up factor is also shown in Figure 2 on the BMS database, and when using bond it is from 1.1 to 2 times faster. For the all confidence measure, the improved runtime is from 90 to 170 times faster than with the FCHM algorithm. In the Connect database, which is shown in Figure 3, our proposed algorithm has a runtime that is from 3.5 to 7.5 times faster on the bond measure. Especially when using the all confidence measure, the algorithm is observed to have a significant speed boost, from 700 to 2,600 times faster. In Table 6, when changing the correlation threshold, the execution time of the GMCHM algorithm does not change significantly and is lower than that of the FCHM algorithm in most cases and thus the algorithm performed very much better than the original FCHM algorithm. When the value of the minimum correlated threshold varies, the GMCHM has its runtime slightly decreased while FCHM tends to see it fall by a large amount. This is because FCHM uses the pruning strategy that relies on the downward-closure property  of the measures all confidence and bond. For the GMCHM algorithm with the bond measure, the runtime is higher than those of the all confidence and the remaining 38 measures, since it needs to perform intersections on the itemsets' TID-sets.

Memory usage:
The results shown in Figures 4-6 show that on all the tested minimum utility and correlated thresholds the memory usage of the GMCHM algorithm is much better than FCHM using both all confidence and bond measures, better by from 2 to 10 times. The high memory consumption of the FCHM algorithm is due to the EUCS structure (Fournier-Viger et al., 2020), which is actually a two-dimensional array to store the utility and support information of item pairs. Other than that, the FCHM algorithm adopts the bit vector approach to calculate the bond measure (Fournier-Viger et al., 2020), and the OR operator to compute the disjunctive support of an itemset, which requires m × n  complexity, in which m is the length of the currently considered transaction and n is the number of transactions to be scanned. This would thus increase the memory consumption of the algorithm in the case of long transactions, as shown in Figures 4 and 6.
Candidate generation: Using the adopted several pruning strategies in the GMCHM algorithm leads to a greater reduction in search space than with the FCHM algorithm, and the results are described in detail in Table 5. The FCHM algorithm only uses three pruning strategies based on TWU, utility list, and interestingness measure to prune the search space. The GMCHM algorithm also utilizes more efficient approaches like High-utility Database Projected (HDP), High-utility-Transaction Merging (HTM), and tighter upperbounds, such as local utility and sub-tree utility, to prune the candidates. The proposed algorithm thus significantly improves the performance of mining HUIs.

Conclusion and Future Work
Mining correlated high-utility itemsets represents an essential task in the science of data mining. In this work, we developed the GMCHM algorithm to adopt several strategies to efficiently calculate the interestingness correlated measures combined with the task of high-utility itemset mining. These strategies reduce the cost of database scans and improve the candidate pruning efficiency to achieve better performance with regard to execution time and memory usage.
As observed in the experiments, the bond measure and others provide speed gains to the mining phase, but the improvement is not as great as with the all confidence measure. Thus, better pruning strategies on such measures are needed to address this drawback. Mining correlated high-utility itemsets from dynamic profit databases also should be considered, since these reflect the nature of the real-world databases. Moreover, it could be worth modifying and deploying the proposed algorithm using a parallel or distributed approach to further utilize the available power of computer systems. Notes 1. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php 2. http://www.philippe-fournier-viger.com/spmf/