An efficient approach for mining closed high utility itemsets and generators

ABSTRACT Mining closed high utility itemsets (CHUIs) serves as a compact and lossless representation of high utility itemsets (HUIs). CHUIs and their generators are useful in analytical and recommendation systems. In this paper, we introduce a lattice approach to extract CHUIs and their generators from a set of HUIs quickly. The experimental results show that mining CHUIs and their generators from a lattice of HUIs is efficient in both runtime and memory usage.


Introduction
The traditional association rule mining (Agrawal, Imielinski, & Swami, 1993) only considers the presence of items in a transaction, and the utility of items is not observed. Many investigations of high utility itemset mining (HUIM) have been published, such as the incremental mining algorithm based on the concept of the fast-update approach for efficiently extracting high utility itemsets (HUIs) (Lin, Lan, Hong, 2012), mining HUIs from vertically distributed databases (Vo, Nguyen, & Le, 2009), the HUI-Miner algorithm with a utilitylist structure to store both utility and heuristic information for pruning the search space (Liu & Qu, 2012), the Efficient high-utility Itemset Mining (EFIM) algorithm using two upper bounds called sub-tree utility and local utility for search-space pruning (Zida, Fournier-Viger, Lin, Wu, & Tseng, 2016), mining top-k HUIs (Tseng, Wu, Fournier-Viger, & Yu, 2016), the UP-Growth algorithm using the UP-Tree data structure for candidate pruning (Tseng, Wu, Shie, & Yu, 2010), the UP-Growth+ algorithm (Tseng, Wu, Shie, & Yu, 2013), mining HUIs with multiple minimum utility thresholds (Gan, Lin, Fournier-Viger, & Chao, 2016) and the Two-Phase algorithm for mining HUIs with effective performance (Liu, Liao, & Choudhary, 2005).
The HUIM tasks are more difficult than that of frequent itemset mining (FIM) (Agrawal & Srikant, 1994). In HUIM, the utility of an itemset is neither monotonic nor anti-monotonic, and thus a HUI may have a superset or subset with lower, equal or higher utility. Therefore, we cannot apply some techniques on pruning the search space which have been developed in FIM into HUIM, and thus many recent algorithms have been developed focusing on mining HUIs, especially on candidate elimination (Liu et al., 2005;Liu & Qu, 2012;Tseng et al., 2013;Zida et al., 2016).
In this paper, we address the need of CHUIs and their high utility generators (HUGs) as well as propose an approach to mine CHUIs and their HUGs from HUIs by using a HUIs semi-lattice structure. The main contributions in this paper are as follows: . We propose an algorithm for building a semi-lattice of high utility itemets (HUIL) from mined HUIs. . We propose an algorithm to generate all CHUIs and their generators from a HUIL. . We carry out experiments to examine the efficiency of the proposed method.
In the rest of this paper, we present some current works related to mining HUIs, CHUIs and generators. We also introduce some basic definitions and the problem statement. We then propose the HUIL algorithm to construct a HUIL, which includes IsClosed and IsGenerator flags from a HUI list. Based on the HUIL structure, we propose the LHUCI-Miner algorithm to extract all CHUIs and their HUGs. To examine the proposed algorithm's problem, we carry out the experimental results with regard to both runtime and memory usage. Finally, we discuss conclusions and some directions for future works.

Related works
High utility itemset mining The traditional association rules (Agrawal et al., 1993) do not consider the combination between the weight of an item and its quantity. HUIM helps to solve these problems. In HUIM, a transaction can contain multiple same items; moreover, each item has its weight (e.g. profit or utility).
Several studies on HUIM have been carried out. Liu et al. (2005) proposed a Two-Phase algorithm to discover all HUIs efficiently with search-space optimization. However, candidates are generated and it needs to scan database multiple times.
To solve the issue of huge number of candidates being generated, Li, Yeh, and Chang (2008) proposed an isolated items discarding strategy. To improve the process of mining HUIs and reduce the number of database scans, Ahmed, Tanbeer, Jeong, and Lee (2009) proposed a tree-based algorithm named IHUP. Beside that, Tseng et al. (2010) also proposed the UP-Growth algorithm for mining HUIs. Then, Tseng et al. (2013) improved the UP-Growth algorithm and proposed UP-Growth+ algorithm to reduce the overestimated utilities.
HUI-Miner was proposed by Liu and Qu (2012) to discover HUIs with a new utility-list data structure. For mining HUIs, each utility list for an itemset handles the information of transaction identifications for all of the transactions containing the itemset, the utility values of the itemset in the transactions and the sum of the utilities of remaining items that can be included in super itemsets of the itemset in the transactions.
To speed up the runtime and reduce memory consumption in the HUIM process, Lan, Hong, and Tseng (2014) introduced an efficient approach that adopts a projection-based indexing mechanism that directly extracts the result from the given database.
Fournier-Viger, Wu, Zida, and Tseng (2014) proposed the Fast High-Utility Miner algorithm to extract all HUIs from a transaction database. While this is considered as an efficient approach, it generates a large number of HUIs, and this algorithm requires much space for processing as well as fails to terminate. Sahoo et al. (2015) proposed a Fast High Utility Itemset Miner (FHIM) and defined the promising utility co-occurrence structure to further reduce the number of candidates. Zida et al. (2016) proposed the EFIM algorithm for fast HUIM. EFIM calculates the remaining utility (called the sub-tree utility) at the parent node rather than at child nodes in the depth-first search, and thus it can prune more nodes in the search space.

CHUIs and generators mining
A CHUI is a HUI having no proper supersets that are HUIs and appear in the same number of transactions. A generator is also an itemset that has no proper subset having the same support. Fournier-Viger,  investigated the concept of applying generators to HUIM and proposed two new concise representations of HUIs, called HUGs and generator of HUIs (GHUIs). Two new algorithms, HUG-Miner and GHUI-Miner, were also proposed for mining generators. Overall, these are efficient approaches for extracting CHUIs and generators directly from a database. Wu et al. (2015) provided a solution to mine Closed + High Utility Itemsets, compact and lossless representations of HUIs. This approach helps to resolve the issue of large numbers of candidates being generated, and thus has better performance in both runtime and memory consumption. Sahoo et al. (2015) investigated integrating the closure property into HUIM and proposed the HUCI-Miner algorithm to mine all CHUIs and their HUGs from HUIs. This is one of the first studies on extracting CHUIs and HUGs on the mined HUIs from a database. However, the overall process still has performance issues when mining CHUIs and generators from a huge set of HUIs.
Lattice-based approach in data mining Davey and Priestley (1990) introduced the concepts of lattice, order to this context as well as the concept of the join and the meet. An ordered set (L, ≤ ) is a lattice if the join x _ y and the meet x^y always exist for any two elements x and y in L. L is called a join semilattice if only the join exists. L is called a meet semi-lattice if there is only the meet available.
In FIM, the set of all frequent itemsets is only a meet semi-lattice. For any two itemsets, only their meet is guaranteed to be frequent, while their join may or may not be. Similarly, in HUIM, the set of itemsets is also a meet semi-lattice. For any pair of itemsets, their join and meet may or may not be HUIs.
In FIM, there are many studies based on the concept of lattice. Zaki and Hsiao (2005) proposed a lattice approach to mine closed itemset. Vo and Le (2011) proposed the Lattice-FI algorithm for quickly mining closed frequent itemsets and generators, and MNAR_Lattice for fast generating minimal non-redundant association rules from the lattice. Vo, Hong, and Le (2013) proposed a lattice-based approach for mining most generalization association rules.

Definitions
Definition 1: Given a finite set of items has an unique identifier id, called T id . Each item i p in the transaction T d is associated with a weight indicator called quantity q(i p , T d ), which is the purchased number of item i p . An example transaction database is shown in Tables 1 and 2.
Definition 2: The utility of an item i in a transaction T d is denoted as u(i, T q ) and defined as p For example, the utility of item A in t 5 from Tables 1 and 2 is u(A, t 5 ) = 3 × 3 = 9.
Definition 3: The utility of an itemset X in a transaction T d is denoted as u(X,T d ) and defined as Definition 4: The utility of an itemset X in database D is calculated by the total utility of X in all transactions that X belongs to.
Definition 5: The support of an itemset X is an indication of how frequently that X appears in database D. The support value of X with respect to T is defined as the proportion of itemsets in a database containing X, denoted as supp(X ). For example, the support of X = {A, C, E} in database D from Table 1 is 2/9.
Definition 6: The transaction utility (TU) of a transaction T d is the sum of the utility of all the items in T d . For example, TU(t 2 ) = u(D) + u(E) + u(F) = 2 + 4 + 5 = 11.
Definition 7: The transaction-weighted utilization (TWU) of an itemset X is defined as the sum of the transaction utility of transactions containing X. For example, TWU (A) = TU (t 1 ) + TU (t 5 ) = 26 + 15 = 41.
Definition 8: An itemset X is a HUI if it has a utility equal or larger than the minimum user-specified utility threshold. If an itemset X has a utility lower than the minimum utility threshold, it is not a HUI; in other words, it is a low utility itemset.
Definition 10: An itemset X is called a HUI generator or HUG or generator if it is a HUI and no any subset Z of X has supp(X ) = supp(Z).
Definition 11: The local utility value of an item x i in itemset X, which is denoted as luv(x i , X), is calculated by the total utility of x i in all records containing X: Definition 12: Let X = {i 1 , i 2, … , i n } be an itemset, then the utility unit array of X is defined as Definition 13: The local utility value of an itemset X in itemset Y, X # Y, is denoted as luv(X, Y) and defined by the sum of local utility values of each item x i [ X in itemset Y, as follows:

Problem statement
Given a set of HUIs which are mined from transaction database D with a user-specified minimum utility threshold (min-util), the problem statement is to discover all CHUIs and their HUGs from HUIs.

Constructing a lattice of HUIs
In this section, we refer to a semi-lattice as a lattice. We propose an approach of using a lattice structure to build a lattice of HUIs where each node contains a HUI, IsClosed flag and IsGenerator flag, and we call this lattice HUIL. From this HUIL, we can quickly extract useful information about CHUIs and HUGs.

HUIL structure
The HUIL structure contains a root node, which is an empty itemset with utility and support equal to 0, child nodes and connections between each pair of nodes. The connections between each pair of nodes are used to specify their parent-child relationships. Each node contains information about the itemset: utility, support, IsClosed flag and IsGenerator flag. The IsClosed flag in a lattice node indicates that the itemset in this node is a CHUI if its value is true. The IsGenerator flag indicates that the itemset in a lattice node is a generator if its value is true. The name of each node is formed based on the collection of items in an itemset. For example, in Figure 1, Root is the root node, B is a node with utility = 20 and support = 2, and B node has two child nodes {BE, BF}, B is not a CHUI but a generator. Algorithm for constructing HUIL Algorithm 1. HUIL (HUIs) Input: Set of HUIs sorted by number of items (level) ascending (HUIs) Output: HUIL with root node (rootNode) SetupLattice () 1.
For each X in HUIs.Levels[ j] do 4.
For each childNode in rootNode.Children do 6.
For each Lc in Ln.Children do 16.
For each childNode in rootNode.Children do 23.
If Flag = True then 29.

End
Firstly, the algorithm calls the SetupLattice procedure to set up a lattice with an empty root node. Secondly, it scans all HUIs (each set of HUIs is sorted by the quantity of items). For each HUI, the IsTraversed flags of root and child nodes are reset, then the algorithm calls the PushNodeToLattice method to add a HUI into the lattice. With regard to the PushNode-ToLattice method, the variable flag is used to determine if node {X} can be added directly into the current node. If the current rootNode has child nodes where each childNode , X (line 23), the PushNodeToLattice method will be called recursively (line 25) to insert the {X} node into the lattice with each childNode as the root node. If there does not exist any childNode [ rootNode.Children^Lc , X , X will be a child node of the current root-Node (line 29). To prepare information for mining CHUIs and generators, there are two flags, IsGenerator and IsClosed, which are attached to the HUI when it is inserted into the lattice (lines 25-29). From the results of the lattice, we can easily determine which nodes are CHUIs or generators. We can then propose an algorithm to extract all CHUIs and associated generators based on the HUIL, as described in the fifth section.

Illustration
The HUIs extracted using the FHIM algorithm (Sahoo et al., 2015) from the given database in Tables 1 and 2 are shown in Table 3. The FHIM algorithm is then applied as follows. Let level-i be the set of HUIs in which each itemset has i items (i > 0). The results in Table 3  Firstly, an empty node will be initialized and added into the lattice as root node rootNode. Secondly, the level-1 set of HUI is processed, initially with G, and it is connected directly to rootNode, after that B, A, D, F and E are processed and also inserted directly into rootNode as child nodes. Thirdly, considering level-2 of HUI, BE is processed; as the rootNode has a child node B, it will call the PushNodeToLattice method recursively with B as the root node, and then add a connection between B and BE. Similarly, with regard to child node E, E and BE also have a connection. This process then continues with the rest of the HUIs. The IsClosed and IsGenerator flags are also updated if two nodes have a connection and the same support. The results of the lattice are shown in Figure 1.

Extracting CHUIs and generators from HUIL
Based on the results of Algorithm 1, each node in a lattice carries IsClosed and IsGenerator flags, indicating that the itemset is a CHUI and/or generator. In this section, we propose Algorithm 2, named LHUCI-Miner, to extract CHUIs and the list of generators of each CHUI.
For each Lc in rootNode.Children do 2.
For each Ls in Lc.Children do Table 3. Extracted HUIs from the database in Tables 1 and 2 with min-util = 20. Add all Lc.Children into Queue and TrackList 20.
For each Ls in Li.Children do 26.
Add Ls into Queue 28.
Add Ls into TrackList 29.

End
Initially, the algorithm traverses all of child nodes from the root of the lattice. For each child node, it calls ExtendFindingChuiAndGenerator at line 2. ExtendFindingChuiAndGenerator (Lc) will add Lc to CHUIs list if Lc.IsClosed equals to True (lines 6-8). The Lc itemset can be both a CHUI and a generator if it is a HUCI and its IsGenerator flag is True (lines 11-13). If Lc is a generator and not a CHUI, FindingChuiAndGenerator (Lc) is called to find which CHUI that Lc belongs to (lines 9, 10). In this method, a queue structure and a list structure are used, with all child nodes of Lc as initial values (lines 18, 19). If a queue has items, it processes each itemset Li in the queue and adds Lc to be the generator of Li if Li is a CHUI and has the same support as Lc (lines 22-24). If Li has child nodes, then the algorithm continues to add all these into the queue (lines 25-30). Example CHUIs and generators for the sample database shown above are presented in Tables 4 and 5.

Experimental results
In this section, we executed our algorithms, LHUCI-Miner and HUCI-Miner, to evaluate their performance with regard to time and memory usage when mining CHUIs and HUGs.
The experiments were implemented and tested on a system with the following configuration: Intel Core I7-6500U 2.5 GHz (4 CPUs), 16 GB of RAM and running Windows 10, 64 bit version. The source code was created in C# using Visual Studio 2015 Community, .Net framework 4.5. The data sets (Fournier-Viger, Gomariz, et al., 2014) for testing have the features shown in Table 6. For each test data set in Table 6, we carried out the tests with various min-util (%) values to compare performance with regard to runtime and memory usage when mining CHUIs and generators for both HUCI-Miner and LHUCI-Miner. For any further references, we also reported some sample results of HUIs, CHUIs and HUGs for the test data sets (Table 7). For selecting min-util values mentioned in Table 7, we obtained references from the publication of Sahoo et al. (2015). Moreover, we also executed our proposed algorithm and High Utility Generic Basic algorithm with many other different min-util values to evaluate their performance; therefore, we adjusted to use some other values which could support and return more results on HUIs, CHUIs as well as HUGs. On the Chainstore data set, we used a very small value of min-util (0.005%) to have larger numbers of HUIs so that we could emphasize the effectiveness of the proposed LHUCI-Miner algorithm at runtime ( Figure 5).

Runtime for mining CHUIs and generators
We executed our proposed algorithms, LHUCI-Miner and HUCI-Miner, on many different data sets (Table 6)  We also obtained similar results for the time needed to mine CHUIs and generators from the Chess data set (Figure 3). This is a rapid process, and we can extract the information quickly within 8 seconds with min-util = 25% using HUCI-Miner. However, if we apply LHUCI-Miner with the same min-util = 25%, we can have results within 4 seconds. The average runtime ratio for mining CHUIs and generators on this data set is 51%.
The experimental results at runtime for extracting CHUIs and generators from the Retail data set (Figure 3) show that LHUCI-Miner is always faster than HUCI-Miner. Similarly, we also had the same runtime performance evaluation between LHUCI-Miner and HUCI-Miner algorithms on mining CHUIs and generators from the Mushrooms data set. On the Mushrooms data set, we also used various min-util values decreasing from  Chess  25  6406  3550  4074  26  2875  1727  1873  27  1246  816  874  28  493  339  358  Mushroom  10  9594  119  1623  11  5801  52  1080  12  2726  22  606  13  1152  6    14% to 10% to observe the runtime of LHUCI-Miner versus that of HUCI-Miner on the same set of HUIs. In general, Figure 4 indicates that the LHUCI-Miner algorithm is more efficient than HUCI-Miner.
To compare LHUCI-Miner runtime and HUCI-Miner runtime with various numbers of HUIs, we evaluated the performance of LHUCI-Miner by adjusting the min-util. As can be seen in Figure 5, with the Chainstore data set, we decreased min-util from 0.03% to 0.004%. The result indicated that the runtime of LHUCI-Miner is always better than the runtime of HUCI-Miner.    Figure 6 shows that mining CHUIs and generators from the lattice using LHUCI-Miner is effective. In our experimental results for the Accidents data set with various min-util values from 11% to 15%, we can observe that LHUCI-Miner has a better runtime that HUCI-Miner with the input of huge HUIs.

Memory usage for mining CHUIs and generators
Mining CHUIs and generators using both HUCI-Miner and LHUCI-Miner algorithms is mainly based on HUIs, and the memory usage for the test data sets remains about the same in each case. Moreover, the use of a lattice makes it necessary to store the connections between parent and child nodes, and thus mining CHUIs and generators using a lattice approach can require more memory than when using HUCI-Miner.
In our experimental results, the difference in memory consumption between both algorithms is not too much (Table 8). In real applications, runtime is much more important than memory usage. Therefore, although the lattice approach for mining CHUIs and generators does not use less memory than HUCI-Miner, it is more efficient because it needs less execution time.

Conclusions and future works
In this work, we used the utility-confidence framework and the lattice concept to mine CHUIs and generators. We proposed an algorithm, called HUIL, to construct the lattice of a HUI set including Closed and Generator flags. Based on the constructed HUIL, we also propose the LHUCI-Miner algorithm to extract all CHUIs and generators. The process of this work requires less execution time. The results of the experiments in this study show that the proposed algorithm can be used effectively in various recommendation systems. In future work, we are going to investigating the use of the HUIL structure for quickly generating all high utility association rules.