Towards fast and memory efficient discovery of periodic frequent patterns

ABSTRACT Periodic frequent pattern (PFP) mining, the process of discovering frequent patterns that occur at regular periods in databases, is an important data mining task for various decision-making. Although several algorithms have been proposed for discovering PFPs, most of these algorithms often employ a two-stage approach to mining these periodic frequent patterns. That is, by firstly deriving the set of periods of a pattern from its coverset and subsequently evaluating the patterns' periodicity from the derived set of periods. This two-stage approach in discovering periodic frequent patterns as a result make existing algorithms inefficient in both runtime and memory usage. This paper presents solutions towards reducing the runtime, as well as, memory usage in discovering periodic frequent patterns. This is achieved by evaluating the periodicity of patterns without deriving the set of periods from their coversets. Experimental analysis on benchmark datasets show that the proposed solutions are efficient in reducing both the runtime and memory usage in mining periodic frequent patterns.


Introduction
Frequent pattern mining (the process of discovering patterns which occur frequently together) over the past years has been widely studied for knowledge discovery in databases for various decision making. Several algorithms based on various approaches have been developed for mining frequent patterns from database. Typical of such include algorithms that use the: apriori candidate generation approach (Agrawal, Imieliński, & Swami, 1993;Zaki, Parthasarathy, Ogihara, & Li, 1997); frequent pattern growth approach (Han, Pei, & Yin, 2000;Pei et al., 2001); vertical representation approach (Shenoy et al. 2000;Zaki, 2000;Zaki & Gouda, 2003) and hierarchical approach (Tseng, 2013). Though the frequent pattern mining approaches can reveal the frequently occurring patterns in databases, the frequency measure alone in these algorithms often fail in revealing the occurrence shapes of patterns. For example, in crime data or customer transaction analysis, though the frequent pattern mining algorithms will reveal the frequent crimes or customer purchases, they will fail to report the periodic occurrence shapes of crimes or customer transactions. However, the ability to detect and understand the periodic occurrence shapes of patterns in databases could be vital in decision-making, for instance, in curbing crime or preventing customer attrition. This limitation in frequent pattern mining algorithms and the relevance of patterns' occurrence shapes in decisionmaking resulted in the start of research on periodic frequent pattern mining.
Periodic frequent pattern (PFP) mining from transactional datasets has been widely researched on in works such as: Fournier-Viger et al. (2017), Kiran and Reddy (2010), Kiran and Kitsuregawa (2013), Nofong (2016), Surana, Kiran, and Reddy (2012) and Tanbeer, Ahmed, Jeong, and Lee (2009). Several algorithms have been proposed for discovering periodic frequent patterns in transactional databases. Notwithstanding the usefulness of these algorithms in discovering periodic frequent patterns from transactional databases, they are faced with the following challenges: . Algorithms for mining periodic frequent patterns proposed in works such as: Kitsuregawa (2013, 2014), Kiran and Reddy (2010) and Surana et al. (2012) that discover periodic frequent patterns using the maximum periodicity threshold (proposed in Tanbeer et al., 2009) will often miss some important periodic frequent patterns if such patterns have just one periodic (occurrence) interval being greater than the user desired maximum periodicity threshold. . Algorithms for mining periodic frequent patterns proposed in works such as Kumar and Valli-Kumari (2013) and Rashid, Gondal, and Kamruzzaman (2013) that discover periodic frequent patterns using the maximum variance threshold (proposed in Rashid, Karim, Jeong, & Choi, 2012) will often report a set of periodic frequent patterns having distinct periods for decision-making. . Most existing algorithms for mining periodic frequent patterns often use a two-stage process to evaluate the periodicity of patterns. That is, they firstly derive the set of periods of a pattern from its coverset and subsequently evaluate the periodicity from the derived set of periods. This thus make existing algorithms employing this twostage process in mining periodic frequent patterns inefficient in both runtime and memory usage.
Although some of these challenges have been addressed in some recent works, the case of time and memory inefficiency in the discovery of periodic frequent patterns due to the two-stage process, to the best of our knowledge, is yet to be addressed. This paper presents effective and efficient solutions towards mining periodic frequent patterns without employing the two-stage approach in evaluating the periodicity of patterns. Eliminating this two-stage process will in turn reduce the runtime and memory used mining periodic frequent patterns from transactional databases.
The main contributions of this paper towards PFP discovery in transactional databases include: . It proposes effective and efficient techniques for evaluating the periodicity of patterns without the traditional two-stage approach used in existing works. . The proposed techniques are incorporated on existing periodic frequent pattern mining algorithms which showed a reduction in both runtime and memory usage in mining periodic frequent patterns.
The rest of this paper is organized as follows. Section 2 discusses related work while the proposed periodicity evaluation measures are introduced in Section 3. Section 4 presents the experimental analysis and Section 5 draws the conclusion and outlines some future works.

Related work
The associated notations for periodic frequent pattern discovery in transactional databases can be given as follows.
Let I = 〈i 1 , i 2 , . . . , i n 〉 be a set of literals, called items. Then, a transaction is a nonempty set of items. A pattern S is a set of items satisfying some conditions of measures like frequency. A pattern is of length-k if it has k items, for example, S = {b, c, d, e} is a length-4 pattern.
Given a transactional database of k transactions, D = 〈n 1 , n 2 , n 3 , . . . , n k 〉, where each n m in D is identified by m called transaction identifier (TID), the cover of a pattern S in D, cov D (S), is the set of TIDs of transactions that contain S. That is, where |cov D (S)| is often referred to as the support count of S [ D.
The support of a pattern S [ D, sup D (S), is defined as, Given a user desired minimum support (ε), a pattern S [ D is said to be frequent if sup D (S) ≥ 1. For any given pattern S in a transactional database D with cov D (S) as its coverset, the notation e.cov D (S) is used to indicate the extension of cov D (S) by inserting a starting time 0 and the last time m to cov D (S). That is, where m = |D|. The last time, m will be duplicated if it is already in cov D (S). For instance, given |D| = 7 and cov D (S) = {1, 2, 4, 7}, then, e.cov D (S) = {0} < {1, 2, 4, 7} < {7} = {0, 1, 2, 4, 7, 7}. Let (m j , m j+1 ) [ e.cov D (S) be two consecutive transaction IDs (occurrence times) of S in D, then p S j = m j+1 − m j is the j th period of S in D. The set of all periods of S, that is, P S , obtained from its extended cover is denoted as: where r = |e.cov D (S)| − 1. For example, given e.cov D (S) = {0, 1, 2, 4, 7, 7}, then p S , 1, 2, 3, 0}. Thus, for any pattern S, it can be derived that: To discover the set of patterns in transactional databases with periodic occurrence shapes, Tanbeer et al. (2009) introduced a periodicity measure on patterns as follows.
Definition 2.1 (Tanbeer et al., 2009): Given a database D, a pattern S and its set of periods P S in D, the periodicity of S, Per(S) is defined as, With the periodicity measure proposed in Definition 2.1, Tanbeer et al. (2009) subsequently defined a periodic frequent pattern as a frequent pattern whose periodicity is not greater than a user defined maximum periodicity threshold, maxPer.
Given a pattern S and its set of periods P S , the approach in Tanbeer et al. (2009) returns S as periodic if the maximal occurring period (that is, the maximal time interval between any two consecutive occurrence times) of S is not greater than the maximum periodicity threshold, maxPer. This idea of discovering periodic frequent patterns using the maximal occurring period as proposed in Tanbeer et al. (2009) have been used in periodic frequent pattern mining in transactional databases in works such as: Kiran and Kitsuregawa (2014), Kiran and Reddy (2010, 2011), Lin, Zhang, Fournier-Viger, Hong, and Zhang (2017 and Surana et al. (2012). Rashid et al. (2012) however argued that discovering periodic frequent patterns using the periodicity measure proposed in Tanbeer et al. (2009) is inappropriate as it returns the maximum time-interval (period) for which a pattern does not appear in a database as its periodicity. Rashid et al. (2012) thus defined the periodicity of a pattern under the name regularity as follows.
Definition 2.2 (Rashid et al., 2012): Given a database D, a pattern S and its set of periods P S in D, the regularity of S, Reg(S) is defined as Reg(S) = var(P S ), where var(P S ) is the variance of P S .
Based on the regularity (periodicity) measure proposed in Definition 2.2, Rashid et al. (2012) defined a regular (periodic) frequent pattern as a frequent pattern whose variance among its set of periods is not greater than a user desired maximum regularity threshold, maxReg. This concept of discovering regular frequent patterns based on the proposition in Rashid et al. (2012) has been used in discovering regular frequent patterns in works such as Kumar and Valli-Kumari (2013) and Rashid et al. (2013). Nofong (2016) however argued that, though the proposition in Rashid et al. (2012) will not miss interesting periodic frequent patterns as in Tanbeer et al. (2009), algorithms that mine periodic frequent patterns using the propositions in both Tanbeer et al. (2009) and Rashid et al. (2012) will always report periodic frequent patterns having totally distinct periods. To report only periodic frequent patterns with similar periods for decisionmaking, Nofong (2016) defined a periodic frequent pattern as follows.
Where, Prd(S) is the periodicity of S (defined as the mean of P S , that is, Prd(S) = x(P S )) and std(P S ) the standard deviation in P S .
With Definition 2.3, though PFPs having similar periodic shapes will be mined and reported, Nofong (2016) observed that some of the reported periodic frequent patterns may be periodic due to random chance without inherent item relationship. To ensure only periodic frequent patterns having inherent item relationships are mined and returned, the productiveness measure (proposed in Webb, 2010) was incorporated by Nofong (2016) in defining the productive periodic frequent patterns as the set of periodic frequent patterns with inherent item relationship.
Fournier-Viger et al. (2017) introduced PFPM, an efficient algorithm having novel pruning techniques for discovring periodic frequent patterns in transactional databases. PFPM unlike the proposed techniques in Nofong (2016), Rashid et al. (2012) and Tanbeer et al. (2009 proposed three periodicity measures (that is, the minimum, maximum and average periodicity measures) for mining user desired periodic frequent patterns. The three measures proposed in Fournier-Viger et al. (2017) for periodic frequent pattern mining in transactional datasets thus give users the advantage of more flexibility in discovering periodic frequent patterns.

Proposed periodicity evaluation measures
We adopt the periodic frequent pattern definition (Definition 2.3) proposed in Nofong (2016). To enable the mining of PFPs based on Definition 2.3 while addressing the time and memory inefficiencies in discovering periodic frequent patterns, we show that the periodicity of a pattern can be evaluated directly from its coverset and the size of the database as follows.
Let D be a dataset and n the set of frequent patterns in D whose periodicities are to be evaluated. The functions for evaluating the periodicities based on Lemma 3.1 (Function 1) and without Lemma 3.1 (Function 2) are as shown below.
Analysing both Functions 1 and 2 based on the Big-O notation, Function 1 employs only one for-loop in evaluating the periodicity of all potential periodic frequent patterns while Function 2 uses a nested for-loop for the same purpose. As such, the runtime complexity of Function 1 based on Lemma 3.1 turns out as O(n) while that of Function 2 turns out as O(n 2 ). Hence, there will be a significant reduction in runtime if Lemma 3.1 is employed in evaluating the periodicity of patterns.
It is, however, worth nothing that, though Lemma 3.1 will evaluate the periodicity of a pattern, it will not be able to evaluate the standard deviation 2 among the set of periods of patterns. In existing works on discovering periodic frequent patterns, the set of periods for each pattern are often derived from their coversets before evaluating the standard deviation among the derived set of periods. As mentioned previously, deriving the set of periods from a pattern's coverset and subsequently evaluating its periodicity from the derived set of periods make existing algorithms on discovering periodic frequent patterns inefficient in both runtime and memory usage.
To eliminate the two-stage process in periodic frequent pattern discovery, we show how the standard deviation among the set of periods can be directly derived from the coverset without necessarily evaluating the set of periods as follows.
Summing the above expansion results in the following: where X S is the variance value for the first period in P S .
where Y S is the variance value for the last period in P S . Z S = m−1 j=2 (n 2 j + n 2 j−1 + Prd(S) 2 − 2n j n j−1 ) (for any other period in P S ) . . .
where Z S is the variance value for any other period in P S which is not the first or last period. Hence for any given pattern S and its coverset, the variance and standard deviation among its periods can be obtained respectively as: We compare the proposed techniques for evaluating the periodicity of patterns, that is, Lemmas 3.1 and 3.2 (as Function 3) vis-a-vis the existing two-stage approach of evaluating the periodicity of patterns (as Function 4) as follows.
Let D be a dataset and n the set of frequent patterns in D whose periodicities are to be evaluated. With the Big-O notation analysis on Functions 3 and 4, the runtime complexity of Function 3 (based on Lemmas 3.1 and 3.2) turns out to be O(n 2 ) while that of Function 4 is O(3n 2 ). However, in the worse case scenario, both Functions 3 and 4 will have same runtime complexities of O(n 2 ).

Experimental analysis
To show the effectiveness of our proposed periodicity evaluation measures, we incorporate them on existing algorithms for mining periodic frequent patterns and test their effectiveness on benchmark datasets. The effectiveness of our proposed measures were analysed with regards to runtime (execution time) and memory usage in discovering periodic frequent patterns.
For our experimental analysis, 3 the following implementations were used: . PFP*: PFP* is our implementation of the technique for mining all periodic frequent patterns. For any given user thresholds and a given dataset, PFP* discovers and returns the set of all periodic frequent patterns having similar periodicities. . PFP+: PFP+ is our improved implementation of PFP* which incorporates our proposed periodicity evaluation measures. For any given user thresholds and a given dataset, PFP + discovers and returns the set of all periodic frequent patterns having similar periodicities. . PPFP: PPFP is an implementation of the periodic frequent pattern mining technique proposed in Nofong (2016). For any given user thresholds and a given dataset, PPFP discovers and returns all productive periodic frequent patterns having similar periodicities. . PPFP+: PPFP+ is our improved implementation of PPFP which incorporates our proposed periodicity evaluation measures. For any given user thresholds and a given dataset, PPFP+ discovers and returns the set of all productive periodic frequent patterns having similar periodicities.
For the above four algorithms, experimental analysis were conducted with regards to (i) execution time and (ii) memory usage. The following datasets described below were used for our experimental analysis. It is worth noting that the compared algorithms were implemented in Java and the experiments carried on a 64-bit Windows 10 PC (Intel Core i7, CPU 2.10GHz, 12GB). The results of the experimental analysis with regards to execution time and memory usage are discussed below.

Execution time
For scalability and time performance, we compare the four implementations mentioned above on the datasets described above. The values recorded and plotted for each dataset are average values of the experiments which were run ten (10) times. Figures 1-4 show the execution comparison of the above mentioned implementations in mining periodic frequent patterns from the Kosarak10K, Kosarak45k, Accident and Tafeng datasets respectively.
As can be seen in Figures 1-4, incorporating our proposed periodicity evaluation techniques on existing periodic frequent pattern mining algorithms significantly reduces the runtime required in periodic frequent pattern discovery. For instance, in Figures 1(a) and 2(a), PFP+ which is an implementation based on our proposed techniques is almost twice as efficient as PFP* with regards to the time required in discovering periodic frequent patterns. Also, as can be seen in Figures 1(b), 2(b), 3 and 4, PFP+ and PPFP+ (which are all implementations incorporating our proposed techniques) are also slightly more efficient compared to PFP* and PPFP in periodic frequent pattern discovery.

Memory usage
We also compare the memory used in discovering periodic frequent patterns by the four mentioned implementations on the datasets described above. The values recorded and plotted for each dataset are average values of the experiments which were run ten (10) times. Figures 5-8 show the memory usage comparison of the four implementations on the Kosarak10K, Kosarak45K, Accident and Tafeng datasets respectively.
As can be seen in Figures 5-8, incorporating our proposed periodicity evaluation techniques on existing periodic frequent pattern mining algorithms significantly reduces the memory usage in periodic frequent pattern discovery. In Figures 5-7 for instance, both PFP+ and PPFP+ (which are implementations incorporating our proposed techniques) are almost twice as efficient in memory usage compared to PFP* and PPFP in periodic frequent pattern discovery.

Conclusion
Despite the usefulness of periodic frequent patterns in revealing useful occurrence shapes in databases, existing algorithms for their discovery often employ a two-stage process, thus making them inefficient in runtime and memory usage. This paper proposes effective and efficient techniques towards reducing the runtime and memory used in discovering periodic frequent patterns from databases. Incorporating these techniques on existing periodic frequent pattern mining algorithms, we show experimentally on benchmark datasets that our proposed techniques are efficient in reducing both the runtime and memory used in periodic frequent pattern discovery. Our future works will be towards further improvement of the algorithm through pseudo-projection in order to reduce the memory used in periodic frequent pattern mining. Notes 1. That is, deriving the set of periods and subsequently evaluating the periodicity from the set of periods. 2. Which will be required to identify periodic frequent patterns with similar periodicitiessee Definition 2.3. 3. We do not compare our implementations with that proposed in Tanbeer

Disclosure statement
No potential conflict of interest was reported by the authors.  His current research interests include data mining, pattern mining, classification and trend prediction.

Notes on contributors
John Wondoh is a researcher and lecturerat the University of South Australia, Adelaide. He received a Ph.D. degree in Computer and Information Science in 2018 at University of South Australia, Adelaide. He currently works as part of the team for several courses at the university. His research areas of interest include event recognition and processing, process optimizations and machine learning.