Usage Apriori and clustering algorithms in WEKA tools to mining dataset of traffic accidents

ABSTRACT The aim of this study is finding approaches for investigating association rules mining algorithms and clustering to offer new rules from a broad set of discovered rules which taken from traffic accident data at Alghat Provence in KSA. Several tools are applying in data mining to extracting data. WEKA provides applications of learning algorithms that can efficiently execute any dataset. In WEKA tools, there are many algorithms used to mining data. Apriori and cluster are the first-rate and most famed algorithms. Apriori is the simple algorithm, which applied for mining of repeated the patterns from the transaction dataset to find frequent itemsets and association between various item sets. A cluster is a technique used to group a collection of items having similar features. Association rules applied to find the connection between data items in a transactional database. Association rules data mining algorithms used to discover frequent association. WEKA tools were used to analysing traffic dataset, which composed of 946 instances and 8 attributes. Apriori algorithm and EM cluster were implemented for traffic dataset to discover the factors, which causes accidents. Through the results, shows that the Apriori algorithm is better than the EM cluster algorithm.


Introduction
There is a significant amount of data stored in the databases, and with the rapid spread of the data warehouse, it is necessary to find techniques to extract information and knowledge by exploiting these data stored for used in problem-solving and decision-making using modern computer applications, the current smart technology famous as artificial intelligence.Data mining is an analytical process that combines artificial intelligence, statistics, and machine learning.It is considered a step of knowledge in databases.Data mining and machine learning are topics in artificial intelligence that focus on pattern discovery, prediction, and forecasting based on possessions of gathered data (Witten, Frank, Hall, & Pal, 2016).algorithm is to find frequent itemsets and association between different itemsets, that is, association rule.Apriori is Easy implementation.The algorithm applies information from previous steps to produce the frequent itemsets (Shweta & Garg, 2013).Apriori is the most uncomplicated algorithm, which is employed for mining of repetitive patterns from the transaction database.We have aimed to execute the Apriori algorithm for adequate study work, and we have applied WEKA for mentioning the process of association rule mining.The benefits of using Apriori algorithm are usages large item set property.Easily parallelized, simply and easy to implement, Apriori algorithm is an efficient algorithm for finding all frequent itemsets.EM algorithm is a general method of finding the maximum likelihood estimate of data distribution when data is partially missing or hidden (Prajwala & Sangeeta, 2014).The advantages of using EM algorithm are to give a beneficial result for the real world data set.Moreover, use this algorithm when you want to perform cluster analysis of a small scene or region-of-interest, and are not satisfied with the results obtained from the other algorithms (Sharma, Bajpai, & Litoriya, 2012).EM algorithm is an essential algorithm for data mining.We used this algorithm when we are satisfied the result of other algorithms methods.EM is chosen to cluster data for the many reasons: first, It has a robust statistical basis.Second, it is linear in database size.Third, it is healthy to noisy data.Fourth, it can accept the desired number of the cluster as input.Fifth, it can handle high dimensionality, and final, it converges fast given a proper initialization (Abbas, 2008), guarantees about optimality, easily explainable results (Ordonez & Omiecinski, 2002).Also has several disadvantages, Algorithm is highly involved, it is hard to initialize, and the quality of the final solution depends on the quality of the initial solution (Slimani & Lazzez, 2014).

WEKA
WEKA term is a set of modern machine learning ways and data pre-handling tools.It is identified as a set of machine learning approaches for data extraction tasks (Seppelt, Voinov, & Lange, 2012).It designed so that users can speedily test out existing machine learning modes on new datasets in very flexible ways (Frank et al., 2009).The workbench contains techniques for the first data mining troubles: regressions, classification, clustering, and association rule mining, conception, and attribute selection (Seppelt et al., 2012).It is an excellent appropriate for improving new machine learning methods (Hall et al., 2009).The user can access components through JAVA programming or command line interfaces.It affords graphical user interface in an application named the WEKA Knowledge Flow Environment featuring visual programming and WEKA explorer (Parikh & Tirkha, 2013).
There are three additional graphical user interfaces to WEKA.The Knowledge Stream interface allows the user to design configurations for flowed data handling.WEKA's third interface, the Experimenter intended to help the user to answer a fundamental practical question when implementing classification and regression methods: Which techniques and parameter values work better for the given problem.The fourth interface, so-called the Workbench, is a unified graphical user interface that combines the other three into one application (Witten et al., 2016).In this study, we chose WEKA from other software tools on the market because it is the package that would be recommended for people who are beginners to such software to those who are very adept.The software merely is very robust with built-in features.WEKA is contained many built-in features that require no programming or coding knowledge.Then WEKA has become very common with the academic and industrial researchers, also widely applied for teaching aims.WEKA is better suited for mining association rules, powerful in machine learning techniques and Suited for machine learning.WEKA is user-friendly with a graphical interface that allows for fast setup and operation.WEKA work on the prediction that the user data is gained as a flat file or relation.In another word, that means each data object described by a stable number of attributes that ordinarily are of a specified type, normal alpha-numeric or numeric values (Ramamohan, Vasantharao, Chakravarti, & Ratnam, 2012).
WEKA offers applications of learning algorithms that you can efficiently use to your dataset.It contains a diversity of tools for converting datasets, such as the algorithms for discretization and sampling (Witten et al., 2016).
WEKA makes it easy to compare different solution strategies based on the same evaluation method and identify the one that is most appropriate for the problem at hand.It is implemented in JAVA and runs on almost any computing platform (Frank, Hall, Trigg, Holmes, & Witten, 2004).WEKA provides applications of learning algorithms that can quickly implement any dataset.It also contains a diversity of tools for transforming dataset (Frank et al., 2009).WEKA is an open source software tool for implementing machine-learning algorithms.Bansal and Bhambhu (2013) reported that association rule transacts with frequent itemsets as done by much association algorithms like Apriori algorithm, which used in widely real vitality applications.In this paper, authors contain the use of association rule mining in extracting pattern that frequently happened within a dataset and explanation the implementation of the Apriori algorithm WEKA technique from a dataset, which is gathering of demeaning crimes against women in Session court.This paper studies two association rule algorithms Apriori algorithm and Predictive Apriori algorithm and matches the result of both the algorithms using WEKA tool.Therefore, the result of rules together algorithms visibly shows that Apriori algorithm achieves better and faster than the Predictive Apriori algorithm.The study uses a comprehension of recurrent pattern matching based on support and confidence measures produced excellent results in various fields.The paper indicates that investigation of repetitive pattern matching based on support and confidence measures provided excellent results in multiple areas.

Related work
Apriori Algorithm can saw as a two-stage operation: (I) All item sets having support factor greater than or equal to, the user-specified minimum support.(II) All rules are having the confidence factor more significant than or similar to the userspecified minimum confidence.
Research explained that association rule in data mining shows a critical key in the process of mining data for repeated item sets.Apriori algorithm is applied to find out and comprehend the underlying patterns involved in the court's records from their data contains in various sections.Amira, Pareek, and Araar (2015) offered association rule-mining algorithms are commonly applied to find all rules in the database to pleasing some minimum support and minimum confidence restrictions.The number of generated rules reduced the adaptation of the association rule-mining algorithm to mine only a particular subset of association rules where the classification class attribute is assigned to the right-hand-side was investigated in past research.In this study, a dataset about traffic accidents was gathered from Dubai Traffic Department, UAE.After data preprocessing, Apriori and Predictive Apriori association rules algorithms were applied to the dataset to explore the link between recorded accidents factors to accident acuity in Dubai area.Two sets of class association rules were generated using the two algorithms and summarized to get the most interesting rules using technical measures.Experimental results showed that the class association rules generated by Apriori algorithm were more effective than those generated by Predictive Apriori algorithm were.More associations between accident factors and accident severity level were explored when applying Apriori algorithm.This paper showed that when applying rule covers method on the generated class association rules using Apriori and Predictive Apriori algorithms, many class association rules produced by Apriori algorithm were eliminated, and more effective rules than those generated by Predictive Apriori algorithm were obtained.Shrivastava and Panda (2014) explained there are several algorithms developed to mine the association rules from the huge databases.Authors offered the Apriori algorithm is the best common algorithm to mine the association rules from the dataset.Various tools are existing to execute the Apriori algorithm.WEKA is an open source software tool for implementing machine-learning algorithms.A study defined WEKA is the gathering or a collection of the implements for execution data mining with the application of the association rules in it.Association rules formed by analysing data for various samples and using the standard support and dependability to identify the most important relationships.They are differed into separate classes in data mining and used in the WEKA to perform the operations.The result in Apriori algorithm generates the best association rule for the dataset after operating the WEKA tool.The implementation of Apriori algorithm, it can be more compatible and purposeful in future, by the implementation of the new association algorithms for some other new operations and analysis in this WEKA tool.Agrawal and Agrawal (2017) explained details description about Analysis of Clustering Algorithm of WEKA Tools.Paper defined clustering is a method used in several areas such as image analysis, pattern recognition, and statistical data analysis.Clustering is a partition of data into sets of similar items.Every cluster contains various items that are analogous to them and unlike compared to objects of other sets.Some clustering algorithms represent to produce clusters (Chauhan, Kaur, & Alam, 2010).WEKA tool used to compare different clustering algorithms.It used because it provides a better interface to the user than compare to other data mining tools.In this paper, algorithms are analysing and comparing the various clustering algorithm by using WEKA tool to find out which algorithm will be more comfortable for the operators for execution clustering algorithm.This present the applications of data mining WEKA tool it provides the cluster's huge data set and clustering that provide making a hand in the optimizing of the search engine.Verma, Srivastava, Chack, Diswar, and Gupta (2012) demonstrated EM algorithm is a reiterative procedure for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical paradigm, wherever the model consists of hidden implicit variables.The paper dealt with that the EM iteration rotates among implementing an expectation step, which calculates the expectation of the log-probability estimated using the present evaluate for the parameters, and maximization(M)step, those count parameters maximizing the expected log-probability found on the expectation (E) step.These parameters estimations used to determine the allocation of the potential variables in the sequent expectation step.
Tanna and Ghodasara (2014) discussed the using of Apriori through WEKA for repeated pattern mining.Paper described Apriori algorithm is very an effective for extracting repeated groups for Boolean association rules.The conclusion of this paper that Apriori is the simple algorithm, which applied for mining of repeated patterns from the transaction database.Paper presented the used of WEKA implements for association rule to applying Apriori algorithm.Authors exercised the Apriori Algorithm to get the association rules that have minSupport = 50% and min confidence = 50% by using WEKA GUI.They have tried to implement the Apriori algorithm for sufficient research work, and they have utilized WEKA for referring the process of association rule mining.
AprioriTID: Slimani and Lazzez (2014) reported AprioriTID proposed by Agrawal and Srikant (1994).This algorithm has the extra property that the database is not used at all for counting the support of candidate item set after the first pass.Instead, an encoding of the applicant item sets used in the previous pass is employed for this purpose.The most critical tasks of frequent pattern mining approaches are itemset mining, sequential pattern mining, sequential rule mining, and association rule mining.Apriori algorithm is among the proposed initially structure which deals with association rule problems.In synchronism with Apriori, the AprioriTid and AprioriHybrid algorithms have been offered.For smaller problem sizes, the AprioriTid algorithm is executed equivalently well as Apriori, but the performance degraded two times slower when applied to massive problems.The support counting method included in the Apriori algorithm has involved voluminous research due to the performance of the algorithm.A useful number of ineffective data mining algorithms exist in the literature for frequent mining patterns.This study offered a summary of the status and future directions of frequent pattern mining.
Hierarchical clustering: Steinbach, Karypis, and Kumar (2000) demonstrated the results of experimental research of several general document-clustering techniques: agglomerative hierarchical clustering and K-means.Authors presented Hierarchical clustering is many times represented as the excellent property clustering approach but is limited because of its quadratic time complexity.The runtime of parting K-means is very enticing when compared to that of hierarchical clustering methods.However, through the way of study tests Authors exposed that a simple and efficient variant of K-means, 'bisecting' K-means, can produce clusters of documents that are better than those given by 'regular' K-means and useful or useful than those yielded by agglomerative hierarchical clustering techniques.The study also has been able to find what we think is a reasonable explanation for this behaviour.
FP-growth: Novitasari, Hermawan, Abdullah, Sembiring, and Herawan (2015) presented the candidate set generation and tests are two major drawbacks in Apriori-like algorithms.Therefore, to deal with this problem, a new data structure called frequent pattern tree (FPTree) was introduced.FP-Growth was then developed based on this data structure and currently is a benchmarked and fastest algorithm for mining frequent itemset Lee, Kim, Cai, Han (2003).The benefits of FP-Growth are, it needs two times of scanning the transaction database.First, it scans the database to calculate a list of various items sorted by descending order and eliminates rare items.Then, it scans to compress the database into an FPTree structure and mines the FP-Tree recursively to construct its conditional FP-Tree.
Mansouri and Javad Kargar (2014) showed driving accidents had always been counted as one of the most likely causes of deaths in the societies today.In this study, the rules and issues motivating the traffic road accidents have been mined along with extracting a local data model after collecting the data from a diversity of sources followed by data collection and combination, data cleaning, and separating the inconsistent data.In this study used data mining methods, such as clustering and decision tree.The objective of this research was to analyse and monitor the road traffic accidents using the data mining techniques in suburban roads in Isfahan Province.The obtained results in this study are interesting and significant which can be considered by authorities as invaluable information to be used for decreasing the road accidents.Furthermore, five algorithms existing in data mining was used in this study for knowledge discovery of the accident dataset.The C5.0 decision tree algorithm proved to generate the best results and performance.Later in this research clustering of the data was also performed but did not result in separation of clusters with a specific meaning.Based on the clustering results, it can be concluded that each route follows its particular pattern and differentiating the data concerning time, vehicle, and the road status is not generalizable to all of the routes.In determining the accident type as Casualty, fatal, and car crash, the most important characteristic was the type of vehicle.
Verma, Srivastava, Chack, Diswar, and Gupta (2012) presented data clustering is a manner of setting similar data into groupings.The paper showed that the cluster algorithm divisions a data set into some groups such that the similarity within a group is larger than among groups.This study revises six types of clustering techniqueskmeans clustering, hierarchical clustering, DBS can clustering, density-based clustering, optics, EM algorithm.These clustering techniques are implemented and analysed using a clustering tool WEKA.Performance of the six techniques are obtainable and compared.The paper presented several indicates: The performance of K-means algorithm is better than hierarchical clustering algorithm, all the algorithms have some confusion in some (noisy) data when clustered, K-means and EM algorithm are very sentient for fuss in a dataset.This noise makes it complex for the algorithm to cluster data into convenient clusters while affecting the outcome of the algorithm, K-means algorithm is faster than other clustering algorithm and generates property clusters when applying, a hierarchical clustering algorithm is more sensitive for noisy data.Prajwala and Sangeeta (2014) demonstrated the two clustering algorithms considered are EM and density-based algorithm.EM algorithm is a common way of discovering the maximum likelihood estimation of data distribution when data are lost or concealed.In density-based clustering, clusters are large areas in the data space, split by sections of lower object density.This paper showed WEKA an open source tool is used for comparing these two algorithms.In conditions of likelihood, EM algorithm is better than a densitybased algorithm; the density-based algorithm takes less time than EM algorithm to build the model.Kumar and Rukmani (2010) proposed this research on the web using mining and in particular, efforts on finding the web usage procedures of websites from the server log files.The study used Apriori algorithm and Frequent Pattern Growth algorithm for evaluation memory practice and time usage.The characterize of using Apriori algorithm are Operates large item set property.It easily parallelized and easy to implement.The research showed some restrictions of Apriori algorithm.Treating a huge number of applicant sets is costly.It is tedious to recurrently scan the database and checked a large set of nominees by pattern identical, which is especially true for mining long patterns.The main distinguishes of the FP-growth algorithm is usages compressed data structure and rejects recurrent database scan.The main obstacle of the FP-growth algorithm is the fulminatory amount of lacks a good candidate generation method.Future research can combine FP-Tree with Apriori nominee to make way to solve the drawbacks of together Apriori and FP-growth.Krömer et al. (2013) used to investigate a data set describing traffic accidents in Ethiopia and use a machine learning method based on artificial evolution and fuzzy systems to mine symbolic description of selected features of the data set.Paper demonstrated there are simple fuzzy classifiers as well as complex rule-based fuzzy classification systems that usually build and maintain sophisticated rule bases.The popularity of fuzzy classifiers can be attributed to their ability to perform soft classification, to assign multiple labels to data samples, and to the ease of their interpretation.This study compared the ability of evolutionary fuzzy rules to evolve classifiers for binary and multiclass attributes.While the rules for a binary attribute were successful, the artificial evolution as implemented in this work was not able to find fuzzy rules that would accurately classify data according to selected multi-class attributes.
Rai, Verma, and Thoke (2012) defined MSApriori is an association rule mining algorithm planned to mine frequent itemsets including rare objects and to give better performance in comparison with approaches that employ single minimum support.In association rule, mining MSApriori algorithm plays an important role as it considers rare item sets.This paper proposed a novel approach MSApriori-T algorithm, which uses total support tree structure to make MSApriori algorithm more efficient.T-tree stores each item in a tree as nodes and links are available to its child node.To beat the drawback of an MSApriori algorithm that needs high storage requirement and processing time, authors proposed an approach that combines the MSA prior algorithm with a total support tree storage structure resulting in a more efficient algorithm in terms of storage requirement and processing time.

Methodology
The importance of this research is in suggesting a way using data mining algorithms to determine the causes of accidents in terms of time, road, driver nationality, and type of accident from a large set of discovered rules extracted from Alghat traffic accidents real data.This study was based on traffic accident data which taken from public traffic department in Alghat Provence in KSA within four years (1432, 1434, 1435, and 1436).One of the main obstacles, which researchers faced when collecting data from traffic department that information of the accident in traffic registration form is incomplete.For this reason, many variables had been neglected from the analysis such as the driver age, driver health status, driver behaviour, and weather state.WEKA tools used for preprocessing and analysing data.In WEKA, we implemented two tools, Apriori algorithm in association rules and EM clustering algorithm.A comparison between these algorithms (Ariori and clustering) were made to discover the factors, which caused accidents.
Figure 1 shows an Attributes relationship File Format (ARFF) for the traffic accident dataset after converted it from excel file.The header of the data is started with the name of the relationship (traffic), and a block knows the attributes (year, type of accident, location, number of vehicles, driver, injured, and death).Also, the @data line includes the values entries for each attribute.It is prepared dataset in Attribute relation format file to execute in WIKA interface.ARFF format just gives a dataset; it cannot appoint which of  the attributes the one that is supposed to be predicted.It can be applied to locate different algorithms used in WEKA software.
In this part Figure 2 displays ARFF file for the traffic accident dataset which pre-processing in WEKA explorer.The file contains 8 attributes and 946 instances.At this stage, the data will be ready for mining and extraction information by using various algorithms supported by WEKA tools.Figure 3 shows the use of the Apriori algorithm to find best results that have minSupport = 0.4 and minimum confidence = 0.9.
Figure 4. Shows the best results obtained by the EM cluster algorithm.

Results and discussions
After Apriori algorithm executed, we obtained many results, which based on the size of the set of the large itemsets.Table 2 explains the results obtained with item sets: 10, the results show that the highest number of accidents occurred in 1434, most of the incidents happened during the day, the most common types of accidents were a collision with another vehicle.The highest accidents appeared in highway the drivers who caused the accidents were non-Saudis.Most accidents happened between two cars or more.The total number of accidents was 946; there were 171 injured and 35 death.
Table 3 shows the results taken with item sets: 4, the output display that most of the incidents occurred during the day, the most common types of accidents were a collision with another vehicle.As the results, the highest number of accidents occurred in highway with 51%.Main accidents happened between two vehicles or more.
Table 4 displays the results reached with item sets 3, the highest number of accidents occurred in 1434, most of the incidents happened during the day; the most common types of accidents are a collision with another vehicle, and the highway achieved the highest accidents and drivers who caused the accidents were non-Saudis.
Table 5 shows the best rules found in Apriori algorithm.The results depend on the comparison of confidence, leverage, and convince.All rules in this table support the rules presented in above tables.(1432, 1434, 1435, and 1435), the data show that the most accidents and injured occurred in 1434.The highest death in 1435.
Table 7 represents the summarizes results obtained using the EM clustering algorithm.Through the results obtained from current study showed that in EM cluster algorithm time taken to build model was (1.58 s), and Log likelihood was (7.46685-).Loglikelihood here refers to the probability of identifying a correct group of data elements.The EM algorithm is a general statistical method of maximum likelihood estimation.EM cluster may converge to a poor locally optimal solution, therefore; it needs an unknown number of iterations to converge to a good solution (Ordonez & Omiecinski, 2002).While Appling Apriori algorithm in this research we obtained for the best result because the Apriori algorithm is an efficient algorithm for finding all frequent itemsets.Therefore Apriori algorithm more effective better than the EM cluster algorithm.

Conclusion
The aim of the study to present the implementations of the WEKA tools in data mining techniques.Apriori and cluster algorithms used to discover and concept the underlying patterns involved in the traffic accident dataset in Alghat Provence.As result of rules of both algorithms, display that Apriori algorithm performs better and faster than cluster algorithm.The paper presents Apriori algorithm is a simple and efficient tool to analyse the dataset.In general, WEKA interface is a very useful tool in data mining, which allows the user to choose many different algorithm and compare them to reach the accurately required results.

Disclosure statement
No potential conflict of interest was reported by the authors.

Figure 4 .
Figure 4. Apply the EM cluster algorithm in WEKA.

Faisal
Mohammed Nafie Ali received the B.Sc. from Omdurman Ahlia University, Faculty of Applied & Computer science in Sudan, in 2001.He got M.Sc.and Ph.D. degree in Computer Science in 2009 and 2014 respectively from Alneelain University Faculty of Computer science and Information Technology in Sudan.He worked in the field of education as Computer Science teacher in Sudan from 2002 to 2013.He worked as Assistant Professor in Majmaah University, Suadia Arabia from 2014 until now; He worked as Oracle Database Administrator in National Pensions Fund in Sudan from 2007 to 2014.He has an experience in Data mining using WEKA and Clementine.He has many Certifications in oracle database Administrator.Abdelmoneim Ali Mohamed Hamed received the B.Sc. and M.Sc.in mathematics in Sudan, Alnileen University Faculty of science, in 1989 and 2005 respectively.He has received Ph.D. in applied statistics in Sudan, Sudan University for science and technology, 2012.From 1989 to 2008, He worked in the field of education as a mathematics teacher in Sudan and Saudi Arabia.From 2009 to 2013, He worked as a lecturer at Al Ahfad University for Girls.From 2014 until now, he worked as Assistant Professor at Al Majmaah University.He has an experience in statistical analysis using SPSS and WEKA.

Table 1 .
Advantages and disadvantages of data mining tools.
IBM SPSSSPSS is a good tool for non-programmers.Easy to understand and use, with extensive documentation of use, and is quite strong with quantitative analysis.SPSS is non-code programme SPSS their graphical results not improving and model intervention is limited

Table 3 .
Size of a set of large itemsets L (2): 4.

Table 2 .
Size of a set of large itemsets L (1): 10.

Table 4 .
Size of a set of large itemsets L (2): 13

Table 6 .
The distribution of accidents per year.

Table 7 .
Summarized results obtained by using EM clustering algorithm.

Table 5 .
Best rules using Apriori algorithm.
Table 6 represents the number of accidents happened within four years