Parallel Algorithms for Association Rule Mining
1 Introduction
Data mining is the analysis of very large datasets in order to find interesting trends. These trends or patterns describe the data and offer insight into possible future activities. Of particular interest in the field of data mining is the discovery of association rules. Association rule mining (ARM) finds rules which give us a correlation between sets of items and their occurrences in a set of records. The most typical application of ARM is in market-basket analysis, where items are represented by products and records are represented by sale transactions for a particular retail store. For example, if a set of items A is purchased from a particular store, then we can expect a set of items B to be purchased with a certain amount of confidence. We illustrate this rule more formally by writing A → B.
2 Problem Statement
Suppose we have a set of items I = {i1, i2, i3, ... , in} and let D, the database, be a set of transactions over I. Then each transaction contains a subset of I, called an itemset. An itemset with k values is called a k-itemset. For any itemset X the support of that itemset, denoted supp(X), is the percentage of transactions in which X appears as a subset. All itemsets with support greater than some specified minimum support, minsup, are called frequent or large. For any association rule A → B, the support of the rule is the joint support of itemset A and itemset B. Each rule can be asserted with a certain amount of confidence, which is the probability that a transaction contains B given that it contains A, and is given by the supp(AUB) / supp(B). All rules with confidence greater than some specified confidence minconf are said to be strong.
ARM involves finding those rules whose support is greater than minsup and whose confidence is greater than minconf from a database. This can be broken down into two steps and proceeds according to the following property:
The a Priori Property: Every subset of a frequent itemset is also a frequent itemset.
Step 1: First Generate all frequent itemsets. Step 2: Then generate strong rules from the itemsets found in step 1. For some frequent itemset A we generate rules of the form A\B → B, where B is a subset of A, and test the confidence of each rule.
3 Apriori Algorithm: A Sequential Overview
Rakesh Agrawal et al. of the IBM Research Division in Almaden proposed an algorithm that is one of the most widely used ARM algorithms today. It is called the Apriori algorithm and proceeds in an iterative bottom-up fashion.
L1 := set of all frequent 1-itemsets k := 2; repeat until no new frequent itemsets found Ck := set of candidate k-itemsets generated from Lk-1 foreach transaction t in the database update count for each itemset in Ck that is contained in t Lk := all itemsets Ck with support > minsup; k += k+1; end repeat ans := Union over all k of LkThere are three steps involved here. First generate candidates of length k from Lk-1 using a self-join. Then prune the search space by removing any candidate that has a subset that is not frequent. Finally, scan the transactions and calculate the supports for all the candidates. To make this process quick, the candidates are stored in a hash-tree data structure where the internal nodes contain hash tables to direct the search and the leaves contain the candidate counts.
4 Parallel Algorithms
Three parallel algorithms have been proposed by Agrawal and Schafer and are based on the Apriori algorithm. All three assume a distributed-memory architecture, using the standard MPI communication primitives, where data is evenly distributed on all processors with no special ordering to the transactions. I will discuss all three algorithms, but my main focus will be on the Count Distribution algorithm.
4.1 Count Distribution Algorithm
The Count Distribution algorithm begins by having each processor independently build the candidate hash-tree from Lk-1. Local supports are then calculated by each processor using their local partition of the database. After all local supports have been calculated, global supports are calculated through a sum reduction by having all the processors exchange their local counts. Each processor can now compute Lk from Ck and the entire process is continued as needed.
4.2 Data Distribution Algorithm
The problem with the Count Distribution method is that the total amount of processor memory is not exploited since the same number of candidates are counted on each pass of the algorithm. The Data Distribution algorithm seeks to better exploit the total system memory by having each processor count mutually exclusive candidates such that if the number of processors are increased, so does the number of candidates that can be counted in one pass. Each processor must first generate Ck from Lk-1 and retain 1/N candidates, where N is the number of processors. To generate the global support, each processor must scan its own partition as well as the partitions of the other processors. It is easy to see that this algorithm involve quite a bit of communication as all the processors must exchange their local partitions.
4.3 Candidate Distribution Algorithm
Both of the above mentioned algorithms involve synchronization at the end of each pass so processors must wait for whichever processors finish last. This can be problematic if the amount of work is not evenly balanced. We heuristically determine j, where on iteration j the frequent itemsets are redistributed so that each processor can independently generate unique candidate sets. The database must be replicated in order for the processors to continue asynchronously.
5 Performance
I am only going to discuss the results of the Count Distribution algorithm since this algorithm turned out to be the superior to the other two. Communication costs for the Data Distribution proved to be too high and the cost of data redistribution for Candidate Distribution turned out to be to much compared to the cost of synchronization. (You can read more about it here).
All experiments were run on a 32-node IBM SP2 Model 302 with each node consisting of a POWER2 processor running at 66.7MHz with 256MB of memory. The processors all run AIX level 3.2.5.
Name T
I
D1
D16
D32
D2016K.T10.I2 10 2 2016k 32256k 64512k T = Average transaction length D1456K.T15.I4 15 4 1456k 23296k 46952k I = Average size of frequent itemsets D1140K.T20.I6 20 6 1140k 18240k 36480k D = Average number of transactions
The datasets used in the testing are listed in the above table, each with the average size of the database for a single-node, 16-node, and 32-node configurations. Scaleup experiments were performed where the size of the database was increased in direct proportion to the number of nodes in the system. Below are plots for total response time and the scaleup with the response time normalized with respect to the response time for a single processor. The Count algorithm scales nicely as the response time is almost constant as the size of the database and the number of processors increase. However, with more processors we see a slight increase in response time, which may be explained by the increase in communication as more processors are used.
Figure 1For the speedup experiments, the size of all three database was fixed at 400MB while the number of processors varied. The tests were run for configurations of up to 16 nodes. Here we see that Count has very good speedup performance, except that we begin to see the performance fall at 8 processors. The amount of data becomes small enough where communication becomes significant in the total response time.
Figure 26 Conclusion
Count Distribution has been shown to exhibit excellent scalability. We can see that load-balancing is no longer an issue as the test results show that of the 7.5% in overhead, 2.5% was spend in synchronization. The data used for these tests were evenly distributed across all processors in no particular order, however skewed data and even nodes with unequal capabilities can easily cause a problem. More work would be needed in finding a good load balancing strategy that can be used with the Count Distribution method. The standout trademark of Count is in the minimum amount of communication needed to compute supports, however this is a tradeoff as redundant computation, such as candidate hash-tree building, is needed to avoid further communication.
References
- R. Ramakrishnan, J. Gehrke. Database Management Systems. McGraw-Hill, 2003
- Mohammed J. Zaki. Parallel and Distributed Association Mining: A Survey, IEEE Concurrency (7) October-December 1999.
- A. Y. Zomaya, T. El-Ghazawi, O. Frieder. Parallel and Distributed Computing for Data Mining, IEEE Concurrence (7) October-December 1999.
- R. Agrawal, J.C. Schaffer, Parallel Mining of Association Rules. IEEE Transactions on Knowledge and Data Engineering (8) December 1996.
January 28, 2004