CS61B: Lecture 33 Wednesday, April 17 DISJOINT SETS ============= A _disjoint_sets_ data structure represents a collection of sets that are _disjoint_: that is, no item is found in more than one set. The collection of disjoint sets is called a _partition_, because the items are partitioned among the sets. Moreover, we work with a _universe_ of items. The universe is made up of all of the items that can be a member of a set. Every item is a member of exactly one set. For example, suppose the items in our universe are corporations that still exist today or were acquired by other corporations. Our sets are corporations that still exist under their own name. For instance, "Microsoft," "Forethought," and "Web TV" are all members of the "Microsoft" set. We will limit ourselves to two operations. The first is called a "union" operation, in which we merge two sets into one. The second is called a "find" query, in which we ask a question like, "What corporation does Web TV belong to today?" More generally, a "find" query takes an item and tells us which set it is in. We will not support operations that break a set up into two or more sets (not quickly, anyway). Data structures designed to support these operations are called "partition" or "union/find" data structures. Applications of union/find data structures include maze generation (which you'll do in Homework 9) and Kruskal's algorithm for computing the minimum spanning tree of a graph (which you'll implement in Project 3). Union/find data structures begin with every item in a separate set. -------------- ------------ -------- ------------------- -------- ----------- |Piedmont Air| |Empire Air| |US Air| |Pacific Southwest| |Web TV| |Microsoft| -------------- ------------ -------- ------------------- -------- ----------- The query "find(Empire Air)" returns "Empire Air". Suppose we take the union of Piedmont Air and Empire Air and called the resulting corporation Piedmont Air. Similarly, we unite Microsoft with Web TV and US Air with Pacific SW. -------------- ------------------- ----------- |Piedmont Air| | US Air | |Microsoft| | Empire Air | |Pacific Southwest| | Web TV | -------------- ------------------- ----------- The query "find(Empire Air)" now returns "Piedmont Air". Suppose we further unite US Air with Piedmont Air. -------------------------------- ----------- | US Air Piedmont Air| |Microsoft| |Pacific Southwest Empire Air | | Web TV | -------------------------------- ----------- The query "find(Empire Air)" now returns "US Air". When Microsoft takes over US Air, everything will be in one set and no further mergers will be possible. List-Based Disjoint Sets and the Quick-Find Algorithm ----------------------------------------------------- The obvious data structure for disjoint sets looks like this. - Each set references a list of the items in that set. - Each item references the set that contains it. With this data structure, find operations take O(1) time; hence, we say that list-based disjoint sets use the "quick-find" algorithm. However, union operations are slow, because when two sets are united, we must walk through one set and relabel all the items so that they reference the other set. Time prevents us from analyzing this algorithm in detail (but see Goodrich and Tamassia, Section 12.1.4). Instead, let's move on to the less obvious but flatly superior "quick-union" algorithm. Tree-Based Disjoint Sets and the Quick-Union Algorithm ------------------------------------------------------ In tree-based disjoint sets, union operations take O(1) time at the expense of slower find operations. However, for any sequence of union and find operations, the quick-union algorithm is faster overall than the quick-find algorithm. To support fast unions, each set is maintained as a general tree. The quick- union data structure comprises a _forest_ (a collection of trees), in which each item is initially the root of its own tree; then trees are merged by union operations. The quick-union data structure is simpler than the general tree structures you have studied so far, because there are no child or sibling references. Every node knows only its parent, and you can only walk up the tree. Fortunately, the true identity of each set is recorded at its root. Union is a simple O(1) time operation: we simply make the root of one set become a child of the root of the other set. For example, when we form the union of US Air and Piedmont Air: US Air Piedmont Air US Air ^ ^ ^ ^ | | | | Piedmont Air Pacific Southwest Empire Air Pacific Southwest ====> ^ | Empire Air US Air becomes a set containing four members. However, finding the set to which a given item belongs is not a constant-time operation. The find operation is performed by following the chain of parent references from an item to the root of its tree. For example, find(Empire Air) will follow the trail of references until it reaches US Air. The cost of this operation is proportional to the item's depth in the tree. To keep items from getting too deep, we unite sets intelligently. At each root, we record the height of its tree. When we unite two trees, we make the shorter one a subtree of the taller one (breaking ties arbitrarily). This strategy is called "union-by-height." Implementing Quick-Union with an Array -------------------------------------- Suppose the items are non-negative integers, numbered from zero. We'll use an array to record the parent of each item. If an item has no parent, we'll record the height of its tree. To distinguish it from a parent reference, we'll record the height h as the negative number -h-1. (Think of this as minus the number of nodes on the longest path from the root.) Initially, every item is the root of its own tree, so we set every array element to -1. ------------------------------- |-1|-1|-1|-1|-1|-1|-1|-1|-1|-1| ------------------------------- 0 1 2 3 4 5 6 7 8 9 The forest illustrated at left below is represented by the array at right. 8 1 2 ------------------------ / \ /|\ |1|-2|-1|8|5|8|1|3|-3|1| 5 3 9 0 6 ------------------------ | | 0 1 2 3 4 5 6 7 8 9 4 7 This is a kludgy way to implement tree-based disjoint sets, but it's fast (in terms of the constant hidden in the asymptotic notation). Let root1 and root2 be two items that are roots of their respective trees. Here is code for union. public void union(int root1, int root2) { if (array[root2] < array[root1]) { array[root1] = root2; // root2 is taller; make root2 new root } else { if (array[root1] == array[root2]) { array[root1]--; // Both trees same height; new one is taller } array[root2] = root1; // root1 equal or taller; make root1 new root } } The find() method is equally simple, but we need one more trick to obtain the best possible speed. Suppose a sequence of union operations creates a tall tree, and we perform find() repeatedly on its deepest leaf. Each time we perform find(), we walk up the tree from leaf to root, perhaps at considerable expense. When we perform find() the first time, why not move the leaf up the tree so that it becomes a child of the root? That way, next time we perform find() on the same leaf, it will run much more quickly. Furthermore, why not do the same for _every_ node we encounter as we walk up to the root? 0 0 /|\ /|||\ 1 2 3 / /|\ \ /|\ ==find(7)==> 7 4 1 2 3 4 5 6 /| |\ /|\ 8 9 5 6 7 8 9 In the example above, find(7) walks up the tree from 7, discovers that 0 is the root, and then makes 0 the parent of 4 and 7, so that future find operations on 4, 7, or their children will be faster. This technique is called "path compression." Let x be an item whose tree we wish to identify. Here is code for find, which returns the identity of the item at the root of the tree. Recall that items are numbered starting from zero. public int find(int x) { if (array[x] < 0) { return x; // x is the root of the tree; return it } else { // Find out who the root is; compress path by making the root x's parent. array[x] = find(array[x]); return array[x]; // Return the root } } Path compression changes the height of each tree, so how do we fix union-by- height? We don't. Say that the _rank_ of each tree is the height the tree would have had if we weren't using path compression. We use union-by-rank, so the union() code needs no changes, and we don't need to recompute the height of each tree after path compression (which would be too slow). The asymptotic running time of union-by-rank is just as good as that of union-by-height. What if we want some control over the names of the sets? The solution is to maintain a separate table that maps root items to set names, and perhaps vice versa (depending on the application's needs). For instance, the table might map 0 to Microsoft. The union() method must be modified so that when it unites two sets, it assigns the union an appropriate name. Often, however, we don't care about the name of a set at all; we only want to know if two items x and y are in the same set. (This is true in both Homework 9 and Project 3.) You need only run find(x), run find(y), and check if the two roots are the same. Running Time of Quick-Union --------------------------- A single find operation can take time as great as Theta(log u), where u is the number of union operations that took place prior to the find. However, the worst-case _average_ running time of union/find operations is better because of path compression. The average running time of find and union operations in the quick-union data structure is so close to a constant that it's hardly worth mentioning that, in a rigorous asymptotic sense, it's slightly slower. A sequence of f find and u union operations (in any order and possibly interleaved) takes Theta(u + f alpha(f + u, u)) time in the worst case. alpha is an extremely slowly-growing function known as the "inverse Ackermann function." This function is never larger than 4 for any values of f and u you could ever use. Hence, for all practical purposes (but not on the exam), you should think of quick-union as having find and union operations that run, on average, in constant time.