CS61B: Lecture 39 Wednesday, May 1 AMORTIZED ANALYSIS ================== We've seen several data structures for which I claimed that the average-case time for certain operations is always better than the worst-case time. These data structures include hash tables, tree-based disjoint sets, and splay trees. The analysis technique that proves these claims is called _amortized_analysis_. Amortized analysis is a way of proving that even if an operation is occasionally expensive, its cost is made up for by other, cheaper occurrences of the same operation. Hash Tables ----------- Before we examine amortized analysis, let's look at the average-case time for inserting into a hash table. Assume that we double the size of the table whenever the load factor exceeds one - that is, when the number n of items exceeds the number N of buckets. Suppose we begin with an empty hash table with one bucket. Also assume that the chains don't grow very long, so any operation that doesn't resize the table takes O(1) time - more precisely, suppose it takes at most one second. Next, perform multiple insert operations. Whenever n is a power of two, and an insertion is performed, the table is resized from N = n to N = 2n (before n is incremented). Suppose this resizing operation takes at most 2n seconds. After i insert operations, n = i and N is the smallest power of two >= n. The total seconds taken by _all_ the resizing operations is 2 + 4 + 8 + ... + N/4 + N/2 + N = 2N - 2. So the cost of i insert operations is at most i + 2N - 2 seconds. Because N < 2i, the i insert operations take O(i) time, and so the _average_ running time of an insertion operation is O(1). We say that the _amortized_running_time_ of insertion is O(1), even though the worst-case running time is O(n). For almost any application, the amortized running time is more important than the worst-case running time, because the amortized running time determines the total running time of the application. The main exceptions are some applications that require fast user interaction (like video games), for which one slow operation might cause a noticeable glitch in response time. Amortized Analysis: The Accounting Method ------------------------------------------ The total-time argument for hash tables is straightforward, but it's hard to adapt it to the case where hash tables can shrink when items are removed. Let's examine a more sophisticated method. In the _accounting_method_, we "charge" each operation a certain amount of time. Usually we overcharge, but we can save time in the bank for use during later operations. We don't actually know how many seconds any computation takes, because it varies from computer to computer. Fortunately, constant factors don't matter in asymptotic analysis. Everything a computer does can be broken down into constant-time computations, so we will let a _dollar_ be a unit of time that's large enough to execute the slowest constant-time computation that comes up in the algorithm we're analyzing. Each operation has an _amortized_running_time_, which is the number of dollars that are "charged" to do that operation. Each operation has an _actual_ _running_time_, which is the actual number of constant-time computations it takes. If an operation's amortized time exceeds its actual time, the extra dollars are saved in the bank to be spent on later operations. If an operation's actual time exceeds its amortized time, dollars are withdrawn from the bank to pay for an unusually expensive operation. The bank balance must never fall below zero. If it does, you have failed to prove anything about the amortized running time of the algorithm. Amortized Analysis of Hash Tables --------------------------------- Suppose that every operation (insert, find, remove) takes one dollar of actual running time unless the hash table is resized. We resize the table in two cases: we double its size if the load factor exceeds one, and we halve its size if the load factor drops below 0.25. We charge the following amortized running times. insertItem: 5 dollars remove: 5 dollars findElement: 1 dollar Is this accounting valid, or will we go broke? Let's consider a hash table that is new, or has just been resized. If it has N buckets, then it has between N/2 - 1 and N/2 + 1 items. (You can see this by checking all three cases: if the table is new, it has zero items; if it's just been doubled, it has N/2 + 1 items; if it's just been halved, it has N/2 - 1 items.) Almost every insert operation costs one actual dollar and puts four dollars in the bank. An insert operation only resizes the table if the number of items reaches N + 1. If this happens, at least N/2 insert operations have occurred since the last resizing, so there are at least 2N dollars in the bank. The cost of resizing the hash table from N to 2N buckets is 2N dollars, so we can afford it. Almost every remove operation costs one actual dollar and puts four dollars in the bank. A remove operation only resizes the table if the number of items reaches N/4 - 1. If this happens, at least N/4 remove operations have occurred since the last resizing, so there are at least N dollars in the bank. The cost of resizing the hash table from N to N/2 buckets is N dollars, so we can afford it. The bank balance never drops below zero, so the amortized time of all three operations is O(1). Observe that if we alternate between inserting and deleting the same item over and over, the hash table is never resized, so we save up a lot of money in the bank. This isn't a problem; it just means the algorithm is faster (spends fewer dollars) than the amortized times indicate. Why Does Amortized Analysis Work? --------------------------------- Why does this metaphor about putting money in the bank tell us anything about the actual running time of an algorithm? Suppose our accountant keeps a ledger with two columns: the total amortized time of all operations so far, and the total actual time of all operations so far. So long as the bank balance never drops below zero, the total actual time is less than or equal to the total amortized time. Therefore, the algorithm never takes longer than the total amortized time. Amortized analysis (as presented here) only tells us an upper bound (big-Oh) on the actual running time, and not a lower bound (big-Omega). It might happen that we accumulate a big bank balance and never spend it, and the total actual running time might be much less than the amortized running time. For example, splay tree operations take amortized O(log n) time, but if your only operation is to access the same item n times in a row, the actual average running time is O(1). Amortized Time vs. Expected Time -------------------------------- There's a subtle but important difference between amortized running time and _expected_running_time_. Expected running time is a probabilistic idea, usually applied to randomized algorithms like quicksort and quickselect. Quicksort with random pivots has an _expected_ running time of O(n log n), though its worst-case running time is Theta(n^2). This means that there is a small possibility that quicksort will cost Omega(n^2) dollars, but the probability that this will happen approaches zero as n grows large. On the other hand, a splay tree operation takes O(log n) amortized time, but the worst-case running time for a splay tree operation is Theta(n). Splay trees are not randomized, and the "probability" of a Theta(n) operation is not a meaningful concept. If you take an empty splay tree, insert the items 1...n in order, then run findElement(1), the find operation _will_ cost n dollars. But a sequence of n splay tree operations, starting from an empty tree, _never_ costs more than O(n log n) actual running time. Hash tables are an interesting case, because they use both amortization and randomization. Suppose that each imaginable item hashes to a randomly chosen bucket. (This is what the ideal hash code/hash function combination would do.) Probability theory can show that the expected running time of any operation is O(1) if the table is not resized. However, there is a tiny probability that every item will hash to the same bucket, so the worst-case running time of an operation is Theta(n) - even without resizing. To account for resizing, we use amortized analysis. To account for collisions, we use probabilistic analysis. So when we say that hash table operations run in O(1) time, we mean they run in O(1) _expected_, _amortized_ time. Splay trees O(log n) amortized time / operation * Disjoint sets (tree-based) O(alpha(f + u, u)) amortized time / op ** Quicksort O(n log n) expected time *** Quickselect Theta(n) expected time **** Hash tables Theta(1) expected amortized time / op ***** If you take CS 170, you will learn an amortized analysis of disjoint sets there. Unfortunately, the analyses of both disjoint sets and splay trees are complicated. If you're curious, the handout on splay trees includes the amortized analysis, but you're not required to read or understand it for this class. * Worst-case time is Theta(n), worst-case amortized time is Theta(log n), best-case time is Theta(1). ** For find operations, worst-case time is Theta(log n), worst-case amortized time is Theta(alpha(f + u, u)), best-case time is Theta(1). Union operations always take Theta(1) time. *** Worst-case time is Theta(n^2). Worst-case expected time is Theta(n log n). Recall that quicksort can be implemented so that keys equal to the pivot go into a separate list, in which case the best-case time is Theta(n), with the best case being that all the keys are equal. If quicksort is implemented so that keys equal to the pivot are divided between lists I1 and I2, as is the norm for array-based quicksort, then the best-case time is Theta(n log n). **** Worst-case time is Theta(n^2), expected time is Theta(n), best-case time is Theta(n). ***** Worst-case time is Theta(n), expected worst-case time is Theta(n) (worst case is when table is resized), amortized worst-case time is Theta(n) (worst case is every item in one bucket), expected amortized worst-case time is Theta(1), best-case time is Theta(1). Confused yet?