From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August14老 2001斥 Tuesday 2:20 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: A Class of Odd-Weight-Column SEC-DED-SbED Codes for Memory System Applications Hi, This is another single error correction doulbe error detection paper. It makes H matrix in a different way although it detects and corrects errors in the same manner. Let r be an arbitary but positive even integer ( r > 3). Step 1: Let g be a column vector with (r/2) entries. It is preferred if g has all 1's. Step 2: Let fq a column vector with (r/2) entries and odd number of 1's. (fq has even number of 1's if g has odd number of 1's.) Step 3: For distinct i, j, we can make a matrix hij such that: hij = [ g + fi + fj | g + fi + fj | fi | fj ] [ fi | fj | g + fi + fj | g + fi + fj ] Since fi, fj are odd-weight if g is even-weight and fi, fj are odd-weight if g is odd-weight, the columns of hij are odd-weight. Step 4: We can make matrix H by concatenating hij's. H = [h01, h02, h03, ..., hij ... ] Let us have an example for clarity. If we pick g = [11]', then fq's are [01]' [10]' [0001] [0010] h01 = [0010] , h10 = [0001] [0100] [1000] [1000] [0100] [00010010] H = [00100001] [01001000] [10000100] When we multiply H by C, (1) H * C = 0 if there is no errors C = [10110111]' H * C = [0000]' (2) H * C has odd number of 1's: single bit error, can find the location by comparing the H * C with each column of H as we did the previous paper. C = [11101100]' H * C = [0100]' C = [11001101]' H * C = [0100]' C = [11101111]' H * C = [1000]' C = [11111101]' H * C = [1000]' C = [11011010]' H * C = [0001]' C = [11110111]' H * C = [0010]' (3) H * C has even number of 1's: double bit error is detected. C = [11011101]' H * C = [1100]' ------------ Jaein From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August14老 2001斥 Tuesday 1:12 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: A Class of Optimal Minimum Odd-weight-column SEC-DED Codes Hi, This week I cover error correction and detection papers. This paper starts with saying that "Minimum weight of matrix H is at least w iff every combination of (w - 1) or fewer columns of H is linearly independent." The author proposes a minimum weight 4 code the H matrix of which meets: (1) There are no all-0 columns. (2) Every column is distinct. (3) Every column contains an odd number of 1's (odd weight). Property (2) implies every combination of two column vectors is linearly independent. Property (3) means every combination of three column vectors is linearly independent. Otherwise, one of the column vector should have even number of 1's, which contradicts the assumption. Since every combination of 3 (= 4 - 1) column vectors is linearly independent, H matrix has minimum weight 4 by definition. Then, how to make matrix H? Suppose the lengths of data word, check bits, and code word are k, r, and n (= k + r). H matrix should have n columns. First choose the columns which have one 1's: rC1 Next, choose those columns which have three 1's: rC3 Then, choose the columns with five 1's: rC5 ... Repeat this procedure until we have n columns For the sake of convenience, we rearrange so that columns with one 1's are on the right end of H like: [ 1000] [ ... 0100] H = [ 0010] ... [ 0001] How to detect and correct errors? (1) Detecting 1-bit error If we have non zero vector when we multiply H by C (code vector) in GF(2), then we have at least 1-bit error. (2) Detecting 2-bit error If the product of H*C has even number of 1's. It has a 2-bit error. Actually, it doesn't discriminate any even number bit errors with 2-bit errors. (3) Correcting 1-bit error If the error is 1-bit error, H*C has odd number of 1's and it should match of one of H's n columns. Suppose H*C matches j'th column of H. Then we can correct the code word by toggling j'th bit of code word. I will give an example: k = 4, r = 4, and n = 8 [11101000] H = [11010100] [10110010] [01110001] b0 = [11111111]' (where ' means transpose) b0 is a correct code word, and H*b = [0000]' See what happens if we change one bit of b0: b08 = [11111110]' H*b08 = [0001]' b04 = [11101111]' H*b04 = [0111]' Both H*b08 and H*b04 have odd number of 1's so they are 1-bit errors. If we compare H*b08 and H*b04 with columns of H, we can know that they are equal to 8-th and 4-th columns. So we found that errors occurred in 8-th and 4-th positions. We get correct the code word by flipping 8-th and 4-th bits. Now, let us change two bits of b0: b078 = [11111100]' H*b078 = [0011]' H*b078 has even number of 1's. So it has a 2-bit error. -------------- Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July25老 2001斥 Wednesday 2:20 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: A Comparitive Analysis of Schemes for Correlated Branch Prediction These people worked really hard to write a paper with lots of really big words and very little interesting content. They reached some pretty interesting conclusions, most of which exhibit insight on par with such observations as "2+2=4". The paper gets off to a good start with its categorization of branch prediction schemes. Basically, a branch predictor can be broken down into 2 parts: the divider and the predictor. The divider is used to parse the branches, separating them by various criterion. The decisions made in the divider control which predictor is accessed for the final branch prediction. This can be thought of as a 2 level branch prediction system, where some criterion is used to access the first-level table, and then the information from that location in the first-level table is used to index the second-level table. Dividers: They discuss several different types of dividers. The simplest is the identity divider, which isn't really a divider. This scheme would imply that there is only one predictor, and that all branches are predicted from this single predictor. The next divider is called a per-branch substream divider. It divides based on each individual branch, implying that there are n predictors associated directly (1 to 1) with n branches in a program. Next, we can subdivide each of these per-branch streams into per-branch global-pattern streams. This method, sometimes referred to as the GAs method, partitions each per-branch substream based on the pattern of directions of the k preceding branches in the program. In other words, the branch prediction for a particular branch is affected by the branches immediatetly near it in sequential execution. Another method is called per-branch branch-pattern streams, aka PAs, and is yet another way to subdivide the per-branch streams. It exploits repeating patterns in the execution of a single branch, but I don't understand how. The final method is the authors' creation, called per-branch global-path streams, or static correlated branch prediction (scbp). It divides based on the path taken to get to the current branch. Note that a path implies more info than a pattern because it specifies both the branch identifiers and directions, rather than just the directions. Static branch prediction schemes assign predictions before a program runs, whereas dynamic prediction schemes can change their predictions as a program executes. Examples of static predictors are "always predict taken," "backwards taken, forwards not taken," and profiling schemes. Profiling schemes use test runs of a program to assign static branch predictions for each branch in a program based on collected statistics. The most popular dynamic predictor is the saturating 2-bit up/down counter. Implementation of these schemes can get messy due to the very large sizes of the tables generated by the above schemes. Dividers often implement aliasing to reduce the size of the second table in the prediction scheme. Aliasing uses the lower bits of a predictor address to access a shared predictor. Thus, a divider that would normally generate n predictor streams only generates n/(2^j) streams with aliasing, where j is the number of high bits in the predictor address that are ignored. The rest of the paper degenerates into simulations where they make lofty claims based on graphs that don't support them. They discovered that aliasing can reduce the accuracy of predictions. This is really incredible, considering that aliasing deliberately underminds the efforts of the divider. I mean really, it's shocking that it should reduce accuracy (note heavy sarcasm here). They seem to forget that aliasing is a necessary part of implementation. Without it, the tables get so large that it's more expensive to implement the branch predictor than is justified by the performance gain derived from it. They also show that path history is better than pattern history. Again, this isn't very surprising, given that path history contains more information than pattern history. Finally, they show that cross procedure correlation can improve the accuracy of static predictors. This isn't very surprising either, given that local knowledge of branch history is a subset of global knowledge, so once again, the better case is better because it has access to more information. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July19老 2001斥 Thursday 4:49 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: A Preliminary Architecture for a Basic Data-Flow Processor Hi, This paper explains how we can make a data flow processor. First, the paper showes the elementary processor, which can do only arithmetic operations. The elementary processor consists of instruction cells, arbitration network, operation units distribution network. Packets(data packets and operation packets) travel along the processor. An instruction cell waits for operands from data packets. When all of its operands are filled, it fires an operation packet. The operation packet flows to an operational unit through arbitration network. The opeartional unit executes the operation and sends a data packets containg the result and the instruction cell where the result should be stored. The data packet moves to an instruction cell through distribution network. The authors develop this idea by adding decision units and control network for flow control intructions. Naturally, we need decision packets and control packets which are counterparts of operation pakcets and data packets. Decision units consist of decider, T-gate, F-gate, merge. Finally, they expand their system into 2-level memory architecture. Instructions which are not used often is moved back to instruction memory. The instructions can be retrieved again when they are needed. Actually, the data flow architecture issues multiple instructions per cycle and executes them in multiple functional units. As we saw in simultaneous multithreading paper, the instructions can have contention between them. In addition, it is not clear they got the machine code for the data flow machine. So the performance of the architecture is dubious. ------- Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July18老 2001斥 Wednesday 1:13 AM To: Mark Whitney; jaein@eecs.berkeley.edu Subject: A VLIW Architecture for a Trace Scheduling Compiler I'm assuming you know the general philosophies, pros, and cons to VLIWs in general, so I'll skip to the stuff unique to this paper: the compiler and their implementation of a VLIW architecture. Their compiler is a called a "trace scheduling compacting compiler." They claim it finds and exploits fine-grained paralellism in any application with no programmer intervention. It's completely transparent to the code, programmer, and user. The compiler starts by performing a complete set of classical optimization. It then builds a flow graph of the program and measures the likelihood of all paths through the program (paths defined by branching). The most likely path, or "trace," is treated as if it were completely free of all conditional branches, allowing for very large code blocks to be combined into VLIW instructions. Special "compensation code" is included to clean up any off-trace branches that occur. This process is then repeated with the next most-likely trace, and so on until all possible traces through the entire program have been accounted for. They built a VLIW architecture with integer and floating point groups of functional units, all pipelined. 1024-bit instructions initiate 28 operations per instruction, giving a peak performance of 60 MFLOPS. It can support up to 8 memory controllers, for a max of 512MB. They use interleaved memory addressing, and they do not have a data cache. Software sees a 7-beat (3.5-cycle) memory reference pipeline. A max of 4 memory references may be started in each beat (2 beats per cycle, 1 beat = 65ns), for a max memory BW of 492MB/s. There was concern that virtual memory page faults would lead to imprecise interrupts. They implemented a "history queue" to resolve this. I believe this history queue is identical to the history buffer discussed in the precise interrupts paper by Smith. They have a full-width instruction cache that is physically distributed across the boards of the system. Bits of each instruction word are sent to the boards that they will control. The I$ can achieve 984MB/s, and can store up to 1MB of instructions. To keep code size down, they use variable length main memory representation of fixed length machine instructions. In other words, they use 32-bit headers in main memory to allow them to eliminate the NOPs generally produced by code that doesn't perfectly fit into VLIW parallelized blocks. Branching is implemented with a delayed branch slot, and is based on priority judgements. For example, if multiple branches are compacted into a single VLIW instruction word, it is possible that several branches could evaluate to taken, causing ambiguities. These ambiguities arise from situations in which one branch would cause another to be skipped in a sequential machine, while in the VLIW machine, they occur simultaneously. Thus, the compiler asigns priorities to branches based on the sequential machine version, and these priorities are used to decide which branch to accomodate in cases where multiple branches evaluate to taken. Moving instructions around conditional branch boundaries causes other problems. For example, accesses to an array might be controlled by a conditional branch that would prevent out-of-bound accesses in a sequential machine. In the VLIW machine, however, the access might be moved around the branch, allowing illegal accesses, and undesirable exceptions. Another example would be a check to prevent divide-by-zero cases in the FP units. To handle these situations, the designers included special instructions that basically allow such illegal operations to proceed without causing exceptions until an attempt is made to alter the state of the machine with the faulty data. They compared code sizes for several large (100k-300k) Fortran programs as compiled for TRACE and for the VAX. They found that the TRACE code was usually about 3 times the size of the VAX code. They also tested the TRACE on several benchmarks, and compared the results to the VAX and the CRAY. They found that the TRACE had about 4-10 times the performance of the VAX, which costs twice as much as the TRACE, and about 1/2-1/5 the performance of the CRAY, which costs over twenty times as much. That's it. See you both tomorrow on the 4th floor of Soda at 8:30. Jaein, I'm going to plan on calling you in your office again to let me in. Thanks, Chris From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August14老 2001斥 Tuesday 2:55 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: A power-hungry processor from 1992 vs. a power-efficient processor from 1996 Hi guys, I'm combining the summary of the following 2 papers: 1) A 200-MHz 64-b Dual-Issue CMOS Microprocessor 2) A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor I'll refer to the processor from the first paper as processor 1, and that of the second paper as processor 2. Both papers have a very long list of authors, and many of the authors are common to both papers. Processor 2 is actually the DEC StrongARM 110, while processor 1 doesn't seem to have a name associated with it. Processor 2 is a greatly improved implementation of processor 1. By improved, I mean that processor 2 gets about half the performance of processor 1 while consuming about 1/60 the power: Processor 1: 200 MHz, 400 MIPS, 30W Processor 2: 160 MHz, 185 MIPS, 0.5W They used several techniques for reducing the power, and they estimate the following savings from each: Vdd reduction, from 3.3V to 1.5V: power cut by factor of 5.3 Reduced functionality*: power cut by factor of 3 Scale process from .75um to .35um: power cut by factor of 2 Reduce clock load**: power cut by factor of 1.3 Reduce clock rate from 200MHz to 160MHz: power cut by factor of 1.25 * They removed the FP unit and the branch history table. In fact, they don't do any branch prediction in processor 2. Instead, they worked to ensure a single-cycle branch penalty, so that the lack of branch prediction HW wasn't such a big deal. ** They switched the latches on the clock tree to single edge-triggered latches to reduce clock load and latch delay. This cut the clock power by a factor of about 2, but clock power only consumed about 65% of the power on processor 1, so this is estimated to cut the overall power by a factor of 1.3. Other than these details, the authors comment that the two architectures are quite similar, with a few cases of processor 2 being simpler than processor 1. For example, #2 is single issue with a classic 5-stage pipeline, whereas #1 is dual issue with 7 stages. #1 has branch prediction and scoreboarding, while #2 has neither. The caching strategies are different too. #1 uses 8kB D and I caches, both direct-mapped. #2 uses 16kB D and I caches, both 32-way set associative. I know this sounds crazy, but well, they are 32-way... This high level of associativity was a result of their use of banks of fully associative caches. Each cache is implemented as 16 fully associative blocks. Processor 2 supports power down modes: Idle and Sleep. In Idle mode, the internal clock grid and the bus clock stop toggling, while the PLL stays active. This cuts the power down to about 20mW. Sleep mode cuts off the internal power to the chip, reducing the power to about 50uA. The caches block on misses, so during cache fills, the internal clock switches over to the external clock speed, which is much slower, and matches the cache fill speed. This saves on power. Nevertheless, as the caches occupy a signigicant portion of the chip (80%?), they dominate the power consumption. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 12:20 AM To: jaein@eecs.berkeley.edu; whitney@eecs.berkeley.edu Subject: Active Messages: a Mechanism for Integrated Communication and Computation They argue that current multiprocessors ("current" meaning 1992) fall into 2 categories: message passing and message driven. Message passing architectures focus primarily on computational abilities of the processors, failing to pay enough attention to efficient communication between processors. Message driven systems focus too heavily on messages, resulting in short computation run-lengths (since processors can only compute so much before they have to wait for the next message), which limits utilization of the processors' abilities. They propose balancing these issues by providing a scheme that allows computation and communication to be overlapped in order to reduce communication overhead. Communication is asynchronous. Messages have handlers and payloads. Basically, instead of just passing data between processors, the processors pass messages that are decoded into instructions with arguments. When a message arrives, it is decoded very quickly and easily and inserted into the normal flow of computation. They present Split-C, an experimental programming model that uses active messages. It's C with the addition of GET and PUT commands. These commands are passed to other processors with arguments, and they then execute on the remote processors, causing data to be transferred without incurring the typical overhead associated with message passing systems or the limitations on processor utilization seen in message driven machines. They site a matrix example that shows that active messages can offer as much as an order of magnitude performance improvement. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 3:58 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Alternative Implementions of Two-level Adaptive Branch Prediction Hi, A branch predictor predicts which instruction is executed after a branch instruction. As a processor executes more instructions in parallel, high prediction rate gets more important. A branch predictor can be categorized either as a static predictor or a dynamic predictor. A static predictor predicts the next instruction without regard to which instructions have been executed. BTFN (Backward Taken and Forward Not-taken), Always-Taken, and static training are examples of static predictions. A dynamic predictor predicts based on the history of instructions executed already. This paper proposed two level adaptive branch predictor, which predicts depending on branch history and pattern history. Branch history is a record of the history whether a branch was taken or not. History pattern tables associate states with each pattern of branch history values. Whether a branch is to be taken relies on how to interpret the states. This paper showed a 1-bit state machine and four 2-bit state machines. 1-bit state machine predicts next instruction just upon previous branch predicting taken if the last one was taken and not-taken if not. 2-bit state machines are saturating counters or variations. Since they have more bits than 1-bit state machine, they are more resistant to noise, thus having higher prediction rate. Two level adaptive prediction can be divided to three kinds depending on whether branch history is shared for all branches or there are separate registers for each branch address and whether pattern history table is shared or not: Global History Register and a Global Pattern History Table (GAg) Per-address Branch History Table and a Global Pattern History Table (PAg) Per-address Branch History Table and Per-address Pattern History Table (PAp) The performance of two level adaptive predictors relies on the branch history register length (which is log of pattern history size) and branch history table size in case of PAg and PAp. The performance got better for more resources and it was much more evident in GAg. Since more resources mean higher cost, they tried to find the most affordable solution while achieving the same prediction rate. The experiment showed that PAg was the most cost effective. GAg was the less affected by context switches than PAg or PAp. Two level adaptive branch predictor did well on most test samples (97% average), whereas static training predictors was not good on some samples, because they were trained to work well on specific samples. ------------- Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: September01老 2001斥 Saturday 11:21 PM To: Mark Whitney; jaein@eecs.berkeley.edu Subject: Architecture Alternatives On the continuum from ASIC to general purpose processor, the following trends hold true as one moves closer to an ASIC: 1) Fixed costs increase, while variable costs decrease 2) performance increases 3) time to market/difficulty of design increases 4) flexibility decreases ASICs fall into 3 subgroups: 1) full-custom 2) semi-programmable 3) programmable With full-custom ASICs, the designer designs everything, including the layout of the transistors. These are usually mixed A/D chips. Semiprogrammable ASICs include CBICs (cell-based ICs) and MGAs (masked-gate arrays). CBICs use standard cell libraries, relying on the designer to do layout at the gate level. For MGAs, the transistors are pre-placed, and the designer defines the interconnect. There are 3 types of MGAs: 1) channelled GA: interconnect/transistors lined up in rows 2) channelless GA: uniform placement of transistors and interconnect 3) structured GA: channelless with a predefined megacell somewhere on the die. the megacell could be a small processor, for example. Programmable ASICs include PLDs (programmable logic devices) and FPGAs (field programmable gate arrays). PLDs have a single, large block of programmable interconnect, and come in 2 flavors: PAL (programmable array logic), and PLA (programmable logic array). PALs have a programmable AND plane and a fixed OR plane, while PLAs have both programmable AND and OR planes. FPGAs can be thought of as big, complex PLDs with multiple interconnect blocks and programmable I/O pads. With the right application and a good implementation, FPGAs can offer 100-1000x the performance of a GP processor due to the potential for highly concurrent, deeply pipelined structures. DSPs are optimized for signal processing. They have a bunch of special functions, almost all of which were created and included with a particular algorithm in mind. For example, multiply-accumulate is really usefull for FIR filters, butterfly addressing is great for FFTs, etc. While high-end GP processors (i.e. Intel) typically aim at speed as their most important quality, DSPs typically aim at a mix of low power and low cost, given a certain level of performance. The example I gave earlier was an image processor. The human eye can only pick up 30 frames per second, so getting something faster than that doesn't do any good. 30 fps then becomes the required level of performance, and the rest of the optimization effort goes into low cost and low power. This brings up another point. What if our image processor were to occassionally drop below 30 fps? That would be bad (Ghostbusters reference here...). Now think on the nature of superscalar processors; potentially fast performance with the allowance for occassional dips in performance due to stalls in dynamic issue hardware. This won't do for our DSP. As a result, most DSPs are VLIWs. Developing software for DSPs is still more difficult than for a general purpose processor, but it's getting better, thanks to better compilers. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: September02老 2001斥 Sunday 12:02 AM To: Mark Whitney; jaein@eecs.berkeley.edu Subject: Arithmetic 2+2=3.99999999999999 (neon radiation suits?? Down with the industry giants!!!) Adders (see Appendix A for diagrams): 1) ripple carry (RCA): carries are rippled across all n-blocks. computation: O(n) 2) carry look ahead (CLA): carries are generated in advance at each block by looking at all input bits. O(nlogn), but fan-in is a problem 3) carry skip: sum/carry generated at each block assuming no carry in. After first carry-out generated at first block, comparison of carry in, speculated carry out, and speculate sum at each block allow carry to skip all the way down the chain of blocks. Basically a really fast ripple, with computation occuring in parallel ahead of and behind the carry ripple. O(sqrt(n)) 4) carry select: each block generates a set of sum/carry bits for carry in=1 and carry in=0. When actual carry arrives, the correct pair of results is selected. O(sqrt(n)). 5) carry save (CSA): n independent full adders. carry and sum info kept separate until end of computation, then combined. worthwhile for larger n (due to overhead of combining carry/sum at end). Basic multiplier and divider diagrams: see pg. A-4 Division can be restoring or non-restoring. Non-restoring division can potentially save an add per quotient bit, but the book states that restoring division implementations have a work-around for this anyway. Signed #'s: 1) sign magnitude: 1 sign bit, n-1 mag bits 2) 1's compliment: a-a=2^(n-1) 3) 2's compliment: a-a=2^n 4) biased: k+bias >= 0, bias is usually 2^(n-1) Multiplying 2's compliment numbers: Could convert to positive, multiply, and then fix sign, but that's slow. Use Booth Recoding instead. This reduces the number of partial products by grouping bits together and representing them in a redundant number system format. Division, on the other hand, is usually done by converting to positive, dividing, and then fixing the sign. The amount of extra time it takes to do this is small relative to the time it takes to divide. System issues: 1) When an integer overflow occurs, should we: a) set a bit? b) trap? c) do nothing? signed operations usually set a bit and provide a maskable trap, while unsigned ops usually trap. unsigned ops usually deal with memory addressing, in which overflow can be used to wrap around the address space. 2) multiplying 2 n-bit numbers results in a 2n-bit number or an n-bit number? most high-level languages suggest that 2n-bits are unecessary (i.e. the product of 2 ints is still an int). assembly language can take advantage of 2n-bits though, for as much as a 3x speed-up for multiply. 3) multiply takes too long on machines that would otherwise have IPC=1. possible solutions: a) 1-cycle multiply steps (pipelined mult) b) use the FP hardware c) use a separate, autonomous unit in the CPU to do the multiplication Floating Point, IEEE standard: single-precision: 32-b 1 sign bit, 8 exponent bits, 23 fraction bits x=(-1)^s * 1.f * 2^(e-127) features of IEEE standard for FP: 1) when rouding halfway cases, choose even 2) special values: NaN, -Inf, +Inf 3) denormals: numbers less then 1.0 * 2^e_min 4) round to nearest by default, but 3 other rounding modes available: towards 0, -Inf, or Inf 5) sophisticated exception handling FP mult: (s1*s2)*2^(e1+e2) 1) multiply the significands 2) round 3) add exponents (don't forget to subtract the bias) Uses sticky bits for rounding As denormals require variable shifting in the mult hardware, they're usually trapped to software FP addition... is harder than FP mult due to rounding since add and sub round at different decimal locations. Can be improved by pipelining or adding hardware to do 2 adds in parallel. FP divide: (s1/s2)*2^(e1-e2): linear cost iterative methods: quadratic cost (much faster), but have imperfect rounding and no remainder Rule of thumb: division should be about 1/3 as fast as mult. Fused FP mult-add: - rounding is expensive. here, we only have to round once. - better accuracy since no internal rounding Speeding up multiplication: 1 adder 1) higher-radix multiplication (i.e. radix-4 Booth Recoding) 2) CSAs Speeding up mult: multiple adders - allows pipelining - combinational circuit: no clocking ==> lower latency Array multipliers: multipass array and even/odd array, both with computation time O(n) Tree multipliers: Wallace tree and binary tree, both with O(logn) Wallace tree uses fewer gates, but is very difficult to lay out. most VLSI designers go with binary tree for easier layout. Faster division with 1 adder: Use CSA and higher radix i.e. Pentium uses radix-4 SRT division. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July17老 2001斥 Tuesday 9:32 PM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: Burroughs' stack machine hi guys, more stuff coming later tonight...I am still trying to get used to this "work" thing again after my vacation :) "Burroughs' B6500/B7500 stack mechanism" by Hauck and Dent: This goofy architecture features an explicit implementation of a procedure call stack that would normally be emulated in main memory using a stack and frame pointer. The procedure which is currently being executed is allocated a control word (called MSCW) at the bottom of its space in its process' stack and it stores local data on top of this control word. Data is actually placed onto the stack and removed from the stack for computation through 4 registers. The rest of the data in the current procedure's stack an also be addressed by specifying an offset from the top of the stack. The current history of procedure calls is recorded with the use of a special bank of "display registers". Each of these registers points to a control word of a procedure on the stack from the current procedure all the way down lexically to the operating system. This way, upon the return from a procedure the top of the stack is popped off down to the returning procedure's control word and the new stack is specified by the display register pointing to the stack of the procedure being returned to. Non-local variables that are still in the current lexical scope can be accessed through a pair of values, the display register number pointing to the relevant calling procedure's stack and the offset within that stack. Variables that are not in the current scope are not in the available namespace. Each process has its own stack upon which called procedure's allocate local data and control words. The operating system has access to another bank of registers called the "stack vector array". Each register in this bank points to the bottom of one of the active process' stack. A context switch entail changing some of the display registers so that they point to the bottom of the new process' stack up to the currently executing procedure in the process. All in all, an interesting machine but the performance advantage gained by the explicit stack was probably not that great compared to a meager stack and frame pointer. It was also tailored to the language ALGOL 60, enough said. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August21老 2001斥 Tuesday 11:39 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Chapter 5 - Memory Hierarchy Summary Principle of locality: Memory hierarchy is based on the principle of locality. That is most programs don't access all code and data uniformly. Given a faster hardware on top of the existing memory, most of accesses can be serviced on the faster level with the larger storage limit of lower inexpensive level. In real system, memory hierarchy is organized as follows: Processor | Cache | Main memory | Disk | Tape or optical storage Four fundamental questions on memory hierarchy: At the start of chapter, the authors ask the following four questions: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) For block placement, we have direct mapped, set associative, and fully associative. In direct mapped, a block in the lower level maps to a designated block in the current memory level. In fully associative, a block can be in any empty blocks. And in set associative, a block can be in a restricted number of blocks. Direct mapped is simplest and fastest. Whereas fully associative hits best. Set associative is between these two. For block identification, we divide an address to tag + index + block offset -------------- block address First, we access find the block with index. Then, we compare the tags of the address and the block if they match, we found, otherwise, we have to get it from the lower level memory. After that, we concatenate the block offset to access the data word. For replacement, we can replace a block randomly or replace a block which is the least likely to be used. We predict that a block which is least recently used (LRU) will be the least likely to be used. For direct mapped, we don't need replacement policy since we have only one block - just kick that block off. Most of the time, especially for set associative, random replacement works almost as well as LRU. LRU can be used when the miss penalty from less elaborate replacement is greater than the overhead of implementing LRU. For write strategy, we have write through and write back. Write through writes a block to the lower level when ever a word is written on the block. Write back keeps writes on the current memory level, and writes to the lower level, when the block is replaced. Write through is simple, but suffers from overhead since it has to access lowerlevel memory for every write. Typically, write through is coupled with a write buffer. The current memory level is not stalled until the write buffer is full. Since write through keeps the current copy of data in the lower level memory, it is coherent. Write back reduces lower level memory accesses. It keeps a dirty bit. When a data word is not written yet, a dirty bit is cleared. Once a data word is written the dirty bit is set. When a block is replaced, it is written back to lower level memory only when the dirty bit is set. Even though write back is not coherent inherently, it is preferred in multiprocessor because it consumes less bandwidth, thus can accomodate more processors. On a write miss, we can read a block from the lower level and write a data to it (write allocate) or just modify the lower level without loading the block into the current block (no-write allocate or write around). When we access one level of memory, first we access that level. If we find a block (Hit), we are done. Otherwise (Miss), we have to find that word in the lower level and that time is miss penalty. If we let the time to access a level HitTime, the probability not to find a block in that level MissRate, and the time to find that word in the lower level MissPenalty. The access time to a level is: HitTime + MissRate * MissPenalty So we can know that to improve performance, we can (1) Reduce miss rate (2) Reduce miss penalty (3) Reduce hit time Policies to reduce miss rate are: 1) Larger Block Size Since an instruction or data next to the current one is likely to be used soon (spatial locality), by reading a block in larger chunk we can reduce the miss rate. However, larger block size increases miss penalty and conflict misses. After a certain block size, miss rate increases for the fixed cache size because of conflict misses or capacity misses (when the cache is small). 2) Higher Associativity Higher associativity reduces conflict misses, that's why set-associative and fully associative cache has lower miss rates than direct mapeed cache. But, set-associative cache and fully associative cache is slower than direct mapped because in direct mapped cache read and tag matching can happen at the same time, which makes direct mapped cache have faster hit time. Thus, first level cache, which is synchronized to processor clock, is usually direct mapped cache. 3) Victim Caches Victime Cache is a buffer where a data kicked out from the cache is stored. When a missed occured in the cache, we can look up the victim cache while we are look at the next level of memory. Victim caches reduces conflict misses without giving overhead to the current level cache. Victim cache is more useful when the cache size is small - has more conflict misses. A four entry victim cache removed 20% to 95% of the conflict misses in a 4KB direct mapped data cache. 4) Pseudo-Associative Caches Pseudo-associative cache is actually alternative use of set associative cache. Rather than look up all the blocks in a set, pseudo-associative cache looks up a designated block first and finds another block when it failed on the first block. If one of two blocks is accessed more frequently, pseudo- associative cache has the same degree of miss rate while having the fast hit time of direct mapped cache. 5) Hardware Prefetching of Instructions and Data This is the idea of stream buffer. This is also based on spatial locality. When a data is loaded in cache, next words are loaded into stream buffer using extra bandwidth. A single instruction stream buffer could catch 15% to 25% of the misses in a 4-KB 16-byte block direct mapped inst cache. 6) Compiler-Controlled Prefetching The idea is that a compiler inserts prefetch instructions. Makes sense when the processor can proceed while prefetching instructions. This requires nonblocking cache. Compiler-controlled prefetching incurs instruction overhead. 7) Compiler Optimizations A compiler reorganizes the code so that locality can be used. Although it doesn't give hardware overhead, writing such a compiler is challenging. Policies to reduce miss penalty are: 1) Giving Prioirty to Read Misses over Writes While write overhead can be hidden once a write is sent to a write buffer, cache has to wait until the requested block is loaded from the lower level. The idea is for a write to yield for a read miss. In a write through cache, we can check the write buffer on a read miss. In a write back cache, we can copy the cache block to write buffer and move that block back to cache. 2) Sub-block Placement for Reduced Miss Penalty A cache block larger than the processor data bits can take longer to read. The idea is to load only the necessary word on a miss rather than load all the words in a block. Thus, we can reduce miss penalty. By associating a valid bit for each word (subblock), we can know which words are valid in a block. 3) Early Restart and Critical Word First The idea is to send a word as soon as it is available, rather than wait until all the words in a block is loaded (Early Start). Critical Word First sends the requested word first. 4) Nonblocking caches to Reduce Stalls on Cache Misses Traditional cache (lockup cache) stalls on a read miss until the requested block is loaded. Nonblocking cache (lockup free cache) proceeds even with a read miss, thus overrides waiting overhead. 5) Second Level Caches To have another level of cache between cache and the main memory. The second level cache has latency between the first level cache and the main memory and has larger space (256KB-16MB) than the first level (1-128KB). Since most of first level cache miss can be serviced from second level, second level cache reduces miss penalty of first level cache. ** Multilevel inclusion When a lower level of memory holds all the data in the upper level of memory, then we say inclusion property or subset property is met. Inclusion property is nice because we need to check the lower level memory for coherence. The problem is happens when the lower level has larger block size than the upper level. Different blocks in the upper level can map to the same block in the lower level. In this case, all the upper level cache blocks that map to a lower level block should be invalidated when that lower block is replaced. Inclusion property may not be met when the product of associativity and set size of upper level memory is larger than that of lower memory. But, we don't need to care about this case since upper level memory is usually made small and direct mapped for fast hit time. Policies to reduce hit time are: 1) Small and Simple Caches Small cache can be made fast enough to be synchoronized to the process. Simple direct mapped cache runs fast because it can read the block while comparing the tag. That's why first level cache is made as a small direct mappped cache. 2) Avoiding Address Translation During Indexing of the Cache In a system with virtual memory, virtual memory address should be translated. In a traditional model, an address should be translated by TLB every time, which is critical because it incurs overhead for every memory access. We can pileline the processor stages more deeply (complicated). If we use virtual address with cache (virtual cache) we may have some problems. First, when a context switch occurs the cache blocks for a swapped process no longer points to the correct address. It should be flushed, which causes overhead. To avoid unnecessary flushing, process ID (PID) can be associated with each cache block by operating system. Then, virtual address and PID pair uniquely refers to an address. Another problem is that there can be two copies of virtual addresses for the same physical address (synomym, alias). If we use page offset to index the cache, then we can read a cache block while translating address because page offset is not affected by address translation. This limits the size of direct mapped cache to the size of a page, usually 4KB. By increasing associativity, we can make a cache larger while virtually indexing and physically tagging. Or we can rely on operation system, so that all aliases have the same last n bits in a 2^n bytes cache, thus those aliases have the different tags. This is called page coloring. 3) Pipelining Writes for Fast Write hits Pipeline write stage to tag checking and write stages. If we can overlap the tag checking of current write with the write stage of the previous one, then we can reduce the write time. Main memory Main memory is between cache and the disk in the memory hierarchy. Main memory is grouped in memory banks and memory banks are connected to the cache through. The goal in memory system is to achieve high bandwidth. First, we can make the data path wider. That means the processor accesses multiple memory banks simultaneously. Second, we can interleave memory banks. Accessing DRAM takes three steps: 1) sending address on the bus 2) initiating DRAM 3) sending data back from DRAM to the bus. Since we can broadcast address on multiple banks and overlap initiating time of the banks, we only need to pay the additional time for sending data when interleaving multiple memory banks. Third, we can make each memory bank operate independently. We need separate address and data paths. Fourth, we can avoid memory bank conflicts in memory interleaving by using prime number of memory banks. Fifth, we can use DRAM-specific interleaving. DRAM addresses its data in two steps to save address lines. First, one half of the address is sent, and this is called row access strobe(RAS). And then, the other half of the address is sent, which is called column access strobe(CAS). In DRAM specific interleavings, cells are accessed differently for higher bandwidth. Nibble mode DRAM can supply three extra bits as well as the requested bit. In page mode DRAM, RAS is not needed for the cells in the same row. Just by changing column address, it can access cells. Static column DRAM is very similar to page mode DRAM, except that we don't need to toggle CAS every time the column address changes. These DRAM specific interleaving uses the circuitry already on the DRAM and increases bandwidth four-fold with little cost. But, we need fast bus for these kinds of DRAMs. RAMBUS DRAM uses bus transaction for each chip rather than RAS/CAS and can deliver one byte every 2ns when the address pipeline is full. Virtual Memory The advantages of virtual memory: 1) Gives large address space 2) Every user has separate view of memory 3) Protection 4) Sharing We can answer the four fundamental questions on virtual memory. 1) Block Placement For placement we have direct mapped, set associative, and fully associative. To determine which policy, we need to remind the memory access time formula again. Average memory access time = HitTime + MissRate * MissPenalty Miss penalty for first level cache is between 8 to 100 clock cycles, whereas the miss penalty for virtual memory is 700000 to 6000000 clock cycles - Enormous. Fast hit time if offset by the huge miss penalty if miss rate is big. Thus, virtual memory is implemented in fully associative. 2) How to find a block? The virtual address is splited into virtual page number and page offset. We look up the page table and find the associated physical page number and concatenate the physical page number with the page offset to get the physical address. The problem is that page table can take too much space while most of its slots are not used. Thus, we can have multiple stages of tables. Since multi-level address translation, which occurs on every memory access, causes overhead, we can store that address translation in a cache called translation look aside buffer (TLB). TLB is made small (32-8192 Byte) and either fully associative or set associative. 3) Which block should be replaced on a miss? Since miss penalty of virtual memory is huge, we use LRU algorithm to minimize miss rate. LRU algorithm can be implemented in use bit or reference bit. The reference bit of a page is set when it is referenced, and the operating system clears the bits periodically. When a miss happens, one of the blocks with reference bit cleared is selected. 4) Write Policy Virtual memory use write back policy due to the huge disk access time. ** Selecting a Page Size The advantages of larger page size are - small page table - larger cache size is possible for virtually indexed and physically tagged cache - TLB can be made smaller - has less overhead than smaller pages The advantages of smaller page size are - Reduces internal fragmentation - Fast process start up time From: Jaein Jeong [jijeong@uclink.berkeley.edu] Sent: August21老 2001斥 Tuesday 7:38 AM To: jaein@cs.berkeley.edu Subject: Chapter 5 Memory-Hierarchy Design Summary Principle of locality: Memory hierarchy is based on the principle of locality. That is most programs don't access all code and data uniformly. Given a faster hardware on top of the existing memory, most of accesses can be serviced on the faster level with the larger storage limit of lower inexpensive level. In real system, memory hierarchy is organized as follows: Processor | Cache | Main memory | Disk | Tape or optical storage Four fundamental questions on memory hierarchy: At the start of chapter, the authors ask the following four questions: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) For block placement, we have direct mapped, set associative, and fully associative. In direct mapped, a block in the lower level maps to a designated block in the current memory level. In fully associative, a block can be in any empty blocks. And in set associative, a block can be in a restricted number of blocks. Direct mapped is simplest and fastest. Whereas fully associative hits best. Set associative is between these two. For block identification, we divide an address to tag + index + block offset -------------- block address First, we access find the block with index. Then, we compare the tags of the address and the block if they match, we found, otherwise, we have to get it from the lower level memory. After that, we concatenate the block offset to access the data word. For replacement, we can replace a block randomly or replace a block which is the least likely to be used. We predict that a block which is least recently used (LRU) will be the least likely to be used. For direct mapped, we don't need replacement policy since we have only one block - just kick that block off. Most of the time, especially for set associative, random replacement works almost as well as LRU. LRU can be used when the miss penalty from less elaborate replacement is greater than the overhead of implementing LRU. For write strategy, we have write through and write back. Write through writes a block to the lower level when ever a word is written on the block. Write back keeps writes on the current memory level, and writes to the lower level, when the block is replaced. Write through is simple, but suffers from overhead since it has to access lowerlevel memory for every write. Typically, write through is coupled with a write buffer. The current memory level is not stalled until the write buffer is full. Since write through keeps the current copy of data in the lower level memory, it is coherent. Write back reduces lower level memory accesses. It keeps a dirty bit. When a data word is not written yet, a dirty bit is cleared. Once a data word is written the dirty bit is set. When a block is replaced, it is written back to lower level memory only when the dirty bit is set. Even though write back is not coherent inherently, it is preferred in multiprocessor because it consumes less bandwidth, thus can accomodate more processors. On a write miss, we can read a block from the lower level and write a data to it (write allocate) or just modify the lower level without loading the block into the current block (no-write allocate or write around). When we access one level of memory, first we access that level. If we find a block (Hit), we are done. Otherwise (Miss), we have to find that word in the lower level and that time is miss penalty. If we let the time to access a level HitTime, the probability not to find a block in that level MissRate, and the time to find that word in the lower level MissPenalty. The access time to a level is: HitTime + MissRate * MissPenalty So we can know that to improve performance, we can (1) Reduce miss rate (2) Reduce miss penalty (3) Reduce hit time Policies to reduce miss rate are: 1) Larger Block Size Since an instruction or data next to the current one is likely to be used soon (spatial locality), by reading a block in larger chunk we can reduce the miss rate. However, larger block size increases miss penalty and conflict misses. After a certain block size, miss rate increases for the fixed cache size because of conflict misses or capacity misses (when the cache is small). 2) Higher Associativity Higher associativity reduces conflict misses, that's why set-associative and fully associative cache has lower miss rates than direct mapeed cache. But, set-associative cache and fully associative cache is slower than direct mapped because in direct mapped cache read and tag matching can happen at the same time, which makes direct mapped cache have faster hit time. Thus, first level cache, which is synchronized to processor clock, is usually direct mapped cache. 3) Victim Caches Victime Cache is a buffer where a data kicked out from the cache is stored. When a missed occured in the cache, we can look up the victim cache while we are look at the next level of memory. Victim caches reduces conflict misses without giving overhead to the current level cache. Victim cache is more useful when the cache size is small - has more conflict misses. A four entry victim cache removed 20% to 95% of the conflict misses in a 4KB direct mapped data cache. 4) Pseudo-Associative Caches Pseudo-associative cache is actually alternative use of set associative cache. Rather than look up all the blocks in a set, pseudo-associative cache looks up a designated block first and finds another block when it failed on the first block. If one of two blocks is accessed more frequently, pseudo- associative cache has the same degree of miss rate while having the fast hit time of direct mapped cache. 5) Hardware Prefetching of Instructions and Data This is the idea of stream buffer. This is also based on spatial locality. When a data is loaded in cache, next words are loaded into stream buffer using extra bandwidth. A single instruction stream buffer could catch 15% to 25% of the misses in a 4-KB 16-byte block direct mapped inst cache. 6) Compiler-Controlled Prefetching The idea is that a compiler inserts prefetch instructions. Makes sense when the processor can proceed while prefetching instructions. This requires nonblocking cache. Compiler-controlled prefetching incurs instruction overhead. 7) Compiler Optimizations A compiler reorganizes the code so that locality can be used. Although it doesn't give hardware overhead, writing such a compiler is challenging. Policies to reduce miss penalty are: 1) Giving Prioirty to Read Misses over Writes 2) Sub-block Placement for Reduced Miss Penalty 3) Early Restart and Critical Word First 4) Nonblocking caches to Reduce Stalls on Cache Misses 5) Second Level Caches Policies to reduce hit time are: 1) Small and Simple Caches 2) Avoiding Address Translation During Indexing of the Cache 3) Pipelining Writes for Fast Write hits From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August22老 2001斥 Wednesday 12:20 AM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Chapter 6 - Storage System Summary CPU performance increases 60% per year whereas IO performance increases less than 10% per year. By Amdahl's Law system performance will be limited by IO. To access data in a disk, Move the arm to the cylinder where the data is (seek time) Wait until the sector with the data is on the head (rotation time) Transfer data from disk to IO bus (transfer time) And the controller can cause additional overhead (controller overhead). In disk, capacity, MB per dollar, and transfer rate is growing fast (100% per year for capacity, 100% per year for MB/$, and 40% per year for transfer rate). But rotation time and seek time is improving very slowly (8% per year). The idea of RAID is to increase availability by using multiple number of small disks for data and parity disks. When an error occurs in a disk, the data is reconstructed from other disks. RAID 1: Every disk has its mirror and has very high availability. But it is the most expensive with 100% capacity overhead. It has good performance especially for read. RAID 3: Data is striped on every disk and parity is saved on a parity disk. It is good for large data read. Capacity overhead = 1 / (num data disks) RAID 4: Each disk is accessed independently. Parity is saved on a specific disk. Good for small read RAID 5: Each disk is accessed independently as in RAID 4, but parity is interleaved all the disks, thus reduce conflicts for parity disks. It has good overall performance and capacity overhead = 1 / (num data disks). On a write, reads the data disk and the parity disk. Then, write the new data to the data disk and old data XOR new data XOR old parity to parity disk. In a IO system, latency goes to infinity when utilization is 100% percent. This can be explained through queuing theory. Queuing theory applies to any system in equilibrium. There are job requests and one or more servers. We need to know definitions and formulas of some expression: Arrival Rate (r) is the number of requests per second. Service Time (Tser) is time to service a request. Service Rate (mu) is 1/Tser. Utilization is the rate at which resources are serviced. Queue time (Tq) is the average time waiting to be serviced. System time (Tsys) is the average time per task in the system. Length_Queue (Lq) is the average length of the queue. Length_Sys (Lsys) is the average number of tasks in the system. When service and requests are exponentially distributed, they have memoryless property and called: M/M/1 when there is one server M/M/m when there are m servers When service is not necessarily exponentially distributed, we call it general server and denote M/G/1 when there is one server. For M/M/1, util = r / mu Tq = Tser * util / (1 - util) Tsys = Tser + Tq Lq = r * Tq For M/M/m util = r / (mu * m) Tq = Tser * util / (m * (1 - util)) Tsys = Tser + Tq By the formula above we can know that queue time goes to infinity when utilization is 100%. So when designing an I/O system, we should limit the load (e.g. 65% or 75%) for reasonable response time. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August27老 2001斥 Monday 2:41 PM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: Chapter 7 Networks Some metrics you probably already know: -bandwidth (bw) - maximum rate network can propagate info that enters the net. Usually given in megabits/sec -time of flight (tof) - time it takes for the 1st bit to get from entering the network to arriving at the receiver. -transmission time - time for message to pass through network (not counting time of flight), assuming no contention, equals (message size / bandwidth). -transport latency - time message spends in the network (= time of flight + transmission time) total communication latency = sender overhead + tof + (msg size/bw) + receiver overhead effective bandwidth = msg size / total latency Network connections are made with many choices of media: twisted copper pair wiring coaxial cable multimode optical fiber single-mode optical fiber Connection topologies Shared media -when a node wants to send: listen to shared bus for traffic, if no traffic, send message if another sends at the same time, there is a collision, which the NI detects sender waits a random amount of time and resends -ethernet is an example, inexpensive but also has limited bw -easy to broadcast since traffic automatically goes to all nodes Switched media -dedicated line from node to switch, dedicated line from switch to node -many communication paths can be simultaneously used so bw is higher Switch topology Centralized switches: -crossbar - any node can get to any other node in 1 pass through the switch - crossbar can simultaneously route any permutation of traffic - size of the network scales as n^2, overkill in most situations -Omega network - log n layers of 2-in, 2-out switches. Book doesn't give properties of the network...not important to us I guess -fat tree - tree structure of switches w/ extra links near the top of the tree to increase bw Distributed switches: -let each node be a switching element -ring, each node connects to exactly two neighboring nodes and they form a connected ring. -fully connected, every node directly connected to every other node -2D grid/mesh, see book -2D torus, " " -hypercube, " " Bisection bw - divide network into 2 parts, each w/ half the nodes, sum bw of lines crossing from one half to the other. Defined as the worst bw sum over all possible bisections. Path allocation circuit switched - sets up a ckt ahead of time, no congestion because bw is reserved before the transfer takes place packet switched - msg is divided up into packets and sent independently, congestion can arise since it is difficult to determine whether available bw will be sufficient. Routing source based routing - message itself specifies path to destination virtual circuit - ckt is established in the switches between source and destination and the message specifies which circuit to use destination based routing - message only has destination address and switch must figure out where to send it that was about it...there were a few case studies at the end... From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 3:48 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Client-Server Computing on the SHRIMP Multicomputer Hi, They built a faster interconnection of computers with virtual memory-mapped communication, optimized network protocol and custom network interface. They made new RPC and socket protocol. And their performance was near the hardware limits. In RPC (remote procedure call), servers export their services and clients access the services with simple procedure calls. SunRPC is the industry standard. SunRPC is implemented as follows: RPCGEN (stub generator, exports interface) HIGH ---------------------------------------------------- RPCLIB (RPC library) ---------------------------------------------------- XDR (XDR data representation, abstract higher layers from machine dependent parts) ---------------------------------------------------- stream (buffer management) ---------------------------------------------------- network (read and write system call implementation) LOW ---------------------------------------------------- hardware VMMC (Virtual Memory-Mapped Communication) permits data transfer between two virtual memory spaces over the network. Then, two processors over the network can work only on the user level. This makes stream layer unnecessary. The authors replaced stream layer with SBL which is lighter. Elimination of stream layer also reduces number of data copies by one. The latency dropped by more than a factor of ten (from SunRPC to vRPC). vRPC is SunRPC compatible remote procedure call protocol. They reduced the latency further by developing whole new RPC protocol (ShrimpRPC) which is not SunRPC compatible any more. ShrimpRPC was 2-3 times faster than the optimized vRPC. Likewise, they developed Shrimp socket. And it achieved maximum network bandwidth. This SHRIMP project shows us that we can make multicomputers at a reasonable cost by improving network protocol and interface. ------- Jaein From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August13老 2001斥 Monday 5:12 PM To: whitney@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU Subject: Coherent Network Interfaces for Fine-Grain Communication "Coherent Network Interfaces (CNIs) for Fine-Grain Communication" by Mukherjee, Falsafi, Hill, Wood: This paper is basically about providing mechanisms through which the buffers and control registers for various devices (NIs, I/O, etc.) can be cached closer to the processor but still maintain the correct semantics associated with the cached item. The caching of such buffers/registers is desirable since it would greatly decrease bus utilization and latency due to polling of devices in which the contents changes infrequently. Also, caching allows data to be transferred between processor and device in larger blocks which can decrease bus occupancy. The mechanism also allows coherent entries to have their home in main memory (instead of at the actual device); this permits a decoupling between logical and physical locations of the device buffers and allows main memory latency-saving optimizations (like prefetching) to be leveraged for faster access to the device buffers. The bus between the processor and device now maintains a snoopy MOESI invalidation cache coherence protocol. Processors can read data from device registers repeatedly form the cache if the data does not change. When the device generates new data, it obtains write permission and the processor's cache entry is invalidated. When the processor writes data to the device, it obtains write permission to the location and then writes to the cache. The device then performs a read of its invalidated entry (the device maintains a similar cache) to get the new data written by the processor. The simpler cached device registers (CDRs) have issues though. The problem is that the processor may wish to read multiple cache blocks (in this paper, CDRs are assumed to hold a cache line of data) from the device through the CDR. Since it takes multiple loads to read the whole cache block of data from the CDR, simple clear-on-read semantics do not work for the delivery of multiple cache blocks through one CDR. This means that CDRs require an explicit (and slow) three-stage handshake in which the processor signals it is done with the current cached block and the device can deliver the next. Enter cachable queues (CQs), which can hold multiple cache blocks of data at once and maintain a head and tail pointer for easy reading and writing operations. Messages can be sent by writing to an outbound queue and incrementing the tail pointer and then relying on the coherence protocol to propagate the update when appropriate. Messages are received by reading the data from the head and then decrementing the head when finished. One problem with this is that since the head and tail pointers must be kept coherent between processor and device and are modified when reading and writing, the cached head and tail pointers might ping-pong back and forth under alternating updates and reads by processor and device. They present a number of clever optimizations to prevent this detrimental behavior but they are kinda complicated. Additionally, CQs allow the data to have their home in main memory as opposed to the device itself. Use of the less precious main memory resource simplifies deadlock avoidance since many more messages can be buffered in the case of unreliable networks. Address translation can then be handled by the OS through the regular VM system. This gives the added benefit of a simple way of dealing with protection in a multiprogramming environment. Each process can have its own virtual CQ which is mapped to somewhere in virtual memory. Rockin! In the results, they show that CNIs consistently reduce the round trip message latency, with CQs reducing latency more than CDRs. They also consistently increase the effective bandwidth by reducing the bus occupancy (CQs better than CDRs). They consistently provide speedups in the benchmarks of up to 50%! From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August16老 2001斥 Thursday 1:36 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Comments on error correcting codes. Hi, The odd-weight-column SEC-DED codes paper by M.Y.Hsiao says it has better detecting capability than modified Hamming codes. For "A Class of Odd-Weight-Column SEC-DED-SbED Codes" by Shigeo Kaneda, I gave you an wrong example. Even though it said the parity bits (r) can be any even number greater than 3. Actually, it can't be 4. Its code length is 2^(r-1) - 2^(r/2) and parity bits are r bits long. When r is 4, code length is 4 and parity length is 4. That doesn't make sense! - All the code word is used just for parity. So r should be at least 6. In addition, when we choose fi, fj, we don't discriminate the order of (fi, fj). That means (fi, fj) and (fj, fi) are considered essentially the same. ------- Jaein From: whitney@EECS.Berkeley.EDU on behalf of Mark Whitney [whitney@EECS. Berkeley.EDU] Sent: July27老 2001斥 Friday 5:57 PM To: whitney@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: Comparative Evaluation of Latency Reducing and Tolerating Techniques hi guys, "Comparative Evaluation of Latency Reducing and Tolerating Techniques" by Gupta, Hennessy, Gharachorloo, Mowry, and Weber: This paper addresses the problem of memory operation latencies in multiprocessor environments, latencies which are even higher than in single-processor accesses when accessing shared data. The paper presents 4 techniques to reduce the latency or allow processors to perform useful work during these latencies. Coherent caching, consistency model relaxation, prefetching, and multiple contexts are investigated individually and in conjunction with one another. Coherent caching is the most obvious solution to reducing latency and they find that a speedup of at least 2 is achieved on all their benchmarks from caching shared data (private data was already cached). Cache coherence is maintained by an invalidating, distributed directory-based protocol, like the DASH multiprocessor. Since coherent caching performs uniformly well in all cases, they include caching in all further memory enhancements measured here. Relaxing the memory consistency model is next. They relax it from sequential consistency to release consistency and find that the speedup is 1.1 to 1.5 for all benchmarks. They try out non-binding software prefetch with and without consistency relaxation by manually entering a small number of prefetch instructions in the benchmarks to prefetch mostly array data structures and a few linked list structures. They find that the two benchmarks that deal with array accesses mostly get speedups of 30% to 40% both with relaxed and non-relaxed consistency. The benchmark doing more linked-list accesses only speeds up by about 15%. In any case, these trials show that prefetching shared data is effective, regardless of the memory consistency model. Finally, they add the ability to switch between thread contexts in the even of a memory access stall. They try it with seq. consistency, release consistency, and release consistency with prefetching. They also the vary the number of contexts available (2 and 4) and the switching latencies (4 and 16 cycles). Multiple contexts seemed to offer improvements with sequential and relaxed consistency by gave almost no improvement when combined with prefetching. They state a couple reasons for this, prefetching for multiple contexts could overstrain the memory system, also, the placement of prefetch instructions did not take into account the changes in effective latencies from multiple contexts. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August15老 2001斥 Wednesday 4:33 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Complexity-Efficient Superscalar Processors This paper looks at the tradeoffs between higher IPC counts and higher clock rates as functions of complexity and technology process. Their working definition for complexity is the issue width and the instruction window for the machine. Basically, higher complexity gets you better IPC, but larger critical paths, and hence slower clocks. Simpler machines offer faster clocks, but lower IPCs. They analyze the details of this tradeoff and then propose an architecture that strikes a compromise between IPC and clock frequency to achieve an average of 16% improved performance. They itemize complexity by breaking it down into four major structures: Register rename logic Wakeup logic (responsible for waking up instructions waiting for source operands in order to issue) Selection logic (selects instructions for issue from pool of ready instructions) Bypass logic (basic data forwarding, so as to avoid having to wait for values to be written to registers) Their results are based on analysis and subsequent Spice simulations. They tested issue widths of 2, 4, and 8, and technologies of .8um, .35um, and .18um. Register renaming logic delay was shown to increase linearly with issue width for all three technologies they tested. The relative increase in delay as a function of issue width did not scale with technology though. As feature size is reduced, wire delay becomes increasingly important, and so smaller technologies (.18um) showed higher percentage increases in delay for larger issue widths. Dependence checking was shown to be faster than the table look-up for translation from logical address to physical address. Thus, dependence checking can be hidden behind look-up, and so window size is not a factor for this category. Wakeup logic delay was also shown to increase linearly as issue width increases. There is also a quadratic increase in delay as window size is increased, but this effect is relatively small compared to the issue width. However, consider that these 2 parameters will most often increase together, as larger windows are often needed to take full advantage of larger issue widths. Selection logic delay was shown to increase logarithmically with window size. Also, the relative delays for different technologies scale well with different feature sizes since we're mostly considering logic delays here, and not much wiring. Data bypass logic delay grows quadratically with issue width, due to wiring. Optimized layouts can help reduce the constants tied to the terms in the relationship between issue width and delay, but the relationship will remain quadratic. They call their complexity-efficient machine the dependence-based microarchitecture. They replace the issue window by a small number of FIFOs. Independent instructions are steered towards separate FIFOs, while dependent instructions are steered to the same FIFO, forcing in-order-execution. Only instructions at the heads of the FIFOs are eligible for execution, thus greatly simplifying logic for all 4 of the structures outlined above. They found that their architecture suffered an IPC degradation of about 6.3%, but considering the ability to increase the clock speed, they estimated an overall 16% performance improvement over the traditional window-based architecture. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 4:42 PM To: Chris Savarese; jaein@cs.berkeley.edu; whitney@EECS.Berkeley.EDU Subject: Confidence estimation for speculation control "Confidence Estimation for Speculation Control" by Grunwald, Klauser, Manne, and Pleszkun: This paper asks the question of how does one decide whether to make a speculation during execution when there are alternative options that will perform useful work, like executing instructions from another thread in the case of SMT. In some cases, it is desireable to know how confident the predictor is of its current prediction to determine whether the prediction should be used or whether we should work on something else until the needed non-speculative state is produced. The paper compares various implementations of confidence estimators and notes how well each work with a number of different branch predictors. The confidence estimators are compared based on 4 metrics: Sensitivity - The fraction of correct predictions that are noted "high confidence" Predictive value of a Positive test - Probability that a high confidence estimate is correct Specificity - The fraction of incorrect predictions identified as "low confidence" Predictive value of a Negative test - Probability that a low confidence estimate is correct They use 4 different confidence estimators used with branch predictors: JRS estimator - This consists of a table indexed by the branch address xored with the branch history. Upon successful prediction, the counter for the entry is incremented, on incorrect prediction, it is set to zero. Once the counter goes above a certain threshold, the prediction is said to have high confidence. Pattern History Estimator - This sets a number of branching patterns as being able to provide a high confidence prediction and all other patterns give a low confidence prediction. Saturating Counters Estimator - The state of the saturating counters in the branch predictor itself is used to determine the confidence (saturated taken or saturated not-taken are high confidence estimates). Static Estimator - The branches are profiled ahead of time on the branch predictor and branches that are predicted correctly at a percentatage above a threshold are high confidence. The remainder of the paper is spent characterizing how these estimators perform with 3 branch predictors, gshare, McFarling, and SAg. From: jaein@EECS.Berkeley.EDU on behalf of Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July12老 2001斥 Thursday 2:35 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein@cs.berkeley.edu Subject: Control Data 6600 Paper Hi guys, I summarize "Parallel Operation in the Control Data 6600" paper by James E. Thornton. The persons who built Control Data 6600 tried to achieve high performance with parallelism (a multiplicity of peripheral processors and functional units). A 6600 machine is organized by 10 peripheral/control processors, central memory, and central processor. Within the central processor, it has 10 functional units (ADD, MULTIPLY, ...) and 24 registers. If we draw a diagram it looks like this: 10 Functional Units | 24 Registers (8 Address, 8 Index, 8 Floating Poing Registers) | Central Memory | 10 peripheral/control processors | 12 I/O channels Only one of 10 peripheral/control processors is served by the central processor at a minor cycle (100 nano sec). The other peripheral/control processors do I/O jobs when they are waiting to be served. Along with a multiplicity of functional units and 24 registers, 6600 used a queue and reservation scheme called "scoreboard" to execute as many instructions as possible in parallel. When an instruction is issued, it is executed immediately if all of its registers are available. Otherwise, it waits in the scoreboard until all of its registers are available. After an instruction is executed in one of 10 functional units, a signal is sent to the scoreboard. Then, an instruction waiting fot that register can now be executed. This scoreboard scheme made it possible to issue an instruction until there is no functional units left free or a specifica register is assigned to more than one result. As long as this constraint holds, 6600 can issue an instruction at each minor cycle (100 nano sec). 6600's functional units are not connected to central memory directly. central memory is connected to functional units through registers. This makes data access faster than when access memory directly. For fast access of instruction memory, they made a buffer register and an instruction stack between instruction registers and central memory. When a program branches back to a location which is already in instruction stack, then the instruction doesn't have to be fetched from memory. That can improve performance in a small iteration. The memory itself was interleaved into 32 banks, which makes possible fast sequential memory access. -- Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August01老 2001斥 Wednesday 5:05 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Data Coherence Problem in a Multicache System Here's the second paper. Sorry it's late (again). Jaein, I'll see you tonight at 7. Thanks for being flexible. They make some lofty claims in this paper. Written in 1985, they predict that, "Advantages offered by these attributes are so significant that [multiprocessor] architectures will undoubtedly play an important role in the computer of the future." Well, not yet... This kind of sets the stage for the paper though. They present a lot of theory, and no data to back it up. Nevertheless, the theorie's pretty good. It's commonly believed that performance doesn't scale very well with the number of processors in a MP system due to synchronization and memory problems. They claim that the synchronization problem was shown to be tolerable in [10], but that memory consistency is still a huge hurdle. They base their analysis of several cache coherence schemes on the concept of the semicritical section, a logical entity that can be accessed by several readers but only one writer. The key questions for any cache coherence scheme are what that logical entity is, how it is enforced, and when it is enforced. Centralized Cache Coherence Models: In the shared cache scheme, the coherence problem is eliminated by only having a shared cache (i.e. no private caches). This greatly underminds the benefits of having a cache at all though, and so the performance gain from this scheme is low. C.mmp uses the OS to help control coherence. The OS designates which lines in memory are safe to cache, where a safe line is a line that is not being written to. Thus, all writes go straight through the main memory, and caches are only used for reads. This algorithm suffers from lower performance and from its dependence on the OS. In fact, if the OS makes a mistake, it can not only lower performance, but it can actually endanger the correctness of the system. Tang's Proposal is a central directory approach. The central directory has an entry for each data unit corresponding to a cache block in main memory. It uses shared-read and private-read states for cache blocks in the directory and in the cache. On reads, a block is marked "shared". If the block was already private in another cache, a write-back occurs, and then it's marked as shared in the new cache. Writes cause blocks to marked as private, and copies of that block in other caches are invalidated. Distributed Cache Coherence Models: The IBM 370/168 and 3033 connect all shared caches by a high-speed broadcast bus used to broadcast modifications (writes) to cache blocks. All caches monitor the bus at all times. All writes are write-through. The biggest problem with this approach is the invalidation track on the bus. This traffic scales proportionally with the number of processors, quickly overloading the bus. The Synapse system uses the Tang algorithm with a broadcast bus instead of a centralized directory. It uses the idea of ownership of cache blocks. By default, main memory has ownership of all blocks. Any cache read miss will cause ownership to return to main memory. Any write will transfer ownership to the cache being written to. Cache to cache ownership transfers are ignored by main memory. This scheme still has invalidation traffic problems, but they're less severe since invalidations only occur on cache-misses, rather than on all writes, as with the IBM scheme. I/O processors are given a special "write-new-data" command that allows them to write directly to main memory and at the same time steal ownership of a block and give it to main memory. The Goodman scheme... I couldn't figure out what kind of performance this one offered, relative to the others. It seems as though Goodman was shooting for compatibility, as they make several comments about how easy it would be to integrate this scheme into existing systems. They don't make any comments about relative performance though. Anyway, he uses various states associated with each cache block: invalid (no data), valid (data in cache, consistent with memory), reserved (data in cache has been locally modified exactly once. write-through), and dirty (data in cache have been modified more than once, and only the first modification was written through. rest are write-back.). This seems like it should cut down on invalidation traffic on the bus. They spend the rest of the paper talking about their model for cache block states. The only thing useful about the last 3 pages is the beta state, which distinguishes between read cases in which the data is truly shared by multiple caches, and read cases in which there's only one cache reading the data. This is beneficial when caches read data and then modifiy later on, since it can skip the overhead needed to change from shared-read to private-read before writing. Nevertheless, it's very expensive to implement since this level of knowledge requires extra hardware and/or communication to identify situations in which only one cache is reading a particular block of data. Their solution to this is to add software assistance via the concept of lookahead semicritical sections. Basically, a processor identifies cases in which private-reads can be used instead of shared-reads. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July17老 2001斥 Tuesday 10:51 PM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: Decoupled access/execute computer architectures Another nice, elegant paper, Smith is the man! "Decoupled Access/Execute Computer Architectures" by James Smith: This paper presents the first formal layout of an architecture in which memory operations and real computation operations are separated in the archtecture and allowed to execute out-of-order with respect to each other. A program is actually split up into operations which read or write data from main memory and other operations (which would do some sort of computation on this data). The processor is separated into a part to which the memory ops are issued, the A(ccess)-processor, and a part to which the computation operations are issued, the E(xecute)-processor. The two parts are made as independent as possible so that optimally, the memory (access) operations can run ahead of the computation (execute) operations, thereby hiding memory latency. Data values are communicated between the two parts by writing two and reading from two queues. Queue reads and writes are signaled by using special registers are operands. The compiler is responsible for scheduling coordination between the parts. Stalls due to true data dependencies between access and execute instructions manifest themselves when an instruction tries to read from a queue and there are no values. The instruction will then block the pipeline, preventing no more issuing of that type of instruction until the other processor can catch up. Stalls can also occur if one processor runs way past the other and fills up its queue of values to send to the slower processor. It is mentioned that dependencies between memory ops can also be resolved with the use of an associative store buffer. Output and anti dependencies between memory ops and computation ops are eliminated (except for the effects of a finite queue size), leaving only true data dependencies to slow the execution of instructions. Conditional branches can be handled by either processor depending on the data used to resolve the condition. Two more queues are added for this purpose, when one processor resolves a branch condition, it sends its decision to the other processor which executes a B(ranch)F(rom)Q(ueue) instruction which picks the next branch decision from the stack of decisions sent by the other processor. This paper is a great example of a very simple, elegant architecture which acts sort of as a data flow processor with respect to dependencies between memory ops and other computation ops. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 4:43 PM To: Chris Savarese; Jaein Jeong; whitney@EECS.Berkeley.EDU Subject: Design and evaluation of a compiler algorithm for prefetching "Design and Evaluation of a Compiler Algorithm for Prefetching" by Todd Mowry, Monica Lam and Anoop Gupta: This paper presents an compiler algorithm that selectively places non-binding prefetch instructions in the compiled code, attempting to intelligently prefetch arrays in numerical programs. The result is more selective prefetching of arrays, eliminating unnecessary prefetches which represent useless computation and memory bandwidth demand. The algorithm itself takes in information about the L1 cache of the machine the program is being compiled for and analyzes which array elements do not need to be prefetched because they are already in the L1 cache due to spatial or temporal locality in the array accesses. This locality analysis is the key interesting bit about this paper and is what allows the algorithm to put in substantially fewer prefetch instructions while still prefetching most of the data that needs to be prefetched. Given the loop structure of the program, the algorithm figures out which array elements are reused under a set of conditions holding for the loop indicies. After it is determined which array elements are reused eventually, the cache size information is taken into account to calculate which elements will not have been replaced in the cache before they are reused. Then, given the estimated latency of the prefetch operation, the algorithm tries to schedule prefetching of the remaining elements far enough ahead of time to be useful using loop unrolling and software pipelining. The loop unrolling algorithm also performs transformations to ensure that the loop index conditions mentioned above hold or do not hold using a minimal number of conditional branches. They then present a very through experimental analysis of the effectivity of their locality analysis compared to the indiscriminate, prefetch-everything approach. They then show the performance improvements obtained through their fancy versions of loop unrolling and software pipelining. They finally do a sensitivity analysis of the performance of the optimized program to the different compile time parameters (effective cache size, approximate prefetch latency, etc.). From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August01老 2001斥 Wednesday 6:57 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Dynamic Speculation and Synchronization of Data Dependences Hi, Speculation can be categorized either control speculation and data speculation. Branch prediction is a good example of control speculation. Data value speculation and data dependence speculation are examples of data speculation. Depending on the fact that of all the dynamic dependences, only a few static dependences cause most of mis-speculation, they came up with a structure to reduce dependence speculation miss. One is memory dependence prediction table (MDPT) and the other is memory dependence synchronization table (MDST). MDPT keeps load/store pairs whose dependence may be speculated. MDST keeps the pairs that are actually to be synchronized. Their scheme not only outperformed naive algorithm which assumes all load/store pairs are dependent, but also was comparable to ideal algorithm. ------ Jaein From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July18老 2001斥 Wednesday 3:01 AM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: Efficient Superscalar Performance Through Boosting "Efficient Superscalar Performance Through Boosting" by Michael Smith, Mark Horowitz, and Monica Lam: This paper starts off with a nice summary of the previous work in the area of trace scheduling and the inheirent limitations to the trace scheduling approch to instruction scheduling. Their argument against trace scheduling is basically that while the method produces efficient execution along the perceived most probable trace, there is no bound on the amount of code produced for other traces that may not be the highest probability but is still pretty probable. This paper tries to fix this by introducing a new global scheduling algorithm which introduces some restrictions, like no inter-loop/procedure movement of code and only movement of instructions up. This new scheduler limits the amount of code duplication that takes place by moving instructions in simple steps and observing conditions in which basic blocks are equivalent with respect to the control flow and data flow. It also uses the "boosting" feature. The boosting feature is basically a compiler generated from of predicated execution in which an instruction that is being executed speculatively in the trace is annotated with the number of conditional branches this instruction has been moved above. This allows selective squashing of instructions based on which branches were mispredicted. The hardware must supply the mechanism to checkpoint relevant register values at each speculatively executed branch. This paper provides a nice hardware/software interface for instruction scheduling using boosting with a global static scheduling algorithm. Their point is to show that static scheduling is not dead yet and can still compete with more complicated dynamic scheduling techniques which use much more hardware. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August01老 2001斥 Wednesday 2:54 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Evaluating Stream Buffers as a Secondary Cache Replacement Hey guys, Sorry to be late this week... It's been crazy. This paper is based on the stream buffer concept proposed by Jouppi in the paper I summarized last week. Where Jouppi recommened streams between L1$ and L2$, these guys want streams to completely replace L2$. Their main argument is that streams can offer similar performance to L2$ for MUCH less hardware. All of their experiments use stream buffer depths of 2. They tested using 15 scientific benchmarks. They compared to L2$ of 64K I and 64K D, 4-way set-associative. For unit-stride streams, using 7 to 8 streams, their experiments showed hit rates (L1 miss, stream hit) of about 50-80%, compared to L2$ hit rates of about 70-85%. Given the huge difference in the cost of implementing each method, these numbers favor streams. The big problem with streams though is the extra memory bandwidth that they use up while making useless fetches. Their 15 benchmarks showed an average of 75% extra bandwidth required for streams (ouch). They propose reducing this requirement by avoiding useless fetches. They implemented a filter to avoid fetching on isolated references. The filter monitors L1 misses that also miss in the streams. Misses are logged in the filter as a+1, where a was the address that missed. Future misses are checked against the current entries in the filter. If there is a match, implying a consecutive reference, then a stream is allocated. Their experiments showed that 8-10 entries in the filter was sufficient to reduce the extra BW required by more than 50% for most of the benchmarks with negligible impacts on hit rates. Some programs use non-unit strides, which can negate the benefits of a stream buffer. They propose a means of detecting non-unit strides such that their stream buffers can benefit such programs. They partition the physical address space and then look for strided references within each partition. They keep a history buffer on chip that tracks memory references using a finite state machine. The FSM looks to see if the difference between the 3rd and 2nd addresses is the same as the difference between the 2nd and 1st addresses. Steams are allocated after 3 consecutive strided references. An important detail is the size of each partition, or czone (concentration zone). If too small, patterns may not fall within the same partition, and if too big, multiple streams may clutter the history buffer, causing strides to go undetected. This is very program-dependent. Thus, the czone size is parameterized, and set by the programmer or the compiler. They found that streams typically scaled better with increasing input sizes than L2$s: "...as the data set size for scientific programs increases, it may be more cost-effective to exploit the regular pattern in memory references [i.e. streams] rather than to fit a large data set in a huge second level cache." Note however, that streams will not perform well with programs that involve "widely-scattered array in-directions." (guys, what are "widely-scattered array in-directions"?) In summary, using filtered streams with non-unit stride detection, most of their benchmarks showed hit rates of >60% while increasing memory BW requirements by only 30%. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 4:43 PM To: Chris Savarese; jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: Expandable Split Window Paradigm for Exploiting fine-grain parallelism "The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism" by Manoj Franklin and Gurindar Sohi: This paper presents an architecture which provides a nice combination of static and dynamic scheduling techniques to get a large window of instructions in which to gain ILP while minimizing the amount of centralized control and communication usually necessary to resolve dependencies among the large number of potential instructions. Their solution is to use dataflow-like analysis on basic blocks in a program, determining inter-block data and control dependencies at compile time and annotating the basic blocks as such with masks which specify which registers values are created and used by each block. The a simple sequential data path (along with the L1 Icache) is then replicated n times in order to execute n of these basic blocks at once. This architecture decentralizes a large amount of the control and communication because all inter-block dependencies in the dynamic instruction stream are detected at compile time and annotated as such. Instruction bandwidth is also much less limited in this case by the replication of the L1 Icache for each window datapath. Each datapath has its own register file as well which is much like a future file. Basic blocks are issued to ready windows in sequential order and register dependent values are forwarded under the guidance of the use and create masks to later basic blocks in the instruction stream. Inter-block memory dependencies are resolved through the use of the Address Resolution Buffer (ARB). This table consists of entries for the memory access address, the value produced for that address and a store and load bit corresponding to every window in the architecture. Both stores and loads put memory addresses in the table and stores also insert the value to be stored in the table in the value space, the loads and stores to an address also update the window load/store present bits. All basic block results are commited in sequential order and therefore, stores are only sent to memory when its block window is commited but loads in later issued blocks can access the store value through the ARB before the commit of the store, enabling speculative loads of memory values across block boundaries. Additionally, a global architecture register file is updated using the commiting window's future file. Exceptions are resolved at the commit time of a basic block, simply discarding all future windows and executing the exeception code. Branches are predicted dynamically in the program and a misprediction simply results in the invalidation of all basic blocks issued after the one containing mispredicted branch. The neat thing about this architecture is that even though basic blocks must be issued in-order, the instructions do not need to be issued in-order in relations to instructions in other blocks because all the necessary data dependency information between instructions in different blocks is noted at compile time and resolved at basic block issue and not during the decoding of the instruction. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August13老 2001斥 Monday 3:51 PM To: whitney@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU Subject: Exploiting Two-Case Delivery for Fast Protecting Messaging "Exploiting Two-Case Delivery for Fast Protecting Messaging" by Mackenzie, Kubi, Frank, Lee, Agarwal, Kaashoek: In the spirit of Alewife, yet another hardware-supported-fast-path-with-software-supported-exception-handling idea: this architecture implements direct user-to-user communication between processing nodes while also providing the level of resource protection and allocation necessary for a multiprogramming environment. Multiprogramming related events are handled in software. Message send and receive operations are performed by abstract "inject" and "extract" operations which are implemented as user-level interrupt handlers or directly in a user process through polling. In all cases, message "injections" are performed atomically by performing stores to a special memory mapped buffer. If the buffer fills before all the message is written to the buffer, the process is blocked until the buffer has more space. Once the entire message is in the buffer, it is sent with a single "launch" operation. In the fast case, the user handler or polling code to extract a message is entered directly, with no kernel crossings by causing a user-level interrupt or checking user visible message arrival flags. The message can then be read directly from the network interface buffer. In the case of polling, the message receive interrupts can be disabled. Atomicity is guaranteed in the user-level handler by allowing further message receipt interrupts to be disabled and messages buffered until the handler completes. Each message belongs to a group of processes specified by a group ID (GID). If a message is received by a processor that is not currently running a process from this group, the operating system handles the message and buffers it in a virtual buffer for that GID dynamically allocated in main memory. When a process of the pertainent GID is later scheduled, the operating system signals the user-level handler (or polling flag) and emulates the delivery of the message through the virtual buffer (instead of reading directly from the NI buffer). The key point is that the same user-level code is used, the software vs. hardware handling is completely abstracted from the program's perspective. It just takes longer to get the message data from the virtual buffer. Protection comes in a few different ways. First, with the virtual buffering mechanism, only processes of the correct GID can access a message. Second, when a message arrives at a processor, an OS controlled timer is started and if the message is not processed in a certain amount of time, the OS interrupts the processor and buffers the message. This prevents bad code from blocking timely receipt of messages for the whole processor. Deadlock is not discussed in depth here but there is another logical network available only to the OS to resolve deadlock at a low level. In addition, processes which overuse the virtual buffering mechanism, causing potential virtual memory thrashing, are paged out and well-behaved processes are allowed to proceed. All in all, a good paper, alot attention to important details relating to protection. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August05老 2001斥 Sunday 11:57 PM To: jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU Subject: FLASH "The Stanford (boo) FLASH Multiprocessor" by Kuskin, Ofelt, Heinrich, Heinlein, Simoni...blah blah, everybody else doing comp. arch. at Stanford This multiproc is all about super flexible but still fast message processing and shared data cache coherence. Each node has a MIPS processor (with cache), a piece of global memory, a net interface, an I/O interface, and a MAGIC processor where all the "magic" happens, haha. The MAGIC processor is stuck in the middle and is the only path for communication between the NI, I/O interface, main memory, and the compute proc. The MAGIC processor is a specialized programmable processor which can load multiple handlers to effectively implement multiple protocols at once. The MAGIC processor has its own queues to receive and send message from and to the MIPS proc., the main mem., and the interfaces. Data in messages are directly forwarded to special data buffers before being sent on to their destinations (main mem., interfaces). Messages to a process running on the MIPS processor is handled at the system level, sending the data to main memory. Deadlock avoidance due to full queues is avoided by having separate request/reply queues, reserving outgoing queue space for messages generated by processing an incoming message, and the ability to swap out messages that overflow queues to main memory to be processed later. Coherence is maintained through a distributed directory scheme, where the directories are in special areas of main memories. Additionally, shared block entries link to a dynamically maintained linked list of processor where the data is present. The control and data pipelines are decoupled to enable data from a message to be streamed through the MAGIC processor while the message header information is processed in parallel. This enables the MAGIC processor to have the flexiblity associated with its own programming language for a variety or protocols while still being able to execute the protocol at speed on par with more specialized implementations. This multiprocessor tries to do everything which is why this summary is kinda scattered, it addresses most multiproc. communication issues though so it would be worth looking at. From: Jaein Jeong [jijeong@uclink.berkeley.edu] Sent: September03老 2001斥 Monday 10:30 PM To: Savarese, Chris Cc: Jaein Jeong Subject: FW: Prelim Mockup questions -----Original Message----- From: Jaein Jeong [mailto:jijeong@uclink.berkeley.edu] Sent: Monday, September 03, 2001 7:31 PM To: whitney@cs.berkeley.edu Subject: Prelim Mockup questions What is cache coherency? Informally, we can say that a memory system is coherent if any read of a data item returns the most recent value of that data item. A memory system is coherent if 1. A read by processor P to location X following a write by P to X ,with no writes to X by any processor between the write and the read, returns the value written by P. 2. A read by processor P to location X following a write by another processor P1 to X ,when no writes to X by any processor between the write and read and the read and write are sufficiently separated, returns the value written by the P1. 3. writes are serialized. That two writes to the same location are seen in the same order by all processors. Why coherence is important? If memory system is not coherent, data cannot be shared among processors. Identify a cache coherence protocol. Snoopy bus protocol. Describe it briefly. The basic snoopy bus protocol has three states: invalid, shared, exclusive. Initiallly the state is in invalid. When there is a read on the processor, the processor reads data from the main memory and state changes to shared. On write, the processor writes data on the bus and state changes to exclusive. Between shared and exclusive, exclusive, on read miss, writes data block on the bus and changes to shared. Shared, on write, writes data on bus, and changes to exclusive. When there is a write miss from another processor, the state changes to invalid both from exclusive and shared. On exclusive, the processor writes data, too. Otherwise, states don't change. Most modern processors have L2 caches. Descibe how cache coherence is maintained in a processor with L2 cache. Every level of cache should snoop on the bus for any data. When the data is written on the bus. What's the problem with it. Memory bandwidth is wasted for cache coherence in every level of cache and cache controller gets complicated. What is multilevel inclusion? When all the data of an upper level cache is contained in the lower level cache all the time, we say that it meets multilevel inclusion. Why multilevel inclusion is good? Only the 2nd level cache need to deal with cache coherence and invalidate corresponding block in the 1st level cache, if there is any block. Since all the data in 1st level cache can be found in 2nd level cache. Is there any limitations with multilevel inclusion? Multilevel cache can have problem when the block sizes of 1st level is larger than that of 2nd level cache. In that case, we need to invalidate all the blocks in the first level cache that map to the 2nd level cache to be invalidated even though the 1st level cache blocks don't contain stale data. This increases miss rate. What is inorder pipeline processor? An inorder processor is a processor where Instructions are issued and executed in the program order. Identify an inorder pipeline processor. StrongARM SA-110, MIPS 20K Identify pipeline stages of an inorder processor. Inst Fetch - Decode - Execution - DMem - Write Back What is a hazard? We say a pipeline has a hazard when cannot execute instructions. What kinds of hazard there are? Structural hazard: cannot execute pipeline due to limited resources data hazard: cannot execute pipeline due to data dependency control harard: cannot execute pipeline because execution path is not determined. Consider the following instruction sequence: LW R1, 0 (R2) SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 Assuming we can forward and have separate inst cache and data cache, identify any data hazard in the instruction sequence above. SUB depends on LW for R1. When SUB needs R1 at the start of EX, LW hasn't written the memory value yet - Data hazard. What kinds of data hazard there are? RAW hazards: Read after write hazard, real dependence WAR hazards: write after read hazard WAW hazards: write after write hazard Which data hazards this processor can have? And explain it. This processor has only RAW hazard. Since all the insts are the same length and register value is read at ID stage and written WB stage, there is no WAR hazard. Since all the insts are the same length and data is written at WB stage, there is no WAR hazard. How precise interrupt can be handled in this processor? save all the registers and the pipeline registers and start interrupt handler by loading PC with its start address. After then, we can restore PC, registers, pipeline registers and resume the work again. ALU instructions are actually finish its computation at EX stage. Thus it looks like we can shorten the ALU insts into 4 stages rather than 5 stages. What is the problem with this idea? This is actually an out of order processor. Since all insts are not the same length, an inst issued later can complete before an inst issued earlier. Two insts can access the register file at the same time to write a data. And precise interrupt can not be maintained with earlier method. How we can solve the problem? First we can increase register ports so that two writes to different registers can be done at the same cycle. For two writes to the same location, an instruction should be checked at the ID stage whether its destination register conflicts with other insts issued earlier. In that case, that inst should be stalled until there is no destination register conflict. For precise interrupt, we can have reorder buffer, where completed instructions are stored before they write results to register file. Instructions should be allowed to write their results only after its preceding instructions are completed and write their data. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July18老 2001斥 Wednesday 2:45 AM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality "HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality" by Hwu and Patt: This paper gives a nice, practical, architectural realization of a data flow processor. Among other things, it lays out how operations would be pipelined through the building of the data flow graph and the subsequent execution. The algorithm and architecture basically implements the Tomasulo algorithm. Upon decoding each register operand is mapped to the appropriate physical register. Then it is either read from the register file or assigned a data flow node if this register value is not currently available. The operand node is given a tag corresponding to the relevant value when it is distributed. The output register number is used to update the register allocation table and is assigned its own tag which is then broadcasted when the instruction completes. This paper basically reiterates Tomasulo's algorithm in data flow terms and applies it to more than just floating point calculations. It also draws out some logic useful for the algorithm. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July11老 2001斥 Wednesday 4:34 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: IBM 360 paper Hi guys. I read the IBM 360 paper. Keep in mind that this was way back in 1964... The paper reads as a discussion of the decisions they made in designing the 360 architecture. In fact, they spend most of the paper listing the issues they were concerned about, so I'll pretty much do the same here, hilighting the more interesting points. - due to new conceptual advances, they allowed for lack of backwards-compatibility - they wanted an architecture that could grow throughout the next decade, hence they needed a general design - specifically, I/O devices differing in data rate, access, and function had to share a general interface - common system functions, such as addressing, I/O management, etc., should be made efficient and need not be diffirent in machines built for different applications. - storage capacities larger than 32,000 words would be needed. - FP may need more than 36 bits of precision - systems should be able to self-diagnos to reduce down-times - in order to accomodate disparate rates of technological improvement in different parts of the system (CPU, I/O, etc.), the system should be able to handle asynchronous interactions between different parts. - they wanted the ability to manipulate individual bits They define "strictly program compatible" as: "A valid program, whose logic will not depend implicitly upon time of execution and which runs upon configuration A, will also run on configuration B if the latter includes at least the required storage, at least the required I/O devices, and at least the required optional features.... Programs dependent on execution-time will operate compatibly if the dependence is explicit..." They discuss the pros and cons of different character sizes, noting that only 4 bits are needed to represent all decimal digits, while 6 bits are needed to represent alphanumeric characters. choices: 1) 6 bits for every character (wasting 2 bits on decimal digits) 2) 4/8 (digits, alphanumeric) 3) 4/6 (digits, alpha) 4) 7 bit characters, using a binary recoding of decimal pairs 1 had been used before (IMB 702-7080 and 1401-7010 families), so was familiar. Also, simple specification of field width was attractive. 3 was unattractive due to need to specify difference between digits and alphanumeric characters. also, only having 6 bits for alphas seemed short-sighted. 2 seemed like a good compromise between 1 and 3, having better coding efficiency in applications that used more decimal digits than alphanumeric characters, and having spare room in the alphanumeric set. (Further comments, including which they finally chose, weren't included.) FP: they used both 32- and 64-bit. base 16 was used, as it was shown by Sweeney (reference # 1 in paper) that it reduced the frequency of pre-shift, overflow, and precision-loss post-shift on FP addition. They used 2's compliment for negative numbers. They thought about only using ASCII, but decided against it because BCD was widely used in code at the time. They use base-register addressing (register-offset) to allow for easy program relocatability. ------------------------------------------- That's about it... I'm also reading the Symbolics paper, which I must say, is MUCH more interesting than the 360 paper so far. It's coming along, and I should have a summary to you before we meet. Sorry to be late on this. See you tomorrow at 8:30. Thanks, Chris From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July17老 2001斥 Tuesday 3:27 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Implementing Precise Interrupts in Pipelined Processors Ahhhh.... good food. Alright, here's the precise interrupt paper. I HIGHLY recommend you take the time to read this paper in its entirety. It's very well-written, very informative, very fundamental, and very important. They almost always ask a question regarding this material. The summary... Definition of a precise interrupt: 1) All instructions preceding the instruction indicated by the saved program counter have been executed and have modified the process state correctly. 2) All instructions following the instruction indicated by the saved program counter are unexecuted and have not modified the porcess state. 3) If the interrupt is caused by an execption condition raised by an instruction in the program, the saved program counter points to the interrupted instruction. The interrupted insttuction may or may not have been executed, dependiing on the definition of the architecture and the cause of the interrupt. Whichever is the case, the interrupted instruction has either completed, or has not started execution. Classification of interrupts: 1) Program interrupts: aka "traps" results from exception conditions detected during fetching and execution of specific instructions 2) External interrupts: caused by sources outside the currently executing process, such as I/O interrupts. Motivating Precise Interrupts: 1) restarting possible in the case of I/O and timer IRQs 2) restarting after page faults possible in VM systems 3) software debugging: precise IRQs help in isolating scenarios generating bugs 4) graceful recovery from arithmetic exceptions 5) unimplemented opcodes can be simulated by system software transparent to user. allows for compatibility between lower and higher performance models of a system. 6) virtual machines can be implemented. History: - IBM 360/91 produced imprecise IRQs - CDC 6600, CDC 7600, and CRAY computers allow out of order completion and have some exceptions that result in imprecise IRQs. Advantages of precise IRQs were sacrificed for maximum parallelism and design simplicity. I/O IRQs in these machines are precise. No VM. - CDC STAR-100 and CYBER 200 machines also have out of order completion instructions, and they support VM. - these achieve precise IRQs using invisible exchange package. resides in memory and captures state info. - CDC CYBER 180/990 uses a history buffer, described below Interrupts prior to Instruction issue - assumed: instructions issue in order, and instructions can't alter state before issuing - exception deteced: halt instruction issue. wait for all currently executing instructions to finish. now in precise state. In-order instruction completion: the result shift register - instructions modify the process state only when all previously issued instructions are known to be free of exception conditions. - issuing insts place control info in result shift register: function unit supplying result, destination register, and validity bit - instructions reserve slots in result shift register according to how long they take to execute - issuing instructions using a stage reserve all earlier stages - when instruction reaches stage 1 in result shift register, checks for exceptions. none: complete instruction. - exception: cancel all subsequent instructions - this protects registers... main memory requires use of a dummy store in result shift register. - a memory write is blocked until the dummy store reaches stage 1 in the result shift register. - result shift register contains a field for program counter so that it too can be restored in case of an exception The reorder buffer - allows out of order execution, but blocks instruction from altering state until exceptions checked for all preceding instructions - reorder buffer: circular buffer with head and tail pointers. - at issue, next available reorder buffer entry (tail) is given to the instruction. tail pointer then incremented - result shift register used as before, but with addition of a tag corresponding to the reorder buffer entry. - instruction completes: results and exception conditions sent to reorder buffer. - when an entry at head of reorder buffer contains valid results, its exceptions are checked. - none: write results to registers - memory writes use dummy stores, as before, except that now they're allowed to issue, but blocked at the write stage until the dummy catches up - program counter saved in field in reorder buffer. - bypass paths (similar to data forwarding) can be used to speed up dependencies. - results can be forwarded from reorder buffer before they're written to registers. - in case of multiple references to same destination register, only the latest result should be forwarded. - disadvantage: lots of circuitry for bypass comparators History Buffer - achieves same performance as reorder buffer with bypassing, but without all the extra circuitry - still used result shift register with tag - instread of reorder buffer, use history buffer - at issue, same control info is loaded into history buffer as was loaded into reorder buffer, but the current value from the destination register is also saved. - results written directly to registers as soon as instructions complete. - when history buffer contains element at head know to have finished without exceptions, that entry is erased. - if all history buffer entries are being used, issue is blocked - exception at head of buffer: buffer held, issue halted, pipeline activity allowed to finish. history buffer then emptied from tail to head, loading history values back into registers. program counter at head is the precise program counter. - memory stores emerging at head of history buffer send signal that another store can commit. Future File - uses 2 separate register files - architectural file represents the sequential machine - future file is the working file that all results are read from and written to. allowed to run ahead of arch. file. - instructions issued and results returned to future file in any order - reorder buffer receives results at same time as future file - when head pointer in reorder buffer encounters a completed instruction without exceptions, the result is written to the arch. file. - program counter written to reorder buffer at time of issue - exception encountered: restore values from arch. file to future file. Extensions - real systems will probably have more state info - registers that point to page and segment tables can be handled just like memory, using dummy stores - condition codes can be stored in reorder buffer - for virtual memory, in-order memory operations assumed... - addressing fault in VM: instruction cancelled in addressing pipeline, all subsequent load/store instructions cancelled - store-through caches: cache updated immediately, but store-through to memory delayed until dummy store encountered. - exception causes cache to be flushed. this can lead to performance problems though, especially for larger caches. - alternative: treat cache like a register file. i.e. use a history buffer for the cache. - write-back cache much more natural to precise interrupts. - before a write-back is made, make sure reorder buffer has emptied or at least checked for data belonging to line being written back. wait until cache updated from reorder buffer before writing back. - linear pipeline structures: much finer pipeline stages, such that reordering of instructions not needed. then pipeline itself acts as a sort of reorder buffer. - sequential architecture model for vectors: one vector instruction completes its last result before the next begins producing results. - vectors: register architectures: keep 2 copies of register vectors, with 1-bit pointers for each register indicating which copy is "current" and which copy is "new". - save these pointers in the reorder buffer, and ensure that a pointer to a specific register appears only once in buffer. - then can alter pointer according to exception conditions when pointer reaches head of buffer. if no exceptions, swap new and current, else don't swap. - similar strategy for history buffer and future file approaches applied to vectors. - vectors: memory to memory systems: buffer results in CPU until all operations associated with a vector instruction have completed w/out exception before allowing memory stores to begin. That's all. I'll do the other 2 tomorrow. See ya' Chris From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 11:42 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Improving Direct-Mapped Cached Performance... Hi guys, Here's the paper on additional, small, fully-associative caches and prefetch buffers for reducing cache misses. Miss caching: a small, fully-associative cache between L1 and its refill (either L2 or main memory). The miss cache is on the order of 2-5 cache lines in size. On an L1 miss, data is fetched both to the L1 direct-mapped location and to the LRU location in the miss cache. This technique essentially adds associativity to a direct-mapped cache without incurring the traditional penalties of higher access times due to associativity logic. Adding associativity reduces conflict misses. Simulations with common benchmark applications show that this strategy is much more effective at removing data cache misses than instruction cache misses. Victim caching: similar to miss caching, but with a different fill strategy. On an L1 miss, the new data is loaded into the L1 cache. With miss caching, this same data would also be loaded into the LRU location in the miss cache. This redundancy is often useful, as shown by the performance increase achieved via miss caching, but at times it is also wasteful. Thus, on an L1 miss, instead of loading duplicate data, the victim cache retains the victim data from the L1 cache. Victim caches always out-perform miss caches. Stream buffers: a small buffer used to prefetch lines of data in the event of a cache miss. Prefetching starts at the miss target and continues through sequential lines, fetching in parallel to normal system operation. This paper talks about a 4-entry stream buffer. This technique reduces capacity and compulsory misses. Multi-way stream buffer: use multiple stream buffers in parallel. When a miss occurs in both L1 cache and all stream buffers (they used 4 buffers), the stream buffer that was least recently hit (i.e. LRU) is flushed, and prefetching begins from the miss target in this buffer. This doesn't seem to improve instruction cache misses, as compared to a single stream buffer, but it almost doubled the performance of the data side. Overall, they claim that using victim caches and stream buffers together can reduce the miss rate in a first level cache by a factor of 2-3. The rest of the paper talks about the various performance trends of each strategy as functions of cache size, line size, etc... "In general, smaller direct-mapped caches benefit most from the addition of a victim cache." Victim caches seem to be more efficient with longer line sizes. Instruction stream buffers have constant performance over a wide range of cache sizes. Data stream buffers generally improve in performance as cache size increases. Sream buffers provide less of an improvement as line sizes increase. The paper concludes with a summary of a simluation involving a 4-entry data victim cache, and instruction stream buffer, and a 4-way data stream buffer. Overall performance was improved by 143% on average over 6 benchmarks. From: whitney@EECS.Berkeley.EDU on behalf of Mark Whitney [whitney@EECS. Berkeley.EDU] Sent: July26老 2001斥 Thursday 4:32 PM To: whitney@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU Subject: Improving the Accuracy of Static Branch Prediction "Improving the Accuracy of Static Branch Prediction" by Cliff Young and Michael Smith: This paper actually has a quite clever and simple idea to provide decent branch prediction ability without the addition of dynamic correlation counters or other hardware structures. The idea is based on the difference between a branch path history and branch pattern history. The architecture developed in the paper essentially duplicates branch basic blocks such that different instances of a duplicated branch block are reached by the different path histories which are deemed significant through static profiling of the program. Neato! Okay, to be more specific, the compiler starts out by collecting profiling info. It runs the program, keeping track of the last few global branch outcomes and their PCs. For every branch, it has a bank of counter pairs indexed by the global branch path history. One counter counts the number of times the branch is taken under that branch path and the other counts the number of times the branch is executed under the branch path. Next, they merge all the path histories for a branch into a single history tree and prune the tree. Each leaf in the tree (the oldest branches in the path histories) is labeled taken or not taken based on whether that path produced more than half taken outcomes. The tree is then traversed depth-first, at each parent node, if all children are labelled the same (taken or not taken), the children are pruned away and the node is labelled the same as the children. This can be done since the final branch outcome obviously does not depend on the children that were just pruned since all their paths produced the same outcome. If the children are different, the parent is labelled both and the children nodes are not pruned. The next step is to globally duplicate the basic blocks such that each block in a path significant to some branch must be duplicated to create all the necessary distinct paths. The paper has better diagrams and stuff, it if kinda complicated. This incidentally implies that the blocks could need to be duplicated exponentially but in the results, it is shown that only short path histories are needed and that not too many branches benefit form path history based prediction. The last stage involves placing the actual duplicated basic blocks in the binary, they do not go into too much detail on how they do this. As mentioned, these prediction schemes were not as good as a dynamic two level branch predictor but still performed pretty well comparatively. As mentioned before, relatively little history was needed for each branch and a small subset of branches really benefited from this path based history so much of the history tree could be pruned away, increasing code size by around 30% for most programs. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 1:21 AM To: jaein@eecs.berkeley.edu; whitney@eecs.berkeley.edu Subject: LimitLESS Directories: A Scalable Cache Coherence Scheme Distributed full-map directory schemes for cache coherence take ridiculous amounts of physical memory to implement, on the order of O(N^2), where N is the number of processors in the multiprocessor machine. Limited directory schemes address this by providing a finite number of directory entries that are shared. These schemes grow as O(N log N), and can often approach the performance of full-map schemes for programs that are optimized to limit the number of widely-shared objects. LimitLESS realizes the performance of a full-mapped scheme with only the hardware required for a limited scheme, while avoiding the need for software optimization. It does this by spreading the directory responsibilities between hardware and software. Hardware implements a limited directory scheme. When a directory overflows, the memory module interrupts the processor for software emulation of a full-mapped scheme. Most data structures have small worker-sets, or sets of processors that concurrently read a memory location. This means that the limited scheme implemented in hardware is often sufficient to handle the coherence scheme. When data structures have larger worker-sets than can fit in the hardware directory, the processor takes over, emulating the directory in software. Thus, the common case is handled in hardware, and the rare case is handled in software. The MIT Alewife machine implements LimitLESS. The most important feature of this architecture relative to LimitLESS is its fast trapping capabilities. When a memory module interrupts a processor in the case of a hardware directory overflow, it's important that the response be quick, so as to not degrade performance. In addition to this, the architecture must allow processors complete access to coherence related controller state info. Third, the processor must have access to the network that allows it to launch and intercept coherence protocol packets. They ran a bunch of simulations comparing full-map, limited, and limitLESS schemes. For apps with small worker-sets, both limited and limitless performed almost as well as the full-mapped scheme. For apps with larger worker-sets, however, the limited scheme showed about half the performance of the full-mapped scheme, while the limitless scheme was almost as good as the full-mapped scheme. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 5:16 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Lockup-Free Instruction Fetch/Prefetch Cache Organization Hi, Cache access time depends on miss rate, miss penalty, and hit time. Lockup-Free cache or non-blocking cache is a scheme to reduce miss penalty. Lockup-Free cache allows a processor to access cache while the processor is waiting for data. Lockup-free cache needs additional hardware - input stack, miss info holding registers, and miss comparator and status collection. The miss info holding registers hold information which is necessary to handle central memory and cache. The input stack is a buffer to hold data in transit to cache. The data in input stack can be bypassed to CPU before it goes to the cache. The experiments in this paper showed that 4 is enough for MSHR (miss info handling register). Patterson & Hennesey says [p415] the number of outstanding misses (1, 2, or 64) a cache can hold doesn't make much difference for integer applications, but the difference was noticable for floating point applications. This is because integer applications can get most of its benefit from simple hit-under-one-miss whereas floating point applications benefit from more complex schemes. As I mentioned in the beginning, lockup-free cache is to reduce miss penalty. Other schemes to reduce miss penalty are Giving priority to read misses over writes Subblock placement Early restart and critical word first Second-level cache lockup-free cache has the highest complexity. To mix lockup-free cache with these techniques will be a good compromise. Most modern computer architectures have non-blocking caches. Examples are Alpha 21064, MIPS R10000, Pentium Pro, II, and III. ----------- Jaein From: MARKGREGORY WHITNEY [whitney@imap4.CS.Berkeley.EDU] on behalf of MARKGREGORY WHITNEY [whitney@EECS.Berkeley.EDU] Sent: August07老 2001斥 Tuesday 2:37 PM To: jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: Memory Consistency and Event Ordering in Scalable Shared- Memory Multiprocs "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors" by Gharachorloo, Lenoski, Laudon, Gibbons, Gupta, and Hennessy: Okay, unless you are already a consistency bad-ass, you need to read this paper. I can't really summarize it much since there isn't anything to summarize outbut I will give it a whirl. The paper starts off redefinining sequential, processor and weak consistency to have a consistent (no pun intended) description of existing ordering methods. They also have some really boffo diagrams of all the consistency models which I haven't seen before. They go on to categorize memory accesses not only as conflicting and non-conflicting but also as synchronizing and non-synchronizing (a subcategory of conlicting accesses). They point out that the synchronizing accesses can even be divided between aquire and release accesses, differentiating between synchronizing ops which must wait for previous loads and stores, and ops for which previous loads and stores must wait. Then they present their new model, release consistency, which is weaker than all previous models. It is weaker than weak consistency because it differentiates between the synchronous accesses where WC does not. They derive equivalences between all the models and describe a system of labelling accesses which allows them to talk about cases in which some weaker models are as intuitive (for the programmer) as SC. All this and more in only 9 pages! (if you skip the proof appendicies :) Like I said, you should read it. It is very comprehensive, pretty rigorous and still readable. mark From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July25老 2001斥 Wednesday 3:20 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Memory Dependence Prediction using Store Sets Ahhh, so much nicer than the last paper... Naive processors are slowed by the need to wait for possible RAW dependencies. Accurately disambiguating these dependencies ahead of time, either statically or dynamically, is too complicated to be realistic. This paper proposes a means of predicting such dependencies. Loads develop a history of RAW violations, and this history is used to predict exactly which stores each individual load must wait for in the future. This allows loads to proceed earlier than if they were forced to wait for all stores. Each load has a store set. A store set for a specific load is the set of all stores upon which the load has ever depended. Basically, when a program starts, all loads have empty store sets, and a processor allows naive speculation of loads around stores. This causes RAW hazards, which are detected and fixed by the processor, but with a substantial penalty. As these RAW hazards occur though, the stores causing them are added to the load's store set, so that in the future, the load will be forced to wait for the store. These dependencies can change over time though, and so it's useful to have some means of resetting or retiring store set dependencies. 2 proposed means of doing this are a 2-bit saturating counter attached to each store in a store set, and a cyclic clear algorithm. 2-bit counters: After a load is forced to wait for a store based on its store set, the hw checks to see if there actually was a dependency. If there was, the counter is incremented. If there wasn't, the counter is decremented. The load is only required to wait for the store if the high bit of the counter for that store is 1. cyclic clear: Valid bits for each store in all store sets are cleared at regular intervals (i.e. every 1 million cycles). This algorithm is implemented in the form of 2 tables: the Store Set Identifier Table (SSIT), and the Last Fetched Store Table (LFST). The SSIT holds the store sets, and the LFST specifies the status of the most recently fetched store for each store set. Loads and stores access the SSIT to determine if they belong to valid store sets. If so, they use the info there to access the LFST. Loads check to be sure they've waited for all appropriate stores. Stores are forced to store in order, and so a store will append itself to the data concerning the most recent store in the LFST, making itself dependent on this most recent store. Note that stores are limited to existing in only 1 store set at a time. Also, in order to keep hw costs manageable, the size of the SSIT is limited, forcing aliasing among loads. If a store attempts to exist in 2 locations, these locations are merged. These limitations can cause false dependencies. Through simulation, they found that the 2-bit counter approach sometimes degraded performance in the presence of aliasing, and so they opted for the cyclic clearing algorithm instead. They're final simulations showed an average performance improvement of 34%. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August01老 2001斥 Wednesday 12:38 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Memory-System Design Consideration for Dynamically-Scheduled Processors Hi, Sorry for sending reviews a little late. Chris, we can meet at 7 o'clock as you suggested. In this paper, they tried to find the relation between the complexity of memory and the performance for dynamically-scheduled processors. They ran experiments with different number of registers (48, 64, 96, ...) , in-flight misses sustainable (lockup caches or lockup free cache), number of prefetched entries (for stream buffer) and issue size (4-way or 8-way). The performance metric was number of instructions per cycle (IPC). Higher IPC means better performance clock cycles and ISA are the same among the diffrent configurations. They found the following relations: -- System with less restrictive caches has better performance (4 in-flight miss sustainable cache is better than 2 miss cache and 2 miss cache is better than lockup cache) -- System with more registers has better performance (System with 96 registers is better than that with 64 and system with 64 is better than 48). -- But the increase of performance with additional registers was greater for system with less restrictive cache and wider issue unit. (Performance increase of 4 in-flight miss cache is greater than that of 2 in-flight miss cache when the number of registers was increased from 64 to 128, or when the issue width was widen from 4 to 8.) From this observation we can know that system with more flexible cache and wider issue width can exploit the increased number of registers better since they have higher memory bandwidth and allocate instructions more freely. -- Both more flexible cache and increased number of registers improves performance, but load miss was much more reduced by having more flexible cache. Just restricive cache with a huge register file doesn't help much! -- Stream buffer helped boosting IPC, but the performance increase was greater for more restrictive cache. That is because more restrictive caches suffer from narrow memory bandwith and get more benefit from reduced miss penalty by stream buffer. -- Different stream buffer strategies were employed. That made little difference to system with less restrictive cache whereas it showed a little better performance for more restrictive. To summarize, lockup free cache and stream buffer reduce cache miss penalty and one of them can be used in case is the other is not so much supported. Higher parallelism needs both higher memory bandwidth and larger register file. ------ Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July17老 2001斥 Tuesday 11:14 PM To: Mark Whitney; jaein@eecs.berkeley.edu Subject: Monsoon: an Explicit Token-Store Architecture Hi guys, Here's the token-store architecture paper. The Explicit Token-Store (ETS) architecture was developed based on ideas taken from the MIT Tagged-Token Dataflow Architecture (TTDA). The desirable properties of the TTDA that were retained in ETS were a large synchronization namespace, inexpensive synchronization, and tolerance to memory and communication latency. Parallel machines lacking these features are dependent on compilers' abilities to decouple dependencies and parallelize code. TTDA machines achieve such tolerance with sophisticated maching hardware. This inhibits the critical path of instruction issue though, and it also allows for the possiblity that token storage will overflow, causing deadlock. ETS take a more aggressive approach to storing tokens... "The central idea in the ETS model is that storage for tokens is dynamically allocated in sizeable blocks, with detailed usage of locations within a block determined at compile time." Each function has its own storage for tokens that is allocated as the function is invoked. Tokens contain a value, a pointer to the instruction to execute (IP), and a pointer to an activation frame (FP). IP and FP form the tag. Each frame slot has presence bits associated with it, indicating whether or not the slot contains data. Instructions get 1 operand from the value in the token and the other from the location specified by FP+r Monsoon is a general purpose multiprocessor using ETS. The pipeline operates as follows: 1) IP from incoming token used to address local instruction memory 2) address in frame-store computed from FP+r 3) presence bits from this frame read, modified, and written back to same location. If the slot was empty, it is now considered full, and vice-a-versa. 4) depending on results from step 3, the data from the frame location is ignored, read, written, or exchanged with the value on the token 5) operands sorted to inputs of ALU 6) operands processed by a functional unit 7) at same time as 6, 2 new results tags are computed 8) ALU results combined with new tags from 7 to create new tokens. Comparison between Monsoon and CRAY using GAMTEB benchmark: The same simulation took a little over one billion instructions on Monsoon and 250 million instructions on CRAY. It was estimated that 50% of the Monsoon instructions were due to memory management, and the other factor of 2 performance difference was expected. See ya' Chris From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 1:38 AM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Multiprocessors Should Support Simple Memory Consistency Models Hi, This paper starts explaining memory consistency models: sequential consistency and relaxed models which relax write-to-read order (processor consistency) or relax all orders (weak ordering, release consistency) It defines sequential consistency like this: A multiprocessor is sequentially consistent if the result of any executions of all the processors were executed in some sequential order, and the operations of each individual processor were to appear in this sequence in the order specified by its program. That means the order all memory operations is determined and their results are to be seen in that order. The benefit of sequential consistency is that programs are guaranteed to run in the same order as in a uniprocessor. So one can write a program for a multiprocessor as he or she did in a uniprocessor. But, programs following sequential consistency cannot benefit the increased number of processors because it cannot overlap two memory operations. That's why relaxed models came out. Relaxed models allow a program to overlap some memory operations while preserving program order. We maintain program order with synchronization. In 'relaxing write-to-read program order' (processor-consistency) a processor's writes may not immediately affect other processors, but when they do the writes are seen in program order. In 'relaxing all orders', reads and writes can be reordered for hardware optimization. To preserve program order, programmers need to include synchronization operations, which are called fences, barriers, membars, and syncs. When those a fence is included, an operation before the fence should be completed before any operations after that fence. It shows that the processor consistency and release consistency reduces execution time up to 10 percent and 16 percent compared to sequential consistency. However, with the increasing transistor budgets and new micro- architectural invention, the gap between sequential consistency and relaxed models is shrinking. And relaxed models still give burden on programmers and compiler writers. Thus, the author expects sequential consistency will win in the end. It's interesting that the person who advocated one of the relaxed memory consistency models (weak order) now prefer the simplest model (sequential consistency). ------ Jaein From: jaein@EECS.Berkeley.EDU on behalf of Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July06老 2001斥 Friday 11:56 AM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Other RISC readings Hi guys, I summarize two papers: RISC I: A Reduced Instruction Set VLSI Computer A Retrospective on High-Level Language Computer Architecture In the paper "The Case for the Reduced Instruction Set Computer", Patterson and Ditzel pointed out that conventional computers (CISC) has some problems and proposed RISC as an alternative. In RISC I paper, they showed their ideas were right with simulations. To build a simple and effective single chip VLSI computer, they assumed the following constraints: An instruction is executed per cycle. All instructions are of the same size. Memory is accessed only by load and store instructions. High level languages are supported. To counter these constraints, they provided capabilities like: Large register files and overlapped register windows Prefetching and Delayed jump The most time consuming tasks in high level languages are CALL and RETURN, which needed memory access for passing parameters and return values. By using overlapped register window, RISC I could execute CALL and RETURN most of the time just with registers. The only exception was when register files were in overflow or underflow, which were less than 1% of the cases. RISC I also used prefetching (a kind of pipelining) to improve performance. This meant the processor should wait when executing a branch instruction. They made the processor execute the instruction right after a branch all the time. They were able to rearrange the program sequences so that the processor execute useful instructions after a branch in most cases (around 90%). In the statistics, the execution time of RISC I was shorter than that of VAX and the instruction length was at most two thirds more than that of VAX. Reduced memory access was very impressive. Even though RISC I accessed memory only through load and store, it reduced memory access by using registers. In procedure call intensive programs, the performance of RISC I was much better. They prospected RISC I will produce a good result when it is made in a single chip. In "A Retrospective on High-Level Language Computer Architecture" paper, they mention that some issues are appearing again: high level language support in machine level (Java Byte code), compact code size to reduce a big processor memory performance gap or to reduce total system cost. Once again They pointed out that what matters is the effect of combined hardware and software system, not an individual component implemented in hardware. They also mention that some of the possible attributes for High level language computer system was really realized in RISC computers: ISA will be optimized for compilers and compilers will be in important roles. Jaein From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: October05老 2001斥 Friday 9:03 AM To: Jaein Jeong Subject: Paper Summary: A Preliminary Architecture for a Basic Data- Flow Processor (fwd) Hi, This paper explains how we can make a data flow processor. First, the paper showes the elementary processor, which can do only arithmetic operations. The elementary processor consists of instruction cells, arbitration network, operation units distribution network. Packets(data packets and operation packets) travel along the processor. An instruction cell waits for operands from data packets. When all of its operands are filled, it fires an operation packet. The operation packet flows to an operational unit through arbitration network. The opeartional unit executes the operation and sends a data packets containg the result and the instruction cell where the result should be stored. The data packet moves to an instruction cell through distribution network. The authors develop this idea by adding decision units and control network for flow control intructions. Naturally, we need decision packets and control packets which are counterparts of operation pakcets and data packets. Decision units consist of decider, T-gate, F-gate, merge. Finally, they expand their system into 2-level memory architecture. Instructions which are not used often is moved back to instruction memory. The instructions can be retrieved again when they are needed. Actually, the data flow architecture issues multiple instructions per cycle and executes them in multiple functional units. As we saw in simultaneous multithreading paper, the instructions can have contention between them. In addition, it is not clear they got the machine code for the data flow machine. So the performance of the architecture is dubious. ------- Jaein From: jaein@EECS.Berkeley.EDU Sent: September30老 2001斥 Sunday 8:59 AM To: jaein@cs.berkeley.edu Subject: Paper Summary: CDC 6600 CDC 6600 was built more than 30 years ago (1964), but it shows many features of modern processors. First, it is a load/store architecture. All arithmetic operations are done in registers. Memory access is done through registers, not directly with fucntional units. Second, arithmetic operations are 3-address insts. (two source, one dest). Third, it has 10 functional units. Thus can execute multiple instructions in parallel. If a processor executes multiple instructions at a cycle, some insts may depend on others. CDC6600 enforces dependence by using scoreboard. When a inst is executed, a processor checks whether all of its source registers are available. Only after all registers are available, the inst is issued to a functional unit. And its destination register is reserved. All the following insts depending on that register should wait until the register is available. The way CDC 6600 handled multiple inst execution has been adopted most modern processors. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: September30老 2001斥 Sunday 9:19 AM To: jaein@cs.berkeley.edu Subject: Paper Summary: Cray-1 Cray 1 has scalar insts and vector instructions. A vector instruction is done on homogeneous multiple operations, i.e. array elements. Reasons why Cray 1 was successful: Functional units are accessed only with registers. Can handle non-consecutive address accesses by using non-unit stride. Vectorizing scalar insts was effective. A vector inst was better than equivalent scalar insts when number of operations are more than 2 or 3 operations. Chaining makes possible to use elements as soon as they are available rather than waiting until all the elements of a vector are computed.From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: October02老 2001斥 Tuesday 9:54 AM To: Jaein Jeong Subject: Paper Summary: IBM 360 paper Hi guys. They built IBM 360 based on the folling goals: - Allowing for lack of backwards-compatibility due to new conceptual advances. - Design an architecture so that it can grow throughout the next decade. - I/O devices with different data rate, functions had to share general interface. - common system functions, such as addressing, I/O management, etc., should be made efficient and need not be diffirent in machines built for different applications. They define "strictly program compatible" as: "A valid program, whose logic will not depend implicitly upon time of execution and which runs upon configuration A, will also run on configuration B if the latter includes at least the required storage, at least the required I/O devices, and at least the required optional features.... Programs dependent on execution-time will operate compatibly if the dependence is explicit..." From: Jaein Jeong [jijeong@EECS.Berkeley.EDU] Sent: September21老 2001斥 Friday 9:29 AM To: jaein@cs.berkeley.edu Subject: Paper Summary: Implementation of precise interrupts in pipelined processors What is precise interrupt? Suppose an interrupt happened while executing an instruction. If instructions before that instruction are executed and committed, and instructions following that instructions don't affect the program status, then we say that that interrupt is a precise interrupt. Why is precise interrpt is useful? For servicing I/O. For virtual memory. For supporting floating point numbers. For software debugging. For handling unimplemented instructins. How can precise interrupt be implemented? Precise interrupt can be implemented in reorder buffer, history buffer, and future file. Reorder buffer entries are recorded and committed in program order. On an interrupt, reorder buffer entry before the faulting instructions are committed and insts following that inst are squashed. History buffer is similar to reorder buffer except that history buffer keep old register values rather than current value. Current values are updated after the execution and old values are stored in the history buffer. On an interrupt, old values are recovered from history buffer. Future file keeps two sets of register file, future file and architectural file. Architectural file keeps the committed values whereas future file keeps the current value of registers. On an interrupt, architectural file is copied to future file. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: October02老 2001斥 Tuesday 9:44 AM To: Jaein Jeong Subject: Paper Summary: RISC I and its retrospective ---------- Forwarded message ---------- Date: Fri, 06 Jul 2001 11:55:56 -0700 From: Jaein Jeong To: MARKGREGORY WHITNEY , Chris Savarese , Jaein Jeong Subject: Other RISC readings Hi guys, I summarize two papers: RISC I: A Reduced Instruction Set VLSI Computer A Retrospective on High-Level Language Computer Architecture In the paper "The Case for the Reduced Instruction Set Computer", Patterson and Ditzel pointed out that conventional computers (CISC) has some problems and proposed RISC as an alternative. In RISC I paper, they showed their ideas were right with simulations. To build a simple and effective single chip VLSI computer, they assumed the following constraints: An instruction is executed per cycle. All instructions are of the same size. Memory is accessed only by load and store instructions. High level languages are supported. To counter these constraints, they provided capabilities like: Large register files and overlapped register windows Prefetching and Delayed jump The most time consuming tasks in high level languages are CALL and RETURN, which needed memory access for passing parameters and return values. By using overlapped register window, RISC I could execute CALL and RETURN most of the time just with registers. The only exception was when register files were in overflow or underflow, which were less than 1% of the cases. RISC I also used prefetching (a kind of pipelining) to improve performance. This meant the processor should wait when executing a branch instruction. They made the processor execute the instruction right after a branch all the time. They were able to rearrange the program sequences so that the processor execute useful instructions after a branch in most cases (around 90%). In the statistics, the execution time of RISC I was shorter than that of VAX and the instruction length was at most two thirds more than that of VAX. Reduced memory access was very impressive. Even though RISC I accessed memory only through load and store, it reduced memory access by using registers. In procedure call intensive programs, the performance of RISC I was much better. They prospected RISC I will produce a good result when it is made in a single chip. In "A Retrospective on High-Level Language Computer Architecture" paper, they mention that some issues are appearing again: high level language support in machine level (Java Byte code), compact code size to reduce a big processor memory performance gap or to reduce total system cost. Once again They pointed out that what matters is the effect of combined hardware and software system, not an individual component implemented in hardware. They also mention that some of the possible attributes for High level language computer system was really realized in RISC computers: ISA will be optimized for compilers and compilers will be in important roles. Jaein From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: October05老 2001斥 Friday 9:51 AM To: Jaein Jeong Subject: Paper Summary: Symbolics Architecture (fwd) ---------- Forwarded message ---------- Date: Wed, 11 Jul 2001 16:33:17 -0700 From: Chris Savarese To: MARKGREGORY WHITNEY , Jaein Jeong Subject: Symbolics Architecture 3 conceptual layers of architecture: 1) system architecture: viewable to user/programmer 2) instruction architecture 3) processor architecture The basic philosophy is that each layer should be independent of the others, such that individual layers can be changed without affecting other layers. Key points regarding the the system architecture layer - "safety and convinience must not be compromised to increase performance" - Symbolics systems attempt to achieve both speed and safety by: - performing low-level checking in HW in parallel with computation and memory access - machine instructions are generic... programs need not know ahead of time what types of data they'll be acting on - function calling is very fast, while maintaining info needed for debugging - built-in substrate facilities - app-specific control of of virtual memory paging costs: - cost and complexity of system hw and sw are increased - performance optimization not always automatic. why lisp? time-sharing no longer economically beneficial, since processors getting cheaper. inter-user communication supported by LANs, so time-share systems no longer needed. symbolic system architecture allows for faster development of programs. why build a symbolic system architecture on unconventional lower-level architectures? conventional architectures are very different... no concept of parallel error-checking and generic instructions, often pay much more attention to complicated addressing, etc., then is needed for symbolic. hw technology of conventional machines will always be a few years ahead of symbolic hw. sw technology of symbolic machines will always be a few years ahead of conventional sw. Key points regarding the the system architecture layer - object references: highest level of data object abstraction. can refer to any type of data. all values of variables, arguments to/from functions, elements of lists are object references. (sound like pointers to me... and a language that uses nothing but pointers...) - object reference: 34-bit quantity, containing 32-bit data and 2-bit major tag. 32-bit chunk can then be broken down into 28-bit data and 4-bit minor tag - addresses are 28 bits, granularity is 1 word (object-oriented architecture doesn't need finer granularity) - headers, special markers (indicating currently empty memory location), and forwarding pointers (pointers to pointers?) used in storage representation of objects. Benefits of tagging every word in memory: - all data are self-describing - HW can process tag in parallel with data - generic insts that alter their operations according to the tags - automatic storage management is simple, since all data structures are the same, regardless of content - compact representations = less storage needed for data 17-bit instructions: 9-bit op field, 8-bit arg field no indexed/indirect addressing modes. function calls sped up by use of stack buffer to reduce cost of memory transactions, by optimizing stack for speed rather than space, by speeding up slower exception checks, and by using entry vector mechanism to simply run-time decision making. purpose of speeding up function calls: encourage programmers to write cleaner, less buggy, quicker-time-to-market code. principle cost of function-calling discipline: space occupied by header and extra copies of args is large. stack buffer large enough that this usually isn't a problem though. Key points regarding the the processor architecture layer - 3 models of 3600 architecture: 3640, 3675, 3620 - all have same instruction architectures, slight differences in implementation technologies and a few cost/performance trade-offs - 4-stage pipeline - 3640 uses 4-inst buffer - 3675 uses 2k-inst cache - 3620 uses 6-inst buffer with prefetch unit - datapath contains several units that function in parallel That's it. See you tonight. From: Jaein Jeong [jijeong@EECS.Berkeley.EDU] Sent: September21老 2001斥 Friday 8:58 AM To: jaein@cs.berkeley.edu Subject: Paper Summary: The Case for the Reduced Instruction Set Computer Reasons for increased complexity In the 60's and 70's the speed gap between processor and the memory was still big. By translating instructions into a sequence of micro-codes which can be executed in processor cycle, processor could execute larger number of operations than simply execute instructions without translation. Another reason of increased complexity was for upward compatibility. As a family of architectures evolves, more and more instruction sets were added, which made the instruction set complex. Before 80's compilers were considered not so smart in generating code, thus instructions sets were made to match a high level language. What is RISC? With the introduction of cache, instructions can be executed at the same rate with the processor, thus there was no benifit of translation. Researchers found that only a smaller number of instructions of the tradisional CISC architecures (20%) were used frequently. By reducing the number of instruction sets, they were able to put a processor in a single chip. Even though only a small number of instructions were used, compiler could generate optimized codes. What is an example of CISC and RISC? Computers before 1980 were all CISC like VAX and PDP. Motorola 68K and early generation of x86 processors are CISCs. MIPS (R10000, R12000 and etc), Alpha (21064, 21164 and etc), PowerPC , and Sparc processors are examples of RISC. Modern x86 processors (Pentium Pro, Pentium II, and Pentium III) are CISC strongly influenced by RISC. In the instructoin level they execute the same instructions with their ancestors, but they translate them into smaller RISC operations and use all the RISC techniques like pipelining, register renaming, and so on. Future of RISC and CISC. RISC proved to be the right way. This is shown by that modern CISC processors adopted many of the RISC techniques. For the applications where software compatibility is not so important like imbedded devices and servers, RISC processors dominate. Whereas in PC, CISC x86 processors take the most part of market share just because of code compatibility. From: Jaein Jeong [jijeong@EECS.Berkeley.EDU] Sent: September21老 2001斥 Friday 9:14 AM To: jaein@cs.berkeley.edu Subject: Paper Summary: Transistors Budgets Go Ballistic Transistor budgets are increasing with the development of technology. This made possible CISC processors use the techniques of RISC. For what else can we use the increased transisor budgets? Obviously, more space can be allotted to cache. But there is a limit. Since the cache should operated at the same cycle with the processor and the cache hit rate doesn't get better after some point, there is a limit for increasing cache size. We can also consider incorporating memory controller, for example North Bridge in x86 PCs. For low end computers, we can consider incorporating graphic controller and multimedia processors like sound card and MPEG decoder and encoder. Integrating DRAM is dubious, because DRAM is optimized low leakage and processor is optimized for fast transistor. Integration will sacrifice either low leakage or fast access. But battery operated devices can be benefitted because on chip memory access reduces power consumption. We can think about implementing multiprocessor or simultaneous multi- threading in a single chip. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 1:16 PM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Performance Analysis of k-ary n-cube Interconnection Networks Interconnect quickly becomes the limiting factor in concurrent machines. This paper explores the tradeoffs between high and low dimension interconnect networks. Definitions: k-ary n-cube: n dimensions, k nodes in each dimension. N=k^n, where N is the number of nodes wire bisection: the number of wires crossing a cut that evenly divides the machine. unidirectional networks: one-way network traffice bidirectional networks: traffic goes both ways flit: the smallest piece of information that a queue or channel can accept or refuse throughput: the total number of messages the network can handle per unit time network capacity: the total number of messages that can be in the network at once. hot-spot: a pair of nodes that accounts for a disproportionately large portion of the total network traffic hot-spot throughput: the max rate at which messages can be sent from one specific node to another specific node. Throughout the paper, he assumes constant wire bisections for the networks he compares. The networks are all unidirectional. He claims that this does not affect his results much, but that in a real machine, bidirectional networks have the advantage of better exploiting locality of communication. Also, he emphasizes that it's the dimension of the network that's important, not the topology. He mentions 3 types of routing paradigms: store and forward, wormhole, and virtual cut-through. With store and forward, the entire message is buffered at each node, and then sent out to the next node. Wormhole routing is similar to pipelining in concept. Rather than waiting for the entire message to arrive at a node before beginning to pass the message on to the next node, the head of the message is passed on immediately. The trailing parts of the message are passed as they are received. The individual flits of the message contain no routing info though, so interleaving of messages is not allowed. Also, if the head is blocked, the entire message must wait. Virtual cut-through is similar to wormhole except that it buffers messages when they block, rather than leaving them in the network as wormhole does. He uses wormhole routing for everything in this paper. He points out that wire delay is an important factor, and he uses 3 different models for wire delay: constant, linear, and logarithmic, all with respect to wire length. Basically, lower-dimensional networks always win. First, realize that with constant wire bisection density, having a high-d network means having a lot of low bandwidth channels, whereas having a low-d net means having a few high bandwidth channels. The low-d case results in resource-sharing (wires are the resources here) and better queueing performance. Also, realize that in low-d networks, the latency is dominated by distance, whereas in high-d nets, the latency is dominated by message length. Thus, programs that exploit locality of communication will run well on low-d nets, but not show any improvement on high-d nets. So low-d nets have the potential for lower latency. Low-d nets get better queueing performance due to the higher bw of their channels. This results in lower service times for messages, which in turn yields less contention in the network. Essentially, because messages are serviced faster, there is a lower probability of collision, as well as a lower expected wait time in the event of a collision. Finally, hot-spot throughput of low-d nets is better than that of high-d nets simply because of the higher bandwidth between 2 nodes. These conclusions hold for all three wire delay models. In general, he found that latency is minimized when k and n are chosen so as to make D = L/W. D is the latency due to distance (distance measured in hops) L/W is the aspect ratio... L is the length of the message W is the channel width. See you guys at 8:30 tonight. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 6:13 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Pseudo-Randomly Interleaved Memory Hi, Most processors access memory in multiple banks for broad bandwidth. But the performance degrades, if the memory is not accessed evenly over the banks. Sequentially interleaved memory (SIM) works well when memory is accessed evenly with stride of 1. But only one bank is accessed with strides of multiple of M (M is the number of memory banks), which can cause memory stall. This paper introduces pseudo-randonly interleaved memory (PRIM) in the hope that each access is evenly distributed over the banks. The first PRIM discussed is Permutation using XOR function. In this scheme, a physical address A = is converted to B = using formula B = A * H. Where H is a chosen non-singular n by m matrix. We assume GF(2) for all arithmetic, that is numbers are either 0 or 1 and addition, subtraction, multiplication, and division is done in mod 2. The performance depends on a good selection of matrix H. Cydra 5 is an architecture with this scheme. Since permutation using the XOR function is difficult to analyze, the paper proposes irreducible polynomial inter- leaving. It proves several theorems to choose a polynomial P(x) such that B(x) = (A(x) div x^m) * x^m + (A(x) mod P(x)) where A(x) is a physical address and B(x) is a pseudo random address. (A(x) div x^m (high order bits )) becomes word address and (A(x) mod P(x)) becomes bank address. The theorems shows that we should pick P(x) such that P(x) and A(x) has great common divisor (GCD) = 1 and P(x) is odd (coefficient of x^0 = 1). After it picks a polynomial 19 (= x^4 + x + 1, where x = 2), which is called I-poly 19, it shows the experiment results. The two PRIM schemes were better than SIM and H matrix scheme did better than I-poly when we allowed more latency (queue, memory latency register), vice versa. While address translation from physical to pseudo random takes time, I think that pseudo randomly interleaved memory will produce better performance when the cache-memory gap is high and the number of memory banks is large. -------------- Jaein From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: September05老 2001斥 Wednesday 5:43 PM To: Jaein Jeong Subject: Question 1 1. Give an example of out-of-order processor. Describe how data hazards are solved. 2. Describe the path when a processor access virtual memory. What is the critical path? How can you improve this? 3. Discuss the use of increased transistor budgets for different markets. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: September05老 2001斥 Wednesday 6:41 PM To: Jaein Jeong Subject: Question 2.1 1. Draw a diagram of in-order processor. What is simultaneous multi- threading(SMT)? How can we change the processor to handle SMT? Estimate the number of transistors increased based on the original in-order processor. Is it plausible? 2. Discuss the trends, today's spec of processor speed, memory speed, and disk speed. List the steps taken in cache, memory and disk in a cache read miss. Calculate time to read a cache line based on the estimated specs. 3. Describe the steps a cache line is read from memory module. Discuss the ways where bandwidth can be increased. From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: September05老 2001斥 Wednesday 5:51 PM To: Jaein Jeong Subject: Question 2 1. Draw a diagram of in-order processor. What is simultaneous multi- threading(SMT)? How can we change the processor to handle SMT? Estimate the number of transistors increased based on the original in-order processor. Is it plausible? 2. Discuss the trends, today's spec of processor speed, memory speed, and disk speed. What is the bottleneck? How can we improve this? 3. Describe the steps a cache line is read from memory module. Discuss the ways where bandwidth can be increased. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July06老 2001斥 Friday 12:54 AM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: RISC readings Hey guys, we should have a web page or something. That way we can consolidate all this stuff better and not have so many emails flying around. I will look into it... Quick summary of the ideas in: "The Case for the Reduced Instruction Set Computer" by Patterson and Ditzel and its rebuttal by Clark and Strecker Patterson and Ditzel go through the assumptions that have been made that rationalize the development of more "complex" instruction sets and conjecture that some of these assumptions may no longer be valid. They go on to say that the use of CISC architectures has adverse consequences like longer design and debugging time, and cases in which a group of simple instructions can perform the same task as a single complex instruction faster. They then give reasons why a RISC is better, including faster time-to-market, shorter critical paths and better area usage. The key ideas in this paper however are that since high-level languages are used for all programmin tasks these days, there is alot of flexibility in the compiler/instruction set boundary and that the tradeoff between complexity in the compiler and complexity in the instruction set does not effect programmers significantly and should be investigated more closely before deciding on an architecture. Clark and Strecker come back with arguments that RISC and CISC are not really opposing philosophies and that there is no clear metric of instruction set complexity. They also point out that reduction in complexity of the processor only adds more complexity to the compiler and/or operating system, and that showing the frequency of instructions executed is misleading and one one must really profile the amount of time doing instructions to make a useful comparison. Besides the point mentioned in paragraph 2, the other key idea to these two papers is the tradeoff between gains in performance through technology improvement and gains through architectural improvement. Simpler architectures will be able to better leverage the rapid improvements in process technology while more complex architectures should be able to provide a higher throughput of operations per cycle by providing more functionality per instruction. Mark Whitney From: jaein@EECS.Berkeley.EDU on behalf of Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July16老 2001斥 Monday 4:51 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Simultaneous Multithreading paper summary Hi guys, This is a summary of "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. In simultaneous multithreading (SM), several independent threads issue instructions in to multiple functional units in a superscalar machine. SM can be constrained by the number of instructions a thread can issue. It ranges from the most complex one ,Full Simultaneous Issue machine in which a thread can issue as many instructions as in funtional units, to Single Issue per Thread, where a thread can issue only one instruction. First, the authors ran workload over fine grain multithreading and SM of different configurations. In Fine grain multithreading, there is only one thread and it issues as many instructions as possible in multiple functional units per cycle. Experiments showed that the speed-up by fine grain multithreading saturates after 4 threads with around 3 insts per cycle. Even though there was a contention between threads, the number of insts per cycle increased for each thread. Full simultaneous issue SM achieved speed up of over 6. The experiments also showed that SM doesn't need to be complex to produce the best results: four issue SM was almost equivalent to full simultaneous issue in speed up. They did similar experiments over SM and single-chip multiprocessing (MP). In MP, a functional unit is devoted a particular processor. Whereas, a functional unit can be allocated more flexibly in SM. The results of experiments showed that SM outperformed MP when they had the same number of functional units even though MP's performance was scalable. In a configuration, SM did as well as MP even though it had less than twice functional units than MP. They ran experiments to find the best cache configuration with four different configurations with overall cache size the same: (1) shared instruction and shared data cache (2) private instruction and shared data cache (3) shared instruction and private data cache (4) private instruction and private data cache (1) showed best throughput for small number of threads (1 to 6). For more than seven threads (2) produced the best inst/cycle throughput. This can be explained as follows: Since the instructions of each program are independent, the effect of contention can be reduced with private cache. Shared cache is favored for data cache because cache data can be accessed over different banks. To summarize, simultaneous multithreading gives high inst/cycle ratio while using resources economically. ------ Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July11老 2001斥 Wednesday 4:33 PM To: MARKGREGORY WHITNEY; Jaein Jeong Subject: Symbolics Architecture 3 conceptual layers of architecture: 1) system architecture: viewable to user/programmer 2) instruction architecture 3) processor architecture The basic philosophy is that each layer should be independent of the others, such that individual layers can be changed without affecting other layers. Key points regarding the the system architecture layer - "safety and convinience must not be compromised to increase performance" - Symbolics systems attempt to achieve both speed and safety by: - performing low-level checking in HW in parallel with computation and memory access - machine instructions are generic... programs need not know ahead of time what types of data they'll be acting on - function calling is very fast, while maintaining info needed for debugging - built-in substrate facilities - app-specific control of of virtual memory paging costs: - cost and complexity of system hw and sw are increased - performance optimization not always automatic. why lisp? time-sharing no longer economically beneficial, since processors getting cheaper. inter-user communication supported by LANs, so time-share systems no longer needed. symbolic system architecture allows for faster development of programs. why build a symbolic system architecture on unconventional lower-level architectures? conventional architectures are very different... no concept of parallel error-checking and generic instructions, often pay much more attention to complicated addressing, etc., then is needed for symbolic. hw technology of conventional machines will always be a few years ahead of symbolic hw. sw technology of symbolic machines will always be a few years ahead of conventional sw. Key points regarding the the system architecture layer - object references: highest level of data object abstraction. can refer to any type of data. all values of variables, arguments to/from functions, elements of lists are object references. (sound like pointers to me... and a language that uses nothing but pointers...) - object reference: 34-bit quantity, containing 32-bit data and 2-bit major tag. 32-bit chunk can then be broken down into 28-bit data and 4-bit minor tag - addresses are 28 bits, granularity is 1 word (object-oriented architecture doesn't need finer granularity) - headers, special markers (indicating currently empty memory location), and forwarding pointers (pointers to pointers?) used in storage representation of objects. Benefits of tagging every word in memory: - all data are self-describing - HW can process tag in parallel with data - generic insts that alter their operations according to the tags - automatic storage management is simple, since all data structures are the same, regardless of content - compact representations = less storage needed for data 17-bit instructions: 9-bit op field, 8-bit arg field no indexed/indirect addressing modes. function calls sped up by use of stack buffer to reduce cost of memory transactions, by optimizing stack for speed rather than space, by speeding up slower exception checks, and by using entry vector mechanism to simply run-time decision making. purpose of speeding up function calls: encourage programmers to write cleaner, less buggy, quicker-time-to-market code. principle cost of function-calling discipline: space occupied by header and extra copies of args is large. stack buffer large enough that this usually isn't a problem though. Key points regarding the the processor architecture layer - 3 models of 3600 architecture: 3640, 3675, 3620 - all have same instruction architectures, slight differences in implementation technologies and a few cost/performance trade-offs - 4-stage pipeline - 3640 uses 4-inst buffer - 3675 uses 2k-inst cache - 3620 uses 6-inst buffer with prefetch unit - datapath contains several units that function in parallel That's it. See you tonight. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 12:02 AM To: jaein@eecs.berkeley.edu; whitney@eecs.berkeley.edu Subject: Tempest and Typhoon: User-Level Shared Memory I'm not sure I understand this paper, but here goes... First of all, Tempest is an interface specification, and Typhoon is a proposed hardware implementation for Tempest. Essentially, they take the details of a coherence mechanism in a share-memory multiprocessor and make them visible to the compiler/programmer. They motivate this by stating that system designers can't possibly anticipate the memory access patterns of software that will be run on the system, and so the details of the shared memory scheme should be brought up to the user level to allow for customization for each application. (This sounds absurd to me, but maybe I'm missing something... Sure, this could yield impressive performance gains, but what a huge burden on the compiler/programmer! This just seems like one step closer to using fully programmable architectures to me, while ignoring the trade-off with the software development effort. Hell, why not just make giant FPGAs and let the programmer/compiler construct customized data paths and all? Great performance, but what a bitch to use compared to a fixed architecture... Anyway...) Tempest offers the following 4 mechanisms: 1) Low-overhead messaging. They suggest building something off of active messages (the '92 culler paper). This allows fast communication, which is important for any multiprocessor. 2) Bulk data transfer. Transfer large chunks of data in parallel with computation, which is more efficient than using shared memory. 3) Virual memory management. Allows the compiler to manage the address space so as to eliminate unnecessary remote memory accesses. 4) Fine-grain access control. Memory blocks are tagged as ReadWrite, ReadOnly, or Invalid. These tags can be controlled from the user level. In other words, the coherence scheme is visible to the user. They present a new replication policy called Stache. It replicates data using part of each processing node's local memory. It's basically creates a 2nd or 3rd level fully-associative data cache. I think this only helps when the data being worked on would not normally fit in the cache, but would fit in local memory. Otherwise, it doesn't seem to be any different than a local cache with a typical invalidation coherence scheme. Typhoon isn't all that exciting. It's an architecture with support for the 4 mechanisms required by Tempest. They tested Tempest against a standard, all-hardware, directory based machine with 5 benchmarks. In this first experiment, they did not alter the program code at all, so they didn't take advantage of the user-level shared memory philosophy. Instead, the first set of results only looks at Stache. Some of the programs didn't use data sets large enough to make a difference, and so there was no improvement from Stache. Other programs, whose data sets were too large for the local cache, showed as much as a 25% performance improvement. They then adapted the code for one of the apps to take advantage of the user-level shared memory. This bought them an additional 35% performance increase. From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July17老 2001斥 Tuesday 12:01 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: The CRAY-1 Computer System Hi all, Here's the summary for the CRAY paper. Be warned: I'm a bit hungry right now... Something to whet your appetites: - 138 MFLOPS sustainable, 250 MFLOPS burst - chaining vector machine (aka data forwarding) - small size (relative to computing power for 1977): 6.5' tall, 8.75' wide, circular base - 1M word memory (64-bits/word) - optimizing Fortran compiler The meat and gravy: - has been called "the world's most expensive love seat" due to its cylindrical shape - 270 degree arc leaves room for an operator to squeeze inside - Only 4 chip types inside machine, all ECL (register chips, memory chips, and jelly-bean logic chips) - main memory supports single-bit error correction and double-bit error detection (SECDED) - CRAY generates about as much heat per cubic inch as the 7600 (how much does the 7600 generate?) - requires massive cooling system based on Freon and aluminum/stainless steel cooling bars - cooling system keeps case temperature at or below 130 degrees F (54C) - 12 functional units, organized into 4 groups: address, scalar, vector, FP - ea. FU pipelined into single clk segments - all FU's can operate concurrently - no divide unit. uses reciprocal approx. instead - 8 24-bit address (A) registers: primarily used as address registers for memory references and as index registers, but also provide values for shift counts, loop control, channel i/o ops - 64 24-bit address-save (B) registers: aux. storage for A regs - 8 64-bit scalar (S) registers: principle data handling registers for all scalar ops - 64 64-bit scalar-save (T) registers: aux. storage for S regs - 8 64-word (4096-bit) vector (V) registers: vector ops - misc. support registers: vector length (VL) and vector mask (VM), program counter (P), base address (BA) and limit address (LA) exchange address (XA), flag (F), and mode (M) - LA allows system to detect out-of-bounds accesses to memory, allowing for memory protection operation - XA used for interrupts - F signals which type of interrupt: normal exit, error exit, i/o interrupt, uncorrected memory error, program range error, operand range error, FP overflow, real-time clk interrupt, console interrupt - M used to mask interrupts - All integer arithmetic is performed in 24-bit or 64-bit 2's complement form - FP represented in signed magnitude form - vector instructions issue at a rate of one instruction per clk period. - once issued, vector insts produce 1st result after FU delay. subsequent results issued at 1/clk - software available at time of writing: CRAY OS (COS), CRAY Fortran Compiler (CFT), CRAY Assembler Language (CAL) (GO BEARS!), and some utilities - CRAY requires a front-end computer for interface. can be attached to any of the 12 i/o channels - biggest problems while developing CRAY: cooling and designing circuits with completely balanced dynamic loads - cooling: aluminum castings are porous, causing Freon to leak - circuits: unbalanced signals can cause standing waves on the ground plane (EMI?) - solution 1: make all signal paths the same length - solution 2: use only simple gates and make sure that both sides of every gate are terminated. - result: only a purely resistive load to the power supply The pizza next-door: - CRAY-1 classified as a second generation vector processor - CDC STAR 100A and TI ASC are 1st gen vector processors - CRAY is better because: - elements of vectors not required to be in consecutive addresses - can handle short vectors (2-4 elements) - very flexible when accessing vectors: user simply specifies start address and increment. only restriction is that increment must be constant And for dessert: a chocolate fudge covered sunday with almonds, walnuts, chocolate sprinkles, whipped cream, and a cherry: - at time of writing (1978), 3 CRAYs had been shipped, with contracts for 3 more - anticipate one CRAY-1 per quarter That's it. I'm going to go get some food now... See ya' Chris From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 1:52 AM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: The DASH Prototype: Implementation and Performance Hi, DASH is a cache coherent directory based multiprocessor developed in Stanford University. DASH prototype consists of four clusters, and each cluster has four MIPS R3000 processors, I/O interface, memory, and directory & intercluster interface. These units are connected on a bus and cache coherency between these four processors are maintained through snoopy bus protocol. Clusters are connected in mesh shape to make the entire system. Although the prototype has only four clusters, the system can grow with more clusters. Cache coherency outside a cluster is maintained through directory based protocol. To hide memory latency, DASH supports release consistency. The experiments showed that DASH is scalable: At least 10 times upto 14 times speed-up for most applications ----- Jaein From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August05老 2001斥 Sunday 11:55 PM To: jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: The J-Machine Multicomputer: An Archtectural Evaluation "The J-Machine Multicomputer: An Archtectural Evaluation" by Noakes, Wallach, and Dally: This paper is all about a really big multiprocessor which offers communication and synchronization through explicitly programmed message passing. Therefore for once, this a paper that is not about coherence and all that junk. It is about fast message passing and handling, and hardware mechanisms to assist in synchronization. Each node in the system consists of an integer processor, MMU, router and NI, SRAM, and DRAM (1M). Process data is distributed among all the nodes and a global address space is provided somehow by explicit instructions to insert virtual-physical address pairs into a hardware translation table (no notes on where this table is or how many there are!) Sychronization is provided by good old barrier style routines and by providing tags for every memory block and register to specify whether data is present in that data unit. If a processor tries to read a value from a non-present data unit, the processor is notified that the data is not available yet. This feature is good for producer-consumer based synchronization. Not-present data is tagged explicitly through a future instruction supported by the ISA. This network is a e-cube routed 3-D mesh network, (3-cube network has a low latency given under constant bisection bandwidth, see Dally's "Performance of k-ary n-cube Interconnection Nets".) The processor also supports super fast switching between threads and fast message sending and receiving capabilities to support fine-grained communication. The focus of their microbenchmarks are on total latency and throughput for message send and receive events, which is good due to their fast message handling. They also provide programming through their special version of C, offering direct access to the individual message handling and synchronization instructions. The system also supports Concurrent Smalltalk (good god!). From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August08老 2001斥 Wednesday 10:04 AM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: The MIT Alewife Machine: Architecture and Performance Hi, Alewife is a cache coherent multiprocessor developed in MIT. Coupled with fine grain computation and latency tolerance, integration of shared memory and message passing makes Alewife easily programmable and scalable. Alewife consists of 2-dimensional mesh network of nodes. In each node, there are a Sparcle processor (SPARC compatible processor) , cache, memory (shared memory, directory, private memory), floating point unit, and network router. This paper is based on 32-node Alwife machine. Cache coherency of shared memory is maintained through a software extended scheme - LimitLESS, which is a directory based protocol. Message passing is done in two steps: describe and then launch. messages are described by output descriptor array by 'stio' instruction. After that those descriptions are launched by 'ipilauch' instruction. At the destination, an interrupt occurs. An interrupt handler loads data directory from array using 'ldio' instruction or initiates DMA sequence to store that data word to memory. Shared memory runs fast, but message passing permits broader bandwidth. For fine grain synchronization, full/empty bit is associated with each 32-data bit word in hardware level or J-structure and L-structure with full/empty bits are used for vector elements in software level. Synchronizing data using these full/empty bits reduces unnecessary dependencies in coarse grain synchronization technique as barriers. For latency tolerance Alwife employs block multithreading and non-binding software prefetching. Block multithreading means generating context switch when there occurs a remote cache miss, thus hides latency. Experiments showed that Alwife is scalable both in shared memory applications and message passing applications. Achieving at least 10 speed up for most apps in shared memory. ------- Jaein From: jaein@EECS.Berkeley.EDU on behalf of Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July13老 2001斥 Friday 10:36 AM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: Tomasulo's algorithm paper Hi, This is a summary of "An Efficient Algorithm for Exploiting Multiple Arithmetic Units" by R.M.Tomasulo. Even though the performance of IBM360 was improved performance by pipelining instructions, there was another possibility for improvement - executing instructions in multiple units at the same time. It is natural to execute a fixed point instruction and a floating point instruction at the same time because they require different units and registers (called accumulators in this paper). But what if we execute multiple floating point instructions at the same time? Each instruction depends on another for its registers and we should make sure the precedence is preserved while executing instructions in parallel. First, Tomasulo showed that by associating a busy bit for each register we can execute an instruction and another depending on it in correct order. Since all the depending instructions are waiting for the sink register (a register where a result is saved), the register becomes a bottleneck. How can we overlap instructions while preserving instruction dependence? We need extra functional units, which is natural because we cannot execute two instructions without having at least two functional units. He suggested using reservation stations and common data bus (CDB). In this paper, he assumed we have three adders and two multiplier/dividers. Upon issuing an instruction, we copy the content of a floating point register and execute it immediately if none of source and sink registers are busy. But if any of its source or sink registers are busy, then we copy the instruction in reservation stations denoting the dependent register as a tag from which a result comes. Here the tag means one of six floating point buffers (bufferfor load) 1 through 6. Or the result of three adders, 10 through 12. Or the result of two multiplier/dividers, 8 and 9. At each cycle reservation stations, floating point register files, and store data bus check whether any data is on CDB. If a data is on CDB and the tag of of the data matches that in reservation stations or floating registers, they update the content by ingating it. When we issue an instruction we set the tag of the sink register as the unit number where the instruction will be executed. When the instruction is completed, the floating point register is updated with the result of the specified unit. However, if we issue another instruction which has the same register as its sink, then we set the tag of the register as the unit number of the second instruction is going to be executed. This doesn't affect any instructions who are waiting for the former instruction because they are waiting for the result of the unit where the first instruction is executed, not for the register itself. By renaming a register number with the unit number, this algorithm makes it possible to overlap instructions. ----- Jaein From: Chris Savarese [savarese@EECS.Berkeley.EDU] Sent: July25老 2001斥 Wednesday 12:57 AM To: Jaein Jeong; MARKGREGORY WHITNEY Subject: Trace Processors So it's a different Smith... Yet another great paper. Traces are built dynamically in hw as the program executes and are stored in a trace cache. This paper establishes early on that the unit of control prediction should be a trace, not an individual instruction. In fact, "issue" in this paper usually refers to a trace, not an instruction. They also emphasize and exploit the fact that register values can now be split into 3 distinct groups: live-ins, live-outs, and locals. Live-x values are values that enter or leave a trace, respectively, whereas local values exist only within the trace in question. This leads to 2 separate register files: a local register file and a global register file. Their processor is capable of issuing several traces at a time, with each trace running on a separate processing element. Each processing element is essentially a small-scale superscalar processor, connected via the global register file. A trace is uniquely ID'd by the addresses of all of its intructions. These trace IDs are used to sequence through a program. The trace predictor outputs a primary and secondary trace prediction. The primary is checked for a hit in the trace cache, and also to see if the prediction is correct. If so, execution proceeds. If the prediction is correct but there is a miss in the trace cache, a slow-path sequencer builds the needed trace using the trace ID directly. If the precition is incorrect, as verified by checking against actual branch outcomes, then a new trace must be built using the conventional branch predictor and branch outcomes. In this research, they used 2 trace selection criterion when building traces: 1) stop at a max of 16 instructions per trace, and 2) stop at any call indirect, jump indirect, or return instructions. There are, of course, many other possible criterion... Traces are preprocessed before being stored in the trace cache. Register values are parced into global and local, with the locals undergoing pre-naming. This is basically static register renaming performed dynamically before each trace issues. They also mention that preprocessing could include instruction scheduling (i.e. small-scale VLIW performed dynamically at each trace), but that they didn't implement this. There is an obvious trade-off between increased performance gained from longer traces compared to increased trace misprediction rates due to longer traces. They never presented any arguments justifying their choice of 16 as their max trace length, but they did model this trade-off by simulating two sets of criterion. The more lax set yielded average trace lengths of 14.8 and 13.9 for go and gcc. The stricter set yielded lengths of 11.8 and 10.9. Misprediction rates for go changed from 34% to 14%, respectively. The trace predictor is built in a way similar to that of branch predictors; it's a 2-level history scheme. The first level is the path history, filled with hashed representations of the last few traces. This history is used to form an index into a prediction table with 2^16 entries. Each entry contains the predicted trace ID, an alternate trace ID, and a 2-bit saturating counter used to guide replacement. In addition, there is a second, smaller predictor that keeps only the previous trace ID. This is used to reduce the impact of cold starts and aliasing. Their machine uses value prediction for live-ins only. This predictor is also 2 levels. The first level is indexed by a unique prediction ID, derived from the trace ID. Each entry is a hashed version of the previous 4 data values of the item being predicted. This is then used to look up a 32-bit data prediction in the 2nd level table. This scheme does replacement with a 3-bit saturating counter. Instructions issue with predicted live-in values only if they have a high level confidence, as determined by the predictor. At trace dispatch, the trace ID (i.e. the instructions making up the trace) is decoded, local values are pre-named, global values are renamed, live-ins are predicted, and a section of the reorder buffer is reserved for the trace. Exceptions are precise. Note that exceptions can occur from previous traces. i.e. an exception condition discovered now could have been generated before the current trace was dispatched. Thus, in addition to reorder buffer information for each trace, a snapshot of the register rename map is taken at trace boundaries. When an exception occurs, the snapshots are used to back up to the excepting trace, and then the reorder buffer for that trace is used to back up to the excepting instruction. This, of course, raises the issue of when snapshots can be discarded... Traces must be retired in order to provide for precise interrupts. This implies that the number of outstanding traces is limited by the size of the reorder buffer. PEs are freed when traces retire. Although they don't explicitly state it, this seems like the obvious place to also discard the snapshots. I believe the individual PEs are 1-, 2-, or 4-way issue SSs. They support bypassing to alleviate the pressure on their results buses. Now we come to the value prediction stuff. First, realize that they've built a modular system in the sense that all instructions seem to have memory. More specifically, instruction buffers and store buffers monitor comparator outputs corresponding to live-in predictions they used. If they spot a misprediction, they automatically reissue the first instruction(s) that used the incorrect live-in values. Since mispredictions are only spotted once the true live-in values appear on the global results bus, the only misprediction penalty seen is the validation latency. In the case of an instruction reissuing, all subsequent instructions that used its results will then automatically reissue. In other words, this trace processor supports selective reissue of dependent instructions automatically based on the verification of live-ins. When a trace is dispatched, all of its load/store operations are assigned sequence numbers, defining the sequential order of all memory operations in the trace. These sequence numbers are used to control and fix all memory access hazards. Stores are ordered according to sequence number in a store buffer. Loads snoop all store traffic, reloading when necessary. In the case of a control misprediction (i.e. off-trace branch), three things must happen: 1) instructions from the incorrect path must be replaced with instructions from the correct path. 2) register dependencies must be checked (not all instructions need to be reissued). 3) stores on the wrong path must undo their changes to memory. They simulated the trace processor techniques without value prediction. After a lot of bullshit about comparing apples and oranges, I think they concluded that a fair comparison yielded a 58% improvement when going from 1 PE to 4 (i.e. 1 PE = basic superscalar and 4 PEs = trace processor). They did a pretty good job of clouding this discussion though, so I'm not sure if that's really what they're saying. They also simulated the trace processor with value prediction using standard benchmark apps. These simulations compared the base performance of their trace processor without value prediction to the performance of their trace processor with value prediction. They ran several simulations with varying levels of perfect information used as experimental controls to isolate the performance of various aspects of the data prediction scheme. In the end, with real values everywhere (no perfect stuff), it looks like they got about a 6% improvement over all the benchmarks. From: jaein@EECS.Berkeley.EDU on behalf of Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: July15老 2001斥 Sunday 4:19 PM To: MARKGREGORY WHITNEY; Chris Savarese; Jaein Jeong Subject: VLIW paper summary Hi guys, This is a summary of "Very Long Instruction Word Architectures and the ELI-512" by Joseph A. Fisher Conventional parallel computers like CDC 6600, CRAY-1, IBM 360/91, showed only a factor of 2 or 3 of speed up from parallelism. For example, a vector computer like CRAY-1 was very difficult to program for parallelism and it speeds up only inner loops. As an alternative, Fisher proposes a VLIW architectures that can achieve 10 to 30 times speed up with regard to a sequential machine. VLIW has following properties: A central control unit issuing a single long instruction per cycle. Each long instruction consisting of may tightly coupled independent operations. Each operation statically scheduled by a compiler. He tried to accomplish high parallelism by Trace Scheduling Memory Anti-aliasing (N+1)-way independent jumps Memory bank prediction. In trace scheduling, a VLIW compiler takes a flow of code that has no back edges (loop-free code). The compiler picks the code that has the highest probability of execution. Then, it inserts some codes that connect the flow of code and the rest. After that, the compiler takes another flow of code that includes the house keeping code mentioned above as well as the rest of code. It repeats this process until it schedules all the code. When an instruction depends on another instruction for its operand, then they cannot be scheduled at the same time. In case the operand is a memory location such as array variable, the two instruction may or may not be scheduled at the same time depending on index variable. The VLIW compiler, Bulldog, could solve memory anti-aliasing by analisys for array variable, which takes most of indirect memory access in a scientific program. The VLIW compiler allows (n+1)-way jumps in an instruction by doing n tests. Another factor should be considered when a VLIW compiler schedule multiple operations: avoiding memory bank collision. Instructions can access memory at the same time only when they access memory locations in different banks. It predicts memory bank locations for scalar variables very well. Most of the time, array variables are used in a loop, the compiler unroll the loop so that index variables are incremented by a multiple of number of memory banks. Thus it avoids bank collisions. For unpredictable memory references, it gives precedence over the predictable one. When the compiler can't tell the subscript of an array, it adds preloop before the unrolled loops so that starting point of the loop becomes some known values. This paper showed how VLIW architecture can be made for high parallelism, but lacks the experimental data that support his idea. ----- Jaein From: Jaein Jeong [jaein@EECS.Berkeley.EDU] Sent: August27老 2001斥 Monday 2:19 PM To: Chris Savarese; MARKGREGORY WHITNEY; Jaein Jeong Subject: Vector Processor Summary Vector processors provide high-level operations that work on vectors. A vector instruction is equivalent to executing an entire loop. The benefits of vector processors are: - Since a vector instruction can replace an executing of a loop, instruction bandwidth is saved. - Control hazards are removed by changing a loop into a vector inst. - Due to fast memory access time and a large number of independent memory banks, vector processors are good for heavy usage of main memory. The challenge of vector processors are on vectorizing existing scalar operations into a vector mode. Depending on the applications, the vectorization rate varies from 1% to 100% (Cray X-MP). Techniques to improve vector processors: Chaining: Treat a vector operation as a collection of operations on elements of a vector and forward results to other vector operations which need the results. Vector Mask: When elements of a vector is compared with one scalar value, the vector processor can be set its mask so that it can comparison cuncurrently. The superiority of vector processors have been diminishing: Microprocessors can utilize better technology. This is evidenced by the fact that many vector processor manufactures now building supercomputers out of a collection of microprocessors. However, a vector computer has still its use when high peak performance and extensive memory use is needed. ----- Jaein From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August05老 2001斥 Sunday 11:52 PM To: whitney@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU Subject: Weak Ordering-A New Definition grrr....html mail is evil. "Weak Ordering-A New Definition" by Sarita Adve and Mark Hill: This paper talks about different memory consistency models in a multiprocessor environment. It begins by reviewing sequential consistency, the intuitive but most restrictive model. This model can be relaxed to what is called weak ordering, the definition given as according to Dubois. The authors claim that Dubois conditions for weak ordering are unnecessarily restrictive when trying to meet the "intuitive goals" of weak ordering. They define weak ordering as a hardware/software interface and introduce the use of a synchronization model which specifies how and when synchonization needs to be done in relation to memory accesses: "Hardware is weakly ordered with respect to a syncronization model if and only if it appears sequentially consistent to all software that obey the synchronization model." This definition does not appear to be terribly constructive but they go on to give an example synchronization model, DRF0. This model requires that synch operations are hardware recognizable and only operate on a single memory address at a time. Additionally, a partial ordering is imposed on all memory operations which orders conflicting operations by the irreflexive transitive closure of program ordering for instructions on the same processor and synchronization ordering for ops on different processors accessing the same location. They then go on to give requirements that must be satisfied in order for hardware to be weakly ordered. They also prove a bunch of stuff. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August27老 2001斥 Monday 5:19 PM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: chapter 8 - consistency consistency - when must a processor see a value that was updated by another processor? What properties must be enforced among reads and writes to different locations by different processors? Sequential consistency -requires result to be same as if accesses on the same processor are in program order and accesses on btw different processors are interleaved -implemented by delaying completion of access until invalidations from access are completed synchronization -a program is synchronized if all access to shared data is ordered by synchronization ops -accesses are ordered by sync ops if a write to a variable by a processor and an access of that variable by another processor are separated by a pair of sync ops, one performed by the writing processor after the write and one performed by accessing processor before access -prevents data races -program is synchronized if every execution containing a write then access sequence contains the events write(x) . . release(s) . . acquire(s) . . access(x) Memory ordering restrictions -write fence - all writes by P that occur in program order before a write fence have completed...no writes that occur in program order after the write fence are initiated before the write fence -read fence - same, except replace write with read. -in sequential consistency, all reads are read fences and all writes are write fences -sequential consistency is equivalent to a single memory serializing all accesses -weaker models effectively define fewer fences -sync ops act as the fences instead of ordinary acceses -define models by orderings among reads & writes, ordinary & sync on a single processor -if there is a dependence btw 2 ops, they must be ordered, memory consistency models deal w/ ordering of non-dependent ops Consistency models Notes about some models: processor consistency: allows write buffering so reads can bypass, hiding write miss latency partial store order: allows overlapping/pipelining of writes, writes only cause a stall when we hit a write fence weak ordering: allows overlapping of reads too, processor must have non-blocking reads to take advantage Model name ordinary orderings preserved sync orderings preserved seq. consistency R->R R->W W->R W->W S->W S->R R->S W->S S->S processor R->R R->W W->W " " consistency (total store order) partial store R->R R->W " " order weak ordering " " release consistency S_a->W S_a->R R->S_r W->S_r S_a->S_a S_a->S_r S_r->S_a S_r->S_r From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: July24老 2001斥 Tuesday 4:41 PM To: Chris Savarese; Jaein Jeong; whitney@EECS.Berkeley.EDU Subject: dynamic instruction reuse "Dynamic Instruction Reuse" by Avinash Sodani and Gurindar Sohi: This paper attempts to reduce the total number of instructions executed by dynamically caching the results of instructions and reusing these cached results if the same instruction is later executed with the same operands. This paper explores the design space of hardware structures to do such a thing. Instructions that execute are put into a reuse buffer, indexed by the program counter of that instruction. When the next instruction is about to be fetched, the PC is looked up in the reuse buffer and the instruction entries returned are checked for conditions to determine in any instances of the instruction in the buffer can be reused (note that multiple instances of instructions can be present in the reuse buffer in some cases). In the paper, 3 strategies are explored for determining if the operands are valid for a buffered instruction. For the first one, when an instruction execution is buffered, the values of the operands are stored in the entry as well. When the same PC later retrieves the buffer entry(ies), the current operand values are compared to the previously stored operand values, if any buffer entry has the same operand values, the result is used. For the second one, buffer entry validity is determined solely by the names of the operand registers. When an instruction writes to a register, all the entries in the buffer that had that register as an operand are invalidated. For the third scheme, validity is based on operand register names along with other collected dependency information. To track the dependency information, another table is added, the Register Source Table, which has an entry for each archtectural register and each entry points to the slot in the reuse buffer with the latest result for that register. See the paper for more details. This scheme reduces the number of entry invalidations compared to the scheme which determines entry validity through only register names. The results presented show that in some cases, 76% of the dynamic instructions did not need to be executed because they could reuse previous calculations. From this, a more modest maximum speedup of 40% is derived, due to the fact that not all the time used to execute a program is actually spent performing computation. They also show that a good number of speculatively executed and later squashed instructions can be reused using this mechanism as long as the reuse buffer can be updated speculatively. A nice compliment to this paper is "Understanding the Differences Between Value Prediction and Instruction Reuse" by Sodani and Sohi. It is a pretty quick read too after just reading this paper. From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August31老 2001斥 Friday 1:54 PM To: jaein@EECS.Berkeley.EDU; savarese@EECS.Berkeley.EDU Subject: processor reviews Hi, here it is mark From: Mark Whitney [whitney@EECS.Berkeley.EDU] Sent: August27老 2001斥 Monday 1:56 PM To: savarese@EECS.Berkeley.EDU; jaein@EECS.Berkeley.EDU; whitney@EECS.Berkeley.EDU Subject: tcp hey guys, here is the TCP stuff...if you want the beautiful drawings of connection examples I can copy them or something...I don't have time to ascii-art or xfig them I'll see you tonight at 6. mark TCP/IP Review IP -unreliable - something bad happens to a message, it gets thrown out, also, no checksum over data -connectionless - no state info kept about message -every message contains a header which contains source and destination IP addresses -IP also performs routing, messages are routed on a hop by hop basis. When a message arrives at a router, the router looks up the destination IP addr in a table which maps next-hop routers to destination IP addrs or portions of an IP which specify the destination network. TCP -reliable - buffers messages until an ack(nowledgement) is received, performs checksum over all data in a message -connection-oriented - ensures application receives messages in order, w/o duplicates, also performs flow control -stream-based - stream of app data is split into chunks (packets) to be sent separately TCP packet is encapsulated in an IP message TCP header includes a checksum over entire TCP packet Header also contains source and destination port numbers to specify a socket pair...this allows a machine to support multiple connections by using different ports for each connection. Each byte in data transmission is uniquely identified by a 32 bit sequence # Header contains the seq. # of the 1st byte in that packet to aid in ordering the packets later. TCP is a full duplex protocol so data can be xfered in both directions Header has an ack # so a packet of data can ack the recept of data from the other direction Whenever a receiver sends an acknoledgement, the ack # sent is the ack # for the last _in-order_ byte received There is no negative ack in TCP, that is, a receiver cannot send a message to the sender indicating it did not receive some byte in the stream. Unacknowleged data is eventually retransmitted by the sender, the time to retransmit is usually increased exponentially with each unacknowleged send. This means that unacknowledged data bytes must be buffered by the sender so that they can be retransmitted if necessary. Flow control is performed through the advertisment of "window size". The TCP header has a field to specify the number of unacknowledged bytes that can be outstanding at any point in time for that side of the connection. Thus, a receiver "advertises its window size" in the packets it sends to the sender, the sender can then only send up to the number of bytes given by the window size before it must wait for acknowledgements from the receiver. From: Jaein Jeong [jijeong@uclink.berkeley.edu] Sent: August24일 2001년 Friday 11:55 PM To: Savarese, Chris; Whitney, Mark; Jeong, Jaein Subject: On restartable interrupt... Hi, If all the states in the pipeline are saved, then the answer to my question becomes clear. Suppose we have an instruction sequence like this: ADD R1, R2, R3 if R2 = 0 then The compiler changes that code into: if R2 = 0 then ADD R1, R2, R3 Here we assume that delay slot is one inst. Suppse an interrupt happens on the ADD. If we saved only one PC - PC for ADD then, the processor will execute instructions sequentially from ADD, which is not correct when "if" is taken. So we need to save two PCs when we have one delay slot. Likewise, when we allow n delay slots, we need to save n + 1 PCs because a delay slot instruction can be n-inst after the branch. We need to make sure that branch is executed even though branch is before that inst in location. Ask for comment if something is wrong. Thanks, Jaein