CS 252 - Spring 1998 Projects

The IRAM project has several topics that need investigation. As we are going to start building a 250,000,000 transistor microprocessor this summer, likely the most transistors in a MPU when it is finished, such projects would be very timely on what may prove to be a very influential effort.

  • P1: Investigating algorithms, circuits, and floorplan for the VIRAM-1 vector unit.

    While many of the components of the VIRAM design have been investigated, the actual vector unit design has not. The purpose of this project is to investigate options and make recommendations. In all cases we are interested in area, power, speed tradeoffs.

    There are several ways to attack this problem, so multiple groups could easily work on it.

  • Survey existing FPU implementations and guess based on that information
  • Design of datapath with someone else's low-level components. Estimate based on circuit type and number needed
  • The multiplier is a large and variable design by itself, so concentrate on it. Survey existing multipliers, and look at new designs. Possible help from Brian Kingsbury and George Taylor.
  • Look at the T0 fixed-point vector unit layout and apply necessary changes. (T0 is available in Magic process, and was designed at Berkeley.)
  • P2: Algorithms/Benchmarks on VIRAM

    Since we are finalizing the instruction set architecture (ISA) for VIRAM-1 this semester, now is the time to investigate algorithms to see if we should make changes to the ISA. There are really several different projects, so several groups can work in this area. We have simulators and others tools to help collect statistics.

  • Port the NESL Language to VIRAM-1.

    Guy Blelloch's NESL language works very well on vector machines, and he has several programs with irregular parallelism (sort, connected components, spare matrix multiply, ...) in this language. Guy Blelloch is on sabbatical at Berkeley this semsester and can advise on the port. He says there are about a dozen macros that need to be ported, and then the whole library and applicatiosn should work. Such code could really exercise VIRAM features.

  • Port the BDTI Benchmarks to VIRAM-1.

    We are hoping to investigate whether a vector architecture with special DSP instructions is a good match to DSP applications. VIRAM-1 has both vectors and special DSP instructions. The BDTI benchmarks contain 11 short DSP kernels (4 FIRs, IIR, vector dot product, vector add, vector maximum, convolution, FSM, andFFT) that have been ported to practically every DSP processor plus a few general purpose processors. This project would have you code the kernels in assembly language and then compare the results to the rest of the machines in their benchmark suite. (We will soon have a result for FFT, which is the hardest kernel.) Again the goal is to make insights into how well its works, and how VIRAM-1 can be improved. We have a copy of the book that includes all the latest results, including code size, clocks per kernel, instructions per kernel, and cost-performance. Jeff Bier of Berkeley Design Technology, Inc., located at the corner of Shattuck and Dwitcht Way in Berkeley, is willing to help answer questions.

  • Code other algorithms

    This suggestion is not as tightly formed. Examples include hidden markoff model for speech recognization, viterbi as alternative to HMM (possible help from Eric Fosler), lossless compression algorithms, MPEG2 encoding, GSM cellphone algorithms, and encryption algorithms.

  • P3: TLB design for the VIRAM-1 vector unit.

    Since one design requires 4 addresses per cycle to be translated by the TLB, this project looks at how best to do that. Start by looking at existing high bandwidth TLBs, and decide about the tradeoffs of being multiported vs interleaved. Look at the paper by Todd Austin, called "High Bandwidth TLB for Superscalar".

  • P4: Characterize Architecture Metrics for Multiple Commercial Databases

    About 40% of sales of servers are for data base applications, yet little has been published on comparing multiple databases on a single SMP. The idea is to use the builtin hardware performance tools of either the the 4-way SPARC SMP or the 4-way Intel Pentium II SMP to recreate the Kim Keeton's study across several commercial databases. The question whether architectural support varies by database. Kim Keeton (kkeeton@cs) would be willing to help with this study.

    P5: Explore energy implications of reconfigurable implementation of compute kernels.

    This project is related to the Berkeley Reconfigurable Architectures, Systems and Software (BRASS) Research Project.

    Per raw bit operation, the potential energy cost of a low-voltage FPGA and a low-power DSP or microprocessor are very similar. Once correlation between bits in a datapath are taken into account, the energy may vary considerably, perhaps an order of magnitude.

    In particular, a spatial (non-multiplexed) implementation on an FPGA will have a low activiation rate when data is highly correlated. The heavy multiplexing and interleaving of operations on the processor will tend to destroy the natural correlation in the data yielding a higher activity rate.

    For some common kernels, (maybe start with filters, transforms common in signal/video processing) collect the data activity and estimate the actual energy consumed on a processor and an FPGA implementation. The goal would be to understand the source of potential benefits for the reconfigurable architecture and quantify typical effects.

    Andre' DeHon (amd@cs) would give advice on this project. It would likely involve:

  • Find code for simple benchmark kernels;
  • build netlist for FPGA implementation;
  • Get an energy model for each (DeHon believes there are several in the literature and around Berkeley so it's simply a matter of picking and elaborating);
  • Instrument appropriate level of simulations for processors and FPGA to collect bit toggle rates (Again, there are probably several things around to start with...just need to be tailored a bit to this task);
  • Run sample data and collect activity stats;
  • Use energy model to estimate energy consumed;
  • Reflect on results, identify sources of benefits(costs);
  • P6: Very Large Scale Backup.

    Edward K. Lee (eklee@pa.dec.com) of DEC SRC Research labs would be willing to give advice on this project. He thinks, for example, think that designing an efficient backup system where users can get useful work done after only a small part of the backup has been restored is an interesting problem.

    Automatic, reliable backup for large scale storage systems should be done at the device rather than file level. This ensures that all data can be backed up uniformly and regularly without regard to the type of file system or database system the data represents. This also means that if you have very large files, you can back up just the changed portions of the file.

    Some requirements:

  • For large systems, the backup system should be designed such that one never has to do a full backup. Only incrementals should be necessary.
  • Backup should be "consistent" and online.
  • The backup system must be able to efficiently restore individual files (accidental deletes) and entire volumes (catastrophic loss).
  • In the case of catastrophic loss, it is desirable that users can resume work after only a small fraction of the total data has been restored. That is, users shouldn't have to wait for the entire restore (say a Petabyte) to finish, just the stuff they most frequently use (say a Terabyte). This implies that the backup system can organize and restore the more frequently accessed data first in an efficient manner. This is very hard to do using high-latency sequential access devices such as tapes.
  • The backup repository should share as few failure modes with the primary storage system as possible. For example, you might not even want to store your backup data on the same type of storage system as your primary data.
  • Ideally, backup should be done over a network to a remote site to avoid the delays and human involvement needed to physically ship tapes off site. It's a very attractive idea to sell storage boxes that automatically backs up itself over a WAN to a backup center that is professionally run, perhaps by a company like Digital.
  • You should be able to restore any "active" file by using only the most recent "n" incrementals.
  • You should be able to do a full restore with a single pass scan of the most recent "n" incrementals.
  • John R. Mashey (mash@mash.engr.sgi.com) made three major suggestions for projects, each containing multiple CS252 projects:

  • P7. Doing better than SPEC;
  • P8. Fixing Knuth Volume 3 & "Algorithms and Data Structures" courses;
  • P9. I/O nightmares on the way.
  • P7. Doing better than SPEC, etc.

    1996's efforts on XSPEC (prior CS252 project) were useful, but this is a never-ending task...

    P7.1 Continue to propose/analyze new benchmarks.

    I.e., along same model as SPEC/XSPEC, take real programs and analyze them. Gathering and publishing data is always useful, although deeper analyses of fewer codes might be good.

    P7.2 No new data, more correlation [need to know some statistics]

    Do a serious mathematical correlation analysis among the existing masses of SPEC95 & XSPEC data to understand the level of redundancy, with goal of selecting well-correlated subsets, and shrinking the number of benchmark cases, not growing them. [One of the implicit goals of SPEC, at least from soem of us, was to make sure there was a reasonable body of results to analyze...]

  • P7.2.1 No new data, correlation within product lines

    Pick some product line with a number of members of different clock rates, memory systems, compiler versions. Study how performance has changed with compiler versions, and how it varies by Mhz, cache size, etc. Question: given the SPEC or XSPEC numbers for one family member, how well can you predict the numbers for other family members, given MHz, peak issue rate, cache size, memory latency, memory bandwdith, etc, for entire benchmark sets or individual members?

  • P7.2.2 No new data, correlation across product lines

    Like P7.2.1, but with added complexity of different compilers, cache designs, etc. If one proposes a formula that predicts most of the variation, it is especially important to analyze the mis-predictions (at least, for commercial benchmark fights :-).

    WARNING: be careful on correlation analyses. Sometimes people have reached somewhat-incorrect conclusions by using a set of data points dominated by large subsets of realted systems. I.e., if many of the systems used Pentiums of various flavors, one would might find that Mhz alone accounted for much of the variation.

  • P7.3 Performance versus data cache size & dataset size.

    (Some of this was done in 1996, much more needed). Some benchmarks have reasonably-scalable dataset sizes. In some cases, performance is very sensitive to dataset size versus cache size, and in others, it is not, and sometiems it depends on how the code is compiled. (I.e., SPEC89's matrix300 was supposed to be be a cache & memory thrahser, but cache-blocking compilers invalidated its usefulness for that, as they made it much less sensitive to cache size and memory performance.) Analyze the existing SPEC/XPSPEC codes and see if ther are any simple models that relate performance, cache size, dataset size, and some metric for nature of the code. Try to categorize codes better: it is silly to make multiple runs of different sizes if little new information is provided.

    This probably involves running more sizes of data on fewer benchmarks, to look for steep dropoffs in performance.

    P7.4 Performance, data cache size, dataset size across machine types.

    This is like P7.3, but emphasizing inter-machine comparisons.

    [Note: in commercial system benchmarking, fixed-size benchmarks are notorious for allowing misleading/surprising comparisons, and vendor benchmarkers know this well.] If a benchmark does have datset vs caches-size sensitivity, and there are two systems:

  • A: bigger cache, slightly slower memory
  • B: smaller cache, lower memory latency
  • If I sell A, I'll aim to get a benchmark size that is the largest that gets a good hit rate in my cache, because B's will be awful. If I sell B, I'll aim for a small benchmark, that fits in my cache as well, of else a huge benchmark that thrashes both of us.

    Of course, the SPEC benchmarks are all fixed-size, and sometimes prone to odd cases where the size just happens to clobber a system, but size/2 or or 2*size would not.] I've seen cases where one could make either A or B look P7.5X faster than the other by adjusting datset sizes.

    Maybe CPU benchmarks should look more like system throughput benchmarks, i.e., shown as a curve. For benchmarks in which cache size versus data size matters, it would be good to compare machines by curves with vertical = performance, horizontal = variable data size, across an appropriate range of sizes, and then see if there are better, succinct metrics derivable from the mass of data. Somehow, one would like a metric that gives credit to performance at a range of sizes, not just one size.

    P7.5 Error bars

    Redo any of the analyses above and get away from individual performance numbers into means with error bars, or just 90% confidence limits, or anything that would bring more rationality to the presentation and analysis of these things. [It is incredible that people talk about SPEC#s with 3 digits of accuracy...]

    P7.6 Synthetic benchmark (data)

    I'd love to have a easonably short, scalable, perhaps synthetic benchmark, that, with appropriate & understandable input parameters, could predict most of the variation in the various SPEC/XSPEC codes. (This is probably not for 1998, since it should build on some of the other suggestions above, not be invented out of whole cloth.)

    P7.7 Synthetic benchmark (instruction)

    Data-intense codes lend themselves to scaling tests; instruction-cache ones don't. Characterize the instruction cache behavior of the existing codes. Propose a parameterizable synthetic benchmark whose code size can be varied, and investigate its correlation with the existing programs.

  • P7.7.1 Instruction & data-cache benchmarks

    As in P7.4, given the same size external cache, there are odd cases of differences among: 1-level, split I& D cache 2-level, joint L2 caches 3-level (as in Alphas)

    (or in earlier days, between a split I&D or joint 1-level cache).

    Also, between direct-mapped and set-associative caches: direct-mapped D-caches sometimes have terrible performance dips as array sizes are increased, due to cache collisions. (Customers get angry out of all proportion, i.e., system A (direct-mapped) may have better average performance than system B (set-associative), but B degrades smoothly, rather than showing the occasional giant dips of A. People hate A.

    Study the existing benchmarks and analyze the performance results in light of either I-size or D-size being near sensitivity points of current system designs, or sensitivity points on associativity.

  • P7.8 Latency

    Larry McVoy's lmbench benchmark is a useful indicator of performance for small-object-latency-intense codes, i.e., where cache misses are frequent. Since this includes some behavior of DBMS, networking, and OS code, and such behavior is not necesarily well represented by SPEC , these numbers are of interest. On the other hand, the numbers are prone to over-interpretation, as over-reliance on the part of the benchmark that measures memory latency alone, would lead one to design machines that have the simplest cache design (1-level, blocking loads, no fetchahead), and simplest in-order execution, since the benchmark explicitly defeats use of multiple outstanding cache misses and out-of-order execution, even though these features are believed by many to be useful.

    Analyze lmbench results (of which many exist) and compare to SPEC/XSPEC results, and see which if any of the SPEC/XSPEC components are well-predicted by the lmbench (a) Memory latency or (b) latency of largest cache.

  • P8. Fixing Knuth Volume 3 or "Algorithms and Data Structures" courses [CS61B]


    [At least once a quarter, I stop by at the school down the road from my house, and check out whatever they are using for "Algorithms and Data Structures" or equivalent [for UCB: CS61B especially}. I complain to Hennessy: "These courses are out of date; they seldom even mention cache memory, much less include it in algorithm analysis; they fit CPUs of 10-20 years ago, but not current ones; students ought to be sensitized to the issues a bit earlier, instead of hearing about theoretical machines in data structures course, and then real machines (but without reference to the data structures. ... rant, rave" Hennessy says "Well, Knuth is rewriting Volume 3" ... yes, good, sometime, but we have a problem, and it is in fact one of the most commonly-asked questions by practicing programmers: what should we do to our data structures to cope with the new CPUs?

    The problem: Knuth Volume 3, and many other textbooks, effectively act like CPUs are stuck in the early-1980s microprocessor days:

  • Timing: 1 memory access ~= arithmetic operation, or close

    But now, we have multi-level cache hieararchies, and it is quite possible for a cache miss to cost 400 or more (peak) instruction issues, and this is going to get worse, not better for a while.

  • Spatial locality: irrelevant

    But with caches, spatial locality can be very important, especially with out-of-order, superscalar machines, which can take good advantage of multiple elements in a cache line. For example, if one hashes into a large table, think hard about hashing to cache lines, then searching each cache line, rather than rehashing on each collision. You have paid many cycles to obtain the cache line... Also, in real life, sometimes people spend a lot of time aligning data structures on cache lign boundaries...

  • Temporal locality: irrelevant

    Two algorithms may theoretically have similar computational complexity, but one may run much faster in practice, for many reasons. Cache temporal locality can vary dramatically between otherwise similar-looking algorithms, as for example, in sorting, where people spend serious effort optimizing sorts around the use of caches, making internal sorts act somewhat more like external sorts.

  • Scientific programming, of course, is well familiar with these things, as evidenced by the widespread use of cache-blocking compilers. Unfortunately, analyses for lists, tress, searching, sorting, etc, seem not to have progressed as quickly, although I may have missed something, as I only look at this every once in a while. I think there is room for a lot of publishable papers to update the literature, and one of these days, there should be a new textbook, or at least a section called: "classic data structures versus modern machines". Of course, if everything turns into JAVA, maybe this won't matter :-)

    P8.1 Classical algorithm analysis meets modern machines

    Go back to your CS61B (or equivalent) textbook/notes.

    Pick a scalable algorithm, that can be scaled large enough to stress cache memories well. Run it to produce a plot of (elapsed time - vertical) vs problem size (horizontal), and compare the shape of the curve with the theoretical analsys, and analyze any surprises. Do this across several systems with different memory hierachies. If the systems you use have machine-level profiling tools (i.e., equivalent to those in the MIPS R10000-based systems, which can be easily made to tell you about cache misses, TLB misses, etc, etc; I think DEC has similar things, I'm not sure how accessible they are elsewhere) use them to explain the curves. Suggest cache-related improvements to the algorithm and see if they matter. Searching, sorting, memory allocation problems are all relevant, and some people at UCB study parallel versions of some of these ... but it would be nice just ot accumulate better analysis for the simplest uniprocessor cases.

    (I'm mostly ignoring MMU issues in favor of cache ones, but the latter can pop up as well.)

    P8.2 Algorithm 1 meets algorithm 2, in presence of cache

    As in P8.1, but compare two different algorithms that accomplish the same task, i.e., like different sort programs. Analyze cache size/nature sensitivies, and any other surprises. Can the source code be parameterized (with cache sizes, latencies) to give "good" performance across a wide range of machines? Can a binary have the same property and still be clean code? (I.e., how does one do such algorithms so they perform "well" on machines in the same family with 256K -> 8MB caches?) How complex does a machine model need to be to get good results in paremterizing such codes. Do this one one system, or better, on several.

    P8.3 Evil pointers, linked lists, out-of-order machines

    Following a chain of pointers, where each pointer is a cache miss, is one of the worst possible things to make cached machines do, since even a speculative-execution machine with multiple outstanding cache misses (MIPS R10000, PHP PA8000, Intel Pentium Pro, Pentium II) gets slowed to DRAM speed in following such a pointer chain. Such CPus can sometimes unroll tight iterative loops, and get resultant cache misses going faster ... but only if, for example, a linear list is restructured as pointers to vectors of pointers, over which the hardware can iterate.

    Look at algorithms that use lists, or binary trees, and see if converting them to N-ary structures helps on machines with o-o-o execution, or not, especially as compared with those CPUs that block on first use of data.

    P8.4 Algorithms, languages, object-oriented stuff

    Chasing pointers is bad enough; the worst thing for a modern CPU is doing an indirect call to a function pointer located in a structure that has just been accessed via a pointer ... like some object-oriented programming likes. Even a deeply-speculative CPU cannot get beyond that indirect call. Study the I-cache miss behavior of a program coded in this style: perhaps this is not such a bad problem (if the call targets are a small enough set there is a good cache-hit rate), maybe it is. If it is, suggestion a hardware solution that makes indirect calls as fast as explicit ones. (50 points, or just send me the solution :-)

  • P9. I/O issues, such as what to do with these huge disks that are coming.

    Some famous people have said that I/O is the orphan of computer architecture, and it is ... but many real-life problems depend a lot more on I/O performance than anything else. Even worse, many of them are very dependent on latency (rather than bandwidth), and latency is not improving as fast as anyone would like. Unfortunately, it is harder to think up good I/O problems that are doable in a few weeks. I'll suggest the #1 "InfraStress" (InfraStructure Sress) issue I know of coming up related to I/O, and maybe some problems will fall out.

    Disks are getting HUGE, fast, and algorithms are breaking (out) all over

    Disk density, and thus capacity/cubic inch, are going up dramatically already, and are about to go up even faster (i.e., due to IBM GMR disks, and Terastor/Quinta laser-enhanced magnetics): if you believe what TeraStor says, one would expect to see 500GB 3.5" drives by 2003, so that one could get a Terabyte on the desktop ... but unfortunately, one would guess that a normal 500GB disk would take 10,000 seconds to read ... not good.

    Unfortunately, disk densities are going up faster than bandwidth, and latency is only dropping very slowly. Such changes in ratios tend to break old limits (which are at least obvious), and more insidiously, existing algorithms tend to have performance breakage in the face of large changes. The coming disks are likely to cause trouble for most current filesystems, and there'd be some serious rearchitecting.

    Any problems here are likely to need to build on work/environments that are already around UCB, and I'm not sure offhand what there is. Realistically, one would want some environment that would let students do file system evaluations/analysis at user level, and study key algorithms without having to invent everything.

    Greg Pfister of IBM (pfister@us.ibm.com) suggests projects P10 and P11:

  • P10. I/O

    First: Looking over the list of prior projects from last time, I noticed virtually nothing about I/O. You are using a text that doesn't focus there (nor do many others), but nevertheless -- many topics from last time could be revisited in an I/O context: Benchmarks of I/O, efficiency of I/O, etc. How good is the memory system at block streaming multiple multimedia streams onto disk and/or a fast network? How about OS overhead for lots of little transactions -- there certainly are imaginative ways it could be be reduced, and proposing/measuring the results could be a good project.

  • P11. Application Performance: Messages vs. CC-NUMA

    Second, how about an application comparison of a low-overhead messaging scheme (like VIA) versus a shared-memory implementation using CC-NUMA? Of course it should be based on measurement, so maybe SMP measurement plus some analytical modelling substitutes for CC-NUMA and some analytical work.

  • P12. Networks vs. Busses

    Jim Gray of Micrsoft (Gray@Microsoft.com) also thought last time there was not much emphasis on IO or memory bus. A IO/Networking project:

  • Measure Disk controller bandwidth and latency (via read/write of same cached block).
  • Measure Network Controller bandwidth and latency (via loopback?)
  • Prediction: Network controller has lower latency and bandwidth.
  • Why (why can't we have the best of both?)
  • P13. Evaluating New WinTEL I/O Standard

    David Douglas (douglas@East.Sun.COM) and F. Balint (balintf@East.Sun.COM) from Sun Microsystems made a suggestion regarding a new I/O standard. "I2O" (I-to-O) is the big new I/O architecture definition from WiNTel, attempting to push more processing off of the main CPU's and onto the I/O cards. Their question is: Is this any faster than the "smart" IO subsystems people have been building for a while? Will I2O open up new opportunities to move stuff off of the main CPU's, resulting in faster performance?

    One additional issue not mentioned below, but within Berkeley's historic interests, is to look at I2O in light of RAID controllers.

    The project goal is two fold:

  • 1. Establish the performance benefits of I2O (Intelligent IO) subsystems compared to :
  • a. Traditional "smart" IO subsystems
  • b. Traditional "not smart" IO subsystems
  • 2. Make suggestions of how to improve system performance further by perhaps off loading more IO processing into an IOP.
  • What are the factors impacting system performance? At what IOP performance is needed to gain the most? Is there linear relationship between IOP performance and system performance gain?

    The focus should be on storage and on network performance.

    For step #1, the recommendation is to compare the

  • Symbios 875 SCSI controller stand alone with the same 875 controller connected to an Intel 960RP I20 controller.

  • Also, compare the ISP 1040 Qlogic SCSI controller with the 875/960RP solution.

  • Note that 875 is a traditional "medium" smart SCSI controller

  • The 960RP is the defacto I20 controller used today

  • The ISP1040 is a traditional smart SCSI controller, but not I20 compliant.
  • The data to be looked at is IO per second for 2K r/w and 8K r/W The sytem utilization should be compared (Intelligent IO is suppose to reduce PIO and interrupt rate, thus CPU utilization).

  • P14. Evaluating Embedded Processors

    Christoforos Kozyrakis (kozyraki@ikaros.CS.Berkeley.EDU) had suggestions based on using the embedded processor board for StrongARM SA110 that we have in the lab, plus a few miscellaneous ideas.

  • Benchmarking embedded processor

    Run Spec95int and other micro-benchmarks (lmbench etc) on an embedded processor (recommended: SA110 that we have available). Measure performance; compare with results for other microprocessors.

  • Comparison of embedded/desktop processors for DSP applications

    Use BDTI benchmarks or other DSP kernels to compare/analyze the performance of an embedded processor (SA110) and a desktop processor (Sparc/Pentium).

  • Vector Algorithms: unit stride vs scatter-gather

    It is believed that scatter/gather support on vector processors is crucial for high-performance. Take a small number of vector algorithms that are usually written using scatter/gather. Change them (i.e. use different algorithm to solve the same problem) to use strided accesses. Quantify difference in performance using T0 simulation tools.

  • Texture Caching

    Previous work has indicated the existence of large amount of locality in texture accesses in high-end 3D machines. Still, the industrial experience on 3d accelerators for personal computers has been that texture accesses in 3D games and other PC applications do not cache well.

    Collect traces from 3D applications, games or benchmarks and analyze the locality (temporal and spatial).

  • Genetic algorithms in computer architecture

    In ISCA97, Joe Emer presented how genetic algorithms can be used to describing and synthesizing branch predictors with impressive results. Genetic algorithm can probably be of use in other areas of computer architecture: data predictors, hardware prefetching. Exploit the potential of creating predictors by using genetic algorithms. (Kubi may have more specific ideas here)

  • List of CS252 projects from last time (look under Course Projects)

    Back to CS252 home page...