CS 252 - Project Suggestions Page

Here are a list of suggested CS 252 projects for Fall 1996. If you use an item that was suggested by someone, you should at least tell the person that you are using it, and hopefully tell them the result of the investigation.

Note that you are not limited to the projects listed here. You may work on anything that you find interesting that is deemed worthy of being a CS 252 project.

If you want more of an idea of just what a CS 252 project entails, check out the projects from last year, Spring 1996 and Fall 1995.

Please try to find a project partner and have a rough idea of what you want to work on by Monday, September 23. Send mail to patterson@cs.berkeley.edu and rfromm@cs.berkeley.edu indicating who you are working with and what you are interested in working on. Please, just one message per group. Feel free to contact us if you have any questions or are looking for suggestions and/or clarifications.

A more detailed survey concerning your project choice will be coming shortly. It is important to get started thinking about this early. Trust me, you don't want to let this all go until the last minute.


eXperimental SPEC for 2002: create benchmark set of programs, data size of future apps

Java vs. SPEC

JAVA 1: characterization of cache behavior, register usage, branch behavior, and depth of calls (to name a few interesting data points) for Java applications.

Suggested by Dileep Bhandarkar ( Dileep_Bhandarkar@ccm.sc.intel.com)

SPEC95 across different architectures

Since you have a variety of architectures, take the gcc compiler and measure SPECint95 on some public version of Unix such as Linux or FreeBSD

Suggested by Dileep Bhandarkar ( Dileep_Bhandarkar@ccm.sc.intel.com)

SPEC95 cache analysis

Repeat Mark Hill's SPEC92 cache analysis for SPEC95.

Suggested by Dileep Bhandarkar ( Dileep_Bhandarkar@ccm.sc.intel.com)

Measure OS primitives

Pick some set of OS primitives (process creation, synchronization etc) and measure the times for NT vs Unix on same platform.

Suggested by Dileep Bhandarkar ( Dileep_Bhandarkar@ccm.sc.intel.com)

A "voting" data-prefetch engine

This H/W device has the following characteristics:

Suggested by Sharad Mehrotra ( Sharad.Mehrotra@Eng.Sun.COM)

Architecture Archeology/Endangered Species Act

Documenting architectural history might attempt to either collect or construct emulators for machines which are disappearing

Suggested by Eugene Miya ( eugene@pioneer.arc.nasa.gov)

Evaluate usefulness of VIS/MMX instructions

Both Intel (MMX) and SPARC (x86) have recently extended their instruction sets with mini-vector-like operations that allow the simultaneous operation on groups of values (i.e. vectors). One area which is being specifically targetted by this is multimedia. Both companies claim significant speedups from the use of these instructions. How valid are these claims? Evaluate the usefulness of these additional instructions for some application(s). Are the instructions worthwhile? Can you suggest any different new instructions which would be better? What really matters is complete, end-to-end performance of an entire application, not just an impressive speedup on a small micro-kernel. Some claims and/or previous evaluations may have been published in past issues of Microprocessor Report.

The following several suggestions deal with the IRAM project. See the IRAM web page and/or contact Dave Patterson or Rich Fromm for more details about the project.

Investigation of logic circuits in memory processes

One problem with building a processor in a DRAM manufacturing process is that the logic and memory processes are optimized for different factors, and logic in a DRAM process is likely to run slower. How much slower is a very important question. It would be very helpful to simulate (with SPICE) various logic circuits in a memory process to try to quantify these differences. Other related topics (besides speed differences) include area and power considerations. Studying the performance of an SRAM cache built in a DRAM process would also be useful, since an IRAM with a large DRAM main memory is still likely to have an L1 cache, which is likely to be built from SRAM. This project would investigate such physical phenomena and apply this knowledge to some simple architectural evaluations.

Stelios Perisakis (sper@cs.berkeley.edu) has already investigated some of these issues, so it would probably be helpful to consult with him before undertaking this project.

IRAM with processor redundancy

Memory chips are typically built with redundancy to improve yield. Can this make sense for logic as well, including logic combined with memory?

On-the-fly compression/decompression

An IRAM will have a tremendous bandwidth to the on-chip memory, but there is still the problem of accesses that must go off-chip. The off-chip memory bandwidth is likely to be orders of magnitude less than the on-chip bandwidth. Perhaps the most sensible solution is to store programs and data in compressed form, decompress them as they are read into the chip, and compress data as it is written off of the chip. Evaluate the usefulness of this idea (with both standard and novel compression schemes) with real programs and data,\ paying careful attention to the architectural feasibility of any implication, not just how good some algorithm might perform theoretically.

IRAM for large programs/data

Get traces of large programs. Compare and contrast:

Evaluation of existing cache design(s) implemented as an IRAM

Take the memory hierarchy of an existing microprocessor and evaluate its effectiveness if implemented as an IRAM. As mentioned above, a processor running in a DRAM process is likely to be slower. Can you quantify how much slower the processor can be to still have a performance advantage for an IRAM vs. a conventional solution. Traces of real programs and/or the utilization of on-chip performance counters on processors in existing systems could be used as the data upon which to base the analysis.

Design and evalutation of a new microarchitecture for IRAM

Similar to above, but instead of using an existing design of a memory hierarchy, come up with your own design that is optimized for an IRAM and evaluate its effectiveness.

IRAM as a single-chip multiprocessor

Integrating the processor and memory provides for a tremendous amount of on-chip bandwidth. An important question is how to best utilize this. One possibility is by placing more than one processor on the same chip. Perhaps this is not the right idea, since if bandwidth is the problem, having multiple processors will just make the problem worse. But if there is enough available bandwidth in an IRAM context, perhaps this would make sense.

A paper at this fall's ASPLOS, The Case for a Single-Chip Multiprocessor, investigated integrating multiple processors onto a single die. Would the results of this investigation be any different looking into the future if this were implemented in the 1 Gb DRAM generation, with the main memory on chip? (The paper assumed a 256 KB L2 SRAM cache on-chip, implemented in a logic process.)

IRAM as a building block of a distributed shared memory multiprocessor

Besides viewing an IRAM as a single uniprocessor, or a single multiprocessor (see above), another potential is that an IRAM could be an ideal building block for a larger multiprocessor system. Need more memory, just throw together multiple IRAMs, and you have the added benefit of having multiple processors too.

There is a substantial body of literature on parallel processing architectures. What if the building block of a distributed shared memory multiprocessor, which is typically a processor with some local memory, were all contained on the same chip? How much of the conclusions drawn from past studies would still apply, and what might be different?

If you want to do any project (whether or not related to IRAM) that is related to parallel processing, I would advise talking to David Culler (culler@cs.berkeley.edu) to get some more advise and insights. One paper that might be a good starting point for ideas comes from this summer's ISCA, "STING: A CC-NUMA Computer System for the Commercial Marketplace".

Treat registers like a cache

Have a (fully?) associative array of readout registers, and only do the writeback once the readout register is replaced. This means that a logical read turns into a physical read. It also implies that the physical DRAM copy is wrong. All the standard cache questions apply: stream buffers, speculation, etc. In addition, there is the question of pre-writeback of dirty lines, as in the virtual memory subpiece of operating systems.

Suggested by Eric Anderson ( eanders@cs.berkeley.edu)

Programmable prefetch unit

Many people are building more and more complicated prefetch units that try to learn indirect references, etc. (See Sharad Mehrota's work at UIUC Examination of a memory access classification scheme for pointer-intensive and numeric programs and Quantifying the performance potential of a data prefetch mechanism for pointer-intensive and numeric programs). A potentially better idea is to let the compiler/user program the prefetch unit. This would have a second little state machine that can inspect the state of the primary CPU and issue memory prefetches based on the primary CPU's state.

Suggested by Eric Anderson ( eanders@cs.berkeley.edu)

Compressed linked lists vs. fully expanded arrays

When does compressing to a linked representation provide a performance improvement .vs. a fully expanded array representation. Linked lists allow for easier insertions into the middle, and more flexible resizing queues and stacks. Some initial work on compressed representations for linked lists providing a middle ground between these two extremes is the work on Cache-Friendly Lists

Suggested by Eric Anderson ( eanders@cs.berkeley.edu)

Width and depth of circuits

Is Transistor depth/cycle relatively constant? If so, then circuits need to be "widenable" for performance improvment. Even with IRAM, CPU's are still likely to improve faster than memory in latency, and hence to keep from being limited by memory, you need to be able to reduce the miss rate arbitrarily by wasting space and width -- but not depth in the circuits. Some of the theory work on circuit depth may be applicable here.

Suggested by Eric Anderson ( eanders@cs.berkeley.edu)

Cache replacement policies

Is farthest future use an optimal cache replacement policy? Given a memory reference stream, and an optimal replacement policy, if the cache does not do well on that stream, then no cache can do well. However, if that cache can do well, and in addition conventional caches do poorly, then there are still questions about improving caching. Also question of how much buffering is neded to do perfect prefetching with the optimal replacement policy. Define optimal as least total misses.

Suggested by Eric Anderson ( eanders@cs.berkeley.edu)

The following several suggestions deal with the BRASS project. See the BRASS web page and/or contact John Wawrzynek or André DeHon for more details about the project.

Comparison between full-custom, FPGA, and processor implementations

There are a few instances (e.g. multipliers, FIRs) where we have custom IC, FPGA, and processor implementations and can compare the area-time efficiency of each. It would be beneficial to expand this set with additional computational tasks/kernels/applications. So, an interesting project might be pick an application which has a plausible custom implementation to compare against and develop/estimate a good processor and array (FPGA/GARP-array/etc.) implementation (perhaps include the processor+FPGA hybrid, as well). In the long run, it will be interesting to have a large set of tasks with {custom,processor,fpga,processor+fpga,... matrix,vector...} implementations to provide a better feel for the tradeoffs involved.

I've been collecting papers on (semi)-custom implementations of various tasks to provide the custom datapoint, including: path planning, DFT/FFT, viterbi, DES, search/matching template matching, IDEA, Reed-Solomon codec.

Suggested by André DeHon ( amd@CS.Berkeley.edu)

FPGAs to accelerate common APIs

John Hauser suggested the use of the coupled array to implement various standard APIs which are being used to allow separate software/hardware implementations (OpenGL, Direct{3D,Draw,X}, Photoshop API?, others?). It might be interesting to pick one (or a subset of operations from one of these) and look at the library implementation one could provide with a coupled FPGA-accelerator. Presumably, we already have (or can easily measure) the performance of the pure software implementations of these APIs for comparison. In some cases, there might be custom hardware accelerators to compare against, as well.

Suggested by Andre DeHon ( amd@CS.Berkeley.edu)

Specialization Opportunity

We know the FPGA implementation is going to be larger/slower than a custom implementation of the *exact* same task. However, the custom implementation must often be heavily generalized over the task one wants to solve at one particular time. The FPGA implementation can be built to solve a more specialized task and in the processes recover some of the area-time penalty (e.g. implementing a multiply by an 8-bit constant rather than having to implement a full 32x32 multiplier).

The goal would be to estimate the opportunity for specialization. How often and how tightly are things known such that a more specialized implementation can be used? and how much smaller/faster is the specialized implementation?

More specialized versions may be possible because:
  1. the structure of the program constrains it (now I multiply by a constant)
  2. the value is bound early and then remains constant (e.g. value changed outside a loop, or parameter calculated only during initialization)
  3. the PDF on a value is highly localized (e.g. this values is almost always 8)

Looking at traces of program runs including their data values, we can identify when things are bound and estimate how much tighter an implementation would be possible if these values were treated as constants. Similarly, for the localized PDFs of values, when they are sufficiently localized, we can look at how much tighter the implementation would be for the particular cases which occur most frequently.

Estimating actual savings might require some finesse. For some operations, we might be able to pull out an equivalent logic implementation of the task and run that through synopsys to get a comparison.

Suggested by André DeHon ( amd@CS.Berkeley.edu)

Coping with Finite Array Size

One issue which comes up regularly is how we deal with a finite array size.

Several models include:

  1. map everything assuming an infinitely large array and perform demand-drive fetching of operations
  2. select a subset of the task which fits onto the array and map the rest onto the coupled processor [and this can be further broken down to: a) single, fixed set b) explicitly swapped overlays ]
  3. "generalize" the structure built on the array until it fits onto the space available
As Seth suggested, the optimum probably lies somewhere in a mix of these techniques.

A project might pick an application or two and try out one or more of these techniques to begin to understand the merits of each and the situations under which each technique is preferable.

[point of reference: BYU's DISC architecture is the only one currently doing demand paging. I think I've only seen an application or two on it. I asked Michael Wirthlim at FCCM this year how things would differ if, instead of demand paging the entire application, he simply picked a subset of ops to implement on the array and implemented the rest on the host processor -- he agreed with me that it would probably run faster had he done that (but that wasn't what *he* was studying).]

[A closely related question of interest which also shows up here is that of "how large" is the configurable array. In discussions with Tom McDermott, he advocated that there are probably natural breakpoints -- i.e. there's a set of things you can do with ~100 configurable elements (e.g. Razdan style, flow control and data conversion), another set you might be able to do with thousands (small loop kernels), and other things you can do when you can map large portions of the application into the array (e.g. SPLASH/PAM apps?). The analog here might be like cache sizes -- you get one break when the inner loop fits into the I-cache, another when the program fits into cache, and yet another when the entire working data set fits into real memory.

So, the question is: are there discontinuous breaks in the benefit curve for configurable logic? and are there natural places where those breaks line up between applications? (If they don't line up, the average across many applications may be continuous as they all break at different points). Natural breaks could argue for building arrays in particular size ranges, whereas, a continuous average might argue that you just want as much as you can justify without worrying about any magic sizes.]

Suggested by André DeHon ( amd@CS.Berkeley.edu)

Multitasking and Configurable Arrays

We need to look at the interactions of multitasking and configurable arrays.


  1. fetch/instruction management scheme (array instruction faults in some schemes)
  2. security
  3. dealing w/ (non-array) memory faults
  4. overheads and swapping costs (comparison w/ other swapping costs in os)
  5. guaranteeing forward progress

One could approach this from various directions:

  1. build software layer on top of the John's defined GARP mechanisms
  2. design management scheme from scratch based on goals/constraints (what should the hardware support to make it work well?)
  3. consider what you might do w/ a pure array scenario

There are a number of performance issues associated with the details of the instruction fetch scheme (explicitly/implicit?, granularity of swapping?,...) and the data swapping schemes. This might be a good tie-in for someone looking for a pair of closely related 262/252 projects.

Suggested by André DeHon ( amd@CS.Berkeley.edu)

Important Program Characteristics

There are also a number of program characteristics we'd like to start trying to extract/estimate:
  1. (derivable) datasizes (i.e. is this operation a 1-bit, 8-bit, 16-bit,... operation?) -- A fine-grained array can narrow the width of deployed resources to the operations (and modern multimedia extensions allows efficient handling of small ops when they can be performed in SIMD sets).
  2. retiming distances (how long between the production and final consumption of data items -- this tells us how much space we really need to hold intermediate data. A PDF on retiming distances can help us understand the mix of retiming depths/resources which will be necessary to run programs).
  3. richness of interconnect (I like to think of it as calculating the Rent parameter for a program -- i.e. how much locality is there in the program, how much physical interconnect will we need to support a given level of parallelism for this application.)

Suggested by André DeHon ( amd@CS.Berkeley.edu)

Usefulness of Vector IRAM and/or reconfigurable logic for real applications

One possible use of the available bandwidth for an IRAM is with a vector unit. Applications that parallelize well are likely to be good matches for either vectorizing or mapping to reconfigurable logic. What are some of these applications, and just how much of a speedup can be obtained using either a vector processor or an array of reconfigurable logic. A 294 project last spring, MPEG Encoding on a Vector Processor, looked at the performance speedup from using vector processing on a portion of the MPEG encoding process. A more complete project would be concerned with total end-to-end performance. It is much harder to claim impressive speedups on entire applications than just on small micro-kernels.

If you want to do any project related to vector computing, I would strongly suggest talking to Krste Asanovic (krste@icsi.berkeley.edu), who worked on the T0 Vector Microprocessor.

Multiprocessor Simulator

We need to understand how different interconnects for large multiprocessors (order 1K-10K nodes, 100 TFLOPS) impact their performance on scientific codes which solve coupled-physics problems with multiple domain decompositions. In these codes, different decompositions are used for different physics, and all physics must be done at every time step. Note that these codes are more complicated than many others due to the multiple domain decompositions.

We wish to determine the dependence of computational performance on interconnect details such as topology, latency, bandwidth, dynamic reconfigurability, and cut-through vs. store&forward routing. Because the codes to be simulated are large and complex, the instruction-level simulator we're currently using is much too slow. We'd like to switch to a trace-based simulator.

The immediate job of the student would be to build a simulator code which can accept a code trace, a simplified description of the compute nodes, and a detailed interconnect model, and provide output which includes total time-to-solution plus details of traffic flow in the interconnect. These details include throughput statistics, latency statistics, and some contention details (where in the code, where in the interconnect, how much added delay).

This project requires a creative soul to define a generic interconnect input description and handling of the interconnect within the simulation, plus the programming to define the simulator. The student would be expected to talk with us about the relevant interconnect features which must be included, and then to (rather independently) build a simulator for this purpose. Obviously, the work needs to be documented well enough so that others can use the simulator.

Once the simulator is completed, the student can either go off to other research, or continue on this project-- becoming involved with simulating different interconnects and understanding what (detailed) attributes are needed for the codes of interest. Either would be acceptable to me.

I'd like to get a simulation code in place by January of 97 at the latest.

Suggested by Bob Deri ( Bob.Deri@quickmail.llnl.gov)

Practical Implementations of ILP

Determine the degree of scalarity that would be practical under realistic physical design constraints for a commercial microarchitecture such as x86, PowerPC, SPARC, Alpha or MIPS.

Use traces or write a simulator that can execute binary code in the target architecture.

Compare results for both in-order and out-of-order execution in hardware. Out-of-order would require the assumption of a certain size reorder window for load/stores and a certain size reorder window for non-load/stores, or a single reorder window for both (HP 8200 has separate windows.)

Assume a certain pipeline (number of stages, etc) and a certain number of execution units of each type (branch, compare, integer, load/store, floating-point)

Consider constraints such as:

  1. 1 vs. 2 load/store units
  2. maximum number of unresolved branches allowed
  3. whether more than one branch could be resolved per cycle
  4. ability to compress two dependent adds into a single cycle (or to compress an add plus the address calculation for a dependent load/store into a single cycle)
  5. 1 vs. 2 instructions per cycle that can set condition codes
  6. availability or lack of rename registers in the fp unit

Suggested by George Taylor ( taylor@exp.com)

The following suggestions were from David Culler for CS 262. Some may be applicable as well for CS 252, some (i.e. many) may not. Use your own discretion.

Stack vs. Register Machines

A re-evaluation of Stack vs reg ala Amdahl, Blau, and Brooks in the presence of multiple issue and serious renaming.

Suggested by David Culler ( culler@cs.berkeley.edu)

New Programming models on NOW

Shared virtual memory systems have mostly been developed in the context of TCP/IP class networks (or highly specialized devices). Identify the best candidate SVM system and port to Active Message on NOW. Run and compare on Splash.

Suggested by David Culler ( culler@cs.berkeley.edu)

Port HPF, PC++, CC++, NESL environments to NOW

Understanding system behavior in a large scale cluster. Instrumentation and visualization of system performance, utilization.

Suggested by David Culler ( culler@cs.berkeley.edu)

Toolbox for Persistent Applications

Most applications that run for a long time are not monolithic. They often evaluate a complex operations over a region of a parameter space, which may be determined dynamically or even via search. Extend the smart-clients model to one of interacting with persistant "application servers". Adaptation to load and configuration changes, as well as fault resilience are inherent in the toolbox.

Suggested by David Culler ( culler@cs.berkeley.edu)

Resource allocation and starvation in a "successful NOW" with persistent applications

Suggested by David Culler ( culler@cs.berkeley.edu)

Resource Allocation, negotiation, coordination in a serverless server

Suggested by David Culler ( culler@cs.berkeley.edu)

Scheduling for Large Scale I/O problems

BIG I/O problems need to be well co-ordinated, and there is time to do it

Suggested by David Culler ( culler@cs.berkeley.edu)

Parallel interface to XFS

Vesta, HPFS, PFS may be the wrong direction.

Suggested by David Culler ( culler@cs.berkeley.edu)

Network Interface Pooling in Clusters of SMPs

In a NOW there are many factors that lead toward a system configuration of one network interface per processor. SMP nodes offer a new degree of freedom, since multiple processors share multiple network interfaces. This allows a single processor to burst data at a much larger rate by striping communication across the interfaces. It also allows the network cost to be reduced by sharing interfaces in a ratio of less than one-to-one. The effects of these factors has yet not been well characterized.

Suggested by David Culler ( culler@cs.berkeley.edu)

Network Desgin in Clusters of SMPs

Investigate how this degree of freedom at the nodes impact on the design of the network itself. With multiple interfaces, each node attaches to the network fabric at several points. This dramatically changes the fault tolerance properties of the network as well as its behavior under contention. It appears likely that many of the benefits obtained by using randomness within the network can be realized in practice by simply arbitrating among multiple ports into the network at each node.

Suggested by David Culler ( culler@cs.berkeley.edu)

Load balancing in Clusters of SMPs

Statistical analysis shows that systems on the scale of a thousand processors will lose substantial efficiency from even very slight variations in workload, such as might arise from cache effects, floating-point unit effects, pipeline stalls and the like. In practice, numerous factors make it difficult to achieve a near perfect workload distribution. Balancing work within and SMP is considerably easier and cheaper than balancing work between nodes. Our initial results suggest that the ability to dynamically balance work within and SMP node significantly improves the efficiency of the overall system. In addition to balancing work between processors within an SMP, there is opportunity to dedicated processors to communication tasks. This potentially reduced synchronization costs while improving attentiveness to the network. To date, there is little systematic exploration of the potential alternatives.

Suggested by David Culler ( culler@cs.berkeley.edu)

Is it fundamental that NT does not virtualize well?

Suggested by David Culler ( culler@cs.berkeley.edu)

If the desktop is nothing but Java stations (NCs), what is the OS on the server?

Suggested by David Culler ( culler@cs.berkeley.edu)

What is the macrolevel (entire division) internet service requirements?

- email(t), web (t) How does it scale with organizational size? Caching, prefetching, watchdogs - Departmental file service?

Suggested by David Culler ( culler@cs.berkeley.edu)

On-line web tracking of AC transit via GPS

Suggested by David Culler ( culler@cs.berkeley.edu)

Prototype design for the home computer (not the office or the game room)

Suggested by David Culler ( culler@cs.berkeley.edu)

Fundamental challenges underlying gigabit ethernet

Suggested by David Culler ( culler@cs.berkeley.edu)

Floating-point registers: caller vs. callee save

One interesting project would be the division of floating point registers into caller saved and callee saved registers.

My survey on the comp.compilers indicated that all other system vendors have at least some callee-saved registers, and Sun is alone in having all registers saved by the caller. This has a performance impact. It would be interesting to find out how much it is.

It doesn't have to be a Sparc-centric study, but a general evaluation based on a generic RISC processor like DLX perhaps.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)

Prepare-to-branch vs. branch-prediction (static vs. dynamic; fairness)

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)

Out-of-order vs. in-order machines: Tradeoffs in hardware/software complexity

Complex machines such as the HP PA8000 use a large window of available instructions and hardware scheduling with register renaming in an attempt to issue more instructions per clock. Is the extra complexity in such processors justified? Can compilers close the gap between such machines and simpler in-order machines through profile feedback and static scheduling?

If they can run their traces through appropriate in-order/out-of-order models (simulators), it would be interesting to see what they come up with.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)

Do out-of-order machines need small L1 caches that are very close (low L1 load latency) as much as do in-order machines?

What is the (perf) sensitivity of an out-of-order machine to L1 load latency? If a latency of say up to 5 clocks can be hidden well, then should we just build the largest cache we can at the 5 clock latency that the machine can tolerate?

If the machine is capable of hiding the extra latency, why have a smaller cache that is closer?

This problem can also be studied with the traces and simulators they have.

Suggested by SME architecture group at Sun, via Robert Garner ( robert.garner@Eng.Sun.COM)

Back to CS252 home page...