Parthenon

CS252: Fall 1999 Project Suggestions <Under construction!>


This document is divided into several sections. The first several describe research projects at Berkeley that have projects that might be applicable for CS252. At the end are more generic architecture-related projects.

DynaCOMP: The Introspective Computing Project
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

This is investigating new computing paradigms in which the traditional hardware functionality of a CPU is replaced by feedback-driven, continuous dynamic compilation and execution. This has been called "introspective computing" by other researchers, since part of execution involves monitoring the behavior of a running process and changing its behavior in order to optimize performance, power utilization, or other metrics. Note that modern processors such as the Pentium II have hardware "compilers" which translate x86 instructions directly into internal micro operations. Since this translation is done in hardware, none of the more sophisticated compiler optimizations are possible. An introspective computing processor could compile and recompile many times, optimizing code based on runtime information. Of particular interest is new ways of exploiting runtime information to perform this type of optimization. Some of the prediction techinques discussed in class might be appropriate here. So might genetic algorithms.
As a class project, make use of the SimpleScalar Introspective Simulator to explore some introspective techniques. Come up with an architecture which includes both execution units and monitoring units (these might be the same), and a scheme for exploiting the monitoring process. A complete solution here would be a PhD thesis, so you should consider small pieces of this.

  1. Come up with software versions of various prefetching or branch predicition algorithms

  2. Can you use a secondary processor to track the behavior of load/store addresses or of branches to improve the performance of processors?  If you choose to do something like this, try to be more ambitious with your algorithms than a hardware designer would.  Use the introspective simulator to set up an architecture for evaluation. Feedback to the running program could consist of prefetches inserted automatically into the memory access stream of the primary processor, updates to hardware branch prediction tables, or perhaps direct insertion of branch prediction information into the running instructions (changing of static prediction bits over time?)
  3. Node splitting for better branch prediction

  4. Set up an introspective configuration that monitors the behavior of branches and decides to perform node-splitting optimizations to improve static branch prediction?
  5. Architecture for Dynamic Compilation

  6. Is there some set of instructions or hardware mechanisms that could be added to an architecture to make dynamic compilation easier/faster/more efficient, etc. See if you can port the vcode interface from MIT or the the Trimeran ELCOR back-end to the SimpleScalar architecture and potentially add your new instructions in support of these codes.
  7. Recompilation for Power Savings

  8. As mentioned a number of times in class, power dissipation is currently a major problem in architectures. Come up with some way to exploit the Introspective Computing concept to save power. What monitoring of execution would be appropriate? How would you alter the execution based on this to save power? This is pretty open ended, guaranteed to get a good conference paper out of this if you come up with something.
  9. Data Value Prediction

  10. Can you do a better job of predicting data values with an introspective architecture?  This would involve coming up with code to do such prediction in a secondary processor, complete with some way of feeding information back to the primary processor.
  11. Optimistic Specialization

  12. Write an introspective monitoring module that is capable of recognizing constant values in a running program and potentially exploiting them through recompilation.  This is a variant of the previous project.  Recognize code-sequences that are repeated and figure out how to bypass them by inserting their results without recomputing (memoization).  Do you do this by recompiling?  By freezing the primary processor and inserting results directly? Consider using the SEQUITOR algorithm for this.
  13. Parallelism Extraction

  14. Figure out how an introspective computing architecture might be able to automatically extract parallelism from running code (loops?) and split this code out into an explictly parallel version.
OceanStore: The Oceanic Data Utility
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
The OceanStore project is considering issues associated with a global data utility.  It is considering how to make data available anywhere, anytime in a world infrastructure that is untrusted and unreliable.  Check out the OceanStore proposal for more details.
  1. Mechanisms for data location with promiscuous caching

  2. The OceanStore proposal talks about a data location mechanism based on a form of attenuated Bloom filters.  There are several possible projects here.  All of them involve  collecting data about file usages/distributions from real systems to develop a realistic model of how data might be spread across the world in an OceanStore system.
  3. Mechanisms for erasure codes in an OceanStore system

  4. Explore the algorithmic and computation requirements of erasure codes in an OceanStore-like system.  Tornado codes (Mike Luby) are of particular interest because they are linear-time encodeable/decodeable.  Can these codes be implemented so that they are fast enough on normal hardware without hardware accelleration?  Perhaps compare their implementation on vector processors vs scalar processors.  Perhaps construct a realistic solution simulation of their use in a world-wide system.  Perhaps come up with interesting variants on distribution systems such as appear in the intermemory project (www.intermemory.org).
  5. Explorations of Introspective Computing for data distribution

  6. One key aspect of the OceanStore system is that it is a complicated optimization problem.   Figure out extensions of the "semantic distance" metrics of from the Ficus project at UCLA (see, for instance, the paper on Automated Hoarding and the Seer system from Geoff Kuenning's homepage).  Can you come up with better/more interesting clustering algorithms for OceanStore?  Can you match these against real data from Soda Hall servers?
  7. Explore the complexities of Incremental Cryptography

  8. Is there one particular type of Incremental Cryptography that is better than another for high-speed, continuous usage in OceanStore?  Actually implement an incremental mechanism for signing of data (derived from SHA-1) or for encryption (algorithm of your choice). Compare and contrast against multiple hardware architectures (trying a vector processor such as VIRAM is an obvious choice).
  9. State Machine model of a network server

  10. Possible collaboration with John Kubiatowicz (kubitron@cs.berkeley.edu) and Eric Brewer (brewer@cs.berkeley.edu)

    The normal software model for a network server involves a complete UNIX-like operating system (or worse, i.e. NT) with many threads, one per ongoing transaction.  An alternative model is to view each ongoing transaction as a statemachine.  Each time in which a thread would have been put to sleep in the normal threaded model would correspond to a state-machine arc in the new model.  Such a system could have a really thin layer of software that had a small number of threads that never blocked, but rather continuously grabbed the next input message or event, ran an appropriate statemachine to its next arc, then went on to the next input event or message.  Figure out how this would work and how a person would specify such a system for a server-based application such as oceanstore.  Would this model make the hardware requirements simple enough that something like Intel's new IXP2000 network processor could be attached to a disk and a network and process requests at network bandwidth?

The IRAM Project

The IRAM project has several topics that need investigation.
  1. Look at scaling the IRAM architecture to multiple processors.

  2. Suggested by Kathy Yelick (yelick@cs.berkeley.edu)

    The idea is to start with some relatively well-understood algorithms from data-mining, document retreival (LSI), vision, image/signal processing and write the kernels for VIRAM, and build a performance model to predict overall performance on a larger network of IRAM, possibly checking that the model is reasonably close to predicting performance on the NOW.  I've done this for 3D FFT.  The algorithms need to be relatively regular, or the modeling part is too hard.  The architecture part is to look at system balance for a large scale (say 10K processor) system.  How fat a network would you need?  Is there enough memory per processor on a 16 MB, ... , 100 MB DRAM with IRAM processor on it.

  3. Explore the extent to which the new VIRAM compiler is able to extract parallelism.

  4. Suggested by Kathy Yelick (yelick@cs.berkeley.edu)

    There will soon be a compiler for VIRAM, derived from a Cray compiler.  It runs right now, but we're waiting for some tool upgrades so that the
    generated code will work on the simulator.  So far, all IRAM studies have been done using hand-coded algorithms. This would look at the features for VIRAM.
     

  5. Compiler extensions for vector fixed-point operations on VIRAM

  6. Suggested by Kathy Yelick (yelick@cs.berkeley.edu)

    Add some simple extensions to the vectorizing compiler to generate fixed point arithmetic instructions. Currently, the compiler can vectorize fp and int loops, but it can't generate the fixed point instructions because they would change the semantics.  Still there, seem to be some common programming idioms, and there must be a better way to support this than ASM programming.  Some companies have their own extensions to C for fixed point, so that's a place to start.

  7. Impact of Memory Models (and synch instructions) on VIRAM code

  8. Suggested by Kathy Yelick (yelick@cs.berkeley.edu)

    For performance, VIRAM has a much weaker memory model than most multiprocessors.  The net effect is that "synch" instructions must be included at stages of the computation in order to maintain correct execution. In particular, this impacts the virtual-processor model of computation. Study the impact of these synchs on VIRAM applications.  How would performance change with stronger coherence semantics?
     

  9. Memory management in an IRAM system with external DRAM

  10. Suggested last year by Christoforos Kozyrakis (kozyraki@cs.berkeley.edu)

    One question raised by the IRAM architecture is how to support expansion of the built-in memory. One option is to connect external DRAMs to the IRAM chip. This project would investigate tradeoffs in different methods of connecting the external DRAM, and, more importantly, would investigate how to structure this IRAM-based memory hierarchy. For example, should the on-chip memory be like a cache to the external DRAM (with cache- or disk-like paging), or should it be part of the same physical memory space? Who should manage the location of blocks in the hierarchy (hardware, OS, application, some combination)?

    This project will involve selecting one or more simple benchmarks that have large working sets (larger than the on-chip IRAM memory) and modelling the performance of these applications on different IRAM/external-memory configurations, either analytically or via simulation.

  11. Can multimedia-specific hardware be useful for scientific applications?

  12. Suggested last year by Randi Thomas (randit@cs.berkeley.edu)

    One curiosity of the IRAM architecture is that it is built around a vector processor, yet is targetted primarily at embedded multimedia applicatons rather than the large-scale scientific applications for which vector processors are typically used. This project investigates whether or not the multimedia-oriented hardware on VIRAM (fixed-point and DSP-like operations) could be leveraged to accelerate traditional scientific and supercomputer applications.

  13. Modeling vectorized scientific and multimedia applications

  14. Characterize the memory bandwidth requirements and access patterns of multimedia codes, and compare with scientific applications. Ideally, this would involve building a parameterized analytic modelling framework for the memory demands of such applications and the hardware they run on (like LogP did for communication in parallel programs/machines). The model might take into consideration bandwidth, latency, strides, ratio of memory to computation, vectorizability, etc.
  15. Open-ended combination of VIRAM and other architectures

  16. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu)

    Use VIRAM along with some other architecture to compare the relative advantages of vector processing.  See last year's projects for examples.
     
     

The ISTORE Project

The ISTORE project is a spin-off of the IRAM project that is investigating the integration of processors (intelligence) into the storage systems of large-scale servers. An ISTORE system consists of a traditional front-end CPU or SMP, plus multiple so-called "Intelligent Disks" (IDISKs, disks with integrated processors) interconnected via a fast crossbar-switched network. It may also contain "Intelligent Memory" (IMEM, memory built out of IRAMs). The research issues in ISTORE are in how to adapt server applications (databases, scientific apps, etc.) to this new system model, and in how the system can provide better performance or runtime support to such applications based on the tighter coupling of processing and storage.

The following are some ISTORE-related projects. There are several people who may be able to help with these projects; talk to Aaron Brown (abrown@cs.berkeley.edu) if you're interested in one of the following projects.

The BRASS Project

The Berkeley Reconfigurable Architectures, Systems and Software (BRASS) Research Project is investigating issues involved in building high-performance reconfigurable computing systems. The following projects are related to BRASS.
  1. Implement an interesting application using the SCORE (virtual hardware model)

  2. Suggested by John Warzynek (johnw@cs.berkeley.edu)

    This is pretty open-ended, but would involve comparing an implementation of some algorithm in BRASS with an implementation of the same algorithm on a more conventional processor.

  3. Implement a TCP/IP stack in reconfigurable hardware

  4. Suggested by John Warzynek (johnw@cs.berkeley.edu)

    What is involved in connecting a reconfigurable architecture to the network?  As has been stressed in class, the network is a key component of modern Computer Architecture.  Figure out how to implement key aspects of IP and perhaps TCP under the BRASS architecture (using SCORE?).  How does this compare with more traditional implementations in software on standard processor hardware?  Are there advantages to the reconfigurable approach?

Miscellaneous Projects

This section contains projects that don't fit into simple categories:
 
  1. Explorations of the new IXP2000 Network processor from Intel.

  2. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

    What are the key architectural advantages to the new Intel network architecture?  Are there interesting network applications that are now possible that weren't before?  Figure out how to analyze the impact of the new network architecture on TCP/IP.  Alternatively, would this chip be fast enough to support the state-machine description of an OceanStore server?  (See OceanStore above)
     

  3. Genetic algorithms in computer architecture

  4. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

    In class, we handed out Joel Emer's paper on using genetic algorithms to synthesizing branch predictors with impressive results. Genetic algorithm can probably be of use in other areas of computer architecture: data predictors, hardware prefetching. Figure out how to exploit genetic algorithms to design other aspects of hardware architectures.

  5. Data Value Prediction

  6. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

    Another thing that we discussed in class was the notion of "breaking the dataflow barrier" through Data Value Prediction. Although various archiectures have been proposed for (1) predicting values and (2) exploiting these predictions, there is a tremendous amount of room for improvement. Come up with a value prediction strategy and a proposed architecture for exploiting this. Figure out how to evaluate it using a simulation model (Superscalar simulator from Wisconsin would be a good choice).
     

  7. Branch Prediction

  8. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

    Similar to above, can you come up with a new branch-predicition algorithm?  This is much-trodden territory, so you would have to make sure that you checked all the literature first.  Implement your technique in a simulator such as SimpleScalar and evaluate your technique vs previous techniques and take into account hardware cost and complexity.
     

  9. Probability of Deadlock in Direct Network Interfaces

  10. Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu)
Back to CS252 page
Maintained by John Kubiatowicz (kubitron@cs.berkeley.edu). Last modified 1 October 1999.