 |
CS252: Fall 1999 Project Suggestions <Under construction!>
|
This
document is divided into several sections. The first several describe research
projects at Berkeley that have projects that might be applicable for CS252.
At the end are more generic architecture-related projects.
DynaCOMP: The Introspective Computing Project
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
This is investigating new computing paradigms in which the traditional
hardware functionality of a CPU is replaced by feedback-driven, continuous
dynamic compilation and execution. This has been called "introspective
computing" by other researchers, since part of execution involves monitoring
the behavior of a running process and changing its behavior in order to
optimize performance, power utilization, or other metrics. Note that modern
processors such as the Pentium II have hardware "compilers" which translate
x86 instructions directly into internal micro operations. Since this translation
is done in hardware, none of the more sophisticated compiler optimizations
are possible. An introspective computing processor could compile and recompile
many times, optimizing code based on runtime information. Of particular
interest is new ways of exploiting runtime information to perform this
type of optimization. Some of the prediction techinques discussed in class
might be appropriate here. So might genetic algorithms.
As a class project, make use of the SimpleScalar Introspective Simulator
to explore some introspective techniques. Come up with an architecture
which includes both execution units and monitoring units (these might be
the same), and a scheme for exploiting the monitoring process. A complete
solution here would be a PhD thesis, so you should consider small pieces
of this.
-
Come up with software versions of various prefetching or branch predicition
algorithms
Can you use a secondary processor to track the behavior of load/store
addresses or of branches to improve the performance of processors?
If you choose to do something like this, try to be more ambitious with
your algorithms than a hardware designer would. Use the introspective
simulator to set up an architecture for evaluation. Feedback to the running
program could consist of prefetches inserted automatically into the memory
access stream of the primary processor, updates to hardware branch prediction
tables, or perhaps direct insertion of branch prediction information into
the running instructions (changing of static prediction bits over time?)
-
Node splitting for better branch prediction
Set up an introspective configuration that monitors the behavior of
branches and decides to perform node-splitting optimizations to improve
static branch prediction?
-
Architecture for Dynamic Compilation
Is there some set of instructions or hardware mechanisms that could
be added to an architecture to make dynamic compilation easier/faster/more
efficient, etc. See if you can port the vcode interface from MIT or the
the Trimeran ELCOR back-end to the SimpleScalar architecture and potentially
add your new instructions in support of these codes.
-
Recompilation for Power Savings
As mentioned a number of times in class, power dissipation is currently
a major problem in architectures. Come up with some way to exploit the
Introspective Computing concept to save power. What monitoring of execution
would be appropriate? How would you alter the execution based on this to
save power? This is pretty open ended, guaranteed to get a good conference
paper out of this if you come up with something.
-
Data Value Prediction
Can you do a better job of predicting data values with an introspective
architecture? This would involve coming up with code to do such prediction
in a secondary processor, complete with some way of feeding information
back to the primary processor.
-
Optimistic Specialization
Write an introspective monitoring module that is capable of recognizing
constant values in a running program and potentially exploiting them through
recompilation. This is a variant of the previous project. Recognize
code-sequences that are repeated and figure out how to bypass them by inserting
their results without recomputing (memoization). Do you do this by
recompiling? By freezing the primary processor and inserting results
directly? Consider using the SEQUITOR algorithm for this.
-
Parallelism Extraction
Figure out how an introspective computing architecture might be able
to automatically extract parallelism from running code (loops?) and split
this code out into an explictly parallel version.
OceanStore: The Oceanic Data Utility
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
The OceanStore project is considering issues associated with a global
data utility. It is considering how to make data available anywhere,
anytime in a world infrastructure that is untrusted and unreliable.
Check out the OceanStore proposal for
more details.
-
Mechanisms for data location with promiscuous caching
The OceanStore proposal talks about a data location mechanism based
on a form of attenuated Bloom filters. There are several possible
projects here. All of them involve collecting data about file
usages/distributions from real systems to develop a realistic model of
how data might be spread across the world in an OceanStore system.
-
Mechanisms for erasure codes in an OceanStore system
Explore the algorithmic and computation requirements of erasure codes
in an OceanStore-like system. Tornado codes (Mike Luby) are of particular
interest because they are linear-time encodeable/decodeable. Can
these codes be implemented so that they are fast enough on normal hardware
without hardware accelleration? Perhaps compare their implementation
on vector processors vs scalar processors. Perhaps construct a realistic
solution simulation of their use in a world-wide system. Perhaps
come up with interesting variants on distribution systems such as appear
in the intermemory project (www.intermemory.org).
-
Explorations of Introspective Computing for data distribution
One key aspect of the OceanStore system is that it is a complicated
optimization problem. Figure out extensions of the "semantic
distance" metrics of from the Ficus project at UCLA (see, for instance,
the paper on Automated Hoarding and the Seer system from Geoff
Kuenning's homepage). Can you come up with better/more interesting
clustering algorithms for OceanStore? Can you match these against
real data from Soda Hall servers?
-
Explore the complexities of Incremental Cryptography
Is there one particular type of Incremental Cryptography that is better
than another for high-speed, continuous usage in OceanStore? Actually
implement an incremental mechanism for signing of data (derived from SHA-1)
or for encryption (algorithm of your choice). Compare and contrast against
multiple hardware architectures (trying a vector processor such as VIRAM
is an obvious choice).
-
State Machine model of a network server
Possible collaboration with John Kubiatowicz (kubitron@cs.berkeley.edu)
and Eric Brewer (brewer@cs.berkeley.edu)
The normal software model for a network server involves a complete UNIX-like
operating system (or worse, i.e. NT) with many threads, one per ongoing
transaction. An alternative model is to view each ongoing transaction
as a statemachine. Each time in which a thread would have been put
to sleep in the normal threaded model would correspond to a state-machine
arc in the new model. Such a system could have a really thin layer
of software that had a small number of threads that never blocked, but
rather continuously grabbed the next input message or event, ran an appropriate
statemachine to its next arc, then went on to the next input event or message.
Figure out how this would work and how a person would specify such a system
for a server-based application such as oceanstore. Would this model
make the hardware requirements simple enough that something like Intel's
new IXP2000 network processor could be attached to a disk and a network
and process requests at network bandwidth?
The IRAM Project
The IRAM project has several
topics that need investigation.
-
Look at scaling the IRAM architecture to multiple processors.
Suggested by Kathy Yelick (yelick@cs.berkeley.edu)
The idea is to start with some relatively well-understood algorithms
from data-mining, document retreival (LSI), vision, image/signal processing
and write the kernels for VIRAM, and build a performance model to predict
overall performance on a larger network of IRAM, possibly checking that
the model is reasonably close to predicting performance on the NOW.
I've done this for 3D FFT. The algorithms need to be relatively regular,
or the modeling part is too hard. The architecture part is to look
at system balance for a large scale (say 10K processor) system. How
fat a network would you need? Is there enough memory per processor
on a 16 MB, ... , 100 MB DRAM with IRAM processor on it.
-
Explore the extent to which the new VIRAM compiler is able to extract
parallelism.
Suggested by Kathy Yelick (yelick@cs.berkeley.edu)
There will soon be a compiler for VIRAM, derived from a Cray compiler.
It runs right now, but we're waiting for some tool upgrades so that the
generated code will work on the simulator. So far, all IRAM studies
have been done using hand-coded algorithms. This would look at the features
for VIRAM.
-
Compiler extensions for vector fixed-point operations on VIRAM
Suggested by Kathy Yelick (yelick@cs.berkeley.edu)
Add some simple extensions to the vectorizing compiler to generate fixed
point arithmetic instructions. Currently, the compiler can vectorize fp
and int loops, but it can't generate the fixed point instructions because
they would change the semantics. Still there, seem to be some common
programming idioms, and there must be a better way to support this than
ASM programming. Some companies have their own extensions to C for
fixed point, so that's a place to start.
-
Impact of Memory Models (and synch instructions) on VIRAM code
Suggested by Kathy Yelick (yelick@cs.berkeley.edu)
For performance, VIRAM has a much weaker memory model than most multiprocessors.
The net effect is that "synch" instructions must be included at stages
of the computation in order to maintain correct execution. In particular,
this impacts the virtual-processor model of computation. Study the impact
of these synchs on VIRAM applications. How would performance change
with stronger coherence semantics?
-
Memory management in an IRAM system with external DRAM
Suggested last year by Christoforos Kozyrakis (kozyraki@cs.berkeley.edu)
One question raised by the IRAM architecture is how to support expansion
of the built-in memory. One option is to connect external DRAMs to the
IRAM chip. This project would investigate tradeoffs in different methods
of connecting the external DRAM, and, more importantly, would investigate
how to structure this IRAM-based memory hierarchy. For example, should
the on-chip memory be like a cache to the external DRAM (with cache- or
disk-like paging), or should it be part of the same physical memory space?
Who should manage the location of blocks in the hierarchy (hardware, OS,
application, some combination)?
This project will involve selecting one or more simple benchmarks that
have large working sets (larger than the on-chip IRAM memory) and modelling
the performance of these applications on different IRAM/external-memory
configurations, either analytically or via simulation.
-
Can multimedia-specific hardware be useful for scientific applications?
Suggested last year by Randi Thomas (randit@cs.berkeley.edu)
One curiosity of the IRAM architecture is that it is built around a
vector processor, yet is targetted primarily at embedded multimedia applicatons
rather than the large-scale scientific applications for which vector processors
are typically used. This project investigates whether or not the multimedia-oriented
hardware on VIRAM (fixed-point and DSP-like operations) could be leveraged
to accelerate traditional scientific and supercomputer applications.
-
Modeling vectorized scientific and multimedia applications
Characterize the memory bandwidth requirements and access patterns
of multimedia codes, and compare with scientific applications. Ideally,
this would involve building a parameterized analytic modelling framework
for the memory demands of such applications and the hardware they run on
(like LogP did for communication in parallel programs/machines). The model
might take into consideration bandwidth, latency, strides, ratio of memory
to computation, vectorizability, etc.
-
Open-ended combination of VIRAM and other architectures
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu)
Use VIRAM along with some other architecture to compare the relative
advantages of vector processing. See last year's projects for examples.
The ISTORE Project
The ISTORE project is a spin-off of the IRAM project that is investigating
the integration of processors (intelligence) into the storage systems of
large-scale servers. An ISTORE system consists of a traditional front-end
CPU or SMP, plus multiple so-called "Intelligent Disks" (IDISKs, disks
with integrated processors) interconnected via a fast crossbar-switched
network. It may also contain "Intelligent Memory" (IMEM, memory built out
of IRAMs). The research issues in ISTORE are in how to adapt server applications
(databases, scientific apps, etc.) to this new system model, and in how
the system can provide better performance or runtime support to such applications
based on the tighter coupling of processing and storage.
The following are some ISTORE-related projects. There are several people
who may be able to help with these projects; talk to Aaron Brown (abrown@cs.berkeley.edu)
if you're interested in one of the following projects.
The BRASS Project
The Berkeley Reconfigurable
Architectures, Systems and Software (BRASS) Research Project is investigating
issues involved in building high-performance reconfigurable computing systems.
The following projects are related to BRASS.
-
Implement an interesting application using the SCORE (virtual hardware
model)
Suggested by John Warzynek (johnw@cs.berkeley.edu)
This is pretty open-ended, but would involve comparing an implementation
of some algorithm in BRASS with an implementation of the same algorithm
on a more conventional processor.
-
Implement a TCP/IP stack in reconfigurable hardware
Suggested by John Warzynek (johnw@cs.berkeley.edu)
What is involved in connecting a reconfigurable architecture to the
network? As has been stressed in class, the network is a key component
of modern Computer Architecture. Figure out how to implement key
aspects of IP and perhaps TCP under the BRASS architecture (using SCORE?).
How does this compare with more traditional implementations in software
on standard processor hardware? Are there advantages to the reconfigurable
approach?
-
P11: Explore energy implications of reconfigurable implementation of
compute kernels.
Suggested last year by Andre' Dehon (amd@cs.berkeley.edu) for last
year's 252 class
Per raw bit operation, the potential energy cost of a low-voltage FPGA
and a low-power DSP or microprocessor are very similar. Once correlation
between bits in a datapath are taken into account, the energy may vary
considerably, perhaps an order of magnitude.
In particular, a spatial (non-multiplexed) implementation on an FPGA
will have a low activiation rate when data is highly correlated. The heavy
multiplexing and interleaving of operations on the processor will tend
to destroy the natural correlation in the data yielding a higher activity
rate.
For some common kernels, (maybe start with filters, transforms common
in signal/video processing) collect the data activity and estimate the
actual energy consumed on a processor and an FPGA implementation. The goal
would be to understand the source of potential benefits for the reconfigurable
architecture and quantify typical effects.
Andre' DeHon (amd@cs) would give advice on this project. It would likely
involve:
-
Find code for simple benchmark kernels;
-
build netlist for FPGA implementation;
-
Get an energy model for each (DeHon believes there are several in the literature
and around Berkeley so it's simply a matter of picking and elaborating);
-
Instrument appropriate level of simulations for processors and FPGA to
collect bit toggle rates (Again, there are probably several things around
to start with...just need to be tailored a bit to this task);
-
Run sample data and collect activity stats;
-
Use energy model to estimate energy consumed;
-
Reflect on results, identify sources of benefits(costs);
Miscellaneous Projects
This section contains projects that don't fit into simple categories:
-
Explorations of the new IXP2000 Network processor from Intel.
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
What are the key architectural advantages to the new Intel network architecture?
Are there interesting network applications that are now possible that weren't
before? Figure out how to analyze the impact of the new network architecture
on TCP/IP. Alternatively, would this chip be fast enough to support
the state-machine description of an OceanStore server? (See OceanStore
above)
-
Genetic algorithms in computer architecture
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
In class, we handed out Joel Emer's paper on using genetic algorithms
to synthesizing branch predictors with impressive results. Genetic algorithm
can probably be of use in other areas of computer architecture: data predictors,
hardware prefetching. Figure out how to exploit genetic algorithms to design
other aspects of hardware architectures.
-
Data Value Prediction
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
Another thing that we discussed in class was the notion of "breaking
the dataflow barrier" through Data Value Prediction. Although various archiectures
have been proposed for (1) predicting values and (2) exploiting these predictions,
there is a tremendous amount of room for improvement. Come up with a value
prediction strategy and a proposed architecture for exploiting this. Figure
out how to evaluate it using a simulation model (Superscalar simulator
from Wisconsin would be a good choice).
-
Branch Prediction
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
Similar to above, can you come up with a new branch-predicition algorithm?
This is much-trodden territory, so you would have to make sure that you
checked all the literature first. Implement your technique in a simulator
such as SimpleScalar and evaluate your technique vs previous techniques
and take into account hardware cost and complexity.
-
Probability of Deadlock in Direct Network Interfaces
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu)
In his PhD thesis, John Kubiatowicz explored the probability of deadlock
in message-passing multiprocessors with Direct Network Interfaces (e.g.
the Alewife interface). See description of "DeadSIM" in Chapter 6 of thesis,
available off publications link on homepage. This exploration was done
with probabilistic simulation in mesh networks of varying dimensions. Expand
this analysis to include: (1) networks with virtual channels, (2) networks
with automatic queueing to memory (such as the Wisconsin CNI interface).
Figure out how to make the message traffic more realistic by encorporating
actual multiprocessor traces. Will direct network interfaces with software
deadlock recovery do well in large systems/under heavy load?
Back to CS252 page
Maintained by John Kubiatowicz
(kubitron@cs.berkeley.edu). Last modified 1 October 1999.