CS258: Spring 2002 Final Project Suggestions

You may need simulation infrastructures for some of the following project suggestions. Some options:

Possibly using the SIMICS setup with GEMS or FLEXUS addons for shared memory
Sun's "Adaptive Transactional Memory Test Platform (ATMTP).

People considering using ATMTP are encouraged to read the following
two Transact 2008 papers about ATMTP. Please cite the first one in any
publication describing use of ATMTP.

http://research.sun.com/scalable/pubs/TRANSACT2008-ATMTP.pdf
http://research.sun.com/scalable/pubs/TRANSACT2008-ATMTP-Apps.pdf

We would like to hear about your experience with ATMTP, positive or
otherwise. Please send email to:

  atmtp-feedback AT sun.com 

to tell us about your plans or experience, to report any problems with
the simulator or documentation, to request to receive future
announcements about ATMTP, or if you would like to contribute to
future releases.

ATMTP is brought to you by the Scalable Synchronization Research Group
(http://research.sun.com/scalable) of Sun Microsystems
Laboratories. For information about GEMS or to download the latest
version of GEMS (which includes ATMTP) see the GEMS website
(http://www.cs.wisc.edu/gems).  GEMS and ATMTP are available as open
source under the terms of version 2 of the GNU General Public License.

RAMP Implementations: Any of the hardware suggestions given below could be implemented in the RAMP domain with real processors.

Actual Project Suggestions:

RAMP Blue suggestions (Krste)

Add a better processor-network interface. Explore multiple network interfaces?
Adapt RAMP Blue to provide a testbed for trying out new network designs.

Application work on real machines (Krste)

Figure out how to build an autotuned map-reduce (or some other framework), and see what works best across a range of SMP architectures (Itanium, Intel/x86, AMD/x86, Niagra 1/2, with various socket counts)
Implement one of the parlab dwarfs on two machines and compare different approaches.

Performance Counters for Debugging and Autotuning (Kubi)

Parlab is studying the general problem of how to provide a wide array of on-line performance counters to help debug performance, dynamically adapt computation, find the best place in the power/performance space for a given application. These projects need careful consideration of how you intend to evaluate your result. You could either use a simulator or implement actual mechanisms in RAMP.

Design a new performance counter methodology and architecture. How can the maximum number of things be measured at any one time without burning too much power/area? How would the information be kept? How would you perform consistent snapshots across a large CMP in order to understand causality (what causes what)? How might you trigger on exceptional events (too much power, processor too busy, etc) and invoke handlers?
Design a tagging architecture that allows tracking of the source (processor/thread combination or code module) of messages and/or shared-memory traffice in a way that permits performance counters to detect bottlenecks in queues and/or memory interfaces and suggest which code needs to be changed.
Develop a methodology for handling the notion of performance counting of shared resources in a CMP. Examples: message traffic, memory traffic, etc. What would make sense to count and why? Would you have special nodes in the system aggregating information, and what information would this be?
Can you develop a hardward mechanism for observing events in a way that permits deterministic replay of execution results? Assume that large volumes of information will have to be stored and correllated to provide this replay mechanism.
Pick an application and show how it might be dynamically autotuned based on performance counter information. Define the counters and their interface to the application. Assume the parlab research domain: a handheld device with potentially limited resources. Perhaps the tuning will reduce the amount of work being done based on circumstances (low power, insufficient network bandwidth): example a music or graphic application that attempts do courser-grained work when resources get overutilized. Other options: figure out how a given application can be set up to autotune on-the-fly for each new platform and/or usage model.

Network architectures (Kubi)

Can you come up with a new and better network/router architecture for manycore systems? Clearly you would have to improve on the state of the art in some way. Analysis based on power consumption and area tradeoffs would be desirable.

Message Passing Architectures (Kubi)

Develop a hardware tagging architecture for messages that works with other processor partitioning mechanisms to provide fast security (in hardware) on every message, such as proposed for software by the Asbestos project (UCLA/MIT: http://asbestos.cs.ucla.edu/doku.php) or the HiStar project (Stanford: http://www.scs.stanford.edu/histar/). How would you make the label comparisions extremely fast? (Caching/invalidation)? What would the hardware interface to the security mechanisms be? Could you enhance this idea with some sort of Quality of Service mechanisms to prevent an unauthorized thread from tying up resources at a destination node by sending messages that will only be rejected.
Come up with a new message passing interface that is particularly well suited to a manycore processors. Would it provide interrupts at the destination? How would you control the cost of these interrupts? How would it be integrated with user/kernel distinctions, manycore partitioning, and DMA? How could you ensure QoS such that different parts of the machine would have fair access to a node that was serving as (say) an I/O node?

Cache Coherent Multiprocessor suggestons (Krste,Kubi)

Explore the space of dynamic outer level cache management in CMPs Look at the various NUCA schemes, victim replication/migration, industry cache private/shared adaptive protocols etc. One thing that might be interesting is figuring out how to do a limit study of the best case benefit from these schemes.
Propose producer-consumer synch mechanisms that work in a CMP cache environment. Examples: follow on to full/empty bits (sort of Alewife revisited, but with different communication tradeoffs).
Look at realistic interconnect traffic - though it's difficult to scale the simulators up to larger node counts where networks become interesting.
Come up with an alternative to Mondriaan Memory Protection (http://www.cag.csail.mit.edu/scale/mondriaan/index.html) that utilizes HiStar Labels (see #1 under Message Passing Architectures, above) to provide privileged access to shared memory lines (think of labeling the cache coherence directory!)
Develop mechanisms for providing QoS to memory access when a manycore architecture is partitioned. Could you guarantee some minimum level of memory access to each partition -- even when some partition is malicious?
Transactional Memory: Come up with a new variant of transactional memory that advances the state of the art. Evaluate this on a number of applications.
Develop new mechanisms to detect incorrect software behavior such as data races, wild writes, etc, in such a way that violations of invariants are caught without high-overhead monitoring operations. Show how a compiler could reflect dynamic "expectations" of behavior to the hardware so that the hardware can catch these violations and reflect them back to the user to aid in debugging. Variant: something that can be used to detect causality and permit nondeterministic executions to be played back in a deterministic fashion for debugging.

Synchronization Networks (Kubi)

Come up with a partitionable synchronization network that permits very fast barrier synchronization (or parallel prefix, etc). Synchronizations of the order of one or two cycles would be interesting and would allow a range of interesting SPMD applications, snapshots across a partition (perhaps integrated with performance counter support), etc. The idea would be that you could partition a manycore machine into many pieces -- each running separate applications. The network would work as well with multiple partitions as it would with a single chip-wide partition. You would need to find applications to evaluate your technology as well as figure out how to gracefully handle user-level access/virtualization of synchronization resources within a partitions.

Checkpoint/Restart/Fault Tolerance (Kubi)

Come up with fast mechanisms for performing checkpoints across a large manycore chip. You could either assume shared memory or message passing. How could you generate a consistent state for the machine in order to quickly recover from faults (say a processor dies, the software crashes because of a race, etc). How would might you integrate these ideas into the application level to provide true fault recovery?
Could you produce hardware support for something like #1 that was so inexpensive that you could produce frequent "speculative" executions (such as in http://www.eecs.umich.edu/~enightin/sosp05.pdf)
How might you produce a versioned shared memory that permits a smooth undo of data structure modifications, thus providing a very clean way to restart partitions after they fail. Consider, for instance, the problem of restarting a device driver that has been properly isolated in a partition, but which must interact with software modules in other parts of the machine (see a software version with somewhat high overheads in: http://nooks.cs.washington.edu/). The idea is to provide a smooth undo operation in hardware. Note that you might consider some variant of the many transactional memory schemes that are out there.

Correctness of parallel computation (Yelick)

Can you come up with a new type of analysis in the spirit of SharC (in the IVY project: http://ivy.cs.berkeley.edu/ivywiki/index.php/Main/Publications) that can be used to find race conditions in parallel portability layers such as GasNet or UPC. Apparently there are bugs that can be used to crash both IBM and Cray runtime systems. Kathy Yelick can give details.

Back to CS258 page
Maintained by John Kubiatowicz (kubitron@cs.berkeley.edu).
Last modified 1/24/2002

CS258: Spring 2002 Final Project Suggestions

Actual Project Suggestions:

RAMP Blue suggestions (Krste)

Application work on real machines (Krste)

Performance Counters for Debugging and Autotuning (Kubi)

Network architectures (Kubi)

Message Passing Architectures (Kubi)

Cache Coherent Multiprocessor suggestons (Krste,Kubi)

Synchronization Networks (Kubi)

Checkpoint/Restart/Fault Tolerance (Kubi)

Correctness of parallel computation (Yelick)