Sourav Chatterji, Manikandan Narayanan and Jason
Duell
{souravc,nmani}@cs.berkeley.edu and mailto:jduell@lbl.gov
Proposal:
Streams are applications in
which a set of computation kernels operates on continuously flowing data. The
dataflow graphs of these applications (called stream graphs) can be determined
statically and gets modified only occasionally during the execution of the
program. Besides, all communication between the kernels (with the exception of a
few irregular accesses) is only via streams. Such stream-based applications are
found in large numbers in multimedia and wireless domains and their special
properties make them a good match for grid-based architectures (e.g. Smart
Memories, RAW) and other virtualized architectures (e.g. Piperench,
BRASS).
A "stream model" exposes the high-level structure of stream-based applications to both the compiler and the hardware. The compiler and the hardware can then perform stream-specific optimizations using the high-level information (provided ideally in the form of stream graphs). Examples of stream models are SCORE at Berkeley (collection of threads that communicate only via streams - [2]) and StreamIt at MIT (collection of kernels called filters that communicate mainly via streams and occasionally via irregular accesses - [1]).
In this project, we intend to characterize the stream-based applications based on the properties of their stream graph and the flow rates of their streams. The emphasis of the characterization will be on the suitability of an application to an architecture i.e., the characterization should help us in determining which stream-based application will run better on which stream processor. We will use the two processors - Imagine at Stanford [4] and VIRAM at Berkeley [3] - as platforms for our studies. Both are targeted at multimedia applications and (in a sense) provide options in the ISA to pass stream-specific information to the hardware. We will use one of the models - StreamIt or SCORE - for characterizing the applications. As we study the applications, we will also try to come up with some new stream-specific optimization techniques. Scheduling of memory accesses and computations of stream graphs are some examples of stream-specific optimizations.
Annotated Bibliography:
[1] William Thies, Michal Karczmarek,
and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In
Proceedings of the 2002 International Conference on Compiler
Construction.
(web - http://cag.lcs.mit.edu/commit/papers/02/streamit-cc.pdf)
The authors, researchers at the Compiler group in MIT, have come up with a
novel language for modern stream programming with an emphasis on programmer
productivity without sacrificing performance. The concepts of interest to us are
the highly structured model of representing stream programs, using filters and
filter compositions for regular high-volume data flow and wavefront-based
dynamic messaging for irregular low-volume control flow. Though the current
model supports only static flow rates and one-dimensional streams, they plan to
extend it to dynamic flow rates (for modeling applications like compression) and
multi-dimensional streams (useful in image processing).
[2] Eylon
Caspi, Randy Huang, Yury Markovskiy, Joseph Yeh, John Wawrzynek, and André
DeHon. A Streaming Multi-Threaded Model. Presented at Third Workshop on Media
and Stream Processors (MSP-3, December 2, 2001).
(web - http://brass.cs.berkeley.edu/documents/msp3.pdf)
The authors describe their SCORE (Stream Computations Organized for
Reconfigurable Execution) model, which provides an abstract interface for
streams programming to allow easy porting across hardware. A Task Description
Format (TDF) is used to break applications into separate finite state machines
which communicate between each other only via streams I/O. These TDF threads can
then be run either in parallel or serially, with stream I/O providing data
flow-based synchronization. A variety of scheduling approaches, including
dynamic (hardware), static (compiled), and mixed are described, with a mixed
model doing best on a number of benchmarked multimedia applications. The initial
hardware target (an FPGA) is described, and methods are enumerated for allowing
SCORE programs to also run on other architectures (including standard SMPs) by
adding simple streams operations to hardware.
[3] Christoforos Kozyraki.
"A Media-Enhanced Vector Architecture for Embedded Memory Systems", Technical
Report UCB//CSD-99-1059, University of California, Berkeley, July
1999.
(web - http://www.cs.berkeley.edu/~kozyraki/papers/csd-99-1059.pdf)
Presents the architecture of Vector IRAM (VIRAM) processor and its performance/energy efficiency on a set of media kernels. This efficiency comes from its simple scalable vector hardware partitioned into lanes, high bandwidth on-chip memory access without caches and flexible support for media data types, short vectors and DSP features. Memory Crossbar for accessing the on-chip memory banks and interlane communication for implementing vector reductions are the difficult-to-scale hardware parts. Some applications (eg. iDCT) with strided or indexed memory accesses can suffer due to memory bank conflicts and limited address generators (especially if they operate on shorter data types). Sub-banks can mitigate the effects of bank conflicts.
[4] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi,
Peter Mattson, Jin Namkoong, John D. Owens, Brian Towles, and Andrew Chang.
"Imagine: Media Processing with Streams." IEEE Micro, Mar/April
2001
(web - ftp://cva.stanford.edu/pub/publications/imagine-ieeemicro.pdf)
This paper presents the Stanford Imagine Processor and its performance/power efficiency on certain media processing applications. The processor consists of a 128KB on-chip stream register file (SRF), a stream controller, a streaming memory system and floating-point arithmetic units in eight arithmetic clusters controlled by a microcontroller (in a SIMD fashion). The architecture tries to reduce memory latency by prefetching data into stream buffers in the SRF, by having the stream controller take care of strided accesses (common in applications like DFT) and by having local register files in the arithmetic clusters for storing temporaries. There can be performance degradation due to startup/shutdown costs and off-chip memory accesses.