CS 252 Spring 2002 - Project on "Streams" 

Sourav Chatterji, Manikandan Narayanan and Jason Duell 
{souravc,nmani}@cs.berkeley.edu and mailto:jduell@lbl.gov 

 

Proposal: 
    Streams are applications in which a set of computation kernels operates on continuously flowing data. The dataflow graphs of these applications (called stream graphs) can be determined statically and gets modified only occasionally during the execution of the program. Besides, all communication between the kernels (with the exception of a few irregular accesses) is only via streams. Such stream-based applications are found in large numbers in multimedia and wireless domains and their special properties make them a good match for grid-based architectures (e.g. Smart Memories, RAW) and other virtualized architectures (e.g. Piperench, BRASS). 

    A "stream model" exposes the high-level structure of stream-based applications to both the compiler and the hardware. The compiler and the hardware can then perform stream-specific optimizations using the high-level information (provided ideally in the form of stream graphs). Examples of stream models are SCORE at Berkeley (collection of threads that communicate only via streams - [2]) and StreamIt at MIT (collection of kernels called filters that communicate mainly via streams and occasionally via irregular accesses - [1]). 

    In this project, we intend to characterize the stream-based applications based on the properties of their stream graph and the flow rates of their streams. The emphasis of the characterization will be on the suitability of an application to an architecture i.e., the characterization should help us in determining which stream-based application will run better on which stream processor. We will use the two processors - Imagine at Stanford [4] and VIRAM at Berkeley [3] - as platforms for our studies. Both are targeted at multimedia applications and (in a sense) provide options in the ISA to pass stream-specific information to the hardware. We will use one of the models - StreamIt or SCORE - for characterizing the applications. As we study the applications, we will also try to come up with some new stream-specific optimization techniques. Scheduling of memory accesses and computations of stream graphs are some examples of stream-specific optimizations. 

 

Annotated Bibliography: 
[1] William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In Proceedings of the 2002 International Conference on Compiler Construction. 
(web - http://cag.lcs.mit.edu/commit/papers/02/streamit-cc.pdf

The authors, researchers at the Compiler group in MIT, have come up with a novel language for modern stream programming with an emphasis on programmer productivity without sacrificing performance. The concepts of interest to us are the highly structured model of representing stream programs, using filters and filter compositions for regular high-volume data flow and wavefront-based dynamic messaging for irregular low-volume control flow. Though the current model supports only static flow rates and one-dimensional streams, they plan to extend it to dynamic flow rates (for modeling applications like compression) and multi-dimensional streams (useful in image processing). 

[2] Eylon Caspi, Randy Huang, Yury Markovskiy, Joseph Yeh, John Wawrzynek, and André DeHon. A Streaming Multi-Threaded Model. Presented at Third Workshop on Media and Stream Processors (MSP-3, December 2, 2001). 
(web - http://brass.cs.berkeley.edu/documents/msp3.pdf

The authors describe their SCORE (Stream Computations Organized for Reconfigurable Execution) model, which provides an abstract interface for streams programming to allow easy porting across hardware. A Task Description Format (TDF) is used to break applications into separate finite state machines which communicate between each other only via streams I/O. These TDF threads can then be run either in parallel or serially, with stream I/O providing data flow-based synchronization. A variety of scheduling approaches, including dynamic (hardware), static (compiled), and mixed are described, with a mixed model doing best on a number of benchmarked multimedia applications. The initial hardware target (an FPGA) is described, and methods are enumerated for allowing SCORE programs to also run on other architectures (including standard SMPs) by adding simple streams operations to hardware.

[3] Christoforos Kozyraki. "A Media-Enhanced Vector Architecture for Embedded Memory Systems", Technical Report UCB//CSD-99-1059, University of California, Berkeley, July 1999. 
(web - http://www.cs.berkeley.edu/~kozyraki/papers/csd-99-1059.pdf

Presents the architecture of Vector IRAM (VIRAM) processor and its performance/energy efficiency on a set of media kernels. This efficiency comes from its simple scalable vector hardware partitioned into lanes, high bandwidth on-chip memory access without caches and flexible support for media data types, short vectors and DSP features. Memory Crossbar for accessing the on-chip memory banks and interlane communication for implementing vector reductions are the difficult-to-scale hardware parts. Some applications (eg. iDCT) with strided or indexed memory accesses can suffer due to memory bank conflicts and limited address generators (especially if they operate on shorter data types). Sub-banks can mitigate the effects of bank conflicts. 


[4] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jin Namkoong, John D. Owens, Brian Towles, and Andrew Chang. "Imagine: Media Processing with Streams." IEEE Micro, Mar/April 2001 
(web - ftp://cva.stanford.edu/pub/publications/imagine-ieeemicro.pdf

This paper presents the Stanford Imagine Processor and its performance/power efficiency on certain media processing applications. The processor consists of a 128KB on-chip stream register file (SRF), a stream controller, a streaming memory system and floating-point arithmetic units in eight arithmetic clusters controlled by a microcontroller (in a SIMD fashion). The architecture tries to reduce memory latency by prefetching data into stream buffers in the SRF, by having the stream controller take care of strided accesses (common in applications like DFT) and by having local register files in the arithmetic clusters for storing temporaries. There can be performance degradation due to startup/shutdown costs and off-chip memory accesses.