1. Introduction a. Parallel programming environments are well suited to applications with domain separation (i.e. splitting the data set among the processors) and easily yield homogeneous functional nodes. I. the synchronization and communication is explicit II. load balancing is up to the user b. We want to propose the environment suited for heterogeneous nodes (i.e. each node has a specific function)--> DATAFLOW I. Some shortcomings of the traditional programming environment are overcome by exposing dependencies and making synchronization explicit. II. additional benefits are composability (i.e. operators are trivially composed into arbitrary graphs, streaming permits an operator not to know who is its source and sink) 2. Related Work a. DF has been used extensively in DSP domain and as a mechanism to provide clean and precise comm and execution semantics for embedded systems (fine grain) b. SCORE for FPGA/uP array (fine grain) c. Coarser grain dataflow - River (Eddies), data processing at larger gran i. get more IO bandwidth (e.g. disk) ii. balance load using n-to-m streams with back pressure. d. Volcano - query processing 3. Motivation/Objective a. Make programmers' life easier: - communication is implicit (the infrastructure can take care of buffering, batching, blocking, etc.) - synchronization is implicit (avoid race conditions, guarrantee determinism) b. How do we make "generic" dataflow app and make it run efficiently on multinode cluster of computers. c. What is the appropriate infrastructure? d. What is the cost model? e. dataflow has good properties (e.g. exposed dependencies), how do we take advantage of that knowledge. To demonstrate the strength of analysis of DF, do this statically (not required, but proves the point). 4. Where did SCORE come from? a. what is reconfigurable array b. score graph c. diagrams 5. Infrastructure a. translation from TDF language into the MPI compatible program. b. built on top of MPI: - stream abstraction (token is the unit of communication) - IO thread? automatic buffering (token aggregation) c. each computer spawns several threads, each thread runs one operators behavioral code d. upon startup the application, computes its own static schedule and spawns the thread on the right nodes. e. diagram of how we go from graph to execution 6. Important issues to address a. Lack of application to the applicable domain, so we borrowed apps from embedded systems. a. Dealing with comm granularity - many small messages (tokens) - very difficult with MPI and in general with multi-threading - (overhead figures?) b. PARTIAL SUCCESS: Automatic token buffering, explain the strategies behind the IO thread (graph of improvements in the performance with different token handling mechanisms in IO thread). 7. Scheduling (1) a. Accurate model of performance on a cluster of SMP is extremely difficult. b. Synthetic experiments to understand cost model - show some results??? 8. Scheduling (2) a. Partition to optimize performance given the cost model b. Naive partition vs. Greedy topo walk vs. Smart 9. Future Work/Conclusion a. TDF is limited language (designed for embedded). b. providing efficient mechanisms to communicate small tokens is difficult due to large reductions on bandwidth. c. since DF is great at exposing dependencies, static scheduling works (smart scheduling works better)