## TITAN: A Next-Generation Infrastructure for Integrating Computing and Communication

Grant Number: CDA-9401156

Computer Science Division
University of California, Berkeley

Technical Progress Report (08/01/97 - 07/31/98)

## Overview

UC Berkeley is developing as its research infrastructure a new type of computing system, called Titan, which harnesses breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly though its constituent nodes or though inexpensive media stations. This project seeks to investigate the computer science issues in large scale computing systems as they will be. We are roughly midway through the project. The system comprises the largest tightly integrated cluster in the world. It is being utilized on a set of advanced applications with demanding computational, I/O, and graphics requirements throughout the department and by researchers at other institutions. It recently set the world’s record in disk-to-disk sorting benchmark and cracked the RSA 40-bit key challenge. It is a critical resource to research throughout the department.

### Background

The University of California at Berkeley is a premier research and teaching institution. Its roots go back to the gold rush days of 1849 when the drafters of the State Constitution required the legislature to "encourage by all suitable means the promotion of intellectual, scientific, moral and agricultural improvement" of the people of California. Fifteen members of the Berkeley faculty have been awarded Nobel Prizes and in 1966 Berkeley was recognized by the American Council on Education as "the best balanced distinguished university in the country."

The Computer Science Division of the Electrical Engineering and Computer Sciences department is recognized as a world leader, rated #1, along with MIT and Stanford in recent national evaluations. The division consists of 30 faculty, roughly 200 graduate students, and a large undergraduate population in both the College Engineering and the College of Letters and Science. The department has had a major impact on the technological world with developments such as the BSD UNIX operating system, computer-aided design tools for integrated circuits (such as Spice and Magic), relational database systems, IEEE floating point, and pioneering work in RISC computer architectures and RAID storage systems. It is also renown for its theoretical work, such as the theory of NP-completeness, and multiple of its faculty have received the Turing Award.

In the late 80's the Division began a major effort to construct a new Computer Science building. Because of its strong impact on the computing industry, it was able to fund and develop a state-of-the-art facility, Soda Hall. The Titan project grew out of the thought processes of designing the new building and the efforts by the faculty to envision what would be the dominant directions of computing as we enter into the next century. We wanted to create an environment in which to experience and investigate the salient issues of computer systems as they "will be" and in which we could rapidly incorporate advancing technology, especially networking, high-performance computing, and interactive multimedia.

The Computer Science faculty at the University of California proposed to NSF to develop as its computing and communication infrastructure a new type of computing system, called Titan, which would harness breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly through its constituent nodes or through inexpensive media stations. The vision was to treat the building as an integrated computing system, with a core computing component providing vast amounts of computing power and storage, connected to media stations and other advanced devices. A software architecture for the global operating system and programming language would be developed and the system design would be driven by a set of advanced applications with demanding computational, I/O, and graphics requirements. Funding for the project is shared between the National Science Foundation and the University, with individual research groups adding value to this infrastructure through their research personnel and equipment supported through other sources. NSF requested, in response to the original proposal and its addendum, that the project directly incorporate a significant experimental systems research component, along the lines described in a separate UCB proposal: "NOW: Design and Implementation of a Distributed Supercomputer as a Cost-effective Extension to a Network of Workstations." The Titan project comprises a core computing component, a multimedia component, an advanced networking component, and a set of driving applications.

In this report we outline the progress on Titan during its fourth year. The report organization follows the primary segmentation of the progrect:

## Physical Infrastructure

The physical infrastructure of Titan has remained since the last report, with the core consisting of 100 Ultra 170 workstations connected by a high speed Myrinet, 35 Intel PentiumPro PCs and roughly 400 IBM disks used in  a massive storage cluster, four 8-way Sun Enterprise 5000 SMPs, and the mediastation component consistng of roughly 150 HP/715s and 100 donated Intel Pentium Pro PC and monitors from Sony and Samsung. Before the end of this project year, we will upgrade the Myricom network within NOW to overcome hardware problems with the switches and to enhance the storage capacity of the network interface card. The hardware problems have not prevented us from moving forward with the research, in fact they have driven it in interesting ways. NOW is a substantially larger cluster than what the vendor can test and, with our fast communication layers, we drive the network harder than the vendor layers can. As a result, we have revealed a serious of hardware problems with the switches. This caused up to incorporate fast error detection and retry with the Active Message layer. The upgrade allows us to move forward to better switch hardware and a more useful configuration.

We deployed a new switched 100Mb/s network to support the new Intel PC mediastations. Initially the available products froim Bay allowed us only to construct a network with limited bisection bandwidth (200 Mb/s).  Recently we were able to increase this by an order of magnitude with a new switched ethernet product. This switched network is heavily used for multimedia traffic, as well as file traffic.

We have received gigabit ethernet switches and adapators and are currently undergoing tests in advance of replacing the ATM cloud and 10 Mb/s external network of NOW with a switch gigabit core and 100 Mb/s links to the NOW nodes.

## Core: Architecture and Operating Systems (D. Culler, D. Patterson)

### Architectural Trade-offs (Culler, Patterson)

We have conducted a number of investigations to isolate and quantify the architectural trade-offs in NOWs, SMPs, and Clusters of SMPs against various application workloads. One of these studies focuses on resource usage while performing streaming I/O by contrasting three architectures, a single workstation, a cluster, and an SMP, under various I/O benchmarks. We developed a family of performance analysis tools, based on Shade and the hardware counters in the Ultrasparc and Sun Enterprise system. We have derive analytical and empirically-based models of resource usage during data transfer, examining the I/O bus, memory bus, network, and processor of each system. By investigating each resource in detail, we assess what comprises a well-balanced system for these workloads.

Surprisingly, across the platforms the main limitation to attaining peak I/O performance is the CPU, due to lack of data locality. Increasing processor performance (especially with improved block operation performance) will be of great aid for these workloads in the future. The cluster design requires more memory bandwidth per processor, but the bandwidth available in current designs in adequate and, by design, scales with the number of procesors. For a cluster workstation, the I/O bus is a major system bottleneck, because of the increased load placed on it from network communication. A well-balanced cluster workstation should have copious I/O bus bandwidth, perhaps via multiple I/O busses. The SMP suffers from poor memory-system performance; even when there is true parallelism in the benchmark, contention in the shared-memory system leads to reduced performance. As a result, the clustered workstations provide higher absolute performance for streaming I/O workloads.

### Communication (Culler)

Our demonstration fast, general purpose communication layer over scalable low-latency networks, Active Messages II, has fully stabilized on the full NOW cluster, including a port to Solaris 2.6. This includes underlying driver support for virtual networks, which has been fully stress tested with large numbers of simultaneous parallel applications. We have done a great deal of study and optimization of the protocols. The current performance is comfortably within factor of two of the dedicated GAM layer, obtaining 32 MB/s one-way bandwidth, 17.5 usec one-way user-to-user time, and 13 usec inter-message gap. This work was presented at Hot Interconnects V and selected for an IEEE Micro paper. Versions of the AM II communication layer and driver have been distributed to several research groups.

As part of this work we have rebuilt the "Lanai" firmware to use a single context, making it simpler and better suited to a wide range of network interface cards, especially those emerging for gigabit ethernet. We have ported the AM II layer, Solaris virtual networks driver, Split-C, and MPI to PentiumPro platforms using the PCI-based Lanai.

The virtual network driver has been ported to operate on SMP nodes with multiple concurrent driver threads and multiple network interface cards per processor. Multi-NIC operation has been demonstrated on a cluster of four 8-way Sun Enterprise 5000 SMPs with three NICS per SMP. We have built an initial multi-protocol AM II layer for Clusters of SMPs (Clumps) and demonstrated its operation through benchmarks and applications on our cluster. This work has raised key issues for such layers in the design of concurrent communication objects, adaptive polling algorithms, and lock-free queue management algorithms.

The AM II layer is completely integrated with an automatic network mapper (demonstrated and proven correct in a SPAA 97 paper). An map of the current NOW (updated every minute) can be found at http://www.cs.berkeley.edu/~alanm/map.html. The network mapper is now running on all nodes by default. It takes a fraction of a second to map the network a network of 43 switches, 93 hosts, and 192 cables. The automatic mapping algorithm has been substantially revised as well to make it faster, more robust, and to yield better route selection.  In its current form, all nodes independently map the network. Improvements to the algorithm and time-out mechanism allow this to be very fast. By allowing each of the nodes to choose their routes, consistent with an up*-down* ordering of the network graph, a better spread of load is obtained over the physical links. We have experimented with integrating the remapping operation into the error handling loop of the AM layer.

We have conducted an extensive performance evaluation study of the NAS Parallel Benchmarks on NOW by building an in situ measurement facility  into our MPI layer. This study the power of direct execution approach; a single run through the Class A benchmark suite is a trillion instructions and this study involved hundreds of such runs. Class B has also been run, and it is several times large. The study showed NOW scalability to be substantially better than that of the SP-2 and as good as the Cray T3D. We developed a number of tools to isolate the performance factors, including instruction breakdowns, scaling of computational work, MPI send, receive, and wait time, and cache traces for parallel applications. The study reveals the extensive change in architectural interactions under constant problem size scaling, including changes in communication characteristics and memory load. Although the cost of communication increases and extra work is performed, we obtain perfect scaling on several NPB applications because of improvements in computational efficiency due to a large number of caches to hold the working set. We have been working on the NAS Parallel Benchmark followup work on SGI Origin 2000 machines.  The basic conclusion from the work is the SGI machine is highly sensitive to the cache benefit to gain
super linear speedup. The node performance is has a first order impact on the overall scalability of the benchmark, on both NOW cluster and the SGI Origin 2000.

We built  kernel-to-kernel AM and test apparatus to study the sensitivity to communication performance of  application with use the internet-protocols (e.g. TCP, UDP, RPC). The apparatus, currently running on a 4-node cluster, allows independent variation of network overhead, inter-packet gap, latency and bandwidth.  We have constructed a controlled environment for studying sensitivity on the SPEC SFS benchmark of NSF. Using the apparatus, we have run a large number of experiments on the sensistivity of NFS to network performance. We used the SPEC-SFS load generators and evaluation metrics; they are industry standards.  The current results suggest strong sensitivity to overhead, and weak sensitivity to bandwidth. Sensitivity to latency and per-message gap is very low for the performance ranges of interest in local area networks.

We have studied in detail how to minimize the latency of a message through a network that consists of a number of store-and-forward stages, especially for the page size chunks transported within cluster files systems. This research is especially important for today's low overhead communication subsystems that employ dedicated processing elements for protocol processing. We have developed an abstract pipeline model that reveals a crucial performance tradeoff. We exploit this tradeoff in fragmentation algorithms designed to minimize message latency. By applying a rather formal methodology to the Myrinet-GAM system, we have improved its latency by up to 51%. A paper describing this work can be found at http://www.cs.berkeley.edu/~rywang/papers/pipeline

A prototype was developed that demonstrates user customization of virtual network interfaces through safe language extensions. A Java Virtual Machine was built for the Myrinet board LanAI processor and a class library was developed to provide the basic communication subprimitives that are used within Lanai control programs. The architecture permits applications to safely specify code to be executed within the NI on message transmission and reception. The design is based on the Cornel Unet implementation and achieves impressive performance.A draft of our U-Net/SLE whitepaper is at http://www.cs.berkeley.edu/~mdw/proj/unet-sle/unet-sle.ps

We have completed a reference implementation of the Virtual Interface Architecture (VIA). Based on the VIA specification version 1.0 published jointly by Microsoft, Compaq and Intel, the Berkeley VIA successfully achieves a networking interface between the user process and networking hardware that requires no kernel transitions to accomplish data transmission/receipt. When a virtual interface is created, the host allocates pinned memory from which the network interface can perform DMA. The user process writes/reads data and transfer descriptors to this memory region and initiates a transfer by writing a special doorbell token to a memory page that is mapped into the network interface card. By using this paged doorbell approach, system calls are eliminated while assuring some form of protection to the user process. Presently, our VIA supports the Sun Ultrasparc platform running Sun Solaris 2.6 and Myricom's Myrinet network. Analysis of the architecture on this platform yields single-packet latencies as small as 24 microseconds. Measured bandwidth for a 2KB packet is approximately 196 Mbits/sec. We intend to continue the development of the Berkeley VIA by expanding its functionality and porting to different host/network platforms such as Intel based PC running Windows NT and Gigabit Ethernet.

### Global Operating System Layer (Culler, Anderson)

The global operating system layer has been exercised extensively on a sizable population. A paper describing the design and the development experience has been accepted for a special issue of "Software, Practice, and Experience" on the subject of GLUnix. We have made the first public release of GLUnix available on the web (http://now.cs.berkeley.edu/Glunix). We continue to fix bugs in the GLUnix source. We have developed a new tool to monitor the GLUnix master and daemons to restart them automatically after a crash or a machine reboot.

We have implemented the ideas in our implicit scheduling work and have conducted an extensive empirical investigation of that theory. The implementation process revealed a number of subtle issues in the AM II layer and within the Solaris scheduler. The adaptive two-phase algorithms have been shown to work extremely well in practice, both on the synthetic workloads of the original simulation study and on collections of real programs. However, it is critical that no layer below the run-time library silently block. The critical issue for implicit scheduling work is reacting to the response time of remote operations. This turns out to be much more important than the actual message arrival. We have also be able to demonstrate simple extensions that provide fairness.

We have been working on a system for accessing NOW resources across the wide area. The goal is to allow for authorized users across the Internet to be able to utilize unique computational resources, such as the NOW, across the wide area. Our work has resulted in the design and implementation of a new authentication and access control system, called CRISIS. A goal of CRISIS is to explore the systematic application of a number of design principles to building highly secure systems, including: redundancy to eliminate single points of attack, caching to improve performance and availability over slow and unreliable wide area networks, fine-grained capabilities and roles to enable lightweight control of privilege, and complete local logging of all evidence used to make each access control decision (e.g., no implicit reasoning about transfer of rights). Measurements of a prototype CRISIS-enabled wide area file system show that CRISIS adds only marginal overhead relative to unprotected wide area accesses. A paper entitled "The CRISIS Wide Area Security Architecture", by Eshwar Belani, Amin Vahdat, Thomas Anderson, and Michael Dahlin has been accepted to the USENIX Security Symposium and is available at http://www.cs.berkeley.edu/~vahdat/uss.ps

### File Systems (Anderson, Patterson, Culler)

We ported the Illinois Portable Parallel File System (PPFS) to NOW and conducted several performance studies. The performance was rather disappointing. This prompted development of our own parallel file system, which incorporates the techniques employed in our fast external sorting work. The file system operates within it own virtual network connecting the disk servers to the requesters. We are able to sustain the full disk bandwidth on streaming loads, regardless of the whether the disks are local or remote. The difference is in the amount of processing time and I/O bus bandwidth that is consumed by the transfer.

We have begun design and developement of an adaptive I/O system for parallel programs on NOW, called River, that utilizes what was learned from numerous high performance I/O studies. The key problem with applications that perform parallel I/O (e.g.,  NOW-Sort) is that a small perturbation on a single workstation leads to a large-scale performance hit. We call this performance regime "meta-stable" (very much like the marble on top of the hill). What is needed is an I/O environment that provides more "stable" performance, where small perturbations lead to small (or no) performance decreases. Thus, we  have been working on a system we call "River". The River mantra is "move the I/O to the computation", and draws on work from both the parallel I/O and task queue literature. By defining a higher level, more flexible interface, and a dynamic, "perturbance-aware" system underneath, we plan to provide robust  parallel I/O to a range of interesting applications (decision support, scientific, etc.).

Our Large-scale Trace Analysis facility is essentially complete. Unfortunately, there is a problem with one set of traces, so we will need to reprocess them from scratch and recalculate all results for this trace. Analysis done includes: reexamine sprite tests to observe changes over time, use of long-term traces to see long-term file lifetime and disk accesses beyond the sprite study,  and post-cache file system behavior an analysis of disk seeks. It appears that a good case can be made for disk reorganization to help read cost. This analysis could not be done with short-term traces.  The  traces were used to look at long-term self-similar behavior, to appear in Sigmetrics 98. Because the traces are long-term, we were able to see at what granularity self-similar behavior pans out. The camera ready sigmetrics paper on this topic has been turned in. We may have a follow-up paper examining the causes of burstiness and post-cache behavior. We are studying file system backup policies (again only long-term traces were useful for this project) and have done initial work on read cost for different layout policies. Initial results show that ffs has slightly lower read cost.

### Tertiary Disk: FAMSF Zoom Project (Patterson)

The new FAMSF web site using the Tertiary Disk (TD)  servers went on line March 2nd. Our portion of the site has been up continuously since then. We have about 35,000 images available now. The TD web servers get an average of 50 users per day, accessing about  80 images. About 70% of these accesses come from the United States. The rest are primarily from Canada, Western Europe and Australia. In two  months, we've had around 2700 users and served around 4700 images  (640MB transferred). The 10 most popular images account for 15-20% of all images served. This collection includes a Picasso, several Monets and a print that was mentioned in the FAMSF newsletter.

We are doing an experiment to see if the failure rate of disk sectors is affected by workload. Basically, there are several groups of disks that are continuously subjected to different load levels and read/write patterns.The load patterns are based on some advice from people at IBM. This experiment has been running for about three weeks. The load patterns are based on some advice from people at IBM. We have detected some new bad sectors, but its too early to say anything definite.

The basic goal in Self Maintaining Storage is to limit maintenance by a system adminisration to regular intervals. To do this requires a combination of monitoring and fault tolerance. We're attempting to solve this problem for our application (and others like it that are web accessible and read mostly). So far we've built some fault tolerance into our web server (failover for front end, using mirroring to recover from failed servers). We are porting NOW monitoring software to BSD and adding other modules to monitor additional devices like disk enclosures.

### Recent Publications:

[Gh*98] Douglas P. Ghormley, Steven H. Rodrigues, David Petrou, Thomas E. Anderson. SLIC: An Extensibility System for Commodity Operating Systems,  To appear in the USENIX 1998 Annual Technical Conference, New Orleans, LA, June 1998.

[Ch*98] Virtual Network Transport Protocols for Myrinet, Brent Chun, Alan Mainwaring, and David Culler, IEEE Micro (Special issue on Hot Interconnects), Jan/Feb 1998.

[Gr*98] Gribble, Steven D., Gurmeet Singh Manku, Drew Roselli, Eric A. Brewer, Timothy J. Gibson, and Ethan L. Miller, "Self-Similarity in File Systems, Proceedings of ACM SIGMETRICS 1998, Madison, Wisconsin, June 1998.

[LuCu98] Steven S. Lumetta, David E. Culler.  Managing Concurrent Access for Shared Memory Active Messages. IPPS/SPDP 98 , Orlando, FL , March, 1998.

[Arp*98] Remzi Arpaci-Dusseau, Andrea Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs. HPCA 4, Las Vegas, February, 1998 .

[Lu*97] Multi-Protocol Active Messages on a Cluster of SMP's, Steven S. Lumetta, Alan M. Mainwaring, David E. Culler. SC'97 , San Jose, California , November, 1997 .

[Cu*97] Parallel Computing on the Berkeley NOW, David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steven Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Frederick Wong, JSPP'97 (9th Joint Symposium on Parallel Processing), May 1997, Kobe, Japan.

[Ma*97] Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture, Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson, ISCA 24 , Denver, Co , June, 1997 .

[AD*97] High-Performance Sorting on Networks of Workstations. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson.SIGMOD '97 , Tucson, Arizona , May, 1997 .

[Gho*98] GLUnix: A Global Layer Unix for a Network of Workstations. To appear in Software Practice and Experience. Douglas P. Ghormley, David Petrou, Steven H. Rodrigues, Amin M. Vahdat, Thomas E. Anderson.

[Nee*97] Improving the Performance of Log-Structured File Systems with Adaptive Methods. SOSP 16 , St. Malo, France , October 5-8, 1997 . Jeanna Neefe Matthews, Drew Roselli, Adam M. Costello, Randy Wang, Tom Anderson.

[Cha*97] Experience with a Language for Writing Coherence Protocols. USENIX Conference on Domain-Specific Languages USENIX/DSL , Santa Barbara, California , October 15-17, 1997 . Satish Chandra, Michael Dahlin, Bradley Richards, Randolph Wang, Thomas E. Anderson, James R. Larus.

[Dus*97] Extending Proportional-Share Scheduling to a Network of Workstations. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'97) , Las Vegas, Nevada , June, 1997 . Andrea C. Arpaci-Dusseau, David E. Culler.

[Wo*98] Architecture Requirements and Scalability of the NAS Parallel Benchmarks, Fredrick Wong, Richard Martin, Remzi Arpaci-Dusseau, and David Culler, submitted for publication.

[Mai*98] Design and Implementation of Virtual Networks, Alan Mainwaring and David Culler, Submitted for publication.

Doug Ghormley received his PhD. Steven Lumetta, Andrea Dusseau, Amin Vahdat, Randy Wang, and Richard Martin will be finishing soon and have obtained faculty or post-doc positions. Several students passed quals and or received MS degrees.

### Wireless Networking (Katz, McCanne)

The BARWAN Research Group continues to deploy refined versions of its wireless network technology and proxy services within Soda Hall for the use of the TITAN research community. In particular, a Lucent WaveLAN network has been deployed throughout Soda Hall, a DHCP server has been made available, and numerous researchers outside of the BARWAN group are now using the WaveLAN infrastructure within Windows 95, Windows NT, and various forms of PC-based Unix systems for their own research.

In terms of networking research, the group has focused on methods for improving the performance of TCP over cellular wireless networks. We have implemented a TCP-aware link layer protocol called the Snoop Protocol that isolates wired senders from the lossy characteristics of a wireless link. In our latest work, we have employed a novel protocol based on Explicit Loss Notification (ELN) to improve transport performance. This technique is particularly well suited for packet radio networks, it which the lossy link need not be limited to the final network hop. We have obtained extensive packet traces of wireless errors from a production multi-hop wireless network (i.e., the Reinas Remote Sensing network deployed in the Monterey Bay by researchers at UC Santa Cruz) and have derived an empirical model of channel errors based on this data. We have used this to evaluate the performance of "standard" TCP Reno, TCP Selective Acknowledgments, and our Snoop protocol for Web workloads to mobile hosts. Furthermore, we have extensively studied the scaling behavior of the Snoop protocol to understand how it performs under load. This analysis leads to general insights about efficient protocol design for reliable wireless transport.

### Relevant Publications:

H. Balakrishnan, V. Padmanabhan, S. Seshan, R. H. Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links," ACM Transactions on Networking, V5, N 6, (December 1997), pp. 756-769.

H. Balakrishnan, M. Stemm, S. Seshan, R. H. Katz, "Analyzing Stability in Wide-Area Network Performance," ACM Sigmetrics Conference, Seattle, WA, (June 1997).

B. Noble, M. Satyanarayanan, G. Nguyen, R. H. Katz, "Trace Based Mobile Network Emulation," ACM SIGCOMM Conference, Cannes, France, (September 1997).

T. Hodes, R. H. Katz, E. Servan-Schreiber, L. A. Rowe, "Composable Ad-Hoc Mobile Services for Universal Interaction," Third ACM Mobicom Conference, Budapest, Hungary, (September 1997). Best Paper Award. mber 1997).

H. Balakrishnan, V. Padmanabhan, R. H. Katz, "The Effects of Asymmetry on TCP Performance over Wide-Area Wireless Networks," Third ACM Mobicom Conference, Budapest, Hungary, (September 1997). .

T. Henderson, R. Katz, "Satellite Transport Protocol (STP)--An SSCOP-based Transport Protocol for Datagram Satellite Networks," Second Workshop on Satellite-Based Information Systems (WOSBIS-97), Budapest, Hungary, (October 1997). mber 1997).

M. Stemm, S. Seshan, R. H. Katz, "SPAND: Shared Passive Network Performance Discovery," USENIX Symposium on Internet Technologies and Systems, Monterey, CA, (December 1997). mber 1997).

H. Balakrishnan, M. Stemm, S. Seshan, V. Padmanabhan, R. H. Katz, "TCP Behavior of a Busy Internet Server: Analysis and Solutions," IEEE Infocomm Conference, San Francisco, CA, (March 1998). mber 1997).

M. Stemm, H. Balakrishnan, S. Seshan, V. Padmanabhan, R. H. Katz, "TCP Improvements for Heterogeneous Networks: The Daedalus Approach," Proceedings Allerton Conference, Urbana, IL, (September 1997). Invited Paper.

R. H. Katz, "Beyond Third Generation Telecommunications Infrastructures," ACM Sigmobile Newsletter, V. 2, N. 2, (April 1998), pp. 1-5. Invited Paper based on ACM Mobicom Keynote Address, September 1997.

D. Goodman, N. Abramson, E. Cacciamani, J. Engel, M. Epstein, B. Fette, D. Fields, B. Gavish, A. Goldsmith, R. H. Katz, E. Kelley, K. Pahlavan, C. Perkins, T. Rappaport, J. Russell, The Evolution of Untethered Communications, National Research Council Press, 1997.

R. H. Katz, W. L. Scherlis, S. L. Squires, "The National Information Infrastructure: A High Performance Computing and Communications Perspective," in White Papers, The Unpredictable Certainty: Information Infrastructure Through 2000, National Research Council Press, Washington, 1998, pp. 315-334.

## Core: Compiler, Library, and Language Component

### Titanium (Yelick, Aiken, Graham, Hilfinger)

The Titanium language is a Java dialect extended with primitives for SPMD style parallelism within a global address space. Over the past year the group completed the intial language design and implemented a prototype compiler that generates C++ code with either shared memory operations or Active Messages for accessing data between processes. The compiler generates code for the Titan NOW as well as the SMPs nodes within the Sun Clump or Intel machines. The performance of the compiled code is quite good: the compiled programs are sometimes faster than the same programs written directly in C/C++ or Fortran and rarely more than 2x slower. The parallel constructs in Titanium are modeled after those of the Split-C language; it uses the same runtime layer as Split-C and has therefore benefited from the work on optimizing the Split-C runtime layer and underly Active Message layers. mber 1997).

The uniprocessor and SMP versions of Titanium have been used in the graduate parallel computing class (CS267) and by undergraduate researchers for writing parallel algorithms. The group now has several benchmarks that are designed for these platforms: an electromagnetics model using an unstructured mesh, a Particle-in-Cell code, matrix multiplication, Cholesky and LU decompositions (without pivoting), a multigrid solver for sturctured meshes, a linear systems solver for tridiagonal systems, and a parallel sorting algorithm, and a simple n-body simulation. In addition, a major application development is under development with Luigi Semenzato and Phillip Colella at NERSC/LBNL. The application is a Poisson solver that uses Adaptive Mesh Refinement techniques and it runs on the SMPs (both Intel and Sun) and the NOW.

The group is currently working on improved parallel code generation for distributed memory machines like the NOW and tuning some of the applications for distributed data layout on these machines. There have been several results in the area of program analysis and optimization of explicitly parallel code, such as communication optimization, synchronization analysis, cache optimization for grid-based computation, and analysis and optimization of dynamic memory management.

References:

Titanium: A High-Performance Java Dialect. ACM 1998 Workshop on Java for High-Performance Network Computing. To appear in Concurrency: Practice and Experience.

Analyses and Optimizations for Shared Address Space Programs. A. Krishnamurthy and K. Yelick, Journal of Parallel and Distributed Computation, 1996.

Evaluation of Architectural Support for Global Address-Based Communication in Large Scale Parallel Machines. A. Krishnamurthy, K. Schauser, C. Scheiman, R. Wang, D. Culler, and K. Yelick, Architectural Support for Programming Languages and Operating Systems, November, 1996.

Empirical Evaluation of Global Memory Support on the Cray-T3D and Cray-T3E. A. Krishnamurthy, D. Culler, and K. Yelick, UCB//CSD-98-991.

Alex Aiken and David Gay. Barrier Inference. Proceedings of the Twenty-Fifth Annual ACM Sigplan Symposium on Principles of Programming Langauges, San Diego, California, January, 1998.

David Gay and Alex Aiken. Memory Management with Explicit Regions. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, to appear, Montreal, Canada, June 1998.

### P-sather (Feldman)

The Sather project on Object-Oriented Software Engineering and its parallel counterpart pSather are both making good progress using the Titan infrastructure. We are redirecting efforts to put greater emphasis on parallel computing where pSather is much more mature than Java. The latest Sather 1.2 Beta release is based on Active Threads, a new high-performance portable runtime that delivers orders of magnitude higher performance than most commercial thread systems. Active Threads enabled pSather to achieve high-performance over a range of architectures and platforms: SPARC, PentiumPro, DEC ALPHA, and HPPA SMPs; Myrinet-based networks of SMPs, etc. Active Threads was developed by Boris Weissman, a UCB graduate student on the project. Juergen Quittek, a German postdoc implemented a distributed synchronization object management system that avoids performance bottlenecks of the older centralized lock management.

Boris Weissman, delivered a paper prepared with Juergen Quittek on efficient synchronization in Sather at the 1997 International Scientific Computing in Object-Oriented Parallel Environments Conference. Another aspect of synchronization in Sather, fairness, has been addressed in a paper delivered by a former ICSI postdoc Michael Philippsen at the IASTED International Conference on Parallel and Distributed Computing and Systems. Matthias Anlauff, a German postdoc, gave a talk about his work on a system for formal specification of programming language semantics at the International Workshop on the Theory and Practice of Algebraic Specifications, Amsterdam. Efficiency aspects of thread migration on the emerging hardware platforms such as the networks of SMPs (CLUMPS) are investigated in the paper to appear at the 12th International Parallel Processing Symposium and 9th IEEE Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998). Based on our experience with pSather over the past few years we have formulated a new programming model for a safe high-performance programming language. This resulted in a paper prepared in cooperation with the researchers from Karlsruhe, Germany and submitted to the European Conference on Object-Oriented Programming (ECOOP 98).

David Stoutamire completed his doctoral dissertation on a new Zones'' model for improving locality in compiling high-level languages. Although the developments are based on Sather and fully implemented in the compiler, they are broadly applicable. David has moved to JavaSoft, where he joins Robert Griesemer, another ICSI alumnus. The Sather project has already had some influence on Java development and promises to have more.

Claudio Fleiner finished his Ph.D. thesis on parallel optimizations in Sather and passed the defense at the University of Fribourg, Switzerland. The thesis was done during Fleiner's stay at ICSI. Claudio moved on to accept a research position with IBM Labs in Zurich.

Ben Gomes completed his doctoral dissertation on reusable parallel frameworks for mapping connectionist networks onto parallel machines using pSather as an implementation platform. He also continued his Sather library work. He will also join the core language gropup at Javasoft. Michael Holtkamp, a student from Hamburg has finished his thesis on thread migration with Active Threads on the CLUMPs.

Serial Sather remains fairly stable. Our current efforts are mostly concerned with further work on high-performance parallel runtimes including thread scheduling for locality.

\bibitem{Quittek} Quittek,J. \& Boris Weissman, Efficient Extensible Synchronization in Sather'', The 1997 International Scientific Computing in Object-Oriented Parallel Environments Conference,

\bibitem{Fleiner-1} Fleiner, C. \& Philippsen M. Fair Multi-Branch Locking of Several Locks,'' Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, Washington D.C., October 1997.

\bibitem{Weissman-1} Weissman, B., Gomes, B., Quittek J.W., Holtkamp M., Efficient Fine-Grain Thread Migration with Active Threads,'' 12th International Parallel Processing Symposium and 9th IEEE Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), Orlando, March 1998, to appear.

\bibitem{Gomes-1} Gomes, B., Lowe W., Quittek J.W., Weissman B., Safe Sharing of Objects in a High-Performance Parallel Language,'' Submitted to the 12th European Conference on Object-Oriented Programming (ECOOP 98).

\bibitem{Anlauff} Anlauff, M., Kutter, P.W., Pierantonio A., Formal Aspects of and Development Environments for Montages,'' 2nd International Workshop on the Theory and Practice of Algebraic Specifications, Amsterdam 1997

\bibitem{Weissman-2} Weissman, B. Active Threads: an Extensible and Portable Light-Weight Thread System,'' TR-97-036, ICSI, November 1997.

\bibitem{Weissman-3} Weissman, B., Gomes, B., Quittek J.W., Holtkamp M., A Performance Evaluation of Fine-Grain Thread Migration with Active Threads,'' TR-97-054, ICSI, December 1997.

\bibitem{Gomes-2} Gomes, B., Stoutamire, D., Weissman, B. Klawitter, H., "Sather 1.1: A Language Manual" currently available on-line at http://www.icsi.berkeley.edu/ ~sather/Documentation/LanguageDescription/webmaker/index.html, upcoming TR, ICSI

\bibitem{Fleiner-2} Fleiner, C., Advanced Constructs and Compiler Optimizations for a Parallel, Object Oriented, Shared Memory Language running on a Distributed System'', Ph.D. Thesis, #1148, University of Fribourg, Institute of Informatics, Switzerland, April 1997.

\bibitem{Stoutamire} Stoutamire, D., Portable, Modular Expression of Locality,'' Ph.D. Thesis, University of California at Berkeley, 1997.

\bibitem{Gomes-3} Gomes, B., Mapping Connectionist Networks onto Parallel Machines: A Library Approach,'' Ph.D. Thesis, University of California at Berkeley, 1997.

### BANE: The Berkeley ANalysis Engine (Aiken)

On of the outgrowths of Titanium is our work on  automatic analysis of software using constraint resolution techniques. We are building a system (BANE) and working on a number of demonstration applications. The goal is to be able to scale realistice static bug detection to programs in the range of 1,000,000 lines. Our current system scales quite well. We have applied a non-trivial alias analysis to C programs that scales to 100,000 source code lines (450,000 lines of preprocessed source). For comparison, we believe this is roughly 10 times larger than any comparable effort reported in the literature.

The most realistic application we have implemented to date analyzes RLL (Relay Ladder Logic) programs. RLL is an embedded control language used in most US manufacturing facilities. Bugs in RLL programs cost thousands of dollars *per minute* to fix when factory controllers crash. It is not uncommon for a single RLL bug to cost hundreds of thousands of dollars to repair.

In consultation with Rockwell we focused on the problem of detecting "relay races" in RLL programs. Relay races are very difficult to detect using standard testing techniques and are a common source of bugs in practice. Our RLL analysis was very successful at finding these bugs in very large, production RLL programs, including one known bug that had originally required four hours of factory down time to repair. It is fair to characterize this as a very positive result---very few other tools that are able to process realistic size software systems and glean useful information. A paper on this work received the Best Paper Award from the European Association of Programming Languages and Systems federated conference.

Publications

Partial Online Cycle Elimination in Inclusion Constraint Graphs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, to appear, Montreal, Canada, June 1998 (with M. Faehndrich, J. Foster, and Z. Su).

Detecting Races in Relay Ladder Logic Programs. In Proceedings of the 1st International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Lisbon, Portugal, pages 184-200, April, 1998 (with M. Faehndrich and Z. Su).

A Toolkit for Constructing Type- and Constraint-Based Program Analyses (invited paper). In Proceedings of the 2nd International Workshop on Types in Compilation, Kyoto, Japan, pages 165-169, March 1998 (with M. Faehndrich, J. Foster, and Z. Su).

Program Analysis Using Mixed Term and Set Constraints. Proceedings of the 4th International Static Analysis Symposium, Paris, France, September, 1997 (with M. Faehndrich).

Optimal Representations of Polymorphic Types with Subtyping (Extended Abstract). Theoretical Aspects of Computer Software (TACS), September, 1997 (with E. Wimmers and J. Palsberg).

### JavaTime Project (Newton, Hilfinger):

A second outgrowth is a new successive, formal refinement approach for specification of embedded systems using a general-purpose programming language. Systems are formally modeled as Abstractable Synchronous Reactive systems, and Java is used as the design input language. A policy of use is applied to Java, in the form of language usage restrictions and class-library extensions, to ensure consistency with the formal model. A process of incremental, user-guided program transformation is used to refine a Java program until it is consistent with the policy of use. The final product is a system specification possessing the properties of the formal model, including deterministic behavior, bounded memory usage, and bounded execution time. This approach allows systems design to begin with the flexibility of a general-purpose language, followed by gradual refinement into a more restricted form necessary for specification.

Reference:

James Shin Young, Josh MacDonald, Michael Shilman, Abdallah Tabbara, Paul Hilfinger, and A. Richard Newton, "Design and Specification of Embedded Systems in Java Using Successive, Formal Refinement", Proceedings of Design Automation Conference (DAC), 1998, to appear.

### Numerical Libraries (Demmel)

Tzu-Yi Chen has shown that applying a permuted diagonal similarity transform to a matrix A before calculating its eigenvalues can improve the speed and accuracy with which the eigenvalues are computed. This is often called balancing. We have developed a novel, finite algorithm which runs in O(n^4) time. Our implementation is faster than the dense algorithm when the density of the matrix is less than or equal to approximately .5. We have developed a set of Krylov-based algorithms for matrices that are not given explicitly, i.e. the only operation available on the matrix is matrix-vector multiplication and perhaps matrix-transpose-vector multiplication. On test matrices from important applications, one version of our algorithm is found to return matrices whose norms are within a factor of 2.5 of the norm of the balanced matrices returned by the explicit algorithm. Software may be found at www.cs.berkeley.edu/~tzuyi/Balancing.

Eigenvectors have proved to be an invaluable computational tool in many diverse applications. To the quantum chemist they may signify wave functions; statisticians compute eigenvectors of a covariance matrix to find directions of maximum variance in the data (Principal Components Analysis), while computer scientists have lately used eigenvectors to partition graphs, segment images and retrieve textual information (Latent Semantic Indexing). Dhillon's thesis focuses on the computation of the eigenvectors of a symmetric tridiagonal matrix T, which is an important phase in finding the eigenvectors of any symmetric matrix. Previous practical algorithms to find all the n eigenvectors of T take O(n^3) time in the worst case. This is due to the need for Gram-Schmidt (or similar) orthogonalization when eigenvalues are close. It presents a new O(n^2), embarrassingly parallel algorithm that avoids this need by: 1. finding multiple representations of T and its translates that determine the locally small eigenvalues to high relative accuracy, 2. techniques for computing such small eigenvalues to full accuracy, 3. procedures to compute associated eigenvectors that have guaranteed tiny residual norms. An interesting facet of our work is that high accuracy in intermediate computations lead to a much faster overall algorithm.

Our ideas are well illustrated on a problem arising from joint work with computational quantum chemists at the Pacific Northwest National Laboratory (PNNL). In a problem arising in the modeling of a biphenyl molecule, our new eigensolver takes 2 seconds as opposed to the 2 minutes taken by the previous LAPACK inverse iteration algorithm. We observe considerable speedups on a variety of other test matrices. Software based on this new algorithm will soon be available as part of the LAPACK and ScaLAPACK public-domain libraries. An earlier version of this software (for distributed-memory machines) is already available in PNNL's PeIGS software library.

The execution time of a symmetric eigendecomposition depends upon the application, the algorithm, the implementation, and the computer. Symmetric eigensolvers are used in a variety of applications, no two applications solve exactly the same eigenproblem. Many different algorithms can be used to perform a symmetric eigendecompostion, each with differing computational properties. Different implementations of the same algorithm have different computational properties. The computer on which the eigensolver is run not only affects execution time but may favor certain algorithms and implementations over others. Stanley's thesis explains the performance of the ScaLAPACK symmetric eigensolver, the algorithms that it uses and other important algorithms for solving the symmetric eigenproblem on today's fastest computers.

The performance of conjugate gradient schemes for minimizing unconstrained energy functional in the context of electronic structure calculations is studied. The unconstrained functionals allow a straightforward apoplication of conjugate gradients by removing the explicit orthonormality constraints on the quantum- mechanical wave functions. However, the removal of the constraints can lead to slow convergence, in particular when preconditioning is used. The convergence properties of two previously suggested energy functionals are analyzed in Pfrommer's MS thesis, and a new functional is proposed which unifies some of the advantages of the other functionals. A numerical example confirms the analysis.

Blackston's thesis describes the design of several portable and efficient parallel implementations of adaptive N-body methods, including the adaptive Fast Multipole Method, the adaptive version of Anderson's method, and the Barnes-Hut algorith. Our codes are based on a communication and work-partitioning scheme that allows an efficient implementation of adapative multipole methods even on high-latency systems. Our test runs demonstrate high performance and speed-up on several parallel architectures, including traditional MPPs, shared-memory machines, and networks of workstations.

The parallel construction of maximal independent sets is a useful building block for many algorithms in the computational sciences, including graph coloring and multigrid coarse grid creation on unstructured meshes. We present an efficient asynchronous maximal independent set algorithm for use on parallel computers, for use on well partitioned'' graphs, that arise from finite element (FE) models. For appropriately partitioned bounded degree graphs, it is shown that the running time of our algorithm under the CREW PRAM computational model is of O(1), which is an improvement over the previous best PRAM complexity for this class of graphs. Adams presents numerical experiments on an IBM SP, that confirm our PRAM complexity model is indicative of the performance one can expect with practical partitions on graphs from FE problems.

Publications and Theses:

Tzu-Yi Chen, MS, "Balancing Sparse Matrices for Computing Eigenvalues", 1998

Inderjit Dhillon, PhD, "A New O(n^2) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem," 1997

Ken Stanley, PhD, "Execution time of Symmetric Eigensolvers", 1997

Bernd Pfrommer, MS, "Minimizing unconstrained electronic structure energy functionals with conjugate gradients on parallel computers", 1997

Bernd Pfrommer, H. Simon, J. Demmel, "Uncontrained Energy Functionals for Electronic Structure Calculations", sumitted to J. Comp. Physics

David Blackston and Torsten Suel, "Highly Portable and Efficient Implementations of Parallel Adaptive N-Body Methods", Supercomputing 98

Mark Adams, "A maximal independent set algorithm", Fifth Copper Mountain Conference, 1998 (Best Student Paper Award)

### PHIPAC  (Demmel)

BLAS3 matrix-matrix operations usually have great potential for agressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be acheved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.

### Sparsity: A Toolbox for Optimized Sparse Matrix Computations (Demmel, Yelick)

We are working towards developing toolbox for generating high performance sparse matrix kernels for uniprocessors and SMPs. In particular, sparse matrix-vector multiplication is a core routine of many applications, including large scale iterative and eigenvalue solvers. As the initial phase of this research, we have studied several optimizations for sparse matrix-vector multiplication, including register blocking, cache blocking, and reordering. The benefits of these techniques vary widely depending on the matrix structure and machine characteristics. Register blocking appears to be the most useful of the three for Finite Element problems and other matrix that arise in physical simulations, because there are often small dense sub-blocks that naturally arise in these matrices. Cache blocking has so far proven useful only in a matrices from a web search application, which has a nearly random structure. Reordering has shown some benefit across problem domains, but the improvement is small, and it is often worse than leaving the matrix in its natural order, if the reordering destroys the natural sub-block structure.

The Sparsity group is building a toolbox that will be provided as a web service. It will collect information from the user about the matrix structure and machine type, using a combination of questions and answers, supplying example matrices, and downloading code to the user that measures various parameters.

# Shell: Multimedia Component

## Multimedia Proxy Services (Brewer, McCanne)

The proxy services research also continues, culminating in a successful scalable proxy server based on Networks of Workstations, called TranSend. Extending TranSend, the group has developed a generic architecture for supporting diverse Internet applications suitable for heterogeneous environment dictated by mobility and portable devices. The architecture provides a substrate for Scalable Network Services (SNS), on top of which application developers can design their services without worrying about the details of service management. To demonstrate the utility of this innovative service architecture, we have developed three real-world services on top of it: a web distillation proxy, a proxy-based web-browser for PDAs, and a MBone archive server. A particular development by the group is TopGun Wingman, a graphical split Web browser for the Palm Pilot PDA that is currently in use by more than 11,000 users around the world. This is being used with both cellular phones and Metricom packet radios to bring into reality truly handheld and untethered access to the Internet.

Media Gateways are software agents which bridge two or more conferencing sessions and process the data streams between
the sessions. Examples of such processing include transcoding between two formats, rate limiting, and application of encryption
or decryption. One of the primary uses of such gateways is accomodating the heterogeneity inherent in Internet conferences by applyingtranscoding and rate limiting on the "well connected" portion of the network. Furthermore, a gateway can provide user-level tunnels for bridging multicast capable islands over a non-multicast capable link. The Media Gateway (MeGa) Architecture is an experimental deployment of an architecture which automatically deploys gateways on the Titan infrastructure on behalf of end users across slow speeds links. The architecture attemps to make the use of the gateways as transparent and seamless as possible. As such, the architecture incorporates the use of the conventional Mbone tools: vat, wb, sdr, and vic. Vat and wb can be used in their unmodified versions while sdr and vic are modified to work within the architecture. (See http://www.cs.berkeley.edu/~elan/mega/)

E. Amir, S. McCanne, R. H. Katz, "Receiver Driven Bandwidth Allocation for Light Weight Session," ACM Multimedia '97 Conference, Seattle, WA (November 1997). Best Paper Award. mber 1997).

S. McCanne, E. Brewer, R. Katz, L. Rowe, E. Amir, Y. Chawathe, A. Coopersmith, K. Mayer-Patel, S. Raman, A. Schuett, D. Simpson, A. Swan, T-L Tung, D. Wu, B. Smith, "Toward a Common Infrastructure for Multimedia-Networking Middleware," Seventh International Workshop on Network and Operating System Support for Digital Audio and Video, St. Louis, (May 1997). Invited Paper.

## Multimedia Toolkit (Rowe)

The Berkeley Multimedia Research Center uses network infrastructure and the NOW supported by Titan and SWW to advance the Berkeley Multimedia Toolkit.

The recent advent of the Internet Multicast service has enabled a number of successful real-time multimedia applications, yet the scalability of these applications remains challenged by the inherent heterogeneity of the underlying Internet. One promising approach for taming this heterogeneity is to encode each media flow as a layered signal that is striped across multiple multicast groups, thereby allowing a receiver to tune its individual reception rate by modulating its subscription to multicast groups. Though significant progress had been made on media transport protocols and congestion control strategies for adjusting multicast groups in this fashion, comparatively little work has been devoted to extending the session directory service and address allocation architecture to meet the needs and requirements of layered media. Moreover, the large-scale deployment of layered media formats is hindered by the lack of support for layered formats in existing session directory tools. To overcome these limitations, we propose a new architecture for session advertisement and caching that exploits multicast administrative scope'' through protocol proxies to admit layered media formats and reduce the start-up latency of a directory-service client by an order of magnitude or more. Our architecture is fully compatible with the existing directory service allowing our implementation, which is split across a new session directory tool and network proxy, to be incrementally deployed within the current Internet multimedia conferencing architecture.

Internet video is emerging as an important multimedia application area. Although development and use of video applications is increasing, the ability to manipulate and process video is missing within this application area. Current video effects processing solutions are not well matched for the Internet video environment. A software-only solution, however, provides enough flexibility to match the constraints and needs of a particular video application. The key to a software solution is exploiting parallelism. Mayer-Patel's papers present the design of a parallel software-only video effects processing system. Preliminary experimental results exploring the use of temporal parallelism are presented. In Wong's paper, we describe the design and implementation of a software video production switcher, vps, that improves the quality of MBone broadcasts. vps is modeled after the broadcast television industry's studio production switcher. It provides special effects processing to incorporate audience discussions, add titles and other information, and integrate stored videos into the presentation. vps is structured to work with other MBone conferencing tools. The ultimate goal is to automate the production of MBone broadcasts.

Andrew Swan, Steven McCanne and Lawrence A. Rowe, Layered Transmission and Caching for the Multicast Session Directory Service, to appear Proceedings of The Sixth Annual ACM International Multimedia Conference, September 1998. Best paper award. (students: A. Swan MS 12/97)

Ketan Mayer-Patel and Lawrence A. Rowe, Exploiting Temporal Parallelism for Software-only Video Effects Processing, to appear Proceedings of The Sixth Annual ACM International Multimedia Conference, September 1998. (students: K. Mayer-Patel MS 12/97)

T. Wong, K. Mayer-Patel, D. Simpson, and L.A. Rowe, A Software-Only Video Production Switcher for the Internet MBone, Multimedia Computing and Networking 1998, Proc. IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, San Jose, CA, January 1998. (students: T. Wong MS 12/97, D. Simpson MS (forthcoming))

K. Mayer-Patel and L.A. Rowe, "Design and Performance of the Berkeley Continuous Media Toolkit," in Multimedia Computing and Networking 1997, Martin Freeman, Paul Jardetzky, Harrick M. Vin, Editors, Proc. SPIE 3020, pp 194-206 (1997). (students: K. Mayer-Patel)

R. Malpani and L.A. Rowe, "Floor Control for Large-Scale MBone Seminars," to appear Proceedings of The Fifth Annual ACM International Multimedia Conference, Seattle, WA, November 1997, pp 155-163. (students: Malpani MS 5/97)

Swan and Mayer-Patel are PhD students. Mayer-Patel has passed quals and will finish 6/99. Swan will take quals fall 98, phd 6/00.

# Driving Applications

### Machine Learning (Russell)

Prof. Russell's research has been enhanced significantly by the availability of the computational infrastructure supported by Titan, particulary the NOW cluster. His research project Learning complex probabilistic models from data,'' supported by NSF under grant IRI-9634215, has been a major user of the cluster. Since of the computational experiments, e.g., on speech recognition, have consumed upwards of 1000 CPU hours, the absence of the NOW would have severely retarded progress.

The project focuses on extending the expressiveness of probabilistic models, extending the scope of learning algorithms, and developing substantial scientific applications. In the last year, we have obtained the following results requiring substantial computational resources: \begin{itemize} \item {\it Methods for automatic creation and modification of model structures. }\\ A new algorithm, Structural EM (Friedman, 1997) represents a significant development in computational statistical methods. SEM generalizes the well-known EM to incorporate structural as well as parametric learning, while retaining the convergence guarantees of EM. We have applied SEM to learn substantial models, including DBN models of speech generation that substantially outperform hidden markov models for recognition (Zweig \& Russell, 1998). \item {\it Methods for combined model learning and reinforcement learning in partially observable environments.}\\ Andre et al.~(1997) have shown how DBN models can be combined with reinforcement learning, providing a powerful method for adaptive control of Markov processes. Dearden et al~(1998) developed a Bayesian formulation of reinforcement learning and derived improved exploration algorithms, with application to several hard problems from the literature. \item {\it Automated extraction of human driver models from videotapes.}\\ We have developed simple DBN models of human drivers and and trained them directly from vehicle tracking data (Oza \& Russell, submitted). \item {\it Adaptive hierarchical control for large Markov processes}.\\ For very large Markov processes, tractable control policies must be hierarchically structured. In (Parr \& Russell, 1997), a language is proposed for describing partially specified hierarchical policies. Algorithms are given for efficient online learning of optimal policies consistent with the prior specifications. These results may make practical a theoretically rigorous approach to the control of very large systems. Solution of systems with several thousand states has ben demonstrated. \item {\it Object identification under uncertainty.}\\ When observing over time a system consisting of multiple objects, state estimation and model learning require computing the probability that one observed object is in fact the same as another. We developed appropriate probabilistic models and inference algorithms for this problem, and applied them to estimate freeway travel times using video data streams from widely separated camera sites. The resulting paper (Huang \& Russell, 1998) received a Distinguished Paper Award at IJCAI 97 (out of 816 submissions) and will appear as an invited paper in AIJ. This research required both massive computational resources and massive storage facilities for video sequence data. \end{itemize}

Selected Relevant Publications (out of 24 papers total)

\begin{enumerate} \item D. Andre, N. Friedman, and R. Parr, Generalized Prioritized Sweeping.'' In {\em NIPS '97}. \item J. Binder, D. Koller, S. Russell, K. Kanazawa, Adaptive Probabilistic Networks with Hidden variables.'' {\it Machine Learning}, {\bf 29}, 213--244, 1997a. \item S. Dasgupta, The sample complexity of learning Bayesian networks.'' {\it Machine Learning}, {\bf 29}, 165--180, 1997. \item N. Friedman, Learning belief networks in the presence of missing values and hidden variables.'' In {\em ICML-97}. \item T. Huang and S. Russell, Object Identification in a Bayesian Context.'' {\it Artificial Intelligence}, to appear. \item R. Parr and S. Russell, Reinforcement Learning with Hierarchies of Machines.'' In {\em NIPS '97}. \item G. Zweig and S. Russell. Speech Recognition with Dynamic Bayesian Networks.'' In {\em AAAI-98}.

\item John Binder, Kevin Murphy, Stuart Russell, Space-Efficient Inference in Dynamic Probabilistic Networks.'' In {\em Proc.~Fifteenth International Joint Conference on Artificial Intelligence}, Nagoya, Japan, 1997b.

\item Sanjoy Dasgupta, The sample complexity of learning fixed-structure Bayesian networks.'' {\it Machine Learning}, {\bf 29}, 165--180, 1997.

\item Jeffrey Forbes, Nikunj Oza, Ronald Parr, and Stuart Russell. Feasibility Study of Fully Automated Traffic Using Decision-Theoretic Control.'' California PATH Research Report UCB-ITS-PRR-97-18, Institute of Transportation Studies, University of California, Berkeley. April 1997.

\item N. Friedman, D. Geiger, and M. Goldszmidt, Bayesian networks classifiers.'' {\it Machine Learning}, {\bf 29}, 131--164, 1997.

\item N. Friedman and M. Goldszmidt, Sequential update of Bayesian network structure.'' In {\it Proc. Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI)}, Providence, RI, 1997.

\item N. Friedman and M. Goldszmidt, Learning Bayesian Networks with Local Structure.'' To appear in M. I. Jordan (Ed.) {\it Learning and Inference in Graphical Models}, 1997.

\item Nir Friedman, Stuart Russell, Image Segmentation in Video Sequences.'' In {\it Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence}, Providence, Rhode Island: Morgan Kaufmann, 1997.

\item Daishi Harada, Reinforcment learning in time.'' In {\em Proc.~AAAI-97}, Providence, RI, 1997.

\item Timothy Huang and Stuart Russell, Object Identification in a Bayesian Context.'' {\it Artificial Intelligence}, to appear (invited paper).

\item Timothy Huang and Stuart Russell, Object identification in a Bayesian context.'' Distinguished Paper Prize, in {\em Proc.~Fifteenth International Joint Conference on Artificial Intelligence}, Nagoya, Japan, 1997.

\item Kevin Murphy, Inference and Learning in Hybrid Bayesian Networks.'' Technical Report, Computer Science Division, University of California, Berkeley. January 1998.

\item Ron Parr and Stuart Russell, Reinforcement Learning with Hierarchies of Machines.'' In {\em NIPS '97: Neural Information Processing Systems}, Denver, 1997.

\item Stuart Russell, Lewis Stiller, and Othar Hansson, PNPACK: Computing with Probabilities in Java.'' {\it Concurrency: Practice and Experience}, {\bf 9}, 1333--1339, 1997.

\item Prasad Tadepalli and Stuart Russell, Learning from Examples and Membership Queries with Structured Determinations.'' {\it Machine Learning}, to appear.

\item Richard Dearden, Nir Friedman, and Stuart Russell, Bayesian Q-Learning.'' To appear in {\it AAAI-98}.

\item Nir Friedman, Kevin Murphy, and Stuart Russell, Learning the Structure of Dynamic Probabilistic Networks.'' To appear in {\it UAI-98}.

Students graduated: Nikunj Oza, MS Geoff Zweig, PhD Tim Huang, PhD Ron Parr, PhD Othar Hansson, PhD

### Computer Architecture Simulation (Wawrzynek, Patterson, Kubitowitz)

#### Brass

The objective of the BRASS project is to develop an architecture and build a prototype that will combine a processor, a high performance reconfigurable array, and a memory system on a single chip. The reconfigurable array extends the usefulness and efficiency of the processor by providing the means to tailor its circuits for special tasks. The processor improves the efficiency of the reconfigurable array for general-purpose computation. We hope to demonstrate that a processor combined with reconfigurable array can achieve a significant performance improvement over either a separate processor or a separate reconfigurable device on embedded computing applications.

Our approach is a coordinated attack on the elements needed to demonstrate a combined processor and reconfigurable array: the design of a configurable array architecture that includes features making it more efficient for tight coupling with a processing core; a programming system that can take advantage of the processing core and the reconfigurable resources; a prototype chip implementation of the combined device to verify its practicality; and a demonstration of the efficiency of the device on a set of applications. We have used the thousands of CPU hours on the Titan/NOW infrastructure. We specified and begun detailed design and layout work on an advanced high speed reconfigurable array, specified a prototype chip design that merges DRAM and a reconfigurable array, completed a compilation path from C to our Garp chip, developed and demonstrated a library element generator system in the Java programming language, provided interface specifications Ptolemy group so they can use as an intermediate target mapping designs to FPGA/RC devices · working strategy for adding placement directives, worked out scheme for architecture specific library inheritance to ease portability. A paper on generator system written/accepted FCCM'98 Project web page: http://www.cs.berkeley.edu/projects/brass/

T. Callahan, P. Chong, A. DeHon, J. Wawrzynek, "Fast Module Mapping and Placement for Datapaths in FPGAs'' Published in Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays (FPGA '98, February 22-24, 1998).

T. Callahan, J. Wawrzynek, "Datapath-oriented FPGA Mapping and Placement for Configurable Computing," presented at FCCM'97, Napa Valley, CA (April 1997).

J. Hauser, and J. Wawrzynek, "Garp: A MIPS Processor with a Reconfigurable Coprocessor," published in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'97, April 16-18, 1997), pp. 24-33.

#### IRAM

Two trends call into question the current practice of microprocessors and DRAMs being fabricated as different chips on different fab lines: 1) the gap between the speed of processor and the speed of DRAM is growing at 50% per year; and 2) the size and organization of memory on a single DRAM chip is becoming awkward to use in a system, yet size is growing at 60% per year. Intelligent RAM, or IRAM, merges processing and memory into a single chip to lower memory latency and increase memory bandwidth as well as to select the best memory size and organization for an application. In addition, IRAM promises savings in power and board area. We have heavily utilized Titan/NOW for evaluation of application performance for Vector IRAM by building analytical models of several application kernels. These models are parameterized by certain hardware variables such as clock rate and number of vector lanes (also called pipes). Using these models, we can predict the performance of the computation as well as the required on-chip memory bandwidth. In addition, we can examine off-chip I/O bandwidth requirements, which are important when IRAM chips are considered in the context of a complete computer system computing problems that do not fit within a single chip.

The IRAM design has been specified in terms of block diagrams, pins and functionality. Current work includes the development of Verilog model for the design and initial circuit synthesis. Full custom circuits for the low-swing interconnect schemes have also been sized. Project web page: http://iram.cs.berkeley.edu/ .

C. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick, "Scalable Processors in the Billion Transistor: IRAM". Proceedings IEEE Computer Special Issue: Future Microprocessors - How to use a Billion Transistors, September 1997.

D. Patterson, K. Asanovic, A. Brown, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Truehaft, K. Yelick, "Intelligent RAM (IRAM): the Industrial Setting, Applications, and Architecture", Proceedings ICCD '97 International Conference on Computer Design, Austin, Texas, 10-12 October 1997.

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, "Intelligent RAM (IRAM): Chips that Remember and Compute," 1997 IEEE International Solid-State Circuits Conference, San Francisco CA, 6-8 February 1997.

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, "A Case for Intelligent DRAM: IRAM," IEEE Micro, April 1997.

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, C. Kozyrakis, B. McGaughy, S. Perissakis, K. Yelick, "The Energy Efficiency of IRAM Architectures" ISCA '97: The 24th Annual International Symposium on Computer Architecture, Denver, CO, 2-4 June 1997.

N. Bowman, N. Cardwell, C. Kozyrakis, C. Romer, H. Wang, "Evaluation of Existing Architectures in IRAM Systems" Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, Denver, CO, 1 June 1997.

D. Patterson, R. Arpaci-Dusseau, K. Keeton, "IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck" Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, Denver, CO, 1 June 1997.

### Parallel Databases (Hellerstein, Culler, Patterson)

The NOW infrastructure at Berkeley has been the vehicle for new research in parallel databases, leading to world-record performance on the Sort benchmark. Disk-to-disk sorting is considered an excellent parallel hardware and software benchmark because it stresses the I/O infrastructure of a system, and hence tests peak performance for data-intensive applications like decision-support and data mining. The best results preceding Now-Sort were produced by industrial researchers working on expensive, well-endowed versions of shared-memory parallel computers (SMPs) produced by their parent companies. The Now-Sort work achieved its record using commodity single-processor workstations connected by a high-speed network, with software developed and tuned by Berkeley students. The NOW-sort results and lessons have been documented in a series of research papers listed below. The record still stands a year after it was set on the NOW -- this is especially surprising given the rate of performance increases in hardware, and is testimony both to the level of tuning that went into the work, and the robustness of the hardware infrastructure.

[1] Searching for the Sorting Record: Experience with NOW-Sort'' Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. To appear in 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, August 1998.

[2] The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs'' Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. High Performance Computer Architecture, Feb. 1998.

[3] High-Performance Sorting on Networks of Workstations'' Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. ACM SIGMOD Conference on Management of Data, May 1997.

### IMPULSE-based simulation (Canny)

The dynamic simulator IMPULSE was developed by John Canny and graduate student Brian Mirtich. It was designed to study systems with lots of intermittent and changing contact, and especially vibratory part feeders. The impulse web page which includes links to papers and videos is http://www.cs.berkeley.edu/~mirtich/impulse.html.
The principle of the simulator is to use a good {\em local} model of contact - an impulse model - and to derive all contact response from it. The simulator has performed very well in that role and has since been generalized for other tasks. IMPULSE was first described in \cite{MC95:dynsim}. A second paper \cite{Mir95} described its extension to articulated bodies.

The simulator was used to predict the stable pose distribution of real industrial parts. The results are described in \cite{MZG96}. This paper includes experimental data on stable poses for a number of different parts. The simulation results correlate very well with these, so this paper provides support for the realism of IMPULSE for simulations involving many rigid-body collisions.

Another paper desribed the use of IMPULSE as a design tool for vibratory bowl feeders. Vibratory bowls use strategically-placed slots and fences to deflect parts that are not properly oriented. They are among the most widely-used devices in manufacturing, But their design is currently a totally manual exercise. Some design principles have been suggested by Boothroyd, but these suppose that the statistical effects of the slots and fences on a given part is known.

The paper \cite{Berk96} describes a simulation-only approach to feeder design. A geometric model of the feeder track is built first, with one or more of the design parameters specified as variable. A supervisory program systematically varies those parameters, and then performs a simulation on each geometry. Each simulation includes parts in several different initial poses. For each geometry, the success rate is computed for each feature and initial pose, and used to construct a state-transition diagram for the feeder. A follow-up paper \cite{BC97} includes some improvements in the design tool and compares the results with trials on a real feeder.

Simulation was also used to study a new design of planar motion arrays using MEMS. The paper \cite{RBC96} describes our design efforts for a particular MEMS array using IMPULSE. We studied several design parameter changes, and we able to propose a solution to the binding problem that these arrays exhibit (we found in simulation that parts being fed would stick at the boundaries between feeder rows. Later we found that this happens in the real arrays as well).

\bibitem[BC96]{Berk96} Dina Berkowitz and John Canny. \newblock Designing part feeders with dynamic simulation. \newblock In {\em IEEE Conference on Robotics and Automation}, pages 1127--1132. IEEE, 1996.

\bibitem[BC97]{BC97} Dina~R. Berkowitz and John Canny. \newblock A comparison of real and simulated designs for vibratory parts feeding. \newblock In {\em Proceedings of the IEEE Conference on Robotics and Automation}, 1997.

\bibitem[MC95]{MC95:dynsim} Brian Mirtich and John Canny. \newblock Impulse-based simulation of rigid bodies. \newblock In {\em Symp. on Interactive 3D Graphics}, 1995. \newblock Monterrey, CA.

\bibitem[Mir95]{Mir95} Brian Mirtich. \newblock Hybrid simulation: Combining constraints and impulses. \newblock In {\em Proceedings of First Workshop on Simulation and Interaction

in Virtual Environments}, 1995.

\bibitem[MZG{\etalchar{+}}96]{MZG96} Brian Mirtich, Yan Zhuang, Ken Goldberg, John Craig, Rob Zanutta, Brian Carlisle, and John Canny. \newblock Estimating pose statistics for robotic part feeders. \newblock In {\em IEEE International Conference on Robotics and Automation.}, May 1996. \newblock Minneapolis.

\bibitem[RBC96]{RBC96} D.~Reznik, S.~Brown, and J.~Canny. \newblock Dynamic simulation as a design tool for a microactuator array. \newblock In {\em IEEE Conf. on Robotics and Automation (ICRA)}, 1996. \newblock Albuquerque, NM.

GRADS: Brian Mirtich completed his Ph.D. in Spring of 1996, and is now a researcher at MERL in Cambridge Mass.

### Digital Libraries (Wilensky, Forsyth, Fateman)

In the past couple years our system infrastructure has undergone significant advance through Titan and gifts from companies including Sun Microsystems, IBM, HP, Microsoft and EMASS. The Massive Data Store in the Titan proposal is currently  an IBM tape library storage system that includes an RS/6000 (model 58H) server with 100 GB of SSA disk, and a 3494 tape library with two 3590 (Magstar) tape drives. With support from EMASS, we are currently in production running AMASS version 4.8.1 on the IBM hardware. We have transferred the 400+ GB of data from our old tertiary storage server to the IBM 3494 tape library, and are continuing to grow the collection. The old tertiary storage manager (Metrum RSS-600, also running AMASS) is still on-line as a read-only archive.

Our main web server has moved from the older HP equipment to our new Sun Ultra Enterprise 3000 server which is a dual processor model with 512MB of memory and 42GB of disk. We have since expanded the system with 50GB of additional diskspace, 2 F/W SCSI-2 controllers, and a Sony SDX-300 AIT tape drive for backups. This server now hosts all of our on-line web-pages and data for our image, document, and geographic data collections (larger resolutions are stored on tertiary storage). The system also operates as a database server and compute server for OCR, image processing, and other supportive functions for the project. On the software side, we have also migrated away from the NCSA web server to using the Apache web server.

As our collections continue to grow, we have also acquired disk space on the Sun Sonoma file-servers provided by the extension of Titan and Major Sun donations. This facility uses RAID-5 for increased reliability. This has relieved the space pressure that we were experiencing on our production server. All of the document data and indexing storage has now been moved to the Sonomas. Some of the freed space has been used to extend overfull partitions and to consolidate our image collection. More space consolidation and expansion is currently underway.

HP has donated 2 new C160 workstations, 3 new Vectra PCs, and 5 HP 5P color scanners. The new Vectra PCs help support our Java development environment, and the C160 workstations have replaced older desktop workstations within the group. We plan to use the scanners to explore further ways to share information and are looking to incorporate them into our daily work. As an example, we have used them to scan and post meeting handwritten notes on the network and to scan and OCR a printed paper to digitize the references.

In the summer of 1996, Intel has donated 11 200 MHz Pentium Pro PCs to support our project efforts. Each PC comes fully equipped with 64MB RAM, an 8x CD-ROM drive, a Fast EtherLink PCI 10/100Base-T netwrok interface card, an Adaptec Ultra-wide SCSI adaptor card, and a Matrox MGA Millenium graphics card with 2MB VRAM for 64-bit graphics. In addition, Microsoft has also supplied us with one of just about every piece of PC software they have, on each of our 11 Wondows/NT machines.

Prof. Fateman has used Titan  to implement a web server providing symbolic integration table lookup, (http://http.cs.berkeley.~fateman/htest.html) This involves heavy computation (about a week of CPU time) for computing a high-order taylor series expansion describing the energy dissipation in a classic 3-dimensional vortex problem. The program was written in Macsyma, recompiled for Allegro Common Lisp.

References:

Gary Kopec. Multilevel character templates for document image decoding'', in Document Recognition IV, L. Vincent and J. Hull, editors, Proc. SPIE vol. 3027, 1997.

Serge Belongie, Chad Carson, Hayit Greenspan, and Jitendra Malik, Color- and Texture-based Image Segmentation Using EM and Its Application to Content-Based Image Retrieval.'' International Conference on Computer Vision, Jan. 4-7, 1998, Bombay, India.

Serge Belongie and Jitendra Malik. Finding Boundaries in Natural Images: A New Method Using Point Descriptors and Area Completion.'' Submitted to the European Conference on Computer Vision, 1998, Freiburg, Germany.

Michael Buckland. What is a Document.'' Journal of the American Society for Information Science. 48(9), pp. 804-809, 1997.

Michael Buckland and Christian Plaunt. Selecting Libraries, Selecting Documents, Selecting Data''. International Symposium on Research, Development and Practice in Digital Libraries, ISDL 97, pp. 85-91. Nov. 18-21, 1997, University of Library and Information Science, Tsukuba City, Japan.

Michael Buckland, Youngin Kim and Barbara Norgard. Search Support for Unfamiliar Metadata Vocabularies.'' Unpublished manuscript.

Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. Region-Based Image Querying.'' Workshop on Content-Based Access of Image and Video Libraries, associated with the Conference on Computer Vision and Pattern Recognition, June 20, 1997, San Juan, Puerto Rico.

Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. Color- and Texture-Based Image Segmentation Using EM and Its Application to Image Querying and Classification.'' Submitted to Pattern Analysis and Machine Intelligence.

Richard J. Fateman. More Versatile Scientific Documents'', Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, 18-20 Aug. 1997, IEEE Computer Society, 1997, p. 1107-1110, vol.2.

Richard Fateman. How to Find Mathematics on a Scanned Page.'' Unpublished manuscript.

D.A. Forsyth and M.M. Fleck. Finding People and Animals by Guided Assembly'', Proc. International Conference on Image Processing, Santa Barbara, 1997.

David Forsyth, Jitendra Malik, and Robert Wilensky. Searching for Digital Pictures.'' Scientific American, June 1997.

Gary Kopec. An EM Algorithm for Character Template Estimation.'' Submitted to IEEE Trans. PAMI.

Ray R. Larson and Jerome McDonough. Cheshire II at TREC 6: Interactive Probabilistic Retrieval.'' In: The Sixth Text REtrieval Conference, D.K. Harman and E.M. Voorhees, eds. (in press)

Thomas Leung and Jitendra Malik. Contour Continuity in Region Based Image Segmentation.'' Submitted to the European Conference on Computer Vision, 1998, Freiburg, Germany.

Ginger Ogle. California Native Plant Society newsletter. Oct, 1997.

Thomas A. Phelps and Robert Wilensky. Multivalent Annotations.'' In the Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, September 1-3, 1997 Pisa, Italy.

Christian Plaunt and Barbara Norgard. An Association Based Method for Automatic Indexing with a Controlled Vocabulary.'' To appear in the Journal of the American Society for Informations Science.

Lisa R. Schiff, Nancy A. Van House, and Mark H. Butler. Understanding Complex Information Environments: a Social Analysis of Watershed Planning.'' Digital Libraries '97: Proceedings of the ACM Digital Libraries Conference, Philadelphia, PA, July, 1997. pp. 161-186.

Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation.'' International Conference on Computer Vision, Jan. 4-7, 1998, Bombay, India.

Nancy A. Van House, Mark H. Butler, and Lisa R. Schiff. The Situated Nature of Information: Practices and Artifacts.'' Submitted to the Journal of the American Society for Information Science.

Robert Wilensky and Isaac Cheng. An Experiment in Enhancing Information Access by Natural Language Processing.'' UC Berkeley Computer Science Technical Report UCB/CSD-97-963, June 1997.

Richard Fateman: Symbolic Computation of Turbulence and Energy Dissipation in the Taylor Vortex Model, Intl J. Modern Physics (C) Vol 9, No 3 (May 1998).

### Theory (Sinclair)

The computing resources provided by Titan have been of use in several research projects. Two of the most recent examples are the following. In [RSW] we investigate load-balancing in distributed networks under various interative schemes, including diffusion and dimension exchange. We obtain tight bounds on the number of rounds required to achieve coarse balancing (up to a small discrepancy that depends only on the network itself). Although the final results are analytic in nature, we had to perform many simulations in order to guide our analytic work. For example, we were able to prove surprisingly tight bounds for the cycle and torus networks which were first suggested to us by large-scale simulations on Titan machines.

In [R] a novel approach to the classical problem of approximating the permanent of a positive real matrix is investigated. This involves substituting matrix entries by suitably chosen random elements from a Clifford algebra. Surprisingly, this method yields an algorithm that is more efficient in the worst case than any known competitor. In the development of the algorithm, large symbolic computations were performed in order to prove certain algebraic identities. These were crucial to both the design and the analysis of the algorithm.

References

[RSW] Yuval Rabani, Alistair Sinclair and Rolf Wanka,

Local divergence of Markov chains and analysis

of iterative load-balancing schemes,'' submitted

to IEEE Symposium on Foundations of Computing, 1998.

[R] Lars Rasmussen, New results in approximate counting,''

PhD thesis, Computer Science Division, UC Berkeley, to

be filed July 1998.

### Fire Simulation (Sequin)

The Titan computing resources have been used for distributed simulation and visualization of fires in buildings, funded under a new project (USDC/NIST 60NANB5D0082). Integration of powerful simulation technology with virtual reality visualization systems affords the possibility of intuitive interpretation and visualization of the results of complex and powerful simulations via 3D computer graphics. One application domain is the design and evaluation of buildings through the use of virtual computer models that are complete enough and detailed enough, so that users can preview architectural designs, evaluate their performance with various metrics, and do simulations and what-if'' experiments cheaply and with no risk. We have developed an environment that allows ordinary users to study the spreading of fires in buildings in an intuitive manner. To obtain realistic answers to such experiments, we had to integrate good physical simulations with virtual environment interfaces.

Such seamless integration of high-performance physical simulations requires large quantities of computing power and the ability to distribute information dynamically between simulators and visualization clients. To that end, we are investigating methods for handling the problems of real-time distributed simulation-visualization data management. The Berkeley Architectural Walkthru has already addressed some of the problems of distributed visualization and of the interaction between the user and the virtual world. In our recent work, we have shown that the basic virtual environment structure used in the Walkthru, a spatial subdivision of the world into densely occluded cells with connecting portals, can be put to good use for simulation data management. In addition to optimizing the visualization task, it is also useful for optimizing bandwidth requirements between a visualizer and simulator running on networked workstations, both for the purpose of communicating conditions to the simulator and communicating simulated states back to the visualizer. Using this structure, we can optimize bandwidth requirements for arbitrarily large visualizations and simulations, and relieve the visualization and simulation designers of the complexity of the data management problem. We are currently extending this solution to multiple distributed visualizers and simulators operating on one virtual world, using networked Windows NT computers and Silicon Graphics workstations to create dynamic, physically realistic, multiuser distributed virtual worlds.

Recent results were reported at the following conferences:

Bukowski, R.W. and Sequin, C.H. Performance Evaluation in a Virtual Environment, Part III: Understanding Performance Through an Interactive Environment. To appear in Proceedings of the Second International Conference on Performance-Based Codes and Fire Safety Design Methods (Maui, Hawaii, May 1998).

Bukowski, R.W. and Sequin, C.H. Interactive Simulation of Fire in Virtual Building Environments. Proceedings of SIGGRAPH 97 (Los Angeles, CA, August 1997).

Bukowski, R.W. and Sequin, C.H. The FireWalk System: Fire Modeling in Interactive Virtual Environments. Proceedings of the 2nd International Conference on Fire Research and Engineering (Gaithersburg, MD, August 1997).

### Rapid Prototyping Interface for 3D Solid Parts (Sequin)

The Titan infrastructure has been used in a rapid prototypiing project funded under NSF grant MIP-9632345. We utilize the networking and the Sun workstations in the NOW cluster to run the ACIS solids modeling package needed in the development of SIF (Solid Interchange Format) and its experimental use in the CyberCut project headed by Prof. Paul Wright in Mechanical Engineering.

In this research we are developing a simple and clean language for use as a digital interface for rapid prototyping of mechanical parts using Solid Free-Form Fabrication (SFF) or some special machining approach called CyberCut in which the part to be fabricated is encapsulated and rigidly held in place with a special plastic material that can later be removed easily. The role of this "Solid Interchange Format" (SIF) is to describe the desired solid part in an unambiguous, fabrication-process-independent way, so that the interaction between designers and fabricators can be simplified and streamlined. As in the current proposal, a key issue is to first understand the interactions between designers and fabricators and to capture the semantics of the information that needs to flow across this interface. Then a robust yet efficient language has to be developed to serve this purpose.

During the first year of our contract period we have defined the language and started to use it in research as well as in classroom settings. We also had some parts fabricated on a new interactive CAD tool that described its output in a special dialect SIF_DSG that was developed for the CyberCut machining environment.

S. McMains, C.H. Sequin, "SIF: The Emerging Solids Interchange Format", Fifth SIAM Conference on Geometric Design, Nov 3-6, 1997, Nashville, TN.

## Education and Outreach

As part of our research collaborations and our involvement in the National Partnership for Advanced Computing Infrastructure (NPACI) NOW has been used extensively for a wide variety of applications. NPACI accounts alone amount of 45 users from 17 institutions. In addition, NOW and Clumps  were used extensively in the graduate parallel computing seminar this semester, taught by Prof. Kathy Yelick (http://www.cs.berkeley.edu/~dmartin/cs267).  There were three programming assignments, the first being a matrix multiplication race to optimize this routines for the memory hierarchy on a single Ultrasparc processor of the NOW.  The students used a multiple levels of tiling, hand optimizations for loop unrolling and instruction scheduling, and experimenting with the various optimizations provided by the compiler.  The second project was an n-body calculation on multiple nodes of the NOW and multiple processor of an SMP within a CLUMP.  They were required to implement optimizations for locality to avoid and all-to-all exchange of data which occurs naturally in the O(n^2) algorithm, and several groups also addressed load imbalance issues.  The third assignment was an optimization of a conjugate gradient algorithm on the NOW, for which they performed communication optimizations, load balancing, and numerical analysis of the algorithm.

Most of the final projects are related to the students' research interests, and many are likely to be continued as part of their thesis research. The projects fall roughly into two categories: parallel applications and system support for parallel machines.  Among the first category, there is a parallel Stiff ODE Integrator, a comparison of parallel direct solver software for Finite Element Methods, a radiative transfer algorithm based on Monte Carlo simuluation, and a solution of the transport-of-intensity equation.  There are also two applications that come from problem domains outside of scientific computing, one being a parallelization of a database join with a large number of processor on a system with a smaller number of processors; this will allow researchers to investigate scaling issues in algorithms and systems.  There are three projects related to thread support, one that adds threads to the SPMD model in the Titanium language, another to study the performance problems of threads and caching on SMPs in pSather, and the last to look at the problem of  building a thread-safe library from an unsafe one and tools to aid in this conversion.  Finally, there is a project to build support for interactive parallel visualization.

Among the second category of systems developments was a program to simulate a multiprocessor with a large number of processor on a system with a smaller number of processors; this will allow researchers to investigate scaling issues in algorithms and systems. There are three projects related to thread support, one that adds threads to the SPMD model in the Titanium language, another to study the performance problems of threads and caching on SMPs in pSather, and the last to look at the problem of building a thread-safe library from an unsafe one and tools to aid in this conversion.