TITAN: A Next-Generation Infrastructure for Integrating Computing and Communication

Grant Number: CDA-9401156

Computer Science Division
University of California, Berkeley

Technical Progress Report (08/01/96 - 07/31/97)


Overview

UC Berkeley is developing as its research infrastructure a new type of computing system, called Titan, which harnesses breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly though its constituent nodes or though inexpensive media stations. This project seeks to investigate the computer science issues in large scale computing systems as they will be. We are roughly midway through the project. The system comprises the largest tightly integrated cluster in the world. It is being utilized on a set of advanced applications with demanding computational, I/O, and graphics requirements throughout the department and by researchers at other institutions. It recently set the world’s record in disk-to-disk sorting benchmark and cracked the RSA 40-bit key challenge. It is a critical resource to research throughout the department.

Background

The University of California at Berkeley is a premier research and teaching institution. Its roots go back to the gold rush days of 1849 when the drafters of the State Constitution required the legislature to "encourage by all suitable means the promotion of intellectual, scientific, moral and agricultural improvement" of the people of California. Fifteen members of the Berkeley faculty have been awarded Nobel Prizes and in 1966 Berkeley was recognized by the American Council on Education as "the best balanced distinguished university in the country."

The Computer Science Division of the Electrical Engineering and Computer Sciences department is recognized as a world leader, rated #1, along with MIT and Stanford in recent national evaluations. The division consists of 30 faculty, roughly 200 graduate students, and a large undergraduate population in both the College Engineering and the College of Letters and Science. The department has had a major impact on the technological world with developments such as the BSD UNIX operating system, computer-aided design tools for integrated circuits (such as Spice and Magic), relational database systems, IEEE floating point, and pioneering work in RISC computer architectures and RAID storage systems. It is also renown for its theoretical work, such as the theory of NP-completeness, and multiple of its faculty have received the Turing Award.

In the late 80's the Division began a major effort to construct a new Computer Science building. Because of its strong impact on the computing industry, it was able to fund and develop a state-of-the-art facility, Soda Hall. The Titan project grew out of the thought processes of designing the new building and the efforts by the faculty to envision what would be the dominant directions of computing as we enter into the next century. We wanted to create an environment in which to experience and investigate the salient issues of computer systems as they "will be" and in which we could rapidly incorporate advancing technology, especially networking, high-performance computing, and interactive multimedia.

The Computer Science faculty at the University of California proposed to NSF to develop as its computing and communication infrastructure a new type of computing system, called Titan, which would harness breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly through its constituent nodes or through inexpensive media stations. The vision was to treat the building as an integrated computing system, with a core computing component providing vast amounts of computing power and storage, connected to media stations and other advanced devices. A software architecture for the global operating system and programming language would be developed and the system design would be driven by a set of advanced applications with demanding computational, I/O, and graphics requirements. Funding for the project is shared between the National Science Foundation and the University, with individual research groups adding value to this infrastructure through their research personnel and equipment supported through other sources. NSF requested, in response to the original proposal and its addendum, that the project directly incorporate a significant experimental systems research component, along the lines described in a separate UCB proposal: "NOW: Design and Implementation of a Distributed Supercomputer as a Cost-effective Extension to a Network of Workstations." The Titan project comprises a core computing component, a multimedia component, an advanced networking component, and a set of driving applications.

In this report we outline the progress on Titan during its third year.


Physical Infrastructure

Computing Infrastructure

We completed acquisition of the core computing infrastructure, 100 Ultra 170 workstations connected by a high speed Myrinet. We augmented the core computing infrastructure with the donation of 35 Intel PentiumPro PCs and roughly 400 IBM disks to be used in construction of a massive storage cluster. We also augmented the core computing infrastructure with the donation of 4 8-way Sun Enterprise 5000 SMPs. We augmented the mediastation component with the donation of roughly 100 Intel Pentium Pro PC and monitors from Sony and Samsung.

Networking Infrastructure

We have deployed the building wide scalable networking infrastructure in phases as the technology has evolved. In developing the core, we found that it was critical to provide a scalable external network between the core computing facilities and the rest of the department, in addition to the internal high performance network. We used this structured situation to push the ATM-to-switched-Ethernet technology. This proved to be rather challenging and require multiple commercial generations to get right, and we now have a solid scalable ATM backbone supporting a large number of machines via switched Ethernet. We have tracked the development of ATM, 100 Mb/s Ethernet, and the emerging Gb Ethernet.

We deployed switched Ethernet on an ATM backbone within the core NOW cluster. We deployed an interim switched 100 Mb/s network for the Intel media stations. In addition, we developed and deployed experimental networks support wireless communication, dedicated internet services, experimental protocols, and metropolitan connectivity within the original communications infrastructure of the CS building.

 


Core: Architecture and Operating Systems (T. Anderson, D. Culler, D. Patterson)

Titan Core Computing Component

This year we completed construction of the Titan core as part of the Network of Workstations (NOW) project. It is operational and being used heavily throughout the department, while it continues to undergo development through research activities. The hardware consists of a cluster of 105 Sun UltraSPARC workstations, obtained through NSF and ARPA funds complemented by a major donation from Sun Microcomputer Corp., interconnected by a high performance Myricom network, with 160 MB/s links and multi-GB/s of bisection bandwidth. A picture of a portion of this system is shown in Figure 1. Recently, Sun donated an additional 100 2.3 GB internal disks and 100x64 MB of memory to the project, bringing the full system up to 30 GFLOPS peak computing power with 12.8 GB of memory and 460 GB of internal disks. In addition, we have assembled a cluster of 32 PentiumPros donated by Intel, 24 of which will host 4 TBs of disks, mostly donated by IBM, in a novel massive storage architecture. Furthermore, Sun donated 4 8-way Enterprise 5000 SMPs to further our investigation of this new clustered approach to large scale system design.

 

 

 

 

 

 

 

 

 

 

Figure 1. Titan's Core Computing NOW System

A global Operating System layer, called GLUnix, was completed. It provides NOW-wide process management as a layer on top of Solaris. Using either a global shell, glush, or the glurun command, sequential processes can be started anywhere on the NOW or parallel processes can be started on multiple nodes. The design was enhanced to support dedicating subsets of the resources to particular research activities within the department. The development of this infrastructure gave rise to three new areas of systems research: interpositioning techniques for system functions, smart-clients, and web-based file systems.

The Generic Active Messages (GAM) communication layer, providing low overhead, high bandwidth communication for parallel programs, was completed and thoroughly analyzed in several publications. It provides a 10 µs end-to-end communication of small (8 word messages) and up to 38 MB/s bulk transfer bandwidth. GAM was enhanced to served as an empirical tool for studying the sensitivity of application performance to communication overhead, latency, and bandwidth. A complete, broad-spectrum parallel programming environment became fully operational on top of these new facilities, including MPI, Split-C, HPF, and the Castle and ScaLAPACK software libraries. We demonstrated scalability on NOW to be as good as the T3D, with greater per processor computing power, and better than the IBM SP-2 on the NAS Parallel Benchmarks, using MPI over Active Messages. NOW will be the first cluster on the TOP-500 list for its ScaLAPACK performance.

We showed that the communication potential of the emerging very high performance networks can be delivered to user and system applications in a commercial strength operating system by integrating the networking deeply into the virtual memory system. To this end, we have developed a concept of "virtual networks" and produced a driver that provides multiple, independent user virtual networks, where communication intensive applications get bound dynamically to communication resources.

We demonstrated that scalable file systems can be constructed by distributing all of the functions of the file server across such a cluster: storage, caching, and management. Recently we obtained world-record performance on industry standard disk-to-disk sort benchmark, sorting 8.4 GB of data in 1 minute on 64 nodes, beating SGI’s previous record of 1.6 GB substantially.

The NPACI viewed the NOW architecture as so much a part of the future of scientific computing that access to the experimental machine was made part of the partnership. A complete list of publications on the core computing component can be obtained at the URLs: http://now.cs.berkeley.edu/ and http://www.cs.Berkeley.EDU/Research/Projects/parallel/castle/.

 

Core Communication

Virtual networks enable efficient, protected multiprogramming of high-performance networks. There are several properties of virtual networks that collectively provide a simple and powerful communication abstraction to programmers, much as demand-paged virtual memory does with primary memory. Some recent communication layers contain aspects of virtual networks, but not the full capabilities required to enable a new paradigm of applications. The distinguishing features of virtual network systems are a virtual communications namespace, the use of memory-management facilities to allow many applications to map network hardware directly, the on-demand mapping of virtual communication resources to physical ones, and the automatic management of physical communication resource over-commitment. The virtual network abstraction focuses on the management of communication resources rather than the contracts between specific communication layers and communication hardware. We have designed and implemented a system that demonstrates virtual networks using a switched, multi-giga bit network. It shows that existing Solaris virtual memory facilities can manage a working set of applications with direct memory-mapped access to a network interface, and that the system can dynamically maintain those mappings over time. Using the virtual network infrastructure, the implementations of a next-generation active message system is nearing completion.

We designed, implemented, and analyzed the preliminary performance of the first true virtual network system. The key abstraction introduced is a communication endpoint object that abstracts a process' handle on a virtual network. A large number of processes on a machine can have one or more independent communication endpoints. A collection of communication endpoints spread across nodes form a virtual network. This abstraction is implemented through a sophisticated segment driver and firmware on an intelligent network interface. Virtual networks provide an arbitrary number of applications with the illusion of direct, memory-mapped access to multiple independent and protected networks in an on-demand, general-purpose, and protected manner. We have demonstrated low-latency and high-bandwidth communication under a variety of workloads and examined in detail the performance of single and multiple virtual networks. Our prototype system delivered 50 m s round-trip times, 25+ MB/s of bandwidth per node, fair and scalable performance, efficient bandwidth and endpoint frame utilization, and graceful performance degradation under contention and load.

 

Automatic Network Mapping

As switched, multi-gigabyte system-area networks become increasingly common building blocks for networks of workstations, their communication systems must support dynamic reconfigurability, incremental scalability, and exhibit adaptive behavior as hosts, switches, and links are added and removed from the system. We have developed and proved correct a new network mapping algorithm. The system takes about one .5 m s to map a 35-node network with 13 switches, and 1.5 seconds to map a 101-node network with 37 switches. The mapper runs concurrently with other programs using virtual networks over the Myrinet, dynamically determining the network's topology, and computing and distributing routes.

Protocols

System area networks (SANs) are an emergent class of networks whose unique mix of low-latency, high-bandwidth, and infrequent but non-negligible error and fault characteristics place them in a unique region in the network design space. This space has gone largely unexplored in terms of first principles until now. We have systematically studied trade-offs in protocol design for system area networks, revisiting classical assumptions and techniques, and articulating how design issues differ in this new regime. We have demonstrated single, fast protocols for SANS handling errors and supporting virtual networks while achieving high bandwidth and low latency.

 

Message Passing Interface (MPI) on Active Messages

Despite the performance growth in high speed networks, parallel applications using publicly available parallel programming tools suffer from the high latency and low bandwidth of communication. This creates a large performance gap between networks of workstations and MPPs. MPICH is a freely available implementation of the MPI standard. It runs on networks of workstations, but suffers the high communication costs associated with TCP/IP.

We have implemented MPI on top of Active Messages (AM). The new implementation is aimed at high bandwidth and low latency communication for networks of workstations. With the help of AM, applications can fully utilize the underlying network hardware - thus, reducing the communication latency as well as increasing the bandwidth. Our implementation of MPI on AM, when running on UltraSparc workstations connected with Myrinet switches, achieves a minimum latency of 17 ms end to end, and a maximum bandwidth of 36.8 MB/s. The performance delivered by MPI-AM is comparable to the performance delivered by most MPPs.

 

Active Messages (AM-II)

The second generation Active Message layer has progressed substantially. The latency has been reduced as a result of significant code restructuring. The automatic mapper has been integrated with the main communication layer as a "privilege" application, and this has exercised the return-to-sender error model extensively. Multiple simultaneous user application has been demonstrated, along with tolerance to injected faults. The Solaris driver is being ported to Intel PentiumPro platforms.

 

Application Sensitivity to Communication Performance

We have conducted a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We developed an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on applications demonstrating a wide range of architectural requirements. Our results show that improving cluster communication performance to that of tightly integrated parallel machines has resulted in significantly improved application performance. We established that applications demonstrate strong sensitivity to overhead and message bandwidth, slowing down by a factor of 60 on 32 processors when overhead is increased by 100 m s. Surprisingly, many of our benchmark applications are tolerant of increased latency and lower bulk message bandwidth. Finally, applications demonstrate a highly linear dependence to both overhead and gap, indicating that further improvements in communication performance will continue to improve application performance.

 

File Systems

The xFS file system is designed to take full advantage of all resources available in a network of workstations. By distributing metadata management and data storage, xFS is able to utilize the collective cpu power, memory, and disk capacity of an entire NOW without being slowed through the bottleneck of a central server.

Our recent efforts have focused on developing some of the less understood areas of distributed file systems, including cache consistency under dynamic, distributed management, long-term disk behavior, and defragmenting policies. Because cache consistency under dynamic management is complex, we use a verification program (teapot) to check the validity of our algorithms. This method has helped us restructure our original cache consistency model to avoid inconsistencies and deadlock. With the increasing complexity of file systems, we believe this approach will be necessary in future systems.

The increasing performance gap between cpu and disk speed has placed increasing pressure on file systems designers to hide disk latencies. As a result, there has been a recent trend to find disk layout policies that reduce fragmentation to a greater extent than traditional cylinder-group based policies. By using long-term traces to better understand disk behavior over time, we developed several new strategies for defragmenting the disk for better write performance and reorganized the disk for better read performance. We are currently evaluating the efficacy of these strategies.

 

Experience with a Distributed File System Implementation

A major limiting factor to the widespread deployment of NOW-based peer-to-peer distributed systems is the difficulty of producing a correct and efficient implementation. Relative to client/server programs of the past, peer-to-peer systems are more complex due to their performance, scalability, availability, and dynamic reconfiguration requirements. Our experience with building xFS, a peer-to-peer distributed file system, supports the conclusion that new approaches are required in the implementation of these kinds of systems. We have produced a paper summarizing our experiences in xFS, including the benefit of (i) formal methods for proving the correctness of distributed cache coherence protocols, (ii) using single-threaded event loops instead of threads for managing interactions between software modules, (iii) multi-party remote procedure call for allowing peer machines to transparently respond to forwarded requests, and (iv) reengineering the kernel vnode layer to better match the requirements of high performance networks.

Log-Structured File System (LFS)

Today’s file system designers face a dilemma. A log-structured file system (LFS) can offer superior performance for many common workloads such as those with frequent small writes, read traffic which is predominantly absorbed by the cache, and sufficient idle time to clean the log; however, it has poor performance for other workloads such as random updates to a full disk with little idle time for cleaning. We modified the LFS by applying self-tuning principles to provide high performance across a wider range of workloads. First, we improved LFS write performance in three ways: 1) by adaptively choosing the segment size, 2) by modifying the LFS cleaning policy to adapt to changes in disk utilization, and 3) by using cached data to lower cleaning costs. Second, we improved LFS read performance by reorganizing data to match read patterns. Our use of trace-driven simulations on a combination of synthetic and measured workloads demonstrated that the extensions to LFS can significantly improve performance.

 

Global Operating System Layer (T. Anderson and D. Culler)

SLIC (Secure Layer Interposition Code)

The issue of making existing deployed commercial systems resistant to information warfare has been of increasing interest to the defense community. Our approach has been to build a encapsulation system for commercial operating systems to allow us to securely and efficiently interpose extensions at the interfaces to the operating system -- system calls, signals, page faults, and interrupts -- without modifying source in either applications or the operating system kernel. Extensions are dynamically loaded into the kernel; a small amount of binary patching of the kernel is also required. By inserting the extensions into the kernel, we ensure that the extension cannot be avoided by application code. (Our original motivation for this work was to allow us to transparently add NOW global resource management to existing systems, but the technology is more widely applicable.) We have used our prototype to implement a number of useful extensions, including a 25-line patch to fix a security hole described in CERT advisory, an encrypted file system, and a restricted execution environment for untrusted binaries (e.g., to catch viruses that might exploit loopholes in Netscape, Ghostview, or Finger). Performance measurements of the SLIC prototype show that interposition on existing kernel interfaces can be accomplished efficiently.

 

WebOS: Operating System Services for Wide Area Applications

We have investigated providing a common set of OS services to wide area applications, including mechanisms for resource discovery, a global namespace, remote process execution, resource management, authentication, and security. On a single machine, application developers can rely on the local operating system to provide these abstractions. In the wide area, however, application developers are forced to build these abstractions themselves or to do without. This ad-hoc approach wastes programmer effort and system resources. We have prototyped basic operating systems services needed to build applications that are geographically distributed, highly available, incrementally scalable, and dynamically reconfiguring. Experience with a number of applications developed under WebOS indicates that it simplifies system development and improves resource utilization.

[1] B. Chun, A. Mainwaring, S. Schleimer, and D. Wilkerson, "System Area Network Mapping," to appear in SPAA'97, Newport, Rhode Island, (June 1997).

[2] R. P. Martin, A. M. Vahdat, D. E. Culler, and T. E. Anderson, "Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture," ISCA 24, Denver, Colorado, (June 1997).

[3] C. Yoshikawa, B. Chun, P. Eastham, A. Vahdat, T. Anderson, and D. Culler, "Using Smart Clients To Build Scalable Services," USENIX '97, (1997).

[4] S. Rodrigues, T. Anderson, and D. Culler, "High-Performance Local-Area Communication Using Fast Sockets," USENIX '97, (1997).

[5] J. N. Matthews, R. Y. Wang, D. S. Roselli, A. M. Costello, and T. E. Anderson, "Improving the Performance of Log-Structured File Systems with Self-Tuning Methods," submitted to SOSP 16.

[6] D. Ghormley, D. Petrou, S. Rodrigues, and T. E. Anderson, "Interposition as an Operating System Extension Mechanism," submitted to SOSP 16, [CSD-96-920, April 9, 1997].

[7] R. Y. Wang and T. E. Anderson, "Experience with a Distributed File System Implementation," submitted to SOSP 16.

[8] A. Vahdat, P. Eastham, C. Yoshikawa, E. Belani, T. Anderson, D. Culler, and M. Dahlin, "WebOS: Operating System Services For Wide Area Applications," submitted to SOSP 16.

[9] N. Talagala, S. Asami, D. Patterson, and T. Anderson, "Tertiary Disk: Large Scale Distributed Storage," submitted to VLDB 23.

 


Core: Compiler, Library, and Language Component

In this component of Titan we seek to develop the programming support required to make the core computing resources available to application, building on our language, compiler, and library work for large-scale parallel machines. This work is focused on the NOW prototypes described above. This portion of the research is funded in part by DOE through the Castle Project.

 

Multipol (K. Yelick)

Multipol is a library of distributed data structures, designed to ease the programming of "irregular" problems on large scale, distributed-memory multiprocessors [1]. Multipol structures are divided into state structures and scheduling structures. The state structures include hash tables, sets, and trees, and use a combination of replication, partitioning, and software-controlled caching for good locality. The scheduling structures are various kinds of queues that provide good load balancing without destroying the locality required by the state structures in the application. The Multipol library runs on the IBM SP-1 and SP-2, networks of workstations, the Intel Paragon, the Thinking Machines CM-5, and on uniprocessor workstations. It has been publicly released and is available from (http://HTTP.CS.Berkeley.EDU/projects/multipol/).

[1] C.-P. Wen, S. Chakrabarti, E. Deprit, A. Krishnamurthy, and K. Yelick, "Runtime Support for Portable Distributed Data Structures," in Third Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (LCR), B. K. Szymanski and B. Sinharoy (Editors), Kluwer Academic Publishers, Boston, MA, (May 1995), pp. 111-120.

 

Split-C (D. Culler, K. Yelick)

A new version of Split-C using GAM has been completed, as have two versions using the more powerful AM 2.0 layer. This is in wide use throughout the department. Split-C has also been ported to the Cray T3E A detailed performance comparison of most of these Split-C implementations has been submitted for publication. The Mantis parallel debugger has been ported to NOW.

 

Titanium (A. Aiken, S. Graham, K. Yelick)

Titanium is a project to build an optimizing compiler for explicitly parallel languages like Split-C. The optimizations include minimizing communication and synchronization costs. The current prototype handles a ``pointer-free'' subset of Split-C; it automatically does message pipelining, and analyzes synchronization and shared memory accesses to introduce automatic overlap of read/write operations. Speedups of 20\% to 30\% on kernels like the FFT have been obtained. See ``Optimizing Parallel Programs with Explicit Synchronization,'' A. Krishnamurthy and K. Yelick, Programming Language Design and Implementation, June 1995.

Two major thrusts of the Titanium project are (1) to design a small number of language mechanisms, starting from Split-C, that facilitate explicitly parallel programming, and (2) to use program optimization to achieve high performance for explicitly parallel programs. Since our goal is not to design a totally new language, we will embed our language enhancements in an existing language, much as the Split-C enhancements are embedded in C. However, we expect that many of our successes in program optimization will come from replacing some of the low-level unstructured constructs in a language like C with more disciplined variants. Consequently, rather than developing further modifications to C, we are embedding Titanium in C++. The primary advantage of using an object-oriented language is that extension can take place within the language by (1) defining new classes and methods to create new language features and (2) providing compiler checks for the absence of those language uses that inhibit the full benefit of our optimizations.

We also intend to modify existing compiler components wherever possible, rather than building them from scratch. Since we want to be able to share our prototype software with other researchers, we sought out compilers and components that are both low in cost and available to others with relatively few restrictions. After examining the Cppp, G++ and Lcc systems, we discovered and eventually adopted the EDG system. The Edison Design Group is a small company that sells a C++ front end. The EDG system generates either a high-level intermediate form or source code in C or C++. Thus it can be used for language modifications that can be expressed in C together with library calls, and for optimizations expressible in C. Because EDG supports various C++ dialects and also the standardization activities, the software is designed to facilitate language changes.

EDG has made their software available to us and to other University research groups at no cost under a non-disclosure agreement. That gives us both the opportunity to explore our language designs without extensive software development and the ability to share with other academic researchers. Eventually, we will need to be able to generate code without going through the front end of an existing C or C++ compiler. We are exploring ways to satisfy that need, but it is not an immediate impediment to our research. We completed the licensing legalities this fall and obtained the software.

Applications for Titanium (A. Aiken)

Storage Management: Efficient use of storage resources is very important in high performance computing. For many applications, storage is a limiting factor---users can only run the largest problem that will ``fit'' in the available memory. Because of its importance to overall performance, storage management is handled explicitly by the programmer in high performance numerical software written today.

We are exploring a hybrid approach to automatic storage management. The idea will be not to prohibit explicit management, but rather to supplement it both with efficient garbage-collection-based automatic management at runtime and with some promising new compile-time techniques. While this work is still in its early stages, the long-term goal is to have a compiler and runtime system that will automatically perform most complex memory management in irregular, parallel computations with a negligible performance penalty.

Recently, a new idea---due to Tofte and Talpin---for automatic storage management has been proposed. In this scheme, runtime storage is partitioned into regions. Every computed value is stored in some region. Regions themselves are allocated and deallocated according to a stack discipline akin to the standard implementation of activation records in conventional languages. The assignment of values to regions is decided statically by the compiler and the program is annotated to include operations for managing regions. Thus, there is no need for a garbage collector---all storage allocation and deallocation is statically specified in the program.

The Tofte/Talpin system makes surprisingly economical use of memory. However, it is usually possible to do significantly better and in some cases dramatically better than the Tofte/Talpin algorithm. We have developed an extension to the Tofte/Talpin system that removes the restriction that regions be stack allocated, so that regions may have arbitrarily overlapping extent. We have built a prototype implementation of our system, and preliminary experimental results support our approach. Programs transformed using our analysis typically use significantly less (by a constant factor) memory than the same program annotated with the Tofte/Talpin system alone. We have also found that for some common programming idioms the improvement in memory usage is asymptotic. The memory behavior is never worse than the memory behavior of the same program annotated using the Tofte/Talpin algorithm. Our prototype is available for experimentation on the World Wide Web at URL http://kiwi.cs.berkeley.edu/~nogc.

Effective Cache Usage: One of the most striking trends in machine architecture over the last decade has been the widening gap between CPU and memory speeds. Today, the number of cache misses is often a better measure of program performance than the number of instructions executed.

Neither the design of programming languages nor their implementations have responded well to this change in the underlying architecture. Data structures built using pointers (essentially all data structures except arrays) may or may not have good cache performance depending on the layout of the data in cache lines and the access patterns of the program. Current languages provide no mechanisms for programmers to control layout of data structures to improve cache performance.

We are conducting two studies directed at achieving better cache performance. Both explorations are intended to result in techniques to be used in the Titanium system.

Optimizing Irregular Code for Caches: We have evaluated the effectiveness of several changes to a parallel application's data structures and computational logic to achieve improved cache performance. Our current results are based on hand-optimized code. They are intended to guide the design of compiler optimizations.

The target of our study was the EM3D application, an explicitly parallel Split-C program that models the propagation of electromagnetic waves through objects in three dimensions. A preprocessing step creates an irregular bipartite graph of nodes containing electric or magnetic field values, and performs load balancing. During computation, EM3D iteratively calculates the value of each node based on the values of its neighbors. There are two separate phases in each timestep that modify one part of the bipartite graph while reading the other.

The bipartite graph is naturally represented as a linked structure: each graph node is a C structure with values and pointers to other nodes. We used several approaches when revising the original application. In our first version we converted linked lists to arrays, eliminating storage used for pointers and making accessed data sequential in memory. In our second version we converted an array of structures to a structure of arrays, because the computational pattern only required one field in the structure at a time. Both changes improve the utilization of cache lines. In our third version we attempted to further increase the percentage of sequential memory references during computation by partially sorting the arrays. In the fourth and final version we further compressed the data by joining the smaller arrays into large arrays, and performed a complete sort on the data structures used during computation. This forced a change to the computational logic, which also increased the number of loads and stores performed.

The results were most promising for the first two versions. We achieved a respectable speedup without having to change the algorithm in the main part of the program. We believe these changes could be incorporated into a compiler given sufficient analysis. The third version of the program had much higher overhead for creating the data structure, due to sorting, and did not perform significantly better in the computational loop. In the fourth version the additional loads and stores overwhelmed any advantage of better cache line usage and was therefore worse than the previous versions.

Overall, we found that tuning code for the cache can yield substantial performance improvements, but source-level changes may obscure the code and it is difficult to predict the result of a data transformation. Therefore, we need additional compiler support to aid optimization.

Cache-Friendly Data Structures: The idea of rearranging the layout of linked data structures to improve memory hierarchy performance is not new. Cdr-coding in Lisp systems and B-trees in database systems are well-known examples where a particular data structure is tuned for memory performance. What we are investigating, and what has not been done before, is whether there is a general language mechanism that can both allow programmers to use familiar and convenient pointer-based data structures and simultaneously enable choices among storage layouts to improve cache utilization. Ideally, those choices will be made by the storage allocator, guided by knowledge provided by the programmer.

A basic mechanism for improving cache performance of pointer data structures is to place objects A and B in the same cache line if A points to B (or vice-versa). In small-scale experiments we have verified that substantial speedups are possible from improving cache layout alone (20\%-30\% and sometimes more on small kernels). We are working on the design of an extension to C++s type system that would allow programmers to specify which pointers in a datatype should preferably point to objects in the same cache line. The design of this mechanism is largely complete; in the next phase we plan to implement and evaluate the effectiveness of our design on a set of benchmarks.

Synchronization Analysis: Optimizing explicitly parallel shared memory programs requires new types of static analysis to ensure that accesses reordered on one processor cannot be observed by another. Intuitively, the parallel programmer relies on the notion of sequential consistency: the parallel execution must behave as if it were an interleaving of the sequences of memory operations from each of the processors. If only the local dependencies within a processor are observed, the program execution might not be sequentially consistent. To guarantee sequential consistency under reordering transformations, a new type of analysis called cycle detection is required.

The cycle detection problem is to detect access cycles. For example, if one processor writes to a variable x and then to y, and another processor reads from y and then x, we say there is a cycle: write x, write y, read y, read x. In addition to observing local dependencies within a program, a compiler must ensure that accesses issued by a single processor in a cycle take place in order. In the example, this means that the second processor cannot observe a new value of y and an old value of x. Cycle detection is necessary for most optimizations involving code motion, whether the programs run on physically shared or distributed memory and whether they have dynamic or static thread creation. Cycle detection is not necessary for automatically parallelized sequential programs or data parallel programs with sequential semantics, because every pair of accesses has a fixed order, which is determinable at compile-time. The additional problem for explicitly parallel programs comes directly from the possibility of non-determinism, whether or not the programmer chooses to use it.

We have shown that by restricting attention to Single Program Multiple Data (SPMD) programs, one can significantly reduce the complexity of cycle detection [1]. More recently, we extended this work to consider languages with explicit synchronization constructs and have shown that the quality of the analysis can be improved by taking synchronization into account.

Storage access reordering is a compilation problem that we expect to become increasingly important as superscalar processors, write-buffers, and weak memory models like ``release consistency'' are used in the architectures. Even if the compiler does not reorder the shared memory accesses, reordering may take place at many levels in a multiprocessor system. At the processor level, a superscalar processor may issue an instruction as soon as all its operands are available, so writes to different locations might be issued in the order the values become available. Most processors have write buffers, which allow read operations to overtake write operations that have been issued earlier. In fact, on the SuperSparcs the write-buffer itself is not guaranteed to be FIFO. Reordering may also take place at the network level in distributed memory multiprocessors, because some networks adaptively route packets to avoid congestion. Even if packets do not get reordered, two accesses sent to two different processors may be handled out of order, since latencies may vary. Also, on a machine like DASH, with hardware caching, writes do not wait for all invalidations to complete, so remote accesses might appear to execute in reverse-order. These architectural features usually come with support to ensure sequential consistency, such as a memory barrier or a write-buffer flush to enforce ordering between memory operations, or a test for completion of a remote operation. However, the compiler must insert these new instructions. If a standard uniprocessor compiler is used for generating code, these special instructions would not be automatically inserted.

Uniprocessor compilers are ill suited to the task of compiling explicitly parallel programs, because they do not have information about the semantics of the communication and synchronization mechanisms. As a result, they either generate incorrect code or miss opportunities for optimizing communication and synchronization, and the quality of the scalar code is limited by the inability to move code around parallelism primitives.

We have designed optimizations for multiprocessors with physically distributed memory and hardware or software support for a global address space. Machines with physically distributed memory often have long remote memory latencies. However, most of this latency can be overlapped with local computation or with the initiation of more communication, especially on machines like the J-Machine and *T, with their low overheads for communication startup.

Three important optimizations for these multiprocessors are overlapping communication, eliminating round-trip message traffic, and avoiding communication altogether. The first optimization, message pipelining, changes remote read and write operations into their split-phase analogs, get and put. In a split-phase operation, the initiation of an access is separated from its completion. The operation to force completion of outstanding split-phase operations comes in many forms, the simplest of which (called sync or fence) blocks until all outstanding accesses are complete. To improve communication overlap, puts and gets are moved backwards in the program execution and syncs are moved forward. The second optimization eliminates acknowledgement traffic. A final optimization is the elimination of remote accesses by either re-using values of previous accesses or updating a remote value locally multiple times before issuing a write operation on the final value.

We have a prototype compiler based on the gcc compiler that implements these optimizations, and we quantified the potential payoff of a few of these optimizations on a set of application kernels. The compiler takes parallel shared memory style C as input and produces Split-C as output. The performance improvements on several application kernels are as high as 35% on the CM-5, with even better improvement expected on future architectures with lower communication startup.

[1] A. Krishnamurthy and K. Yelick, "Analyses and Optimizations for Shared Address Space Programs," Journal of Parallel and Distributed Computation, (1996).]

 

pSather (J. Feldman)

There has been progress on several fronts in the Sather project on Object-Oriented Programming and its parallel extension, pSather. Further details are available through the Sather (http://www.icsi.berkeley.edu/~sather/) and pSather (http://www.icsi.berkeley.edu/~sather/psather.html) web pages. Over the past year, one doctoral dissertation [1] has been accepted and two more will be completed this term. Technical results include an implementation of safe thread termination [1], a Zone model for locality management [2], and a system for mapping neural networks to parallel processors [3]. A new release of Sather and greatly improved documentation were also produced. Current work includes a formal specification of Sather semantics, an efficient and portable thread package, and efforts on parallel library design and applications.

[1] Claudio Fleiner, "Advanced Constructs and Compiler Optimizations for a Parallel, Object Oriented, Shared Memory Language Running on a Distributed System," Doctoral Dissertation, U. Fribourg, (1996).

[2] David Stoutamire, "Zones, A Memory Abstraction for Locality Management," Doctoral Dissertation, EECS, University of California, Berkeley, California, (1997).

[3] Benjamin Gomes, "Mapping Connectionist Networks onto Parallel Machines: An Object-Oriented Library Approach," Doctoral Dissertation, EECS, University of California, Berkeley, California, (1997).

 

ScaLAPACK (J. Demmel)

ScaLAPACK is a library of parallel linear algebra routines. It achieves portability by doing all communication using the BLACS (Basic Linear Algebra Communication Subroutine) library, which is in turn has been implemented using MPI, PVM and other lower level libraries. A full version was release in November 1996.

We have produced an Active Message version of the BLACS, which has let us successfully run the ScaLAPACK test codes. We are continuing to work on the Split-C-to-ScaLAPACK interface, to permit calls to ScaLAPACK on arrays defined in Split-C. A description of this work will appear in the MS thesis of M. Ivory. Much of our work has depended on the availability of high performance Basic Linear Algebra Subroutines. These are typically commercial products, specific to each platform, and often expensive; this is a major reason LAPACK has not yet been integrated into packages like Matlab. We are addressing this problem by producing a system, called PHiPAC, for automatically producing high-performance BLAS for any RISC architecture. For example, our system has automatically produced matrix-multiplication routines that on average outperform IBM highly tuned ESSL library, as well as SGI's similar library. The same software has produced 90\% of peak performance on Sparcstation-20/61 and HP 712/80i. An alpha release of the software is available at http://www.icsi.berkeley.edu/~bilmes/phipac/ . Future work will involve producing the entire suite of BLAS routines. This work is jointly funded by the LAPACK project (DOE and NSF).

As part of the November 1996 release, we produced 3 routines for the symmetric eigenvalue problem (PxSYEVX and PxSTEV) and its close relative, the singular value decomposition (PxGESVD). PxSYEVX and PxSYEV both begin by reducing the matrix to symmetric tridiagonal form. After that, PxSYEVX is a parallelization of the serial algorithms bisection (to find the eigenvalues) and inverse iteration (for the eigenvectors). Innovations include the exploitation of IEEE infinity arithmetic to accelerate bisection, and trading off accuracy, workspace, and running time in inverse iteration. Both PxSYEV and PxGESVD are parallel versions of a different algorithm, QR iteration, which is simultaneously somewhat slower but more accurate on a homogeneous machine; its behavior on a heterogeneous machine is discussed below. We have extensive performance analysis and models of these and other routines, to help the user choose the best algorithm. Finally, we have a prototype running of a new algorithm, which may be the ultimate solution for the symmetric eigenproblem on both parallel and serial machines. Loosely based on inverse iteration, it promises to cost just O(n2/p) in contrast to the worst case O(n3/p) of the other algorithms.

We designed and implemented SuperLU, a supernodal version of sparse Gaussian elimination. Much of the design work involved minimizing time in symbolic factorization, and exploiting the memory hierarchy to get nearly BLAS 3 performance on a serial machine; the serial version is available on Netlib [Susan: this will be true as soon as you or Jack can install it!]. We also parallelized it for a multithreaded shared memory environment. SuperLU is currently one of the two fastest serial implementations of sparse Gaussian elimination (on some test problems it wins, on others another code is faster), and the fastest parallel implementation.

[1] A. Zege, "QR Implementation of SVD for ScaLAPACK," Master's Thesis, Computer Science Division, University of California, Berkeley, California, (1996).

[2] J. Demmel, S. Eisenstat, J. Gilbert, X. Li, and J. W. H Liu, "A Supernodal Approach to Sparse Partial Pivoting," to appear in SIAM, J. Mat. Anal. Appl.

[3] X. Li, "Sparse Gaussian Elimination on High Performance Computers," Ph.D. Thesis, Computer Science Division, Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, (September 1996).

[4] J. Demmel, J. Gilbert, and X. Li, "SuperLU Users' Guide," Netlib, (1997). We have completed the outline of our intended book, and gotten the agreement of several other leading researchers to write chapters (Ming Gu, Axel Ruhe, Youcef Saad, Gerard Sleijpen, and Henk van der Vorst). These authors are now preparing their contributions.

[5] Z. Bai, J. Demmel, J. Dongarra, A. Edelman, M. Gu, A. Ruhe, Y. Saad, G. Sleijpen, D. Sorensen, and H. van der Vorst, "Templates for the solution of Eeigenvalue Problems: A Practical Guide," in preparation, (1997).

 

Understanding Heterogeneity

ScaLAPACK was originally conceived to run on a homogeneous network of processors, i.e., all identical hardware running identical object code. Its scope has since expanded to running on heterogeneous networks, where the processors differ. Heterogeneity can cause previously identical results to differ slightly on different processors, resulting in subtle but rare errors. We have identified these and are modifying ScaLAPACK to eliminate as many as possible; not all algorithms can run reliably and efficiently on heterogeneous machines.

[1] L. S. Blackford, A. Cleary, J. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, A. Petitet, H. Ren, K. Stanley, and R. C. Whaley, "Practical Experience in the Dangers of Heterogeneous Computing," LAPACK Working Note No.112, Department of Computer Science, University of Tennessee, Knoxville, Tennessee, (1996) [Technical Report #CS-96-330].


Parallel Applications

In addition to the parallel applications discussed in last year's progress report (Phyllogeny Problem, Cell Simulation, and GATOR, a Gas, Aerosol, Transport, and Radiation Chemistry model), we have developed several new parallel applications.

 

NOW-Sort (D. Culler)

We conducted a study of a self-tuning high performance disk-to-disk sort on a Network of Workstations (NOW). On a 64-node cluster, we set world record by sorting 6.0 GB in one minute, extending the performance gap between NOW and the largest available SMPs. The previous record was held by SGI at 1.6 GB. (SGI recently tried to regain the title, but stopped at 5 GB. We recently extended our record to 8.4 GB on 100 nodes.) This "minute-sort" sustains a file bandwidth of 400 MB/s on 64 nodes and demonstrates near perfect scalability. A 32-node cluster finished the Datamation benchmark in 2.41 seconds. The graduate students conducting the study were presented with two trophies at SIGMOD 97, where their NOW-sort was presented.

In addition to applying NOW-Sort to a variety of disk, memory, and processing configurations, we also highlighted salient issues for tuning each component of the system. Our continued evaluation of using commodity operating systems and hardware for parallel sorting found the existing OS primitives for memory management and file access to be adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.

[1] A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson, "High-Performance Sorting on Networks of Workstations," SIGMOD '97, Tucson, Arizona, (May 1997).

 

TranSend (E. Brewer)

In conjunction with the BARWAN wireless mobile networking project, we began exploring how to trade computational cycles and storage in a NOW for scarce bandwidth, to adapt Internet content on-the-fly to the varying hardware, software, and communications needs of a heterogeneous client population. Our techniques of datatype-specific distillation and refinement (for example, real-time degradation of image resolution or color) led to a successful early prototype for a real-time WWW distillation proxy [1], which has been used for a variety of other projects at Berkeley, at our industrial sponsors, and at other research institutions, and whose fundamental ideas have begun to appear in third-party commercial products from Hitachi and Spyglass. Further analysis of on-demand dynamic distillation and refinement [2] led to a re-usable, layered architecture for building scalable infrastructural services [3]. In this architecture, distillation and refinement are generalized to Transformation, Aggregation, Caching, and Customization of Internet content (TACC services). By enabling rapid deployment of new services, composition of existing services, mass customization, and a near-zero incremental operating cost per user served, the TACC server architecture provides a new platform for effectively deploying Internet services. A prototype of a scalable TACC service, TranSend [5], is now running and available to the public. It is currently hosting a large volume of the campus dial-in users. Excess load spills over onto a portion of NOW.

[1] A. Fox and E. A. Brewer, "Reducing WWW Latency and Bandwidth Requirements via Real-Time Distillation," Proc. WWW-5, Paris, France, (May 1996).

[2] A. Fox, S. D. Gribble, E. Brewer, and E. Amir, "Adapting to Network and Client Variation Via On-Demand Dynamic Distillation," Proc. ASPLOS-VII, Boston, Massachusetts, (October 1996).

[3] A. Fox, S. D. Gribble, Y. Chawathe, and E. Brewer, "Scalable Network Services," submitted to SOSP-16, (1997).

[4] R. H. Katz, E. A. Brewer, E. Amir, H. Balakrishnan, A. Fox, S. Gribble, T. Hodes, D. Jiang, G. T. Nguyen, V. Padmanabhan, and M. Stemm, "The Bay Area Research Wireless Access Network (BARWAN)," Proc. Spring COMPCON Conference, (1996).

[5] The TranSend proxy service: http://transend.cs.berkeley.edu.

 

Inktomi (E. Brewer)

The technology developed in the Inktomi search engine has now been transferred to industry. In a joint venture between WIRED and Inktomi Corporation, a search engine containing the world's largest index of the web was conducted using NOW and Inktomi technology. It is accessible from anywhere on the net at http://www.hotbot.com/.

 

Electromagnetic Simulation (A. Neureuther)

TEMPEST is a parallel electromagnetic simulator built in Split-C by Prof. A. Neureuther's group in the Berkeley EECS Dept. It is intended for production runs in understanding scattering from nonplanar polysilicon features in printing gate tips as they cross the field edge. TEMPEST simulates a 3-dimensional domain of grid points, and the electric and magnetic field values at each grid point are calculated iteratively. An initial development of the TEMPEST application in Split-C was developed. It is being used in industry (at AMD) as well as within the university. The application is being further developed on NOW IRAM Simulation. In order to investigate the performance and power advantages of IRAM computer systems, which integrate processors and DRAM on the same chip, the NOW has been used for thousands of hours of architectural simulation over the last several months. The SimOS complete computer system simulation environment was used to evaluate the performance of eight computer system models, each running 11 different benchmarks, where each of these 88 simulations took 2-3 days on a single 167MHz UltraSparc I workstation. In addition, Shade cache simulations were used to determine the energy consumption of 4 alternative memory hierarchies, each running 8 benchmarks, where each of these 32 simulations took 1-2 days. Results for both simulation efforts are being published in ISCA '97. Because these simulations were independent and thus trivially parallelizable, the NOW was an ideal simulation engine, allowing us to complete the simulation in a matter of a few days, rather than many months.

 

Network Security (E. Brewer)

ISAAC: Internet Security, Applications, Authentication and Cryptography

The network security component of Titan that advanced dramatically through the ISAAC security project (http://HTTP.CS.Berkeley.EDU/projects/isaac/). The ISAAC research group has also made headlines with some research projects chosen for their potential impact. In January, RSA Data Security offered a $1000 challenge: the goal was to break data protected by encryption with a 40-bit key. One group member used the NOW clusters to crack the code with just 3.5 hours after the contest started. This effort made headlines, for 40-bit keys are the strongest encryption currently allowed to be exported under US regulations. In March, the group again made headlines by identifying serious flaws in the privacy protection found in digital cellular phones: a group member analyzed the encryption algorithm used to protect digits dialed on the cellphone's keypad, and discovered that it could be cracked in minutes on a standard personal computer.

The driving principle of this work is that the security system should be orthogonal to its functionality. This directly goes against the traditional advice of security experts, which is to make sure you build in security deeply and from the beginning. We believe that the traditional approach has largely failed; it is too difficult to build security into a complex system. For example, bugs often lead to security holes, which means complex network systems would have to be mostly bug-free to be secure, which is unrealistic. Instead, we advocate orthogonal security: building clean, simple security systems that are adjacent to the system they protect rather than deeply integrated with it. The separation of concerns allows the security system to be significantly simpler and more likely to be secure. Furthermore, orthogonal security can be applied well to legacy, black box, and untrusted code; it can also be composed into multiple layers to provide broader or redundant security. Over the past year, we have used the orthogonal security methodology to solve many well-known problems in practical security, including security for Netscape, Java, and sendmail.

One of the difficult challenges in implementing a large-scale network infrastructure is the need to overcome the myriad of severe security flaws in current network software, including all protocols based on UDP/IP and TCP/IP. For example, it is relatively easy to watch ethernet traffic for passwords, or worse, to falsify packets in order to gain illegal access to a machine. For example, it is possible to spoof NFS with false packets into providing write access to files that a given user should not be able to change. It is possible to eliminate these problems by fixing all of the common network software, including NFS, rlogin and the e-mail system. However, these programs are numerous and very complex, thus prohibiting any realistic chance of fixing all such flaws. Instead, we are building a network security monitor called IPSE (Internet Protocol Scanning Engine). IPSE will be deployed on each of the Titan subnets to monitor traffic for in-progress security attacks. In particular, IPSE performs three jobs:

IPSE is easy to extend on a per-protocol basis, so that the filtering can be easily improved as new security holes are discovered. By taking this approach, we have a manageable solution that sits beside existing network software (with its security flaws). Although we can not prevent every attack, we can prevent several kinds and detect most of the others in-progress, which may allow the system administrator to either prevent completion or discover the identity of the attacker.


Shell: Multimedia Component

The multimedia component is currently operational, with the original set of HP media-stations and Apple-based multimedia authoring studio augmented by a large donation from Intel providing PC-based media-stations on the desktop and a greatly expanded PC-based multimedia authoring capability. Through external funds, we have deployed multimedia presentation capability in all the major meeting rooms in the building and thereby greatly leveraged the NSF investment. This component of the project has given rise to a new interdisciplinary organization, the Berkeley Multimedia Research Center. Recent publications on full-motion video for portable terminals and tools for video-on-demand are available at http://bmrc.berkeley.edu/.

Networking for Continuous Media (D. Ferrari)

The Tenet group has made available the source code of its Real-Time Protocol Suite 1. The suite consists of three protocols (RMTP and RTIP for data delivery, and RCAP for guaranteed-performance channel establishment and teardown), which have been designed to coexist with the Internet protocols. The source code can be freely used for educational and research purposes without a license; its commercial exploitation requires obtaining a license from the Regents of the University of California, who own the copyright to it. A Ph.D. dissertation was completed on this work.

[1] B. Mah, "Quality of Service and Asynchronous Transfer Mode in IP Internetworks," Ph.D. Dissertation, University of California, Berkeley, California, (December 1996).

[2] B. Mah, "An Empirical Model of HTTP Network Traffic," Proceedings of INFOCOM '97, Kobe, Japan, (April 1997).

Deadalus/BARWAN Wireless Infrastructure (E. Brewer, R. Katz)

The two main concerns in this work are as follows:

· Can proxy-enabled mobile computing services really scale to thousands or millions of users?

· Are the display-adaptation mechanisms you propose suitable in practice for viewing Web/Internet content on very impoverished clients?

We have responded to these challenges as follows. From December to February, the GloMop Group built a second-generation distillation proxy to support mobile and wireless computing. The design of this system was influenced both by the lessons learned from our prototype implementation demonstrated at prior retreats and reported on in prior papers [Fox 96a, Fox 96b], and by the lessons learned by the NOW (Network of Workstations) project. The result of this work is a distillation proxy that offers millisecond latency for adaptation via distillation, demonstrates perfect linear scaling up to thousands of users, exhibits excellent availability and incremental growth properties, is extremely cost effective to operate, and tracks bursty offered-load behavior on time scales that are small compared to end-user latencies. This system is described in a submitted paper [Fox 97]. This system is functional and will be deployed to all UC Berkeley dialup IP users in the near future. The user interface is being polished by undergraduate research assistants.

Network Interfaces in Hand-Held Devices

Next generation hand-held devices must provide seamless connectivity while obeying stringent power and size constrains. We have examined this issue from the point of view of the Network Interface (NI). We measured the power usage of two PDAs, the Apple Newton Messagepad and Sony Magic Link, and four NIs, the Metricom Ricochet Wireless Modem, the AT&T Wavelan operating at 915 MHz and 2.4 GHz, and the IBM Infrared Wireless LAN Adapter. These measurements clearly indicated that the power drained by the network interface constitutes a large fraction of the total power used by the PDA. We then examined two classes of optimizations that can be used to reduce network interface energy consumption on these devices: transport-level strategies and application-level strategies. Simulation experiments of transport-level strategies show that the dominant cost does not come from the number of packets sent or received by a particular transport protocol, but from the amount of time that the NI is in an active but idle state. In addition, simulation experiments of application-level strategies indicate that significant energy savings can be made with a minimum of user-visible latency.

Vertical Handoffs in Wireless Overlay Networks

No single network technology simultaneously provides a low latency, high bandwidth, wide area data service to a large number of users. Wireless Overlay Networks - a hierarchical structure of room-size, building-size, and wide area data networks - solve the problem of providing network connectivity to a large number of mobile users in an efficient and scalable way. The specific topology of cells and the wide variety of network technologies that comprise wireless overlay networks present new problems that have not been encountered in previous cellular handoff systems. We have implemented a vertical handoff system that allows users to roam between cells in wireless overlay networks. Our goal is to provide a user with the best possible connectivity for as long as possible with a minimum of disruption during handoff. Results of our initial implementation show that the handoff latency is bounded by the "discovery time", the amount of time before the mobile host discovers that it has moved into or out of a new wireless overlay. This discovery time is measured in seconds: large enough to disrupt reliable transport protocols such as TCP and introduce significant disruptions in continuous multimedia transmission. To efficiently support applications that cannot tolerate these disruptions, we presented enhancements to the basic scheme that significantly reduce the discovery time without assuming any knowledge about specific channel characteristics. For handoffs between room-size and building-size overlays, these enhancements lead to a handoff latency of approximately 170ms with a 1.5% overhead in terms of network resources. For handoffs between building-size and wide-area data networks, the handoff latency is approximately 600ms with a similarly low overhead.

Analyzing Stability in Wide-Area Network Performance

The Internet is a very large scale, complex, dynamical system that is hard to model and analyze. We have developed and analyzed statistical models for the observed end-to-end network performance based on extensive packet-level traces (consisting of approximately 1.5 billion packets) collected from the primary Web site for the Atlanta Summer Olympic Games in 1996. We find that observed mean throughputs for these transfers measured over 60 million complete connections vary widely as a function of end-host location and time of day, confirming that the Internet is characterized by a large degree of heterogeneity. Despite this heterogeneity, we find (using best-fit linear regression techniques) that we can express the throughput for Web transfers to most hosts as a random variable with a log-normal distribution. Then, using observed throughput as the control parameter, we attempt to quantify the spatial (statistical similarity across neighboring hosts) and temporal (persistence over time) stability of network performance. We find that Internet hosts that are close to each other often have almost identically distributed probability distributions of throughput. We also find that throughputs to individual hosts often do not change appreciably for several minutes. Overall, these results indicate that there is promise in protocol mechanisms that cache and share network characteristics both within a single host and amongst nearby hosts.

TCP Performance of a Busy Web Server: Analysis and Improvements

In recent years the rapid growth of the World Wide Web has caused a significant shift in the composition of Internet traffic. Recent studies have shown that Web traffic constitutes up to 50% of the bytes traversing Internet backbone links. Therefore, there is significant value in understanding Web traffic characteristics and its implications on network behavior and network protocol design. We have analyzed TCP behavior using extensive Web traffic traces obtained from a busy server (the official Web server of the 1996 Atlanta Olympic games). We describe the techniques used to gather these traces and reconstruct the behavior of TCP on the server. We then present a detailed analysis of both loss recovery and congestion control from the recorded transfers. We also examine the effects of short transfers and multiple simultaneous connections (used by many popular Web browsers) on end-to-end performance. We used these results to motivate and discuss a set of solutions designed to alleviate several observed problems. We studied new techniques for TCP loss recovery, an integrated approach to congestion control and loss recovery for multiple simultaneous connections, and presented results from simulation experiments to demonstrate the benefits of our solutions.

[Amir 95a] E. Amir, H. Balakrishnan, S. Seshan, and R. H. Katz, "Efficient TCP over Networks with Wireless Links," Proceedings HotOS-V Workshop, Orcus Island, Washington, (May 1995).

[Amir 95b] E. Amir, S. McCanne, and H. Zhang, "An Application Level Video Gateway," Proceedings ACM Multimedia '95, San Francisco, California (November 1995).

[Amir 96] E. Amir, S. McCanne, and M. Vetterli, "A Layered DCT Coder for Internet Video," Proceedings ICIP '96, Lausanne, Switzerland, (September 1996).

[Bala 95] H. Balakrishnan, S. Seshan, E. Amir, and R. H. Katz, "Improving TCP/IP Performance over Wireless Networks," ACM Conference on Mobile Computing and Networks, Oakland, California (November 1995).

[Bala 96a] H. Balakrishnan, S. Seshan, and R. H. Katz, "Reliable Transport and Handoff Protocols for Cellular Wireless Networks," ACM Wireless Networks Journal, V 1, N 3, (December 1995), pp. 469-482.

[Bala 96b] H. Balakrishnan, V.N. Padmanabhan, S. Seshan, and R.H. Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links," Proceedings ACM SIGCOMM '96, Stanford, California, (August 1996).

[Bala 97a] H. Balakrishnan, S. Seshan, M. Stemm, and R. H. Katz, "Analyzing Stability in Wide-Area Network Performance," Proceedings ACM SIGMETRICS Conference on Measurement & Modeling of Computer Systems (SIGMETRICS 97), Seattle, Washington, (June 1997).

[Bala 97b] H. Balakrishnan, S. Seshan, M. Stemm, V. Padmananbhan, and R. H. Katz, "TCP Performance of a Busy Web Server: Analysis and Improvements," submitted for publication.

[Fox 96a] A. Fox and E. Brewer, "Reducing WWW Latency and Bandwidth Requirements by Real-Time Distillation," Proceedings of the Fifth International World Wide Web Conference, Paris, France, (May 6-10, 1996).

[Fox 96b] A. Fox, S. Gribble, E. Brewer, and E. Amir, "Adapting to Network and Client Variation via On-Demand, Dynamic Distillation," Proceedings ASPLOS-VII, Boston, Massachusetts, (October 1996).

[Fox 96c] A. Fox and S. Gribble, "Security on the Move: Indirect Authentication Using Kerberos," ACM MobiCom 96, White Plains, New York, (November 1996).

[Fox 97] A. Fox, S. D. Gribble, Y. Chawathe, and E. A. Brewer, "Scalable Network Services," submitted (blind review) to SOSP-16.

[Nguyen 96] G. Nguyen, R. Katz, B. Noble, and M. Satyanarayanan, "A Trace-based Approach for Modeling Wireless Channel Behavior," Proceedings of the Winter Simulation Conference, (December 1996).

[Katz 96] R. H. Katz and E. Brewer, "The Case for Wireless Overlay Networks," Proceedings 1996 SPIE Conference on Multimedia and Networking, MMCM '96, San Jose, California (January 1996).

[Liu 97] K. M. Liu, "Lightweight Web Browsing Through HTML Validation," Master's Report, University of California, Berkeley, California (January 1997).

[Nara 96] S. Narayanaswamy, S. Seshan, E. Brewer, R. Broderson, F. Burghardt, A. Burstein, Y-C. Chang, A. Fox, J. Gilbert, R. Han, R. Katz, A. Long, D. Messerschmitt, and J. Rabaey, "Application and Network Support for Infopad," IEEE Personal Communications Magazine, V 3, N 2, (April 1996), pp. 4-17.

[Nguyen 96] G. T. Nguyen, R. H. Katz, B. Noble, and M. Satyanarayan, "A Trace-Based Approach for Modeling Wireless Channel Behavior," Proceedings Winter Simulation Conference '96, Coronado, California, (December 1996).

[Padmanabhan 96] V. N. Padmanabhan, H. Balakrishnan, K. Sklower, E. Amir, and R. H. Katz, "Networking Using Direct Broadcast Satellite," Workshop on Satellite Services, Rye, New York, (November 1996).

[Padmanabhan 96a] V. N. Padmanabhan and R. Caceres (AT&T Labs), "Fast and Scalable Handoffs in Wireless Internetworks," ACM MobiCom Conference, Rye, New York (November 1996).

[Seshan 96] S. Seshan, H. Balakrishnan, and R. H. Katz, "Handoffs in Cellular Wireless Networks: The Daedalus Implementation and Experience," to appear in Wireless Personal Communications, Kluwer Academic Publishers, (1997).

[Stemm 96] M. Stemm, P. Gautier, D. Harada, and R. H. Katz, "Reducing Power Consumption of Network Interfaces in Hand-Held Devices," International Workshop on Mobile Multimedia Communications (MoMUC-3), Princeton, New Jersey, (October 1996).

[Stemm 97a] M. Stemm and R. H. Katz, "Measuring and Reducing Energy Consumption of Network Interfaces in Hand-Held Devices," invited paper in IEICE (Institute of Electronics, Information and Communication Engineers) Transactions on Communications, Special Issue on Mobile Computing.

[Stemm 97b] M. Stemm and R. H. Katz, "Vertical Handoffs in Wireless Overlay Networks," ACM Mobile Networking (MONET), Special Issue on Mobile Networking in the Internet, (Fall 1997).

 

 

Continuous Media Toolkit (L. Rowe)

Development has continued on the Continuous Media Toolkit. This work is now widely used with the Titan environment, and within the formed Berkeley Media Research Center. BMRC also received an NSF Academic Infrastructure Grant to develop a video storage system (i.e., video server and tertiary storage backup) and high-speed network for delivering stored and live video material to researchers in many departments at Berkeley (e.g., EECS, Mechanical Engineering, School of Education, School of Information and Systems, etc.) including Titan researchers. This high-speed multimedia network connecting to the Titan network to deliver real-time video to desktops and classrooms within Soda Hall is now operational.

Multimedia

We have developed two multimedia laboratories within Computer Science that are now widely used by faculty and students to produce media elements for presentations, web publication, and performances. These elements include still image, audio, video, and animation material. In addition, we have pioneered the developed of what we call "hello videos" which are short clips in which students introduce themselves. These clips are being integrated into personal and course web pages. The remainder of this section describes other multimedia research funded in part by TITAN on the development of a video storage and delivery system, an Internet Broadcasting System, experiments developing multimedia content with an emphasis on video material, and development of multimedia middleware.

Video Storage and Delivery

The Berkeley Distributed Video-on-Demand (VOD) System is a hierarchical storage system designed to support transparent access to thousands of hours of video material. The system architecture is composed of a database, one or more video file servers (VFS), and one or more archive servers (AS). The database contains metadata and indexes about the videos stored in the system. The VFSs store videos on magnetic disks for real-time playback. An AS manages tertiary storage devices (e.g., optical disk or tape jukeboxes). In essence, VFS is an on-line cache for videos permanently stored on a tertiary storage (TS) device.

During this period we published two papers describing algorithms we developed for this system. The first algorithm determines on which VFS to load a video object from tertiary storage [Brubeck96]. The second algorithm compares several segmentation algorithms, that is, algorithms that identify shot boundaries in video sequences [Boreczky96]. Both algorithms are important parts of the system. We also completed a collection of software tools for entering metadata about videos into the database and fixing up shot boundaries detected by one of the algorithms studied by Boreczky [Bacher97].

Our earlier design for this system assumed that the VFSs would all be the same system and that each one could deliver material coded using different compression algorithms. We now realize that different coding and delivery systems are supported by different servers. For example, the Sun Media Center only supports MPEG1 streams (CIF format at 1.5 Mb/s or higher) and various low-bit rate commercial servers (e.g., Progressive Networks Real Audio/Video, Xing Media Streams, Vxtreme, VDO, etc.) use their own proprietary CODEC's and delivery protocols. We are revising our architecture and system to take into account the inclusion of different servers and formats. In addition, we are extending the metadata database to include information required to support other types of media content (e.g., images, web pages, web scripts, etc.) and relationships useful for managing assests during the authoring process.

Internet Broadcasting System

For over two years we have broadcast a regularly scheduled seminar, the Berkeley Multimedia and Graphics Seminar, world-wide on the Internet (http://www.bmrc.berkeley.edu/298). The number of viewers ranges from 10 to over 200 people depending on the speaker/topic and the current state of the Internet. Average viewership peaked at around 45 during the Spring 1996 semester before it dropped to around 20 during the Fall 1996 semester as result of poor audio/video delivery due to restructuring of the Internet. The number of viewers climbed again at the end of the Spring 1997 semester when 35-50 people watched each seminar. We regularly have viewers from Europe, throughout North America, and occasionally from Australia and Asia.

The seminar broadcast is produced using the LBL/UCB MBone tools vic and vat to capture and transmit video and audio streams and the shared whiteboard tool wb to distribute slides [Eriksson94, McCanne95]. We produce different versions of the broadcast at different data rates, and hence quality, for different user communities. For example, Berkeley was connected to the BAGNet ATM testbed on which we could send video at bitrates of several Mbits/sec. Participants connected directly to the Berkeley campus can receive video at up to 1 Mbit/sec. Video sent to the world-wide MBone is generally limited to rates below 128 Kbits/sec. To accommodate participants with these different receiving capabilities, we use a video transcoding gateway vgw [Amir95]. The broadcast is recorded using a tool we developed for recording RTP streams. We are currently testing a web-based on-demand playback tool that can be used to replay the seminars and other broadcasts (http://www.bmrc.berkeley.edu/MediaServer).

Our experience producing this broadcast has suggested several tools that can be developed to: 1) reduce the effort required to produce the broadcast (i.e., a Broadcast Management System); 2) improve the seminar interaction (i.e., a QuestionBoard floor control tool); and 3) improve the video quality of the broadcasts (i.e., a Video Production Switcher). The Broadcast Manager database maintains descriptions of hardware and software resources required to produce a broadcast, configurations of processes and streams for a broadcast, and scheduled live and replay broadcasts. The system uses the database to launch the tools and processes required to produce a broadcast which can be over 15 processes. In addition, we developed tools to monitor and control various aspects of the broadcast transmission (e.g., rtpmon). A paper describing the design and implementation of this system has been written [Swan97] and we will deploy the system during the Fall 1997 semester.

The QuestionBoard tool qb allows remote participants to indicate their desire to ask a question. The goal is to encourage remote participants to ask questions. Most people using distance learning technologies have discovered that remote participants are unlikely to ask questions. We wanted to change the dynamics so that asking a question was less interrupting and threatening. A question can either be entered as text or as a "request to ask a question" using audio and/or video. The lecturer or a moderator in the room can signal the existence of a question so the floor can be granted to the person with a question or the moderator/speaker can just state and answer the question. A paper describing the system has been written [Malpani97]. The paper describes our experiences using the tool in four seminars at the end of the Spring 1997 semester.

We are also developing a software-only image and video-effects processing system that will operate on digital audio and video streams represented by RTP packets. Some examples of video-effects operations are transitions, chroma key, and picture-in-picture. Examples of image processing operations are filtering, warping, stitching, animation rendering, and compositing. To achieve real-time performance, these operations are typically implemented by special-purpose hardware, although some examples of software-only rendering (e.g., Pixar Renderman) and video post production systems (e.g., the IBM Power Visualization EFX) have been developed.

A parallel, software-only video-effects system is being developed that will run on a Network-of-Workstations (NOW) composed of general-purpose processors. The system will be extensible and scalable. New effects can be added by coding primitive transformations in a high-level language. The NOW architecture will allow the computation to be scaled depending on the effect complexity and the video resolution. Functional, temporal, and spatial decomposition, dynamic workload balancing, compressed domain processing, and use of special-purpose instructions (e.g., MMX) will be exploited to achieve real-time high-quality results. The system will allow video-effects to be performed on live and stored video streams.

We have developed two prototypes of this system. The first prototype, developed as a class project, demonstrated concept feasibility and the first design of the system and effects streaming abstractions [Simpson95]. The second prototype was developed by another student as part of her master's project. This prototype restructured the user interface to the system and the architecture of the software. This prototype also introduced the first compressed domain processing effects (e.g., wipes and mix transitions). An early report on this work is available [Wong96], and a paper is being written. Our work plan for the next year is to revise the software architecture yet again so the system can be deployed as part of the Internet Broadcasting System discussed below and to interface it to the forthcoming NOW implementation of the effects engine.

Our experience producing the seminar broadcast and other live events convinces us that an internet Broadcasting System can be built using the MBone technology that can support thousands of live and on-demand programs that can be viewed simultaneously by one person or millions of people. A program can be composed of multiple streams of data including audio, video, text, images, and animations. It might be a conventional television program, a lecture or meeting with discussion between participants, or an interactive game. We are developing such a system for the Berkeley campus that will connect with the conventional analog video distribution network and studio/classrooms across campus. Users anywhere on the Internet can tune-in to these programs live or when they are replayed either scheduled for other time zones or on-demand by one viewer. We are experimenting with new ways to present content and novel ways to provide information or guides about available programs.

Experiments with Video Content

We have developed several web titles that use streaming video playback. These titles are being developed to experiment with the use of this new media. One title organizes the video material into a collection of short clips, one clip per web page, with links between the pages. Users follow the links to explore different topics and concepts. We have produced three titles using this model. Two titles are taken from interviews of famous personalities in international relations politics (see http://www.bmrc.berkeley.edu/globalvillage/cho, "An Interview with The Honorable Cho Soon, Mayor of Seoul, Republic of Korea," for an example). The "Cho Soon" title is composed of 25 clips that range from 30 seconds to 2.5 minutes. Altogether the video material requires 250 MB of storage. The second title is an interview with Robert McNamara. The idea is that other users can create new material by linking together various clips from one or more interviews. For example, a title on city government can link together interview clips from different people to illustrate different opinions and experiences. We have experimented with other ways to organize material such as lectures with synchronized slides and transcripts with links to annotations and related material.

We are continuing to experiment with other models for multimedia titles. During the Spring 1997 semester Professor Rowe taught an art course (http://www.bmrc.berkeley/webvideo, "Internet WebVideo Studio). This course covered material on shooting, digitizing, and editing video material for the web. Students in the class did projects that were composed of up to 10 minutes of video material and at least 5 web pages. The best projects were selected for public performance and judged by a distinguished panel of artists and videographers. The winning title was a web site designed to help high school track coaches and athletes learn about and train for the triple jump. The material included instructions on the various phases of the triple jump, exercises that athletes can do to improve each phase, a calculator for determining an athlete's performance given his/her long jump performance, interviews with other triple jumpers, and a historic timeline of various periods of great triple jumpers along with a video to illustrate their performance. Other titles of note include an interactive demonstration on cooking, an information kiosk for the Journalism School at Berkeley, and a novel interactive story titled "What Do You Want?" We will be publishing these titles on the web in the next month (http://www.bmrc.berkeley.edu).

Multimedia Development Infrastructure

The Berkeley Continuous Media Toolkit (CMT) is a portable, freely distributed toolkit for developing distributed multimedia applications [Mayer-Patel97, Rowe96]. Example applications are streaming video playback, desktop video conferencing, and non-linear video editors. The CMT system supports a variety of audio and video devices and includes source and destination abstractions (e.g., files, cameras, displays, etc.), a client application programming interface to create objects, connect them together, and control them (e.g., synchronization primitives), and network protocols to send and receive audio/video packets. Devices represent audio and video hardware (e.g., video capture/compression boards and audio boards). Sources, destinations, and filters are abstractions for continuous media (CM) objects that read, write, or read and write streams of CM data, respectively. Source objects produce CM streams (e.g., a camera or file). Destination objects consume CM streams (e.g., a display or a file). And, filter objects consume and produce CM streams (e.g., an audio mixer or format transcoder). Other CMT abstractions and services include network transport (e.g., an RTP stream encapsulated into IP-Multicast), synchronized clocks and timers, and numerous glue objects (e.g., splitters, multiplexers, audio mixers, etc). And creates sources, destinations, and filters, connects them together, and then issues commands to activate the objects (e.g., begin video capture, reposition an audio file to 10 seconds into the file, play video, etc.).

The first release of CMT was made in 1993. During the past 18 months several releases of CMT version 3 were made and significant progress was completed on the development of CMT version 4 which will be released in June 1997. In addition, new objects were implemented and several applications were developed. Two major improvements were made to the programming model. First, a declarative application programming interface (API) for creating, modifying, and controlling a collection of objects to play synchronized audio and video streams was designed and implemented. This interface is called the Media Playback API [Jackson96]. And second, a continuous media event (cmevent) binding was added to the CMT object abstraction. A cmevent is any state data or action an object wants to communicate to an application. Examples are "frame decoded," "blocks decoded per second," and "blocks dropped." Application code can bind callback procedures to these events which will be called when the object raises the event. This mechanism is used to update statistics and status on user interfaces and to invoke control actions in a play chain, that is, a sequence of objects possibly on different machines. A motivating example for the development of this mechanism was the development of MPEG1 video analysis tools [Banks95].

Many applications have been developed using CMT. Two video editors were implemented using CMT. The first, name CMEdit, is a conventional non-linear editor that supports a timeline window, media clip bins, and playback windows [Baldeschwieler96]. One unconventional aspect of this system is that media clips can be stored on different file servers or in the Berkeley Video-on-Demand System [Rowe97]. A second non-linear video editor was implemented that uses an interface paradigm of film clips that can be cut apart and joined together [Steele96]. A prototype of this editor required less than 2,000 lines of code. The prototype software-only video-effects production switcher, described above, was also implemented using CMT. Many other applications have been developed including 1) data entry tools that allow bibliographic and structural information about videos to be entered into a video database [Bacher97], 2) a web plug-in for on-demand seminar playback with synchronized slides and outlines [Simpson97a], and 3) Internet MBONE session record and playback tools with full-function VCR controls [Simpson97b].

[Amir95] E. Amir, S. McCanne, and H. Zhang, "An Application-level Video Gateway," ACM Multimedia, San Francisco, California, (November 1995), pp. 511-522.

[Bacher97] D. R. Bacher and L. A. Rowe, "Data Entry and Administrative Tools for a Video-on-Demand Metadata Database," submitted for publication, (February 1997). An on-line description of the tools is available at http://www.bmrc.berkeley.edu/~drbacher/projects/bvde.html.

[Baldeschwieler96] J. E. Baldeschwieler and L. A. Rowe, "Editing Extensions to the Berkeley Continuous Media Toolkit," http://www.bmrc.berkeley.edu/papers/111.html, (August 1996).

[Banks95] D. Banks and L. A. Rowe, "Analysis Tools for MPEG-1 Video Streams," unpublished manuscript, Computer Science Division – EECS, University of California, Berkeley, California, (July 1995).

[Boreczky96] J. S. Boreczky and L. A. Rowe, "A Comparison of Video Shot Boundary Detection Techniques," Journal of Electronic Imaging, 5(2), (April 1996), pp. 122-128.

[Brubeck96] D. W. Brubeck and L. A. Rowe, "Hierarchical Storage Management in a Distributed Video-On-Demand System," IEEE Multimedia, Vol. 3, No. 3, (1996), pp. 37-47.

[Eriksson94] H. Eriksson, "MBONE: The Multicast Backbone," ACM Communications, Vol 37, No 8, (August 1994), pp 54-60.

[Jackson96] M. H. Jackson, J. E. Baldeschwieler, and L. A. Rowe, "Berkeley CMT Media Playback API," submitted for publication, (September 1996); also available at http://www.bmrc.berkeley.edu/papers/trey/112.html

[Jackson96] M. H. Jackson, J. E. Baldeschwieler, and L. A. Rowe, "Berkeley CMT Media Playback API," submitted for publication, (September 1996). Available at http://www.bmrc.berkeley.edu/papers/trey/112.html.

[Malpani97] R. Malpani and L.A. Rowe, "Floor Control for Large-Scale MBone Seminars," submitted for publication, (June 1997), also available at http://www.bmrc.berkeley.edu/papers/137.html.

[Mayer-Patel97] K. Mayer-Patel and L. A. Rowe, "Design and Performance of the Berkeley Continuous Media Toolkit," Multimedia Computing and Networking Conference, IS&T/SPIE Conferences, San Jose, California, (January 1997).

[Mayer-Patel97] K. Mayer-Patel and L. A. Rowe, "Design and Performance of the Berkeley Continuous Media Toolkit," in Multimedia Computing and Networking, M. Freeman, P. Jardetzky, and H. M. Vin, Editors, Proc. SPIE 3020, (1997).

[McCanne95] S. McCanne, and V. Jacobson, "vic: A Flexible Framework for Packet Video," ACM Multimedia, San Francisco, California, (November 1995), pp. 511-522.

[Rowe96] L. A. Rowe, et al., "Continuous Media Toolkit," http://www.bmrc.berkeley.edu/cmt, (1996).

[Simpson95] D. Simpson and R. Fromm, http://bmrc.berkeley.edu/~davesimp/classes/cs294-3/Project/, "Software Video Production Switcher," class project CS294 Multimedia Systems and Applications (Fall 1995).

[Simpson97a] D. Simpson and L. A. Rowe, "An On-demand Seminar Playback Web Plug-in," http://www.bmrc.berkeley.edu/events/retreat97/talks/rowe-iv/slide8.html, (January 1997).

[Simpson97b] D. Simpson, "Internet MBONE Recording/Playback Tools," forthcoming, (June 1997).

[Steele96] M. Steele, "The Video Workbench: A New User Interface for Non-linear Video Editors," CS294-3 Class Project, http://bmrc.berkeley.edu/294/projects/steele/, (Fall 1996).

[Swan97] A. Swan and L. A. Rowe, "An Internet MBone Broadcast Management System," submitted for publication, (June 1997).

[Wong96] T. Wong, http://bmrc.berkeley.edu/~twong/cs294-3/, "SVPS: A Software Video Production Switcher for MBone Broadcast," class project CS294 Multimedia Systems and Applications, (Fall 1996).

 


Driving Applications

Several of the driving applications for Titan have conducted investigations that would not have been possible without the massive computing power of the core component.

 

Collaborative Design and Distributed Simulation (C. Sequin)

Virtual environments are of major interest to computer graphics researchers due to their ability to immerse the user in a computer-generated alternate reality in which anything is possible. One of the most exciting application domains is collaborative design, where scientists, engineers, architects, and other professionals can enter a virtual space that allows the physical structure of a system to be evaluated without actually building, creating, or affecting a real instance of that structure. Users could preview architectural designs, evaluate their performance with various metrics, and do simulations and "what-if'' experiments cheaply and with no risk. To obtain realistic answers to such experiments, we need to integrate good physical simulations with virtual environment interfaces. Integration of powerful simulation technology with virtual reality visualization systems affords the possibility of intuitive interpretation and visualization of the results of complex and powerful simulations via 3D computer graphics. These tasks require a large amount of computing power and high-bandwidth connections between the different computing nodes. The Titan infrastructure provides these resources.

In the http://http.cs.berkeley.edu/~bukowski/wkfire/index.html Walkthru-CFAST Project we are attempting to realize some of these advantages for the benefit of fire safety in architectural environments. This work is based on a project to integrate the National Institute of Standards and Technology's (NIST) Consolidated Model of Fire and Smoke Transport (CFAST) [1] into the Berkeley Architectural Walkthrough (Walkthru) [2] system. CFAST currently provides the world's most accurate simulation of the impact of fire and its byproducts on a building environment. Integrated into the Walkthru, it provides real-time, intuitive, realistic and scientific visualization of building conditions in a fire hazard situation from the perspective of a person walking through a burning building. The viewer can observe the natural visual effects of flame and smoke in fire hazard conditions; alternatively, scientific visualization techniques allow the user to "observe" the concentrations of toxic compounds such as carbon monoxide and hydrogen cyanide in the air, as well as the temperatures of the atmosphere, walls, and floor. Warning and suppression systems such as smoke detectors and sprinkler heads can be observed in action to help determine the effectiveness of those systems. This technology will be used to improve fire safety by helping engineers and architects evaluate a building's potential safety and survivability through performance-based standards (i.e., how well the building protects its occupants from the fire). With more development, it could also be used to help train personnel in firefighting techniques and rescue operations by presenting them with practice situations that are too risky to be simulated in the real world.

While the combination of virtual reality and environmental simulation constitutes a framework for very powerful tools, it also raises many implementation challenges. Among these challenges are interaction with the virtual world, setting up and dynamically changing simulation conditions from within the virtual world to a simulator, designing "visualization-oriented" simulators, transporting simulation results to the visualizer, integrating the simulator's results with the virtual environment, and visualizing those results in a way that is useful to the user; either descriptively, in the case of scientific visualization applications, or realistically, in the case of training or entertainment applications. These problems are compounded by an additional desire to distribute both the virtual environment and the simulation over multiple computers -- potentially connected by high-latency, low-bandwidth networks such as the Internet -- when attempting to simulate and visualize large buildings with hundreds of rooms.

In a paper accepted to SIGGRAPH'97 [3], we present an approach to the problem of distributed simulation-visualization data management that is optimized for densely occluded polyhedral environments (i.e., buildings) based on the Walkthru and CFAST programs. Walkthru has already addressed some of the problems of distributed visualization and of the interaction between the user and the virtual world [4]. We show that the basic virtual environment structure used in the Walkthru, a spatial subdivision of the world into densely occluded cells with connecting portals, can be put to good use for simulation data management. In addition to optimizing the visualization task, it is also useful for optimizing bandwidth requirements between a visualizer and simulator, both for the purpose of communicating scenario information to the simulator and communicating simulated states back to the visualizer. Using this structure, we can optimize bandwidth requirements for arbitrarily large visualizations and simulations, and relieve the visualization and simulation designers of the complexity of the data management problem. The solution is easily extensible to multiple distributed visualizers and simulators operating on one virtual world. It also suggests an important attribute of future simulation design for simulation developers who wish to make "virtual reality-oriented" real-time simulators: the ability to "concentrate" simulation efforts on areas of the environment of immediate interest to the observer, denoted by those areas which are currently being observed in real-time.

[1] R. Peacock, et al., "CFAST, the Consolidated Model of Fire Growth and Smoke Transport," NIST Technical Note 1299, U.S. Dept. of Commerce, (February 1993).

[2] T. Funkhouser, C. Sequin, and S. Teller, "Management of Large Amounts of Data on Interactive Building Walkthroughs," Proc. Symp. On Interactive 3D Graphics, Boston, Massachusetts (March 1992).

[3] R. Bukowski and C. Sequin, "Interactive Simulation of Fire in Virtual Building Environments<" to appear in Proc. of SIGGRAPH'97, Los Angeles, California, (August 1997).

[4] R. Bukowski and C. Sequin, "Object Associations: A Simple and Practical Approach to Virtual 3D Manipulation," Proc. Symp. On Interactive 3D Graphics, Monterey, California (April 1995).

 

Internet Simulated Networking Environment (INSANE)

The Tenet real-time networking group used NOW cluster to run simulations of a large IP-over-ATM network, using a new network simulator, the Internet Simulated Networking Environment (INSANE). INSANE simulates the operation of different types of Internet applications over a heterogeneous internetwork including some ATM components.

 

Digital Library Project (R. Wilensky, D. Forsyth, J. Malik)

The Digital Library Project (http://elib.cs.berkeley.edu/) is creating testbeds of data for research and public access. The data includes scanned documents which are then converted to various other image formats (on the NOW cluster). OCR is run (on the NOW cluster) to produce ASCII text and word locations. Titan facilities were critical to the work in finding pictures of objects in large collections of images.

 

FACADE: The first project, FACADE, solves a problem of great interest to researchers and practitioners of computer graphics and virtual reality systems. How do we construct computer models of real world objects, such as the Eiffel Tower or the Notre Dame cathedral, so that we may visit them in our synthetic worlds?

Paul Debevec, C. J. Taylor, and J. Malik have developed a new approach for modeling and rendering existing architectural scenes from a sparse set of still photographs. Our modeling approach, which combines both geometry-based and image-based techniques, has two components. The first component is a photogrammetric modeling method which facilitates the recovery of the basic geometry of the photographed scene. The second component is a model-based stereo algorithm, which recovers how the real scene deviates from the basic model. By making use of the model, our stereo technique robustly recovers accurate depth from widely-spaced image pairs. For producing renderings, we developed view-dependent texture mapping, a method of compositing multiple views of a scene that better simulates geometric detail on basic models.

During 1996-97, the following capabilities were added to FACADE:

Recovering surfaces of revolution. This permits us to reconstruct the geometrical structure of domes, minarets, etc., which are fairly common in Islamic and European Renaissance architecture.

Recovering Arches. This permits us to reconstruct the geometrical structure of surfaces that have been formed by sweeping a planar curve.

A large number of architectural scenes have now been modeled and rendered using the FACADE system. These include:

· Sather Tower, UC Berkeley campus

ROADWATCH: Traffic management and information systems must rely on a system of sensors for estimating traffic parameters in real-time. Currently, the dominant technology for this purpose is that of magnetic loop detectors, which are buried underneath highways to count vehicles passing over them. Video monitoring systems promise a number of advantages. First, a much larger set of traffic parameters can be estimated in addition to vehicle counts and speeds. These include vehicle classifications, link travel times, lane changes, rapid accelerations or decelerations, queue lengths at urban intersections, etc. Second, cameras are less disruptive and less costly to install and repair than loop detectors, which require digging up the road surface.

Jitendra Malik, Stuart Russell, David Beymer, Philip McLauclahan, Joseph Weber, Benn Coiffman, Tim Huang and Daniel Lyddy have developed such a video monitoring system. The core idea is to have video cameras mounted on poles or other tall structures looking down at the traffic scene. We envisage that each camera's field of view extends a couple of hundred yards down the direction of traffic flow, and that the video cameras are located every two miles or so down the length of the freeway, or at every urban intersection if used in that context. Video is captured, digitized, and processed by onsite computers, and then transmitted in summary form to a Transportation Management Center (TMC) for collation and computation of multi-site statistics such as link travel times and origin/destination counts. Processing occurs in two stages:

  1. At each camera site, the video stream is continuously processed to identify individual vehicles. Each vehicle is tracked, i.e., followed from frame to frame, to refine and update its position and velocity in 3D world coordinates. This gives us a space-time plot of the trajectory of each vehicle during a spatiotemporal window when the vehicle is in the clear field of view of this camera. Aggregating these data enables the estimation of local traffic parameters including flow rates, average speeds, lane change frequencies, etc. These parameters, together with track information (time stamp, vehicle type, color, shape, and position), are communicated to the TMC at regular intervals.
  2. At the TMC, link travel times and origin/destination counts are computed by first matching individual vehicles appearing at two or more camera sites. As a vehicle is reidentified at a sequence of camera sites, the TMC builds up a path which records the route taken by the vehicle through the freeway network. Given these paths, it is straightforward to determine link travel time and origin/destination counts. The TMC is also responsible for real-time collation and presentation of local and global traffic parameters.

 

We urge the reader to explore http://HTTP.CS.Berkeley.EDU/~debevec/Research/+ measuring traffic parameters.

 

 

Berkeley Intelligent RAM Project

The Berkeley Intelligent RAM project (http://iram.cs.berkeley.edu/) seeks to understand the entire spectrum of issues involved in designing general-purpose computer systems that integrate a processor and DRAM onto a single chip - from circuits, VLSI design and architectures to compilers and operating systems. IRAM should offer several advantage over today's solutions, including considerably reduced latency and dramatically increased bandwidth to main memory, reduced energy consumption, and reduced space and weight for embedded, portable, desktop, and MPP computer systems.

In order to investigate the performance and power advantages of IRAM computer systems, the NOW has been used for thousands of hours of architectural simulation over the last several months. The SimOS complete computer system simulation environment was used to evaluate the performance of eight computer system models, each running 11 different benchmarks, where each of these 88 simulations took 2-3 days on a single 167MHz UltraSparc I workstation. In addition, Shade cache simulations were used to determine the energy consumption of four alternative memory hierarchies, each running eight benchmarks, where each of these 32 simulations took 1-2 days. Results for both simulation efforts are being published in ISCA '97. Because these simulations were independent and thus trivially parallelizable, the NOW was an ideal simulation engine, allowing us to complete the simulation in a matter of a few days, rather than many months.

 

The OPTICAL Project at UC Berkeley: Computer Aided Cornea Modeling and Visualization (B. A. Barsky)

The OPTICAL project is a multidisciplinary effort in the Computer Science Division and School of Optometry. "OPTICAL" is an acronym for "OPtics and Topography Involving the Cornea and Lens." This project is concerned with the computer-aided measurement, modeling, reconstruction, and visualization of the shape of the human cornea, called corneal topography.

The cornea is the transparent tissue covering the front of the eye. It performs 3/4 of the refraction, or bending, of light in the eye, and focuses light towards the lens and the retina. Thus, subtle variations in the shape of the cornea can significantly diminish visual performance. Eye care practitioners need to know the shape of a patient's cornea to fit contact lenses, to plan and evaluate the results of surgeries that improve vision by altering the shape of the cornea, and to diagnose keratoconus, an eye condition where the cornea has an irregular shape with a local protrusion, or "cone," which has dramatic effects on vision.

Recently, instruments to measure corneal topography have become commercially available. These devices, called videokeratographs, typically shine rings of light onto the cornea and then capture the reflection pattern with a built-in video camera. Instead of allowing the videokeratograph to process the pattern, concerned with the computer-aided measurement, modeling, reconstruction, and visualization of the shape of the human cornea, called corneal topography. The OPTICAL project extract the data and construct a mathematical spline surface representation from these reflection patterns. In addition to developing this novel modeling algorithm, future exploration into new scientific visualization techniques to display the resulting information in an intuitive and accurate manner will be pursued. This video compares our new visualization methods with an existing technique. Using our modeling and visualization software, we show real keratoconic data and then a simulated keratoconic model.

 


Education

We have developed a significant amount of educational material related to Titan.

First, J. Demmel developed a fully on-line course, CS 267, on "Applications of Parallel Computers," offered every spring. It covers parallel architecture, software, and applications. D. Culler taught the CS267 course this year entirely on NOW.D. Culler developed another class, CS 258, on "Parallel Computer Architecture" in collaboration with A. Gupta and J. P. Singh of Stanford. The textbook is under contract to Morgan-Kaufman.