I am a Professor in the Computer Science Division of the EECS Department at the University of California, Berkeley. My main research areas are computer architecture, VLSI design, parallel programming and operating system design. I am Director of the new ASPIRE lab tackling the challenge of improving computational efficiency now that transistor scaling is ending. ASPIRE builds upon the earlier success of the Par Lab, whose goal was to make parallel programming accessible to most programmers. I am also an Associate Director at the Berkeley Wireless Research Center, and hold a joint appointment with the Lawrence Berkeley National Laboratory. Previously at MIT, I led the SCALE group, investigating advanced architectures for energy-efficient high-performance computing.

Active Research Projects

The ASPIRE Lab

ASPIRE is a new 5-year research project that recognizes the shift from transistor-scaling-driven performance improvements to a new post-scaling world where whole-stack co-design is the key to improved efficiency. Building on the success of the Par Lab project, it uses deep hardware and software co-tuning to achieve the highest possible performance and energy efficiency for future warehouse-scale and mobile computing systems.

The RISC-V Instruction Set Architecture

RISC-V is a new instruction set architecture (ISA) developed at UC Berkeley, and is designed to be a realistic, clean, and open ISA that is easy to extend for research or subset for education. A wide variety of implementations have been produced including GHz-class silicon fabrications and FPGA emulations, and RISC-V is being used in a number of classes. A full set of software tools for the architecture are also under development and are being prepared for open distribution. RISC-V was initially developed as part of Par Lab and is now part of ASPIRE.

Constructing Hardware in a Scala Embedded Language

Chisel is a new open-source hardware construction language developed at UC Berkeley that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. Chisel is embedded in the Scala programming language, which raises the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. Chisel was originally developed in the DoE Project Isis and Par Lab, and development continues in ASPIRE.

FireBox

FireBox is a new project at UC Berkeley proposing a system architecture for third-generation Warehouse-Scale Computers (WSCs). Firebox is a 50kW WSC building block containing a thousand compute sockets and 100 Petabytes (2^57 Bytes) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. Each compute socket contains a System-on-a-Chip (SoC) with around 100 cores connected to high-bandwidth on-package DRAM. Fast SoC network interfaces reduce the software overhead of communicating between application services and high-radix network backplane switches connected by Terabit/sec optical fibers reduce the network's contribution to tail latency. The very large non-volatile store directly supports in-memory databases, and pervasive encryption ensures that data is always protected in transit and in storage. FireBox is being developed in the Berkeley ASPIRE Lab.

DIABLO: Datacenter-In-A-Box at LOw Cost

DIABLO is a wind tunnel for datacenter research, simulating O(10,000) datacenter servers and O(1,000) switches for O(100) seconds. DIABLO is built with FPGAs and executes real instructions and moves real bytes, while running the full Linux operating system and unmodified datacenter software stacks on each simulated server. DIABLO has successfully reproduced some real-life datacenter phenomena, such as the memcached request latency long tail at large scales. DIABLO was initially developed in the RAMP project, and is now part of ASPIRE.

Resiliency for Extreme Energy Efficiency

Most manycore hardware designs have the potential to achieve maximum energy efficiency when operated in a broad range of supply voltages, spanning from nominal down to near the transistor threshold. As part of ASPIRE, We are working on new circuit and architectural techniques to enable parallel processors to work across a broad supply range while tolerating technology variability, and providing immunity to soft- and hardā€errors. We are building several prototype resilient microprocessors codenamed Raven.

Monolithically Integrated CMOS Photonics

In a collaboration with MIT, the University of Colorado at Boulder, and Micron Technology, we are exploring the use of silicon photonics to provide high bandwidth energy-efficient links between processors and memory. Integrated photonics is a key component of the FireBox project.

Graph Algorithm Platform

Graph algorithms are becoming increasingly important, from warehouse-scale computers reasoning about vast amounts of data for analytics and recommendation applications to mobile clients running recognition and machine-learning applications. Unfortunately, graph algorithms execute inefficiently on current platforms, either shared-memory systems or distributed clusters. The Berkeley Graph Algorithm Platform (GAP) Project spans the entire stack, aiming to accelerate graph algorithms through software optimization and hardware acceleration. GAP was begun in Par Lab and is now part of ASPIRE.

DEGAS: Dynamic Exascale Global Address Space Programming Environments

The Dynamic, Exascale Global Address Space programming environment (DEGAS) project will develop the next generation of programming models, runtime systems and tools to meet the challenges of Exascale systems.

A Liquid Thread Environment

Applications built by composing different parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. Lithe is a low-level substrate that provides basic primitives and a standard interface for composing parallel libraries efficiently, and can be inserted underneath the runtimes of legacy parallel libraries, such as TBB and OpenMP, to provide bolt-on composability without changes to existing application code. Lithe was initially developed in Par Lab and is now part of DEGAS.

Earlier Projects at UC Berkeley

DHOSA: Defending Against Hostile Operating Systems (2009-2014)

The DHOSA research project focuses on building systems that will remain secure even when the operating system is compromised or hostile. DHOSA is a collaborative effort among researchers from Harvard, Stony Brook, U.C. Berkeley, University of Illinois at Urbana-Champaign, and the University of Virginia.

Par Lab: The Parallel Computing Laboratory (2008-2013)

With the end of sequential processor performance scaling, multicore processors provide the only path to increased performance and energy efficiency in all platforms from mobile to warehouse-scale computers. The Par Lab was created by a team of Berkeley researchers with the ambitious goal of enabling "most programmers to be productive writing efficient, correct, portable SW for 100+ cores & scale as cores increase every 2 years".

The Maven Vector-Thread Architecture (2007-2013)

Based on our experiences designing, implementing, and evaluating the Scale vector-thread architecture, we identified three primary directions for improvement to simplify both the hardware and software aspects of the VT architectural design pattern: (1) a unified VT instruction set architecture; (2) a VT microarchitecture more closely based on the vector-SIMD pattern; and (3) an explicitly data-parallel VT programming methodology. These ideas formed the foundation for the Maven VT architecture.

Tessellation OS

Tessellation is a manycore OS targeted at the resource management challenges of emerging client devices. Tessellation is built on two central ideas: Space-Time Partitioning and Two-Level Scheduling. Tessellation was initially developed within Par Lab and is now part of the Swarm Lab.

RAMP: Research Accelerator for Multi-Processors (2005-2010)

The RAMP project was a multi-University collaboration to develop new techniques for efficient FPGA-based emulation of novel parallel architectures thereby overcoming the multicore simulation bottlenecks facing computer architecture researchers. At Berkeley, prototypes included the 1,008 processor RAMP Blue system and the RAMP Gold manycore emulator, as well as the follow-on DIABLO datacenter emulator.

RAMP Gold

RAMP Gold is an FPGA-based emulator for SPARC V8 manycore processors providing a high-throughput, cycle-accurate full-system simulator capable of booting real operating systems. RAMP Gold models target-system timing and functionality separately, and employs host-multithreading for an efficient FPGA implementation. The RAMP Gold prototype runs on a single Xilinx Virtex-5 FPGA board and simulates a 64-core shared-memory target machine.

RAMP Blue

RAMP Blue was the first large-scale RAMP system built as a demonstrator of the ideas. The system models a cluster of up to 1008 MicroBlaze cores implementing using up to 84 Virtex-II Pro FPGAs on up to 21 BEE2 boards. The software infrastructure consists of GCC, uClinux, and the UPC parallel language and runtimes, and the prototype can run off-the-shelf scientific applications.

Earlier Projects from the MIT SCALE Group (1998-2007)

The Scale Vector-Thread Microprocessor

The Scale microprocessor introduced a new architectural paradigm, vector-threading, which combines the benefits of vector and threaded execution. The vector-thread unit can smoothly morph its control structure from vector-style to threaded-style execution.

Transactional Memory

In many dynamic thread-parallel applications, lock management is the source of much programming complexity as well as space and time overhead. We are investigating possible practical microarchitectures for implementing transactional memory, which provides a superior solution for atomicity that is much simpler to program than locks, and which also reduces space and time overheads.

Low-power Microprocessor Design

We have been developing techniques that combine new circuit designs and microarchitectural algorithms to reduce both switching and leakage power in components that dominate energy consumption, including flip-flops, caches, datapaths, and register files.

Energy-Exposed Instruction Sets

Modern ISAs such as RISC or VLIW only expose to software properties of the implementation that affect performance. In this project we are developing new energy-exposed hardware-software interfaces that also allow software to have fine-grain control over energy consumption.

Mondriaan Memory Protection

Mondriaan memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at the granularity of individual words.

Highly Parallel Memory Systems

We are investigating techniques for building high-performance, low-power memory subsystems for highly parallel architectures.

Mobile Computing Systems

Within the context of MIT Project Oxygen, several projects examine the energy and performance of complete mobile wireless systems.

Heads and Tails: Efficient Variable-Length Instruction Encoding

Existing variable-length instruction formats provide higher code densities than fixed-length formats, but are ill-suited to pipelined or parallel instruction fetch and decode. Heads-and-Tails is a new variable-length instruction format that supports parallel fetch and decode of multiple instructions per cycle, allowing both high code density and rapid execution for high-performance embedded processors.

Early Projects

IRAM: Intelligent RAM (1997-2002)

The Berkeley IRAM project sought to understand the entire spectrum of issues involved in designing general-purpose computer systems that integrate a processor and DRAM onto a single chip - from circuits, VLSI design and architectures to compilers and operating systems.

PHiPAC: Portable High-Performance ANSI C (1994-1997)

PHiPAC was the first autotuning project, automatically generating a high-performance general matrix-multiply (GEMM) routine by using parameterized code generators and empirical search to produce fast code for any platform. Autotuners are now standard in high-performance library development.

The T0 Vector Microprocessor (1992-1998)

T0 (Torrent-0) was the first single-chip vector microprocessor. T0 was designed for multimedia, human-interface, neural network, and other digital signal processing tasks. T0 includes a MIPS-II compatible 32-bit integer RISC core, a 1KB instruction cache, a high performance fixed-point vector coprocessor, a 128-bit wide external memory interface, and a byte-serial host interface. T0 formed the basis of the SPERT-II workstation accelerator.

SPACE: Symbolic Processing in Associative Computing Elements (1987-1992)

In the PADMAVATI prototype system, a hierarchy of packaging technologies cascade multiple SPACE chips to form an associative processor array with 170,496 36-bit processors. Primary applications for SPACE are AI algorithms that require fast searching and processing within large, rapidly changing data structures.

[Krste]
Professor
Computer Science Division
EECS Department
579 Soda Hall, MC #1776
University of California
Berkeley, CA 94720-1776
email: krste at berkeley dot edu
(I don't do social networks, so please don't ask.)
phone: 510-642-6506 (don't phone, use email!)
fax: 510-643-1534
office hours: Wednesdays 3-4pm
579 Soda Hall
(email to confirm)
Administrative Support:
Roxana Infante
563 Soda Hall
phone: 510-643-1455
email: parlab-admin at eecs dot berkeley dot edu

Tammy Johnson
565 Soda Hall
phone: 510-643-4816
email: parlab-admin at eecs dot berkeley dot edu
Grant Administrator:
Lauren Mitchell
617 Soda Hall
phone: 510-642-3417
email: lbailey at cs dot berkeley dot edu