CS262a: Fall 2023 Final Projects

This page will contain final CS262a projects for Fall of 2023. These projects are done in groups of two or three (maximum of four for undergraduates) and span a wide range of topics.

Here are some project Suggestions for this year (2023) (Last year's projects are HERE).


  
1:   Enhancing Data Capsule: A Multi-Credential POSIX-Compliant File System for Secure and Verifiable Data Storage Across Networks
 Qingyang Hu and Yucheng Huang and Manshi Yang
Data Capsule is a secure and distributed storage system similar to block chains. There have been many specific applications proposed, for example, CapsuleDB, a key-value database built upon Data Capsules. However, it remains underexplored as a generic file system support for multiple users while ensuring both data provenance and integrity. In this work, our objective is to offer a broader and more comprehensive utilization of Data Capsule through the creation of a POSIX-compliant file system that enables multi-credentials for generic storage. Users can utilize the interface for regular file system operations while ensuring the security and integrity of their data.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
2:   A Secure Dynamically Built Multicast Tree for the Global Data Plane
 Arun Sundaresan and Mikkel Svartveit and Tony Hong
The Global Data Plane (GDP) provides a secure networking architecture for distributed computing on the edge. We propose a protocol for enabling performant and scalable multicast tree building and message sending over the Global Data Plane. In contrast to IP, the Global Data Plane does not use prefix-based routing. We instead utilize a hierarchy of routing information bases (RIBs) to enable efficient and scalable tree building without being bottlenecked by a single router. Our multicast tree building algorithm ensures that message sending is kept as local as possible, avoiding congestion in higher levels of the RIB hierarchy.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
3:   Implementing and Understanding Performance of Fino
 Siddhant Sharma and Chris Liu with Neil Giridharan
BFT-SMR has traditionally low throughput and high message communication overhead. However, recent research advances DAG-based BFT protocols that decouple reliable message broadcast and transaction ordering to achieve high throughput for additional latency overhead. Fino, by Malkhi and Szalachowski, is a novel DAG-BFT protocol that emphasizes simplicity while trying to achieve high performance for throughput and latency. Protocols such as Narwhal-Bullshark explore this tradeoff space as well, achieving impressive performance in the happy case. In the case of network latency or Byzantine leaders, Bullshark can face high latency, as nodes are often blocked awaiting progress. Fino takes an approach of integrating timeouts into its messages to promote view changes, leading to questions about performance between the two protocols in various workloads. In this paper, we implement Fino with the Narwhal broadcast layer. We observe how this affects performance by running benchmarks against other DAG-BFT protocols across various scenarios and workloads. Furthermore, we develop an end-to-end evaluation framework to provide insight on the holistic performance of consensus protocols when applied to applications with transaction executionwe find this reveals more nuanced behavior than would be possible by only observing maximum throughput and minimum latency data.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
4:   Model Serving with Spot GPUs in the Sky
 Tyler Griggs and Ziming Mao
GPUs are expensive and hard to provision, which in turn makes ML model serving expensive and difficult to manage. Provisioning with reserved or on-demand instances guarantees SLOs but are expensive, especially for difficult-to-predict transient traffic. Many cloud providers offer spot GPU instances, which can be 3x cheaper than on-demand GPU instances but can be preempted at any time, potentially at the cost of missing SLO. In this work, we will design GPU allocation policies and implement them on top of the existing SkyPilot infrastructure with the goal of reducing ML model serving cost while maintaining a given SLO. We will do this by: mixing instance types (on-demand and spot, variety of accelerators), creating dynamic spot placement strategies, proactively migrating instances to cheaper and/or more available instances, and other methods.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
5:   Randomized SVD for Serverless Systems
 YenHsiang Chang and Gabriel Raulet
High performance linear algebra kernels are a necessary component of many scientific and data analytic workloads. While researchers develop faster and faster algorithms to run on specialized hardware and clusters, many scientists and data analysts not in the HPC community still have difficulty utilizing these advantages in their own work. The relatively recent introduction of cloud computing and its many variants offers the potential to close the accessibility gap. Serverless computing is a paradigm in cloud computing that has attracted interest due to its low startup cost, relative ease of configuration, and flexible scalability. Crucially, it lets researchers perform highly scalable and parallel computations in a cost-effective manner without exposing them to the gritty and frustrating issues one deals with when interacting with a cluster. In this research, we are particularly interested in randomized SVD for serverless systems since it includes typical BLAS operations such as QR factorizations and matrix multiplications. Previous research showed that such BLAS operations still suffer from high communication cost due to the stateless nature of serverless computing. Recent research proposed a message passing interface for serverless computing and demonstrated comparable performance with that on HPC clusters. The goal of this project is to investigate whether using such an interface can alleviate the communication costs associated with BLAS and gets good performance for randomized SVD in serverless systems.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
6:   Speculative Memory Programming for Secure Computation
 Alice Yeh with Sam Kumar
The high memory overhead associated with secure computation (SC) has been a long-standing obstacle to its adoption in industry. Recent works have presented solutions to this problem, creating systems that produce efficient memory management plans to execute SC protocols at speeds close to machines with unbounded physical memory. However, these systems often require additional developer overhead to implement and have long offline phases, taking multiple passes to plan before executing a protocol. We propose a secure computation execution engine that incorporates speculative memory programming to enable one-pass execution of SC schemes through informed paging. To evaluate the effectiveness and efficiency of our design, we plan to benchmark the performance of our system against both existing systems that do not implement memory programming and recent works that do.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
7:   Enabling Scalable Heterogeneous Hardware Integration Co-simulation with Socket IPC
 Zekai Lin and Richard Yan
Amidst the challenges of an increasingly heterogeneous hardware landscape, the integration and evaluation of new hardware intellectual properties (IPs) remain a significant problem, constrained by complexity and inefficiency. In the first half of this paper, we introduce a simple and intuitive abstraction of procedure calling using socket inter-process communication (IPC) across hardware blocks, designed to address these challenges. We present a lightweight implementation in C++ and evaluate various design choices. In the second half of this paper, we perform two in-depth case studies of hardware integration using the proposed library. The first case study showcases a CPU-GPU co-simulation. We demonstrate the modularity of the communication scheme by showing mix-and-match capabilities through combinations of functional and Register-Transfer Level (RTL) simulations, and show up to 3.1x faster simulation, with only 5% cycle inaccuracy in large kernels. In the second case study, we examine a many-accelerator co-simulation executing a transformer encoder workload, showcasing communications on fine granularity, parallelization capabilities, and simulation scalability. We show parallel simulations enabled around 15-35% faster simulation time over an approximated monolithic SoC integration baseline with 8 workers, and true cycle numbers can be predicted with 94.9% accuracy.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
8:   Hierarchical Delegated Routing in GDP
 Mira Sharma and Nia Nasiri and Eric Nguyen and Tianyun Yuan
We present a system of hierarchical Routing Information Bases (RIBs) for the GDP, optimizing for reduced advertisement traffic, scaled numbers of DataCapsule servers, and increased DataCapsule availability as servers go offline and later reappear online. The scalability of GDP hinges on the scalability of its routing system. Previous work by Arya et al. implements secure delegated routing with a single global RIB, but routing with a hierarchy of RIBs is better suited for massive scale and better models of real-world nested trust domains. We address the challenges that arise due to the data-centric, location-agnostic guarantees of the GDP, as well as its flat namespace and security focus.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
9:   Private Analytics via Streaming Sketching and Silently Verifiable Proofs
 Yuwen Zhang with Mayank Rathee and Raluca Ada Popa and Henry Corrigan Gibbs
Imagine many clients want to upload some private data to a server, who wants to compute some aggregate statistic over all of the client data. One solution involves clients secret sharing their data, uploading shares of their private data to multiple servers, and the servers using some multi party computation (MPC) to verify client data and compute the aggregate statistic in a private way. We propose a method of compiling existing protocols to batch verify well-formedness of client data with only two rounds of aggregator communication. This batch verification property necessitates some interesting changes in our system architecture compared to prior work.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
10:   Optimizing KV Cache Locality for Faster Sequential and Parallel Language Model Inference
 Nikhil Jha and Kevin Wang
As large language models (LLMs) continue to grow in popularity, scaling inference to meet demand is becoming an increasingly relevant problem. We propose three things in this paper to improve LLM scalability: a scheduling algorithm for caching contexts that are preserved between runs, a cache-aware query design to interface with language model APIs, and a method of running multi-query attention that takes better advantage of the data locality on the GPU. We hope that these three changes will improve the performance on a wide variety of LLM inference workloads, including making parallel runs on independent inputs, making sequential as well as parallel runs on related inputs, and any combination of the above.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
11:   Dynamic LoRA Serving System for Offline Context Learning
 Sijun Tan and Shiyi Cao with YIng Sheng and Dacheng Li
The ``pretrain-then-finetune'' paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. In this project, we try to address two problems: (1) how to derive LoRA adapters from long-context documents so that we can have tailored finetuned adapters per user, and (2) how to design a system for scalable serving of many LoRA adapters concurrently. To address (1), we propose offline context learning, a novel strategy to memorize long documents through parameter-efficient finetuning. Our evaluation on the QuALITY QA dataset shows that our proposed method outperforms in-context learning while processing over 32x fewer tokens during inference. To resolve (2), we propose S-LoRA, a serving system scalable to thousands of LoRA adaptors on a single GPU. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), our serving system S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
12:   Unified and Secure Storage Plane with DataCapsules
 Ted Lin and Samuel Berkun and Saketh Malyala
DataCapsules are cryptographically-hardened data containers that consist of records linked by secure hashes to form a DAG. We will be designing and implementing replicated DataCapsule servers, focusing on efficient integrity proof generation (not just naively directly signing/verifying every record) along with a novel anti-entropy algorithm to patch "holes"/"branches" caused by potential network/server crashes. Our evaluation will include: measuring the performance overhead of our system compared to existing non-cryptographically-hardened storage solutions; exploring how application-specific record structures affect performance (particularly wrt integrity proof generation); and measuring how efficiently our anti-entropy algorithm reaches convergence compared to existing algorithms.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
13:   A Protobuf Accelerator Architecture with GC Language Support
 Ethan Wu and Viansa Schmulbach with Sagar Karandikar
Numerous works have been done on offloading serialization and deserialization tasks from the CPU to specialized hardware accelerators to cut down on the "datacenter tax" of communication between components of a modern cloud software stack. For example, Karandikar et. al.'s implementation of an accelerator for ProtoBuf, targeting C++, demonstrated a 6.9x speedup in deserialization and a 4.5x speedup in serialization compared to a CPU software approach. However, there exists no such accelerator with multi-language support. For our project, we propose a new accelerator architecture that extends support for languages that contain a garbage collector, such that this hardware acceleration can be leveraged more easily by existing codebases. The project will require a hardware-software codesign solution that introduces a new serialization accelerator architecture with multi-language support and integration with the software's GC and memory allocator. To do this, we will build off the accelerator developed by Karandikar et. al. We aim to develop a generalized language-interoperable accelerator that can support directly serializing and deserializing multiple different host languages, while still allowing communication over a standard format with existing software that does not use the accelerator.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
14:   Quality-of-Service Aware LLM Serving
 Rithvik Chuppala and Siddharth Jha
Large language models (LLMs) have been increasingly used as core services within many applications such as chat, browsers, coding tools, and more. Furthermore, the size of popular LLMs has grown significantly over the last few years, with GPT-3 having 175B parameters. With such model sizes, achieving the latency and job completion guarantees desired by applications becomes impossible during periods of high request rates. In order to serve requests during high arrival rates, we propose a best-effort system that trades quality for latency while simultaneously ensuring that client requirements are met to the best of the systems ability. To this end, we explore QoS at the network level as well as trade-offs in model complexity. On the networks side, we utilize QUIC, a transport-level network protocol built on top of a best-effort UDP. QUIC provides a wide range of functionality, but we specifically leverage the ability to multiplex several streams of data over a single point-to-point connection. By encapsulating the model output in a QUIC stream, we can schedule the responses, effectively providing QoS to the stream of client requests. We explore the interaction between various schedulers and various application requirements to determine the best-suited scheduler for modern model serving needs. On the model complexity side, we take advantage of the fact that the quality gap between smaller and larger models is much smaller for certain tasks than other more difficult tasks. We use Q-learning to learn a policy that maps queries to a specific model to serve the query in order to maximize quality while meeting latency requirements. The methods we develop are largely independent of the specific model replication and partitioning strategy.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
15:   Faster Distributed LLM Inference with Dynamic Partitioning
 Isaac Ong with Woosuk Kwon
In light of the rapidly increasing size of large language models (LLMs), this work addresses the challenge of serving these LLMs efficiently given the limitations of modern GPU memory. We observe that the inference of LLMs is unique as compared to other models due to the wide variation in input lengths, a factor not adequately addressed by existing works. Current inference engines typically employ a static partitioning strategy, which is sub-optimal given the variability in input lengths and the diversity of GPU specifications. To overcome these challenges, we propose a dynamic partitioning strategy for distributed LLM inference which dynamically switches between different partitioning strategies at inference time, optimizing for both GPU characteristics and input length. We systematically search for all Pareto-optimal partitioning strategies for distributed Transformer inference, focusing on their computational requirements, communication overhead, and memory demands. Based on this search, we identify three Pareto-optimal strategies that cater to different scenarios and implement an inference engine for dynamic partitioning. Our evaluation, conducted on NVIDIA L4 and A100 GPUs using the Llama 2 family of models, demonstrates significant improvements over existing approaches. We illustrate notable reductions in the time to the first token and latency, underlining the effectiveness of dynamic partitioning. Our findings pave the way for more efficient utilization of GPU resources in distributed LLM inference, accommodating the evolving landscape of model sizes and architectures.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
16:   Choosing Neural Network Compression Methods for Reinforcement Learning
 Kaushik Singh and Gaurav Bhatnagar and Andrew Kim and Nidhir Guggilla
The expansion of reinforcement learning into complex tasks like human-level language generation and autonomous driving have greatly increased the computational power required therein. This very quickly becomes expensive and inefficient to run on servers, and unwieldy if not impossible to run on edge devices. Quantization reduces the number of bits required to represent a number while sacrificing either range or precision, while sparsification identifies and functionally removes nodes which will minimally impact the final performance of the model. Fine tuning is a machine learning approach to increase performance in the latter stages of training for specific tasks. We intend to experiment with these compression methods on previously trained, high-performing models during fine tuning in order to make them more lightweight while being given the opportunity to adjust to lower expressivity. One aspect of our project will be review of relevant literature and comparison between (previously discussed and/or novel) methodologies across various metrics through simulation. This will ideally inform the second aspect of our project: the actual implementation of our chosen compression algorithms through a novel interface designed to take advantage of the reduced model representation.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
17:   SkyStore: Unified Storage Across Clouds
 Shaopu Song and Junhao Hu
Using resources and services from multiple Clouds is a natural evolution from consuming the ones from in-silo Clouds. Given the evolving nature of data access patterns, continuous monitoring and migration to the suitable cloud environment become imperative. However, this multi-cloud, multi-region approach introduces significant challenges in data management, including latency issues, increased operational costs, and the complexity of interfacing with diverse storage APIs from various cloud vendors. We are aiming to support data store & migration between library-based multi-cloud environments that optimize cost, latency and throughput. We will build SkyStorage, a multicloud object store overlay that automatically optimizes data locality to minimize cost and maximize performance. It currently support S3, Azure Blob, and GCS. You interact with SkyStore through S3 API. SkyStore automatically place your data in the right cloud provider and region to optimize performance and cost. It simplifies user interaction with data, abstracting away the complexities of the underlying storage infrastructure. It optimally manages data operations, including demand-driven data placement, efficient data movement, and robust consistency management.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)
  
18:   Optimizing Distributed Reinforcement Learning with Reactor Model and Lingua Franca
 Pranav Atreya with Jacky Kwok and Amil Dravid
Distributed reinforcement learning (RL) is a critical field that involves advancing both the reinforcement learning algorithms to operate in a parallelized fashion and optimizing the distributed compute infrastructure to efficiently run these algorithms. For challenging simulation based learning problems, optimizing the use of compute infrastructure to collect simulation data and learn the policies is necessary to make the learning problem tractable. We hypothesize that the reactor model of concurrency can outperform the traditional actor models used in distributed reinforcement learning frameworks (e.g. Ray). In prior work the reactor actor model has been shown to surpass the performance of the traditional Actor Framework by factors of 1.86x. More particularly, the reactor model with a fixed set of reactors and fixed communication patterns allows the scheduler to eliminate works needed for synchronization (e.g. acquiring and releasing locks for each reactor, or sending and processing coordination-related messages between environment and policy). In this project we develop novel algorithms for making use of the reactor model within the context of distributed reinforcement learning, and demonstrate improved throughput and performance with an in-house developed framework and reimplementations of several state-of-the-art RL algorithms.
Final Project Report: Paper (pdf)
Supporting Documentation: Proposal Poster (pdf)

Back to CS262a page
Maintained by John Kubiatowicz (kubitron@cs.berkeley.edu).
Last modified Sun Jan 7 11:50:10 2024