Haoyuan Li
I'm a second year Computer Science PhD student in the AMP Lab at UC Berkeley, interested in computer systems and cloud computing. My advisors are Scott Shenker and Ion Stoica. Before Berkeley, I studied at Cornell University and Peking University, and also worked at Conviva and Google.
You can contact me at haoyuan@cs.berkeley.edu [Github] [LinkedIn] [Twitter]
Projects
I focus on systems and algorithms for large-scale data-intensive computing. Below is a list of open sourced projects that I contribute to:
Tachyon: A fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. [Github]
Spark Streaming: Spark Streaming offers a high-level functional programming API, strong consistency, and efficient fault recovery. It supports a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions: parallel recovery of lost state across the cluster. It is now part of the Spark, which lets users seamlessly intermix streaming, batch and interactive queries. [Short Paper] [Tech Report] [Github]
Spark: A cluster computing engine that makes data analytics fast. It provides an efficient abstraction for distributed in-memory computation. Besides the streaming part, I worked on the initial version of the current Storage Manager. [Github]
Shark: A high-speed query engine runs Hive SQL queries on top of Spark, and supports fault recovery and complex analytics (e.g. machine learning). I contributed to the integration with Tachyon. [Github]
Parallel Frequent Pattern Mining: Frequent itemset mining is a useful tool for discovering frequently co-occurrent items. Various algorithms have been developed to speed up mining performance. However, when the dataset size is huge, both the memory use and computational cost can still be prohibitively expensive. We designed a parallel FP-Growth algorithm, which is now part of Apache Mahout. [Paper]
Mesos and Yarn: Both Mesos and Yarn are cluster resource managers. I ported Yarn to run on top of Mesos.
Publications
- Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing, Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. UCB EECS Tech Report 2012, December 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. HotCloud 2012, June 2012.
- Tradeoffs in CDN designs for throughput oriented traffic, Minlan Yu, Wenjie Jiang, Haoyuan Li, and Ion Stoica. CoNEXT 2012, December 2012.
- Quilt: A Patchwork of Multicast Regions, Qi Huang, Ken Birman, Ymir Vigfusson, and Haoyuan Li. DEBS 2010, July 2010.
- Dr. Multicast: Rx for Data Center Communication Scalability, Ymir Vigfusson, Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, Robert Burgess, Haoyuan Li, Gregory Chockler, and Yoav Tock. EuroSys 2010, April 2010.
- Declarative Languages to Declarative Processing in Computer Games, Ben Sowell, Alan Demers, Johannes Gehrke, Nitin Gupta, Haoyuan Li, and Walker White. CIDR 2009, January 2009.
- PFP: Parallel FP-Growth for Query Recommendation, Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Chang. RecSys 2008, October 2008.
Talks
- Tachyon: Reliable File Sharing at Memory-Speed Across Cluster Frameworks
- Google Ventures, Mountain View, May 2013
- AMPLab Retreat, Lake Tahoe, January 2013
- Spark Streaming: Fault-Tolerant Stream Processing at Scale
- Yelp, San Francisco, June 2012
Selected Awards
Olin Fellowship, IBM Fellowship (twice), Morgan Stanley Fellowship, Beijing Outstanding Graduates, Chinese National Fellowship, Innovation Award at Peking University, Pacemaker to Outstanding students at Peking University (three times), General Electric Fellowship, No. 11 and No. 13 in ACM-ICPC World Final 2005 and 2006, No. 8 in Google Code Jam China Final,
Template design by Andreas Viklund. Valid XHTML and CSS. Password Manager: OneLastPass.