Matei Zaharia

I'm a Ph.D. student in the UC Berkeley AMP Lab, interested in computer systems, networking, and cloud computing. My advisors are Scott Shenker and Ion Stoica. I'm supported by a Google Ph.D. fellowship.

Before joining Berkeley, I got my Bachelor's degree from the University of Waterloo, in Canada, where I worked with Srinivasan Keshav.

You can contact me at matei@berkeley.edu or find me in Soda 493B.

Projects

I focus on systems and algorithms for large-scale data-intensive computing. My projects include:

Spark: As big data analytics evolves beyond simple batch jobs, there is a need for both more complex multi-stage applications (e.g. machine learning algorithms) and more interactive ad-hoc queries. Spark provides efficient and fault-tolerant primitives for in-memory cluster computing, and can run 30x faster than Hadoop MapReduce for these applications. (homepage) (short paper) (tech report)

Mesos: Clusters are running increasingly diverse applications, from batch jobs to interactive services. Mesos is a cluster manager that efficiently supports diverse applications by letting them control their own scheduling. The project is open source in the Apache Incubator. (homepage) (NSDI'11 paper)

Multi-Resource Fairness: Life is not fair, but with a little help, your computer system can be — ensuring predictable time-sharing between users. However, past work on fair sharing considered a single resource (e.g. CPU), while datacenter applications have demands across multiple resources (memory, IO, CPU, etc). Dominant resource fairness generalizes max-min fairness for this case. (NSDI'11 paper)

MapReduce Scheduling: I've worked on several scheduling algorithms for MapReduce, including the LATE algorithm for straggler mitigation (OSDI'08) and delay scheduling for data locality (Eurosys'10). Both algorithms are now included in Hadoop. I also developed the Hadoop Fair Scheduler.

SNAP Sequence Aligner: I'm working with colleagues from Microsoft and UCSF on SNAP, a sequence alignment algorithm that is 10-100x faster than current tools and simultaneously more accurate, to handle the growing volume of data from high-throughput DNA sequencers. (arXiv paper)

Publications

2012

2011

2010

Earlier

Technical Reports

Talks

Open Source

Almost all of my work is open source. The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are part of Apache Hadoop, and I continue to contribute to Hadoop as a committer. Mesos and Spark are both available on GitHub.

Other Activities

Starting in high school, I've participated in a number of programming contests, including the International Olympiad in Informatics and the ACM International Collegiate Programming Contest. I've now stopped doing contests, but I still love algorithmic and mathematical problems.

In undergrad, I contributed to the open source realtime strategy game 0 A.D., where I worked on gameplay logic, random map generation, water rendering, and multiplayer networking.

I enjoy reading, nature, and food that is either good or free.