Matei Zaharia
I'm a PhD student in the UC Berkeley AMP Lab, interested in computer systems, networks, and cloud computing. My advisors are Scott Shenker and Ion Stoica. I'm supported by a Google PhD fellowship.
Before joining Berkeley, I worked with Srinivasan Keshav at the University of Waterloo.
You can contact me at matei@berkeley.edu.
Projects
I focus on systems and algorithms for large-scale data-intensive computing. My projects include:
Spark: As big data analytics evolves beyond simple batch jobs, there is a need for both more complex multi-stage applications (e.g. machine learning algorithms) and more interactive ad-hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Datasets, and can run 100x faster than Hadoop for these applications. (homepage) (short paper) (NSDI'12 paper)
Shark: This high-speed query engine runs Hive SQL queries on top of Spark up to 100x faster than Hive, and supports fault recovery and complex analytics (e.g. machine learning). (homepage) (tech report)
Mesos: Clusters are running increasingly diverse applications, from batch jobs to interactive services. Mesos is a cluster manager that efficiently supports diverse applications by letting them control their own scheduling. The project is open source in the Apache Incubator. (homepage) (NSDI'11 paper)
Multi-Resource Fairness: Life is not fair, but with a little help, your computer system can be, ensuring predictable time-sharing between users. However, past work on fair sharing considered a single resource (e.g. CPU), while cluster applications have demands across multiple resources (memory, IO, CPU, etc). Dominant resource fairness generalizes max-min fairness for this case. (NSDI'11) (SIGCOMM'12)
MapReduce Scheduling: I've worked on several scheduling algorithms for MapReduce, including the LATE algorithm for straggler mitigation (OSDI'08) and delay scheduling for data locality (Eurosys'10). Both algorithms are now included in Hadoop. I also developed the Hadoop Fair Scheduler.
SNAP Sequence Aligner: To tackle the growing volume of genomic data, SNAP is a new sequence alignment algorithm that is 10-100x faster than current tools and also more accurate. (homepage) (arXiv)
Publications
2013
- R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, to appear at SIGMOD 2013.
- A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, EuroSys 2013, April 2013.
2012
- A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. Multi-Resource Fair Queueing for Packet Processing, SIGCOMM 2012, August 2012. Best Paper Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Fast and Interactive Analytics over Hadoop Data with Spark, USENIX ;login:, August 2012.
- M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, HotCloud 2012, June 2012.
- L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems, USENIX ATC 2012, June 2012.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo), SIGMOD 2012, May 2012. Best Demo Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.
2011
- T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M.J. Franklin, P. Abbeel, and A.M. Bayen. Scaling the Mobile Millennium System in the Cloud, SOCC 2011, October 2011.
- M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011, August 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: Flexible Resource Sharing for the Cloud, USENIX ;login:, August 2011.
- M. Zaharia, B. Hindman, A. Konwinski, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, The Datacenter Needs an Operating System, HotCloud 2011, June 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011.
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011.
2010
- M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I. Stoica. Spark: Cluster Computing with Working Sets, HotCloud 2010, June 2010.
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.
- M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, Above the Clouds: A View of Cloud Computing, Communications of the ACM, April 2010.
- S. Guo, M. Derakhshani, M.H. Falaki, U. Ismail, R. Luk, E.A. Oliver, S. Ur Rahman, A. Seth, M.A. Zaharia, S. Keshav, Design and Implementation of the KioskNet System, Computer Networks, ISSN 1389-1286, DOI: 10.1016/j.comnet.2010.08.001
Earlier
- B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, A Common Substrate for Cluster Computing, HotCloud 2009, June 2009.
- R. Luk, M. Zaharia, M. Ho, B. Levine and P. Aoki, ICTD for Healthcare in Ghana: Two Parallel Case Studies, ICTD 2009, April 2009.
- M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, December 2008.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, U. Ismail, and S. Keshav, Design and Implementation of the KioskNet System, ICTD 2007, December 2007.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, and S. Keshav, Very Low-Cost Internet Access Using KioskNet, ACM Computer Communication Review, October 2007.
- M. Zaharia and S. Keshav, Gossip-based Search Selection in Hybrid Peer-to-Peer Networks, J. Concurrency and Computation: Practice and Experience, 2007.
- M. Zaharia, A. Chandel, S. Saroiu, and S. Keshav, Finding Content in File-Sharing Networks When You Can't Even Spell, Proc. IPTPS, February 2007.
- A. Seth, D. Kroeker, M. Zaharia, S. Guo, S. Keshav, Low-cost Communication for Rural Internet Kiosks Using Mechanical Backhaul, Proc. MOBICOM 2006, September 2006.
- M. Zaharia and S. Keshav, Gossip-Based Search Selection in Hybrid Peer-to-Peer Networks, Proc. IPTPS, February 2006.
Full Publication List and Technical Reports
Talks
- Spark and Shark: High-Speed In-Memory Analytics over Hadoop and Hive Data (pptx, pdf), Hadoop Summit 2012, San Jose, CA, June 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (pptx, pdf), HotCloud 2012, Boston, MA, June 2012.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (pptx, pdf), NSDI 2012, San Jose, CA, April 2012.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (machine learning focused version) (pptx, pdf), NIPS Big Learning Workshop, Sierra Nevada, Spain, December 2011.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (pptx, pdf), Google Inc, Mountain View, CA, October 2011.
- The Datacenter Needs an Operating System (ppt, pdf) HotCloud 2011, Portland, OR, June 2011.
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, (ppt, pdf), NSDI 2011, Boston, MA, March 2011.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (ppt, pdf), Stanford University, Stanford, CA, February 2011.
- Spark: Cluster Computing with Working Sets (ppt, pdf), HotCloud 2010, Boston, MA, June 2010.
- Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling (ppt, pdf), Eurosys 2010, Paris, France, April 2010.
- Job Scheduling with the Fair and Capacity Schedulers (ppt, pdf), Hadoop Summit 2009, Santa Clara, CA, June 2009.
- Job Scheduling for MapReduce (ppt, pdf), Microsoft Research Silicon Valley, Mountain View, CA, January 2009.
- Improving MapReduce Performance in Heterogeneous Environments (ppt, pdf), OSDI 2008, San Diego, CA, December 2008.
Open Source
Almost all of my work is open source:
- The Spark cluster computing framework is available under a BSD license at spark-project.org. We have also open sourced Shark, our Apache Hive compatible SQL and analytics engine built over Spark.
- The Mesos cluster manager is now hosted in the Apache Incubator.
- The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.
I'm also a committer on the Apache Hadoop and Mesos projects.
Other Activities
Starting in high school, I've participated in a number of programming contests, including the International Olympiad in Informatics and the ACM International Collegiate Programming Contest. I've now stopped doing contests, but I still love algorithmic and mathematical problems.
In undergrad, I contributed to the open source realtime strategy game 0 A.D., where I worked on gameplay logic, random map generation, water rendering, and multiplayer networking.
I enjoy reading, nature, and food that is either good or free.
Template design by Andreas Viklund. Valid XHTML and CSS.
