Current Research Projects

Below a list of some of my current research projects:

CrowdDB

Some queries cannot be answered by machines only. Processing such queries requires human input, e.g., for providing information that is missing from the database, for performing computationally difficult functions and for matching, ranking, or aggregating results based on fuzzy criteria. With CrowdDB we explore a new database design, that uses human input via crowdsourcing to process queries that neither database systems nor search engines can answer adequately. CrowdDB uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many concepts from traditional database systems, there are also important differences. From a conceptual perspective, the traditional closed-world assumption for query processing no longer holds for human input, which is essentially unbounded. From an implementation perspective, CrowdDB uses an operator-based query engine but this engine is extended with special operators that generate User Interfaces in order to solicit, integrate and cleanse human input.

Performance Insightful Query Language (PIQL)

Newly-released web applications often succumb to a "Success Disaster", where overloaded database machines and resulting high response times destroy a previously good user experience. Unfortunately, the data independence provided by a traditional relational database, while useful for rapid development, only exacerbates the problem by hiding potentially expensive queries under simple declarative expressions. As a result, developers of these applications are increasingly abandoning relational databases in favor of imperative code written against distributed key/value stores, losing the many benefits of data independence in the process. Instead, we propose PIQL, a declarative language that also provides scale independence by calculating an upper bound on the number of key/value store operations that will be performed for any query. Coupled with a service level objective (SLO) compliance prediction model and PIQL's scalable database architecture, these bounds make it easy for developers to write success-tolerant applications that support an arbitrarily large number of users while still providing acceptable performance.

Multi-Data-Center Consistency (MDCC)

The recent Amazon East-Coast data-center failure ones more demonstrated the need for multi-data center deployments and recovery. However, multi-data center deployments challenge many of the existing design decisions for existing cloud services. For example, the latency between data centers is so high and unreliable, that traditional strong consistency protocols are not applicable. At the same time, data-center failures render a huge fraction of the nodes inside a system unavailable, making replication even more important. The goal of this project is to investigate how alternative architectures, programming and consistency models as well as recovery techniques can help to build multi-data center applications.

Failure-Aware Multi-Tenancy Placement

Together with SAP and the Hasso-Plattner-Institute, we are currently exploring new techniques and scheduling algorithms for placing multi-tenancy applications in the cloud. SAP became a software-as-a-service provider with their on-demand platform Business by Design. Those new SAP services offer hosting multi-tenant application with very strong availability and response-time guarantees. At the moment, those guarantees are mainly achieved by overprovisioning the hardware for the service. The goal of this project is to develop new techniques and scheduling algorithms, which are more cost effective, while still guaranteeing the service level agreements even in the presence of major failures.

Past Research Projects

Building a Database on Cloud Infrastructure

With this project, we explored the opportunities and limitations of using cloud computing as an infrastructure for general-purpose web-based database applications. Part of this work was to to analyzes alternative client-server and indexing architectures as well alternative consistency protocols. Furthermore, we proposed a new transaction paradigm, Consistency Rationing, which not only defines the consistency guarantees on the data instead of at transaction level, but also allows for switching consistency guarantees automatically at runtime. Thus, the system allows to adapt and balance consistency on the fly against possible inconsistency risk. The outcome of this work was published at SIGMOD08 and VLDB09 and is partly commercialized by the 28msec Inc.

Cloudy/Smoky - a distributed storage and streaming service in the cloud

Cloud computing has changed the view on data management by focusing primarily on cost, flexibility and availability instead of consistency and performance at any price as traditional DBMS do. As a result, cloud data storages run on commodity hardware, are designed to be scalable, easy to maintain and highly fault-tolerant often providing relaxed consistency guarantees. The success of key-value stores like Amazon's S3 or the variety of open-source systems reflect this shift. Existing solutions, however, still lack substantial functionality provided by a traditional DBMS (e.g., support for transactions and a declarative query language) and are tailored to specific scenarios creating a jungle of services. That is, users have to decide for a specific service and are later locked into this service, preventing the evolution of the application, leading to misuse of services and expensive migrations to other services. During my time at ETH we started to build our own highly scalable database, which provides a completely modularized architecture and is not tailored to a specific use case. For example, Cloudy supports stream processing, as well as SQL and simple key-value requests. This is project is continued at ETH and I am still partially involved in the development.

Zorba - a general purpose XQuery processor implemented in C++

Zorba is a general purpose XQuery processor implementing in C++ the W3C family of specifications. The query processor has been designed to be embeddable in a variety of environments such as other programming languages extended with XML processing capabilities, browsers, database servers, XML message dispatchers, or smartphones. Its architecture employs a modular design, which allows customizing the Zorba query processor to the environment's needs. In particular, the architecture of the query processor allows a pluggable XML store (e.g. main memory, DOM stores, persistent disk-based large stores, S3 stores). Zorba can be accessed through APIs from the following languages: C, C++, Ruby, Python, Java, and PHP. Zorba runs on most platforms and is available under the Apache license v2. My part in this project was the design and implementation of the main run-time components (e.g. iterator contract, variable binding, FLWOR ...). Zorba is still continued and actively developed by a team of 6 full-time developers and several voluntary contributors.

MXQuery & Windowing for XQuery

MXQuery is a low-footprint, extensible open-source XQuery engine implemented in Java. Besides a high level of compliance with XQuery 1.0, it provides a wide coverage of upcoming W3C standards proposals (Update, Fulltext, Scripting) and support for a wide range of Java Platforms (including mobile/embedded devices). We used MXQuery as our research platform for data stream processing/CEP with XQuery. The proposed windowing extension we developed during this project got accepted at the W3C for the upcoming XQuery 1.1 standard.