CS262B Reading Summary

Parallel Database Systems: The Future of High Performance Database Processing

David J. DeWitt and Jim Gray

Summary by Feng Zhou
2/17/2004

Strong points of the paper are:

  1. Two parallelism are identified in DBMS systems, namely pipelined parallelism and partitioned parallelism.  Pipelined parallelism is fundamentally limited because of reasons such as SQL queries being mostly short. Partitioned parallelism, in contrast, are not so limited and can be exploited well by a shared-nothing cluster.
  2. The three threats to speedup and scaleup are useful.  They are startup, interference and skew.  Interference is actually lock contention and skew is load-inbalancing.  The conclusion from the following discussion is that shared-nothing architecture does not poses any obstable in overcoming these obstables.  Therefore, given the economical benefits of clusters, they should be no-brainers for building parallel databases.  This is a valid point, at least without considering software cost.
One major flaw.

The paper said Grosch's Law (economics of scale in computing) doesn't apply to databases because the advent of clusters and MPP machines.  However, even though it's mostly true for hardware, it's not at all true for software, which makes up a large part of costs for database systems.  Nowadays database vendors charges very high prices for parallel databases simply because they don't sell a lot of them.  Software cost is also one major reason why these MPP companies all died out.  One interesting question would be whether a free parallel database project for Internet services, built on top of existing open-source databases, can fly, given that parallel databases have the technical merits discussed in the paper and have been success in the "high-end" market-place like banks and large companies.  This sounds appealing because currently nearly no Internet services use parallel databases, mainly because of cost presumably.