CS267: Applications of Parallel Computers

Shariq Rizvi

First Year Graduate Student
Computer Science Division
rizvi [at] cs [dot] berkeley [dot] edu

Research Interests

Database Systems

Platform and Coding Experiences

Operating Systems: Several Unix-like OSes and Windows
Languages: C, C++, Java, Scheme/LISP
No experience with parallel programming

Misc

Want to get a first experience with issues in parallel programming/computing

Design of Parallel Databases

A database management system is a piece of software that allows storage, modification and querying of large amounts of data in an efficient manner. Research in this area has focused on efficient physical design of databases (disk-level storage issues like indexing), building independence between physical database design and the logical model available to the user (a key feature of the relational database model), efficient evaluation of queries (query processing and optimization) and other aspects of databases. Given the massive data and query workloads that database systems are expected to face, they form one of the most natural domains that can use parallelism.

Classification of Parallel Database Proposals

The three possible architectures to exploit multiprocessor parallelism have been described in [1] as:
  1. Shared memory: A collection of processors share the same physical memory. It is easiest to support as it can be run as a single-site DBMS and can depend on the OS for scheduling processes to multiple processors.

  2. Shared disk: Each processor has its own memory but a shared set of storage disks is used. This presents several challenges rooted in the fact that consistency has to be preserved for the data residing on the disk under multiple processors accessing it. This become difficult as there is no shared memory.

  3. Shared nothing: These involve multiple machines that share no hardware but communicate with each other using a high-speed LAN. Most research prototypes have been built based on this paradigm.

Case Study: Shared Nothing Architecture

The GAMMA project [2] is perhaps one of the most prominent in this line of research. It is based on the shared nothing paradigm. This allows the system to scale to thousands of processors rather than just 30-40 that the shared memory architecture would allow. Secondly, using a shared nothing architecture allows the system storage to scale with the number of processors too - thus increasing the I/O bandwidth of the system, without the use of specialized disk controllers. Here is a brief overview of GAMMA features:

A Framework for Query Processing in Parallel Databases

The Volcano query processing system [3] is known for the extensibility that results from the uniform interface between operators. An operator is a query processing module that performs a specific function on the stream of data tuples that are sent to it. In general, a query execution plan can be expressed as a collection of operators and the way their output and input streams are connected to each other. Each operator in Volcano is coded such that it feels it is operating in a single-process environment. A special exchange operator encapsulates all the parallelism in the system and hence provides extensibility.

An overview of parallel database systems is provided in [4]. The authors argue that while in the early 80's, researchers felt that specialized hardware is key to large-scale scalable databases, around early 90's it was clear that parallel databases are a more natural software solution. They point out that relational operators very naturally extend to the parallel architecture as they operate on relations and output the same.

References

[1] Readings in Database Systems. Michael Stonebraker. Morgan Kaufmann Publishers, 1994.

[2] The Gamma Database Machine Project. Dewitt et al. IEEE Transactions on Data and Knowledge Engineering, 1990.

[3] Encapsulation of Parallelism in the Volcano Query Processing System. Goetz Graefe. SIGMOD Record, 1990.

[4] Parallel Database Systems: The Future of High Performance Database Processing. David Dewitt and Jim Gray. Communications of the ACM, 1992.