CS267: Applications of Parallel Computers
Shariq Rizvi
First Year Graduate Student
Computer Science Division
rizvi [at] cs [dot] berkeley [dot] edu
Research Interests
Database Systems
Platform and Coding Experiences
Operating Systems: Several Unix-like OSes and Windows
Languages: C, C++, Java, Scheme/LISP
No experience with parallel programming
Misc
Want to get a first experience with issues in parallel
programming/computing
Design of Parallel Databases
A database management system is a piece of software that allows
storage, modification and querying of large amounts of data in an
efficient manner. Research in this area has focused on efficient
physical design of databases (disk-level storage issues like indexing),
building independence between physical database design and the logical
model available to the user (a key feature of the relational database
model), efficient evaluation of queries (query processing and
optimization) and other aspects of databases.
Given the massive data and query workloads that database systems are
expected to face, they form one of the most natural domains that can
use parallelism.
Classification of Parallel Database Proposals
The three possible architectures to exploit multiprocessor parallelism
have been described in [1] as:
-
Shared memory: A collection of processors share the same physical
memory. It is easiest to support as it can be run as a single-site DBMS
and can depend on the OS for scheduling processes to multiple processors.
-
Shared disk: Each processor has its own memory but a shared set
of storage disks is used. This presents several challenges rooted in the
fact that consistency has to be preserved for the data residing on the
disk under multiple processors accessing it. This become difficult as
there is no shared memory.
-
Shared nothing: These involve multiple machines that share no
hardware but communicate with each other using a high-speed LAN. Most
research prototypes have been built based on this paradigm.
Case Study: Shared Nothing Architecture
The GAMMA project [2] is perhaps one of the most prominent in this line
of research. It is based on the shared nothing paradigm. This allows the
system to scale to thousands of processors rather than just 30-40 that
the shared memory architecture would allow. Secondly, using a shared
nothing architecture allows the system storage to scale with the number
of processors too - thus increasing the I/O bandwidth of the system,
without the use of specialized disk controllers. Here is a brief overview
of GAMMA features:
-
Storage: The relations (a fundamental notion in relational
databases - represents
a collection of "similar" tuples of data) are horizontally partitioned
across all disk drives in the system. This parallelizes the scan
operation, which reads all tuples from the relations. The system provides
the user coarse-grained control over how the tuples of a relation can
be partitioned across the disks (like round robin, hash based etc.).
-
Indexing: The system allows both clustered and non-clustered
indexes. On a user request, the index is created on each fragment of the
relation.
-
Query Processing: The query plan generator takes into account
the partitioning scheme when scheduling the query over the different
processor (e.g., a predicate like "key = a" in the query may require
just the processor that has tuples corresponding to key value "a" to
perform processing if hash based partitioning is used).
-
A Framework for Query Processing in Parallel Databases
The Volcano query processing system [3] is known for the extensibility
that results from the uniform interface between operators. An operator
is a query processing module that performs a specific function on the
stream of data tuples that are sent to it. In general, a query execution
plan can be expressed as a collection of operators and the way their
output and input streams are connected to each other. Each operator in
Volcano is coded such that it feels it is operating in a single-process
environment. A special exchange operator encapsulates all the
parallelism in the system and hence provides extensibility.
An overview of parallel database systems is provided in [4]. The authors
argue that while in the early 80's, researchers felt that specialized
hardware is key to large-scale scalable databases, around early 90's
it was clear that parallel databases are a more natural software solution.
They point out that relational operators very naturally extend to the
parallel architecture as they operate on relations and output the same.
References
[1] Readings in Database Systems. Michael Stonebraker. Morgan
Kaufmann Publishers, 1994.
[2] The Gamma Database Machine Project. Dewitt et al. IEEE
Transactions on Data and Knowledge Engineering, 1990.
[3] Encapsulation of Parallelism in the Volcano Query Processing
System. Goetz Graefe. SIGMOD Record, 1990.
[4] Parallel Database Systems: The Future of High Performance
Database Processing. David Dewitt and Jim Gray. Communications of
the ACM, 1992.