University of California, Berkeley

College of Engineering

Computer Science Division – EECS

 

Fall 2001                                                                   Anthony D. Joseph and Joe Hellerstein

Midterm Exam Solutions

October 31, 2001

CS262A Advanced Topics in Computer Systems

 

Your Name:

 

 

E-mail Address:

 

 

 

Administrative Instructions:

 

This is an open book/open notes examination, however, no collaboration with students in the class or others is permitted.

 

You have until 5PM PST on Friday November 2, 2001 to answer both questions.

 

Exams should be turned in electronically in ASCII form only to the e-mail address: cs262prof AT postgres.berkeley.edu . The only exception to the format rule are the graph(s) for question 2.

 

Each question is worth 5 points; there are 10 points in all.

 

As you can imagine, there are no single correct answers to the questions in this exam. Try to focus your discussion on the issues you feel are most important. Limit your answers to each question to no more than 1000 words (approximately 3x the length of each question); please double-check the lengths before turning in your exam, or it will be truncated accordingly. 

 

If there is something in a question that you believe is open to interpretation, then please ask us about it!

 

 

                                                Good Luck!!

 


 

Problem 1. (5 points total). Using a DBMS on the UNIX Operating System.

 

From the beginning, UNIX has been designed with the characteristics of “simplicity, elegance, and ease of use.” Its most important design goal is to provide a file system. Over time, the implementation (e.g., LFS, FFS, etc.) has changed, but the abstraction semantics have remained largely the same. One reason for the lack of change is that the semantics are sufficiently general purpose for all applications.

 

However, your database designer friend, Bob, approaches with a dilemma. Bob tells you that he doesn't care whether you use FFS or LFS; the UNIX file abstraction is inappropriate for a DBMS. Your job in this question is to explain why Bob’s statement is correct?

 

After explaining why Bob’s statement is correct, list three specific issues that are faced by potential solutions that extend the file abstraction (instead of replacing it with a new abstraction built on a raw disk device), and give a brief explanation of each issue.

 

For the purposes of this exam, assume the following UNIX API for files:

·        Create: create a new file for reading or writing.

·        Open: opens an existing file for reading or writing.

·        Close: closes an open file. Dirty buffers are scheduled for flushing.

·        Read: read bytes from a file.

·        Write: write bytes to a file.

·        Seek: seek to a specified byte offset within an open file.

·        Sync: request that the OS schedule a flush to disk of the dirty buffers of all open files.

·        Move: atomically rename a file.

·        HLink: create a hard link to the I-node for a file.

·        SLink: create a symbolic link to a file.

·        Lock: exclusively lock a file to prevent other processes from opening that file.

·        Delete: atomically delete a file.

 

 


Problem 1 answer:


Problem 2. (5 points total). Time-Travel Layout and Performance in Postgres.

 

The Postgres storage system as described by Stonebraker was an early design, and many of the decisions are subject to debate.  One area where it could be changed was in its handling of "tuple differencing".  In the original proposal, the "anchor point" contained the original version of the tuple, with deltas chained off of the anchor point.  In subsequent Postgres implementations (including what you download from postgresql.org, last I checked), there is no tuple differencing at  all – each version of a tuple is represented as a complete tuple in the table, with no explicit chaining or deltas.  A third option is to make the anchor point be the most recent version of the tuple, with older versions stored as deltas in a reverse-chronological chain.

 

Your job in this question is to address the tradeoffs between these three options quantitatively.  Define a simple one-table workload, and clarify your assumptions about storage overheads in the different schemes.  Make your workload be a mix of queries and updates to the one table.  Try not to make wildly unrealistic assumptions (as was done in the original paper).  Then draw one or two telltale graphs that illustrate key tradeoffs among these three schemes.  The graphs need not be super-accurate, they should just show the relevant trends to illustrate your points.

 

Additional Instructions:

You can draw the graphs freehand on paper, and turn them in separately by sliding them under Prof. Hellerstein's door (685 Soda).  If you really want to turn in the graphs electronically, you can email them – PDF, GIF, or JPEG formats only, please.  The graphs should have labeled axes, but no explanatory captions (i.e., this is not a way to make your answer longer).

 

 


Problem 2 answer: