Advanced Topics in Computer Systems  
Eric Brewer (based on notes from Joe Hellerstein)

SQL Query Optimization

Basics

Given: A query joining n tables
The "Plan Space": Huge number of alternative, semantically equivalent plans.
The Perils of Error: Running time of plans can vary by many orders of magnitude
Ideal Goal: Map a declarative query to the most efficient plan tree.
Conventional Wisdom: You’re OK if you avoid the rotten plans.
Industrial State of the art: Most optimizers use System R technique and work "OK" up to about 10 joins.

Approach 1: The Optimization Oracle

(definitely not to be confused with the company of the same name)
You’d like to get the following information, but in 0 time:

Approach 2: Make Up a Heuristic & See if it Works

University INGRES Oracle 6

OK,OK. Approach 3: Think!

Three issues: Selinger & the System R crowd the first to do this right. The Bible of Query Optimization.
 

SQL Refresher

SELECT {DISTINCT} <list of columns>  
FROM <list of tables>  
{WHERE <list of "Boolean Factors" (predicates in CNF)>}  
{GROUP BY <list of columns>  
{HAVING <list of Boolean Factors>}}  
{ORDER BY <list of columns>};
Semantics are:
  1. take Cartesian product (a/k/a cross-product) of tables in FROM clause, projecting to only those columns that appear in other clauses
  2. if there’s a WHERE clause, apply all filters in it
  3. if there’s a GROUP BY clause, form groups on the result
  4. if there’s a HAVING clause, filter groups with it
  5. if there’s an ORDER BY clause, make sure output is in the right order
  6. if there’s a DISTINCT modifier, remove dups
Of course the plans don’t do this exactly; query optimization interleaves 1 & 2 into a plan tree. GROUP BY, HAVING, DISTINCT and ORDER BY are applied at the end, pretty much in that order.
 

Plan Space

All your favorite query processing algorithms: Note some assumptions folded in here: Some other popular assumptions (System R)

Cost Estimation

The soft underbelly of query optimization.

Requires:

Searching the Plan Space

The System R Optimizer’s Search Algorithm

Look only at left-deep plans: there are n! plans (not factoring in choice of join method)
Observation: many of those plans share common prefixes, so don’t enumerate all of them
Sounds like a job for … Dynamic Programming!
  1. Find all plans for accessing each base relation
  1. For each relation, save cheapest unordered plan, and cheapest plan for each "interesting order". Discard all others.
  2. Now, try all ways of joining all pairs of 1-table plans saved so far. Save cheapest unordered 2-table plans, and cheapest "interesting ordered" 2-table plans.
  1. Now try all ways of combining a 2-table plan with a 1-table plan. Save cheapest unordered and interestingly ordered 3-way plans. You can now throw away the 2-way plans.
  2. Continue combining k-way and 1-way plans until you have a collection of full plan trees
  3. At top, satisfy GROUP BY and ORDER BY either by using interestingly ordered plan, or by adding a sort node to unordered plan, whichever is cheapest.


Some additional details:

Evaluation: Simple variations to improve plan quality: Subqueries 101: Selinger does a very complete job with the basics.