CS262B Reading Summary
Querying the Internet with PIER
Ryan Huebsch et al.
Summary by Feng Zhou
4/12/04
Strong points of the paper are:
- DHTs provide a natural platform for constructing Internet-scale
databases. Hash tables are a key data structure in databases and DHTs
implement it efficiently and robustly in the wide area. This was
not available for traditional distributed databases and has the chance
of making a difference.
- The in situ application
scenario is novel for databases. Instead of collecting all data in a
centralized warehouse, the paper proposes leaving all data on users'
machines and do queries in situ.
This makes sense for applications like monitoring and content
aggregration. In these cases the data is constantly changing and event
nodes are transiant. So pushing all data generated at various nodes to
a centralized location is unreasonable.
- The 4 join algorithms are the key of the paper. They are quite
simple. The first two, symmetric
hash join and Fetch Matches
and adaptations of standard distributed join algorithms to DHTs. Semmetric semi-join is a simple
optimization of publishing pointers instead of full tuples into the
rehashed namespace. This trades latency for less bandwidth
consumption. The last optimization is using Bloom filter to
provide summaries of tuples available on each node. Instead of
publishing rehashed tuples, Bloom summaries are broadcasted to each
node, which in turn matches local tuples against it and fetches real
tuples using the DHT. This essentially replaces the DHT used for
rehashed tuples with broadcasted Bloom summaries.
One major flaw.
The network locality of all the four join algorithms are bad.
Because tuples are scattered throughout the network, every queries will
involves 'getting' and 'putting' single tuples from all over the
network. Unfortunately not much batching can be done. Because the
packet overhead of every message, this will result in waste of
bandwidth. It also increases query latency a lot.