Project Presentations

Tuesday December 9, 2003
12:30 pm - 4:20 pm
310 Soda


Time
Title
Researchers / Abstract
Slides
Paper
12:40
DHT-Based Adaptive Query Processing via Federated Eddies
Ryan Huebsch and Shawn Jeffery
[ppt,ps.gz,pdf] [ps.gz,pdf]
In response to the ever increasing scale and complexity of data processing, much research in the past few years has focused on adaptive query processing.  However, many of these solutions, although aimed at wide-area data processing, remain centralized solutions.  In this paper, we present FREddies, an extension of the centralized Eddy operator for use in a P2P query processing system.  FREddies operate within the framework of PIER, a DHT-based p2p query processor.  FREddies optimize the query during runtime and require no global knowledge. We show that FREddies using rudimentary routing policies can, on average, outperform a traditional static query optimization approach.  Furthermore, we validate our approach in the real world environ-ment of planetLab.
1:05
Anonymity and Performance in Peer-to-Peer Systems
Nikita Borisov and Jason Waddle
[ppt,ps.gz,pdf]
[ps.gz,pdf]
Existing peer-to-peer systems that aim to provide anonymity to its users are based on neteworks with unstructured or loosely-structured routing algorithms.  Structured routing offers performance and robustness guarantees that these systems are unable to achieve.  We therefore investigate adding anonymity support to structured peer-to-peer systems. We apply an entropy-based anonymity metric to Chord and use this metric to quantify the effects that several possible extensions have on anonymity.  We identify particular properties of Chord that have the strongest negative effects on anonymity and propose a routing extension that allows a general trade-off of anonymity and performance. It should be possible to generalize our results to other structured peer-to-peer systems.
1:30
Persistence of Data in a Dynamic, Unreliable Network
Rachel Rubin and Hakim Weatherspoon
[ppt,ps.gz,pdf] [ps.gz,pdf]
We present a data layer that is persistent under churn and a branching scheme that allows users to modify old versions of objects.  To accomplish the first goal we needed working host heartbeats, 'critical' redundancy triggers, and an archival  layer that repairs fragments. Additionally, if the author of an object dies, the object can still be located through tombstones. To accomplish the second goal, we created time-travel macrobranching that allows modification of an older version of a file or directory by creating a new branch. Finally, we sketch a design of how to bind together branching and the persistent data store for a time-travel NFS client that creates a branch of a modified older directory in the midst of churn.
1:55 BREAK
2:05
Improving Performance in the Gnutella Protocol
Benjamin Poon and Jonathan Hess
[ppt,ps.gz,pdf] [ps.gz,pdf]
The Gnutella protocol describes a completely decentralized P2P file sharing system in which queries are flooded to all neighbors in the search for files. As originally specified, the protocol does not have any notion of providing privacy; as such, because agencies have begun to censor and threaten users of such systems, participation has decreased. In turn, users who continue to utilize the network, choose not to share data in fear of litigation. This reduces data redundancy, as well as an increase in the workload of fully participating peers. As files become less available Gnutella peers must broadcast deeper into the network. While data-participation is relatively uncontrollable, increased redundancy and decreased workload can be achieved through replicating files to other peers. This, however, must be done in such a way that preserves the ability of the proxy-peers to deny knowledge of file content. In this paper, we present an extension to the Gnutella protocol which achieves redundancy through encrypted mirroring. We further improve performance by directing queries using a Bloom filter mechanism. Through simulation, we explore the performance gains of these protocol extensions in terms of aggregate bandwidth consumption, query efficiency, and average hop count.
2:30 Security Implications of Peer-to-Peer Networks Karthik Lakshminarayanan and Jayanthkumar Kannan [ppt,ps.gz,pdf] [ps.gz,pdf]
Recently, two trends have emerged in the field of peer-to-peer networks: widespread deployment of peer-to-peer systems for file sharing and development of distributed hash tables that provide efficient lookups. In this paper, we study how to harness the power of these technologies to further the state-of-the-art in designing and defending against Internet worms. We quantify this advance from three different viewpoints. Firstly, peer-to-peer traffic characteristics differs from traditional Internet traffic in several aspects, and we quantitatively analyze the effect of these differences on worm propagation and control. Secondly, we show that a DHT is an ideal model for coordination among worms, and design a DHT-enabled worm that is an improvement over existing worm designs in a number of aspects.  In this way, this paper attempts to "raise the bar'' in worm design, and this is essential to the development of suitable defences.  Finally, we offer some preliminary insights on how a DHT can be used to be defend against worms.
2:55 Peer-to-Peer Result Dissemination for High-Volume Data Filtering
Shariq Rizvi and Paul Burstein
[ppt,ps.gz,pdf]

[ps.gz,pdf]
Data filtering is the problem of matching high-volume document streams, against a set of client profiles, often represented by queries. Recent work in high-volume data filtering, like the YFilter project at Berkeley, has focused on efficiently indexing client queries, by exploiting the similarity between queries. However, the problem of result dissemination in data filtering has not been received enough attention.  Each filtered document has to be delivered to a unique, and typically large set of clients, making this a problem of highly dynamic multicast. We observe fundamental limitations in the client-server delivery model for the high-volume data filtering problem and describe a peer-to-peer scheme to solve it. It exploits the bandwidths of participating clients for data dissemination by building an unstructured overlay network. Our deployment on the PlanetLab testbed shows that the approach proposed is scalable and offers acceptable delivery delays and network economy.
3:20 BREAK
3:30 Distributed Web Crawling over DHTs
Owen Cooper, Sailesh Krishnamurthy, and Boon Thau Loo
[ppt,ps.gz,pdf] [ps.gz,pdf]
A traditional web search engine like Google is composed of a crawler, indexer and searcher, with only the search interface exposed to end users. As a result, users can control neither the freshness nor structure of the crawl, a growing problem as the web evolves towards dynamically generated content. Further, there are emerging data sources, such as an estimated 550 billion documents of the deep web that remain out of the reach of most centralized crawlers. Finally, it is difficult to experiment with new indexing and searching technologies without an infrastructure for crawling. Unfortunately, crawling is a very heavy weight operation that is not conducive for applications on individual clients.
To address these problems, we propose a distributed crawler that uses the excess bandwidth and computing resources of clients to crawl the web. This co-operative crawler uses a Distributed Hash Table (DHT) as a scalable overlay for easy coordination and distribution of the crawl workload. In designing and building the crawler we study different crawl distribution strategies and investigate the tradeoffs in bandwidth utilization, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present a design and implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on PlanetLab querying live web sources.

3:55 Caching Game
Marco Barreno and Byung-Gon Chun
[ppt,ps.gz,pdf] [ps.gz,pdf]
We introduce a novel game-theoretic model for caching to characterize the placement of replicated resources by server nodes that act selfishly.  Nodes incur cost for replicating resources, but if a node has demand for a resource it does not replicate, it incurs cost for access to a remote replica.
We investigate how the behavior of selfish servers compares with the social optimum.  We show that pure Nash equilibria exist for the game and the price of anarchy (cost of worst-case Nash equilibrium divided by cost of social optimum) can be O(n) in general due to undersupply problems. Under certain topologies, the price of anarchy does have tighter
bounds.  For complete graphs and stars, it is O(1). For D-dimensional grids, it is O(n1-(1/(D+1))). There are important phase transitions in the price of anarchy, such as when the placement cost exceeds the network diameter. We use extensive simulations to investigate how the price of anarchy changes as we vary the demand distribution, underlying physical topology, installation cost, and maintenance cost of replicated resources.