CS 267: Describe a Parallel Application

Author: Michael Tung

Background

Very Brief Bio

I am an undergraduate EECS major in my senior year. My hometown is a small city near Atlanta, Georgia known for its pecans and high-tech startups. I’ve worked in development at Microsoft for two summers on Mobile Devices and Longhorn OS. In my free time I like to listen to music, travel to exotic locales, and fight people (martial arts). Currently, I live in an apartment near campus that would make for a good group meeting spot.

Research Interests

My primary focus in computer science is on machine learning. Practical problems that involve applying machine learning approaches to discover hidden knowledge in large datasets appeal to me. Currently, I am working in the Berkeley Phylogenomics Group with Professor Kimmen Sjolander in BioE on a project to predict functional residues in proteins using a combination of sequence, structural, and biochemical information.

Why I took this class

The main reasons I took this class were to get a taste of graduate-level CS courses at this school and to further explore whether parallel machine techniques can be applied to my research. My first exploration of parallel machines was writing a Java API interface for the Parallel Problems Server and attempting to learn MPI programming. This piqued my interest in finding more about how parallel machines can optimize machine learning methods.

Application Case Study: TREE-PUZZLE

Domain

Phylogenetic tree are data structures used in bioinformatics analysis to infer relationships between related protein sequences. These relationships can be helpful for autonomously classifying new protein sequences, functional prediction, and drug discovery. The leaves in the tree are the proteins themselves and the branching represents a predicted evolutionary divergence. There are many implementations of phylogenetic tree reconstruction, however one of the most popular methods is maximum likelihood estimation.

Description

TREE-PUZZLE is essentially a phylogenetic tree construction algorithm with a portion that has been parallelized for multi-processor frameworks using Message Passing Interface (MPI). It uses the maximum-likelihood function for clustering. The use the master-worker architecture, which means that there is a central scheduler and a variable number of worker processes that perform the computation. This allows for scaling to large clusters or parallel machines. They use the Guided Self-Scheduling algorithm (Kuck, Hagerup) for load balancing. They claim almost perfect speedup on a cluster of 20 heterogeneous SUN workstations, keeping all the nodes busy.

 

A paper in Molecular Biology of Evolution cites the performance of the TREE-PUZZLE algorithm, showing that efficiency over traditional methods is increased substantially for very high substitution rates and increased only moderately for more homogenous sequences. However, they point out that the algorithm is computational faster and needs less memory. Whether this efficiency results from parallelization needs to be explored.

 

In the paper, they describe a test conducted on a large dataset(215 red algae small subnit rRNA). On a 12-processor HP V-Class, they were able to perform the computation in 2 weeks, compared to a non-parallelized version running in 5.5 months.

 

The TREE-PUZZLE manual states that TREE-PUZZLE will not be able to parallelize usertree evaluation or likelihood mapping analysis and runs these tasks all on the master process. By Amdahl’s Law, this means that the speedup will slow down for large inputs. I want to explore if these tasks can be parallelized as well.

 

 

References

[1] www.tree-puzzle.de

[2] Bioinformatics Vol. 18 no. 3 2002. Pages 502-504

[3] Molecular Biology of Evolution 1997 Vol. 14: Pages 210-211.