Author: Michael Tung
I am an
undergraduate EECS major in my senior year. My hometown is a small city near
My primary focus in computer science is on machine learning. Practical problems that involve applying machine learning approaches to discover hidden knowledge in large datasets appeal to me. Currently, I am working in the Berkeley Phylogenomics Group with Professor Kimmen Sjolander in BioE on a project to predict functional residues in proteins using a combination of sequence, structural, and biochemical information.
The
main reasons I took this class were to get a taste of graduate-level CS courses
at this school and to further explore whether parallel machine techniques can
be applied to my research. My first exploration of parallel machines was
writing a Java API interface for the Parallel Problems Server
and attempting to learn MPI programming. This piqued my interest in finding
more about how parallel machines can optimize machine learning methods.
Phylogenetic
tree are data structures used in bioinformatics analysis to infer relationships
between related protein sequences. These relationships can be helpful for
autonomously classifying new protein sequences, functional prediction, and drug
discovery. The leaves in the tree are the proteins themselves and the branching
represents a predicted evolutionary divergence. There are many implementations
of phylogenetic tree reconstruction, however one of the most popular methods is
maximum likelihood estimation.
TREE-PUZZLE
is essentially a phylogenetic tree construction algorithm with a portion that
has been parallelized for multi-processor frameworks using Message Passing
Interface (MPI). It uses the maximum-likelihood function for clustering. The
use the master-worker architecture, which means that there is a central
scheduler and a variable number of worker processes that perform the
computation. This allows for scaling to large clusters or parallel machines.
They use the Guided Self-Scheduling algorithm (Kuck, Hagerup) for load
balancing. They claim almost perfect speedup on a cluster of 20 heterogeneous
SUN workstations, keeping all the nodes busy.
A
paper in Molecular Biology of Evolution cites the
performance of the TREE-PUZZLE algorithm, showing that efficiency over
traditional methods is increased substantially for very high substitution rates
and increased only moderately for more homogenous sequences. However, they
point out that the algorithm is computational faster and needs less memory.
Whether this efficiency results from parallelization needs to be explored.In
the paper, they describe a test conducted on a large dataset(215 red algae
small subnit rRNA). On a 12-processor HP V-Class, they were able to perform the
computation in 2 weeks, compared to a non-parallelized version running in 5.5
months.

The
TREE-PUZZLE manual states that TREE-PUZZLE will not be able to parallelize
usertree evaluation or likelihood mapping analysis and runs these tasks all on
the master process. By Amdahl’s Law, this means that the speedup will slow down
for large inputs. I want to explore if these tasks can be parallelized as well.
[1] www.tree-puzzle.de
[2] Bioinformatics
Vol. 18 no. 3 2002. Pages 502-504
[3] Molecular Biology of Evolution 1997 Vol. 14: Pages 210-211.