Rajesh Nishtala (rajeshn@eecs.berkeley.edu)
CS 267 Assignment 0
Who am I?
I am a first year graduate student working under Professor Kathy Yelick and James Demmel in the BeBop group. Until now I have been working on the problem of automatically tuning sparse-matrix vector multiplication on various architectures. My interests include computer systems, parallel and high performance computing. I graduated with honors from UC Berkeley with a B.S. in Electrical Engineering and Computer Science with an emphasis in Computer Systems in May of 2003.

source: http://folding.stanford.edu
Problem Description:
One of the fundamental elements in biology is the protein, because it acts as
the biological engine that enables a wide variety of reactions within the body.
Proteins link together to form structural elements such as bones, skin, hair,
etc. Proteins also act as the antibodies to defend against intruders in the
body. The composition of these proteins and their structure give clues to how
and they work. The human genome project, which recently completed mapped out the
structure of all the proteins in the body, however this information is
insufficient to indicate how they work. Proteins also assemble themselves (this process is also known as
folding) properly before they conduct their reactions. This folding process is
where the interesting part of proteins lie. If we are able to figure out how
they proteins perform these actions quickly and reliably, then it is
possible to create synthetic polymers that will do the same thing there by
tackling certain diseases that occur from protein mis-folding, such as Mad Cow
or Alzheimer's disease. One of the key limitations however is that the
mathematical equations used to simulate this process are very complicated;
current technology shows that it takes about 1 day to simulate a nanosecond of
time. Proteins fold on the order of tens of microseconds, implying very large
delays if the calculation was to be only done by one machine, implying the
need for a massively parallel machine.
Evaluation of Simulations:
There are two different levels of simulations that can occur. The first is to model the interaction of all the atoms in a protein to understand how they interact with each other in the various time steps. This has the advantage of being complete. Experimental results have found this technique to be relatively accurate and able to predict the actions quite well, however this costs a lot of computer time since there are a lot of very complicated interactions that need to be addressed. The second level of simulation is to assume that all atoms do not interact with each other. This of course limits the number of interactions and therefore the complexity. The advantage of this is the simulations finish a lot faster but the results of the experiments are not as accurate as the other simulations. These simplification also creates artifacts that show impossible things to be happening. The overall consensus is to therefore use the first model and hopefully achieve the computation power needed to establish this. The Folding@Home project has claimed to have processed proteins that take 5-10 microseconds to complete and is working on more complex proteins at the moment.
Systems Used:
Current Systems:
One of the most novel approaches to protein folding has come from the Folding@Home group at Stanford university. They have created an application that can be run on any personal computer to analyze the problem. In essence they have created a distributed computing network using a network of personal computers around the world. This project is very similar to the SETI@Home project from Berkeley. One of the main problems however with this approach is that it is very difficult for the machines to communicate with each other since the various parts of the simulation have to interact with each other. The group has published many papers on protein folding with data gathered from the project, a testament to the strength of their system.
Future Systems:
The work by the Folding@Home team is not the only large scale distributed work on the subject. IBM is planning a release of a petaflop machine, the BlueGene computer, to be able to handle precisely this problem. The design and structure of this machine will be very amenable to this problem since the supercomputer will employ some of the latest advances in interconnect technologies as well as state of the art microprocessors to power the machine. Currently IBM is working on 180-360 Teraflop machine called BlueGene/L which will be the predecessor to BlueGene. The BlueGene/L will be designed to handle 65,536 processors. The processors themselves will be interconnected in 3 different networks: a 3 dimensional torus, a global tree and ethernet links.
In addition to the compute platforms themselves a second project at IBM is designing a software system built around a database that is being run on top of the Blue Gene machine. The system is called Blue Matter. The application framework that Blue Matter provides an interface that allows users to separate the molecular dynamics from the complexity of parallel programming. Thus it allows the users to simply encode their applications and have the Blue Matter system worry about the parallelism (in theory). This an interesting approach to solving the problem because it is very difficult to be able to effectively use the full system effectively without the some assistance with the parallelism. Once the Blue Gene system comes out it will be interesting to see what percentage of the petaflop they will be able to handle.
Scalability:
One of the main problems with protein folding is the scalability of the simulation. The fundamental problem is that every atom has to interact with every other atom in the environment, implying a lot of communications between the various nodes if were to assume that the nodes were to be broken up based on spatial positioning of the grid. This is one of the main motivations for a 3-D topology of the network of Blue-Gene to be able to efficiently handle the inter-node communications between their neighbors. It is also very difficult to break the problem along the time line since the system is causal and previous states are needed to calculate the next states in the problems. Thus the problem of scaling is a very tricky one in this domain and methods are being developed to get around this.
Conclusions:
The protein folding problem is a very interesting and very important biophysical simulation problem that needs to be solved. The implications of the findings could fundamentally change our understanding of proteins and how the body works, leading to advances in other fields of medicine. Interestingly enough this problem is also important for computer science since it seems to be a driving force behind building larger and more complicated parallel machines as well as develop application frameworks to be able to handle the large amount of parallelism.
References: