Bio: I am a PhD student in Computer Science, interested in data intensive computation in the datacenter and cloud computing. I’d like to get general ideas of optimizing parallel applications that could be applied to data intensive computation in large distributed systems.
The application in question: MapReduce in hybrid datacenter
MapReduce  is a software framework that supports data-intensive computation on large clusters. It is best suited for problems that are embarrassingly parallel and, unlike Grid computing, is centered on data-intensive tasks where it tries to co-locate the computation with the data instead of fetching the data to the computation site. In particular, this description is based on Hadoop , the open-source and most widely used MapReduce implementation.
At its core, a MapReduce program is centered around two functions: a map function and a reduce function. The framework converts the input data into key and value pairs ([K1, V1]) which the map function then translates into new output pairs ([K2, V2]). The framework then groups all values for a particular key together ([K2, < V21, V22, … >]), and uses the reduce function to translate this group to a new output pair ([K3, V3]). Example values for these key-value pairs with a MapReduce Word Count program and the text “coast to coast" can be found in Table 1. In reality, a MapReduce program can be composed of multiple map and reduce phases with the reduce output from one phase serving as the map input for the next.
Table 1. MapReduce Word Count Example
In terms of the actual MapReduce architecture, there are four separate components. The client submits the MapReduce job written by the programmer. There is a global Job-Tracker that schedules the submitted job across the MapReduce cluster and handles node failures. A per-node Task-Tracker tracks and runs each job sub-component (known as a task). Finally, there is a distributed file system that is used to store the input and final output data. HDFS is one of the more commonly used file systems with Hadoop but other alternatives, including cloud storage-based systems, also exist.
- Data-local task assignment
As a MapReduce program is supposed to process large data, it is critical to reduce data transfer time to improve overall performance. To this end, the jobtracker tries to assign map tasks to nodes where the data to process sit. Further, if it is not possible, the jobtracker also tries to pick a node in the same rack where the data is in.
- Speculative execution
As all map tasks should be finished before starting reduce tasks and all reduce tasks should be done to complete the whole job, the tasks running behind (or laggards) will pose significant impact in terms of overall job completion time. To reduce the negative impact of laggards, the job-tracker monitors the progress of each task, and speculatively launches redundant tasks if there are some lagging tasks.
As a datacenter grows over time, and it is likely that equipments in the datacenter will consist of multiple generation of hardware. In this case, it is important to intelligently place workloads to appropriate nodes, as it will make huge difference in terms of performance. Furthermore, we could aggressively build a datacenter with servers with different micro-architecture and different storage to exploit advantages of heterogeneous configuration .
Currently, the Hadoop implementation of MapReduce focuses on the cluster consists of homogeneous nodes. There are some previous works to improve MapReduce scheduling policy to take heterogeneity into account , but still Hadoop treat all nodes equally in general.
Data-local task assignment might be effective in many cases, but not always, especially in a hybrid datacenter that consists of different hardware. For example, if a map task is computation intensive, it will make more sense to assign the task to remote but faster node. Similarly, speculative execution plays an important role to reduce overall completion time. However, the more important thing is to place tasks properly at first place to avoid laggards. If the cluster consists of different hardware, this heterogeneity should be considered to make a proper placement decision.
 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04
 Apache Hadoop, http://hadoop.apache.org/
 Byung-Gon Chun, Gianluca Iannaccone, Giuseppe Iannaccone, Randy Katz, Gunho Lee, Luca Niccolini, An Energy Case for Hybrid Datacenters, HotPower’09
 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica, Improving MapReduce Performance in Heterogeneous Environments, OSDI ‘08