Bio: I am a PhD student in
Computer Science, interested in data intensive computation in the datacenter
and cloud computing. I’d like to get general ideas of optimizing parallel
applications that could be applied to data intensive computation in large
distributed systems.
The application in question:
MapReduce in hybrid datacenter
MapReduce [1] is a software
framework that supports data-intensive computation on large clusters. It is
best suited for problems that are embarrassingly parallel and, unlike Grid computing,
is centered on data-intensive tasks where it tries to co-locate the computation
with the data instead of fetching the data to the computation site. In
particular, this description is based on Hadoop [2], the open-source and most
widely used MapReduce implementation.
At its core, a MapReduce program
is centered around two functions: a map function and a reduce function. The framework converts the input
data into key and value pairs ([K1, V1]) which the map function then
translates into new output pairs ([K2, V2]). The framework then groups all
values for a particular key together ([K2, <
V21, V22, … >]), and uses the reduce function to translate
this group to a new output pair ([K3, V3]). Example values for these key-value
pairs with a MapReduce Word Count program and the text “coast to coast"
can be found in Table 1. In reality, a MapReduce program can be composed of
multiple map and reduce phases with the reduce output from one phase serving as
the map input for the next.
Table 1. MapReduce Word Count Example

In terms of the actual MapReduce
architecture, there are four separate components. The client submits the MapReduce
job written by the programmer. There is a global Job-Tracker that schedules the
submitted job across the MapReduce cluster and handles node failures. A
per-node Task-Tracker tracks and runs each job sub-component (known as a task).
Finally, there is a distributed file system that is used to store the input and
final output data. HDFS is one of the more commonly used file systems with
Hadoop but other alternatives, including cloud storage-based systems, also
exist.
-
Data-local
task assignment
As a MapReduce program is
supposed to process large data, it is critical to reduce data transfer time to
improve overall performance. To this end, the jobtracker tries to assign map
tasks to nodes where the data to process sit. Further, if it is not possible,
the jobtracker also tries to pick a node in the same rack where the data is in.
-
Speculative
execution
As all map tasks should be finished
before starting reduce tasks and all reduce tasks should be done to complete
the whole job, the tasks running behind (or laggards) will pose significant
impact in terms of overall job completion time. To reduce the negative impact
of laggards, the job-tracker monitors the progress of each task, and speculatively
launches redundant tasks if there are some lagging tasks.
As a datacenter grows over time,
and it is likely that equipments in the datacenter will consist of multiple
generation of hardware. In this case, it is important to intelligently place
workloads to appropriate nodes, as it will make huge difference in terms of performance. Furthermore, we could aggressively build a
datacenter with servers with different micro-architecture and different storage
to exploit advantages of heterogeneous configuration [3].
Currently, the Hadoop implementation
of MapReduce focuses on the cluster consists of homogeneous nodes. There are some
previous works to improve MapReduce scheduling policy to take heterogeneity into
account [4], but still Hadoop treat all nodes equally in general.
Data-local task assignment might
be effective in many cases, but not always, especially in a hybrid datacenter
that consists of different hardware. For example, if a map task is computation
intensive, it will make more sense to assign the task to remote but faster
node. Similarly, speculative execution plays an important role to reduce
overall completion time. However, the more important thing is to place tasks properly
at first place to avoid laggards. If the cluster consists of different hardware,
this heterogeneity should be considered to make a proper placement decision.
[1] Jeffrey Dean and Sanjay
Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04
[2] Apache Hadoop, http://hadoop.apache.org/
[3] Byung-Gon Chun, Gianluca
Iannaccone, Giuseppe Iannaccone, Randy Katz, Gunho Lee, Luca Niccolini, An
Energy Case for Hybrid Datacenters, HotPower’09
[4] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and
Ion Stoica, Improving MapReduce Performance in Heterogeneous
Environments, OSDI ‘08