MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)

This commit is contained in:
Allen Wittenauer 2015-01-30 11:08:20 -08:00
parent f2c91098c4
commit 12e883007c
3 changed files with 57 additions and 53 deletions

View File

@ -266,6 +266,8 @@ Release 2.7.0 - UNRELEASED
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw) MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
OPTIMIZATIONS OPTIMIZATIONS
MAPREDUCE-6169. MergeQueue should release reference to the current item MAPREDUCE-6169. MergeQueue should release reference to the current item

View File

@ -105,6 +105,7 @@
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/> <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/> <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/> <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
<item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
<item name="Rumen" href="hadoop-rumen/Rumen.html"/> <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
</menu> </menu>

View File

@ -38,21 +38,14 @@ Overview
GridMix is a benchmark for Hadoop clusters. It submits a mix of GridMix is a benchmark for Hadoop clusters. It submits a mix of
synthetic jobs, modeling a profile mined from production loads. synthetic jobs, modeling a profile mined from production loads.
This version of the tool will attempt to model
There exist three versions of the GridMix tool. This document
discusses the third (checked into `src/contrib` ), distinct
from the two checked into the `src/benchmarks` sub-directory.
While the first two versions of the tool included stripped-down versions
of common jobs, both were principally saturation tools for stressing the
framework at scale. In support of a broader range of deployments and
finer-tuned job mixes, this version of the tool will attempt to model
the resource profiles of production jobs to identify bottlenecks, guide the resource profiles of production jobs to identify bottlenecks, guide
development, and serve as a replacement for the existing GridMix development.
benchmarks.
To run GridMix, you need a MapReduce job trace describing the job mix To run GridMix, you need a MapReduce job trace describing the job mix
for a given cluster. Such traces are typically generated by Rumen (see for a given cluster. Such traces are typically generated by
Rumen documentation). GridMix also requires input data from which the [Rumen](../hadoop-rumen/Rumen.html).
GridMix also requires input data from which the
synthetic jobs will be reading bytes. The input data need not be in any synthetic jobs will be reading bytes. The input data need not be in any
particular format, as the synthetic jobs are currently binary readers. particular format, as the synthetic jobs are currently binary readers.
If you are running on a new cluster, an optional step generating input If you are running on a new cluster, an optional step generating input
@ -62,10 +55,15 @@ on the same or another cluster, follow these steps:
1. Locate the job history files on the production cluster. This 1. Locate the job history files on the production cluster. This
location is specified by the location is specified by the
`mapred.job.tracker.history.completed.location` `mapreduce.jobhistory.done-dir` or
`mapreduce.jobhistory.intermediate-done-dir`
configuration property of the cluster. configuration property of the cluster.
([MapReduce historyserver](../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver)
moves job history files from `mapreduce.jobhistory.done-dir`
to `mapreduce.jobhistory.intermediate-done-dir`.)
2. Run Rumen to build a job trace in JSON format for all or select jobs. 2. Run [Rumen](../hadoop-rumen/Rumen.html)
to build a job trace in JSON format for all or select jobs.
3. Use GridMix with the job trace on the benchmark cluster. 3. Use GridMix with the job trace on the benchmark cluster.
@ -79,13 +77,17 @@ Usage
Basic command-line usage without configuration parameters: Basic command-line usage without configuration parameters:
org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace> ```
java org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
```
Basic command-line usage with configuration parameters: Basic command-line usage with configuration parameters:
org.apache.hadoop.mapred.gridmix.Gridmix \ ```
java org.apache.hadoop.mapred.gridmix.Gridmix \
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \ -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
[-generate <size>] [-users <users-list>] <iopath> <trace> [-generate <size>] [-users <users-list>] <iopath> <trace>
```
> Configuration parameters like > Configuration parameters like
> `-Dgridmix.client.submit.threads=10` and > `-Dgridmix.client.submit.threads=10` and
@ -102,6 +104,8 @@ The `-generate` option is used to generate input data and
Distributed Cache files for the synthetic jobs. It accepts standard units Distributed Cache files for the synthetic jobs. It accepts standard units
of size suffixes, e.g. `100g` will generate of size suffixes, e.g. `100g` will generate
100 * 2<sup>30</sup> bytes as input data. 100 * 2<sup>30</sup> bytes as input data.
The minimum size of input data in compressed format (128MB by default)
is defined by `gridmix.min.file.size`.
`<iopath>/input` is the destination directory for `<iopath>/input` is the destination directory for
generated input data and/or the directory from which input data will be generated input data and/or the directory from which input data will be
read. HDFS-based Distributed Cache files are generated under the read. HDFS-based Distributed Cache files are generated under the
@ -121,16 +125,17 @@ uncompressed. Use "-" as the value of this parameter if you
want to pass an *uncompressed* trace via the standard want to pass an *uncompressed* trace via the standard
input-stream of GridMix. input-stream of GridMix.
The class `org.apache.hadoop.mapred.gridmix.Gridmix` can GridMix expects certain library *JARs* to be present in the *CLASSPATH*.
be found in the JAR One simple way to run GridMix is to use `hadoop jar` command to run it.
`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your You also need to add the JAR of Rumen to classpath for both of client and tasks
Hadoop installation, where `$VERSION` corresponds to the as example shown below.
version of Hadoop installed. A simple way of ensuring that this class
and all its dependencies are loaded correctly is to use the
`hadoop` wrapper script in Hadoop:
hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \ ```
HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.5.1.jar \
-libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
[-generate <size>] [-users <users-list>] <iopath> <trace> [-generate <size>] [-users <users-list>] <iopath> <trace>
```
The supported configuration parameters are explained in the The supported configuration parameters are explained in the
following sections. following sections.
@ -262,14 +267,14 @@ recorded in the trace. It constructs jobs of two types:
</td> </td>
<td>A synthetic job where each task does *nothing* but sleep <td>A synthetic job where each task does *nothing* but sleep
for a certain duration as observed in the production trace. The for a certain duration as observed in the production trace. The
scalability of the Job Tracker is often limited by how many scalability of the ResourceManager is often limited by how many
heartbeats it can handle every second. (Heartbeats are periodic heartbeats it can handle every second. (Heartbeats are periodic
messages sent from Task Trackers to update their status and grab new messages sent from NodeManagers to update their status and grab new
tasks from the Job Tracker.) Since a benchmark cluster is typically tasks from the ResourceManager.) Since a benchmark cluster is typically
a fraction in size of a production cluster, the heartbeat traffic a fraction in size of a production cluster, the heartbeat traffic
generated by the slave nodes is well below the level of the generated by the slave nodes is well below the level of the
production cluster. One possible solution is to run multiple Task production cluster. One possible solution is to run multiple
Trackers on each slave node. This leads to the obvious problem that NodeManagers on each slave node. This leads to the obvious problem that
the I/O workload generated by the synthetic jobs would thrash the the I/O workload generated by the synthetic jobs would thrash the
slave nodes. Hence the need for such a job.</td> slave nodes. Hence the need for such a job.</td>
</tr> </tr>
@ -334,7 +339,7 @@ Job Submission Policies
GridMix controls the rate of job submission. This control can be GridMix controls the rate of job submission. This control can be
based on the trace information or can be based on statistics it gathers based on the trace information or can be based on statistics it gathers
from the Job Tracker. Based on the submission policies users define, from the ResourceManager. Based on the submission policies users define,
GridMix uses the respective algorithm to control the job submission. GridMix uses the respective algorithm to control the job submission.
There are currently three types of policies: There are currently three types of policies:
@ -407,8 +412,8 @@ The following configuration parameters affect the job submission policy:
<td> <td>
<code>gridmix.throttle.jobs-to-tracker-ratio</code> <code>gridmix.throttle.jobs-to-tracker-ratio</code>
</td> </td>
<td>In STRESS mode, the minimum ratio of running jobs to Task <td>In STRESS mode, the minimum ratio of running jobs to
Trackers in a cluster for the cluster to be considered NodeManagers in a cluster for the cluster to be considered
*overloaded* . This is the threshold TJ referred to earlier. *overloaded* . This is the threshold TJ referred to earlier.
The default is 1.0.</td> The default is 1.0.</td>
</tr> </tr>
@ -689,19 +694,15 @@ Emulating High-Ram jobs
----------------------- -----------------------
MapReduce allows users to define a job as a High-Ram job. Tasks from a MapReduce allows users to define a job as a High-Ram job. Tasks from a
High-Ram job can occupy multiple slots on the task-trackers. High-Ram job can occupy larger fraction of memory in task processes.
Task-tracker assigns fixed virtual memory for each slot. Tasks from Emulating this behavior is important because of the following reasons.
High-Ram jobs can occupy multiple slots and thus can use up more
virtual memory as compared to a default task.
Emulating this behavior is important because of the following reasons
* Impact on scheduler: Scheduling of tasks from High-Ram jobs * Impact on scheduler: Scheduling of tasks from High-Ram jobs
impacts the scheduling behavior as it might result into slot impacts the scheduling behavior as it might result into
reservation and slot/resource utilization. resource reservation and utilization.
* Impact on the node : Since High-Ram tasks occupy multiple slots, * Impact on the node : Since High-Ram tasks occupy larger memory,
trackers do some bookkeeping for allocating extra resources for NodeManagers do some bookkeeping for allocating extra resources for
these tasks. Thus this becomes a precursor for memory emulation these tasks. Thus this becomes a precursor for memory emulation
where tasks with high memory requirements needs to be considered where tasks with high memory requirements needs to be considered
as a High-Ram task. as a High-Ram task.
@ -808,11 +809,11 @@ job traces and cannot be accurately reproduced in GridMix:
Appendix Appendix
-------- --------
There exist older versions of the GridMix tool.
Issues tracking the original implementations of Issues tracking the original implementations of
<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>, [GridMix1](https://issues.apache.org/jira/browse/HADOOP-2369),
<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>, [GridMix2](https://issues.apache.org/jira/browse/HADOOP-3770),
and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a> and [GridMix3](https://issues.apache.org/jira/browse/MAPREDUCE-776)
can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
the current development of GridMix can be found by searching the current development of GridMix can be found by searching
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086"> [the Apache Hadoop MapReduce JIRA](https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086).
the Apache Hadoop MapReduce JIRA</a>