MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
This commit is contained in:
parent
f2c91098c4
commit
12e883007c
@ -266,6 +266,8 @@ Release 2.7.0 - UNRELEASED
|
|||||||
|
|
||||||
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
|
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
|
||||||
|
|
||||||
|
MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
|
||||||
|
|
||||||
OPTIMIZATIONS
|
OPTIMIZATIONS
|
||||||
|
|
||||||
MAPREDUCE-6169. MergeQueue should release reference to the current item
|
MAPREDUCE-6169. MergeQueue should release reference to the current item
|
||||||
|
@ -105,6 +105,7 @@
|
|||||||
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
|
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
|
||||||
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
||||||
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
||||||
|
<item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
|
||||||
<item name="Rumen" href="hadoop-rumen/Rumen.html"/>
|
<item name="Rumen" href="hadoop-rumen/Rumen.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
|
@ -38,21 +38,14 @@ Overview
|
|||||||
|
|
||||||
GridMix is a benchmark for Hadoop clusters. It submits a mix of
|
GridMix is a benchmark for Hadoop clusters. It submits a mix of
|
||||||
synthetic jobs, modeling a profile mined from production loads.
|
synthetic jobs, modeling a profile mined from production loads.
|
||||||
|
This version of the tool will attempt to model
|
||||||
There exist three versions of the GridMix tool. This document
|
|
||||||
discusses the third (checked into `src/contrib` ), distinct
|
|
||||||
from the two checked into the `src/benchmarks` sub-directory.
|
|
||||||
While the first two versions of the tool included stripped-down versions
|
|
||||||
of common jobs, both were principally saturation tools for stressing the
|
|
||||||
framework at scale. In support of a broader range of deployments and
|
|
||||||
finer-tuned job mixes, this version of the tool will attempt to model
|
|
||||||
the resource profiles of production jobs to identify bottlenecks, guide
|
the resource profiles of production jobs to identify bottlenecks, guide
|
||||||
development, and serve as a replacement for the existing GridMix
|
development.
|
||||||
benchmarks.
|
|
||||||
|
|
||||||
To run GridMix, you need a MapReduce job trace describing the job mix
|
To run GridMix, you need a MapReduce job trace describing the job mix
|
||||||
for a given cluster. Such traces are typically generated by Rumen (see
|
for a given cluster. Such traces are typically generated by
|
||||||
Rumen documentation). GridMix also requires input data from which the
|
[Rumen](../hadoop-rumen/Rumen.html).
|
||||||
|
GridMix also requires input data from which the
|
||||||
synthetic jobs will be reading bytes. The input data need not be in any
|
synthetic jobs will be reading bytes. The input data need not be in any
|
||||||
particular format, as the synthetic jobs are currently binary readers.
|
particular format, as the synthetic jobs are currently binary readers.
|
||||||
If you are running on a new cluster, an optional step generating input
|
If you are running on a new cluster, an optional step generating input
|
||||||
@ -62,10 +55,15 @@ on the same or another cluster, follow these steps:
|
|||||||
|
|
||||||
1. Locate the job history files on the production cluster. This
|
1. Locate the job history files on the production cluster. This
|
||||||
location is specified by the
|
location is specified by the
|
||||||
`mapred.job.tracker.history.completed.location`
|
`mapreduce.jobhistory.done-dir` or
|
||||||
|
`mapreduce.jobhistory.intermediate-done-dir`
|
||||||
configuration property of the cluster.
|
configuration property of the cluster.
|
||||||
|
([MapReduce historyserver](../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver)
|
||||||
|
moves job history files from `mapreduce.jobhistory.done-dir`
|
||||||
|
to `mapreduce.jobhistory.intermediate-done-dir`.)
|
||||||
|
|
||||||
2. Run Rumen to build a job trace in JSON format for all or select jobs.
|
2. Run [Rumen](../hadoop-rumen/Rumen.html)
|
||||||
|
to build a job trace in JSON format for all or select jobs.
|
||||||
|
|
||||||
3. Use GridMix with the job trace on the benchmark cluster.
|
3. Use GridMix with the job trace on the benchmark cluster.
|
||||||
|
|
||||||
@ -79,13 +77,17 @@ Usage
|
|||||||
|
|
||||||
Basic command-line usage without configuration parameters:
|
Basic command-line usage without configuration parameters:
|
||||||
|
|
||||||
org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
```
|
||||||
|
java org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||||
|
```
|
||||||
|
|
||||||
Basic command-line usage with configuration parameters:
|
Basic command-line usage with configuration parameters:
|
||||||
|
|
||||||
org.apache.hadoop.mapred.gridmix.Gridmix \
|
```
|
||||||
|
java org.apache.hadoop.mapred.gridmix.Gridmix \
|
||||||
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
||||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||||
|
```
|
||||||
|
|
||||||
> Configuration parameters like
|
> Configuration parameters like
|
||||||
> `-Dgridmix.client.submit.threads=10` and
|
> `-Dgridmix.client.submit.threads=10` and
|
||||||
@ -102,6 +104,8 @@ The `-generate` option is used to generate input data and
|
|||||||
Distributed Cache files for the synthetic jobs. It accepts standard units
|
Distributed Cache files for the synthetic jobs. It accepts standard units
|
||||||
of size suffixes, e.g. `100g` will generate
|
of size suffixes, e.g. `100g` will generate
|
||||||
100 * 2<sup>30</sup> bytes as input data.
|
100 * 2<sup>30</sup> bytes as input data.
|
||||||
|
The minimum size of input data in compressed format (128MB by default)
|
||||||
|
is defined by `gridmix.min.file.size`.
|
||||||
`<iopath>/input` is the destination directory for
|
`<iopath>/input` is the destination directory for
|
||||||
generated input data and/or the directory from which input data will be
|
generated input data and/or the directory from which input data will be
|
||||||
read. HDFS-based Distributed Cache files are generated under the
|
read. HDFS-based Distributed Cache files are generated under the
|
||||||
@ -121,16 +125,17 @@ uncompressed. Use "-" as the value of this parameter if you
|
|||||||
want to pass an *uncompressed* trace via the standard
|
want to pass an *uncompressed* trace via the standard
|
||||||
input-stream of GridMix.
|
input-stream of GridMix.
|
||||||
|
|
||||||
The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
|
GridMix expects certain library *JARs* to be present in the *CLASSPATH*.
|
||||||
be found in the JAR
|
One simple way to run GridMix is to use `hadoop jar` command to run it.
|
||||||
`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
|
You also need to add the JAR of Rumen to classpath for both of client and tasks
|
||||||
Hadoop installation, where `$VERSION` corresponds to the
|
as example shown below.
|
||||||
version of Hadoop installed. A simple way of ensuring that this class
|
|
||||||
and all its dependencies are loaded correctly is to use the
|
|
||||||
`hadoop` wrapper script in Hadoop:
|
|
||||||
|
|
||||||
hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
|
```
|
||||||
|
HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
||||||
|
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.5.1.jar \
|
||||||
|
-libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
||||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||||
|
```
|
||||||
|
|
||||||
The supported configuration parameters are explained in the
|
The supported configuration parameters are explained in the
|
||||||
following sections.
|
following sections.
|
||||||
@ -262,14 +267,14 @@ recorded in the trace. It constructs jobs of two types:
|
|||||||
</td>
|
</td>
|
||||||
<td>A synthetic job where each task does *nothing* but sleep
|
<td>A synthetic job where each task does *nothing* but sleep
|
||||||
for a certain duration as observed in the production trace. The
|
for a certain duration as observed in the production trace. The
|
||||||
scalability of the Job Tracker is often limited by how many
|
scalability of the ResourceManager is often limited by how many
|
||||||
heartbeats it can handle every second. (Heartbeats are periodic
|
heartbeats it can handle every second. (Heartbeats are periodic
|
||||||
messages sent from Task Trackers to update their status and grab new
|
messages sent from NodeManagers to update their status and grab new
|
||||||
tasks from the Job Tracker.) Since a benchmark cluster is typically
|
tasks from the ResourceManager.) Since a benchmark cluster is typically
|
||||||
a fraction in size of a production cluster, the heartbeat traffic
|
a fraction in size of a production cluster, the heartbeat traffic
|
||||||
generated by the slave nodes is well below the level of the
|
generated by the slave nodes is well below the level of the
|
||||||
production cluster. One possible solution is to run multiple Task
|
production cluster. One possible solution is to run multiple
|
||||||
Trackers on each slave node. This leads to the obvious problem that
|
NodeManagers on each slave node. This leads to the obvious problem that
|
||||||
the I/O workload generated by the synthetic jobs would thrash the
|
the I/O workload generated by the synthetic jobs would thrash the
|
||||||
slave nodes. Hence the need for such a job.</td>
|
slave nodes. Hence the need for such a job.</td>
|
||||||
</tr>
|
</tr>
|
||||||
@ -334,7 +339,7 @@ Job Submission Policies
|
|||||||
|
|
||||||
GridMix controls the rate of job submission. This control can be
|
GridMix controls the rate of job submission. This control can be
|
||||||
based on the trace information or can be based on statistics it gathers
|
based on the trace information or can be based on statistics it gathers
|
||||||
from the Job Tracker. Based on the submission policies users define,
|
from the ResourceManager. Based on the submission policies users define,
|
||||||
GridMix uses the respective algorithm to control the job submission.
|
GridMix uses the respective algorithm to control the job submission.
|
||||||
There are currently three types of policies:
|
There are currently three types of policies:
|
||||||
|
|
||||||
@ -407,8 +412,8 @@ The following configuration parameters affect the job submission policy:
|
|||||||
<td>
|
<td>
|
||||||
<code>gridmix.throttle.jobs-to-tracker-ratio</code>
|
<code>gridmix.throttle.jobs-to-tracker-ratio</code>
|
||||||
</td>
|
</td>
|
||||||
<td>In STRESS mode, the minimum ratio of running jobs to Task
|
<td>In STRESS mode, the minimum ratio of running jobs to
|
||||||
Trackers in a cluster for the cluster to be considered
|
NodeManagers in a cluster for the cluster to be considered
|
||||||
*overloaded* . This is the threshold TJ referred to earlier.
|
*overloaded* . This is the threshold TJ referred to earlier.
|
||||||
The default is 1.0.</td>
|
The default is 1.0.</td>
|
||||||
</tr>
|
</tr>
|
||||||
@ -689,19 +694,15 @@ Emulating High-Ram jobs
|
|||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
||||||
High-Ram job can occupy multiple slots on the task-trackers.
|
High-Ram job can occupy larger fraction of memory in task processes.
|
||||||
Task-tracker assigns fixed virtual memory for each slot. Tasks from
|
Emulating this behavior is important because of the following reasons.
|
||||||
High-Ram jobs can occupy multiple slots and thus can use up more
|
|
||||||
virtual memory as compared to a default task.
|
|
||||||
|
|
||||||
Emulating this behavior is important because of the following reasons
|
|
||||||
|
|
||||||
* Impact on scheduler: Scheduling of tasks from High-Ram jobs
|
* Impact on scheduler: Scheduling of tasks from High-Ram jobs
|
||||||
impacts the scheduling behavior as it might result into slot
|
impacts the scheduling behavior as it might result into
|
||||||
reservation and slot/resource utilization.
|
resource reservation and utilization.
|
||||||
|
|
||||||
* Impact on the node : Since High-Ram tasks occupy multiple slots,
|
* Impact on the node : Since High-Ram tasks occupy larger memory,
|
||||||
trackers do some bookkeeping for allocating extra resources for
|
NodeManagers do some bookkeeping for allocating extra resources for
|
||||||
these tasks. Thus this becomes a precursor for memory emulation
|
these tasks. Thus this becomes a precursor for memory emulation
|
||||||
where tasks with high memory requirements needs to be considered
|
where tasks with high memory requirements needs to be considered
|
||||||
as a High-Ram task.
|
as a High-Ram task.
|
||||||
@ -808,11 +809,11 @@ job traces and cannot be accurately reproduced in GridMix:
|
|||||||
Appendix
|
Appendix
|
||||||
--------
|
--------
|
||||||
|
|
||||||
|
There exist older versions of the GridMix tool.
|
||||||
Issues tracking the original implementations of
|
Issues tracking the original implementations of
|
||||||
<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
|
[GridMix1](https://issues.apache.org/jira/browse/HADOOP-2369),
|
||||||
<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
|
[GridMix2](https://issues.apache.org/jira/browse/HADOOP-3770),
|
||||||
and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
|
and [GridMix3](https://issues.apache.org/jira/browse/MAPREDUCE-776)
|
||||||
can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
|
can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
|
||||||
the current development of GridMix can be found by searching
|
the current development of GridMix can be found by searching
|
||||||
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
|
[the Apache Hadoop MapReduce JIRA](https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086).
|
||||||
the Apache Hadoop MapReduce JIRA</a>
|
|
||||||
|
Loading…
Reference in New Issue
Block a user