MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki via aw)

This commit is contained in:
Allen Wittenauer 2015-02-17 06:52:14 -10:00
parent 34b78d51b5
commit 8b787e2fdb
17 changed files with 6586 additions and 7902 deletions

View File

@ -96,6 +96,9 @@ Trunk (Unreleased)
MAPREDUCE-6250. deprecate sbin/mr-jobhistory-daemon.sh (aw)
MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki
via aw)
BUG FIXES
MAPREDUCE-6191. Improve clearing stale state of Java serialization

View File

@ -1,151 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Distributed Cache Deploy
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Distributed Cache Deploy
* Introduction
The MapReduce application framework has rudimentary support for deploying a
new version of the MapReduce framework via the distributed cache. By setting
the appropriate configuration properties, users can run a different version
of MapReduce than the one initially deployed to the cluster. For example,
cluster administrators can place multiple versions of MapReduce in HDFS and
configure <<<mapred-site.xml>>> to specify which version jobs will use by
default. This allows the administrators to perform a rolling upgrade of the
MapReduce framework under certain conditions.
* Preconditions and Limitations
The support for deploying the MapReduce framework via the distributed cache
currently does not address the job client code used to submit and query
jobs. It also does not address the <<<ShuffleHandler>>> code that runs as an
auxilliary service within each NodeManager. As a result the following
limitations apply to MapReduce versions that can be successfully deployed via
the distributed cache in a rolling upgrade fashion:
* The MapReduce version must be compatible with the job client code used to
submit and query jobs. If it is incompatible then the job client must be
upgraded separately on any node from which jobs using the new MapReduce
version will be submitted or queried.
* The MapReduce version must be compatible with the configuration files used
by the job client submitting the jobs. If it is incompatible with that
configuration (e.g.: a new property must be set or an existing property
value changed) then the configuration must be updated first.
* The MapReduce version must be compatible with the <<<ShuffleHandler>>>
version running on the nodes in the cluster. If it is incompatible then the
new <<<ShuffleHandler>>> code must be deployed to all the nodes in the
cluster, and the NodeManagers must be restarted to pick up the new
<<<ShuffleHandler>>> code.
* Deploying a New MapReduce Version via the Distributed Cache
Deploying a new MapReduce version consists of three steps:
[[1]] Upload the MapReduce archive to a location that can be accessed by the
job submission client. Ideally the archive should be on the cluster's default
filesystem at a publicly-readable path. See the archive location discussion
below for more details.
[[2]] Configure <<<mapreduce.application.framework.path>>> to point to the
location where the archive is located. As when specifying distributed cache
files for a job, this is a URL that also supports creating an alias for the
archive if a URL fragment is specified. For example,
<<<hdfs:/mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework>>>
will be localized as <<<mrframework>>> rather than
<<<hadoop-mapreduce-${project.version}.tar.gz>>>.
[[3]] Configure <<<mapreduce.application.classpath>>> to set the proper
classpath to use with the MapReduce archive configured above. NOTE: An error
occurs if <<<mapreduce.application.framework.path>>> is configured but
<<<mapreduce.application.classpath>>> does not reference the base name of the
archive path or the alias if an alias was specified.
** Location of the MapReduce Archive and How It Affects Job Performance
Note that the location of the MapReduce archive can be critical to job
submission and job startup performance. If the archive is not located on the
cluster's default filesystem then it will be copied to the job staging
directory for each job and localized to each node where the job's tasks
run. This will slow down job submission and task startup performance.
If the archive is located on the default filesystem then the job client will
not upload the archive to the job staging directory for each job
submission. However if the archive path is not readable by all cluster users
then the archive will be localized separately for each user on each node
where tasks execute. This can cause unnecessary duplication in the
distributed cache.
When working with a large cluster it can be important to increase the
replication factor of the archive to increase its availability. This will
spread the load when the nodes in the cluster localize the archive for the
first time.
* MapReduce Archives and Classpath Configuration
Setting a proper classpath for the MapReduce archive depends upon the
composition of the archive and whether it has any additional dependencies.
For example, the archive can contain not only the MapReduce jars but also the
necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In
that case, <<<mapreduce.application.classpath>>> would be configured to
something like the following example, where the archive basename is
hadoop-mapreduce-${project.version}.tar.gz and the archive is organized
internally similar to the standard Hadoop distribution archive:
<<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/lib/*>>>
Another possible approach is to have the archive consist of just the
MapReduce jars and have the remaining dependencies picked up from the Hadoop
distribution installed on the nodes. In that case, the above example would
change to something like the following:
<<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*>>>
** NOTE:
If shuffle encryption is also enabled in the cluster, then we could meet the problem that MR job get failed with exception like below:
+---+
2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to junpingdu-centos5-3.cs1cloud.internal:13562 with 1 map outputs
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136)
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593)
at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61)
at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
....
+---+
This is because MR client (deployed from HDFS) cannot access ssl-client.xml in local FS under directory of $HADOOP_CONF_DIR. To fix the problem, we can add the directory with ssl-client.xml to the classpath of MR which is specified in "mapreduce.application.classpath" as mentioned above. To avoid MR application being affected by other local configurations, it is better to create a dedicated directory for putting ssl-client.xml, e.g. a sub-directory under $HADOOP_CONF_DIR, like: $HADOOP_CONF_DIR/security.

View File

@ -1,320 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Encrypted Shuffle
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Encrypted Shuffle
* {Introduction}
The Encrypted Shuffle capability allows encryption of the MapReduce shuffle
using HTTPS and with optional client authentication (also known as
bi-directional HTTPS, or HTTPS with client certificates). It comprises:
* A Hadoop configuration setting for toggling the shuffle between HTTP and
HTTPS.
* A Hadoop configuration settings for specifying the keystore and truststore
properties (location, type, passwords) used by the shuffle service and the
reducers tasks fetching shuffle data.
* A way to re-load truststores across the cluster (when a node is added or
removed).
* {Configuration}
** <<core-site.xml>> Properties
To enable encrypted shuffle, set the following properties in core-site.xml of
all nodes in the cluster:
*--------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.require.client.cert>>> | <<<false>>> | Whether client certificates are required |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.hostname.verifier>>> | <<<DEFAULT>>> | The hostname verifier to provide for HttpsURLConnections. Valid values are: <<DEFAULT>>, <<STRICT>>, <<STRICT_I6>>, <<DEFAULT_AND_LOCALHOST>> and <<ALLOW_ALL>> |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.keystores.factory.class>>> | <<<org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory>>> | The KeyStoresFactory implementation to use |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.server.conf>>> | <<<ssl-server.xml>>> | Resource file from which ssl server keystore information will be extracted. This file is looked up in the classpath, typically it should be in Hadoop conf/ directory |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.client.conf>>> | <<<ssl-client.xml>>> | Resource file from which ssl server keystore information will be extracted. This file is looked up in the classpath, typically it should be in Hadoop conf/ directory |
*--------------------------------------+---------------------+-----------------+
| <<<hadoop.ssl.enabled.protocols>>> | <<<TLSv1>>> | The supported SSL protocols (JDK6 can use <<TLSv1>>, JDK7+ can use <<TLSv1,TLSv1.1,TLSv1.2>>) |
*--------------------------------------+---------------------+-----------------+
<<IMPORTANT:>> Currently requiring client certificates should be set to false.
Refer the {{{ClientCertificates}Client Certificates}} section for details.
<<IMPORTANT:>> All these properties should be marked as final in the cluster
configuration files.
*** Example:
------
...
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
...
------
** <<<mapred-site.xml>>> Properties
To enable encrypted shuffle, set the following property in mapred-site.xml
of all nodes in the cluster:
*--------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*--------------------------------------+---------------------+-----------------+
| <<<mapreduce.shuffle.ssl.enabled>>> | <<<false>>> | Whether encrypted shuffle is enabled |
*--------------------------------------+---------------------+-----------------+
<<IMPORTANT:>> This property should be marked as final in the cluster
configuration files.
*** Example:
------
...
<property>
<name>mapreduce.shuffle.ssl.enabled</name>
<value>true</value>
<final>true</final>
</property>
...
------
The Linux container executor should be set to prevent job tasks from
reading the server keystore information and gaining access to the shuffle
server certificates.
Refer to Hadoop Kerberos configuration for details on how to do this.
* {Keystore and Truststore Settings}
Currently <<<FileBasedKeyStoresFactory>>> is the only <<<KeyStoresFactory>>>
implementation. The <<<FileBasedKeyStoresFactory>>> implementation uses the
following properties, in the <<ssl-server.xml>> and <<ssl-client.xml>> files,
to configure the keystores and truststores.
** <<<ssl-server.xml>>> (Shuffle server) Configuration:
The mapred user should own the <<ssl-server.xml>> file and have exclusive
read access to it.
*---------------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.keystore.type>>> | <<<jks>>> | Keystore file type |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.keystore.location>>> | NONE | Keystore file location. The mapred user should own this file and have exclusive read access to it. |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.keystore.password>>> | NONE | Keystore file password |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.truststore.type>>> | <<<jks>>> | Truststore file type |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.truststore.location>>> | NONE | Truststore file location. The mapred user should own this file and have exclusive read access to it. |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.truststore.password>>> | NONE | Truststore file password |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.server.truststore.reload.interval>>> | 10000 | Truststore reload interval, in milliseconds |
*--------------------------------------+----------------------------+-----------------+
*** Example:
------
<configuration>
<!-- Server Certificate Store -->
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value>${user.home}/keystores/server-keystore.jks</value>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>serverfoo</value>
</property>
<!-- Server Trust Store -->
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.server.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>
------
** <<<ssl-client.xml>>> (Reducer/Fetcher) Configuration:
The mapred user should own the <<ssl-client.xml>> file and it should have
default permissions.
*---------------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.keystore.type>>> | <<<jks>>> | Keystore file type |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.keystore.location>>> | NONE | Keystore file location. The mapred user should own this file and it should have default permissions. |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.keystore.password>>> | NONE | Keystore file password |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.truststore.type>>> | <<<jks>>> | Truststore file type |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.truststore.location>>> | NONE | Truststore file location. The mapred user should own this file and it should have default permissions. |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.truststore.password>>> | NONE | Truststore file password |
*---------------------------------------------+---------------------+-----------------+
| <<<ssl.client.truststore.reload.interval>>> | 10000 | Truststore reload interval, in milliseconds |
*--------------------------------------+----------------------------+-----------------+
*** Example:
------
<configuration>
<!-- Client certificate Store -->
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.keystore.location</name>
<value>${user.home}/keystores/client-keystore.jks</value>
</property>
<property>
<name>ssl.client.keystore.password</name>
<value>clientfoo</value>
</property>
<!-- Client Trust Store -->
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>
------
* Activating Encrypted Shuffle
When you have made the above configuration changes, activate Encrypted
Shuffle by re-starting all NodeManagers.
<<IMPORTANT:>> Using encrypted shuffle will incur in a significant
performance impact. Users should profile this and potentially reserve
1 or more cores for encrypted shuffle.
* {ClientCertificates} Client Certificates
Using Client Certificates does not fully ensure that the client is a
reducer task for the job. Currently, Client Certificates (their private key)
keystore files must be readable by all users submitting jobs to the cluster.
This means that a rogue job could read such those keystore files and use
the client certificates in them to establish a secure connection with a
Shuffle server. However, unless the rogue job has a proper JobToken, it won't
be able to retrieve shuffle data from the Shuffle server. A job, using its
own JobToken, can only retrieve shuffle data that belongs to itself.
* Reloading Truststores
By default the truststores will reload their configuration every 10 seconds.
If a new truststore file is copied over the old one, it will be re-read,
and its certificates will replace the old ones. This mechanism is useful for
adding or removing nodes from the cluster, or for adding or removing trusted
clients. In these cases, the client or NodeManager certificate is added to
(or removed from) all the truststore files in the system, and the new
configuration will be picked up without you having to restart the NodeManager
daemons.
* Debugging
<<NOTE:>> Enable debugging only for troubleshooting, and then only for jobs
running on small amounts of data. It is very verbose and slows down jobs by
several orders of magnitude. (You might need to increase mapred.task.timeout
to prevent jobs from failing because tasks run so slowly.)
To enable SSL debugging in the reducers, set <<<-Djavax.net.debug=all>>> in
the <<<mapreduce.reduce.child.java.opts>>> property; for example:
------
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx-200m -Djavax.net.debug=all</value>
</property>
------
You can do this on a per-job basis, or by means of a cluster-wide setting in
the <<<mapred-site.xml>>> file.
To set this property in NodeManager, set it in the <<<yarn-env.sh>>> file:
------
YARN_NODEMANAGER_OPTS="-Djavax.net.debug=all $YARN_NODEMANAGER_OPTS"
------

View File

@ -1,114 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Backward Compatibility
---
---
${maven.build.timestamp}
Apache Hadoop MapReduce - Migrating from Apache Hadoop 1.x to Apache Hadoop 2.x
* {Introduction}
This document provides information for users to migrate their Apache Hadoop
MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x.
In Apache Hadoop 2.x we have spun off resource management capabilities
into Apache Hadoop YARN, a general purpose, distributed application management
framework while Apache Hadoop MapReduce (aka MRv2) remains as a pure
distributed computation framework.
In general, the previous MapReduce runtime (aka MRv1) has been reused and
no major surgery has been conducted on it. Therefore, MRv2 is able to ensure
satisfactory compatibility with MRv1 applications. However, due to some
improvements and code refactorings, a few APIs have been rendered
backward-incompatible.
The remainder of this page will discuss the scope and the level of backward
compatibility that we support in Apache Hadoop MapReduce 2.x (MRv2).
* {Binary Compatibility}
First, we ensure binary compatibility to the applications that use old
<<mapred>> APIs. This means that applications which were built against MRv1
<<mapred>> APIs can run directly on YARN without recompilation, merely by
pointing them to an Apache Hadoop 2.x cluster via configuration.
* {Source Compatibility}
We cannot ensure complete binary compatibility with the applications that use
<<mapreduce>> APIs, as these APIs have evolved a lot since MRv1. However, we
ensure source compatibility for <<mapreduce>> APIs that break binary
compatibility. In other words, users should recompile their applications that
use <<mapreduce>> APIs against MRv2 jars. One notable binary incompatibility
break is Counter and CounterGroup.
* {Not Supported}
MRAdmin has been removed in MRv2 because because <<<mradmin>>> commands
no longer exist. They have been replaced by the commands in <<<rmadmin>>>. We
neither support binary compatibility nor source compatibility for the
applications that use this class directly.
* {Tradeoffs between MRv1 Users and Early MRv2 Adopters}
Unfortunately, maintaining binary compatibility for MRv1 applications may lead
to binary incompatibility issues for early MRv2 adopters, in particular Hadoop
0.23 users. For <<mapred>> APIs, we have chosen to be compatible with MRv1
applications, which have a larger user base. For <<mapreduce>> APIs, if they
don't significantly break Hadoop 0.23 applications, we still change them to be
compatible with MRv1 applications. Below is the list of MapReduce APIs which
are incompatible with Hadoop 0.23.
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<Problematic Function>> | <<Incompatibility Issue>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.util.ProgramDriver#drive>>> | Return type changes from <<<void>>> to <<<int>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapred.jobcontrol.Job#getMapredJobID>>> | Return type changes from <<<String>>> to <<<JobID>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapred.TaskReport#getTaskId>>> | Return type changes from <<<String>>> to <<<TaskID>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapred.ClusterStatus#UNINITIALIZED_MEMORY_VALUE>>> | Data type changes from <<<long>>> to <<<int>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapreduce.filecache.DistributedCache#getArchiveTimestamps>>> | Return type changes from <<<long[]>>> to <<<String[]>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapreduce.filecache.DistributedCache#getFileTimestamps>>> | Return type changes from <<<long[]>>> to <<<String[]>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapreduce.Job#failTask>>> | Return type changes from <<<void>>> to <<<boolean>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapreduce.Job#killTask>>> | Return type changes from <<<void>>> to <<<boolean>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
| <<<org.apache.hadoop.mapreduce.Job#getTaskCompletionEvents>>> | Return type changes from <<<o.a.h.mapred.TaskCompletionEvent[]>>> to <<<o.a.h.mapreduce.TaskCompletionEvent[]>>> |
*-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
* {Malicious}
For the users who are going to try <<<hadoop-examples-1.x.x.jar>>> on YARN,
please note that <<<hadoop -jar hadoop-examples-1.x.x.jar>>> will still use
<<<hadoop-mapreduce-examples-2.x.x.jar>>>, which is installed together with
other MRv2 jars. By default Hadoop framework jars appear before the users'
jars in the classpath, such that the classes from the 2.x.x jar will still be
picked. Users should remove <<<hadoop-mapreduce-examples-2.x.x.jar>>>
from the classpath of all the nodes in a cluster. Otherwise, users need to
set <<<HADOOP_USER_CLASSPATH_FIRST=true>>> and
<<<HADOOP_CLASSPATH=...:hadoop-examples-1.x.x.jar>>> to run their target
examples jar, and add the following configuration in <<<mapred-site.xml>>> to
make the processes in YARN containers pick this jar as well.
+---+
<property>
<name>mapreduce.job.user.classpath.first</name>
<value>true</value>
</property>
+---+

View File

@ -1,233 +0,0 @@
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
---
MapReduce Commands Guide
---
---
${maven.build.timestamp}
MapReduce Commands Guide
%{toc|section=1|fromDepth=2|toDepth=4}
* Overview
MapReduce commands are invoked by the <<<bin/mapred>>> script. Running the
script without any arguments prints the description for all commands.
Usage: <<<mapred [--config confdir] [--loglevel loglevel] COMMAND>>>
MapReduce has an option parsing framework that employs parsing generic
options as well as running classes.
*-------------------------+---------------------------------------------------+
|| COMMAND_OPTIONS || Description |
*-------------------------+---------------------------------------------------+
| --config confdir | Overwrites the default Configuration directory. Default
| | is $\{HADOOP_PREFIX\}/conf.
*-------------------------+---------------------------------------------------+
| --loglevel loglevel | Overwrites the log level. Valid log levels are FATAL,
| | ERROR, WARN, INFO, DEBUG, and TRACE. Default is INFO.
*-------------------------+---------------------------------------------------+
| COMMAND COMMAND_OPTIONS | Various commands with their options are described
| | in the following sections. The commands have been
| | grouped into {{User Commands}} and
| | {{Administration Commands}}.
*-------------------------+---------------------------------------------------+
* User Commands
Commands useful for users of a hadoop cluster.
** <<<pipes>>>
Runs a pipes job.
Usage: <<<mapred pipes [-conf <path>] [-jobconf <key=value>, <key=value>,
...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat
<class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer
<class>] [-program <executable>] [-reduces <num>]>>>
*----------------------------------------+------------------------------------+
|| COMMAND_OPTION || Description
*----------------------------------------+------------------------------------+
| -conf <path> | Configuration for job
*----------------------------------------+------------------------------------+
| -jobconf <key=value>, <key=value>, ... | Add/override configuration for job
*----------------------------------------+------------------------------------+
| -input <path> | Input directory
*----------------------------------------+------------------------------------+
| -output <path> | Output directory
*----------------------------------------+------------------------------------+
| -jar <jar file> | Jar filename
*----------------------------------------+------------------------------------+
| -inputformat <class> | InputFormat class
*----------------------------------------+------------------------------------+
| -map <class> | Java Map class
*----------------------------------------+------------------------------------+
| -partitioner <class> | Java Partitioner
*----------------------------------------+------------------------------------+
| -reduce <class> | Java Reduce class
*----------------------------------------+------------------------------------+
| -writer <class> | Java RecordWriter
*----------------------------------------+------------------------------------+
| -program <executable> | Executable URI
*----------------------------------------+------------------------------------+
| -reduces <num> | Number of reduces
*----------------------------------------+------------------------------------+
** <<<job>>>
Command to interact with Map Reduce Jobs.
Usage: <<<mapred job
| [{{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options}GENERIC_OPTIONS}}]
| [-submit <job-file>]
| [-status <job-id>]
| [-counter <job-id> <group-name> <counter-name>]
| [-kill <job-id>]
| [-events <job-id> <from-event-#> <#-of-events>]
| [-history [all] <jobOutputDir>] | [-list [all]]
| [-kill-task <task-id>] | [-fail-task <task-id>]
| [-set-priority <job-id> <priority>]>>>
*------------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------------+---------------------------------------------+
| -submit <job-file> | Submits the job.
*------------------------------+---------------------------------------------+
| -status <job-id> | Prints the map and reduce completion
| percentage and all job counters.
*------------------------------+---------------------------------------------+
| -counter <job-id> <group-name> <counter-name> | Prints the counter value.
*------------------------------+---------------------------------------------+
| -kill <job-id> | Kills the job.
*------------------------------+---------------------------------------------+
| -events <job-id> <from-event-#> <#-of-events> | Prints the events' details
| received by jobtracker for the given range.
*------------------------------+---------------------------------------------+
| -history [all]<jobOutputDir> | Prints job details, failed and killed tip
| details. More details about the job such as
| successful tasks and task attempts made for
| each task can be viewed by specifying the
| [all] option.
*------------------------------+---------------------------------------------+
| -list [all] | Displays jobs which are yet to complete.
| <<<-list all>>> displays all jobs.
*------------------------------+---------------------------------------------+
| -kill-task <task-id> | Kills the task. Killed tasks are NOT counted
| against failed attempts.
*------------------------------+---------------------------------------------+
| -fail-task <task-id> | Fails the task. Failed tasks are counted
| against failed attempts.
*------------------------------+---------------------------------------------+
| -set-priority <job-id> <priority> | Changes the priority of the job. Allowed
| priority values are VERY_HIGH, HIGH, NORMAL,
| LOW, VERY_LOW
*------------------------------+---------------------------------------------+
** <<<queue>>>
command to interact and view Job Queue information
Usage: <<<mapred queue [-list] | [-info <job-queue-name> [-showJobs]]
| [-showacls]>>>
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| -list | Gets list of Job Queues configured in the system.
| Along with scheduling information associated with the job
| queues.
*-----------------+-----------------------------------------------------------+
| -info <job-queue-name> [-showJobs] | Displays the job queue information and
| associated scheduling information of particular job queue.
| If <<<-showJobs>>> options is present a list of jobs
| submitted to the particular job queue is displayed.
*-----------------+-----------------------------------------------------------+
| -showacls | Displays the queue name and associated queue operations
| allowed for the current user. The list consists of only
| those queues to which the user has access.
*-----------------+-----------------------------------------------------------+
** <<<classpath>>>
Prints the class path needed to get the Hadoop jar and the required
libraries.
Usage: <<<mapred classpath>>>
** <<<distcp>>>
Copy file or directories recursively. More information can be found at
{{{./DistCp.html}Hadoop DistCp Guide}}.
** <<<archive>>>
Creates a hadoop archive. More information can be found at
{{{./HadoopArchives.html}Hadoop Archives Guide}}.
** <<<version>>>
Prints the version.
Usage: <<<mapred version>>>
* Administration Commands
Commands useful for administrators of a hadoop cluster.
** <<<historyserver>>>
Start JobHistoryServer.
Usage: <<<mapred historyserver>>>
** <<<hsadmin>>>
Runs a MapReduce hsadmin client for execute JobHistoryServer administrative
commands.
Usage: <<<mapred hsadmin
[-refreshUserToGroupsMappings] |
[-refreshSuperUserGroupsConfiguration] |
[-refreshAdminAcls] |
[-refreshLoadedJobCache] |
[-refreshLogRetentionSettings] |
[-refreshJobRetentionSettings] |
[-getGroups [username]] | [-help [cmd]]>>>
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| -refreshUserToGroupsMappings | Refresh user-to-groups mappings
*-----------------+-----------------------------------------------------------+
| -refreshSuperUserGroupsConfiguration| Refresh superuser proxy groups mappings
*-----------------+-----------------------------------------------------------+
| -refreshAdminAcls | Refresh acls for administration of Job history server
*-----------------+-----------------------------------------------------------+
| -refreshLoadedJobCache | Refresh loaded job cache of Job history server
*-----------------+-----------------------------------------------------------+
| -refreshJobRetentionSettings|Refresh job history period, job cleaner settings
*-----------------+-----------------------------------------------------------+
| -refreshLogRetentionSettings | Refresh log retention period and log retention
| | check interval
*-----------------+-----------------------------------------------------------+
| -getGroups [username] | Get the groups which given user belongs to
*-----------------+-----------------------------------------------------------+
| -help [cmd] | Displays help for the given command or all commands if none is
| | specified.
*-----------------+-----------------------------------------------------------+

View File

@ -1,98 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Pluggable Shuffle and Pluggable Sort
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Pluggable Shuffle and Pluggable Sort
* Introduction
The pluggable shuffle and pluggable sort capabilities allow replacing the
built in shuffle and sort logic with alternate implementations. Example use
cases for this are: using a different application protocol other than HTTP
such as RDMA for shuffling data from the Map nodes to the Reducer nodes; or
replacing the sort logic with custom algorithms that enable Hash aggregation
and Limit-N query.
<<IMPORTANT:>> The pluggable shuffle and pluggable sort capabilities are
experimental and unstable. This means the provided APIs may change and break
compatibility in future versions of Hadoop.
* Implementing a Custom Shuffle and a Custom Sort
A custom shuffle implementation requires a
<<<org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.AuxiliaryService>>>
implementation class running in the NodeManagers and a
<<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>> implementation class
running in the Reducer tasks.
The default implementations provided by Hadoop can be used as references:
* <<<org.apache.hadoop.mapred.ShuffleHandler>>>
* <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
A custom sort implementation requires a <<<org.apache.hadoop.mapred.MapOutputCollector>>>
implementation class running in the Mapper tasks and (optionally, depending
on the sort implementation) a <<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>>
implementation class running in the Reducer tasks.
The default implementations provided by Hadoop can be used as references:
* <<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>>
* <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
* Configuration
Except for the auxiliary service running in the NodeManagers serving the
shuffle (by default the <<<ShuffleHandler>>>), all the pluggable components
run in the job tasks. This means, they can be configured on per job basis.
The auxiliary service servicing the Shuffle must be configured in the
NodeManagers configuration.
** Job Configuration Properties (on per job basis):
*--------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*--------------------------------------+---------------------+-----------------+
| <<<mapreduce.job.reduce.shuffle.consumer.plugin.class>>> | <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>> | The <<<ShuffleConsumerPlugin>>> implementation to use |
*--------------------------------------+---------------------+-----------------+
| <<<mapreduce.job.map.output.collector.class>>> | <<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>> | The <<<MapOutputCollector>>> implementation(s) to use |
*--------------------------------------+---------------------+-----------------+
These properties can also be set in the <<<mapred-site.xml>>> to change the default values for all jobs.
The collector class configuration may specify a comma-separated list of collector implementations.
In this case, the map task will attempt to instantiate each in turn until one of the
implementations successfully initializes. This can be useful if a given collector
implementation is only compatible with certain types of keys or values, for example.
** NodeManager Configuration properties, <<<yarn-site.xml>>> in all nodes:
*--------------------------------------+---------------------+-----------------+
| <<Property>> | <<Default Value>> | <<Explanation>> |
*--------------------------------------+---------------------+-----------------+
| <<<yarn.nodemanager.aux-services>>> | <<<...,mapreduce_shuffle>>> | The auxiliary service name |
*--------------------------------------+---------------------+-----------------+
| <<<yarn.nodemanager.aux-services.mapreduce_shuffle.class>>> | <<<org.apache.hadoop.mapred.ShuffleHandler>>> | The auxiliary service class to use |
*--------------------------------------+---------------------+-----------------+
<<IMPORTANT:>> If setting an auxiliary service in addition the default
<<<mapreduce_shuffle>>> service, then a new service key should be added to the
<<<yarn.nodemanager.aux-services>>> property, for example <<<mapred.shufflex>>>.
Then the property defining the corresponding class must be
<<<yarn.nodemanager.aux-services.mapreduce_shufflex.class>>>.

View File

@ -0,0 +1,119 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
#set ( $H3 = '###' )
#set ( $H4 = '####' )
#set ( $H5 = '#####' )
Hadoop: Distributed Cache Deploy
================================
Introduction
------------
The MapReduce application framework has rudimentary support for deploying a new version of the MapReduce framework via the distributed cache. By setting the appropriate configuration properties, users can run a different version of MapReduce than the one initially deployed to the cluster. For example, cluster administrators can place multiple versions of MapReduce in HDFS and configure `mapred-site.xml` to specify which version jobs will use by default. This allows the administrators to perform a rolling upgrade of the MapReduce framework under certain conditions.
Preconditions and Limitations
-----------------------------
The support for deploying the MapReduce framework via the distributed cache currently does not address the job client code used to submit and query jobs. It also does not address the `ShuffleHandler` code that runs as an auxilliary service within each NodeManager. As a result the following limitations apply to MapReduce versions that can be successfully deployed via the distributed cache in a rolling upgrade fashion:
* The MapReduce version must be compatible with the job client code used to
submit and query jobs. If it is incompatible then the job client must be
upgraded separately on any node from which jobs using the new MapReduce
version will be submitted or queried.
* The MapReduce version must be compatible with the configuration files used
by the job client submitting the jobs. If it is incompatible with that
configuration (e.g.: a new property must be set or an existing property
value changed) then the configuration must be updated first.
* The MapReduce version must be compatible with the `ShuffleHandler`
version running on the nodes in the cluster. If it is incompatible then the
new `ShuffleHandler` code must be deployed to all the nodes in the
cluster, and the NodeManagers must be restarted to pick up the new
`ShuffleHandler` code.
Deploying a New MapReduce Version via the Distributed Cache
-----------------------------------------------------------
Deploying a new MapReduce version consists of three steps:
1. Upload the MapReduce archive to a location that can be accessed by the
job submission client. Ideally the archive should be on the cluster's default
filesystem at a publicly-readable path. See the archive location discussion
below for more details.
2. Configure `mapreduce.application.framework.path` to point to the
location where the archive is located. As when specifying distributed cache
files for a job, this is a URL that also supports creating an alias for the
archive if a URL fragment is specified. For example,
`hdfs:/mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework`
will be localized as `mrframework` rather than
`hadoop-mapreduce-${project.version}.tar.gz`.
3. Configure `mapreduce.application.classpath` to set the proper
classpath to use with the MapReduce archive configured above. NOTE: An error
occurs if `mapreduce.application.framework.path` is configured but
`mapreduce.application.classpath` does not reference the base name of the
archive path or the alias if an alias was specified.
$H3 Location of the MapReduce Archive and How It Affects Job Performance
Note that the location of the MapReduce archive can be critical to job submission and job startup performance. If the archive is not located on the cluster's default filesystem then it will be copied to the job staging directory for each job and localized to each node where the job's tasks run. This will slow down job submission and task startup performance.
If the archive is located on the default filesystem then the job client will not upload the archive to the job staging directory for each job submission. However if the archive path is not readable by all cluster users then the archive will be localized separately for each user on each node where tasks execute. This can cause unnecessary duplication in the distributed cache.
When working with a large cluster it can be important to increase the replication factor of the archive to increase its availability. This will spread the load when the nodes in the cluster localize the archive for the first time.
MapReduce Archives and Classpath Configuration
----------------------------------------------
Setting a proper classpath for the MapReduce archive depends upon the composition of the archive and whether it has any additional dependencies. For example, the archive can contain not only the MapReduce jars but also the necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In that case, `mapreduce.application.classpath` would be configured to something like the following example, where the archive basename is hadoop-mapreduce-${project.version}.tar.gz and the archive is organized internally similar to the standard Hadoop distribution archive:
`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/lib/*`
Another possible approach is to have the archive consist of just the MapReduce jars and have the remaining dependencies picked up from the Hadoop distribution installed on the nodes. In that case, the above example would change to something like the following:
`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*`
$H3 NOTE:
If shuffle encryption is also enabled in the cluster, then we could meet the problem that MR job get failed with exception like below:
2014-10-10 02:17:16,600 WARN [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to junpingdu-centos5-3.cs1cloud.internal:13562 with 1 map outputs
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136)
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593)
at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61)
at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
....
This is because MR client (deployed from HDFS) cannot access ssl-client.xml in local FS under directory of $HADOOP\_CONF\_DIR. To fix the problem, we can add the directory with ssl-client.xml to the classpath of MR which is specified in "mapreduce.application.classpath" as mentioned above. To avoid MR application being affected by other local configurations, it is better to create a dedicated directory for putting ssl-client.xml, e.g. a sub-directory under $HADOOP\_CONF\_DIR, like: $HADOOP\_CONF\_DIR/security.

View File

@ -0,0 +1,255 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: Encrypted Shuffle
=========================
Introduction
------------
The Encrypted Shuffle capability allows encryption of the MapReduce shuffle using HTTPS and with optional client authentication (also known as bi-directional HTTPS, or HTTPS with client certificates). It comprises:
* A Hadoop configuration setting for toggling the shuffle between HTTP and
HTTPS.
* A Hadoop configuration settings for specifying the keystore and truststore
properties (location, type, passwords) used by the shuffle service and the
reducers tasks fetching shuffle data.
* A way to re-load truststores across the cluster (when a node is added or
removed).
Configuration
-------------
### **core-site.xml** Properties
To enable encrypted shuffle, set the following properties in core-site.xml of all nodes in the cluster:
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `hadoop.ssl.require.client.cert` | `false` | Whether client certificates are required |
| `hadoop.ssl.hostname.verifier` | `DEFAULT` | The hostname verifier to provide for HttpsURLConnections. Valid values are: **DEFAULT**, **STRICT**, **STRICT\_I6**, **DEFAULT\_AND\_LOCALHOST** and **ALLOW\_ALL** |
| `hadoop.ssl.keystores.factory.class` | `org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory` | The KeyStoresFactory implementation to use |
| `hadoop.ssl.server.conf` | `ssl-server.xml` | Resource file from which ssl server keystore information will be extracted. This file is looked up in the classpath, typically it should be in Hadoop conf/ directory |
| `hadoop.ssl.client.conf` | `ssl-client.xml` | Resource file from which ssl server keystore information will be extracted. This file is looked up in the classpath, typically it should be in Hadoop conf/ directory |
| `hadoop.ssl.enabled.protocols` | `TLSv1` | The supported SSL protocols (JDK6 can use **TLSv1**, JDK7+ can use **TLSv1,TLSv1.1,TLSv1.2**) |
**IMPORTANT:** Currently requiring client certificates should be set to false. Refer the [Client Certificates](#Client_Certificates) section for details.
**IMPORTANT:** All these properties should be marked as final in the cluster configuration files.
#### Example:
```xml
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
```
### `mapred-site.xml` Properties
To enable encrypted shuffle, set the following property in mapred-site.xml of all nodes in the cluster:
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `mapreduce.shuffle.ssl.enabled` | `false` | Whether encrypted shuffle is enabled |
**IMPORTANT:** This property should be marked as final in the cluster configuration files.
#### Example:
```xml
<property>
<name>mapreduce.shuffle.ssl.enabled</name>
<value>true</value>
<final>true</final>
</property>
```
The Linux container executor should be set to prevent job tasks from reading the server keystore information and gaining access to the shuffle server certificates.
Refer to Hadoop Kerberos configuration for details on how to do this.
Keystore and Truststore Settings
--------------------------------
Currently `FileBasedKeyStoresFactory` is the only `KeyStoresFactory` implementation. The `FileBasedKeyStoresFactory` implementation uses the following properties, in the **ssl-server.xml** and **ssl-client.xml** files, to configure the keystores and truststores.
### `ssl-server.xml` (Shuffle server) Configuration:
The mapred user should own the **ssl-server.xml** file and have exclusive read access to it.
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `ssl.server.keystore.type` | `jks` | Keystore file type |
| `ssl.server.keystore.location` | NONE | Keystore file location. The mapred user should own this file and have exclusive read access to it. |
| `ssl.server.keystore.password` | NONE | Keystore file password |
| `ssl.server.truststore.type` | `jks` | Truststore file type |
| `ssl.server.truststore.location` | NONE | Truststore file location. The mapred user should own this file and have exclusive read access to it. |
| `ssl.server.truststore.password` | NONE | Truststore file password |
| `ssl.server.truststore.reload.interval` | 10000 | Truststore reload interval, in milliseconds |
#### Example:
```xml
<configuration>
<!-- Server Certificate Store -->
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value>${user.home}/keystores/server-keystore.jks</value>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>serverfoo</value>
</property>
<!-- Server Trust Store -->
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.server.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>
```
### `ssl-client.xml` (Reducer/Fetcher) Configuration:
The mapred user should own the **ssl-client.xml** file and it should have default permissions.
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `ssl.client.keystore.type` | `jks` | Keystore file type |
| `ssl.client.keystore.location` | NONE | Keystore file location. The mapred user should own this file and it should have default permissions. |
| `ssl.client.keystore.password` | NONE | Keystore file password |
| `ssl.client.truststore.type` | `jks` | Truststore file type |
| `ssl.client.truststore.location` | NONE | Truststore file location. The mapred user should own this file and it should have default permissions. |
| `ssl.client.truststore.password` | NONE | Truststore file password |
| `ssl.client.truststore.reload.interval` | 10000 | Truststore reload interval, in milliseconds |
#### Example:
```xml
<configuration>
<!-- Client certificate Store -->
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.keystore.location</name>
<value>${user.home}/keystores/client-keystore.jks</value>
</property>
<property>
<name>ssl.client.keystore.password</name>
<value>clientfoo</value>
</property>
<!-- Client Trust Store -->
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>
```
Activating Encrypted Shuffle
----------------------------
When you have made the above configuration changes, activate Encrypted Shuffle by re-starting all NodeManagers.
**IMPORTANT:** Using encrypted shuffle will incur in a significant performance impact. Users should profile this and potentially reserve 1 or more cores for encrypted shuffle.
Client Certificates
-------------------
Using Client Certificates does not fully ensure that the client is a reducer task for the job. Currently, Client Certificates (their private key) keystore files must be readable by all users submitting jobs to the cluster. This means that a rogue job could read such those keystore files and use the client certificates in them to establish a secure connection with a Shuffle server. However, unless the rogue job has a proper JobToken, it won't be able to retrieve shuffle data from the Shuffle server. A job, using its own JobToken, can only retrieve shuffle data that belongs to itself.
Reloading Truststores
---------------------
By default the truststores will reload their configuration every 10 seconds. If a new truststore file is copied over the old one, it will be re-read, and its certificates will replace the old ones. This mechanism is useful for adding or removing nodes from the cluster, or for adding or removing trusted clients. In these cases, the client or NodeManager certificate is added to (or removed from) all the truststore files in the system, and the new configuration will be picked up without you having to restart the NodeManager daemons.
Debugging
---------
**NOTE:** Enable debugging only for troubleshooting, and then only for jobs running on small amounts of data. It is very verbose and slows down jobs by several orders of magnitude. (You might need to increase mapred.task.timeout to prevent jobs from failing because tasks run so slowly.)
To enable SSL debugging in the reducers, set `-Djavax.net.debug=all` in the `mapreduce.reduce.child.java.opts` property; for example:
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx-200m -Djavax.net.debug=all</value>
</property>
You can do this on a per-job basis, or by means of a cluster-wide setting in the `mapred-site.xml` file.
To set this property in NodeManager, set it in the `yarn-env.sh` file:
YARN_NODEMANAGER_OPTS="-Djavax.net.debug=all"

View File

@ -0,0 +1,69 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Apache Hadoop MapReduce - Migrating from Apache Hadoop 1.x to Apache Hadoop 2.x
===============================================================================
Introduction
------------
This document provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x.
In Apache Hadoop 2.x we have spun off resource management capabilities into Apache Hadoop YARN, a general purpose, distributed application management framework while Apache Hadoop MapReduce (aka MRv2) remains as a pure distributed computation framework.
In general, the previous MapReduce runtime (aka MRv1) has been reused and no major surgery has been conducted on it. Therefore, MRv2 is able to ensure satisfactory compatibility with MRv1 applications. However, due to some improvements and code refactorings, a few APIs have been rendered backward-incompatible.
The remainder of this page will discuss the scope and the level of backward compatibility that we support in Apache Hadoop MapReduce 2.x (MRv2).
Binary Compatibility
--------------------
First, we ensure binary compatibility to the applications that use old **mapred** APIs. This means that applications which were built against MRv1 **mapred** APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration.
Source Compatibility
--------------------
We cannot ensure complete binary compatibility with the applications that use **mapreduce** APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for **mapreduce** APIs that break binary compatibility. In other words, users should recompile their applications that use **mapreduce** APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup.
Not Supported
-------------
MRAdmin has been removed in MRv2 because because `mradmin` commands no longer exist. They have been replaced by the commands in `rmadmin`. We neither support binary compatibility nor source compatibility for the applications that use this class directly.
Tradeoffs between MRv1 Users and Early MRv2 Adopters
----------------------------------------------------
Unfortunately, maintaining binary compatibility for MRv1 applications may lead to binary incompatibility issues for early MRv2 adopters, in particular Hadoop 0.23 users. For **mapred** APIs, we have chosen to be compatible with MRv1 applications, which have a larger user base. For **mapreduce** APIs, if they don't significantly break Hadoop 0.23 applications, we still change them to be compatible with MRv1 applications. Below is the list of MapReduce APIs which are incompatible with Hadoop 0.23.
| **Problematic Function** | **Incompatibility Issue** |
|:---- |:---- |
| `org.apache.hadoop.util.ProgramDriver#drive` | Return type changes from `void` to `int` |
| `org.apache.hadoop.mapred.jobcontrol.Job#getMapredJobID` | Return type changes from `String` to `JobID` |
| `org.apache.hadoop.mapred.TaskReport#getTaskId` | Return type changes from `String` to `TaskID` |
| `org.apache.hadoop.mapred.ClusterStatus#UNINITIALIZED_MEMORY_VALUE` | Data type changes from `long` to `int` |
| `org.apache.hadoop.mapreduce.filecache.DistributedCache#getArchiveTimestamps` | Return type changes from `long[]` to `String[]` |
| `org.apache.hadoop.mapreduce.filecache.DistributedCache#getFileTimestamps` | Return type changes from `long[]` to `String[]` |
| `org.apache.hadoop.mapreduce.Job#failTask` | Return type changes from `void` to `boolean` |
| `org.apache.hadoop.mapreduce.Job#killTask` | Return type changes from `void` to `boolean` |
| `org.apache.hadoop.mapreduce.Job#getTaskCompletionEvents` | Return type changes from `o.a.h.mapred.TaskCompletionEvent[]` to `o.a.h.mapreduce.TaskCompletionEvent[]` |
Malicious
---------
For the users who are going to try `hadoop-examples-1.x.x.jar` on YARN, please note that `hadoop -jar hadoop-examples-1.x.x.jar` will still use `hadoop-mapreduce-examples-2.x.x.jar`, which is installed together with other MRv2 jars. By default Hadoop framework jars appear before the users' jars in the classpath, such that the classes from the 2.x.x jar will still be picked. Users should remove `hadoop-mapreduce-examples-2.x.x.jar` from the classpath of all the nodes in a cluster. Otherwise, users need to set `HADOOP_USER_CLASSPATH_FIRST=true` and `HADOOP_CLASSPATH=...:hadoop-examples-1.x.x.jar` to run their target examples jar, and add the following configuration in `mapred-site.xml` to make the processes in YARN containers pick this jar as well.
<property>
<name>mapreduce.job.user.classpath.first</name>
<value>true</value>
</property>

View File

@ -0,0 +1,153 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
MapReduce Commands Guide
========================
* [Overview](#Overview)
* [User Commands](#User_Commands)
* [archive](#archive)
* [classpath](#classpath)
* [distcp](#distcp)
* [job](#job)
* [pipes](#pipes)
* [queue](#queue)
* [version](#version)
* [Administration Commands](#Administration_Commands)
* [historyserver](#historyserver)
* [hsadmin](#hsadmin)
Overview
--------
All mapreduce commands are invoked by the `bin/mapred` script. Running the mapred script without any arguments prints the description for all commands.
Usage: `mapred [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]`
Hadoop has an option parsing framework that employs parsing generic options as well as running classes.
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| SHELL\_OPTIONS | The common set of shell options. These are documented on the [Hadoop Commands Reference](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Shell_Options) page. |
| GENERIC\_OPTIONS | The common set of options supported by multiple commands. See the [Hadoop Commands Reference](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options) for more information. |
| COMMAND COMMAND\_OPTIONS | Various commands with their options are described in the following sections. The commands have been grouped into [User Commands](#User_Commands) and [Administration Commands](#Administration_Commands). |
User Commands
-------------
Commands useful for users of a hadoop cluster.
### `archive`
Creates a hadoop archive. More information can be found at
[Hadoop Archives Guide](./HadoopArchives.html).
### `classpath`
Prints the class path needed to get the Hadoop jar and the required libraries.
Usage: `mapred classpath`
### `distcp`
Copy file or directories recursively. More information can be found at
[Hadoop DistCp Guide](./DistCp.html).
### `job`
Command to interact with Map Reduce Jobs.
Usage: `mapred job | [GENERIC_OPTIONS] | [-submit <job-file>] | [-status <job-id>] | [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] | [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <jobOutputDir>] | [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] | [-set-priority <job-id> <priority>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| -submit *job-file* | Submits the job. |
| -status *job-id* | Prints the map and reduce completion percentage and all job counters. |
| -counter *job-id* *group-name* *counter-name* | Prints the counter value. |
| -kill *job-id* | Kills the job. |
| -events *job-id* *from-event-\#* *\#-of-events* | Prints the events' details received by jobtracker for the given range. |
| -history [all]*jobOutputDir* | Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option. |
| -list [all] | Displays jobs which are yet to complete. `-list all` displays all jobs. |
| -kill-task *task-id* | Kills the task. Killed tasks are NOT counted against failed attempts. |
| -fail-task *task-id* | Fails the task. Failed tasks are counted against failed attempts. |
| -set-priority *job-id* *priority* | Changes the priority of the job. Allowed priority values are VERY\_HIGH, HIGH, NORMAL, LOW, VERY\_LOW |
### `pipes`
Runs a pipes job.
Usage: `mapred pipes [-conf <path>] [-jobconf <key=value>, <key=value>, ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| -conf *path* | Configuration for job |
| -jobconf *key=value*, *key=value*, ... | Add/override configuration for job |
| -input *path* | Input directory |
| -output *path* | Output directory |
| -jar *jar file* | Jar filename |
| -inputformat *class* | InputFormat class |
| -map *class* | Java Map class |
| -partitioner *class* | Java Partitioner |
| -reduce *class* | Java Reduce class |
| -writer *class* | Java RecordWriter |
| -program *executable* | Executable URI |
| -reduces *num* | Number of reduces |
### `queue`
command to interact and view Job Queue information
Usage: `mapred queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| -list | Gets list of Job Queues configured in the system. Along with scheduling information associated with the job queues. |
| -info *job-queue-name* [-showJobs] | Displays the job queue information and associated scheduling information of particular job queue. If `-showJobs` options is present a list of jobs submitted to the particular job queue is displayed. |
| -showacls | Displays the queue name and associated queue operations allowed for the current user. The list consists of only those queues to which the user has access. |
### `version`
Prints the version.
Usage: `mapred version`
Administration Commands
-----------------------
Commands useful for administrators of a hadoop cluster.
### `historyserver`
Start JobHistoryServer.
Usage: `mapred historyserver`
### `hsadmin`
Runs a MapReduce hsadmin client for execute JobHistoryServer administrative commands.
Usage: `mapred hsadmin [-refreshUserToGroupsMappings] | [-refreshSuperUserGroupsConfiguration] | [-refreshAdminAcls] | [-refreshLoadedJobCache] | [-refreshLogRetentionSettings] | [-refreshJobRetentionSettings] | [-getGroups [username]] | [-help [cmd]]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| -refreshUserToGroupsMappings | Refresh user-to-groups mappings |
| -refreshSuperUserGroupsConfiguration | Refresh superuser proxy groups mappings |
| -refreshAdminAcls | Refresh acls for administration of Job history server |
| -refreshLoadedJobCache | Refresh loaded job cache of Job history server |
| -refreshJobRetentionSettings | Refresh job history period, job cleaner settings |
| -refreshLogRetentionSettings | Refresh log retention period and log retention check interval |
| -getGroups [username] | Get the groups which given user belongs to |
| -help [cmd] | Displays help for the given command or all commands if none is specified. |

View File

@ -0,0 +1,73 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: Pluggable Shuffle and Pluggable Sort
============================================
Introduction
------------
The pluggable shuffle and pluggable sort capabilities allow replacing the built in shuffle and sort logic with alternate implementations. Example use cases for this are: using a different application protocol other than HTTP such as RDMA for shuffling data from the Map nodes to the Reducer nodes; or replacing the sort logic with custom algorithms that enable Hash aggregation and Limit-N query.
**IMPORTANT:** The pluggable shuffle and pluggable sort capabilities are experimental and unstable. This means the provided APIs may change and break compatibility in future versions of Hadoop.
Implementing a Custom Shuffle and a Custom Sort
-----------------------------------------------
A custom shuffle implementation requires a
`org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.AuxiliaryService`
implementation class running in the NodeManagers and a
`org.apache.hadoop.mapred.ShuffleConsumerPlugin`
implementation class running in the Reducer tasks.
The default implementations provided by Hadoop can be used as references:
* `org.apache.hadoop.mapred.ShuffleHandler`
* `org.apache.hadoop.mapreduce.task.reduce.Shuffle`
A custom sort implementation requires a `org.apache.hadoop.mapred.MapOutputCollector` implementation class running in the Mapper tasks and (optionally, depending on the sort implementation) a `org.apache.hadoop.mapred.ShuffleConsumerPlugin` implementation class running in the Reducer tasks.
The default implementations provided by Hadoop can be used as references:
* `org.apache.hadoop.mapred.MapTask$MapOutputBuffer`
* `org.apache.hadoop.mapreduce.task.reduce.Shuffle`
Configuration
-------------
Except for the auxiliary service running in the NodeManagers serving the shuffle (by default the `ShuffleHandler`), all the pluggable components run in the job tasks. This means, they can be configured on per job basis. The auxiliary service servicing the Shuffle must be configured in the NodeManagers configuration.
### Job Configuration Properties (on per job basis):
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `mapreduce.job.reduce.shuffle.consumer.plugin.class` | `org.apache.hadoop.mapreduce.task.reduce.Shuffle` | The `ShuffleConsumerPlugin` implementation to use |
| `mapreduce.job.map.output.collector.class` | `org.apache.hadoop.mapred.MapTask$MapOutputBuffer` | The `MapOutputCollector` implementation(s) to use |
These properties can also be set in the `mapred-site.xml` to change the default values for all jobs.
The collector class configuration may specify a comma-separated list of collector implementations. In this case, the map task will attempt to instantiate each in turn until one of the implementations successfully initializes. This can be useful if a given collector implementation is only compatible with certain types of keys or values, for example.
### NodeManager Configuration properties, `yarn-site.xml` in all nodes:
| **Property** | **Default Value** | **Explanation** |
|:---- |:---- |:---- |
| `yarn.nodemanager.aux-services` | `...,mapreduce_shuffle` | The auxiliary service name |
| `yarn.nodemanager.aux-services.mapreduce_shuffle.class` | `org.apache.hadoop.mapred.ShuffleHandler` | The auxiliary service class to use |
**IMPORTANT:** If setting an auxiliary service in addition the default
`mapreduce_shuffle` service, then a new service key should be added to the
`yarn.nodemanager.aux-services` property, for example `mapred.shufflex`.
Then the property defining the corresponding class must be
`yarn.nodemanager.aux-services.mapreduce_shufflex.class`.