YARN-2994. Document work-preserving RM restart. Contributed by Jian He.
This commit is contained in:
parent
2f1e5dc628
commit
b0d81e05ab
@ -83,6 +83,8 @@ Release 2.7.0 - UNRELEASED
|
|||||||
YARN-2616 [YARN-913] Add CLI client to the registry to list, view
|
YARN-2616 [YARN-913] Add CLI client to the registry to list, view
|
||||||
and manipulate entries. (Akshay Radia via stevel)
|
and manipulate entries. (Akshay Radia via stevel)
|
||||||
|
|
||||||
|
YARN-2994. Document work-preserving RM restart. (Jian He via ozawa)
|
||||||
|
|
||||||
IMPROVEMENTS
|
IMPROVEMENTS
|
||||||
|
|
||||||
YARN-3005. [JDK7] Use switch statement for String instead of if-else
|
YARN-3005. [JDK7] Use switch statement for String instead of if-else
|
||||||
|
@ -11,12 +11,12 @@
|
|||||||
~~ limitations under the License. See accompanying LICENSE file.
|
~~ limitations under the License. See accompanying LICENSE file.
|
||||||
|
|
||||||
---
|
---
|
||||||
ResourceManger Restart
|
ResourceManager Restart
|
||||||
---
|
---
|
||||||
---
|
---
|
||||||
${maven.build.timestamp}
|
${maven.build.timestamp}
|
||||||
|
|
||||||
ResourceManger Restart
|
ResourceManager Restart
|
||||||
|
|
||||||
%{toc|section=1|fromDepth=0}
|
%{toc|section=1|fromDepth=0}
|
||||||
|
|
||||||
@ -32,23 +32,26 @@ ResourceManger Restart
|
|||||||
|
|
||||||
ResourceManager Restart feature is divided into two phases:
|
ResourceManager Restart feature is divided into two phases:
|
||||||
|
|
||||||
ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state
|
ResourceManager Restart Phase 1 (Non-work-preserving RM restart):
|
||||||
|
Enhance RM to persist application/attempt state
|
||||||
and other credentials information in a pluggable state-store. RM will reload
|
and other credentials information in a pluggable state-store. RM will reload
|
||||||
this information from state-store upon restart and re-kick the previously
|
this information from state-store upon restart and re-kick the previously
|
||||||
running applications. Users are not required to re-submit the applications.
|
running applications. Users are not required to re-submit the applications.
|
||||||
|
|
||||||
ResourceManager Restart Phase 2:
|
ResourceManager Restart Phase 2 (Work-preserving RM restart):
|
||||||
Focus on re-constructing the running state of ResourceManger by reading back
|
Focus on re-constructing the running state of ResourceManager by combining
|
||||||
the container statuses from NodeMangers and container requests from ApplicationMasters
|
the container statuses from NodeManagers and container requests from ApplicationMasters
|
||||||
upon restart. The key difference from phase 1 is that previously running applications
|
upon restart. The key difference from phase 1 is that previously running applications
|
||||||
will not be killed after RM restarts, and so applications won't lose its work
|
will not be killed after RM restarts, and so applications won't lose its work
|
||||||
because of RM outage.
|
because of RM outage.
|
||||||
|
|
||||||
|
* {Feature}
|
||||||
|
|
||||||
|
** Phase 1: Non-work-preserving RM restart
|
||||||
|
|
||||||
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which
|
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which
|
||||||
is described below.
|
is described below.
|
||||||
|
|
||||||
* {Feature}
|
|
||||||
|
|
||||||
The overall concept is that RM will persist the application metadata
|
The overall concept is that RM will persist the application metadata
|
||||||
(i.e. ApplicationSubmissionContext) in
|
(i.e. ApplicationSubmissionContext) in
|
||||||
a pluggable state-store when client submits an application and also saves the final status
|
a pluggable state-store when client submits an application and also saves the final status
|
||||||
@ -62,13 +65,13 @@ ResourceManger Restart
|
|||||||
applications if they were already completed (i.e. failed, killed, finished)
|
applications if they were already completed (i.e. failed, killed, finished)
|
||||||
before RM went down.
|
before RM went down.
|
||||||
|
|
||||||
NodeMangers and clients during the down-time of RM will keep polling RM until
|
NodeManagers and clients during the down-time of RM will keep polling RM until
|
||||||
RM comes up. When RM becomes alive, it will send a re-sync command to
|
RM comes up. When RM becomes alive, it will send a re-sync command to
|
||||||
all the NodeMangers and ApplicationMasters it was talking to via heartbeats.
|
all the NodeManagers and ApplicationMasters it was talking to via heartbeats.
|
||||||
Today, the behaviors for NodeMangers and ApplicationMasters to handle this command
|
As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command
|
||||||
are: NMs will kill all its managed containers and re-register with RM. From the
|
are: NMs will kill all its managed containers and re-register with RM. From the
|
||||||
RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs.
|
RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs.
|
||||||
AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command.
|
AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command.
|
||||||
After RM restarts and loads all the application metadata, credentials from state-store
|
After RM restarts and loads all the application metadata, credentials from state-store
|
||||||
and populates them into memory, it will create a new
|
and populates them into memory, it will create a new
|
||||||
attempt (i.e. ApplicationMaster) for each application that was not yet completed
|
attempt (i.e. ApplicationMaster) for each application that was not yet completed
|
||||||
@ -76,13 +79,33 @@ ResourceManger Restart
|
|||||||
applications' work is lost in this manner since they are essentially killed by
|
applications' work is lost in this manner since they are essentially killed by
|
||||||
RM via the re-sync command on restart.
|
RM via the re-sync command on restart.
|
||||||
|
|
||||||
|
** Phase 2: Work-preserving RM restart
|
||||||
|
|
||||||
|
As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem
|
||||||
|
to not kill any applications running on YARN cluster if RM restarts.
|
||||||
|
|
||||||
|
Beyond all the groundwork that has been done in Phase 1 to ensure the persistency
|
||||||
|
of application state and reload that state on recovery, Phase 2 primarily focuses
|
||||||
|
on re-constructing the entire running state of YARN cluster, the majority of which is
|
||||||
|
the state of the central scheduler inside RM which keeps track of all containers' life-cycle,
|
||||||
|
applications' headroom and resource requests, queues' resource usage etc. In this way,
|
||||||
|
RM doesn't need to kill the AM and re-run the application from scratch as it is
|
||||||
|
done in Phase 1. Applications can simply re-sync back with RM and
|
||||||
|
resume from where it were left off.
|
||||||
|
|
||||||
|
RM recovers its runing state by taking advantage of the container statuses sent from all NMs.
|
||||||
|
NM will not kill the containers when it re-syncs with the restarted RM. It continues
|
||||||
|
managing the containers and send the container statuses across to RM when it re-registers.
|
||||||
|
RM reconstructs the container instances and the associated applications' scheduling status by
|
||||||
|
absorbing these containers' information. In the meantime, AM needs to re-send the
|
||||||
|
outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down.
|
||||||
|
Application writers using AMRMClient library to communicate with RM do not need to
|
||||||
|
worry about the part of AM re-sending resource requests to RM on re-sync, as it is
|
||||||
|
automatically taken care by the library itself.
|
||||||
|
|
||||||
* {Configurations}
|
* {Configurations}
|
||||||
|
|
||||||
This section describes the configurations involved to enable RM Restart feature.
|
** Enable RM Restart.
|
||||||
|
|
||||||
* Enable ResourceManager Restart functionality.
|
|
||||||
|
|
||||||
To enable RM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
|
|
||||||
|
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|| Property || Value |
|
|| Property || Value |
|
||||||
@ -92,9 +115,10 @@ ResourceManger Restart
|
|||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
|
|
||||||
* Configure the state-store that is used to persist the RM state.
|
** Configure the state-store for persisting the RM state.
|
||||||
|
|
||||||
*--------------------------------------+--------------------------------------+
|
|
||||||
|
*--------------------------------------*--------------------------------------+
|
||||||
|| Property || Description |
|
|| Property || Description |
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
| <<<yarn.resourcemanager.store.class>>> | |
|
| <<<yarn.resourcemanager.store.class>>> | |
|
||||||
@ -103,12 +127,34 @@ ResourceManger Restart
|
|||||||
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
|
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
|
||||||
| | , a ZooKeeper based state-store implementation and |
|
| | , a ZooKeeper based state-store implementation and |
|
||||||
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> |
|
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> |
|
||||||
| | , a Hadoop FileSystem based state-store implementation like HDFS. |
|
| | , a Hadoop FileSystem based state-store implementation like HDFS and local FS. |
|
||||||
|
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore>>>, |
|
||||||
|
| | a LevelDB based state-store implementation. |
|
||||||
| | The default value is set to |
|
| | The default value is set to |
|
||||||
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. |
|
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. |
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
* Configurations when using Hadoop FileSystem based state-store implementation.
|
** How to choose the state-store implementation.
|
||||||
|
|
||||||
|
<<ZooKeeper based state-store>>: User is free to pick up any storage to set up RM restart,
|
||||||
|
but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper
|
||||||
|
based state-store supports fencing mechanism to avoid a split-brain situation where multiple
|
||||||
|
RMs assume they are active and can edit the state-store at the same time.
|
||||||
|
|
||||||
|
<<FileSystem based state-store>>: HDFS and local FS based state-store are supported.
|
||||||
|
Fencing mechanism is not supported.
|
||||||
|
|
||||||
|
<<LevelDB based state-store>>: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper
|
||||||
|
based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update,
|
||||||
|
and far fewer total files on the filesystem. Fencing mechanism is not supported.
|
||||||
|
|
||||||
|
** Configurations for Hadoop FileSystem based state-store implementation.
|
||||||
|
|
||||||
|
Support both HDFS and local FS based state-store implementation. The type of file system to
|
||||||
|
be used is determined by the scheme of URI. e.g. <<<hdfs://localhost:9000/rmstore>>> uses HDFS as the storage and
|
||||||
|
<<<file:///tmp/yarn/rmstore>>> uses local FS as the storage. If no
|
||||||
|
scheme(<<<hdfs://>>> or <<<file://>>>) is specified in the URI, the type of storage to be used is
|
||||||
|
determined by <<<fs.defaultFS>>> defined in <<<core-site.xml>>>.
|
||||||
|
|
||||||
Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.
|
Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.
|
||||||
|
|
||||||
@ -137,7 +183,7 @@ ResourceManger Restart
|
|||||||
| | Default value is (2000, 500) |
|
| | Default value is (2000, 500) |
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
* Configurations when using ZooKeeper based state-store implementation.
|
** Configurations for ZooKeeper based state-store implementation.
|
||||||
|
|
||||||
Configure the ZooKeeper server address and the root path where the RM state is stored.
|
Configure the ZooKeeper server address and the root path where the RM state is stored.
|
||||||
|
|
||||||
@ -184,25 +230,69 @@ ResourceManger Restart
|
|||||||
| | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> |
|
| | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> |
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
* Configure the max number of application attempt retries.
|
** Configurations for LevelDB based state-store implementation.
|
||||||
|
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|| Property || Description |
|
|| Property || Description |
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
| <<<yarn.resourcemanager.am.max-attempts>>> | |
|
| <<<yarn.resourcemanager.leveldb-state-store.path>>> | |
|
||||||
| | The maximum number of application attempts. It's a global |
|
| | Local path where the RM state will be stored. |
|
||||||
| | setting for all application masters. Each application master can specify |
|
| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> |
|
||||||
| | its individual maximum number of application attempts via the API, but the |
|
|
||||||
| | individual number cannot be more than the global upper bound. If it is, |
|
|
||||||
| | the RM will override it. The default number is set to 2, to |
|
|
||||||
| | allow at least one retry for AM. |
|
|
||||||
*--------------------------------------+--------------------------------------+
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
This configuration's impact is in fact beyond RM restart scope. It controls
|
|
||||||
the max number of attempts an application can have. In RM Restart Phase 1,
|
** Configurations for work-preserving RM recovery.
|
||||||
this configuration is needed since as described earlier each time RM restarts,
|
|
||||||
it kills the previously running attempt (i.e. ApplicationMaster) and
|
*--------------------------------------+--------------------------------------+
|
||||||
creates a new attempt. Therefore, each occurrence of RM restart causes the
|
|| Property || Description |
|
||||||
attempt count to increase by 1. In RM Restart phase 2, this configuration is not
|
*--------------------------------------+--------------------------------------+
|
||||||
needed since the previously running ApplicationMaster will
|
| <<<yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms>>> | |
|
||||||
not be killed and the AM will just re-sync back with RM after RM restarts.
|
| | Set the amount of time RM waits before allocating new |
|
||||||
|
| | containers on RM work-preserving recovery. Such wait period gives RM a chance |
|
||||||
|
| | to settle down resyncing with NMs in the cluster on recovery, before assigning|
|
||||||
|
| | new containers to applications.|
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
|
* {Notes}
|
||||||
|
|
||||||
|
ContainerId string format is changed if RM restarts with work-preserving recovery enabled.
|
||||||
|
It used to be such format:
|
||||||
|
|
||||||
|
Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005.
|
||||||
|
|
||||||
|
It is now changed to:
|
||||||
|
|
||||||
|
Container_<<e\{epoch\}>>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<<e17>>_1410901177871_0001_01_000005.
|
||||||
|
|
||||||
|
Here, the additional epoch number is a
|
||||||
|
monotonically increasing integer which starts from 0 and is increased by 1 each time
|
||||||
|
RM restarts. If epoch number is 0, it is omitted and the containerId string format
|
||||||
|
stays the same as before.
|
||||||
|
|
||||||
|
* {Sample configurations}
|
||||||
|
|
||||||
|
Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store.
|
||||||
|
|
||||||
|
+---+
|
||||||
|
<property>
|
||||||
|
<description>Enable RM to recover state after starting. If true, then
|
||||||
|
yarn.resourcemanager.store.class must be specified</description>
|
||||||
|
<name>yarn.resourcemanager.recovery.enabled</name>
|
||||||
|
<value>true</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<description>The class to use as the persistent store.</description>
|
||||||
|
<name>yarn.resourcemanager.store.class</name>
|
||||||
|
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<description>Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server
|
||||||
|
(e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
|
||||||
|
This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
|
||||||
|
as the value for yarn.resourcemanager.store.class</description>
|
||||||
|
<name>yarn.resourcemanager.zk-address</name>
|
||||||
|
<value>127.0.0.1:2181</value>
|
||||||
|
</property>
|
||||||
|
+---+
|
Loading…
Reference in New Issue
Block a user