YARN-2994. Document work-preserving RM restart. Contributed by Jian He.

This commit is contained in:
Tsuyoshi Ozawa 2015-02-13 13:08:13 +09:00
parent 2f1e5dc628
commit b0d81e05ab
2 changed files with 136 additions and 44 deletions

View File

@ -83,6 +83,8 @@ Release 2.7.0 - UNRELEASED
YARN-2616 [YARN-913] Add CLI client to the registry to list, view YARN-2616 [YARN-913] Add CLI client to the registry to list, view
and manipulate entries. (Akshay Radia via stevel) and manipulate entries. (Akshay Radia via stevel)
YARN-2994. Document work-preserving RM restart. (Jian He via ozawa)
IMPROVEMENTS IMPROVEMENTS
YARN-3005. [JDK7] Use switch statement for String instead of if-else YARN-3005. [JDK7] Use switch statement for String instead of if-else

View File

@ -11,12 +11,12 @@
~~ limitations under the License. See accompanying LICENSE file. ~~ limitations under the License. See accompanying LICENSE file.
--- ---
ResourceManger Restart ResourceManager Restart
--- ---
--- ---
${maven.build.timestamp} ${maven.build.timestamp}
ResourceManger Restart ResourceManager Restart
%{toc|section=1|fromDepth=0} %{toc|section=1|fromDepth=0}
@ -32,23 +32,26 @@ ResourceManger Restart
ResourceManager Restart feature is divided into two phases: ResourceManager Restart feature is divided into two phases:
ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state ResourceManager Restart Phase 1 (Non-work-preserving RM restart):
Enhance RM to persist application/attempt state
and other credentials information in a pluggable state-store. RM will reload and other credentials information in a pluggable state-store. RM will reload
this information from state-store upon restart and re-kick the previously this information from state-store upon restart and re-kick the previously
running applications. Users are not required to re-submit the applications. running applications. Users are not required to re-submit the applications.
ResourceManager Restart Phase 2: ResourceManager Restart Phase 2 (Work-preserving RM restart):
Focus on re-constructing the running state of ResourceManger by reading back Focus on re-constructing the running state of ResourceManager by combining
the container statuses from NodeMangers and container requests from ApplicationMasters the container statuses from NodeManagers and container requests from ApplicationMasters
upon restart. The key difference from phase 1 is that previously running applications upon restart. The key difference from phase 1 is that previously running applications
will not be killed after RM restarts, and so applications won't lose its work will not be killed after RM restarts, and so applications won't lose its work
because of RM outage. because of RM outage.
* {Feature}
** Phase 1: Non-work-preserving RM restart
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which
is described below. is described below.
* {Feature}
The overall concept is that RM will persist the application metadata The overall concept is that RM will persist the application metadata
(i.e. ApplicationSubmissionContext) in (i.e. ApplicationSubmissionContext) in
a pluggable state-store when client submits an application and also saves the final status a pluggable state-store when client submits an application and also saves the final status
@ -62,13 +65,13 @@ ResourceManger Restart
applications if they were already completed (i.e. failed, killed, finished) applications if they were already completed (i.e. failed, killed, finished)
before RM went down. before RM went down.
NodeMangers and clients during the down-time of RM will keep polling RM until NodeManagers and clients during the down-time of RM will keep polling RM until
RM comes up. When RM becomes alive, it will send a re-sync command to RM comes up. When RM becomes alive, it will send a re-sync command to
all the NodeMangers and ApplicationMasters it was talking to via heartbeats. all the NodeManagers and ApplicationMasters it was talking to via heartbeats.
Today, the behaviors for NodeMangers and ApplicationMasters to handle this command As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command
are: NMs will kill all its managed containers and re-register with RM. From the are: NMs will kill all its managed containers and re-register with RM. From the
RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs. RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs.
AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command. AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command.
After RM restarts and loads all the application metadata, credentials from state-store After RM restarts and loads all the application metadata, credentials from state-store
and populates them into memory, it will create a new and populates them into memory, it will create a new
attempt (i.e. ApplicationMaster) for each application that was not yet completed attempt (i.e. ApplicationMaster) for each application that was not yet completed
@ -76,13 +79,33 @@ ResourceManger Restart
applications' work is lost in this manner since they are essentially killed by applications' work is lost in this manner since they are essentially killed by
RM via the re-sync command on restart. RM via the re-sync command on restart.
** Phase 2: Work-preserving RM restart
As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem
to not kill any applications running on YARN cluster if RM restarts.
Beyond all the groundwork that has been done in Phase 1 to ensure the persistency
of application state and reload that state on recovery, Phase 2 primarily focuses
on re-constructing the entire running state of YARN cluster, the majority of which is
the state of the central scheduler inside RM which keeps track of all containers' life-cycle,
applications' headroom and resource requests, queues' resource usage etc. In this way,
RM doesn't need to kill the AM and re-run the application from scratch as it is
done in Phase 1. Applications can simply re-sync back with RM and
resume from where it were left off.
RM recovers its runing state by taking advantage of the container statuses sent from all NMs.
NM will not kill the containers when it re-syncs with the restarted RM. It continues
managing the containers and send the container statuses across to RM when it re-registers.
RM reconstructs the container instances and the associated applications' scheduling status by
absorbing these containers' information. In the meantime, AM needs to re-send the
outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down.
Application writers using AMRMClient library to communicate with RM do not need to
worry about the part of AM re-sending resource requests to RM on re-sync, as it is
automatically taken care by the library itself.
* {Configurations} * {Configurations}
This section describes the configurations involved to enable RM Restart feature. ** Enable RM Restart.
* Enable ResourceManager Restart functionality.
To enable RM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Value | || Property || Value |
@ -92,9 +115,10 @@ ResourceManger Restart
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
* Configure the state-store that is used to persist the RM state. ** Configure the state-store for persisting the RM state.
*--------------------------------------+--------------------------------------+
*--------------------------------------*--------------------------------------+
|| Property || Description | || Property || Description |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.store.class>>> | | | <<<yarn.resourcemanager.store.class>>> | |
@ -103,14 +127,36 @@ ResourceManger Restart
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> | | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
| | , a ZooKeeper based state-store implementation and | | | , a ZooKeeper based state-store implementation and |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> | | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> |
| | , a Hadoop FileSystem based state-store implementation like HDFS. | | | , a Hadoop FileSystem based state-store implementation like HDFS and local FS. |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore>>>, |
| | a LevelDB based state-store implementation. |
| | The default value is set to | | | The default value is set to |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. | | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
* Configurations when using Hadoop FileSystem based state-store implementation. ** How to choose the state-store implementation.
Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store. <<ZooKeeper based state-store>>: User is free to pick up any storage to set up RM restart,
but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper
based state-store supports fencing mechanism to avoid a split-brain situation where multiple
RMs assume they are active and can edit the state-store at the same time.
<<FileSystem based state-store>>: HDFS and local FS based state-store are supported.
Fencing mechanism is not supported.
<<LevelDB based state-store>>: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper
based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update,
and far fewer total files on the filesystem. Fencing mechanism is not supported.
** Configurations for Hadoop FileSystem based state-store implementation.
Support both HDFS and local FS based state-store implementation. The type of file system to
be used is determined by the scheme of URI. e.g. <<<hdfs://localhost:9000/rmstore>>> uses HDFS as the storage and
<<<file:///tmp/yarn/rmstore>>> uses local FS as the storage. If no
scheme(<<<hdfs://>>> or <<<file://>>>) is specified in the URI, the type of storage to be used is
determined by <<<fs.defaultFS>>> defined in <<<core-site.xml>>>.
Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
@ -123,8 +169,8 @@ ResourceManger Restart
| | <<conf/core-site.xml>> will be used. | | | <<conf/core-site.xml>> will be used. |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
Configure the retry policy state-store client uses to connect with the Hadoop Configure the retry policy state-store client uses to connect with the Hadoop
FileSystem. FileSystem.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
@ -137,9 +183,9 @@ ResourceManger Restart
| | Default value is (2000, 500) | | | Default value is (2000, 500) |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
* Configurations when using ZooKeeper based state-store implementation. ** Configurations for ZooKeeper based state-store implementation.
Configure the ZooKeeper server address and the root path where the RM state is stored. Configure the ZooKeeper server address and the root path where the RM state is stored.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
@ -154,7 +200,7 @@ ResourceManger Restart
| | Default value is /rmstore. | | | Default value is /rmstore. |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
Configure the retry policy state-store client uses to connect with the ZooKeeper server. Configure the retry policy state-store client uses to connect with the ZooKeeper server.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
@ -175,7 +221,7 @@ ResourceManger Restart
| | value is 10 seconds | | | value is 10 seconds |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
Configure the ACLs to be used for setting permissions on ZooKeeper znodes. Configure the ACLs to be used for setting permissions on ZooKeeper znodes.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
@ -184,25 +230,69 @@ ResourceManger Restart
| | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> | | | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
* Configure the max number of application attempt retries. ** Configurations for LevelDB based state-store implementation.
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
|| Property || Description | || Property || Description |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.am.max-attempts>>> | | | <<<yarn.resourcemanager.leveldb-state-store.path>>> | |
| | The maximum number of application attempts. It's a global | | | Local path where the RM state will be stored. |
| | setting for all application masters. Each application master can specify | | | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> |
| | its individual maximum number of application attempts via the API, but the |
| | individual number cannot be more than the global upper bound. If it is, |
| | the RM will override it. The default number is set to 2, to |
| | allow at least one retry for AM. |
*--------------------------------------+--------------------------------------+ *--------------------------------------+--------------------------------------+
This configuration's impact is in fact beyond RM restart scope. It controls
the max number of attempts an application can have. In RM Restart Phase 1, ** Configurations for work-preserving RM recovery.
this configuration is needed since as described earlier each time RM restarts,
it kills the previously running attempt (i.e. ApplicationMaster) and *--------------------------------------+--------------------------------------+
creates a new attempt. Therefore, each occurrence of RM restart causes the || Property || Description |
attempt count to increase by 1. In RM Restart phase 2, this configuration is not *--------------------------------------+--------------------------------------+
needed since the previously running ApplicationMaster will | <<<yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms>>> | |
not be killed and the AM will just re-sync back with RM after RM restarts. | | Set the amount of time RM waits before allocating new |
| | containers on RM work-preserving recovery. Such wait period gives RM a chance |
| | to settle down resyncing with NMs in the cluster on recovery, before assigning|
| | new containers to applications.|
*--------------------------------------+--------------------------------------+
* {Notes}
ContainerId string format is changed if RM restarts with work-preserving recovery enabled.
It used to be such format:
Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005.
It is now changed to:
Container_<<e\{epoch\}>>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<<e17>>_1410901177871_0001_01_000005.
Here, the additional epoch number is a
monotonically increasing integer which starts from 0 and is increased by 1 each time
RM restarts. If epoch number is 0, it is omitted and the containerId string format
stays the same as before.
* {Sample configurations}
Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store.
+---+
<property>
<description>Enable RM to recover state after starting. If true, then
yarn.resourcemanager.store.class must be specified</description>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<description>The class to use as the persistent store.</description>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<description>Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server
(e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
as the value for yarn.resourcemanager.store.class</description>
<name>yarn.resourcemanager.zk-address</name>
<value>127.0.0.1:2181</value>
</property>
+---+