From b0d81e05abd730a923aed4727a8191650aebd2f9 Mon Sep 17 00:00:00 2001 From: Tsuyoshi Ozawa Date: Fri, 13 Feb 2015 13:08:13 +0900 Subject: [PATCH] YARN-2994. Document work-preserving RM restart. Contributed by Jian He. --- hadoop-yarn-project/CHANGES.txt | 2 + .../site/apt/ResourceManagerRestart.apt.vm | 178 +++++++++++++----- 2 files changed, 136 insertions(+), 44 deletions(-) diff --git a/hadoop-yarn-project/CHANGES.txt b/hadoop-yarn-project/CHANGES.txt index 41e5411b06..622072f5a7 100644 --- a/hadoop-yarn-project/CHANGES.txt +++ b/hadoop-yarn-project/CHANGES.txt @@ -83,6 +83,8 @@ Release 2.7.0 - UNRELEASED YARN-2616 [YARN-913] Add CLI client to the registry to list, view and manipulate entries. (Akshay Radia via stevel) + YARN-2994. Document work-preserving RM restart. (Jian He via ozawa) + IMPROVEMENTS YARN-3005. [JDK7] Use switch statement for String instead of if-else diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm index 30a3a64297..a08c19db06 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm @@ -11,12 +11,12 @@ ~~ limitations under the License. See accompanying LICENSE file. --- - ResourceManger Restart + ResourceManager Restart --- --- ${maven.build.timestamp} -ResourceManger Restart +ResourceManager Restart %{toc|section=1|fromDepth=0} @@ -32,23 +32,26 @@ ResourceManger Restart ResourceManager Restart feature is divided into two phases: - ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state + ResourceManager Restart Phase 1 (Non-work-preserving RM restart): + Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications. - ResourceManager Restart Phase 2: - Focus on re-constructing the running state of ResourceManger by reading back - the container statuses from NodeMangers and container requests from ApplicationMasters + ResourceManager Restart Phase 2 (Work-preserving RM restart): + Focus on re-constructing the running state of ResourceManager by combining + the container statuses from NodeManagers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won't lose its work because of RM outage. +* {Feature} + +** Phase 1: Non-work-preserving RM restart + As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below. -* {Feature} - The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status @@ -62,13 +65,13 @@ ResourceManger Restart applications if they were already completed (i.e. failed, killed, finished) before RM went down. - NodeMangers and clients during the down-time of RM will keep polling RM until + NodeManagers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to - all the NodeMangers and ApplicationMasters it was talking to via heartbeats. - Today, the behaviors for NodeMangers and ApplicationMasters to handle this command + all the NodeManagers and ApplicationMasters it was talking to via heartbeats. + As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs. - AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command. + AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed @@ -76,13 +79,33 @@ ResourceManger Restart applications' work is lost in this manner since they are essentially killed by RM via the re-sync command on restart. +** Phase 2: Work-preserving RM restart + + As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem + to not kill any applications running on YARN cluster if RM restarts. + + Beyond all the groundwork that has been done in Phase 1 to ensure the persistency + of application state and reload that state on recovery, Phase 2 primarily focuses + on re-constructing the entire running state of YARN cluster, the majority of which is + the state of the central scheduler inside RM which keeps track of all containers' life-cycle, + applications' headroom and resource requests, queues' resource usage etc. In this way, + RM doesn't need to kill the AM and re-run the application from scratch as it is + done in Phase 1. Applications can simply re-sync back with RM and + resume from where it were left off. + + RM recovers its runing state by taking advantage of the container statuses sent from all NMs. + NM will not kill the containers when it re-syncs with the restarted RM. It continues + managing the containers and send the container statuses across to RM when it re-registers. + RM reconstructs the container instances and the associated applications' scheduling status by + absorbing these containers' information. In the meantime, AM needs to re-send the + outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down. + Application writers using AMRMClient library to communicate with RM do not need to + worry about the part of AM re-sending resource requests to RM on re-sync, as it is + automatically taken care by the library itself. + * {Configurations} - This section describes the configurations involved to enable RM Restart feature. - - * Enable ResourceManager Restart functionality. - - To enable RM Restart functionality, set the following property in <> to true: +** Enable RM Restart. *--------------------------------------+--------------------------------------+ || Property || Value | @@ -92,9 +115,10 @@ ResourceManger Restart *--------------------------------------+--------------------------------------+ - * Configure the state-store that is used to persist the RM state. +** Configure the state-store for persisting the RM state. -*--------------------------------------+--------------------------------------+ + +*--------------------------------------*--------------------------------------+ || Property || Description | *--------------------------------------+--------------------------------------+ | <<>> | | @@ -103,14 +127,36 @@ ResourceManger Restart | | <<>> | | | , a ZooKeeper based state-store implementation and | | | <<>> | -| | , a Hadoop FileSystem based state-store implementation like HDFS. | +| | , a Hadoop FileSystem based state-store implementation like HDFS and local FS. | +| | <<>>, | +| | a LevelDB based state-store implementation. | | | The default value is set to | | | <<>>. | *--------------------------------------+--------------------------------------+ - * Configurations when using Hadoop FileSystem based state-store implementation. +** How to choose the state-store implementation. - Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store. + <>: User is free to pick up any storage to set up RM restart, + but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper + based state-store supports fencing mechanism to avoid a split-brain situation where multiple + RMs assume they are active and can edit the state-store at the same time. + + <>: HDFS and local FS based state-store are supported. + Fencing mechanism is not supported. + + <>: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper + based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update, + and far fewer total files on the filesystem. Fencing mechanism is not supported. + +** Configurations for Hadoop FileSystem based state-store implementation. + + Support both HDFS and local FS based state-store implementation. The type of file system to + be used is determined by the scheme of URI. e.g. <<>> uses HDFS as the storage and + <<>> uses local FS as the storage. If no + scheme(<<>> or <<>>) is specified in the URI, the type of storage to be used is + determined by <<>> defined in <<>>. + + Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -123,8 +169,8 @@ ResourceManger Restart | | <> will be used. | *--------------------------------------+--------------------------------------+ - Configure the retry policy state-store client uses to connect with the Hadoop - FileSystem. + Configure the retry policy state-store client uses to connect with the Hadoop + FileSystem. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -137,9 +183,9 @@ ResourceManger Restart | | Default value is (2000, 500) | *--------------------------------------+--------------------------------------+ - * Configurations when using ZooKeeper based state-store implementation. +** Configurations for ZooKeeper based state-store implementation. - Configure the ZooKeeper server address and the root path where the RM state is stored. + Configure the ZooKeeper server address and the root path where the RM state is stored. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -154,7 +200,7 @@ ResourceManger Restart | | Default value is /rmstore. | *--------------------------------------+--------------------------------------+ - Configure the retry policy state-store client uses to connect with the ZooKeeper server. + Configure the retry policy state-store client uses to connect with the ZooKeeper server. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -175,7 +221,7 @@ ResourceManger Restart | | value is 10 seconds | *--------------------------------------+--------------------------------------+ - Configure the ACLs to be used for setting permissions on ZooKeeper znodes. + Configure the ACLs to be used for setting permissions on ZooKeeper znodes. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -184,25 +230,69 @@ ResourceManger Restart | | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<>> | *--------------------------------------+--------------------------------------+ - * Configure the max number of application attempt retries. +** Configurations for LevelDB based state-store implementation. *--------------------------------------+--------------------------------------+ || Property || Description | *--------------------------------------+--------------------------------------+ -| <<>> | | -| | The maximum number of application attempts. It's a global | -| | setting for all application masters. Each application master can specify | -| | its individual maximum number of application attempts via the API, but the | -| | individual number cannot be more than the global upper bound. If it is, | -| | the RM will override it. The default number is set to 2, to | -| | allow at least one retry for AM. | +| <<>> | | +| | Local path where the RM state will be stored. | +| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> | *--------------------------------------+--------------------------------------+ - This configuration's impact is in fact beyond RM restart scope. It controls - the max number of attempts an application can have. In RM Restart Phase 1, - this configuration is needed since as described earlier each time RM restarts, - it kills the previously running attempt (i.e. ApplicationMaster) and - creates a new attempt. Therefore, each occurrence of RM restart causes the - attempt count to increase by 1. In RM Restart phase 2, this configuration is not - needed since the previously running ApplicationMaster will - not be killed and the AM will just re-sync back with RM after RM restarts. + +** Configurations for work-preserving RM recovery. + +*--------------------------------------+--------------------------------------+ +|| Property || Description | +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | Set the amount of time RM waits before allocating new | +| | containers on RM work-preserving recovery. Such wait period gives RM a chance | +| | to settle down resyncing with NMs in the cluster on recovery, before assigning| +| | new containers to applications.| +*--------------------------------------+--------------------------------------+ + +* {Notes} + + ContainerId string format is changed if RM restarts with work-preserving recovery enabled. + It used to be such format: + + Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005. + + It is now changed to: + + Container_<>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<>_1410901177871_0001_01_000005. + + Here, the additional epoch number is a + monotonically increasing integer which starts from 0 and is increased by 1 each time + RM restarts. If epoch number is 0, it is omitted and the containerId string format + stays the same as before. + +* {Sample configurations} + + Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store. + ++---+ + + Enable RM to recover state after starting. If true, then + yarn.resourcemanager.store.class must be specified + yarn.resourcemanager.recovery.enabled + true + + + + The class to use as the persistent store. + yarn.resourcemanager.store.class + org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore + + + + Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server + (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state. + This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore + as the value for yarn.resourcemanager.store.class + yarn.resourcemanager.zk-address + 127.0.0.1:2181 + ++---+ \ No newline at end of file