YARN-2994. Document work-preserving RM restart. Contributed by Jian He.

2015-02-13 13:08:13 +09:00 · 2015-02-13 13:08:13 +09:00 · b0d81e05ab
commit b0d81e05ab
parent 2f1e5dc628
2 changed files with 136 additions and 44 deletions
--- a/hadoop-yarn-project/CHANGES.txt
+++ b/hadoop-yarn-project/CHANGES.txt
@ -83,6 +83,8 @@ Release 2.7.0 - UNRELEASED
     YARN-2616 [YARN-913] Add CLI client to the registry to list, view
     and manipulate entries. (Akshay Radia via stevel)

+    YARN-2994. Document work-preserving RM restart. (Jian He via ozawa)
+
  IMPROVEMENTS

    YARN-3005. [JDK7] Use switch statement for String instead of if-else
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
@ -11,12 +11,12 @@
 ~~ limitations under the License. See accompanying LICENSE file.

  ---
-  ResourceManger Restart
+  ResourceManager Restart
  ---
  ---
  ${maven.build.timestamp}

-ResourceManger Restart
+ResourceManager Restart

 %{toc|section=1|fromDepth=0}

@ -32,23 +32,26 @@ ResourceManger Restart

  ResourceManager Restart feature is divided into two phases:

-  ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state
+  ResourceManager Restart Phase 1 (Non-work-preserving RM restart):
+  Enhance RM to persist application/attempt state
  and other credentials information in a pluggable state-store. RM will reload
  this information from state-store upon restart and re-kick the previously
  running applications. Users are not required to re-submit the applications.

-  ResourceManager Restart Phase 2:
-  Focus on re-constructing the running state of ResourceManger by reading back
-  the container statuses from NodeMangers and container requests from ApplicationMasters
+  ResourceManager Restart Phase 2 (Work-preserving RM restart):
+  Focus on re-constructing the running state of ResourceManager by combining
+  the container statuses from NodeManagers and container requests from ApplicationMasters
  upon restart. The key difference from phase 1 is that previously running applications
  will not be killed after RM restarts, and so applications won't lose its work
  because of RM outage.

+* {Feature}
+
+** Phase 1: Non-work-preserving RM restart
+
  As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which
  is described below.

-* {Feature}
-
  The overall concept is that RM will persist the application metadata
  (i.e. ApplicationSubmissionContext) in
  a pluggable state-store when client submits an application and also saves the final status
@ -62,13 +65,13 @@ ResourceManger Restart
  applications if they were already completed (i.e. failed, killed, finished)
  before RM went down.

-  NodeMangers and clients during the down-time of RM will keep polling RM until 
+  NodeManagers and clients during the down-time of RM will keep polling RM until 
  RM comes up. When RM becomes alive, it will send a re-sync command to
-  all the NodeMangers and ApplicationMasters it was talking to via heartbeats.
-  Today, the behaviors for NodeMangers and ApplicationMasters to handle this command
+  all the NodeManagers and ApplicationMasters it was talking to via heartbeats.
+  As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command
  are: NMs will kill all its managed containers and re-register with RM. From the
  RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs. 
-  AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command.
+  AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command.
  After RM restarts and loads all the application metadata, credentials from state-store
  and populates them into memory, it will create a new
  attempt (i.e. ApplicationMaster) for each application that was not yet completed
@ -76,13 +79,33 @@ ResourceManger Restart
  applications' work is lost in this manner since they are essentially killed by
  RM via the re-sync command on restart.

+** Phase 2: Work-preserving RM restart
+
+  As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem 
+  to not kill any applications running on YARN cluster if RM restarts.
+
+  Beyond all the groundwork that has been done in Phase 1 to ensure the persistency
+  of application state and reload that state on recovery, Phase 2 primarily focuses
+  on re-constructing the entire running state of YARN cluster, the majority of which is
+  the state of the central scheduler inside RM which keeps track of all containers' life-cycle,
+  applications' headroom and resource requests, queues' resource usage etc. In this way,
+  RM doesn't need to kill the AM and re-run the application from scratch as it is
+  done in Phase 1. Applications can simply re-sync back with RM and
+  resume from where it were left off.
+
+  RM recovers its runing state by taking advantage of the container statuses sent from all NMs.
+  NM will not kill the containers when it re-syncs with the restarted RM. It continues
+  managing the containers and send the container statuses across to RM when it re-registers.
+  RM reconstructs the container instances and the associated applications' scheduling status by
+  absorbing these containers' information. In the meantime, AM needs to re-send the
+  outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down.
+  Application writers using AMRMClient library to communicate with RM do not need to
+  worry about the part of AM re-sending resource requests to RM on re-sync, as it is
+  automatically taken care by the library itself.
+
 * {Configurations}

-  This section describes the configurations involved to enable RM Restart feature.
-
-  * Enable ResourceManager Restart functionality.
-
-    To enable RM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
+** Enable RM Restart.

 *--------------------------------------+--------------------------------------+
 || Property                            || Value                                |
@ -92,9 +115,10 @@ ResourceManger Restart
 *--------------------------------------+--------------------------------------+ 


-  * Configure the state-store that is used to persist the RM state.
+** Configure the state-store for persisting the RM state.

-*--------------------------------------+--------------------------------------+
+
+*--------------------------------------*--------------------------------------+
 || Property                            || Description                        |
 *--------------------------------------+--------------------------------------+
 | <<<yarn.resourcemanager.store.class>>> | |
@ -103,12 +127,34 @@ ResourceManger Restart
 | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
 | | , a ZooKeeper based state-store implementation and  |
 | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> |
-| | , a Hadoop FileSystem based state-store implementation like HDFS. |
+| | , a Hadoop FileSystem based state-store implementation like HDFS and local FS. |
+| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore>>>, |
+| | a LevelDB based state-store implementation. |
 | | The default value is set to |
 | | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. |
 *--------------------------------------+--------------------------------------+ 

-    * Configurations when using Hadoop FileSystem based state-store implementation.
+** How to choose the state-store implementation.
+
+    <<ZooKeeper based state-store>>: User is free to pick up any storage to set up RM restart,
+    but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper
+    based state-store supports fencing mechanism to avoid a split-brain situation where multiple
+    RMs assume they are active and can edit the state-store at the same time.
+
+    <<FileSystem based state-store>>: HDFS and local FS based state-store are supported. 
+    Fencing mechanism is not supported.
+
+    <<LevelDB based state-store>>: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper
+    based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update,
+    and far fewer total files on the filesystem. Fencing mechanism is not supported.
+
+** Configurations for Hadoop FileSystem based state-store implementation.
+
+    Support both HDFS and local FS based state-store implementation. The type of file system to
+    be used is determined by the scheme of URI. e.g. <<<hdfs://localhost:9000/rmstore>>> uses HDFS as the storage and
+    <<<file:///tmp/yarn/rmstore>>> uses local FS as the storage. If no
+    scheme(<<<hdfs://>>> or <<<file://>>>) is specified in the URI, the type of storage to be used is
+    determined by <<<fs.defaultFS>>> defined in <<<core-site.xml>>>.

    Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.

@ -137,7 +183,7 @@ ResourceManger Restart
 | | Default value is (2000, 500) |
 *--------------------------------------+--------------------------------------+ 

-    * Configurations when using ZooKeeper based state-store implementation.
+** Configurations for ZooKeeper based state-store implementation.
  
    Configure the ZooKeeper server address and the root path where the RM state is stored.

@ -184,25 +230,69 @@ ResourceManger Restart
 | | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> |
 *--------------------------------------+--------------------------------------+

-  * Configure the max number of application attempt retries.
+** Configurations for LevelDB based state-store implementation.

 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
 *--------------------------------------+--------------------------------------+
-| <<<yarn.resourcemanager.am.max-attempts>>> | |
-| | The maximum number of application attempts. It's a global |
-| | setting for all application masters. Each application master can specify |
-| | its individual maximum number of application attempts via the API, but the |
-| | individual number cannot be more than the global upper bound. If it is, |
-| | the RM will override it. The default number is set to 2, to |
-| | allow at least one retry for AM. |
+| <<<yarn.resourcemanager.leveldb-state-store.path>>> | |
+| | Local path where the RM state will be stored. |
+| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> |
 *--------------------------------------+--------------------------------------+

-    This configuration's impact is in fact beyond RM restart scope. It controls
-    the max number of attempts an application can have. In RM Restart Phase 1,
-    this configuration is needed since as described earlier each time RM restarts,
-    it kills the previously running attempt (i.e. ApplicationMaster) and
-    creates a new attempt. Therefore, each occurrence of RM restart causes the
-    attempt count to increase by 1. In RM Restart phase 2, this configuration is not
-    needed since the previously running ApplicationMaster will
-    not be killed and the AM will just re-sync back with RM after RM restarts.
+
+**  Configurations for work-preserving RM recovery.
+
+*--------------------------------------+--------------------------------------+
+|| Property                            || Description                        |
+*--------------------------------------+--------------------------------------+
+| <<<yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms>>> | |
+| | Set the amount of time RM waits before allocating new |
+| | containers on RM work-preserving recovery. Such wait period gives RM a chance | 
+| | to settle down resyncing with NMs in the cluster on recovery, before assigning|
+| |  new containers to applications.|
+*--------------------------------------+--------------------------------------+
+
+* {Notes}
+
+  ContainerId string format is changed if RM restarts with work-preserving recovery enabled.
+  It used to be such format:
+
+   Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005.
+
+  It is now changed to:
+
+   Container_<<e\{epoch\}>>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<<e17>>_1410901177871_0001_01_000005.
+ 
+  Here, the additional epoch number is a
+  monotonically increasing integer which starts from 0 and is increased by 1 each time
+  RM restarts. If epoch number is 0, it is omitted and the containerId string format
+  stays the same as before.
+
+* {Sample configurations}
+
+   Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store.
+
+---+
+  <property>
+    <description>Enable RM to recover state after starting. If true, then 
+    yarn.resourcemanager.store.class must be specified</description>
+    <name>yarn.resourcemanager.recovery.enabled</name>
+    <value>true</value>
+  </property>
+
+  <property>
+    <description>The class to use as the persistent store.</description>
+    <name>yarn.resourcemanager.store.class</name>
+    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
+  </property>
+
+  <property>
+    <description>Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server
+    (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
+    This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
+    as the value for yarn.resourcemanager.store.class</description>
+    <name>yarn.resourcemanager.zk-address</name>
+    <value>127.0.0.1:2181</value>
+  </property>
+---+