This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.
This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.
Note: Using the Quorum Journal Manager or Conventional Shared Storage
This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see [this alternative guide.](./HDFSHighAvailabilityWithNFS.html)
Background
----------
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
* In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing the option of running two (and as of 3.0.0 more than two) redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
In a typical HA cluster, two or more separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an *Active* state, and the others are in a *Standby* state. The Active NameNode is responsible for all client operations in the cluster, while the Standbys are simply acting as workers, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of all NameNodes, and send block location information and heartbeats to all.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Hardware resources
------------------
In order to deploy an HA cluster, you should prepare the following:
* **NameNode machines** - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
* **JournalNode machines** - the machines on which you run the JournalNodes.
The JournalNode daemon is relatively lightweight, so these daemons may
reasonably be collocated on machines with other Hadoop daemons, for example
NameNodes, the JobTracker, or the YARN ResourceManager. **Note:** There
must be at least 3 JournalNode daemons, since edit log modifications must be
written to a majority of JNs. This will allow the system to tolerate the
failure of a single machine. You may also run more than 3 JournalNodes, but
in order to actually increase the number of failures the system can tolerate,
you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when
running with N JournalNodes, the system can tolerate at most (N - 1) / 2
Note that, in an HA cluster, the Standby NameNodes also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.
Like HDFS Federation, HA clusters reuse the `nameservice ID` to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called `NameNode ID` is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the **nameservice ID** as well as the **NameNode ID**.
### Configuration details
To configure HA NameNodes, you must add several configuration options to your **hdfs-site.xml** configuration file.
The order in which you set these configurations is unimportant, but the values you choose for **dfs.nameservices** and **dfs.ha.namenodes.[nameservice ID]** will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.
***dfs.nameservices** - the logical name for this new nameservice
Choose a logical name for this nameservice, for example "mycluster", and use
this logical name for the value of this config option. The name you choose is
arbitrary. It will be used both for configuration and as the authority
component of absolute HDFS paths in the cluster.
**Note:** If you are also using HDFS Federation, this configuration setting
should also include the list of other nameservices, HA or otherwise, as a
comma-separated list.
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
***dfs.ha.namenodes.[nameservice ID]** - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
**Note:** The minimum number of NameNodes for HA is two, but you can configure more. Its suggested to not exceed 5 - with a recommended 3 NameNodes - due to communication overheads.
After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command "*hdfs \--daemon start journalnode*" and waiting for the daemon to start on each of the relevant machines.
You can visit each of the NameNodes' web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either "standby" or "active".) Whenever an HA NameNode starts, it is initially in the Standby state.
### Administrative commands
Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the "*hdfs haadmin*" command. Running this command without any additional arguments will display the following usage information:
This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run "*hdfs haadmin -help \<command*\>".
***transitionToActive** and **transitionToStandby** - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby
state, respectively. **These commands do not attempt to perform any fencing,
and thus should rarely be used.** Instead, one should almost always prefer to
use the "*hdfs haadmin -failover*" subcommand.
***failover** - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first NameNode
is in the Active state, an attempt will be made to gracefully transition it to
the Standby state. If this fails, the fencing methods (as configured by
**dfs.ha.fencing.methods**) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned to
the Active state. If no fencing method succeeds, the second NameNode will not
be transitioned to the Active state, and an error will be returned.
***getServiceState** - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing
either "standby" or "active" to STDOUT appropriately. This subcommand might be
used by cron jobs or monitoring scripts which need to behave differently based
on whether the NameNode is currently Active or Standby.
If you are running a set of NameNodes behind a Load Balancer (e.g. [Azure](https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview) or [AWS](https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html) ) and would like the Load Balancer to point to the active NN, you can use the /isActive HTTP endpoint as a health probe.
http://NN_HOSTNAME/isActive will return a 200 status code response if the NN is in Active HA State, 405 otherwise.
The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
### Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
* **Failure detection** - each of the NameNode machines in the cluster
maintains a persistent session in ZooKeeper. If the machine crashes, the
* **Active NameNode election** - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes,
another node may take a special exclusive lock in ZooKeeper indicating that
it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
* **Health monitoring** - the ZKFC pings its local NameNode on a periodic
basis with a health-check command. So long as the NameNode responds in a
timely fashion with a healthy status, the ZKFC considers the node
healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
state, the health monitor will mark it as unhealthy.
* **ZooKeeper session management** - when the local NameNode is healthy, the
ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
also holds a special "lock" znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node will be
automatically deleted.
* **ZooKeeper-based election** - if the local NameNode is healthy, and the
ZKFC sees that no other node currently holds the lock znode, it will itself
try to acquire the lock. If it succeeds, then it has "won the election", and
is responsible for running a failover to make its local NameNode active. The
failover process is similar to the manual failover described above: first,
the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
### Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
### Before you begin
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
### Configuring automatic failover
The configuration of automatic failover requires the addition of two new parameters to your configuration. In your `hdfs-site.xml` file, add:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your `core-site.xml` file, add:
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting `dfs.ha.automatic-failover.enabled.my-nameservice-id`.
There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.
### Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.
### Starting the cluster with `start-dfs.sh`
Since automatic failover has been enabled in the configuration, the `start-dfs.sh` script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
### Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually start the `zkfc` daemon on each of the machines that runs a NameNode. You can start the daemon by running:
If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.
In order to secure the information in ZooKeeper, first add the following to your `core-site.xml` file:
Please note the '@' character in these values -- this specifies that the configurations are not inline, but rather point to a file on disk. The authentication info may also be read via a CredentialProvider (pls see the CredentialProviderAPI Guide in the hadoop-common project).
The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:
digest:hdfs-zkfcs:mypassword
...where `hdfs-zkfcs` is a unique username for ZooKeeper, and `mypassword` is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use `kill -9 <pid of NN`\> to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of `ha.zookeeper.session-timeout.ms`, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the `zkfc` daemons as well as the NameNode daemons in order to further diagnose the issue.
Automatic Failover FAQ
----------------------
***Is it important that I start the ZKFC and NameNode daemons in any particular order?**
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
***What additional monitoring should I put in place?**
You should add monitoring on each host that runs a NameNode to ensure that the
ZKFC remains running. In some types of ZooKeeper failures, for example, the
ZKFC may unexpectedly exit, and should be restarted to ensure that the system
is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper
quorum. If ZooKeeper crashes, then automatic failover will not function.
***What happens if ZooKeeper goes down?**
If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
However, HDFS will continue to run without any impact. When ZooKeeper is
restarted, HDFS will reconnect with no issues.
***Can I designate one of my NameNodes as primary/preferred?**
No. Currently, this is not supported. Whichever NameNode is started first will
become active. You may choose to start the cluster in a specific order such
that your preferred node starts first.
***How can I initiate a manual failover when automatic failover is configured?**
Even if automatic failover is configured, you may initiate a manual failover
using the same `hdfs haadmin` command. It will perform a coordinated
failover.
HDFS Upgrade/Finalization/Rollback with HA Enabled
When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you're running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.
**To perform an HA upgrade**, the operator must do the following:
1. Shut down all of the NNs as normal, and install the newer software.
2. Start up all of the JNs. Note that it is **critical** that all the
JNs be running when performing the upgrade, rollback, or finalization
operations. If any of the JNs are down at the time of running any of these
operations, the operation will fail.
3. Start one of the NNs with the `'-upgrade'` flag.
4. On start, this NN will not enter the standby state as usual in an HA
setup. Rather, this NN will immediately enter the active state, perform an
upgrade of its local storage dirs, and also perform an upgrade of the shared
edit log.
5. At this point the other NN in the HA pair will be out of sync with
the upgraded NN. In order to bring it back in sync and once again have a highly
available setup, you should re-bootstrap this NameNode by running the NN with
the `'-bootstrapStandby'` flag. It is an error to start this second NN with
the `'-upgrade'` flag.
Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.
**To query status of upgrade**, the operator will use the `` `hdfs dfsadmin -upgrade query' `` command while atleast one of the NNs is running. The command will return whether the NN upgrade process is finalized or not, for each NN.
**To finalize an HA upgrade**, the operator will use the `` `hdfs dfsadmin -finalizeUpgrade' `` command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.
**To perform a rollback** of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `` `-bootstrapStandby' `` on the other NN to bring the two NNs in sync with this rolled-back file system state.