168 lines
7.6 KiB
Markdown
168 lines
7.6 KiB
Markdown
|
<!--
|
|||
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|||
|
contributor license agreements. See the NOTICE file distributed with
|
|||
|
this work for additional information regarding copyright ownership.
|
|||
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|||
|
(the "License"); you may not use this file except in compliance with
|
|||
|
the License. You may obtain a copy of the License at
|
|||
|
|
|||
|
http://www.apache.org/licenses/LICENSE-2.0
|
|||
|
|
|||
|
Unless required by applicable law or agreed to in writing, software
|
|||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|||
|
See the License for the specific language governing permissions and
|
|||
|
limitations under the License.
|
|||
|
-->
|
|||
|
|
|||
|
HDFS Upgrade Domain
|
|||
|
====================
|
|||
|
|
|||
|
<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
|
|||
|
|
|||
|
|
|||
|
Introduction
|
|||
|
------------
|
|||
|
|
|||
|
The current default HDFS block placement policy guarantees that a block’s 3 replicas will be placed
|
|||
|
on at least 2 racks. Specifically one replica is placed on one rack and the other two replicas
|
|||
|
are placed on another rack during write pipeline. This is a good compromise between rack diversity and write-pipeline efficiency. Note that
|
|||
|
subsequent load balancing or machine membership change might cause 3 replicas of a block to be distributed
|
|||
|
across 3 different racks. Thus any 3 datanodes in different racks could store 3 replicas of a block.
|
|||
|
|
|||
|
|
|||
|
However, the default placement policy impacts how we should perform datanode rolling upgrade.
|
|||
|
[HDFS Rolling Upgrade document](./HdfsRollingUpgrade.html) explains how the datanodes can be upgraded in a rolling
|
|||
|
fashion without downtime. Because any 3 datanodes in different racks could store all the replicas of a block, it is
|
|||
|
important to perform sequential restart of datanodes one at a time in order to minimize the impact on data availability
|
|||
|
and read/write operations. Upgrading one rack at a time is another option; but that will increase the chance of
|
|||
|
data unavailability if there is machine failure at another rack during the upgrade.
|
|||
|
|
|||
|
The side effect of this sequential datanode rolling upgrade strategy is longer
|
|||
|
upgrade duration for larger clusters.
|
|||
|
|
|||
|
|
|||
|
Architecture
|
|||
|
-------
|
|||
|
|
|||
|
To address the limitation of block placement policy on rolling upgrade, the concept of upgrade domain
|
|||
|
has been added to HDFS via a new block placement policy. The idea is to group datanodes in a new
|
|||
|
dimension called upgrade domain, in addition to the existing rack-based grouping.
|
|||
|
For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01,
|
|||
|
nodes in the second position to upgrade domain ud_02 and so on.
|
|||
|
|
|||
|
The namenode provides BlockPlacementPolicy interface to support any custom block placement besides
|
|||
|
the default block placement policy. A new upgrade domain block placement policy based on this interface
|
|||
|
is available in HDFS. It will make sure replicas of any given block are distributed across machines from different upgrade domains.
|
|||
|
By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to
|
|||
|
a specific upgrade domain collectively won't store more than one replica of any block.
|
|||
|
|
|||
|
With upgrade domain block placement policy in place, we can upgrade all datanodes belonging to one upgrade domain at the
|
|||
|
same time without impacting data availability. Only after finishing upgrading one upgrade domain we move to the next
|
|||
|
upgrade domain until all upgrade domains have been upgraded. Such procedure will ensure no two replicas of any given
|
|||
|
block will be upgraded at the same time. This means we can upgrade many machines at the same time for a large cluster.
|
|||
|
And as the cluster continues to scale, new machines will be added to the existing upgrade domains without impact the
|
|||
|
parallelism of the upgrade.
|
|||
|
|
|||
|
For an existing cluster with the default block placement policy, after switching to the new upgrade domain block
|
|||
|
placement policy, any newly created blocks will conform the new policy. The old blocks allocated based on the old policy
|
|||
|
need to migrated the new policy. There is a migrator tool you can use. See HDFS-8789 for details.
|
|||
|
|
|||
|
|
|||
|
Settings
|
|||
|
-------
|
|||
|
|
|||
|
To enable upgrade domain on your clusters, please follow these steps:
|
|||
|
|
|||
|
* Assign datanodes to individual upgrade domain groups.
|
|||
|
* Enable upgrade domain block placement policy.
|
|||
|
* Migrate blocks allocated based on old block placement policy to the new upgrade domain policy.
|
|||
|
|
|||
|
### Upgrade domain id assignment
|
|||
|
|
|||
|
How a datanode maps to an upgrade domain id is defined by administrators and specific to the cluster layout.
|
|||
|
A common way to use the rack position of the machine as its upgrade domain id.
|
|||
|
|
|||
|
To configure mapping from host name to its upgrade domain id, we need to use json-based host configuration file.
|
|||
|
by setting the following property as explained in [hdfs-default.xml](./hdfs-default.xml).
|
|||
|
|
|||
|
| Setting | Value |
|
|||
|
|:---- |:---- |
|
|||
|
|`dfs.namenode.hosts.provider.classname` | `org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager`|
|
|||
|
|`dfs.hosts`| the path of the json hosts file |
|
|||
|
|
|||
|
The json hosts file defines the property for all hosts. In the following example,
|
|||
|
there are 4 datanodes in 2 racks; the machines at rack position 01 belong to upgrade domain 01;
|
|||
|
the machines at rack position 02 belong to upgrade domain 02.
|
|||
|
|
|||
|
```json
|
|||
|
[
|
|||
|
{
|
|||
|
"hostName": "dcArackA01",
|
|||
|
"upgradeDomain": "01"
|
|||
|
},
|
|||
|
{
|
|||
|
"hostName": "dcArackA02",
|
|||
|
"upgradeDomain": "02"
|
|||
|
},
|
|||
|
{
|
|||
|
"hostName": "dcArackB01",
|
|||
|
"upgradeDomain": "01"
|
|||
|
},
|
|||
|
{
|
|||
|
"hostName": "dcArackB02",
|
|||
|
"upgradeDomain": "02"
|
|||
|
}
|
|||
|
]
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
### Enable upgrade domain block placement policy
|
|||
|
|
|||
|
After each datanode has been assigned an upgrade domain id, the next step is to enable
|
|||
|
upgrade domain block placement policy with the following configuration as explained in [hdfs-default.xml](./hdfs-default.xml).
|
|||
|
|
|||
|
| Setting | Value |
|
|||
|
|:---- |:---- |
|
|||
|
|`dfs.block.replicator.classname`| `org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain` |
|
|||
|
|
|||
|
After restarting of namenode, the new policy will be used for any new block allocation.
|
|||
|
|
|||
|
|
|||
|
### Migration
|
|||
|
|
|||
|
If you change the block placement policy of an existing cluster, you will need to make sure the
|
|||
|
blocks allocated prior to the block placement policy change conform the new block placement policy.
|
|||
|
|
|||
|
HDFS-8789 provides the initial draft patch of a client-side migration tool. After the tool is committed,
|
|||
|
we will be able to describe how to use the tool.
|
|||
|
|
|||
|
|
|||
|
Rolling restart based on upgrade domains
|
|||
|
-------
|
|||
|
|
|||
|
During cluster administration, we might need to restart datanodes to pick up new configuration, new hadoop release
|
|||
|
or JVM version and so on. With upgrade domains enabled and all blocks on the cluster conform to the new policy, we can now
|
|||
|
restart datanodes in batches, one upgrade domain at a time. Whether it is manual process or via automation, the steps are
|
|||
|
|
|||
|
* Group datanodes by upgrade domains based on dfsadmin or JMX's datanode information.
|
|||
|
* For each upgrade domain
|
|||
|
* (Optional) put all the nodes in that upgrade domain to maintenance state (refer to [HdfsDataNodeAdminGuide.html](./HdfsDataNodeAdminGuide.html)).
|
|||
|
* Restart all those nodes.
|
|||
|
* Check if all datanodes are healthy after restart. Unhealthy nodes should be decommissioned.
|
|||
|
* (Optional) Take all those nodes out of maintenance state.
|
|||
|
|
|||
|
|
|||
|
Metrics
|
|||
|
-----------
|
|||
|
|
|||
|
Upgrade domains are part of namenode's JMX. As explained in [HDFSCommands.html](./HDFSCommands.html), you can also verify upgrade domains using the following commands.
|
|||
|
|
|||
|
Use `dfsadmin` to check upgrade domains at the cluster level.
|
|||
|
|
|||
|
`hdfs dfsadmin -report`
|
|||
|
|
|||
|
Use `fsck` to check upgrade domains of datanodes storing data at a specific path.
|
|||
|
|
|||
|
`hdfs fsck <path> -files -blocks -upgradedomains`
|