HADOOP-19039. Hadoop 3.4.0 Highlight big features and improvements. (#6462) Contributed by Shilun Fan.

Reviewed-by: He Xiaoqiao <hexiaoqiao@apache.org>
Signed-off-by: Shilun Fan <slfan1989@apache.org>
This commit is contained in:
slfan1989 2024-01-25 15:42:21 +08:00 committed by GitHub
parent caba9bbab3
commit 38f10c657e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -15,7 +15,7 @@
Apache Hadoop ${project.version} Apache Hadoop ${project.version}
================================ ================================
Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch. Apache Hadoop ${project.version} is an update to the Hadoop 3.4.x release branch.
Overview of Changes Overview of Changes
=================== ===================
@ -23,86 +23,124 @@ Overview of Changes
Users are encouraged to read the full set of release notes. Users are encouraged to read the full set of release notes.
This page provides an overview of the major changes. This page provides an overview of the major changes.
Azure ABFS: Critical Stream Prefetch Fix S3A: Upgrade AWS SDK to V2
---------------------------------------- ----------------------------------------
The abfs has a critical bug fix [HADOOP-18073](https://issues.apache.org/jira/browse/HADOOP-18073) S3A: Upgrade AWS SDK to V2
[HADOOP-18546](https://issues.apache.org/jira/browse/HADOOP-18546).
*ABFS. Disable purging list of in-progress reads in abfs stream close().*
All users of the abfs connector in hadoop releases 3.3.2+ MUST either upgrade This release upgrade Hadoop's AWS connector S3A from AWS SDK for Java V1 to AWS SDK for Java V2.
or disable prefetching by setting `fs.azure.readaheadqueue.depth` to `0` This is a significant change which offers a number of new features including the ability to work with Amazon S3 Express One Zone Storage - the new high performance, single AZ storage class.
Consult the parent JIRA [HADOOP-18521](https://issues.apache.org/jira/browse/HADOOP-18521) HDFS DataNode Split one FsDatasetImpl lock to volume grain locks
*ABFS ReadBufferManager buffer sharing across concurrent HTTP requests* ----------------------------------------
for root cause analysis, details on what is affected, and mitigations.
[HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) Split one FsDatasetImpl lock to volume grain locks.
Vectored IO API Throughput is one of the core performance evaluation for DataNode instance.
--------------- However, it does not reach the best performance especially for Federation deploy all the time although there are different improvement,
because of the global coarse-grain lock.
These series issues (include [HDFS-16534](https://issues.apache.org/jira/browse/HDFS-16534), [HDFS-16511](https://issues.apache.org/jira/browse/HDFS-16511), [HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) and [HDFS-16429](https://issues.apache.org/jira/browse/HDFS-16429).)
try to split the global coarse-grain lock to fine-grain lock which is double level lock for blockpool and volume,
to improve the throughput and avoid lock impacts between blockpools and volumes.
[HADOOP-18103](https://issues.apache.org/jira/browse/HADOOP-18103). YARN Federation improvements
*High performance vectored read API in Hadoop* ----------------------------------------
The `PositionedReadable` interface has now added an operation for [YARN-5597](https://issues.apache.org/jira/browse/YARN-5597) YARN Federation improvements.
Vectored IO (also known as Scatter/Gather IO):
```java We have enhanced the YARN Federation functionality for improved usability. The enhanced features are as follows:
void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) 1. YARN Router now boasts a full implementation of all interfaces including the ApplicationClientProtocol, ResourceManagerAdministrationProtocol, and RMWebServiceProtocol.
``` 2. YARN Router support for application cleanup and automatic offline mechanisms for subCluster.
3. Code improvements were undertaken for the Router and AMRMProxy, along with enhancements to previously pending functionalities.
4. Audit logs and Metrics for Router received upgrades.
5. A boost in cluster security features was achieved, with the inclusion of Kerberos support.
6. The page function of the router has been enhanced.
7. A set of commands has been added to the Router side for operating on SubClusters and Policies.
All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously, HDFS RBF: Code Enhancements, New Features, and Bug Fixes
possibly in parallel, with results potentially coming in out-of-order. ----------------------------------------
1. The default implementation uses a series of `readFully()` calls, so delivers The HDFS RBF functionality has undergone significant enhancements, encompassing over 200 commits for feature
equivalent performance. improvements, new functionalities, and bug fixes.
2. The local filesystem uses java native IO calls for higher performance reads than `readFully()`. Important features and improvements are as follows:
3. The S3A filesystem issues parallel HTTP GET requests in different threads.
Benchmarking of enhanced Apache ORC and Apache Parquet clients through `file://` and `s3a://` **Feature**
show significant improvements in query performance.
Further Reading: [HDFS-15294](https://issues.apache.org/jira/browse/HDFS-15294) HDFS Federation balance tool introduces one tool to balance data across different namespace.
* [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html).
* [Hadoop Vectored IO: Your Data Just Got Faster!](https://apachecon.com/acasia2022/sessions/bigdata-1148.html)
Apachecon 2022 talk.
Mapreduce: Manifest Committer for Azure ABFS and google GCS **Improvement**
----------------------------------------------------------
The new _Intermediate Manifest Committer_ uses a manifest file [HDFS-17128](https://issues.apache.org/jira/browse/HDFS-17128) RBF: SQLDelegationTokenSecretManager should use version of tokens updated by other routers.
to commit the work of successful task attempts, rather than
renaming directories.
Job commit is matter of reading all the manifests, creating the
destination directories (parallelized) and renaming the files,
again in parallel.
This is both fast and correct on Azure Storage and Google GCS, The SQLDelegationTokenSecretManager enhances performance by maintaining processed tokens in memory. However, there is
and should be used there instead of the classic v1/v2 file a potential issue of router cache inconsistency due to token loading and renewal. This issue has been addressed by the
output committers. resolution of HDFS-17128.
It is also safe to use on HDFS, where it should be faster [HDFS-17148](https://issues.apache.org/jira/browse/HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL.
than the v1 committer. It is however optimized for
cloud storage where list and rename operations are significantly
slower; the benefits may be less.
More details are available in the SQLDelegationTokenSecretManager, while fetching and temporarily storing tokens from SQL in a memory cache with a short TTL,
[manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html). faces an issue where expired tokens are not efficiently cleaned up, leading to a buildup of expired tokens in the SQL database.
documentation. This issue has been addressed by the resolution of HDFS-17148.
**Others**
HDFS: Dynamic Datanode Reconfiguration Other changes to HDFS RBF include WebUI, command line, and other improvements. Please refer to the release document.
--------------------------------------
HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457. HDFS EC: Code Enhancements and Bug Fixes
----------------------------------------
A number of Datanode configuration options can be changed without having to restart HDFS EC has made code improvements and fixed some bugs.
the datanode. This makes it possible to tune deployment configurations without
cluster-wide Datanode Restarts.
See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361) Important improvements and bugs are as follows:
for the list of dynamically reconfigurable attributes.
**Improvement**
[HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks.
In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated
from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn.
The configurations `dfs.namenode.replication.max-streams` and `dfs.namenode.replication.max-streams-hard-limit` will limit
the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new
configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit.
[HDFS-16663](https://issues.apache.org/jira/browse/HDFS-16663) EC: Allow block reconstruction pending timeout refreshable to increase decommission performance.
In [HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613), increase the value of `dfs.namenode.replication.max-streams-hard-limit` would maximize the IO
performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of
`dfs.namenode.reconstruction.pending.timeout-sec`, default is 5 minutes, to shorten the interval time for checking
pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in most of this 5 minutes.
In decommission progress, we may need to reconfigure these 2 parameters several times. In [HDFS-14560](https://issues.apache.org/jira/browse/HDFS-14560), the
`dfs.namenode.replication.max-streams-hard-limit` can already be reconfigured dynamically without namenode restart. And
the `dfs.namenode.reconstruction.pending.timeout-sec` parameter also need to be reconfigured dynamically.
**Bug**
[HDFS-16456](https://issues.apache.org/jira/browse/HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication.
In below scenario, decommission will fail by `TOO_MANY_NODES_ON_RACK` reason:
- Enable EC policy, such as RS-6-3-1024k.
- The rack number in this cluster is equal with or less than the replication number(9)
- A rack only has one DN, and decommission this DN.
This issue has been addressed by the resolution of HDFS-16456.
[HDFS-17094](https://issues.apache.org/jira/browse/HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes.
During block recovery, the `RecoveryTaskStriped` in the datanode expects a one-to-one correspondence between
`rBlock.getLocations()` and `rBlock.getBlockIndices()`. However, if there are stale locations during a NameNode heartbeat,
this correspondence may be disrupted. Specifically, although there are no stale locations in `recoveryLocations`, the block indices
array remains complete. This discrepancy causes `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate an incorrect
internal block ID, leading to a failure in the recovery process as the corresponding datanode cannot locate the replica.
This issue has been addressed by the resolution of HDFS-17094.
[HDFS-17284](https://issues.apache.org/jira/browse/HDFS-17284). EC: Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery.
Due to an integer overflow in the calculation of numReplicationTasks or numEcReplicatedTasks, the NameNode's configuration
parameter `dfs.namenode.replication.max-streams-hard-limit` failed to take effect. This led to an excessive number of tasks
being sent to the DataNodes, consequently occupying too much of their memory.
This issue has been addressed by the resolution of HDFS-17284.
**Others**
Other improvements and fixes for HDFS EC, Please refer to the release document.
Transitive CVE fixes Transitive CVE fixes
-------------------- --------------------
@ -110,8 +148,8 @@ Transitive CVE fixes
A lot of dependencies have been upgraded to address recent CVEs. A lot of dependencies have been upgraded to address recent CVEs.
Many of the CVEs were not actually exploitable through the Hadoop Many of the CVEs were not actually exploitable through the Hadoop
so much of this work is just due diligence. so much of this work is just due diligence.
However applications which have all the library is on a class path may However, applications which have all the library is on a class path may
be vulnerable, and the ugprades should also reduce the number of false be vulnerable, and the upgrades should also reduce the number of false
positives security scanners report. positives security scanners report.
We have not been able to upgrade every single dependency to the latest We have not been able to upgrade every single dependency to the latest
@ -147,12 +185,12 @@ can, with care, keep data and computing resources private.
1. Physical cluster: *configure Hadoop security*, usually bonded to the 1. Physical cluster: *configure Hadoop security*, usually bonded to the
enterprise Kerberos/Active Directory systems. enterprise Kerberos/Active Directory systems.
Good. Good.
1. Cloud: transient or persistent single or multiple user/tenant cluster 2. Cloud: transient or persistent single or multiple user/tenant cluster
with private VLAN *and security*. with private VLAN *and security*.
Good. Good.
Consider [Apache Knox](https://knox.apache.org/) for managing remote Consider [Apache Knox](https://knox.apache.org/) for managing remote
access to the cluster. access to the cluster.
1. Cloud: transient single user/tenant cluster with private VLAN 3. Cloud: transient single user/tenant cluster with private VLAN
*and no security at all*. *and no security at all*.
Requires careful network configuration as this is the sole Requires careful network configuration as this is the sole
means of securing the cluster.. means of securing the cluster..