Security hardening
+ Adds new interceptAndValidateMessageContains() method in LambdaTestUtils to verify a list of strings
can all be found in the toString() value of a raised exception
Contributed by PJ Fanning
ABFS has a client-side throttling mechanism which works on the metrics collected
from past requests
When requests are fail due to server-side throttling it updates its
metrics and recalculates any client side backoff.
The choice of which requests should be used to compute client side
backoff interval is based on the http status code:
- Status code in 2xx range: Successful Operations should contribute.
- Status code in 3xx range: Redirection Operations should not contribute.
- Status code in 4xx range: User Errors should not contribute.
- Status code is 503: Throttling Error should contribute only if they
are due to client limits breach as follows:
* 503, Ingress Over Account Limit: Should Contribute
* 503, Egress Over Account Limit: Should Contribute
* 503, TPS Over Account Limit: Should Contribute
* 503, Other Server Throttling: Should not Contribute.
- Status code in 5xx range other than 503: Should not Contribute.
- IOException and UnknownHostExceptions: Should not Contribute.
Contributed by Anuj Modi
This PR enables one to create the Hadoop
release tarball on Windows, complete with
the native binaries (including winutils.exe).
This PR contains the following changes -
* Prevents splitting during array element
expansion - this is needed since we need
to pass the arguments correctly to maven.
* Install Python 3.11.8 and pip to the
Windows docker image for building
Hadoop.
* pom file changes to get maven to invoke
the releasedocmaker script through
bash.exe on Windows.
This addresses two CVEs triggered by malformed archives
Important: Denial of Service CVE-2024-25710
Moderate: Denial of Service CVE-2024-26308
Contributed by PJ Fanning
Clarifies behaviour of VectorIO methods with contract tests as well as
specification.
* Add precondition range checks to all implementations
* Identify and fix bug where direct buffer reads was broken
(HADOOP-19101; this surfaced in ABFS contract tests)
* Logging in VectoredReadUtils.
* TestVectoredReadUtils verifies validation logic.
* FileRangeImpl toString() improvements
* CombinedFileRange tracks bytes in range which are wanted;
toString() output logs this.
HDFS
* Add test TestHDFSContractVectoredRead
ABFS
* Add test ITestAbfsFileSystemContractVectoredRead
S3A
* checks for vector IO being stopped in all iterative
vector operations, including draining
* maps read() returning -1 to failure
* passes in file length to validation
* Error reporting to only completeExceptionally() those ranges
which had not yet read data in.
* Improved logging.
readVectored()
* made synchronized. This is only for the invocation;
the actual async retrieves are unsynchronized.
* closes input stream on invocation
* switches to random IO, so avoids keeping any long-lived connection around.
+ AbstractSTestS3AHugeFiles enhancements.
+ ADDENDUM: test fix in ITestS3AContractVectoredRead
Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback
implementation
Contributed by Steve Loughran
Change-Id: Ia4ed71864c595f175c275aad83a2ff5741693432
Clarifies behaviour of VectorIO methods with contract tests as well as specification.
* Add precondition range checks to all implementations
* Identify and fix bug where direct buffer reads was broken
(HADOOP-19101; this surfaced in ABFS contract tests)
* Logging in VectoredReadUtils.
* TestVectoredReadUtils verifies validation logic.
* FileRangeImpl toString() improvements
* CombinedFileRange tracks bytes in range which are wanted;
toString() output logs this.
HDFS
* Add test TestHDFSContractVectoredRead
ABFS
* Add test ITestAbfsFileSystemContractVectoredRead
S3A
* checks for vector IO being stopped in all iterative
vector operations, including draining
* maps read() returning -1 to failure
* passes in file length to validation
* Error reporting to only completeExceptionally() those ranges
which had not yet read data in.
* Improved logging.
readVectored()
* made synchronized. This is only for the invocation;
the actual async retrieves are unsynchronized.
* closes input stream on invocation
* switches to random IO, so avoids keeping any long-lived connection around.
+ AbstractSTestS3AHugeFiles enhancements.
Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback implementation
Contributed by Steve Loughran
If the option fs.s3a.committer.magic.track.commits.in.memory.enabled
is set to true, then rather than save data about in-progress uploads
to S3, this information is cached in memory.
If the number of files being committed is low, this will save network IO
in both the generation of .pending and marker files, and in the scanning
of task attempt directory trees during task commit.
Contributed by Syed Shameerur Rahman
This is a followup to #6406:
HADOOP-18980. S3A credential provider remapping: make extensible
It adds extra validation of key-value pairs in a configuration
option, with tests.
Contributed by Viraj Jasani
This reverts most of
HADOOP-18869: [ABFS] Fix behavior of a File System APIs on root path (#6003).
Calling getXAttr("/") or setXAttr("/") on an abfs container will fail with
`Operation failed: "The request URI is invalid.", HTTP 400 Bad Request`
This change is to ensure:
* Consistency across ADLS clients
* Consistency across authentication mechanisms.
Contributed by Anuj Modi