Unless you explicitly set it, the issue date of a delegation token identifier is 0, which confuses spark renewal (SPARK-33440). This patch makes sure that all S3A DT identifiers have the current time as issue date, fixing the problem as far as S3A tokens are concerned.
Contributed by Jungtaek Lim.
Change-Id: Ic80ac7895612a1aa669459c73a78a9c17ecf0c0d
Fixes read-ahead buffer management issues introduced by HADOOP-16852,
"ABFS: Send error back to client for Read Ahead request failure".
Contributed by Sneha Vijayarajan
Contributed by Sneha Vijayarajan
DETAILS:
This change adds config key "fs.azure.enable.conditional.create.overwrite" with
a default of true. When enabled, if create(path, overwrite: true) is invoked
and the file exists, the ABFS driver will first obtain its etag and then attempt
to overwrite the file on the condition that the etag matches. The purpose of this
is to mitigate the non-idempotency of this method. Specifically, in the event of
a network error or similar, the client will retry and this can result in the file
being created more than once which may result in data loss. In essense this is
like a poor man's file handle, and will be addressed more thoroughly in the future
when support for lease is added to ABFS.
TEST RESULTS:
namespace.enabled=true
auth.type=SharedKey
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 87, Failures: 0, Errors: 0, Skipped: 0
Tests run: 457, Failures: 0, Errors: 0, Skipped: 42
Tests run: 207, Failures: 0, Errors: 0, Skipped: 24
namespace.enabled=true
auth.type=OAuth
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 87, Failures: 0, Errors: 0, Skipped: 0
Tests run: 457, Failures: 0, Errors: 0, Skipped: 74
Tests run: 207, Failures: 0, Errors: 0, Skipped: 140
Adds the options to control the size of the per-output-stream threadpool
when writing data through the abfs connector
* fs.azure.write.max.concurrent.requests
* fs.azure.write.max.requests.to.queue
Contributed by Bilahari T H
This reverts changes in HADOOP-13230 to use S3Guard TTL in choosing when
to issue a HEAD request; fixing tests to compensate.
New org.apache.hadoop.fs.s3a.performance.OperationCost cost,
S3GUARD_NONAUTH_FILE_STATUS_PROBE for use in cost tests.
Contributed by Steve Loughran.
Change-Id: I418d55d2d2562a48b2a14ec7dee369db49b4e29e
S3AFileSystem.listStatus() is optimized for invocations
where the path supplied is a non-empty directory.
The number of S3 requests is significantly reduced, saving
time, money, and reducing the risk of S3 throttling.
Contributed by Mukund Thakur.
Change-Id: I7cc5f87aa16a4819e245e0fbd2aad226bd500f3f
This changes directory tree deletion so that only files are incrementally deleted
from S3Guard after the objects are deleted; the directories are left alone
until metadataStore.deleteSubtree(path) is invoke.
This avoids directory tombstones being added above files/child directories,
which stop the treewalk and delete phase from working.
Also:
* Callback to delete objects splits files and dirs so that
any problems deleting the dirs doesn't trigger s3guard updates
* New statistic to measure #of objects deleted, alongside request count.
* Callback listFilesAndEmptyDirectories renamed listFilesAndDirectoryMarkers
to clarify behavior.
* Test enhancements to replicate the failure and verify the fix
Contributed by Steve Loughran
Change-Id: I0e6ea2c35e487267033b1664228c8837279a35c7
Now skips ITestS3AEncryptionWithDefaultS3Settings.testEncryptionOverRename
when server side encryption is not set to sse:kms
Contributed by Mukund Thakur
Change-Id: Ifd83d353e9c7c6f7e1195a2c2f138d85cf876bb1
This adds an option to disable "empty directory" marker deletion,
so avoid throttling and other scale problems.
This feature is *not* backwards compatible.
Consult the documentation and use with care.
Contributed by Steve Loughran.
Change-Id: I69a61e7584dc36e485d5e39ff25b1e3e559a1958