Commit Graph

1396 Commits

Author SHA1 Message Date
Mukund Thakur
7e642ec5a3
HADOOP-17023. Tune S3AFileSystem.listStatus() (#2257)
S3AFileSystem.listStatus() is optimized for invocations
where the path supplied is a non-empty directory.
The number of S3 requests is significantly reduced, saving
time, money, and reducing the risk of S3 throttling.

Contributed by Mukund Thakur.

Change-Id: I7cc5f87aa16a4819e245e0fbd2aad226bd500f3f
2020-09-21 17:30:15 +01:00
Steve Loughran
aa80bcb1ec
Revert "HADOOP-17244. S3A directory delete tombstones dir markers prematurely. (#2280)"
This reverts commit 0c82eb0324.

Change-Id: I6bd100d9de19660b0f28ee0ab16faf747d6d9f05
2020-09-11 18:07:05 +01:00
Steve Loughran
0c82eb0324
HADOOP-17244. S3A directory delete tombstones dir markers prematurely. (#2280)
This changes directory tree deletion so that only files are incrementally deleted
from S3Guard after the objects are deleted; the directories are left alone
until metadataStore.deleteSubtree(path) is invoke.

This avoids directory tombstones being added above files/child directories,
which stop the treewalk and delete phase from working.

Also:

* Callback to delete objects splits files and dirs so that
any problems deleting the dirs doesn't trigger s3guard updates
* New statistic to measure #of objects deleted, alongside request count.
* Callback listFilesAndEmptyDirectories renamed listFilesAndDirectoryMarkers
  to clarify behavior.
* Test enhancements to replicate the failure and verify the fix

Contributed by Steve Loughran

Change-Id: I0e6ea2c35e487267033b1664228c8837279a35c7
2020-09-10 17:29:33 +01:00
Mehakmeet Singh
ccceec8af0
HADOOP-17158. Test timeout for ITestAbfsInputStreamStatistics#testReadAheadCounters (#2272)
Contributed by: Mehakmeet Singh.

Change-Id: I7ebfa5cd1b5d25f7a750f0c645d7d93c81e89240
2020-09-08 14:02:28 +01:00
Mehakmeet Singh
28f1ded9fe
HADOOP-17113. Adding ReadAhead Counters in ABFS (#2154)
Contributed by Mehakmeet Singh

Change-Id: I6bbd8165385a9267ed64831bb1efa18b6554feb1
2020-09-08 14:02:02 +01:00
Mehakmeet Singh
7970710418
HADOOP-17229. No update of bytes received counter value after response failure occurs in ABFS (#2264)
Contributed by Mehakmeet Singh

Change-Id: Ia9ad1b87a460b10d27486bd00ee67c3cedd2b5b5
2020-09-08 13:26:24 +01:00
Mukund Thakur
5236c96ead
HADOOP-17167 ITestS3AEncryptionWithDefaultS3Settings failing (#2187)
Now skips ITestS3AEncryptionWithDefaultS3Settings.testEncryptionOverRename
when server side encryption is not set to sse:kms

Contributed by Mukund Thakur

Change-Id: Ifd83d353e9c7c6f7e1195a2c2f138d85cf876bb1
2020-09-04 15:00:30 +01:00
Steve Loughran
38354006f8 HADOOP-17227. S3A Marker Tool tuning (#2254)
Contributed by Steve Loughran.
2020-09-04 14:58:54 +01:00
Mehakmeet Singh
f6e1ed4f6b
HADOOP-17194. Adding Context class for AbfsClient in ABFS (#2216)
Contributed by Mehakmeet Singh.

Change-Id: I120c9a068d758d8e5d071c878a3b7fbeb95e4de6
2020-08-27 11:28:37 +01:00
Mukund Thakur
0840c0c1f3
HADOOP-17074. S3A Listing to be fully asynchronous. (#2207)
Contributed by Mukund Thakur.

Change-Id: I1b0574a0c9ebc0805f285dd5280a00e5add081f1
2020-08-25 11:30:42 +01:00
swamirishi
ba4f7fb332
HADOOP-17122: Preserving Directory Attributes in DistCp with Atomic Copy (#2133)
Contributed by Swaminathan Balachandran

Change-Id: I86f956dd4ab0b278d923fe7b70037e6b929a8aa1
2020-08-22 18:51:10 +01:00
Steve Loughran
49f8ae965e
HADOOP-13230. S3A to optionally retain directory markers.
This adds an option to disable "empty directory" marker deletion,
so avoid throttling and other scale problems.

This feature is *not* backwards compatible.
Consult the documentation and use with care.

Contributed by Steve Loughran.

Change-Id: I69a61e7584dc36e485d5e39ff25b1e3e559a1958
2020-08-15 20:19:49 +01:00
Mukund Thakur
571737f4ac HADOOP-17192. ITestS3AHugeFilesSSECDiskBlock failing (#2221)
Contributed by Mukund Thakur
2020-08-13 14:33:27 +01:00
Ayush Saxena
2943e6650f HDFS-15514. Remove useless dfs.webhdfs.enabled. Contributed by Fei Hui. 2020-08-07 22:20:42 +05:30
Mukund Thakur
251d2d1fa5
HADOOP-17131. Refactor S3A Listing code for better isolation. (#2148)
Contributed by Mukund Thakur.

Change-Id: I79160b236a92fdd67565a4b4974f1862e600c210
2020-08-04 17:13:06 +01:00
Sneha Vijayarajan
18ca80331c
Hadoop 17132. ABFS: Fix Rename and Delete Idempotency check trigger
- Contributed by Sneha Vijayarajan
2020-07-25 13:13:18 +00:00
ishaniahuja
f24e2ec487
HADOOP-17058. ABFS: Support for AppendBlob in Hadoop ABFS Driver
- Contributed by Ishani Ahuja
2020-07-25 13:12:32 +00:00
Mehakmeet Singh
7c9b459786
HADOOP-16961. ABFS: Adding metrics to AbfsInputStream (#2076)
Contributed by Mehakmeet Singh.
2020-07-25 13:12:09 +00:00
Mehakmeet Singh
bbd3278d09
HADOOP-17065. Add Network Counters to ABFS (#2056)
Contributed by Mehakmeet Singh.
2020-07-25 13:11:34 +00:00
Karthik Amarnath
8b7e77443d
HDFS-15168: ABFS enhancement to translate AAD to Linux identities. (#1978) 2020-07-25 13:10:39 +00:00
Sneha Vijayarajan
903935da0f
HADOOP-17053. ABFS: Fix Account-specific OAuth config setting parsing
Contributed by Sneha Vijayarajan
2020-07-25 13:10:30 +00:00
Sneha Vijayarajan
869a68b81e
HADOOP-16852: Report read-ahead error back
Contributed by Sneha Vijayarajan
2020-07-25 13:10:19 +00:00
Sneha Vijayarajan
27b20f9689
HADOOP-17054. ABFS: Fix test AbfsClient authentication instance
Contributed by Sneha Vijayarajan
2020-07-25 13:09:26 +00:00
Sneha Vijayarajan
eed06b46eb
Hadoop-17015. ABFS: Handling Rename and Delete idempotency
Contributed by Sneha Vijayarajan.
2020-07-25 13:08:01 +00:00
bilaharith
1ae72d2438
HADOOP-17092. ABFS: Making AzureADAuthenticator.getToken() throw HttpException
- Contributed by Bilahari T H

Change-Id: Id9576d9509faaf057bf419ccb1879ac0cef7a07b
2020-07-22 18:26:36 +01:00
Ayush Saxena
e3b8d4eb05 HADOOP-17100. Replace Guava Supplier with Java8+ Supplier in Hadoop. Contributed by Ahmed Hussein. 2020-07-22 18:21:14 +05:30
Steve Loughran
5aa9396a58
HADOOP-17107. hadoop-azure parallel tests not working on recent JDKs (#2118)
Contributed by Steve Loughran.

Change-Id: I972264aed36f384b7ae23e214326ef7870261cf5
2020-07-20 10:54:22 +01:00
bilaharith
e01852181a
HADOOP-16682. ABFS: Removing unnecessary toString() invocations
- Contributed by Bilahari T H

Change-Id: Id55495b44d81533d1d3654de2553c709f505f7eb
2020-07-20 10:53:59 +01:00
Mehakmeet Singh
0d88ed2794
HADOOP-17129. Validating storage keys in ABFS correctly (#2141)
Contributed by Mehakmeet Singh

Change-Id: I8016ee2f9ffbc86ea867f4a3d960b134e507d099
2020-07-16 18:11:52 +01:00
Mukund Thakur
8b601ad7e6 HADOOP-17022. Tune S3AFileSystem.listFiles() API.
Contributed by Mukund Thakur.

Change-Id: I17f5cfdcd25670ce3ddb62c13378c7e2dc06ba52
2020-07-14 15:28:27 +01:00
Anoop Sam John
cac2fc1f58 HADOOP-16998. WASB : NativeAzureFsOutputStream#close() throwing IllegalArgumentException (#2073)
Contributed by Anoop Sam John.
2020-07-14 14:08:46 +01:00
jimmy-zuber-amzn
79fc58def3
HADOOP-17105. S3AFS - Do not attempt to resolve symlinks in globStatus (#2113)
Contributed by Jimmy Zuber.

Change-Id: I2f247c2d2ab4f38214073e55f5cfbaa15aeaeb11
2020-07-13 19:09:50 +01:00
Steve Loughran
a51d72f0c6 HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext.
Contributed by Steve Loughran.

Change-Id: Iebd34140c1a0aa71f44a3f4d0fee85f6bdf123a3
2020-07-13 13:32:04 +01:00
Sebastian Nagel
f9619b0b97
HADOOP-17117 Fix typos in hadoop-aws documentation (#2127)
(cherry picked from commit 5b1ed2113b)
2020-07-09 00:04:46 +09:00
bilaharith
19fb204011
HADOOP-17086. ABFS: Making the ListStatus response ignore unknown properties. (#2101)
Contributed by Bilahari T H.

Change-Id: I82e4683fba8481aef2abab7a6a99e5752f6fffa9
2020-07-03 19:02:21 +01:00
Steve Loughran
7de1ac0547
HADOOP-16798. S3A Committer thread pool shutdown problems. (#1963)
Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not unheard of.

Note: this problem has never been seen in production; it has surfaced
in the hadoop-aws tests on a heavily overloaded desktop

Change-Id: I3b433356d01fcc50d88b4353dbca018484984bc8
2020-06-30 10:52:56 +01:00
Thomas Marquardt
ee192c4826
HADOOP-17089: WASB: Update azure-storage-java SDK
Contributed by Thomas Marquardt

DETAILS: WASB depends on the Azure Storage Java SDK. There is a concurrency
bug in the Azure Storage Java SDK that can cause the results of a list blobs
operation to appear empty. This causes the Filesystem listStatus and similar
APIs to return empty results. This has been seen in Spark work loads when jobs
use more than one executor core.

See Azure/azure-storage-java#546 for details on the bug in the Azure Storage SDK.

TESTS: A new test was added to validate the fix. All tests are passing:

wasb:
mvn -T 1C -Dparallel-tests=wasb -Dscale -DtestsThreadCount=8 clean verify
Tests run: 248, Failures: 0, Errors: 0, Skipped: 11
Tests run: 651, Failures: 0, Errors: 0, Skipped: 65

abfs:
mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 64, Failures: 0, Errors: 0, Skipped: 0
Tests run: 437, Failures: 0, Errors: 0, Skipped: 33
Tests run: 206, Failures: 0, Errors: 0, Skipped: 24
2020-06-25 05:43:32 +00:00
Thomas Marquardt
63d236c019
HADOOP-17076: ABFS: Delegation SAS Generator Updates
Contributed by Thomas Marquardt.

DETAILS:
1) The authentication version in the service has been updated from Dec19 to Feb20, so need to update the client.
2) Add support and test cases for getXattr and setXAttr.
3) Update DelegationSASGenerator and related to use Duration instead of int for time periods.
4) Cleanup DelegationSASGenerator switch/case statement that maps operations to permissions.
5) Cleanup SASGenerator classes to use String.equals instead of ==.

TESTS:
Added tests for getXAttr and setXAttr.

All tests are passing against my account in eastus2euap:

 $mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
 Tests run: 76, Failures: 0, Errors: 0, Skipped: 0
 Tests run: 441, Failures: 0, Errors: 0, Skipped: 33
 Tests run: 206, Failures: 0, Errors: 0, Skipped: 24
2020-06-19 19:19:31 +00:00
bilaharith
d639c11986
HADOOP-17004. Fixing a formatting issue
Contributed by Bilahari T H.
2020-06-19 19:11:06 +00:00
bilaharith
11307f3be9
HADOOP-17004. ABFS: Improve the ABFS driver documentation
Contributed by Bilahari T H.
2020-06-19 19:10:22 +00:00
Thomas Marquardt
af98f32f7d
HADOOP-16916: ABFS: Delegation SAS generator for integration with Ranger
Contributed by Thomas Marquardt.

DETAILS:

Previously we had a SASGenerator class which generated Service SAS, but we need to add DelegationSASGenerator.
I separated SASGenerator into a base class and two subclasses ServiceSASGenerator and DelegationSASGenreator.  The
code in ServiceSASGenerator is copied from SASGenerator but the DelegationSASGenrator code is new.  The
DelegationSASGenerator code demonstrates how to use Delegation SAS with minimal permissions, as would be used
by an authorization service such as Apache Ranger.  Adding this to the tests helps us lock in this behavior.

Added a MockDelegationSASTokenProvider for testing User Delegation SAS.

Fixed the ITestAzureBlobFileSystemCheckAccess tests to assume oauth client ID so that they are ignored when that
is not configured.

To improve performance, AbfsInputStream/AbfsOutputStream re-use SAS tokens until the expiry is within 120 seconds.
After this a new SAS will be requested.  The default period of 120 seconds can be changed using the configuration
setting "fs.azure.sas.token.renew.period.for.streams".

The SASTokenProvider operation names were updated to correspond better with the ADLS Gen2 REST API, since these
operations must be provided tokens with appropriate SAS parameters to succeed.

Support for the version 2.0 AAD authentication endpoint was added to AzureADAuthenticator.

The getFileStatus method was mistakenly calling the ADLS Gen2 Get Properties API which requires read permission
while the getFileStatus call only requires execute permission.  ADLS Gen2 Get Status API is supposed to be used
for this purpose, so the underlying AbfsClient.getPathStatus API was updated with a includeProperties
parameter which is set to false for getFileStatus and true for getXAttr.

Added SASTokenProvider support for delete recursive.

Fixed bugs in AzureBlobFileSystem where public methods were not validating the Path by calling makeQualified.  This is
necessary to avoid passing null paths and to convert relative paths into absolute paths.

Canonicalized the path used for root path internally so that root path can be used with SAS tokens, which requires
that the path in the URL and the path in the SAS token match.  Internally the code was using
"//" instead of "/" for the root path, sometimes.  Also related to this, the AzureBlobFileSystemStore.getRelativePath
API was updated so that we no longer remove and then add back a preceding forward / to paths.

To run ITestAzureBlobFileSystemDelegationSAS tests follow the instructions in testing_azure.md under the heading
"To run Delegation SAS test cases".  You also need to set "fs.azure.enable.check.access" to true.

TEST RESULTS:

namespace.enabled=true
auth.type=SharedKey
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 63, Failures: 0, Errors: 0, Skipped: 0
Tests run: 432, Failures: 0, Errors: 0, Skipped: 41
Tests run: 206, Failures: 0, Errors: 0, Skipped: 24

namespace.enabled=false
auth.type=SharedKey
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 63, Failures: 0, Errors: 0, Skipped: 0
Tests run: 432, Failures: 0, Errors: 0, Skipped: 244
Tests run: 206, Failures: 0, Errors: 0, Skipped: 24

namespace.enabled=true
auth.type=SharedKey
sas.token.provider.type=MockDelegationSASTokenProvider
enable.check.access=true
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 63, Failures: 0, Errors: 0, Skipped: 0
Tests run: 432, Failures: 0, Errors: 0, Skipped: 33
Tests run: 206, Failures: 0, Errors: 0, Skipped: 24

namespace.enabled=true
auth.type=OAuth
-------------------
$mvn -T 1C -Dparallel-tests=abfs -Dscale -DtestsThreadCount=8 clean verify
Tests run: 63, Failures: 0, Errors: 0, Skipped: 0
Tests run: 432, Failures: 0, Errors: 1, Skipped: 74
Tests run: 206, Failures: 0, Errors: 0, Skipped: 140
2020-06-19 19:00:46 +00:00
Mehakmeet Singh
a2f44344c3
HADOOP-17018. Intermittent failing of ITestAbfsStreamStatistics in ABFS (#1990)
Contributed by: Mehakmeet Singh

In some cases, ABFS-prefetch thread runs in the background which returns some bytes from the buffer and gives an extra readOp. Thus, making readOps values arbitrary and giving intermittent failures in some cases. Hence, readOps values of 2 or 3 are seen in different setups.
2020-06-19 19:00:04 +00:00
bilaharith
76ee7e5494
HADOOP-17002. ABFS: Adding config to determine if the account is HNS enabled or not
Contributed by Bilahari T H.
2020-06-19 18:57:47 +00:00
Steve Loughran
5e290e702f
HADOOP-17050. S3A to support additional token issuers
Contributed by Steve Loughran.

S3A delegation token providers will be asked for any additional
token issuers, an array can be returned,
each one will be asked for tokens when DelegationTokenIssuer collects
all the tokens for a filesystem.

Change-Id: I1bd3035bbff98cbd8e1d1ac7fc615d937e6bb7bb
2020-06-09 14:43:02 +01:00
Mehakmeet Singh
1714589609
HADOOP-17016. Adding Common Counters in ABFS (#1991).
Contributed by: Mehakmeet Singh.

Change-Id: Ib84e7a42f28e064df4c6204fcce33e573360bf42
2020-06-03 20:02:44 +01:00
Steve Loughran
8a642caca8
HADOOP-16568. S3A FullCredentialsTokenBinding fails if local credentials are unset. (#1441)
Contributed by Steve Loughran.

Move the loading to deployUnbonded (where they are required) and add a safety check when a new DT is requested

Change-Id: I03c69aa2e16accfccddca756b2771ff832e7dd58
2020-06-03 17:08:52 +01:00
Mukund Thakur
b0c9e4f1b5
HADOOP-16900. Very large files can be truncated when written through the S3A FileSystem.
Contributed by Mukund Thakur and Steve Loughran.

This patch ensures that writes to S3A fail when more than 10,000 blocks are
written. That upper bound still exists. To write massive files, make sure
that the value of fs.s3a.multipart.size is set to a size which is large
enough to upload the files in fewer than 10,000 blocks.

Change-Id: Icec604e2a357ffd38d7ae7bc3f887ff55f2d721a
2020-06-01 12:01:13 +01:00
Masatake Iwasaki
4d30c395f7 HADOOP-17040. Fix intermittent failure of ITestBlockingThreadPoolExecutorService. (#2020)
(cherry picked from commit 9685314633)
2020-05-22 18:53:53 +09:00
Steve Loughran
f2f727359b Revert "HADOOP-14557. Document HADOOP-8143 (Change distcp to have -pb on by default)."
This reverts commit 44350fdf49.

It is related to the rollback of HADOOP-8143.

Change-Id: If48e3dd670c920ada702dc36461ff398fe9d35cc
2020-05-14 19:20:34 +01:00
Steve Loughran
7f2cf334a8
Revert "HADOOP-8143. Change distcp to have -pb on by default."
This reverts commit dd65eea74b.

Change-Id: I74180cf59d5bbad8c9f66cb331535addcbea863e
2020-05-14 19:14:39 +01:00