Commit Graph

1707 Commits

Author SHA1 Message Date
Mehakmeet Singh
4c8cd61961
HADOOP-17461. Collect thread-level IOStatistics. (#4352)
This adds a thread-level collector of IOStatistics, IOStatisticsContext,
which can be:
* Retrieved for a thread and cached for access from other
  threads.
* reset() to record new statistics.
* Queried for live statistics through the
  IOStatisticsSource.getIOStatistics() method.
* Queries for a statistics aggregator for use in instrumented
  classes.
* Asked to create a serializable copy in snapshot()

The goal is to make it possible for applications with multiple
threads performing different work items simultaneously
to be able to collect statistics on the individual threads,
and so generate aggregate reports on the total work performed
for a specific job, query or similar unit of work.

Some changes in IOStatistics-gathering classes are needed for 
this feature
* Caching the active context's aggregator in the object's
  constructor
* Updating it in close()

Slightly more work is needed in multithreaded code,
such as the S3A committers, which collect statistics across
all threads used in task and job commit operations.

Currently the IOStatisticsContext-aware classes are:
* The S3A input stream, output stream and list iterators.
* RawLocalFileSystem's input and output streams.
* The S3A committers.
* The TaskPool class in hadoop-common, which propagates
  the active context into scheduled worker threads.

Collection of statistics in the IOStatisticsContext
is disabled process-wide by default until the feature 
is considered stable.

To enable the collection, set the option
fs.thread.level.iostatistics.enabled
to "true" in core-site.xml;
	
Contributed by Mehakmeet Singh and Steve Loughran
2022-07-26 20:41:22 +01:00
ashutoshpant
bac2219e3c
HADOOP-18330. S3AFileSystem removes Path when calling createS3Client (#4572)
Adds a new parameter object in s3ClientCreationParameters that holds 
the full s3a path URI

Contributed by Ashutosh Pant
2022-07-21 10:16:39 +01:00
Ayush Saxena
96f8e5b6f4
HADOOP-15789. DistCp does not clean staging folder if class extends DistCp. Contributed by Lawrence Andrews. (#4534)
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2022-07-08 17:04:20 +05:30
Jinhu Wu
3ec4b932c1
HADOOP-18313: AliyunOSSBlockOutputStream should not mark the temporary file for deletion (#4502)
HADOOP-18313: AliyunOSSBlockOutputStream should not mark the temporary file for deletion. Contributed by wujinhu.
2022-07-06 14:23:46 +08:00
Mehakmeet Singh
823f5ee0d4
HADOOP-18242. ABFS Rename Failure when tracking metadata is in an incomplete state (#4331)
ABFS rename fails intermittently when the Storage-blob tracking
metadata is in an incomplete state. This surfaces as the error code
404 and an error message of "RenameDestinationParentPathNotFound"

To mitigate this issue, when a request fails with this response.
the ABFS client issues a HEAD call on the source file
and then retries the rename operation again

ABFS filesystem statistics track when this occurs with new counters
  rename_recovery
  metadata_incomplete_rename_failures
  rename_path_attempts

This is very rare occurrence and appears to be triggered under certain
heavy load conditions, just as with HADOOP-18163.

Contributed by Mehakmeet Singh.
2022-06-27 19:06:59 +01:00
Mukund Thakur
4d1f6f9b99 HADOOP-18106: Handle memory fragmentation in S3A Vectored IO. (#4445)
part of HADOOP-18103.
Handling memory fragmentation in S3A vectored IO implementation by
allocating smaller user range requested size buffers and directly
filling them from the remote S3 stream and skipping undesired
data in between ranges.
This patch also adds aborting active vectored reads when stream is
closed or unbuffer() is called.

Contributed By: Mukund Thakur
2022-06-22 17:29:32 +01:00
Mukund Thakur
1408dd89a7 HADOOP-18107 Adding scale test for vectored reads for large file (#4273)
part of HADOOP-18103.

Contributed By: Mukund Thakur
2022-06-22 17:29:32 +01:00
Mukund Thakur
5db0f34e29 HADOOP-18104: S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads (#3964)
Part of HADOOP-18103.
Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size
to configure min seek and max read during a vectored IO operation in S3A connector.
These properties actually define how the ranges will be merged. To completely
disable merging set fs.s3a.max.readsize.vectored.read to 0.

Contributed By: Mukund Thakur
2022-06-22 17:29:32 +01:00
Mukund Thakur
2daf0a814f HADOOP-11867. Add a high-performance vectored read API. (#3904)
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
2022-06-22 17:29:32 +01:00
Steve Loughran
e199da3fae
HADOOP-17833. Improve Magic Committer performance (#3289)
Speed up the magic committer with key changes being

* Writes under __magic always retain directory markers

* File creation under __magic skips all overwrite checks,
  including the LIST call intended to stop files being
	created over dirs.
* mkdirs under __magic probes the path for existence
  but does not look any further.  	

Extra parallelism in task and job commit directory scanning
Use of createFile and openFile with parameters which all for
HEAD checks to be skipped.

The committer can write the summary _SUCCESS file to the path
`fs.s3a.committer.summary.report.directory`, which can be in a
different file system/bucket if desired, using the job id as
the filename. 

Also: HADOOP-15460. S3A FS to add `fs.s3a.create.performance`

Application code can set the createFile() option
fs.s3a.create.performance to true to disable the same
safety checks when writing under magic directories.
Use with care.

The createFile option prefix `fs.s3a.create.header.`
can be used to add custom headers to S3 objects when
created.


Contributed by Steve Loughran.
2022-06-17 19:11:35 +01:00
monthonk
5ac55b405d
HADOOP-12020. Add s3a storage class option fs.s3a.create.storage.class (#3877)
Adds a new option fs.s3a.create.storage.class which can
be used to set the storage class for files created in AWS S3.
Consult the documentation for details and instructions on how
disable the relevant tests when testing against third-party
stores.

Contributed by Monthon Klongklaew
2022-06-08 19:05:17 +01:00
GuoPhilipse
ba6520f67f
HADOOP-18269. Misleading method name in DistCpOptions.(#4216)
Contributed by guophilipse
2022-05-30 14:02:47 +01:00
Ashutosh Gupta
fb910bd906
HDFS-16453. Upgrade okhttp from 2.7.5 to 4.9.3 (#4229)
Co-authored-by: Ashutosh Gupta <ashugpt@amazon.com>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-05-21 02:53:14 +09:00
Ashutosh Gupta
a46ef5f2eb
HADOOP-18234. Fix s3a access point xml examples (#4309)
Contributed by Ashutosh Gupta
2022-05-16 17:47:14 +01:00
Daniel Carl Jones
4230162a76
HADOOP-18168. Fix S3A ITestMarkerTool use of purged public bucket. (#4140)
This moves off use of the purged s3a://landsat-pds bucket, so fixing tests
which had started failing.
* Adds a new class, PublicDatasetTestUtils to manage the use of public datasets.
* The new test bucket s3a://usgs-landsat/ is requester pays, so depends upon
  HADOOP-14661.

Consult the updated test documentation when running against other S3 stores.

Contributed by Daniel Carl Jones

Change-Id: Ie8585e4d9b67667f8cb80b2970225d79a4f8d257
2022-05-03 14:28:08 +01:00
Steve Loughran
6ec39d45c9 Revert "HADOOP-18168. . (#4140)"
This reverts commit 6ab7b72cd6.
2022-05-03 14:27:52 +01:00
Daniel Carl Jones
6ab7b72cd6
HADOOP-18168. . (#4140)
This moves off use of the purged s3a://landsat-pds bucket, so fixing tests
which had started failing.
* Adds a new class, PublicDatasetTestUtils to manage the use of public datasets.
* The new test bucket s3a://usgs-landsat/ is requester pays, so depends upon
  HADOOP-14661.

Consult the updated  test documentation when running against other S3 stores.

Contributed by Daniel Carl Jones
2022-05-03 14:26:52 +01:00
Ashutosh Gupta
23e1511cd0
HDFS-16256. Minor fix in HDFS Fedbalance document (#4192) 2022-05-02 08:08:12 +08:00
PJ Fanning
63187083cc
HADOOP-15983. Use jersey-json that is built to use jackson2 (#3988)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-04-28 14:18:19 +09:00
Steve Loughran
44ae2fa8e5
HADOOP-16202. Enhanced openFile(): hadoop-azure changes. (#2584/4)
Stops the abfs connector warning if openFile().withFileStatus()
is invoked with a FileStatus is not an abfs VersionedFileStatus.

Contributed by Steve Loughran.

Change-Id: I85076b365eb30aaef2ed35139fa8714efd4d048e
2022-04-24 17:33:05 +01:00
Steve Loughran
e0cd0a82e0
HADOOP-16202. Enhanced openFile(): hadoop-aws changes. (#2584/3)
S3A input stream support for the few fs.option.openfile settings.
As well as supporting the read policy option and values,
if the file length is declared in fs.option.openfile.length
then no HEAD request will be issued when opening a file.
This can cut a few tens of milliseconds off the operation.

The patch adds a new openfile parameter/FS configuration option
fs.s3a.input.async.drain.threshold (default: 16000).
It declares the number of bytes remaining in the http input stream
above which any operation to read and discard the rest of the stream,
"draining", is executed asynchronously.
This asynchronous draining offers some performance benefit on seek-heavy
file IO.

Contributed by Steve Loughran.

Change-Id: I9b0626bbe635e9fd97ac0f463f5e7167e0111e39
2022-04-24 17:33:05 +01:00
Steve Loughran
6999acf520
HADOOP-16202. Enhanced openFile(): mapreduce and YARN changes. (#2584/2)
These changes ensure that sequential files are opened with the
right read policy, and split start/end is passed in.

As well as offering opportunities for filesystem clients to
choose fetch/cache/seek policies, the settings ensure that
processing text files on an s3 bucket where the default policy
is "random" will still be processed efficiently.

This commit depends on the associated hadoop-common patch,
which must be committed first.

Contributed by Steve Loughran.

Change-Id: Ic6713fd752441cf42ebe8739d05c2293a5db9f94
2022-04-24 17:33:05 +01:00
GuoPhilipse
214f369073
HDFS-16556. Fix typos in distcp (#4217) 2022-04-22 14:01:20 -04:00
Daniel Carl Jones
a6ebc42671
HADOOP-18201. Remove endpoint config overrides for ITestS3ARequesterPays (#4169)
Contributed by Daniel Carl Jones.
2022-04-14 16:21:34 +01:00
Viraj Jasani
7c20602b17
HDFS-16522. Set Http and Ipc ports for Datanodes in MiniDFSCluster (#4108)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-04-06 18:17:02 +09:00
litao
966b773a7c
HDFS-16527. Add global timeout rule for TestRouterDistCpProcedure (#4129)
Reviewed-by: Inigo Goiri <inigoiri@apache.org>
Reviewed-by: Ayush Saxena <ayushsaxena@apache.org>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-04-06 14:34:24 +09:00
9uapaw
4b1a6bfb10 YARN-11102. Fix spotbugs error in hadoop-sls module. Contributed by Szilard Nemeth, Andras Gyori. 2022-04-01 18:24:37 +02:00
Szilard Nemeth
94031b729d YARN-11103. SLS cleanup after previously merged SLS refactor jiras. Contributed by Szilard Nemeth 2022-03-31 14:29:59 +02:00
Szilard Nemeth
ab8c360620 YARN-10550. Decouple NM runner logic from SLSRunner. Contributed by Szilard Nemeth 2022-03-30 19:53:10 +02:00
9uapaw
e386d6a661 YARN-10549. Decouple RM runner logic from SLSRunner. Contributed by Szilard Nemeth. 2022-03-29 09:58:27 +02:00
9uapaw
adbaf48082 YARN-11100. Fix StackOverflowError in SLS scheduler event handling. Contributed by Szilard Nemeth. 2022-03-26 21:43:10 +01:00
9uapaw
08a77a765b YARN-10548. Decouple AM runner logic from SLSRunner. Contributed by Szilard Nemeth. 2022-03-25 18:48:56 +01:00
Benjamin Teke
ffa0eab488 YARN-11094. Follow up changes for YARN-10547. Contributed by Szilard Nemeth 2022-03-25 12:01:44 +01:00
9uapaw
526142447a YARN-10552. Eliminate code duplication in SLSCapacityScheduler and SLSFairScheduler. Contributed by Szilard Nemeth. 2022-03-24 16:24:33 +01:00
9uapaw
077c6c62d6 YARN-10547. Decouple job parsing logic from SLSRunner. Contributed by Szilard Nemeth. 2022-03-24 06:16:26 +01:00
Daniel Carl Jones
9edfe30a60
HADOOP-14661. Add S3 requester pays bucket support to S3A (#3962)
Adds the option fs.s3a.requester.pays.enabled, which, if set to true, allows
the client to access S3 buckets where the requester is billed for the IO.

Contributed by Daniel Carl Jones
2022-03-23 20:00:50 +00:00
Steve Loughran
708a0ce21b
HADOOP-13704. Optimized S3A getContentSummary()
Optimize the scan for s3 by performing a deep tree listing,
inferring directory counts from the paths returned.

Contributed by Ahmar Suhail.

Change-Id: I26ffa8c6f65fd11c68a88d6e2243b0eac6ffd024
2022-03-22 13:21:12 +00:00
Steve Loughran
8294bd5a37
HADOOP-18163. hadoop-azure support for the Manifest Committer of MAPREDUCE-7341
Follow-on patch to MAPREDUCE-7341, adding ABFS support and tests

* resilient rename
* tests for job commit through the manifest committer.

contains
- HADOOP-17976. ABFS etag extraction inconsistent between LIST and HEAD calls
- HADOOP-16204. ABFS tests to include terasort

Contributed by Steve Loughran.

Change-Id: I0a7d4043bdf19bcb00c033fc389730109b93b77f
2022-03-17 11:24:51 +00:00
Mukund Thakur
672e380c4f
HADOOP-18112: Implement paging during multi object delete. (#4045)
Multi object delete of size more than 1000 is not supported by S3 and 
fails with MalformedXML error. So implementing paging of requests to 
reduce the number of keys in a single request. Page size can be configured
using "fs.s3a.bulk.delete.page.size" 

 Contributed By: Mukund Thakur
2022-03-11 13:05:45 +05:30
Viraj Jasani
66b72406bd
HADOOP-18131. Upgrade maven enforcer plugin and relevant dependencies (#4000)
Reviewed-by: Akira Ajisaka <aajisaka@apache.org>
Reviewed-by: Wei-Chiu Chuang <weichiu@apache.org>
Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>
2022-03-08 17:27:04 +09:00
Mehakmeet Singh
6995374b54
HADOOP-18150. Fix ITestAuditManagerDisabled test in S3A. (#4044)
Contributed by Mehakmeet Singh
2022-03-03 18:44:28 +00:00
Steve Loughran
b56af00114
HADOOP-18075. ABFS: Fix failure caused by listFiles() in ITestAbfsRestOperationException (#4040)
Contributed by Sumangala Patki
2022-03-01 11:48:10 +00:00
Sumangala Patki
c18b646020
HADOOP-18071. ABFS: Set driver global timeout for ITestAzureBlobFileSystemBasics (#3866)
Contributed by Sumangala Patki
2022-02-23 19:38:10 +00:00
monthonk
1f157f802d
HADOOP-17386. Change default fs.s3a.buffer.dir to be under Yarn container path on yarn applications (#3908)
Co-authored-by: Monthon Klongklaew <monthonk@amazon.com>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-02-22 13:50:27 +09:00
Mohanad Elsafty
a4f459097b
HADOOP-18117. Add an option to preserve root directory permissions (#3970) 2022-02-18 19:12:50 +08:00
Ayush Saxena
fe583c4b63
HADOOP-18096. Distcp: Sync moves filtered file to home directory rather than deleting. (#3940). Contributed by Ayush Saxena.
Reviewed-by: Steve Loughran <stevel@apache.org>
Reviewed-by: stack <stack@apache.org>
2022-02-11 01:59:40 +05:30
Steve Loughran
efdec92cab
HADOOP-18091. S3A auditing leaks memory through ThreadLocal references (#3930)
Adds a new map type WeakReferenceMap, which stores weak
references to values, and a WeakReferenceThreadMap subclass
to more closely resemble a thread local type, as it is a
map of threadId to value.

Construct it with a factory method and optional callback
for notification on loss and regeneration.

 WeakReferenceThreadMap<WrappingAuditSpan> activeSpan =
      new WeakReferenceThreadMap<>(
          (k) -> getUnbondedSpan(),
          this::noteSpanReferenceLost);

This is used in ActiveAuditManagerS3A for span tracking.

Relates to
* HADOOP-17511. Add an Audit plugin point for S3A
* HADOOP-18094. Disable S3A auditing by default.

Contributed by Steve Loughran.
2022-02-10 12:31:41 +00:00
Joey Krabacher
a08e69d33e
HADOOP-18114. Documentation correction in assumed_roles.md (#3949)
Fixes typo in hadoop-aws/assumed_roles.md

Contributed by Joey Krabacher
2022-02-09 10:35:11 +00:00
Petre Bogdan Stolojan
5e7ce26e66
HADOOP-18085. S3 SDK Upgrade causes AccessPoint ARN endpoint mistranslation (#3902)
Part of HADOOP-17198. Support S3 Access Points.

HADOOP-18068. "upgrade AWS SDK to 1.12.132" broke the access point endpoint
translation.

Correct endpoints should start with "s3-accesspoint.", after SDK upgrade they start with
"s3.accesspoint-" which messes up tests + region detection by the SDK.

Contributed by Bogdan Stolojan
2022-02-04 15:37:08 +00:00
Steve Loughran
b795f6f9a8
HADOOP-18094. Disable S3A auditing by default.
See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

* Adds a new option fs.s3a.audit.enabled to controls whether or not auditing
is enabled. This is false by default.

* When false, the S3A auditing manager is NoopAuditManagerS3A,
which was formerly only used for unit tests and
during filsystem initialization.

* When true, ActiveAuditManagerS3A is used for managing auditing,
allowing auditing events to be reported.

* updates documentation and tests.

This patch does not fix the underlying leak. When auditing is enabled,
long-lived threads will retain references to the audit managers
of S3A filesystem instances which have already been closed.

Contributed by Steve Loughran.
2022-01-24 13:37:33 +00:00
Anmol Asrani
7c97c0f969
HADOOP-18084. ABFS: Add testfilePath while verifying test contents are read correctly (#3903)
Contributed by: Anmol Asrani
2022-01-19 10:13:13 +00:00
Steve Loughran
d8ab84275e
HADOOP-18068. upgrade AWS SDK to 1.12.132 (#3864)
With this update, the versions of key shaded dependencies are

  jackson    2.12.3
  httpclient 4.5.13

Contributed by Steve Loughran
2022-01-18 10:31:28 +00:00
Steve Loughran
14ba19af06
HADOOP-17409. Remove s3guard from S3A module (#3534)
Completely removes S3Guard support from the S3A codebase.

If the connector is configured to use any metastore other than
the null and local stores (i.e. DynamoDB is selected) the s3a client
will raise an exception and refuse to initialize.

This is to ensure that there is no mix of S3Guard enabled and disabled
deployments with the same configuration but different hadoop releases
-it must be turned off completely.

The "hadoop s3guard" command has been retained -but the supported
subcommands have been reduced to those which are not purely S3Guard
related: "bucket-info" and "uploads".

This is major change in terms of the number of files
changed; before cherry picking subsequent s3a patches into
older releases, this patch will probably need backporting
first.

Goodbye S3Guard, your work is done. Time to die.

Contributed by Steve Loughran.
2022-01-17 18:08:57 +00:00
monthonk
b27732c69b
HADOOP-14334. S3 SSEC tests to downgrade when running against a mandatory encryption object store (#3870)
Contributed by Monthon Klongklaew
2022-01-09 18:01:47 +00:00
Viraj Jasani
08c803ea30
MAPREDUCE-7371. DistributedCache alternative APIs should not use DistributedCache APIs internally (#3855) 2022-01-09 00:18:10 +09:00
Ayush Saxena
657a2882e9
HADOOP-18056. DistCp: Filter duplicates in the source paths. (#3825). Contributed by Ayush Saxena.
Reviewed-by: tomscut <litao@bigo.sg>
Reviewed-by: Steve Loughran <stevel@apache.org>
2022-01-05 23:53:07 +05:30
Akira Ajisaka
dba139cd0f
HADOOP-18045. Disable TestDynamometerInfra (#3829)
Reviewed-by: Fei Hui <feihui.ustc@gmail.com>
2021-12-28 13:22:12 +09:00
Ashutosh Gupta
ebdbe7eb82
HADOOP-18057. Fix typo: validateEncrytionSecrets -> validateEncryptionSecrets (#3826) 2021-12-27 16:51:17 +08:00
Viraj Jasani
04b6b9a87b
HADOOP-16908. Prune Jackson 1 from the codebase and restrict it's usage for future (#3789)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2021-12-20 16:01:34 +09:00
Szilard Nemeth
a967033a9f
YARN-10427. Duplicate Job IDs in SLS output (#3809). Contributed by Szilard Nemeth 2021-12-17 00:34:16 +01:00
Akira Ajisaka
9b9e2ef87f
HADOOP-18040. Use maven.test.failure.ignore instead of ignoreTestFailure (#3774)
Reviewed-by: Masatake Iwasaki <iwasakims@apache.org>
2021-12-10 01:36:31 +09:00
GuoPhilipse
c65c87f211
HADOOP-18026. Fix default value of Magic committer (#3723)
Contributed by guophilipse
2021-11-29 15:50:30 +00:00
Viraj Jasani
215388beea
HADOOP-18022. Add restrict-imports-enforcer-rule for Guava Preconditions and remove remaining usages (#3712)
Reviewed-by: Akira Ajisaka <aajisaka@apache.org>
Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>
2021-11-29 17:37:30 +09:00
Steve Loughran
98fe0d0fc3
HADOOP-17979. Add Interface EtagSource to allow FileStatus subclasses to provide etags (#3633)
Contributed by Steve Loughran
2021-11-24 17:33:12 +00:00
Mehakmeet Singh
a35f7dec25
HADOOP-18016. Make certain methods LimitedPrivate in S3AUtils.java (#3685)
Contributed By: Mehakmeet Singh
2021-11-24 13:32:59 +05:30
Andrew Chung
5b1b2c8ef6
YARN-11003. Make RMNode aware of all (OContainer inclusive) allocated resources (#3646) 2021-11-23 13:20:08 -08:00
Viraj Jasani
c7ec1897c4
HADOOP-18018. unguava: remove Preconditions from hadoop-tools modules (#3688) 2021-11-23 13:34:10 +09:00
Steve Loughran
3391b69692
HADOOP-18002. ABFS rename idempotency broken -remove recovery (#3641)
Cut modtime-based rename recovery as object modification time
is not updated during rename operation.
Applications will have to use etag API of HADOOP-17979
and implement it themselves.

Why not do the HEAD and etag recovery in ABFS client?
Cuts the IO capacity in half so kills job commit performance.
The manifest committer of MAPREDUCE-7341 will do this recovery
and act as the reference implementation of the algorithm.

Contributed by: Steve Loughran
2021-11-16 16:47:44 +05:30
Steve Loughran
45f164a854 Revert "HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)"
This reverts commit 82658a22d6.
2021-11-05 14:21:15 +00:00
sumangala-patki
e1ac10ceae
HADOOP-17863. ABFS: Fix compiler deprecation warning in TextFileBasedIdentityHandler (#3332)
Closes #3332 

Contributed by Sumangala Patki
2021-11-05 12:53:51 +00:00
sumangala-patki
19644c0cdc
HADOOP-17862. ABFS: Fix unchecked cast compiler warning for AbfsListStatusRemoteIterator (#3331)
closes #3331 

Contributed by Sumangala Patki
2021-11-05 12:50:37 +00:00
sumangala-patki
82658a22d6
HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)
Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes #3341

Contributed by Sumangala Patki
2021-11-04 14:40:37 +00:00
Jinhu Wu
a9c51ea57d
HADOOP-17374. support listObjectV2 (#3587) 2021-11-03 21:47:41 -07:00
Steve Loughran
6c6d1b64d4
HADOOP-17928. Syncable: S3A to warn and downgrade (#3585)
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.

This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).

Contributed by Steve Loughran.
2021-11-02 13:26:16 +00:00
Tamas Domok
a4a874f532
HADOOP-17974. Import statements in hadoop-aws trigger clover failures.
Contributed by Tamas Domok
2021-10-21 18:31:28 +01:00
Viraj Jasani
516f36c6f1
HADOOP-17967. Keep restrict-imports-enforcer-rule for Guava VisibleForTesting in hadoop-main pom (#3555) 2021-10-21 16:54:25 +09:00
Mehakmeet Singh
cb8c98fbb0
HADOOP-17953. S3A: Tests to lookup global or per-bucket configuration for encryption algorithm (#3525)
Followup to S3-CSE work of HADOOP-13887

Contributed by Mehakmeet Singh
2021-10-19 10:58:27 +01:00
adol001
280ae1c0a9
HADOOP-17932. Distcp file length comparison have no effect (#3519)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2021-10-18 19:07:53 +09:00
guoxin12
a38fa941c4
HDFS-16206. The error of Constant annotation in AzureNativeFileSystemStore.java (#3372). Contributed by guoxin. 2021-10-18 09:29:20 +05:30
Viraj Jasani
79e5a7f3e3
HADOOP-17962. Replace Guava VisibleForTesting by Hadoop's own annotation in hadoop-tools modules (#3540) 2021-10-14 17:43:32 +09:00
Viraj Jasani
1151edf12e
HADOOP-17956. Replace all default Charset usage with UTF-8 (#3529)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2021-10-14 13:07:24 +09:00
Viraj Jasani
8071dbb9c6
HADOOP-17950. Provide replacement for deprecated APIs of commons-io IOUtils (#3515)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2021-10-07 10:58:29 +09:00
Petre Bogdan Stolojan
33608c3bd4
HADOOP-17951. Improve S3A checking of S3 Access Point existence (#3516)
Follow-on to HADOOP-17198. Support S3 Access Points

Contributed by Bogdan Stolojan
2021-10-04 20:58:22 +01:00
Josh Elser
6f2fa87fc8
HADOOP-17934 ABFS: Make sure the AbfsHttpOperation is non-null before using it (#3477)
Contributed by: Josh Elser
2021-09-30 13:38:13 +01:00
Steve Loughran
d609f44aa0
HADOOP-17922. move to fs.s3a.encryption.algorithm - JCEKS integration (#3466)
The ordering of the resolution of new and deprecated s3a encryption options & secrets is the same when JCEKS and other hadoop credentials stores are used to store them as
when they are in XML files: per-bucket settings always take priority over global values,
even when the bucket-level options use the old option names.

Contributed by Mehakmeet Singh and Steve Loughran
2021-09-30 10:38:53 +01:00
Steve Loughran
2fda61fac6
HADOOP-17851. S3A to support user-specified content encoding (#3498)
The option fs.s3a.object.content.encoding declares the content encoding to be set on files when they are written; this is served up in the "Content-Encoding" HTTP header when reading objects back in.

This is useful for people loading the data into other tools in the AWS ecosystem which don't use file extensions to infer compression type (e.g. serving compressed files from S3 or importing into RDS)

Contributed by: Holden Karau
2021-09-29 13:42:07 +01:00
Petre Bogdan Stolojan
b7c2864613
HADOOP-17198. Support S3 Access Points (#3260)
Add support for S3 Access Points. This provides extra security as it
ensures applications are not working with buckets belong to third parties.

To bind a bucket to an access point, set the access point (ap) ARN,
which must be done for each specific bucket, using the pattern

fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN

* The global/bucket option `fs.s3a.accesspoint.required` to
mandate that buckets must declare their access point.
* This is not compatible with S3Guard.

Consult the documentation for further details.

Contributed by Bogdan Stolojan
2021-09-29 10:54:17 +01:00
Mehakmeet Singh
acffe203b8
HADOOP-17195. ABFS: OutOfMemory error while uploading huge files (#3446)
Addresses the problem of processes running out of memory when
there are many ABFS output streams queuing data to upload,
especially when the network upload bandwidth is less than the rate
data is generated.

ABFS Output streams now buffer their blocks of data to
"disk", "bytebuffer" or "array", as set in
"fs.azure.data.blocks.buffer"

When buffering via disk, the location for temporary storage
is set in "fs.azure.buffer.dir"

For safe scaling: use "disk" (default); for performance, when
confident that upload bandwidth will never be a bottleneck,
experiment with the memory options.

The number of blocks a single stream can have queued for uploading
is set in "fs.azure.block.upload.active.blocks".
The default value is 20.

Contributed by Mehakmeet Singh.
2021-09-21 12:48:06 +01:00
Mehakmeet Singh
c54bf19978
HADOOP-17871. S3A CSE: minor tuning (#3412)
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.

fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key

The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.

(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)


Contributed by: Mehakmeet Singh
2021-09-15 22:29:22 +01:00
Steve Loughran
10f3abeae7
Revert "HADOOP-17195. OutOfMemory error while performing hdfs CopyFromLocal to ABFS (#3406)" (#3443)
This reverts commit 52c024cc3a.
2021-09-15 22:27:49 +01:00
Mehakmeet Singh
52c024cc3a
HADOOP-17195. OutOfMemory error while performing hdfs CopyFromLocal to ABFS (#3406)
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.

fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key

The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.

(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)


Contributed by: Mehakmeet Singh
2021-09-15 22:27:28 +01:00
Steve Loughran
6e3aeb1544
HADOOP-17894. CredentialProviderFactory.getProviders() recursion loading JCEKS file from S3A (#3393)
* CredentialProviderFactory to detect and report on recursion.
* S3AFS to remove incompatible providers.
* Integration Test for this.

Contributed by Steve Loughran.
2021-09-07 15:29:37 +01:00
Mukund Thakur
9b8f81a179
HADOOP-17156. ABFS: Release the byte buffers held by input streams in close() (#3285)
Contributed By: Mukund Thakur
2021-09-07 15:13:36 +05:30
Dongjoon Hyun
265a48e245
HADOOP-17869. fs.s3a.connection.maximum should be bigger than fs.s3a.threads.max (#3337).
The value of `fs.s3a.connection.maximum` has been increased to 96

Contributed by Dongjoon Hyun
2021-08-30 18:30:43 +01:00
sumangala-patki
dcddc6a59f
HADOOP-17682. ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters (#2975) 2021-08-18 19:14:10 +05:30
Steve Loughran
ee07b90286
HADOOP-17836. Improve logging on ABFS error reporting (#3281)
Contributed by Steve Loughran.
2021-08-18 11:39:17 +01:00
Mehakmeet Singh
8d6a686953
HADOOP-17823. S3A S3Guard tests to skip if S3-CSE are enabled (#3263)
Follow on to
* HADOOP-13887. Encrypt S3A data client-side with AWS SDK (S3-CSE)
* HADOOP-17817. S3A to raise IOE if both S3-CSE and S3Guard enabled

If the S3A bucket is set up to use S3-CSE encryption, all tests which turn
on S3Guard are skipped, so they don't raise any exceptions about
incompatible configurations.

Contributed by: Mehakmeet Singh
2021-08-05 11:46:17 +01:00
Steve Loughran
a67a0fd37a
YARN-10878. move TestNMSimulator off com.google (#3268)
Converting from a class to a lambda-expression removes all need to reference the google stuff

Contributed by Steve Loughran
2021-08-05 11:34:10 +01:00
sumangala-patki
3450522c2f
HADOOP-17618. ABFS: Partially obfuscate SAS object IDs in Logs (#2845)
Contributed by Sumangala Patki
2021-08-04 19:45:57 +01:00
Steve Loughran
4627e9c7ef
HADOOP-17822. fs.s3a.acl.default not working after S3A Audit feature (#3249)
Fixes the regression caused by HADOOP-17511 by moving where the
option  fs.s3a.acl.default is read -doing it before the RequestFactory
is created.

Adds

* A unit test in TestRequestFactory to verify the ACLs are set
  on all file write operations.
* A new ITestS3ACannedACLs test which verifies that ACLs really
  do get all the way through.
* S3A Assumed Role delegation tokens to include the IAM permission
  s3:PutObjectAcl in the generated role.

Contributed by Steve Loughran
2021-08-02 15:26:56 +01:00