hadoop

Author	SHA1	Message	Date
Mehakmeet Singh	4c8cd61961	HADOOP-17461. Collect thread-level IOStatistics. (#4352 ) This adds a thread-level collector of IOStatistics, IOStatisticsContext, which can be: * Retrieved for a thread and cached for access from other threads. * reset() to record new statistics. * Queried for live statistics through the IOStatisticsSource.getIOStatistics() method. * Queries for a statistics aggregator for use in instrumented classes. * Asked to create a serializable copy in snapshot() The goal is to make it possible for applications with multiple threads performing different work items simultaneously to be able to collect statistics on the individual threads, and so generate aggregate reports on the total work performed for a specific job, query or similar unit of work. Some changes in IOStatistics-gathering classes are needed for this feature * Caching the active context's aggregator in the object's constructor * Updating it in close() Slightly more work is needed in multithreaded code, such as the S3A committers, which collect statistics across all threads used in task and job commit operations. Currently the IOStatisticsContext-aware classes are: * The S3A input stream, output stream and list iterators. * RawLocalFileSystem's input and output streams. * The S3A committers. * The TaskPool class in hadoop-common, which propagates the active context into scheduled worker threads. Collection of statistics in the IOStatisticsContext is disabled process-wide by default until the feature is considered stable. To enable the collection, set the option fs.thread.level.iostatistics.enabled to "true" in core-site.xml; Contributed by Mehakmeet Singh and Steve Loughran	2022-07-26 20:41:22 +01:00
ashutoshpant	bac2219e3c	HADOOP-18330. S3AFileSystem removes Path when calling createS3Client (#4572 ) Adds a new parameter object in s3ClientCreationParameters that holds the full s3a path URI Contributed by Ashutosh Pant	2022-07-21 10:16:39 +01:00
Mukund Thakur	4d1f6f9b99	HADOOP-18106: Handle memory fragmentation in S3A Vectored IO. (#4445 ) part of HADOOP-18103. Handling memory fragmentation in S3A vectored IO implementation by allocating smaller user range requested size buffers and directly filling them from the remote S3 stream and skipping undesired data in between ranges. This patch also adds aborting active vectored reads when stream is closed or unbuffer() is called. Contributed By: Mukund Thakur	2022-06-22 17:29:32 +01:00
Mukund Thakur	1408dd89a7	HADOOP-18107 Adding scale test for vectored reads for large file (#4273 ) part of HADOOP-18103. Contributed By: Mukund Thakur	2022-06-22 17:29:32 +01:00
Mukund Thakur	5db0f34e29	HADOOP-18104: S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads (#3964 ) Part of HADOOP-18103. Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size to configure min seek and max read during a vectored IO operation in S3A connector. These properties actually define how the ranges will be merged. To completely disable merging set fs.s3a.max.readsize.vectored.read to 0. Contributed By: Mukund Thakur	2022-06-22 17:29:32 +01:00
Mukund Thakur	2daf0a814f	HADOOP-11867. Add a high-performance vectored read API. (#3904 ) part of HADOOP-18103. Add support for multiple ranged vectored read api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially in object stores implementation. Also added implementation in S3A where smaller ranges are merged and sliced byte buffers are returned to the readers. All the merged ranged are fetched from S3 asynchronously. Contributed By: Owen O'Malley and Mukund Thakur	2022-06-22 17:29:32 +01:00
Steve Loughran	e199da3fae	HADOOP-17833. Improve Magic Committer performance (#3289 ) Speed up the magic committer with key changes being * Writes under __magic always retain directory markers * File creation under __magic skips all overwrite checks, including the LIST call intended to stop files being created over dirs. * mkdirs under __magic probes the path for existence but does not look any further. Extra parallelism in task and job commit directory scanning Use of createFile and openFile with parameters which all for HEAD checks to be skipped. The committer can write the summary _SUCCESS file to the path `fs.s3a.committer.summary.report.directory`, which can be in a different file system/bucket if desired, using the job id as the filename. Also: HADOOP-15460. S3A FS to add `fs.s3a.create.performance` Application code can set the createFile() option fs.s3a.create.performance to true to disable the same safety checks when writing under magic directories. Use with care. The createFile option prefix `fs.s3a.create.header.` can be used to add custom headers to S3 objects when created. Contributed by Steve Loughran.	2022-06-17 19:11:35 +01:00
monthonk	5ac55b405d	HADOOP-12020. Add s3a storage class option fs.s3a.create.storage.class (#3877 ) Adds a new option fs.s3a.create.storage.class which can be used to set the storage class for files created in AWS S3. Consult the documentation for details and instructions on how disable the relevant tests when testing against third-party stores. Contributed by Monthon Klongklaew	2022-06-08 19:05:17 +01:00
Ashutosh Gupta	a46ef5f2eb	HADOOP-18234. Fix s3a access point xml examples (#4309 ) Contributed by Ashutosh Gupta	2022-05-16 17:47:14 +01:00
Daniel Carl Jones	4230162a76	HADOOP-18168. Fix S3A ITestMarkerTool use of purged public bucket. (#4140 ) This moves off use of the purged s3a://landsat-pds bucket, so fixing tests which had started failing. * Adds a new class, PublicDatasetTestUtils to manage the use of public datasets. * The new test bucket s3a://usgs-landsat/ is requester pays, so depends upon HADOOP-14661. Consult the updated test documentation when running against other S3 stores. Contributed by Daniel Carl Jones Change-Id: Ie8585e4d9b67667f8cb80b2970225d79a4f8d257	2022-05-03 14:28:08 +01:00
Steve Loughran	6ec39d45c9	Revert "HADOOP-18168. . (#4140 )" This reverts commit `6ab7b72cd6`.	2022-05-03 14:27:52 +01:00
Daniel Carl Jones	6ab7b72cd6	HADOOP-18168. . (#4140 ) This moves off use of the purged s3a://landsat-pds bucket, so fixing tests which had started failing. * Adds a new class, PublicDatasetTestUtils to manage the use of public datasets. * The new test bucket s3a://usgs-landsat/ is requester pays, so depends upon HADOOP-14661. Consult the updated test documentation when running against other S3 stores. Contributed by Daniel Carl Jones	2022-05-03 14:26:52 +01:00
Steve Loughran	e0cd0a82e0	HADOOP-16202. Enhanced openFile(): hadoop-aws changes. (#2584/3) S3A input stream support for the few fs.option.openfile settings. As well as supporting the read policy option and values, if the file length is declared in fs.option.openfile.length then no HEAD request will be issued when opening a file. This can cut a few tens of milliseconds off the operation. The patch adds a new openfile parameter/FS configuration option fs.s3a.input.async.drain.threshold (default: 16000). It declares the number of bytes remaining in the http input stream above which any operation to read and discard the rest of the stream, "draining", is executed asynchronously. This asynchronous draining offers some performance benefit on seek-heavy file IO. Contributed by Steve Loughran. Change-Id: I9b0626bbe635e9fd97ac0f463f5e7167e0111e39	2022-04-24 17:33:05 +01:00
Daniel Carl Jones	a6ebc42671	HADOOP-18201. Remove endpoint config overrides for ITestS3ARequesterPays (#4169 ) Contributed by Daniel Carl Jones.	2022-04-14 16:21:34 +01:00
Daniel Carl Jones	9edfe30a60	HADOOP-14661. Add S3 requester pays bucket support to S3A (#3962 ) Adds the option fs.s3a.requester.pays.enabled, which, if set to true, allows the client to access S3 buckets where the requester is billed for the IO. Contributed by Daniel Carl Jones	2022-03-23 20:00:50 +00:00
Steve Loughran	708a0ce21b	HADOOP-13704. Optimized S3A getContentSummary() Optimize the scan for s3 by performing a deep tree listing, inferring directory counts from the paths returned. Contributed by Ahmar Suhail. Change-Id: I26ffa8c6f65fd11c68a88d6e2243b0eac6ffd024	2022-03-22 13:21:12 +00:00
Mukund Thakur	672e380c4f	HADOOP-18112: Implement paging during multi object delete. (#4045 ) Multi object delete of size more than 1000 is not supported by S3 and fails with MalformedXML error. So implementing paging of requests to reduce the number of keys in a single request. Page size can be configured using "fs.s3a.bulk.delete.page.size" Contributed By: Mukund Thakur	2022-03-11 13:05:45 +05:30
Viraj Jasani	66b72406bd	HADOOP-18131. Upgrade maven enforcer plugin and relevant dependencies (#4000 ) Reviewed-by: Akira Ajisaka <aajisaka@apache.org> Reviewed-by: Wei-Chiu Chuang <weichiu@apache.org> Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>	2022-03-08 17:27:04 +09:00
Mehakmeet Singh	6995374b54	HADOOP-18150. Fix ITestAuditManagerDisabled test in S3A. (#4044 ) Contributed by Mehakmeet Singh	2022-03-03 18:44:28 +00:00
monthonk	1f157f802d	HADOOP-17386. Change default fs.s3a.buffer.dir to be under Yarn container path on yarn applications (#3908 ) Co-authored-by: Monthon Klongklaew <monthonk@amazon.com> Signed-off-by: Akira Ajisaka <aajisaka@apache.org>	2022-02-22 13:50:27 +09:00
Steve Loughran	efdec92cab	HADOOP-18091. S3A auditing leaks memory through ThreadLocal references (#3930 ) Adds a new map type WeakReferenceMap, which stores weak references to values, and a WeakReferenceThreadMap subclass to more closely resemble a thread local type, as it is a map of threadId to value. Construct it with a factory method and optional callback for notification on loss and regeneration. WeakReferenceThreadMap<WrappingAuditSpan> activeSpan = new WeakReferenceThreadMap<>( (k) -> getUnbondedSpan(), this::noteSpanReferenceLost); This is used in ActiveAuditManagerS3A for span tracking. Relates to * HADOOP-17511. Add an Audit plugin point for S3A * HADOOP-18094. Disable S3A auditing by default. Contributed by Steve Loughran.	2022-02-10 12:31:41 +00:00
Joey Krabacher	a08e69d33e	HADOOP-18114. Documentation correction in assumed_roles.md (#3949 ) Fixes typo in hadoop-aws/assumed_roles.md Contributed by Joey Krabacher	2022-02-09 10:35:11 +00:00
Petre Bogdan Stolojan	5e7ce26e66	HADOOP-18085. S3 SDK Upgrade causes AccessPoint ARN endpoint mistranslation (#3902 ) Part of HADOOP-17198. Support S3 Access Points. HADOOP-18068. "upgrade AWS SDK to 1.12.132" broke the access point endpoint translation. Correct endpoints should start with "s3-accesspoint.", after SDK upgrade they start with "s3.accesspoint-" which messes up tests + region detection by the SDK. Contributed by Bogdan Stolojan	2022-02-04 15:37:08 +00:00
Steve Loughran	b795f6f9a8	HADOOP-18094. Disable S3A auditing by default. See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references * Adds a new option fs.s3a.audit.enabled to controls whether or not auditing is enabled. This is false by default. * When false, the S3A auditing manager is NoopAuditManagerS3A, which was formerly only used for unit tests and during filsystem initialization. * When true, ActiveAuditManagerS3A is used for managing auditing, allowing auditing events to be reported. * updates documentation and tests. This patch does not fix the underlying leak. When auditing is enabled, long-lived threads will retain references to the audit managers of S3A filesystem instances which have already been closed. Contributed by Steve Loughran.	2022-01-24 13:37:33 +00:00
Steve Loughran	d8ab84275e	HADOOP-18068. upgrade AWS SDK to 1.12.132 (#3864 ) With this update, the versions of key shaded dependencies are jackson 2.12.3 httpclient 4.5.13 Contributed by Steve Loughran	2022-01-18 10:31:28 +00:00
Steve Loughran	14ba19af06	HADOOP-17409. Remove s3guard from S3A module (#3534 ) Completely removes S3Guard support from the S3A codebase. If the connector is configured to use any metastore other than the null and local stores (i.e. DynamoDB is selected) the s3a client will raise an exception and refuse to initialize. This is to ensure that there is no mix of S3Guard enabled and disabled deployments with the same configuration but different hadoop releases -it must be turned off completely. The "hadoop s3guard" command has been retained -but the supported subcommands have been reduced to those which are not purely S3Guard related: "bucket-info" and "uploads". This is major change in terms of the number of files changed; before cherry picking subsequent s3a patches into older releases, this patch will probably need backporting first. Goodbye S3Guard, your work is done. Time to die. Contributed by Steve Loughran.	2022-01-17 18:08:57 +00:00
monthonk	b27732c69b	HADOOP-14334. S3 SSEC tests to downgrade when running against a mandatory encryption object store (#3870 ) Contributed by Monthon Klongklaew	2022-01-09 18:01:47 +00:00
Ashutosh Gupta	ebdbe7eb82	HADOOP-18057. Fix typo: validateEncrytionSecrets -> validateEncryptionSecrets (#3826 )	2021-12-27 16:51:17 +08:00
GuoPhilipse	c65c87f211	HADOOP-18026. Fix default value of Magic committer (#3723 ) Contributed by guophilipse	2021-11-29 15:50:30 +00:00
Steve Loughran	98fe0d0fc3	HADOOP-17979. Add Interface EtagSource to allow FileStatus subclasses to provide etags (#3633 ) Contributed by Steve Loughran	2021-11-24 17:33:12 +00:00
Mehakmeet Singh	a35f7dec25	HADOOP-18016. Make certain methods LimitedPrivate in S3AUtils.java (#3685 ) Contributed By: Mehakmeet Singh	2021-11-24 13:32:59 +05:30
Viraj Jasani	c7ec1897c4	HADOOP-18018. unguava: remove Preconditions from hadoop-tools modules (#3688 )	2021-11-23 13:34:10 +09:00
Steve Loughran	6c6d1b64d4	HADOOP-17928. Syncable: S3A to warn and downgrade (#3585 ) This switches the default behavior of S3A output streams to warning that Syncable.hsync() or hflush() have been called; it's not considered an error unless the defaults are overridden. This avoids breaking applications which call the APIs, at the risk of people trying to use S3 as a safe store of streamed data (HBase WALs, audit logs etc). Contributed by Steve Loughran.	2021-11-02 13:26:16 +00:00
Tamas Domok	a4a874f532	HADOOP-17974. Import statements in hadoop-aws trigger clover failures. Contributed by Tamas Domok	2021-10-21 18:31:28 +01:00
Viraj Jasani	516f36c6f1	HADOOP-17967. Keep restrict-imports-enforcer-rule for Guava VisibleForTesting in hadoop-main pom (#3555 )	2021-10-21 16:54:25 +09:00
Mehakmeet Singh	cb8c98fbb0	HADOOP-17953. S3A: Tests to lookup global or per-bucket configuration for encryption algorithm (#3525 ) Followup to S3-CSE work of HADOOP-13887 Contributed by Mehakmeet Singh	2021-10-19 10:58:27 +01:00
Viraj Jasani	79e5a7f3e3	HADOOP-17962. Replace Guava VisibleForTesting by Hadoop's own annotation in hadoop-tools modules (#3540 )	2021-10-14 17:43:32 +09:00
Viraj Jasani	1151edf12e	HADOOP-17956. Replace all default Charset usage with UTF-8 (#3529 ) Signed-off-by: Akira Ajisaka <aajisaka@apache.org>	2021-10-14 13:07:24 +09:00
Petre Bogdan Stolojan	33608c3bd4	HADOOP-17951. Improve S3A checking of S3 Access Point existence (#3516 ) Follow-on to HADOOP-17198. Support S3 Access Points Contributed by Bogdan Stolojan	2021-10-04 20:58:22 +01:00
Steve Loughran	d609f44aa0	HADOOP-17922. move to fs.s3a.encryption.algorithm - JCEKS integration (#3466 ) The ordering of the resolution of new and deprecated s3a encryption options & secrets is the same when JCEKS and other hadoop credentials stores are used to store them as when they are in XML files: per-bucket settings always take priority over global values, even when the bucket-level options use the old option names. Contributed by Mehakmeet Singh and Steve Loughran	2021-09-30 10:38:53 +01:00
Steve Loughran	2fda61fac6	HADOOP-17851. S3A to support user-specified content encoding (#3498 ) The option fs.s3a.object.content.encoding declares the content encoding to be set on files when they are written; this is served up in the "Content-Encoding" HTTP header when reading objects back in. This is useful for people loading the data into other tools in the AWS ecosystem which don't use file extensions to infer compression type (e.g. serving compressed files from S3 or importing into RDS) Contributed by: Holden Karau	2021-09-29 13:42:07 +01:00
Petre Bogdan Stolojan	b7c2864613	HADOOP-17198. Support S3 Access Points (#3260 ) Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan	2021-09-29 10:54:17 +01:00
Mehakmeet Singh	c54bf19978	HADOOP-17871. S3A CSE: minor tuning (#3412 ) This migrates the fs.s3a-server-side encryption configuration options to a name which covers client-side encryption too. fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key The existing keys remain valid, simply deprecated and remapped to the new values. If you want server-side encryption options to be picked up regardless of hadoop versions, use the old keys. (the old key also works for CSE, though as no version of Hadoop with CSE support has shipped without this remapping, it's less relevant) Contributed by: Mehakmeet Singh	2021-09-15 22:29:22 +01:00
Steve Loughran	6e3aeb1544	HADOOP-17894. CredentialProviderFactory.getProviders() recursion loading JCEKS file from S3A (#3393 ) * CredentialProviderFactory to detect and report on recursion. * S3AFS to remove incompatible providers. * Integration Test for this. Contributed by Steve Loughran.	2021-09-07 15:29:37 +01:00
Dongjoon Hyun	265a48e245	HADOOP-17869. `fs.s3a.connection.maximum` should be bigger than `fs.s3a.threads.max` (#3337 ). The value of `fs.s3a.connection.maximum` has been increased to 96 Contributed by Dongjoon Hyun	2021-08-30 18:30:43 +01:00
Mehakmeet Singh	8d6a686953	HADOOP-17823. S3A S3Guard tests to skip if S3-CSE are enabled (#3263 ) Follow on to * HADOOP-13887. Encrypt S3A data client-side with AWS SDK (S3-CSE) * HADOOP-17817. S3A to raise IOE if both S3-CSE and S3Guard enabled If the S3A bucket is set up to use S3-CSE encryption, all tests which turn on S3Guard are skipped, so they don't raise any exceptions about incompatible configurations. Contributed by: Mehakmeet Singh	2021-08-05 11:46:17 +01:00
Steve Loughran	4627e9c7ef	HADOOP-17822. fs.s3a.acl.default not working after S3A Audit feature (#3249 ) Fixes the regression caused by HADOOP-17511 by moving where the option fs.s3a.acl.default is read -doing it before the RequestFactory is created. Adds * A unit test in TestRequestFactory to verify the ACLs are set on all file write operations. * A new ITestS3ACannedACLs test which verifies that ACLs really do get all the way through. * S3A Assumed Role delegation tokens to include the IAM permission s3:PutObjectAcl in the generated role. Contributed by Steve Loughran	2021-08-02 15:26:56 +01:00
Steve Loughran	ee466d4b40	HADOOP-17628. Distcp contract test is really slow with ABFS and S3A; timing out. (#3240 ) This patch cuts down the size of directory trees used for distcp contract tests against object stores, so making them much faster against distant/slow stores. On abfs, the test only runs with -Dscale (as was the case for s3a already), and has the larger scale test timeout. After every test case, the FileSystem IOStatistics are logged, to provide information about what IO is taking place and what it's performance is. There are some test cases which upload files of 1+ MiB; you can increase the size of the upload in the option "scale.test.distcp.file.size.kb" Set it to zero and the large file tests are skipped. Contributed by Steve Loughran.	2021-08-02 11:36:43 +01:00
Bobby Wang	266b1bd1bb	HADOOP-17812. NPE in S3AInputStream read() after failure to reconnect to store (#3222 ) This improves error handling after multiple failures reading data -when the read fails and attempts to reconnect() also fail. Contributed by Bobby Wang.	2021-07-30 20:04:11 +01:00
Petre Bogdan Stolojan	a218038960	HADOOP-17139 Re-enable optimized copyFromLocal implementation in S3AFileSystem (#3101 ) This work * Defines the behavior of FileSystem.copyFromLocal in filesystem.md * Implements a high performance implementation of copyFromLocalOperation for S3 * Adds a contract test for the operation: AbstractContractCopyFromLocalTest * Implements the contract tests for Local and S3A FileSystems Contributed by: Bogdan Stolojan	2021-07-30 19:42:08 +01:00

1 2 3 4 5 ...

525 Commits