hadoop

Author	SHA1	Message	Date
Steve Loughran	26b9d480e8	HADOOP-17337. S3A NetworkBinding has a runtime dependency on shaded httpclient. (#2599 ) Contributed by Steve Loughran.	2021-02-03 14:29:56 +00:00
Steve Loughran	0bb52a42e5	HADOOP-17483. Magic committer is enabled by default. (#2656 ) * core-default.xml updated so that fs.s3a.committer.magic.enabled = true * CommitConstants updated to match * All tests which previously enabled the magic committer now rely on default settings. This helps make sure it is enabled. * Docs cover the switch, mention its enabled and explain why you may want to disable it. Note: this doesn't switch to using the committer -it just enables the path rewriting magic which it depends on. Contributed by Steve Loughran.	2021-01-27 19:04:22 +00:00
Steve Loughran	28cc912a5c	HADOOP-17493. Revert name of DELEGATION_TOKENS_ISSUED constant/statistic (#2649 ) Follow-on to HADOOP-16830/HADOOP-17271. Contributed by Steve Loughran.	2021-01-27 16:39:29 +00:00
Steve Loughran	80c7404b51	HADOOP-17414. Magic committer files don't have the count of bytes written collected by spark (#2530 ) This needs SPARK-33739 in the matching spark branch in order to work Contributed by Steve Loughran.	2021-01-26 19:30:51 +00:00
Steve Loughran	724edb0354	HADOOP-17451. IOStatistics test failures in S3A code. (#2594 ) Caused by HADOOP-16830 and HADOOP-17271. Fixes tests which fail intermittently based on configs and in the case of the HugeFile tests, bulk runs with existing FS instances meant statistic probes sometimes ended up probing those of a previous FS. Contributed by Steve Loughran. Change-Id: I65ba3f44444e59d298df25ac5c8dc5a8781dfb7d	2021-01-12 17:30:32 +00:00
Steve Loughran	05c9c2ed02	Revert "HADOOP-17451. IOStatistics test failures in S3A code. (#2594 )" This reverts commit `d3014e01f3`. (fixing commit text before it is frozen)	2021-01-12 17:29:59 +00:00
Steve Loughran	d3014e01f3	HADOOP-17451. IOStatistics test failures in S3A code. (#2594 ) Caused by HADOOP-16380 and HADOOP-17271. Fixes tests which fail intermittently based on configs and in the case of the HugeFile tests, bulk runs with existing FS instances meant statistic probes sometimes ended up probing those of a previous FS. Contributed by Steve Loughran.	2021-01-12 17:25:14 +00:00
Gabor Bota	42eb9ff68e	HADOOP-17454. [s3a] Disable bucket existence check - set fs.s3a.bucket.probe to 0 (#2593 ) Also fixes HADOOP-16995. ITestS3AConfiguration proxy tests failures when bucket probes == 0 The improvement should include the fix, ebcause the test would fail by default otherwise. Change-Id: I9a7e4b5e6d4391ebba096c15e84461c038a2ec59	2021-01-05 15:43:01 +01:00
Steve Loughran	617af28e80	HADOOP-17271. S3A connector to support IOStatistics. (#2580 ) S3A connector to support the IOStatistics API of HADOOP-16830, This is a major rework of the S3A Statistics collection to * Embrace the IOStatistics APIs * Move from direct references of S3AInstrumention statistics collectors to interface/implementation classes in new packages. * Ubiquitous support of IOStatistics, including: S3AFileSystem, input and output streams, RemoteIterator instances provided in list calls. * Adoption of new statistic names from hadoop-common Regarding statistic collection, as well as all existing statistics, the connector now records min/max/mean durations of HTTP GET and HEAD requests, and those of LIST operations. Contributed by Steve Loughran.	2020-12-31 21:55:39 +00:00
yzhangal	3d2193cd64	HADOOP-17338. Intermittent S3AInputStream failures: Premature end of Content-Length delimited message body etc (#2497 ) Yongjun Zhang <yongjunzhang@pinterest.com>	2020-12-18 19:08:10 +00:00
Mukund Thakur	03b4e98971	HADOOP-17398. Skipping network I/O in S3A getFileStatus(/) breaks some tests (#2493 ) Follow-on to HADOOP-17323. Contributed by Mukund Thakur.	2020-11-26 20:25:32 +00:00
Steve Loughran	67dc0928c1	HADOOP-17385. ITestS3ADeleteCost.testDirMarkersFileCreation failure (#2473 ). Contributed by Steve Loughran The addition of deprecated S3A configuration options in HADOOP-17318 triggered a reload of default (xml resource) configurations, which breaks tests which fail if there's a per-bucket setting inconsistent with test setup. Creating an S3AFS instance before creating the Configuration() instance for test runs gets that reload out the way before test setup takes place. Along with the fix, extra changes in the failing test suite to fail fast when marker policy isn't as expected, and to log FS state better. Rather than create and discard an instance, add a new static method to S3AFS and invoke it in test setup. This forces the load Change-Id: Id52b1c46912c6fedd2ae270e2b1eb2222a360329	2020-11-26 13:50:33 +01:00
Mukund Thakur	5fee95076b	HADOOP-17323. S3A getFileStatus("/") to skip IO (#2479 ) Contributed by Mukund Thakur.	2020-11-24 11:06:56 +00:00
Steve Loughran	9b4faf2b51	HADOOP-17332. S3A MarkerTool -min and -max are inverted. (#2425 ) This patch * fixes the inversion * adds a precondition check * if the commands are supplied inverted, swaps them with a warning. This is to stop breaking any tests written to cope with the existing behavior. Contributed by Steve Loughran	2020-11-23 20:49:42 +00:00
Jungtaek Lim	f3c629c27e	HADOOP-17388. AbstractS3ATokenIdentifier to issue date in UTC. (#2477 ) Followup to HADOOP-17379. Contributed by Jungtaek Lim.	2020-11-20 10:38:42 +00:00
Steve Loughran	ce7827c82a	HADOOP-17318. Support concurrent S3A commit jobs with same app attempt ID. (#2399 ) See also [SPARK-33402]: Jobs launched in same second have duplicate MapReduce JobIDs Contributed by Steve Loughran. Change-Id: Iae65333cddc84692997aae5d902ad8765b45772a	2020-11-18 13:34:51 +00:00
Steve Loughran	e3c08f285a	HADOOP-17244. S3A directory delete tombstones dir markers prematurely. (#2310 ) This fixes the S3Guard/Directory Marker Retention integration so that when fs.s3a.directory.marker.retention=keep, failures during multipart delete are handled correctly, as are incremental deletes during directory tree operations. In both cases, when a directory marker with children is deleted from S3, the directory entry in S3Guard is not deleted, because it is still critical to representing the structure of the store. Contributed by Steve Loughran. Change-Id: I4ca133a23ea582cd42ec35dbf2dc85b286297d2f	2020-11-18 12:18:11 +00:00
Jungtaek Lim	a7b923c80c	HADOOP-17379. AbstractS3ATokenIdentifier to set issue date == now. (#2466 ) Unless you explicitly set it, the issue date of a delegation token identifier is 0, which confuses spark renewal (SPARK-33440). This patch makes sure that all S3A DT identifiers have the current time as issue date, fixing the problem as far as S3A tokens are concerned. Contributed by Jungtaek Lim.	2020-11-17 14:43:29 +00:00
Ayush Saxena	1e3a6efcef	HADOOP-17288. Use shaded guava from thirdparty. (#2342 ). Contributed by Ayush Saxena.	2020-10-17 12:01:18 +05:30
Dongjoon Hyun	b92f72758b	HADOOP-17258. Magic S3Guard Committer to overwrite existing pendingSet file on task commit (#2371 ) Contributed by Dongjoon Hyun and Steve Loughran Change-Id: Ibaf8082e60eff5298ff4e6513edc386c5bae0274	2020-10-12 13:39:15 +01:00
Steve Loughran	f83e07a20f	HADOOP-17293. S3A to always probe S3 in S3A getFileStatus on non-auth paths This reverts changes in HADOOP-13230 to use S3Guard TTL in choosing when to issue a HEAD request; fixing tests to compensate. New org.apache.hadoop.fs.s3a.performance.OperationCost cost, S3GUARD_NONAUTH_FILE_STATUS_PROBE for use in cost tests. Contributed by Steve Loughran. Change-Id: I418d55d2d2562a48b2a14ec7dee369db49b4e29e	2020-10-08 15:35:57 +01:00
Mukund Thakur	82522d60fb	HADOOP-17281 Implement FileSystem.listStatusIterator() in S3AFileSystem (#2354 ) Contains HADOOP-17300: FileSystem.DirListingIterator.next() call should return NoSuchElementException Contributed by Mukund Thakur	2020-10-07 13:59:06 +01:00
Steve Loughran	7fae4133e0	HADOOP-17261. s3a rename() needs s3:deleteObjectVersion permission (#2303 ) Contributed by Steve Loughran.	2020-09-22 17:22:04 +01:00
Mukund Thakur	83c7c2b4c4	HADOOP-17023. Tune S3AFileSystem.listStatus() (#2257 ) S3AFileSystem.listStatus() is optimized for invocations where the path supplied is a non-empty directory. The number of S3 requests is significantly reduced, saving time, money, and reducing the risk of S3 throttling. Contributed by Mukund Thakur.	2020-09-21 17:20:16 +01:00
Steve Loughran	958cab804e	Revert "HADOOP-17244. S3A directory delete tombstones dir markers prematurely. (#2280 )" This reverts commit `9960c01a25`. Change-Id: I820534c3292f2a343693d835f625488c325fb5d6	2020-09-11 18:07:49 +01:00
Steve Loughran	9960c01a25	HADOOP-17244. S3A directory delete tombstones dir markers prematurely. (#2280 ) This changes directory tree deletion so that only files are incrementally deleted from S3Guard after the objects are deleted; the directories are left alone until metadataStore.deleteSubtree(path) is invoke. This avoids directory tombstones being added above files/child directories, which stop the treewalk and delete phase from working. Also: * Callback to delete objects splits files and dirs so that any problems deleting the dirs doesn't trigger s3guard updates * New statistic to measure #of objects deleted, alongside request count. * Callback listFilesAndEmptyDirectories renamed listFilesAndDirectoryMarkers to clarify behavior. * Test enhancements to replicate the failure and verify the fix Contributed by Steve Loughran	2020-09-10 17:03:52 +01:00
Steve Loughran	5346cc3263	HADOOP-17227. S3A Marker Tool tuning (#2254 ) Contributed by Steve Loughran.	2020-09-04 14:58:03 +01:00
Mukund Thakur	cc641534dc	HADOOP-17074. S3A Listing to be fully asynchronous. (#2207 ) Contributed by Mukund Thakur.	2020-08-25 11:29:43 +01:00
Steve Loughran	5092ea62ec	HADOOP-13230. S3A to optionally retain directory markers. This adds an option to disable "empty directory" marker deletion, so avoid throttling and other scale problems. This feature is not backwards compatible. Consult the documentation and use with care. Contributed by Steve Loughran. Change-Id: I69a61e7584dc36e485d5e39ff25b1e3e559a1958	2020-08-15 12:51:08 +01:00
Mukund Thakur	ac697571a1	HADOOP-17186. Fixing javadoc in ListingOperationCallbacks (#2196 )	2020-08-05 20:40:49 +09:00
Mukund Thakur	8fd4f5490f	HADOOP-17131. Refactor S3A Listing code for better isolation. (#2148 ) Contributed by Mukund Thakur.	2020-08-04 16:00:02 +01:00
Mukund Thakur	4647a60430	HADOOP-17022. Tune S3AFileSystem.listFiles() API. Contributed by Mukund Thakur. Change-Id: I17f5cfdcd25670ce3ddb62c13378c7e2dc06ba52	2020-07-14 15:27:35 +01:00
jimmy-zuber-amzn	806d84b79c	HADOOP-17105. S3AFS - Do not attempt to resolve symlinks in globStatus (#2113 ) Contributed by Jimmy Zuber.	2020-07-13 19:07:48 +01:00
Steve Loughran	b9fa5e0182	HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext. Contributed by Steve Loughran. Change-Id: Iebd34140c1a0aa71f44a3f4d0fee85f6bdf123a3	2020-07-13 13:30:02 +01:00
Steve Loughran	4249c04d45	HADOOP-16798. S3A Committer thread pool shutdown problems. (#1963 ) Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. Note: this problem has never been seen in production; it has surfaced in the hadoop-aws tests on a heavily overloaded desktop	2020-06-30 10:44:51 +01:00
Steve Loughran	ac5d899d40	HADOOP-17050 S3A to support additional token issuers Contributed by Steve Loughran. S3A delegation token providers will be asked for any additional token issuers, an array can be returned, each one will be asked for tokens when DelegationTokenIssuer collects all the tokens for a filesystem.	2020-06-09 14:39:06 +01:00
Steve Loughran	40d63e02f0	HADOOP-16568. S3A FullCredentialsTokenBinding fails if local credentials are unset. (#1441 ) Contributed by Steve Loughran. Move the loading to deployUnbonded (where they are required) and add a safety check when a new DT is requested	2020-06-03 17:07:00 +01:00
Mukund Thakur	29b19cd592	HADOOP-16900. Very large files can be truncated when written through the S3A FileSystem. Contributed by Mukund Thakur and Steve Loughran. This patch ensures that writes to S3A fail when more than 10,000 blocks are written. That upper bound still exists. To write massive files, make sure that the value of fs.s3a.multipart.size is set to a size which is large enough to upload the files in fewer than 10,000 blocks. Change-Id: Icec604e2a357ffd38d7ae7bc3f887ff55f2d721a	2020-05-20 13:42:25 +01:00
Steve Loughran	93b662db47	HADOOP-16953. tuning s3guard disabled warnings (#1962 ) Contributed by Steve Loughran. The S3Guard absence warning of HADOOP-16484 has been changed so that by default the S3A connector only logs at debug when the connection to the S3 Store does not have S3Guard enabled. The option to control this log level is now fs.s3a.s3guard.disabled.warn.level and can be one of: silent, inform, warn, fail. On a failure, an ExitException is raised with exit code 49. For details on this safety feature, consult the s3guard documentation.	2020-04-20 15:05:55 +01:00
Steve Loughran	42711081e3	HADOOP-16986. S3A to not need wildfly on the classpath. (#1948 ) HADOOP-16986. S3A to not need wildfly JAR on its classpath. Contributed by Steve Loughran This is a successor to HADOOP-16346, which enabled the S3A connector to load the native openssl SSL libraries for better HTTPS performance. That patch required wildfly.jar to be on the classpath. This update: * Makes wildfly.jar optional except in the special case that "fs.s3a.ssl.channel.mode" is set to "openssl" * Retains the declaration of wildfly.jar as a compile-time dependency in the hadoop-aws POM. This means that unless explicitly excluded, applications importing that published maven artifact will, transitively, add the specified wildfly JAR into their classpath for compilation/testing/ distribution. This is done for packaging and to offer that optional speedup. It is not mandatory: applications importing the hadoop-aws POM can exclude it if they choose.	2020-04-20 14:32:13 +01:00
Mukund Thakur	56350664a7	HADOOP-13873. log DNS addresses on s3a initialization. Contributed by Mukund Thakur. If you set the log org.apache.hadoop.fs.s3a.impl.NetworkBinding to DEBUG, then when the S3A bucket probe is made -the DNS address of the S3 endpoint is calculated and printed. This is useful to see if a large set of processes are all using the same IP address from the pool of load balancers to which AWS directs clients when an AWS S3 endpoint is resolved. This can have implications for performance: if all clients access the same load balancer performance may be suboptimal. Note: if bucket probes are disabled, fs.s3a.bucket.probe = 0, the DNS logging does not take place. Change-Id: I21b3ac429dc0b543f03e357fdeb94c2d2a328dd8	2020-04-17 14:15:38 +01:00
Mukund Thakur	7b2d84d19c	HADOOP-16465 listLocatedStatus() optimisation (#1943 ) Contributed by Mukund Thakur Optimize S3AFileSystem.listLocatedStatus() to perform list operations directly and then fallback to head checks for files	2020-04-14 17:19:51 +01:00
Steve Loughran	eaaaba12b1	HADOOP-16939 fs.s3a.authoritative.path should support multiple FS URIs (#1914 ) add unit test, new ITest and then fix the issue: different schema, bucket == skip factored out the underlying logic for unit testing; also moved maybeAddTrailingSlash to S3AUtils (while retaining/forwarnding existing method in S3AFS). tested: london, sole failure is testListingDelete[auth=true](org.apache.hadoop.fs.s3a.ITestS3GuardOutOfBandOperations) filed HADOOP-16853 Change-Id: I4b8d0024469551eda0ec70b4968cba4abed405ed	2020-03-26 12:59:11 -06:00
Gabor Bota	c91ff8c18f	HADOOP-16858. S3Guard fsck: Add option to remove orphaned entries (#1851 ). Contributed by Gabor Bota. Adding a new feature to S3GuardTool's fsck: -fix. Change-Id: I2cdb6601fea1d859b54370046b827ef06eb1107d	2020-03-18 12:48:52 +01:00
Steve Loughran	0a9b3c98b1	HADOOP-15430. hadoop fs -mkdir -p path-ending-with-slash/ fails with s3guard (#1646 ) Contributed by Steve Loughran * move qualify logic to S3AFileSystem.makeQualified() * make S3AFileSystem.qualify() a private redirect to that * ITestS3GuardFsShell turned off	2020-03-12 14:13:55 +00:00
Mukund Thakur	f864ef7429	HADOOP-16794. S3A reverts KMS encryption to the bucket's default KMS key in rename/copy. AreContributed by Mukund Thakur. This addresses an issue which surfaced with KMS encryption: the wrong KMS key could be picked up in the S3 COPY operation, so renamed files, while encrypted, would end up with the bucket default key. As well as adding tests in the new suite ITestS3AEncryptionWithDefaultS3Settings, AbstractSTestS3AHugeFiles has a new test method to verify that the encryption settings also work for large files copied via multipart operations.	2020-03-02 17:31:12 +00:00
spoganshev	e553eda9cd	HADOOP-16767 Handle non-IO exceptions in reopen() Contributed by Sergei Poganshev. Catches Exception instead of IOException in closeStream() and so handle exceptions such as SdkClientException by aborting the wrapped stream. This will increase resilience to failures, as any which occuring during stream closure will be caught. Furthermore, because the underlying HTTP connection is aborted, rather than closed, it will not be recycled to cause problems on subsequent operations.	2020-03-02 17:17:54 +00:00
Mukund Thakur	e77767bb1e	HADOOP-16711. This adds a new option fs.s3a.bucket.probe, range (0-2) to control which probe for a bucket existence to perform on startup. 0: no checks 1: v1 check (as has been performend until now) 2: v2 bucket check, which also incudes a permission check. Default. When set to 0, bucket existence checks won't be done during initialization thus making it faster. When the bucket is not available in S3, or if fs.s3a.endpoint points to the wrong instance of a private S3 store consecutive calls like listing, read, write etc. will fail with an UnknownStoreException. Contributed by: * Mukund Thakur (main patch and tests) * Rajesh Balamohan (v0 list and performance tests) * lqjacklee (HADOOP-15990/v2 list) * Steve Loughran (UnknownStoreException support) modified: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java modified: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java modified: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java modified: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java new file: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/UnknownStoreException.java new file: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/ErrorTranslation.java modified: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md modified: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/performance.md modified: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md modified: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3AMockTest.java new file: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3ABucketExistence.java modified: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/MockS3ClientFactory.java modified: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestS3AExceptionTranslation.java modified: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/AbstractS3GuardToolTestBase.java modified: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/ITestS3GuardToolDynamoDB.java modified: hadoop-tools/hadoop-aws/src/test/resources/core-site.xml Change-Id: Ic174f803e655af172d81c1274ed92b51bdceb384	2020-02-21 13:44:46 +00:00
lqjacklee	c77fc6971b	HADOOP-15961. S3A committers: make sure there's regular progress() calls. Contributed by lqjacklee. Change-Id: I13ca153e1e32b21dbe64d6fb25e260e0ff66154d	2020-02-17 22:06:34 +00:00
Steve Loughran	56dee66770	HADOOP-16823. Large DeleteObject requests are their own Thundering Herd. Contributed by Steve Loughran. During S3A rename() and delete() calls, the list of objects delete is built up into batches of a thousand and then POSTed in a single large DeleteObjects request. But as the IO capacity allowed on an S3 partition may only be 3500 writes per second and each entry in that POST counts as a single write, then one of those posts alone can trigger throttling on an already loaded S3 directory tree. Which can trigger backoff and retry, with the same thousand entry post, and so recreate the exact same problem. Fixes * Page size for delete object requests is set in fs.s3a.bulk.delete.page.size; the default is 250. * The property fs.s3a.experimental.aws.s3.throttling (default=true) can be set to false to disable throttle retry logic in the AWS client SDK -it is all handled in the S3A client. This gives more visibility in to when operations are being throttled * Bulk delete throttling events are logged to the log org.apache.hadoop.fs.s3a.throttled log at INFO; if this appears often then choose a smaller page size. * The metric "store_io_throttled" adds the entire count of delete requests when a single DeleteObjects request is throttled. * A new quantile, "store_io_throttle_rate" can track throttling load over time. * DynamoDB metastore throttle resilience issues have also been identified and fixed. Note: the fs.s3a.experimental.aws.s3.throttling flag does not apply to DDB IO precisely because there may still be lurking issues there and it safest to rely on the DynamoDB client SDK. Change-Id: I00f85cdd94fc008864d060533f6bd4870263fd84	2020-02-13 19:09:49 +00:00

1 2 3 4 5 ...

300 Commits