This moves off use of the purged s3a://landsat-pds bucket, so fixing tests
which had started failing.
* Adds a new class, PublicDatasetTestUtils to manage the use of public datasets.
* The new test bucket s3a://usgs-landsat/ is requester pays, so depends upon
HADOOP-14661.
Consult the updated test documentation when running against other S3 stores.
Contributed by Daniel Carl Jones
Change-Id: Ie8585e4d9b67667f8cb80b2970225d79a4f8d257
Stops the abfs connector warning if openFile().withFileStatus()
is invoked with a FileStatus is not an abfs VersionedFileStatus.
Contributed by Steve Loughran.
Change-Id: I85076b365eb30aaef2ed35139fa8714efd4d048e
S3A input stream support for the few fs.option.openfile settings.
As well as supporting the read policy option and values,
if the file length is declared in fs.option.openfile.length
then no HEAD request will be issued when opening a file.
This can cut a few tens of milliseconds off the operation.
The patch adds a new openfile parameter/FS configuration option
fs.s3a.input.async.drain.threshold (default: 16000).
It declares the number of bytes remaining in the http input stream
above which any operation to read and discard the rest of the stream,
"draining", is executed asynchronously.
This asynchronous draining offers some performance benefit on seek-heavy
file IO.
Contributed by Steve Loughran.
Change-Id: I9b0626bbe635e9fd97ac0f463f5e7167e0111e39
These changes ensure that sequential files are opened with the
right read policy, and split start/end is passed in.
As well as offering opportunities for filesystem clients to
choose fetch/cache/seek policies, the settings ensure that
processing text files on an s3 bucket where the default policy
is "random" will still be processed efficiently.
This commit depends on the associated hadoop-common patch,
which must be committed first.
Contributed by Steve Loughran.
Change-Id: Ic6713fd752441cf42ebe8739d05c2293a5db9f94
Adds the option fs.s3a.requester.pays.enabled, which, if set to true, allows
the client to access S3 buckets where the requester is billed for the IO.
Contributed by Daniel Carl Jones
Change-Id: I51f64d0f9b3be3c4ec493bcf91927fca3b20407a
The option fs.s3a.object.content.encoding declares the content encoding to be set on files when they are written; this is served up in the "Content-Encoding" HTTP header when reading objects back in.
This is useful for people loading the data into other tools in the AWS ecosystem which don't use file extensions to infer compression type (e.g. serving compressed files from S3 or importing into RDS)
Contributed by: Holden Karau
Change-Id: Ice0da75b516370f51f79e45f391d46c5c7aa4ce4
Optimize the scan for s3 by performing a deep tree listing,
inferring directory counts from the paths returned.
Contributed by Ahmar Suhail.
Change-Id: I26ffa8c6f65fd11c68a88d6e2243b0eac6ffd024
Follow-on patch to MAPREDUCE-7341, adding ABFS support and tests
* resilient rename
* tests for job commit through the manifest committer.
contains
- HADOOP-17976. ABFS etag extraction inconsistent between LIST and HEAD calls
- HADOOP-16204. ABFS tests to include terasort
Contributed by Steve Loughran.
Change-Id: I0a7d4043bdf19bcb00c033fc389730109b93b77f
Multi object delete of size more than 1000 is not supported by S3 and
fails with MalformedXML error. So implementing paging of requests to
reduce the number of keys in a single request. Page size can be configured
using "fs.s3a.bulk.delete.page.size"
Contributed By: Mukund Thakur
Adds a new map type WeakReferenceMap, which stores weak
references to values, and a WeakReferenceThreadMap subclass
to more closely resemble a thread local type, as it is a
map of threadId to value.
Construct it with a factory method and optional callback
for notification on loss and regeneration.
WeakReferenceThreadMap<WrappingAuditSpan> activeSpan =
new WeakReferenceThreadMap<>(
(k) -> getUnbondedSpan(),
this::noteSpanReferenceLost);
This is used in ActiveAuditManagerS3A for span tracking.
Relates to
* HADOOP-17511. Add an Audit plugin point for S3A
* HADOOP-18094. Disable S3A auditing by default.
Contributed by Steve Loughran.
Change-Id: Ibf7bb082fd47298f7ebf46d92f56e80ca9b2aaf8
Part of HADOOP-17198. Support S3 Access Points.
HADOOP-18068. "upgrade AWS SDK to 1.12.132" broke the access point endpoint
translation.
Correct endpoints should start with "s3-accesspoint.", after SDK upgrade they start with
"s3.accesspoint-" which messes up tests + region detection by the SDK.
Contributed by Bogdan Stolojan
Change-Id: I0c0181628ab803afc39036003777eaec79aa378c
Add support for S3 Access Points. This provides extra security as it
ensures applications are not working with buckets belong to third parties.
To bind a bucket to an access point, set the access point (ap) ARN,
which must be done for each specific bucket, using the pattern
fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN
* The global/bucket option `fs.s3a.accesspoint.required` to
mandate that buckets must declare their access point.
* This is not compatible with S3Guard.
Consult the documentation for further details.
Contributed by Bogdan Stolojan
(this commit contains the changes to TestArnResource from HADOOP-18068,
"upgrade AWS SDK to 1.12.132" so that it works with the later SDK.)
Change-Id: I3fac213e52ca6ec1c813effb8496c353964b8e1b
See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
* Adds a new option fs.s3a.audit.enabled to controls whether or not auditing
is enabled. This is false by default.
* When false, the S3A auditing manager is NoopAuditManagerS3A,
which was formerly only used for unit tests and
during filsystem initialization.
* When true, ActiveAuditManagerS3A is used for managing auditing,
allowing auditing events to be reported.
* updates documentation and tests.
This patch does not fix the underlying leak. When auditing is enabled,
long-lived threads will retain references to the audit managers
of S3A filesystem instances which have already been closed.
Contributed by Steve Loughran.
Change-Id: I671e594cd59e8ca77a1f65be791ad0ae9530b8d9
Completely removes S3Guard support from the S3A codebase.
If the connector is configured to use any metastore other than
the null and local stores (i.e. DynamoDB is selected) the s3a client
will raise an exception and refuse to initialize.
This is to ensure that there is no mix of S3Guard enabled and disabled
deployments with the same configuration but different hadoop releases
-it must be turned off completely.
The "hadoop s3guard" command has been retained -but the supported
subcommands have been reduced to those which are not purely S3Guard
related: "bucket-info" and "uploads".
This is major change in terms of the number of files
changed; before cherry picking subsequent s3a patches into
older releases, this patch will probably need backporting
first.
Goodbye S3Guard, your work is done. Time to die.
Contributed by Steve Loughran.
With this update, the versions of key shaded dependencies are
jackson 2.12.3
httpclient 4.5.13
This backport patch does not include the TestArn changes needed
for the test to work with this version of the SDK; it is only
to be applied to branches without HADOOP-17198. "Support S3 Access Points".
If that patch is backported later, that test suite MUST be
updated to the latest version.
Contributed by Steve Loughran
Change-Id: I8d2b71781ee8472b16469531f9cd0de32dd3356f
Cut modtime-based rename recovery as object modification time
is not updated during rename operation.
Applications will have to use etag API of HADOOP-17979
and implement it themselves.
Why not do the HEAD and etag recovery in ABFS client?
Cuts the IO capacity in half so kills job commit performance.
The manifest committer of MAPREDUCE-7341 will do this recovery
and act as the reference implementation of the algorithm.
Contributed by: Steve Loughran
Change-Id: I810054c9fd05041dac552f13d31fb15d7524721b
Addresses transient failures in the following test classes:
* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure
* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count
closes#3341
Contributed by Sumangala Patki
Change-Id: Ied4dec35c81e94efe5f999acae4bb8fde278202e
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.
This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).
Contributed by Steve Loughran.
Change-Id: I0a02ec1e622343619f147f94158c18928a73a885
The ordering of the resolution of new and deprecated s3a encryption options
& secrets is the same when JCEKS and other hadoop credentials stores are used
to store them as when they are in XML files: per-bucket settings always take
priority over global values, even when the bucket-level options use the
old option names.
Contributed by Mehakmeet Singh and Steve Loughran
Change-Id: I871672071efa2eb6b600cb2658fceeef57f658a3