Contributed by Steve Loughran.
Not all stores do complete validation here; in particular the S3A
Connector does not: checking up the entire directory tree to see if a path matches
is a file significantly slows things down.
This check does take place in S3A mkdirs(), which walks backwards up the list of
parent paths until it finds a directory (success) or a file (failure).
In practice production applications invariably create destination directories
before writing 1+ file into them -restricting check purely to the mkdirs()
call deliver significant speed up while implicitly including the checks.
Change-Id: I2c9df748e92b5655232e7d888d896f1868806eb0
* Enhanced builder + FS spec
* s3a FS to use this to skip HEAD on open
* and to use version/etag when opening the file
works with S3AFileStatus FS and S3ALocatedFileStatus
Contributed by Steve Loughran.
This complements the StreamCapabilities Interface by allowing applications to probe for a specific path on a specific instance of a FileSystem client
to offer a specific capability.
This is intended to allow applications to determine
* Whether a method is implemented before calling it and dealing with UnsupportedOperationException.
* Whether a specific feature is believed to be available in the remote store.
As well as a common set of capabilities defined in CommonPathCapabilities,
file systems are free to add their own capabilities, prefixed with
fs. + schema + .
The plan is to identify and document more capabilities -and for file systems which add new features, for a declaration of the availability of the feature to always be available.
Note
* The remote store is not expected to be checked for the feature;
It is more a check of client API and the client's configuration/knowledge
of the state of the remote system.
* Permissions are not checked.
Change-Id: I80bfebe94f4a8bdad8f3ac055495735b824968f5
Contributed by Steve Loughran and Gabor Bota.
This
* Asks S3Guard to determine the empty directory status.
* Has S3A's root directory rm("/") command to always return false (as abfs does)
* Documents that object stores MAY do this
* Overloads ContractTestUtils.assertDeleted to let assertions declare that the source directory does not need to exist. This stops inconsistencies in directory listings failing a root test.
It avoids a recent regression (HADOOP-16279) where if there was a tombstone above the first element found in a directory listing, the directory would be considered empty, when in fact there were child entries. That could downgrade an rm(path, recursive) to a no-op, while also confusing rename(src, dest), as dest could be mistaken for an empty directory and so permit the copy above it, rather than reject it "destination path exists and is not empty".
Change-Id: I136a3d1a5a48a67e6155d790a40ff558d0d2c108
This is needed to fix up some confusion about caching of job.addCache() handling of S3A paths; all parent dirs -the files are downloaded by the NM without using the DTs of the user submitting the job. This means that when you submit jobs to an EC2 cluster with lower IAM permissions than the user, cached resources don't get downloaded and the job doesn't start.
Production code changes:
* S3AFileStatus Adds "true" to the superclass's encrypted flag during construction.
Tests
* Base AbstractContractOpenTest can control whether zero byte files created in tests are encrypted. Not done via an XML attribute, just a subclass point. Thoughts?
* Verify that the filecache considers paths to not have the permissions which trigger reduce-privilege downloads
* And extend ITestDelegatedMRJob to test a completely different bucket (open street map), to verify that cached resources do get their tokens picked up
Docs:
* Advise FS developers to say all files are encrypted. It's otherwise harmless and it'll stop other people seeing impossible to debug error messages on app launch.
Contributed by Steve Loughran.
Change-Id: Ifaae4c9d735ccc5eafeebd2584b65daf2d4e5da3
S3A to implement S3 Select through this API.
The new openFile() API is asynchronous, and implemented across FileSystem and FileContext.
The MapReduce V2 inputs are moved to this API, and you can actually set must/may
options to pass in.
This is more useful for setting things like s3a seek policy than for S3 select,
as the existing input format/record readers can't handle S3 select output where
the stream is shorter than the file length, and splitting plain text is suboptimal.
Future work is needed there.
In the meantime, any/all filesystem connectors are now free to add their own filesystem-specific
configuration parameters which can be set in jobs and used to set filesystem input stream
options (seek policy, retry, encryption secrets, etc).
Contributed by Steve Loughran