This work
* Defines the behavior of FileSystem.copyFromLocal in filesystem.md
* Implements a high performance implementation of copyFromLocalOperation
for S3
* Adds a contract test for the operation: AbstractContractCopyFromLocalTest
* Implements the contract tests for Local and S3A FileSystems
Contributed by: Bogdan Stolojan
The S3A connector's rename() operation now raises FileNotFoundException if
the source doesn't exist; a FileAlreadyExistsException if the destination
exists and is unsuitable for the source file/directory.
When renaming to a path which does not exist, the connector no longer checks
for the destination parent directory existing -instead it simply verifies
that there is no file immediately above the destination path.
This is needed to avoid race conditions with delete() and rename()
calls working on adjacent subdirectories.
Contributed by Steve Loughran.
Adds an Abortable.abort() interface for streams to enable output streams to be terminated; this
is implemented by the S3A connector's output stream. It allows for commit protocols
to be implemented which commit/abort work by writing to the final destination and
using the abort() call to cancel any write which is not intended to be committed.
Consult the specification document for information about the interface and its use.
Contributed by Jungtaek Lim and Steve Loughran.
This defines what output streams and especially those which implement
Syncable are meant to do, and documents where implementations (HDFS; S3)
don't. With tests.
The file:// FileSystem now supports Syncable if an application calls
FileSystem.setWriteChecksum(false) before creating a file -checksumming
and Syncable.hsync() are incompatible.
Contributed by Steve Loughran.
This is the API and implementation classes of HADOOP-16830,
which allows callers to query IO object instances
(filesystems, streams, remote iterators, ...) and other classes
for statistics on their I/O Usage: operation count and min/max/mean
durations.
New Packages
org.apache.hadoop.fs.statistics.
Public API, including:
IOStatisticsSource
IOStatistics
IOStatisticsSnapshot (seralizable to java objects and json)
+helper classes for logging and integration
BufferedIOStatisticsInputStream
implements IOStatisticsSource and StreamCapabilities
BufferedIOStatisticsOutputStream
implements IOStatisticsSource, Syncable and StreamCapabilities
org.apache.hadoop.fs.statistics.impl
Implementation classes for internal use.
org.apache.hadoop.util.functional
functional programming support for RemoteIterators and
other operations which raise IOEs; all wrapper classes
implement and propagate IOStatisticsSource
Contributed by Steve Loughran.
This switches the SnappyCodec to use the java-snappy codec, rather than the native one.
To use the codec, snappy-java.jar (from org.xerial.snappy) needs to be on the classpath.
This comesin as an avro dependency, so it is already on the hadoop-common classpath,
as well as in hadoop-common/lib.
The version used is now managed in the hadoop-project POM; initially 1.1.7.7
Contributed by DB Tsai and Liang-Chi Hsieh
Contributed by Steve Loughran.
Not all stores do complete validation here; in particular the S3A
Connector does not: checking up the entire directory tree to see if a path matches
is a file significantly slows things down.
This check does take place in S3A mkdirs(), which walks backwards up the list of
parent paths until it finds a directory (success) or a file (failure).
In practice production applications invariably create destination directories
before writing 1+ file into them -restricting check purely to the mkdirs()
call deliver significant speed up while implicitly including the checks.
Change-Id: I2c9df748e92b5655232e7d888d896f1868806eb0
* Enhanced builder + FS spec
* s3a FS to use this to skip HEAD on open
* and to use version/etag when opening the file
works with S3AFileStatus FS and S3ALocatedFileStatus
Contributed by Steve Loughran.
This complements the StreamCapabilities Interface by allowing applications to probe for a specific path on a specific instance of a FileSystem client
to offer a specific capability.
This is intended to allow applications to determine
* Whether a method is implemented before calling it and dealing with UnsupportedOperationException.
* Whether a specific feature is believed to be available in the remote store.
As well as a common set of capabilities defined in CommonPathCapabilities,
file systems are free to add their own capabilities, prefixed with
fs. + schema + .
The plan is to identify and document more capabilities -and for file systems which add new features, for a declaration of the availability of the feature to always be available.
Note
* The remote store is not expected to be checked for the feature;
It is more a check of client API and the client's configuration/knowledge
of the state of the remote system.
* Permissions are not checked.
Change-Id: I80bfebe94f4a8bdad8f3ac055495735b824968f5
Contributed by Steve Loughran and Gabor Bota.
This
* Asks S3Guard to determine the empty directory status.
* Has S3A's root directory rm("/") command to always return false (as abfs does)
* Documents that object stores MAY do this
* Overloads ContractTestUtils.assertDeleted to let assertions declare that the source directory does not need to exist. This stops inconsistencies in directory listings failing a root test.
It avoids a recent regression (HADOOP-16279) where if there was a tombstone above the first element found in a directory listing, the directory would be considered empty, when in fact there were child entries. That could downgrade an rm(path, recursive) to a no-op, while also confusing rename(src, dest), as dest could be mistaken for an empty directory and so permit the copy above it, rather than reject it "destination path exists and is not empty".
Change-Id: I136a3d1a5a48a67e6155d790a40ff558d0d2c108
This is needed to fix up some confusion about caching of job.addCache() handling of S3A paths; all parent dirs -the files are downloaded by the NM without using the DTs of the user submitting the job. This means that when you submit jobs to an EC2 cluster with lower IAM permissions than the user, cached resources don't get downloaded and the job doesn't start.
Production code changes:
* S3AFileStatus Adds "true" to the superclass's encrypted flag during construction.
Tests
* Base AbstractContractOpenTest can control whether zero byte files created in tests are encrypted. Not done via an XML attribute, just a subclass point. Thoughts?
* Verify that the filecache considers paths to not have the permissions which trigger reduce-privilege downloads
* And extend ITestDelegatedMRJob to test a completely different bucket (open street map), to verify that cached resources do get their tokens picked up
Docs:
* Advise FS developers to say all files are encrypted. It's otherwise harmless and it'll stop other people seeing impossible to debug error messages on app launch.
Contributed by Steve Loughran.
Change-Id: Ifaae4c9d735ccc5eafeebd2584b65daf2d4e5da3
S3A to implement S3 Select through this API.
The new openFile() API is asynchronous, and implemented across FileSystem and FileContext.
The MapReduce V2 inputs are moved to this API, and you can actually set must/may
options to pass in.
This is more useful for setting things like s3a seek policy than for S3 select,
as the existing input format/record readers can't handle S3 select output where
the stream is shorter than the file length, and splitting plain text is suboptimal.
Future work is needed there.
In the meantime, any/all filesystem connectors are now free to add their own filesystem-specific
configuration parameters which can be set in jobs and used to set filesystem input stream
options (seek policy, retry, encryption secrets, etc).
Contributed by Steve Loughran