The S3A connector's rename() operation now raises FileNotFoundException if
the source doesn't exist; a FileAlreadyExistsException if the destination
exists and is unsuitable for the source file/directory.
When renaming to a path which does not exist, the connector no longer checks
for the destination parent directory existing -instead it simply verifies
that there is no file immediately above the destination path.
This is needed to avoid race conditions with delete() and rename()
calls working on adjacent subdirectories.
Contributed by Steve Loughran.
Adds an Abortable.abort() interface for streams to enable output streams to be terminated; this
is implemented by the S3A connector's output stream. It allows for commit protocols
to be implemented which commit/abort work by writing to the final destination and
using the abort() call to cancel any write which is not intended to be committed.
Consult the specification document for information about the interface and its use.
Contributed by Jungtaek Lim and Steve Loughran.
Change-Id: I7fcc25e9dd8c10ce6c29f383529f3a2642a201ae
This defines what output streams and especially those which implement
Syncable are meant to do, and documents where implementations (HDFS; S3)
don't. With tests.
The file:// FileSystem now supports Syncable if an application calls
FileSystem.setWriteChecksum(false) before creating a file -checksumming
and Syncable.hsync() are incompatible.
Contributed by Steve Loughran.
Change-Id: I892d768de6268f4dd6f175b3fe3b7e5bcaa91194
* core-default.xml updated so that fs.s3a.committer.magic.enabled = true
* CommitConstants updated to match
* All tests which previously enabled the magic committer now rely on
default settings. This helps make sure it is enabled.
* Docs cover the switch, mention its enabled and explain why you may
want to disable it.
Note: this doesn't switch to using the committer -it just enables the path
rewriting magic which it depends on.
Contributed by Steve Loughran.
This needs SPARK-33739 in the matching spark branch in order to work
Contributed by Steve Loughran.
Change-Id: I4fe75b057159e35aacc072da3cb7343467c0c3f1
Caused by HADOOP-16830 and HADOOP-17271.
Fixes tests which fail intermittently based on configs and
in the case of the HugeFile tests, bulk runs with existing
FS instances meant statistic probes sometimes ended up probing those
of a previous FS.
Contributed by Steve Loughran.
Change-Id: I65ba3f44444e59d298df25ac5c8dc5a8781dfb7d
This is the API and implementation classes of HADOOP-16830,
which allows callers to query IO object instances
(filesystems, streams, remote iterators, ...) and other classes
for statistics on their I/O Usage: operation count and min/max/mean
durations.
New Packages
org.apache.hadoop.fs.statistics.
Public API, including:
IOStatisticsSource
IOStatistics
IOStatisticsSnapshot (seralizable to java objects and json)
+helper classes for logging and integration
BufferedIOStatisticsInputStream
implements IOStatisticsSource and StreamCapabilities
BufferedIOStatisticsOutputStream
implements IOStatisticsSource, Syncable and StreamCapabilities
org.apache.hadoop.fs.statistics.impl
Implementation classes for internal use.
org.apache.hadoop.util.functional
functional programming support for RemoteIterators and
other operations which raise IOEs; all wrapper classes
implement and propagate IOStatisticsSource
Contributed by Steve Loughran.
Change-Id: If56e8db2981613ff689c39239135e44feb25f78e
See also [SPARK-33402]: Jobs launched in same second have duplicate MapReduce JobIDs
Contributed by Steve Loughran.
Change-Id: Iae65333cddc84692997aae5d902ad8765b45772a
This adds a semaphore to throttle the number of FileSystem instances which
can be created simultaneously, set in "fs.creation.parallel.count".
This is designed to reduce the impact of many threads in an application calling
FileSystem.get() on a filesystem which takes time to instantiate -for example
to an object where HTTPS connections are set up during initialization.
Many threads trying to do this may create spurious delays by conflicting
for access to synchronized blocks, when simply limiting the parallelism
diminishes the conflict, so speeds up all threads trying to access
the store.
The default value, 64, is larger than is likely to deliver any speedup -but
it does mean that there should be no adverse effects from the change.
If a service appears to be blocking on all threads initializing connections to
abfs, s3a or store, try a smaller (possibly significantly smaller) value.
Contributed by Steve Loughran.
Change-Id: I57161b026f28349e339dc8b9d74f6567a62ce196
This fixes the S3Guard/Directory Marker Retention integration so that when
fs.s3a.directory.marker.retention=keep, failures during multipart delete
are handled correctly, as are incremental deletes during
directory tree operations.
In both cases, when a directory marker with children is deleted from
S3, the directory entry in S3Guard is not deleted, because it is still
critical to representing the structure of the store.
Contributed by Steve Loughran.
Change-Id: I4ca133a23ea582cd42ec35dbf2dc85b286297d2f
This switches the SnappyCodec to use the java-snappy codec, rather than the native one.
To use the codec, snappy-java.jar (from org.xerial.snappy) needs to be on the classpath.
This comesin as an avro dependency, so it is already on the hadoop-common classpath,
as well as in hadoop-common/lib.
The version used is now managed in the hadoop-project POM; initially 1.1.7.7
Contributed by DB Tsai and Liang-Chi Hsieh
Change-Id: Id52a404a0005480e68917cd17f0a27b7744aea4e
When a filesystem is closed, the FileSystem log will, at debug level,
log the method calling close/closeAll.
At trace level: the full calling stack.
Contributed by Karen Coppage.
Change-Id: I1444f065c171fd31d42b497c92ba4517969f67f0
This changes directory tree deletion so that only files are incrementally deleted
from S3Guard after the objects are deleted; the directories are left alone
until metadataStore.deleteSubtree(path) is invoke.
This avoids directory tombstones being added above files/child directories,
which stop the treewalk and delete phase from working.
Also:
* Callback to delete objects splits files and dirs so that
any problems deleting the dirs doesn't trigger s3guard updates
* New statistic to measure #of objects deleted, alongside request count.
* Callback listFilesAndEmptyDirectories renamed listFilesAndDirectoryMarkers
to clarify behavior.
* Test enhancements to replicate the failure and verify the fix
Contributed by Steve Loughran
Change-Id: I0e6ea2c35e487267033b1664228c8837279a35c7
Contributed by Steve Loughran.
* Fixes AbstractContractSeekTest test to use readFully
* Doesn't do this to AbstractContractUnbufferTest test as it changes the test too much.
Instead just notes in the error that this may be transient
The issue is that read(buffer) doesn't guarantee that the buffer is filled, only that it will
read up to a point, and that may be just be the amount of data left in the TCP packet.
readFully corrects for this, but using it in the unbuffer test runs the risk that what
is tested for in terms of unbuffering doesn't actually get validated.
Change-Id: I046eadb69b80ba0aac468b354c82c4d510dc3699
This adds an option to disable "empty directory" marker deletion,
so avoid throttling and other scale problems.
This feature is *not* backwards compatible.
Consult the documentation and use with care.
Contributed by Steve Loughran.
Change-Id: I69a61e7584dc36e485d5e39ff25b1e3e559a1958
* Passing C/C++ standard flags -std is
not cross-compiler friendly as not all
compilers support all values.
* Thus, we need to make use of the
appropriate flags provided by CMake in
order to specify the C/C++ standards.
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 909f1e82d3)
* HDFS-15436. Default mount table name used by ViewFileSystem should be configurable
* Replace Constants.CONFIG_VIEWFS_DEFAULT_MOUNT_TABLE use in tests
* Address Uma's comments on PR#2100
* Sort lists in test to match without concern to order
* Address comments, fix checkstyle and fix failing tests
* Fix checkstyle
(cherry picked from commit bed0a3a374)
Contributed by Steve Loughran.
This is a successor to HADOOP-16346, which enabled the S3A connector
to load the native openssl SSL libraries for better HTTPS performance.
That patch required wildfly.jar to be on the classpath. This
update:
* Makes wildfly.jar optional except in the special case that
"fs.s3a.ssl.channel.mode" is set to "openssl"
* Retains the declaration of wildfly.jar as a compile-time
dependency in the hadoop-aws POM. This means that unless
explicitly excluded, applications importing that published
maven artifact will, transitively, add the specified
wildfly JAR into their classpath for compilation/testing/
distribution.
This is done for packaging and to offer that optional
speedup. It is not mandatory: applications importing
the hadoop-aws POM can exclude it if they choose.
Change-Id: I7ed3e5948d1e10ce21276b3508871709347e113d
Contributed by Steve Loughran.
This moves the new API of HDFS-13616 into a interface which is implemented by
HDFS RPC filesystem client (not WebHDFS or any other connector)
This new interface, BatchListingOperations, is in hadoop-common,
so applications do not need to be compiled with HDFS on the classpath.
They must cast the FS into the interface.
instanceof can probe the client for having the new interface -the patch
also adds a new path capability to probe for this.
The FileSystem implementation is cut; tests updated as appropriate.
All new interfaces/classes/constants are marked as @unstable.
Change-Id: I5623c51f2c75804f58f915dd7e60cb2cffdac681
Contributed by Steve Loughran.
Not all stores do complete validation here; in particular the S3A
Connector does not: checking up the entire directory tree to see if a path matches
is a file significantly slows things down.
This check does take place in S3A mkdirs(), which walks backwards up the list of
parent paths until it finds a directory (success) or a file (failure).
In practice production applications invariably create destination directories
before writing 1+ file into them -restricting check purely to the mkdirs()
call deliver significant speed up while implicitly including the checks.
Change-Id: I2c9df748e92b5655232e7d888d896f1868806eb0
Followup to HADOOP-16885: Encryption zone file copy failure leaks a temp file
Moving the delete() call broke a mocking test, which slipped through the review process.
Contributed by Steve Loughran.
Change-Id: Ia13faf0f4fffb1c99ddd616d823e4f4d0b7b0cbb
Contributed by Xiaoyu Yao.
Contains HDFS-14892. Close the output stream if createWrappedOutputStream() fails
Copying file through the FsShell command into an HDFS encryption zone where
the caller lacks permissions is leaks a temp ._COPYING file
and potentially a wrapped stream unclosed.
This is a convergence of a fix for S3 meeting an issue in HDFS.
S3: a HEAD against a file can cache a 404,
-you must not do any existence checks, including deleteOnExit(),
until the file is written.
Hence: HADOOP-16490, only register files for deletion the create worked
and the upload is not direct.
HDFS-14892. HDFS doesn't close wrapped streams when IOEs are raised on
create() failures. Which means that an entry is retained on the NN.
-you need to register a file with deleteOnExit() even if the file wasn't
created.
This patch:
* Moves the deleteOnExit to ensure the created file get deleted cleanly.
* Fixes HDFS to close the wrapped stream on failures.
Followup to the main openFile().withStatus() patch.
It turns out that this broke the hive builds, which
was not well appreciated.
This patch lists places to review in the hadoop codebase,
and external projects where changes are likely to cause problems.
Contributed by Steve Loughran
Change-Id: Ifac815c65b74d083cd277764b780ac2b5b0f6b36
Contributed by Steve Loughran.
During S3A rename() and delete() calls, the list of objects delete is
built up into batches of a thousand and then POSTed in a single large
DeleteObjects request.
But as the IO capacity allowed on an S3 partition may only be 3500 writes
per second *and* each entry in that POST counts as a single write, then
one of those posts alone can trigger throttling on an already loaded
S3 directory tree. Which can trigger backoff and retry, with the same
thousand entry post, and so recreate the exact same problem.
Fixes
* Page size for delete object requests is set in
fs.s3a.bulk.delete.page.size; the default is 250.
* The property fs.s3a.experimental.aws.s3.throttling (default=true)
can be set to false to disable throttle retry logic in the AWS
client SDK -it is all handled in the S3A client. This
gives more visibility in to when operations are being throttled
* Bulk delete throttling events are logged to the log
org.apache.hadoop.fs.s3a.throttled log at INFO; if this appears
often then choose a smaller page size.
* The metric "store_io_throttled" adds the entire count of delete
requests when a single DeleteObjects request is throttled.
* A new quantile, "store_io_throttle_rate" can track throttling
load over time.
* DynamoDB metastore throttle resilience issues have also been
identified and fixed. Note: the fs.s3a.experimental.aws.s3.throttling
flag does not apply to DDB IO precisely because there may still be
lurking issues there and it safest to rely on the DynamoDB client
SDK.
Change-Id: I00f85cdd94fc008864d060533f6bd4870263fd84
This is a regression caused by HADOOP-16759.
The test TestHarFileSystem uses introspection to verify that HarFileSystem
Does not implement methods to which there is a suitable implementation in
the base FileSystem class. Because of the way it checks this, refactoring
(protected) FileSystem methods in an IDE do not automatically change
the probes in TestHarFileSystem.
The changes in HADOOP-16759 did exactly that, and somehow managed
to get through the build/test process without this being noticed.
This patch fixes that failure.
Caused by and fixed by Steve Loughran.
Change-Id: If60d9c97058242871c02ad1addd424478f84f446
Signed-off-by: Mingliang Liu <liuml07@apache.org>
Contributed by Mustafa Iman.
This adds a new configuration option fs.s3a.connection.request.timeout
to declare the time out on HTTP requests to the AWS service;
0 means no timeout.
Measured in seconds; the usual time suffixes are all supported
Important: this is the maximum duration of any AWS service call,
including upload and copy operations. If non-zero, it must be larger
than the time to upload multi-megabyte blocks to S3 from the client,
and to rename many-GB files. Use with care.
Change-Id: I407745341068b702bf8f401fb96450a9f987c51c
* Enhanced builder + FS spec
* s3a FS to use this to skip HEAD on open
* and to use version/etag when opening the file
works with S3AFileStatus FS and S3ALocatedFileStatus
Introduces `openssl` as an option for `fs.s3a.ssl.channel.mode`.
The new option is documented and marked as experimental.
For details on how to use this, consult the peformance document
in the s3a documentation.
This patch is the successor to HADOOP-16050 "S3A SSL connections
should use OpenSSL" -which was reverted because of
incompatibilities between the wildfly OpenSSL client and the AWS
HTTPS servers (HADOOP-16347). With the Wildfly release moved up
to 1.0.7.Final (HADOOP-16405) everything should now work.
Related issues:
* HADOOP-15669. ABFS: Improve HTTPS Performance
* HADOOP-16050: S3A SSL connections should use OpenSSL
* HADOOP-16371: Option to disable GCM for SSL connections when running on Java 8
* HADOOP-16405: Upgrade Wildfly Openssl version to 1.0.7.Final
Contributed by Sahil Takiar
Change-Id: I80a4bc5051519f186b7383b2c1cea140be42444e
Contributed by David Mollitor.
This adds operations in FileUtil to write text to a file via
either a FileSystem or FileContext instance.
Change-Id: I5fe8fcf1bdbdbc734e137f922a75a822f2b88410
Contains:
HADOOP-16474. S3Guard ProgressiveRenameTracker to mark destination
dirirectory as authoritative on success.
HADOOP-16684. S3guard bucket info to list a bit more about
authoritative paths.
HADOOP-16722. S3GuardTool to support FilterFileSystem.
This patch improves the marking of newly created/import directory
trees in S3Guard DynamoDB tables as authoritative.
Specific changes:
* Renamed directories are marked as authoritative if the entire
operation succeeded (HADOOP-16474).
* When updating parent table entries as part of any table write,
there's no overwriting of their authoritative flag.
s3guard import changes:
* new -verbose flag to print out what is going on.
* The "s3guard import" command lets you declare that a directory tree
is to be marked as authoritative
hadoop s3guard import -authoritative -verbose s3a://bucket/path
When importing a listing and a file is found, the import tool queries
the metastore and only updates the entry if the file is different from
before, where different == new timestamp, etag, or length. S3Guard can get
timestamp differences due to clock skew in PUT operations.
As the recursive list performed by the import command doesn't retrieve the
versionID, the existing entry may in fact be more complete.
When updating an existing due to clock skew the existing version ID
is propagated to the new entry (note: the etags must match; this is needed
to deal with inconsistent listings).
There is a new s3guard command to audit a s3guard bucket/path's
authoritative state:
hadoop s3guard authoritative -check-config s3a://bucket/path
This is primarily for testing/auditing.
The s3guard bucket-info command also provides some more details on the
authoritative state of a store (HADOOP-16684).
Change-Id: I58001341c04f6f3597fcb4fcb1581ccefeb77d91
This hardens the wasb and abfs output streams' resilience to being invoked
in/after close().
wasb:
Explicity raise IOEs on operations invoked after close,
rather than implicitly raise NPEs.
This ensures that invocations which catch and swallow IOEs will perform as
expected.
abfs:
When rethrowing an IOException in the close() call, explicitly wrap it
with a new instance of the same subclass.
This is needed to handle failures in try-with-resources clauses, where
any exception in closed() is added as a suppressed exception to the one
thrown in the try {} clause
*and you cannot attach the same exception to itself*
Contributed by Steve Loughran.
Change-Id: Ic44b494ff5da332b47d6c198ceb67b965d34dd1b
MBeanInfoBuilder's get() method DEBUG logs all the MBeanAttributeInfo attributes that it gathered. This can have a high memory churn that can be easily avoided.
Contributed by Steve Loughran.
This FileSystem instantiation so if an IOException or RuntimeException is
raised in the invocation of FileSystem.initialize() then a best-effort
attempt is made to close the FS instance; exceptions raised that there
are swallowed.
The S3AFileSystem is also modified to do its own cleanup if an
IOException is raised during its initialize() process, it being the
FS we know has the "potential" to leak threads, especially in
extension points (e.g AWS Authenticators) which spawn threads.
Change-Id: Ib84073a606c9d53bf53cbfca4629876a03894f04
* HADOOP-16579 - Upgrade to Apache Curator 4.2.0 and ZooKeeper 3.5.5
- Add a static initializer for the unit tests using ZooKeeper to enable
the four-letter-words diagnostic telnet commands. (this is an interface
that become disabled by default, so to keep the ZooKeeper 3.4.x behavior
we enabled it for the tests)
- Also fix ZKFailoverController to look for relevant fail-over ActiveAttempt
records. The new ZooKeeper seems to respond quicker during the fail-over
tests than the ZooKeeper, so we made sure to catch all the relevant records
by adding a new parameter to ZKFailoverontroller.waitForActiveAttempt().
Co-authored-by: Norbert Kalmár <nkalmar@cloudera.com>
Contributed by Steve Loughran.
Includes
-S3A glob scans don't bother trying to resolve symlinks
-stack traces don't get lost in getFileStatuses() when exceptions are wrapped
-debug level logging of what is up in Globber
-Contains HADOOP-13373. Add S3A implementation of FSMainOperationsBaseTest.
-ITestRestrictedReadAccess tests incomplete read access to files.
This adds a builder API for constructing globbers which other stores can use
so that they too can skip symlink resolution when not needed.
Change-Id: I23bcdb2783d6bd77cf168fdc165b1b4b334d91c7
Contributed by Steve Loughran.
This complements the StreamCapabilities Interface by allowing applications to probe for a specific path on a specific instance of a FileSystem client
to offer a specific capability.
This is intended to allow applications to determine
* Whether a method is implemented before calling it and dealing with UnsupportedOperationException.
* Whether a specific feature is believed to be available in the remote store.
As well as a common set of capabilities defined in CommonPathCapabilities,
file systems are free to add their own capabilities, prefixed with
fs. + schema + .
The plan is to identify and document more capabilities -and for file systems which add new features, for a declaration of the availability of the feature to always be available.
Note
* The remote store is not expected to be checked for the feature;
It is more a check of client API and the client's configuration/knowledge
of the state of the remote system.
* Permissions are not checked.
Change-Id: I80bfebe94f4a8bdad8f3ac055495735b824968f5