HADOOP-17480. Document that AWS S3 is consistent and that S3Guard is not needed (#2636)
Contributed by Steve Loughran. Change-Id: I775e3ee7b60665240ec621859c337b053f747a49
This commit is contained in:
parent
886b245de6
commit
bd85f6acea
@ -20,6 +20,27 @@ This document covers the architecture and implementation details of the S3A comm
|
||||
|
||||
For information on using the committers, see [the S3A Committers](./committer.html).
|
||||
|
||||
### January 2021 Update
|
||||
|
||||
Now that S3 is fully consistent, problems related to inconsistent
|
||||
directory listings have gone. However the rename problem exists: committing
|
||||
work by renaming directories is unsafe as well as horribly slow.
|
||||
|
||||
This architecture document, and the committers, were written at a time
|
||||
when S3 was inconsistent. The two committers addressed this problem differently
|
||||
|
||||
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
|
||||
the lists of files to commit from workers to the job manager/driver.
|
||||
* Magic Committer: require S3Guard to offer consistent directory listings
|
||||
on the object store.
|
||||
|
||||
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
|
||||
The choice of which to use, then, is matter for experimentation.
|
||||
|
||||
This architecture document was written in 2017, a time when S3 was only
|
||||
consistent when an extra consistency layer such as S3Guard was used.
|
||||
The document indicates where requirements/constraints which existed then
|
||||
are now obsolete.
|
||||
|
||||
## Problem: Efficient, reliable commits of work to consistent S3 buckets
|
||||
|
||||
@ -49,10 +70,10 @@ can be executed server-side, but as it does not complete until the in-cluster
|
||||
copy has completed, it takes time proportional to the amount of data.
|
||||
|
||||
The rename overhead is the most visible issue, but it is not the most dangerous.
|
||||
That is the fact that path listings have no consistency guarantees, and may
|
||||
lag the addition or deletion of files.
|
||||
If files are not listed, the commit operation will *not* copy them, and
|
||||
so they will not appear in the final output.
|
||||
That is the fact that until late 2020, path listings had no consistency guarantees,
|
||||
and may have lagged the addition or deletion of files.
|
||||
If files were not listed, the commit operation would *not* copy them, and
|
||||
so they would not appear in the final output.
|
||||
|
||||
The solution to this problem is closely coupled to the S3 protocol itself:
|
||||
delayed completion of multi-part PUT operations
|
||||
@ -828,6 +849,8 @@ commit sequence in `Task.done()`, when `talkToAMTGetPermissionToCommit()`
|
||||
|
||||
# Requirements of an S3A Committer
|
||||
|
||||
The design requirements of the S3A committer were
|
||||
|
||||
1. Support an eventually consistent S3 object store as a reliable direct
|
||||
destination of work through the S3A filesystem client.
|
||||
1. Efficient: implies no rename, and a minimal amount of delay in the job driver's
|
||||
@ -841,6 +864,7 @@ the job, and any previous incompleted jobs.
|
||||
1. Security: not to permit privilege escalation from other users with
|
||||
write access to the same file system(s).
|
||||
|
||||
|
||||
|
||||
## Features of S3 and the S3A Client
|
||||
|
||||
@ -852,8 +876,8 @@ MR committer algorithms have significant performance problems.
|
||||
|
||||
1. Single-object renames are implemented as a copy and delete sequence.
|
||||
1. COPY is atomic, but overwrites cannot be prevented.
|
||||
1. Amazon S3 is eventually consistent on listings, deletes and updates.
|
||||
1. Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
||||
1. [Obsolete] Amazon S3 is eventually consistent on listings, deletes and updates.
|
||||
1. [Obsolete] Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
||||
performed on a path before an object was created can be cached, unintentionally
|
||||
creating a create inconsistency. The S3A client library does perform such a check,
|
||||
on `create()` and `rename()` to check the state of the destination path, and
|
||||
@ -872,11 +896,12 @@ data, with the `S3ABlockOutputStream` of HADOOP-13560 uploading written data
|
||||
as parts of a multipart PUT once the threshold set in the configuration
|
||||
parameter `fs.s3a.multipart.size` (default: 100MB).
|
||||
|
||||
[S3Guard](./s3guard.html) adds an option of consistent view of the filesystem
|
||||
[S3Guard](./s3guard.html) added an option of consistent view of the filesystem
|
||||
to all processes using the shared DynamoDB table as the authoritative store of
|
||||
metadata. Some S3-compatible object stores are fully consistent; the
|
||||
proposed algorithm is designed to work with such object stores without the
|
||||
need for any DynamoDB tables.
|
||||
metadata.
|
||||
The proposed algorithm was designed to work with such object stores without the
|
||||
need for any DynamoDB tables. Since AWS S3 became consistent in 2020, this
|
||||
means that they will work directly with the store.
|
||||
|
||||
## Related work: Spark's `DirectOutputCommitter`
|
||||
|
||||
@ -1246,8 +1271,8 @@ for parallel committing of work, including all the error handling based on
|
||||
the Netflix experience.
|
||||
|
||||
It differs in that it directly streams data to S3 (there is no staging),
|
||||
and it also stores the lists of pending commits in S3 too. That mandates
|
||||
consistent metadata on S3, which S3Guard provides.
|
||||
and it also stores the lists of pending commits in S3 too. It
|
||||
requires a consistent S3 store.
|
||||
|
||||
|
||||
### Core concept: A new/modified output stream for delayed PUT commits
|
||||
@ -1480,7 +1505,7 @@ The time to commit a job will be `O(files/threads)`
|
||||
Every `.pendingset` file in the job attempt directory must be loaded, and a PUT
|
||||
request issued for every incomplete upload listed in the files.
|
||||
|
||||
Note that it is the bulk listing of all children which is where full consistency
|
||||
[Obsolete] Note that it is the bulk listing of all children which is where full consistency
|
||||
is required. If instead, the list of files to commit could be returned from
|
||||
tasks to the job committer, as the Spark commit protocol allows, it would be
|
||||
possible to commit data to an inconsistent object store.
|
||||
@ -1525,7 +1550,7 @@ commit algorithms.
|
||||
1. It is possible to create more than one client writing to the
|
||||
same destination file within the same S3A client/task, either sequentially or in parallel.
|
||||
|
||||
1. Even with a consistent metadata store, if a job overwrites existing
|
||||
1. [Obsolete] Even with a consistent metadata store, if a job overwrites existing
|
||||
files, then old data may still be visible to clients reading the data, until
|
||||
the update has propagated to all replicas of the data.
|
||||
|
||||
@ -1538,7 +1563,7 @@ all files in the destination directory which where not being overwritten.
|
||||
for any purpose other than for the storage of pending commit data.
|
||||
|
||||
1. Unless extra code is added to every FS operation, it will still be possible
|
||||
to manipulate files under the `__magic` tree. That's not bad, it just potentially
|
||||
to manipulate files under the `__magic` tree. That's not bad, just potentially
|
||||
confusing.
|
||||
|
||||
1. As written data is not materialized until the commit, it will not be possible
|
||||
@ -1693,14 +1718,6 @@ base for relative paths created underneath it.
|
||||
|
||||
The committers can only be tested against an S3-compatible object store.
|
||||
|
||||
Although a consistent object store is a requirement for a production deployment
|
||||
of the magic committer an inconsistent one has appeared to work during testing, simply by
|
||||
adding some delays to the operations: a task commit does not succeed until
|
||||
all the objects which it has PUT are visible in the LIST operation. Assuming
|
||||
that further listings from the same process also show the objects, the job
|
||||
committer will be able to list and commit the uploads.
|
||||
|
||||
|
||||
The committers have some unit tests, and integration tests based on
|
||||
the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter`
|
||||
to test various state transitions of the commit mechanism has been extended
|
||||
@ -1766,7 +1783,8 @@ tree.
|
||||
Alternatively, the fact that Spark tasks provide data to the job committer on their
|
||||
completion means that a list of pending PUT commands could be built up, with the commit
|
||||
operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
|
||||
As noted earlier, this may permit the requirement for a consistent list operation
|
||||
|
||||
[Obsolete] As noted earlier, this may permit the requirement for a consistent list operation
|
||||
to be bypassed. It would still be important to list what was being written, as
|
||||
it is needed to aid aborting work in failed tasks, but the list of files
|
||||
created by successful tasks could be passed directly from the task to committer,
|
||||
@ -1890,9 +1908,6 @@ bandwidth and the data upload bandwidth.
|
||||
|
||||
No use is made of the cluster filesystem; there are no risks there.
|
||||
|
||||
A consistent store is required, which, for Amazon's infrastructure, means S3Guard.
|
||||
This is covered below.
|
||||
|
||||
A malicious user with write access to the `__magic` directory could manipulate
|
||||
or delete the metadata of pending uploads, or potentially inject new work int
|
||||
the commit. Having access to the `__magic` directory implies write access
|
||||
@ -1900,13 +1915,12 @@ to the parent destination directory: a malicious user could just as easily
|
||||
manipulate the final output, without needing to attack the committer's intermediate
|
||||
files.
|
||||
|
||||
|
||||
### Security Risks of all committers
|
||||
|
||||
|
||||
#### Visibility
|
||||
|
||||
* If S3Guard is used for storing metadata, then the metadata is visible to
|
||||
[Obsolete] If S3Guard is used for storing metadata, then the metadata is visible to
|
||||
all users with read access. A malicious user with write access could delete
|
||||
entries of newly generated files, so they would not be visible.
|
||||
|
||||
@ -1941,7 +1955,7 @@ any of the text fields, script which could then be executed in some XSS
|
||||
attack. We may wish to consider sanitizing this data on load.
|
||||
|
||||
* Paths in tampered data could be modified in an attempt to commit an upload across
|
||||
an existing file, or the MPU ID alterated to prematurely commit a different upload.
|
||||
an existing file, or the MPU ID altered to prematurely commit a different upload.
|
||||
These attempts will not going to succeed, because the destination
|
||||
path of the upload is declared on the initial POST to initiate the MPU, and
|
||||
operations associated with the MPU must also declare the path: if the path and
|
||||
|
@ -26,10 +26,30 @@ and reliable commitment of output to S3.
|
||||
For details on their internal design, see
|
||||
[S3A Committers: Architecture and Implementation](./committer_architecture.html).
|
||||
|
||||
### January 2021 Update
|
||||
|
||||
Now that S3 is fully consistent, problems related to inconsistent directory
|
||||
listings have gone. However the rename problem exists: committing work by
|
||||
renaming directories is unsafe as well as horribly slow.
|
||||
|
||||
This architecture document, and the committers, were written at a time when S3
|
||||
was inconsistent. The two committers addressed this problem differently
|
||||
|
||||
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
|
||||
the lists of files to commit from workers to the job manager/driver.
|
||||
* Magic Committer: require S3Guard to offer consistent directory listings on the
|
||||
object store.
|
||||
|
||||
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
|
||||
The choice of which to use, then, is matter for experimentation.
|
||||
|
||||
This document was written in 2017, a time when S3 was only
|
||||
consistent when an extra consistency layer such as S3Guard was used. The
|
||||
document indicates where requirements/constraints which existed then are now
|
||||
obsolete.
|
||||
|
||||
## Introduction: The Commit Problem
|
||||
|
||||
|
||||
Apache Hadoop MapReduce (and behind the scenes, Apache Spark) often write
|
||||
the output of their work to filesystems
|
||||
|
||||
@ -50,21 +70,18 @@ or it is at the destination, -in which case the rename actually succeeded.
|
||||
|
||||
**The S3 object store and the `s3a://` filesystem client cannot meet these requirements.*
|
||||
|
||||
1. Amazon S3 has inconsistent directory listings unless S3Guard is enabled.
|
||||
1. The S3A mimics `rename()` by copying files and then deleting the originals.
|
||||
Although S3A is (now) consistent, the S3A client still mimics `rename()`
|
||||
by copying files and then deleting the originals.
|
||||
This can fail partway through, and there is nothing to prevent any other process
|
||||
in the cluster attempting a rename at the same time.
|
||||
|
||||
As a result,
|
||||
|
||||
* Files my not be listed, hence not renamed into place.
|
||||
* Deleted files may still be discovered, confusing the rename process to the point
|
||||
of failure.
|
||||
* If a rename fails, the data is left in an unknown state.
|
||||
* If more than one process attempts to commit work simultaneously, the output
|
||||
directory may contain the results of both processes: it is no longer an exclusive
|
||||
operation.
|
||||
*. While S3Guard may deliver the listing consistency, commit time is still
|
||||
*. Commit time is still
|
||||
proportional to the amount of data created. It still can't handle task failure.
|
||||
|
||||
**Using the "classic" `FileOutputCommmitter` to commit work to Amazon S3 risks
|
||||
@ -163,10 +180,8 @@ and restarting the job.
|
||||
whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*.
|
||||
|
||||
|
||||
None of this algorithm works safely or swiftly when working with "raw" AWS S3 storage:
|
||||
* Directory listing can be inconsistent: the tasks and jobs may not list all work to
|
||||
be committed.
|
||||
* Renames go from being fast, atomic operations to slow operations which can fail partway through.
|
||||
This algorithm does not works safely or swiftly with AWS S3 storage because
|
||||
tenames go from being fast, atomic operations to slow operations which can fail partway through.
|
||||
|
||||
This then is the problem which the S3A committers address:
|
||||
|
||||
@ -341,9 +356,7 @@ task commit.
|
||||
|
||||
However, it has extra requirements of the filesystem
|
||||
|
||||
1. It requires a consistent object store, which for Amazon S3,
|
||||
means that [S3Guard](./s3guard.html) must be enabled. For third-party stores,
|
||||
consult the documentation.
|
||||
1. [Obsolete] It requires a consistent object store.
|
||||
1. The S3A client must be configured to recognize interactions
|
||||
with the magic directories and treat them specially.
|
||||
|
||||
@ -358,14 +371,15 @@ it the least mature of the committers.
|
||||
Partitioned Committer. Make sure you have enough hard disk capacity for all staged data.
|
||||
Do not use it in other situations.
|
||||
|
||||
1. If you know that your object store is consistent, or that the processes
|
||||
writing data use S3Guard, use the Magic Committer for higher performance
|
||||
writing of large amounts of data.
|
||||
1. If you do not have a shared cluster store: use the Magic Committer.
|
||||
|
||||
1. If you are writing large amounts of data: use the Magic Committer.
|
||||
|
||||
1. Otherwise: use the directory committer, making sure you have enough
|
||||
hard disk capacity for all staged data.
|
||||
|
||||
Put differently: start with the Directory Committer.
|
||||
Now that S3 is consistent, there are fewer reasons not to use the Magic Committer.
|
||||
Experiment with both to see which works best for your work.
|
||||
|
||||
## Switching to an S3A Committer
|
||||
|
||||
@ -499,9 +513,6 @@ performance.
|
||||
|
||||
### FileSystem client setup
|
||||
|
||||
1. Use a *consistent* S3 object store. For Amazon S3, this means enabling
|
||||
[S3Guard](./s3guard.html). For S3-compatible filesystems, consult the filesystem
|
||||
documentation to see if it is consistent, hence compatible "out of the box".
|
||||
1. Turn the magic on by `fs.s3a.committer.magic.enabled"`
|
||||
|
||||
```xml
|
||||
@ -514,8 +525,6 @@ documentation to see if it is consistent, hence compatible "out of the box".
|
||||
</property>
|
||||
```
|
||||
|
||||
*Do not use the Magic Committer on an inconsistent S3 object store. For
|
||||
Amazon S3, that means S3Guard must *always* be enabled.
|
||||
|
||||
|
||||
### Enabling the committer
|
||||
@ -569,11 +578,9 @@ Conflict management is left to the execution engine itself.
|
||||
|
||||
<property>
|
||||
<name>fs.s3a.committer.magic.enabled</name>
|
||||
<value>false</value>
|
||||
<value>true</value>
|
||||
<description>
|
||||
Enable support in the filesystem for the S3 "Magic" committer.
|
||||
When working with AWS S3, S3Guard must be enabled for the destination
|
||||
bucket, as consistent metadata listings are required.
|
||||
</description>
|
||||
</property>
|
||||
|
||||
@ -726,7 +733,6 @@ in configuration option fs.s3a.committer.magic.enabled
|
||||
The Job is configured to use the magic committer, but the S3A bucket has not been explicitly
|
||||
declared as supporting it.
|
||||
|
||||
The destination bucket **must** be declared as supporting the magic committer.
|
||||
|
||||
This can be done for those buckets which are known to be consistent, either
|
||||
because [S3Guard](s3guard.html) is used to provide consistency,
|
||||
@ -739,10 +745,6 @@ or because the S3-compatible filesystem is known to be strongly consistent.
|
||||
</property>
|
||||
```
|
||||
|
||||
*IMPORTANT*: only enable the magic committer against object stores which
|
||||
offer consistent listings. By default, Amazon S3 does not do this -which is
|
||||
why the option `fs.s3a.committer.magic.enabled` is disabled by default.
|
||||
|
||||
|
||||
Tip: you can verify that a bucket supports the magic committer through the
|
||||
`hadoop s3guard bucket-info` command:
|
||||
|
@ -81,11 +81,12 @@ schemes.
|
||||
* Supports authentication via: environment variables, Hadoop configuration
|
||||
properties, the Hadoop key management store and IAM roles.
|
||||
* Supports per-bucket configuration.
|
||||
* With [S3Guard](./s3guard.html), adds high performance and consistent metadata/
|
||||
directory read operations. This delivers consistency as well as speed.
|
||||
* Supports S3 "Server Side Encryption" for both reading and writing:
|
||||
SSE-S3, SSE-KMS and SSE-C
|
||||
* Instrumented with Hadoop metrics.
|
||||
* Before S3 was consistent, provided a consistent view of inconsistent storage
|
||||
through [S3Guard](./s3guard.html).
|
||||
|
||||
* Actively maintained by the open source community.
|
||||
|
||||
|
||||
|
@ -30,11 +30,11 @@ That's because its a very different system, as you can see:
|
||||
| communication | RPC | HTTP GET/PUT/HEAD/LIST/COPY requests |
|
||||
| data locality | local storage | remote S3 servers |
|
||||
| replication | multiple datanodes | asynchronous after upload |
|
||||
| consistency | consistent data and listings | eventual consistent for listings, deletes and updates |
|
||||
| consistency | consistent data and listings | consistent since November 2020|
|
||||
| bandwidth | best: local IO, worst: datacenter network | bandwidth between servers and S3 |
|
||||
| latency | low | high, especially for "low cost" directory operations |
|
||||
| rename | fast, atomic | slow faked rename through COPY & DELETE|
|
||||
| delete | fast, atomic | fast for a file, slow & non-atomic for directories |
|
||||
| rename | fast, atomic | slow faked rename through COPY and DELETE|
|
||||
| delete | fast, atomic | fast for a file, slow and non-atomic for directories |
|
||||
| writing| incremental | in blocks; not visible until the writer is closed |
|
||||
| reading | seek() is fast | seek() is slow and expensive |
|
||||
| IOPs | limited only by hardware | callers are throttled to shards in an s3 bucket |
|
||||
|
@ -615,24 +615,10 @@ characters can be configured in the Hadoop configuration.
|
||||
|
||||
**Consistency**
|
||||
|
||||
* Assume the usual S3 consistency model applies.
|
||||
Since November 2020, AWS S3 has been fully consistent.
|
||||
This also applies to S3 Select.
|
||||
We do not know what happens if an object is overwritten while a query is active.
|
||||
|
||||
* When enabled, S3Guard's DynamoDB table will declare whether or not
|
||||
a newly deleted file is visible: if it is marked as deleted, the
|
||||
select request will be rejected with a `FileNotFoundException`.
|
||||
|
||||
* When an existing S3-hosted object is changed, the S3 select operation
|
||||
may return the results of a SELECT call as applied to either the old
|
||||
or new version.
|
||||
|
||||
* We don't know whether you can get partially consistent reads, or whether
|
||||
an extended read ever picks up a later value.
|
||||
|
||||
* The AWS S3 load balancers can briefly cache 404/Not-Found entries
|
||||
from a failed HEAD/GET request against a nonexistent file; this cached
|
||||
entry can briefly create create inconsistency, despite the
|
||||
AWS "Create is consistent" model. There is no attempt to detect or recover from
|
||||
this.
|
||||
|
||||
**Concurrency**
|
||||
|
||||
|
@ -22,24 +22,39 @@
|
||||
which can use a (consistent) database as the store of metadata about objects
|
||||
in an S3 bucket.
|
||||
|
||||
It was written been 2016 and 2020, *when Amazon S3 was eventually consistent.*
|
||||
It compensated for the following S3 inconsistencies:
|
||||
* Newly created objects excluded from directory listings.
|
||||
* Newly deleted objects retained in directory listings.
|
||||
* Deleted objects still visible in existence probes and opening for reading.
|
||||
* S3 Load balancer 404 caching when a probe is made for an object before its creation.
|
||||
|
||||
It did not compensate for update inconsistency, though by storing the etag
|
||||
values of objects in the database, it could detect and report problems.
|
||||
|
||||
Now that S3 is consistent, there is no need for S3Guard at all.
|
||||
|
||||
S3Guard
|
||||
|
||||
1. May improve performance on directory listing/scanning operations,
|
||||
1. Permitted a consistent view of the object store.
|
||||
|
||||
1. Could improve performance on directory listing/scanning operations.
|
||||
including those which take place during the partitioning period of query
|
||||
execution, the process where files are listed and the work divided up amongst
|
||||
processes.
|
||||
|
||||
1. Permits a consistent view of the object store. Without this, changes in
|
||||
objects may not be immediately visible, especially in listing operations.
|
||||
|
||||
1. Offers a platform for future performance improvements for running Hadoop
|
||||
workloads on top of object stores
|
||||
|
||||
The basic idea is that, for each operation in the Hadoop S3 client (s3a) that
|
||||
The basic idea was that, for each operation in the Hadoop S3 client (s3a) that
|
||||
reads or modifies metadata, a shadow copy of that metadata is stored in a
|
||||
separate MetadataStore implementation. Each MetadataStore implementation
|
||||
offers HDFS-like consistency for the metadata, and may also provide faster
|
||||
lookups for things like file status or directory listings.
|
||||
separate MetadataStore implementation. The store was
|
||||
1. Updated after mutating operations on the store
|
||||
1. Updated after list operations against S3 discovered changes
|
||||
1. Looked up whenever a probe was made for a file/directory existing.
|
||||
1. Queried for all objects under a path when a directory listing was made; the results were
|
||||
merged with the S3 listing in a non-authoritative path, used exclusively in
|
||||
authoritative mode.
|
||||
|
||||
|
||||
For links to early design documents and related patches, see
|
||||
[HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345).
|
||||
@ -55,6 +70,19 @@ It is essential for all clients writing to an S3Guard-enabled
|
||||
S3 Repository to use the feature. Clients reading the data may work directly
|
||||
with the S3A data, in which case the normal S3 consistency guarantees apply.
|
||||
|
||||
## Moving off S3Guard
|
||||
|
||||
How to move off S3Guard, given it is no longer needed.
|
||||
|
||||
1. Unset the option `fs.s3a.metadatastore.impl` globally/for all buckets for which it
|
||||
was selected.
|
||||
1. If the option `org.apache.hadoop.fs.s3a.s3guard.disabled.warn.level` has been changed from
|
||||
the default (`SILENT`), change it back. You no longer need to be warned that S3Guard is disabled.
|
||||
1. Restart all applications.
|
||||
|
||||
Once you are confident that all applications have been restarted, _Delete the DynamoDB table_.
|
||||
This is to avoid paying for a database you no longer need.
|
||||
This is best done from the AWS GUI.
|
||||
|
||||
## Setting up S3Guard
|
||||
|
||||
@ -70,7 +98,7 @@ without S3Guard. The following values are available:
|
||||
* `WARN`: Warn that data may be at risk in workflows.
|
||||
* `FAIL`: S3AFileSystem instantiation will fail.
|
||||
|
||||
The default setting is INFORM. The setting is case insensitive.
|
||||
The default setting is `SILENT`. The setting is case insensitive.
|
||||
The required level can be set in the `core-site.xml`.
|
||||
|
||||
---
|
||||
|
@ -974,16 +974,18 @@ using an absolute XInclude reference to it.
|
||||
**Warning do not enable any type of failure injection in production. The
|
||||
following settings are for testing only.**
|
||||
|
||||
One of the challenges with S3A integration tests is the fact that S3 is an
|
||||
eventually-consistent storage system. In practice, we rarely see delays in
|
||||
visibility of recently created objects both in listings (`listStatus()`) and
|
||||
when getting a single file's metadata (`getFileStatus()`). Since this behavior
|
||||
is rare and non-deterministic, thorough integration testing is challenging.
|
||||
|
||||
To address this, S3A supports a shim layer on top of the `AmazonS3Client`
|
||||
One of the challenges with S3A integration tests is the fact that S3 was an
|
||||
eventually-consistent storage system. To simulate inconsistencies more
|
||||
frequently than they would normally surface, S3A supports a shim layer on top of the `AmazonS3Client`
|
||||
class which artificially delays certain paths from appearing in listings.
|
||||
This is implemented in the class `InconsistentAmazonS3Client`.
|
||||
|
||||
Now that S3 is consistent, injecting failures during integration and
|
||||
functional testing is less important.
|
||||
There's no need to enable it to verify that S3Guard can recover
|
||||
from consistencies, given that in production such consistencies
|
||||
will never surface.
|
||||
|
||||
## Simulating List Inconsistencies
|
||||
|
||||
### Enabling the InconsistentAmazonS3CClient
|
||||
@ -1062,9 +1064,6 @@ The default is 5000 milliseconds (five seconds).
|
||||
</property>
|
||||
```
|
||||
|
||||
Future versions of this client will introduce new failure modes,
|
||||
with simulation of S3 throttling exceptions the next feature under
|
||||
development.
|
||||
|
||||
### Limitations of Inconsistency Injection
|
||||
|
||||
@ -1104,8 +1103,12 @@ inconsistent directory listings.
|
||||
|
||||
## <a name="s3guard"></a> Testing S3Guard
|
||||
|
||||
[S3Guard](./s3guard.html) is an extension to S3A which adds consistent metadata
|
||||
listings to the S3A client. As it is part of S3A, it also needs to be tested.
|
||||
[S3Guard](./s3guard.html) is an extension to S3A which added consistent metadata
|
||||
listings to the S3A client.
|
||||
|
||||
It has not been needed for applications to work safely with AWS S3 since November
|
||||
2020. However, it is currently still part of the codebase, and so something which
|
||||
needs to be tested.
|
||||
|
||||
The basic strategy for testing S3Guard correctness consists of:
|
||||
|
||||
|
@ -1018,61 +1018,6 @@ Something has been trying to write data to "/".
|
||||
These are the issues where S3 does not appear to behave the way a filesystem
|
||||
"should".
|
||||
|
||||
### Visible S3 Inconsistency
|
||||
|
||||
Amazon S3 is *an eventually consistent object store*. That is: not a filesystem.
|
||||
|
||||
To reduce visible inconsistencies, use the [S3Guard](./s3guard.html) consistency
|
||||
cache.
|
||||
|
||||
|
||||
By default, Amazon S3 offers read-after-create consistency: a newly created file
|
||||
is immediately visible.
|
||||
There is a small quirk: a negative GET may be cached, such
|
||||
that even if an object is immediately created, the fact that there "wasn't"
|
||||
an object is still remembered.
|
||||
|
||||
That means the following sequence on its own will be consistent
|
||||
```
|
||||
touch(path) -> getFileStatus(path)
|
||||
```
|
||||
|
||||
But this sequence *may* be inconsistent.
|
||||
|
||||
```
|
||||
getFileStatus(path) -> touch(path) -> getFileStatus(path)
|
||||
```
|
||||
|
||||
A common source of visible inconsistencies is that the S3 metadata
|
||||
database —the part of S3 which serves list requests— is updated asynchronously.
|
||||
Newly added or deleted files may not be visible in the index, even though direct
|
||||
operations on the object (`HEAD` and `GET`) succeed.
|
||||
|
||||
That means the `getFileStatus()` and `open()` operations are more likely
|
||||
to be consistent with the state of the object store, but without S3Guard enabled,
|
||||
directory list operations such as `listStatus()`, `listFiles()`, `listLocatedStatus()`,
|
||||
and `listStatusIterator()` may not see newly created files, and still list
|
||||
old files.
|
||||
|
||||
### `FileNotFoundException` even though the file was just written.
|
||||
|
||||
This can be a sign of consistency problems. It may also surface if there is some
|
||||
asynchronous file write operation still in progress in the client: the operation
|
||||
has returned, but the write has not yet completed. While the S3A client code
|
||||
does block during the `close()` operation, we suspect that asynchronous writes
|
||||
may be taking place somewhere in the stack —this could explain why parallel tests
|
||||
fail more often than serialized tests.
|
||||
|
||||
### File not found in a directory listing, even though `getFileStatus()` finds it
|
||||
|
||||
(Similarly: deleted file found in listing, though `getFileStatus()` reports
|
||||
that it is not there)
|
||||
|
||||
This is a visible sign of updates to the metadata server lagging
|
||||
behind the state of the underlying filesystem.
|
||||
|
||||
Fix: Use [S3Guard](s3guard.html).
|
||||
|
||||
|
||||
### File not visible/saved
|
||||
|
||||
@ -1159,6 +1104,11 @@ for more information.
|
||||
A file being renamed and listed in the S3Guard table could not be found
|
||||
in the S3 bucket even after multiple attempts.
|
||||
|
||||
Now that S3 is consistent, this is sign that the S3Guard table is out of sync with
|
||||
the S3 Data.
|
||||
|
||||
Fix: disable S3Guard: it is no longer needed.
|
||||
|
||||
```
|
||||
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
|
||||
`s3a://example/sourcedir/missing': File not found on S3 after repeated attempts: `s3a://example/sourcedir/missing'
|
||||
@ -1169,10 +1119,6 @@ at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(Rena
|
||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
|
||||
```
|
||||
|
||||
Either the file has been deleted, or an attempt was made to read a file before it
|
||||
was created and the S3 load balancer has briefly cached the 404 returned by that
|
||||
operation. This is something which AWS S3 can do for short periods.
|
||||
|
||||
If error occurs and the file is on S3, consider increasing the value of
|
||||
`fs.s3a.s3guard.consistency.retry.limit`.
|
||||
|
||||
@ -1180,29 +1126,6 @@ We also recommend using applications/application
|
||||
options which do not rename files when committing work or when copying data
|
||||
to S3, but instead write directly to the final destination.
|
||||
|
||||
### `RemoteFileChangedException`: "File to rename not found on unguarded S3 store"
|
||||
|
||||
```
|
||||
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
|
||||
`s3a://example/sourcedir/missing': File to rename not found on unguarded S3 store: `s3a://example/sourcedir/missing'
|
||||
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:3231)
|
||||
at org.apache.hadoop.fs.s3a.S3AFileSystem.access$700(S3AFileSystem.java:177)
|
||||
at org.apache.hadoop.fs.s3a.S3AFileSystem$RenameOperationCallbacksImpl.copyFile(S3AFileSystem.java:1368)
|
||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(RenameOperation.java:448)
|
||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
|
||||
```
|
||||
|
||||
An attempt was made to rename a file in an S3 store not protected by SGuard,
|
||||
the directory list operation included the filename in its results but the
|
||||
actual operation to rename the file failed.
|
||||
|
||||
This can happen because S3 directory listings and the store itself are not
|
||||
consistent: the list operation tends to lag changes in the store.
|
||||
It is possible that the file has been deleted.
|
||||
|
||||
The fix here is to use S3Guard. We also recommend using applications/application
|
||||
options which do not rename files when committing work or when copying data
|
||||
to S3, but instead write directly to the final destination.
|
||||
|
||||
## <a name="encryption"></a> S3 Server Side Encryption
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user