HADOOP-15332. Fix typos in hadoop-aws markdown docs. Contributed by Gabor Bota.

2018-03-20 21:11:51 -07:00 · 2018-03-20 21:11:51 -07:00 · 7ce6b41509
commit 7ce6b41509
parent 2caba999bb
7 changed files with 81 additions and 81 deletions
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md
@ -28,7 +28,7 @@ The standard commit algorithms (the `FileOutputCommitter` and its v1 and v2 algo
 rely on directory rename being an `O(1)` atomic operation: callers output their
 work to temporary directories in the destination filesystem, then
 rename these directories to the final destination as way of committing work.
-This is the perfect solution for commiting work against any filesystem with
+This is the perfect solution for committing work against any filesystem with
 consistent listing operations and where the `FileSystem.rename()` command
 is an atomic `O(1)` operation.

@ -60,7 +60,7 @@ delayed completion of multi-part PUT operations
 That is: tasks write all data as multipart uploads, *but delay the final
 commit action until until the final, single job commit action.* Only that
 data committed in the job commit action will be made visible; work from speculative
-and failed tasks will not be instiantiated. As there is no rename, there is no
+and failed tasks will not be instantiated. As there is no rename, there is no
 delay while data is copied from a temporary directory to the final directory.
 The duration of the commit will be the time needed to determine which commit operations
 to construct, and to execute them.
@ -109,7 +109,7 @@ This is traditionally implemented via a `FileSystem.rename()` call.

  It is useful to differentiate between a *task-side commit*: an operation performed
  in the task process after its work, and a *driver-side task commit*, in which
-  the Job driver perfoms the commit operation. Any task-side commit work will
+  the Job driver performs the commit operation. Any task-side commit work will
  be performed across the cluster, and may take place off the critical part for
  job execution. However, unless the commit protocol requires all tasks to await
  a signal from the job driver, task-side commits cannot instantiate their output
@ -241,7 +241,7 @@ def commitTask(fs, jobAttemptPath, taskAttemptPath, dest):
    fs.rename(taskAttemptPath, taskCommittedPath)
 ```

-On a genuine fileystem this is an `O(1)` directory rename.
+On a genuine filesystem this is an `O(1)` directory rename.

 On an object store with a mimiced rename, it is `O(data)` for the copy,
 along with overhead for listing and deleting all files (For S3, that's
@ -257,13 +257,13 @@ def abortTask(fs, jobAttemptPath, taskAttemptPath, dest):
  fs.delete(taskAttemptPath, recursive=True)
 ```

-On a genuine fileystem this is an `O(1)` operation. On an object store,
+On a genuine filesystem this is an `O(1)` operation. On an object store,
 proportional to the time to list and delete files, usually in batches.


 ### Job Commit

-Merge all files/directories in all task commited paths into final destination path.
+Merge all files/directories in all task committed paths into final destination path.
 Optionally; create 0-byte `_SUCCESS` file in destination path.

 ```python
@ -420,9 +420,9 @@ by renaming the files.
 A a key difference is that the v1 algorithm commits a source directory to
 via a directory rename, which is traditionally an `O(1)` operation.

-In constrast, the v2 algorithm lists all direct children of a source directory
+In contrast, the v2 algorithm lists all direct children of a source directory
 and recursively calls `mergePath()` on them, ultimately renaming the individual
-files. As such, the number of renames it performa equals the number of source
+files. As such, the number of renames it performs equals the number of source
 *files*, rather than the number of source *directories*; the number of directory
 listings being `O(depth(src))` , where `depth(path)` is a function returning the
 depth of directories under the given path.
@ -431,7 +431,7 @@ On a normal filesystem, the v2 merge algorithm is potentially more expensive
 than the v1 algorithm. However, as the merging only takes place in task commit,
 it is potentially less of a bottleneck in the entire execution process.

-On an objcct store, it is suboptimal not just from its expectation that `rename()`
+On an object store, it is suboptimal not just from its expectation that `rename()`
 is an `O(1)` operation, but from its expectation that a recursive tree walk is
 an efficient way to enumerate and act on a tree of data. If the algorithm was
 switched to using `FileSystem.listFiles(path, recursive)` for a single call to
@ -548,7 +548,7 @@ the final destination FS, while `file://` can retain the default

 ### Task Setup

-`Task.initialize()`: read in the configuration, instantate the `JobContextImpl`
+`Task.initialize()`: read in the configuration, instantiate the `JobContextImpl`
 and `TaskAttemptContextImpl` instances bonded to the current job & task.

 ### Task Ccommit
@ -610,7 +610,7 @@ deleting the previous attempt's data is straightforward. However, for S3 committ
 using Multipart Upload as the means of uploading uncommitted data, it is critical
 to ensure that pending uploads are always aborted. This can be done by

-* Making sure that all task-side failure branvches in `Task.done()` call `committer.abortTask()`.
+* Making sure that all task-side failure branches in `Task.done()` call `committer.abortTask()`.
 * Having job commit & abort cleaning up all pending multipart writes to the same directory
 tree. That is: require that no other jobs are writing to the same tree, and so
 list all pending operations and cancel them.
@ -653,7 +653,7 @@ rather than relying on fields initiated from the context passed to the construct

 #### AM: Job setup: `OutputCommitter.setupJob()`

-This is initated in `org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.StartTransition`.
+This is initiated in `org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.StartTransition`.
 It is queued for asynchronous execution in `org.apache.hadoop.mapreduce.v2.app.MRAppMaster.startJobs()`,
 which is invoked when the service is started. Thus: the job is set up when the
 AM is started.
@ -686,7 +686,7 @@ the job is considered not to have attempted to commit itself yet.


 The presence of `COMMIT_SUCCESS` or `COMMIT_FAIL` are taken as evidence
-that the previous job completed successfully or unsucessfully; the AM
+that the previous job completed successfully or unsuccessfully; the AM
 then completes with a success/failure error code, without attempting to rerun
 the job.

@ -871,16 +871,16 @@ base directory. As well as translating the write operation, it also supports
 a `getFileStatus()` call on the original path, returning details on the file
 at the final destination. This allows for committing applications to verify
 the creation/existence/size of the written files (in contrast to the magic
-committer covdered below).
+committer covered below).

 The FS targets Openstack Swift, though other object stores are supportable through
 different backends.

 This solution is innovative in that it appears to deliver the same semantics
 (and hence failure modes) as the Spark Direct OutputCommitter, but which
-does not need any changs in either Spark *or* the Hadoop committers. In contrast,
+does not need any change in either Spark *or* the Hadoop committers. In contrast,
 the committers proposed here combines changing the Hadoop MR committers for
-ease of pluggability, and offers a new committer exclusivley for S3, one
+ease of pluggability, and offers a new committer exclusively for S3, one
 strongly dependent upon and tightly integrated with the S3A Filesystem.

 The simplicity of the Stocator committer is something to appreciate.
@ -922,7 +922,7 @@ The completion operation is apparently `O(1)`; presumably the PUT requests
 have already uploaded the data to the server(s) which will eventually be
 serving up the data for the final path. All that is needed to complete
 the upload is to construct an object by linking together the files in
-the server's local filesystem and udate an entry the index table of the
+the server's local filesystem and update an entry the index table of the
 object store.

 In the S3A client, all PUT calls in the sequence and the final commit are
@ -941,11 +941,11 @@ number of appealing features

 The final point is not to be underestimated, es not even
 a need for a consistency layer.
-* Overall a simpler design.pecially given the need to
+* Overall a simpler design. Especially given the need to
 be resilient to the various failure modes which may arise.


-The commiter writes task outputs to a temporary directory on the local FS.
+The committer writes task outputs to a temporary directory on the local FS.
 Task outputs are directed to the local FS by `getTaskAttemptPath` and `getWorkPath`.
 On task commit, the committer enumerates files in the task attempt directory (ignoring hidden files).
 Each file is uploaded to S3 using the [multi-part upload API](http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html),
@ -966,7 +966,7 @@ is a local `file://` reference.
 within a consistent, cluster-wide filesystem. For Netflix, that is HDFS.
 1. The Standard `FileOutputCommitter` (algorithm 1) is used to manage the commit/abort of these
 files. That is: it copies only those lists of files to commit from successful tasks
-into a (transient) job commmit directory.
+into a (transient) job commit directory.
 1. The S3 job committer reads the pending file list for every task committed
 in HDFS, and completes those put requests.

@ -1028,7 +1028,7 @@ complete at or near the same time, there may be a peak of bandwidth load
 slowing down the upload.

 Time to commit will be the same, and, given the Netflix committer has already
-implemented the paralellization logic here, a time of `O(files/threads)`.
+implemented the parallelization logic here, a time of `O(files/threads)`.

 ### Resilience

@ -1105,7 +1105,7 @@ This is done by
 an abort of all successfully read files.
 1. List and abort all pending multipart uploads.

-Because of action #2, action #1 is superflous. It is retained so as to leave
+Because of action #2, action #1 is superfluous. It is retained so as to leave
 open the option of making action #2 a configurable option -which would be
 required to handle the use case of >1 partitioned commit running simultaneously/

@ -1115,7 +1115,7 @@ Because the local data is managed with the v1 commit algorithm, the
 second attempt of the job will recover all the outstanding commit data
 of the first attempt; those tasks will not be rerun.

-This also ensures that on a job abort, the invidual tasks' .pendingset
+This also ensures that on a job abort, the individual tasks' .pendingset
 files can be read and used to initiate the abort of those uploads.
 That is: a recovered job can clean up the pending writes of the previous job

@ -1129,7 +1129,7 @@ must be configured to automatically delete the pending request.
 Those uploads already executed by a failed job commit will persist; those
 yet to execute will remain outstanding.

-The committer currently declares itself as non-recoverble, but that
+The committer currently declares itself as non-recoverable, but that
 may not actually hold, as the recovery process could be one of:

 1. Enumerate all job commits from the .pendingset files (*:= Commits*).
@ -1203,7 +1203,7 @@ that of the final job destination. When the job is committed, the pending
 writes are instantiated.

 With the addition of the Netflix Staging committer, the actual committer
-code now shares common formats for the persistent metadadata and shared routines
+code now shares common formats for the persistent metadata and shared routines
 for parallel committing of work, including all the error handling based on
 the Netflix experience.

@ -1333,7 +1333,7 @@ during job and task committer initialization.

 The job/task commit protocol is expected to handle this with the task
 only committing work when the job driver tells it to. A network partition
-should trigger the task committer's cancellation of the work (this is a protcol
+should trigger the task committer's cancellation of the work (this is a protocol
 above the committers).

 #### Job Driver failure
@ -1349,7 +1349,7 @@ when the job driver cleans up it will cancel pending writes under the directory.

 #### Multiple jobs targeting the same destination directory

-This leaves things in an inderminate state.
+This leaves things in an indeterminate state.


 #### Failure during task commit
@ -1388,7 +1388,7 @@ Two options present themselves
 and test that code as appropriate.

 Fixing the calling code does seem to be the best strategy, as it allows the
-failure to be explictly handled in the commit protocol, rather than hidden
+failure to be explicitly handled in the commit protocol, rather than hidden
 in the committer.::OpenFile

 #### Preemption
@ -1418,7 +1418,7 @@ with many millions of objects —rather than list all keys searching for those
 with `/__magic/**/*.pending` in their name, work backwards from the active uploads to
 the directories with the data.

-We may also want to consider having a cleanup operationn in the S3 CLI to
+We may also want to consider having a cleanup operation in the S3 CLI to
 do the full tree scan and purge of pending items; give some statistics on
 what was found. This will keep costs down and help us identify problems
 related to cleanup.
@ -1538,7 +1538,7 @@ The S3A Committer version, would

 In order to support the ubiquitous `FileOutputFormat` and subclasses,
 S3A Committers will need somehow be accepted as a valid committer by the class,
-a class which explicity expects the output committer to be `FileOutputCommitter`
+a class which explicitly expects the output committer to be `FileOutputCommitter`

 ```java
 public Path getDefaultWorkFile(TaskAttemptContext context,
@ -1555,10 +1555,10 @@ Here are some options which have been considered, explored and discarded

 1. Adding more of a factory mechanism to create `FileOutputCommitter` instances;
 subclass this for S3A output and return it. The complexity of `FileOutputCommitter`
-and of supporting more dynamic consturction makes this dangerous from an implementation
+and of supporting more dynamic construction makes this dangerous from an implementation
 and maintenance perspective.

-1. Add a new commit algorithmm "3", which actually reads in the configured
+1. Add a new commit algorithm "3", which actually reads in the configured
 classname of a committer which it then instantiates and then relays the commit
 operations, passing in context information. Ths new committer interface would
 add methods for methods and attributes. This is viable, but does still change
@ -1695,7 +1695,7 @@ marker implied the classic `FileOutputCommitter` had been used; if it could be r
 then it provides some details on the commit operation which are then used
 in assertions in the test suite.

-It has since been extended to collet metrics and other values, and has proven
+It has since been extended to collect metrics and other values, and has proven
 equally useful in Spark integration testing.

 ## Integrating the Committers with Apache Spark
@ -1727,8 +1727,8 @@ tree.

 Alternatively, the fact that Spark tasks provide data to the job committer on their
 completion means that a list of pending PUT commands could be built up, with the commit
-operations being excuted by an S3A-specific implementation of the `FileCommitProtocol`.
-As noted earlier, this may permit the reqirement for a consistent list operation
+operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
+As noted earlier, this may permit the requirement for a consistent list operation
 to be bypassed. It would still be important to list what was being written, as
 it is needed to aid aborting work in failed tasks, but the list of files
 created by successful tasks could be passed directly from the task to committer,
@ -1833,7 +1833,7 @@ quotas in local FS, keeping temp dirs on different mounted FS from root.
 The intermediate `.pendingset` files are saved in HDFS under the directory in
 `fs.s3a.committer.staging.tmp.path`; defaulting to  `/tmp`. This data can
 disclose the workflow (it contains the destination paths & amount of data
-generated), and if deleted, breaks the job. If malicous code were to edit
+generated), and if deleted, breaks the job. If malicious code were to edit
 the file, by, for example, reordering the ordered etag list, the generated
 data would be committed out of order, creating invalid files. As this is
 the (usually transient) cluster FS, any user in the cluster has the potential
@ -1848,7 +1848,7 @@ The directory defined by `fs.s3a.buffer.dir` is used to buffer blocks
 before upload, unless the job is configured to buffer the blocks in memory.
 This is as before: no incremental risk. As blocks are deleted from the filesystem
 after upload, the amount of storage needed is determined by the data generation
-bandwidth and the data upload bandwdith.
+bandwidth and the data upload bandwidth.

 No use is made of the cluster filesystem; there are no risks there.

@ -1946,6 +1946,6 @@ which will made absolute relative to the current user. In filesystems in
 which access under user's home directories are restricted, this final, absolute
 path, will not be visible to untrusted accounts.

-* Maybe: define the for valid characters in a text strings, and a regext for
+* Maybe: define the for valid characters in a text strings, and a regex for
 validating, e,g, `[a-zA-Z0-9 \.\,\(\) \-\+]+` and then validate any free text
 JSON fields on load and save.
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md
@ -226,14 +226,14 @@ it is committed through the standard "v1" commit algorithm.
 When the Job is committed, the Job Manager reads the lists of pending writes from its
 HDFS Job destination directory and completes those uploads.

-Cancelling a task is straighforward: the local directory is deleted with
+Cancelling a task is straightforward: the local directory is deleted with
 its staged data. Cancelling a job is achieved by reading in the lists of
 pending writes from the HDFS job attempt directory, and aborting those
 uploads. For extra safety, all outstanding multipart writes to the destination directory
 are aborted.

 The staging committer comes in two slightly different forms, with slightly
-diffrent conflict resolution policies:
+different conflict resolution policies:


 * **Directory**: the entire directory tree of data is written or overwritten,
@ -278,7 +278,7 @@ any with the same name. Reliable use requires unique names for generated files,
 which the committers generate
 by default.

-The difference between the two staging ommitters are as follows:
+The difference between the two staging committers are as follows:

 The Directory Committer uses the entire directory tree for conflict resolution.
 If any file exists at the destination it will fail in job setup; if the resolution
@ -301,7 +301,7 @@ It's intended for use in Apache Spark Dataset operations, rather
 than Hadoop's original MapReduce engine, and only in jobs
 where adding new data to an existing dataset is the desired goal.

-Preequisites for successful work
+Prerequisites for successful work

 1. The output is written into partitions via `PARTITIONED BY` or `partitionedBy()`
 instructions.
@ -401,7 +401,7 @@ Generated files are initially written to a local directory underneath one of the
 directories listed in `fs.s3a.buffer.dir`.


-The staging commmitter needs a path in the cluster filesystem
+The staging committer needs a path in the cluster filesystem
 (e.g. HDFS). This must be declared in `fs.s3a.committer.staging.tmp.path`.

 Temporary files are saved in HDFS (or other cluster filesystem) under the path
@ -460,7 +460,7 @@ What the partitioned committer does is, where the tooling permits, allows caller
 to add data to an existing partitioned layout*.

 More specifically, it does this by having a conflict resolution options which
-only act on invididual partitions, rather than across the entire output tree.
+only act on individual partitions, rather than across the entire output tree.

 | `fs.s3a.committer.staging.conflict-mode` | Meaning |
 | -----------------------------------------|---------|
@ -508,7 +508,7 @@ documentation to see if it is consistent, hence compatible "out of the box".
 <property>
  <name>fs.s3a.committer.magic.enabled</name>
  <description>
-  Enable support in the filesystem for the S3 "Magic" committter.
+  Enable support in the filesystem for the S3 "Magic" committer.
  </description>
  <value>true</value>
 </property>
@ -706,7 +706,7 @@ This message should not appear through the committer itself &mdash;it will
 fail with the error message in the previous section, but may arise
 if other applications are attempting to create files under the path `/__magic/`.

-Make sure the filesytem meets the requirements of the magic committer
+Make sure the filesystem meets the requirements of the magic committer
 (a consistent S3A filesystem through S3Guard or the S3 service itself),
 and set the `fs.s3a.committer.magic.enabled` flag to indicate that magic file
 writes are supported.
@ -741,7 +741,7 @@ at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.
 While that will not make the problem go away, it will at least make
 the failure happen at the start of a job.

-(Setting this option will not interfer with the Staging Committers' use of HDFS,
+(Setting this option will not interfere with the Staging Committers' use of HDFS,
 as it explicitly sets the algorithm to "2" for that part of its work).

 The other way to check which committer to use is to examine the `_SUCCESS` file.
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
@ -23,7 +23,7 @@
 The S3A filesystem client supports Amazon S3's Server Side Encryption
 for at-rest data encryption.
 You should to read up on the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html)
-for S3 Server Side Encryption for up to date information on the encryption mechansims.
+for S3 Server Side Encryption for up to date information on the encryption mechanisms.



@ -135,7 +135,7 @@ it blank to use the default configured for that region.
 the right to use it, uses it to encrypt the object-specific key.


-When downloading SSE-KMS encrypte data, the sequence is as follows
+When downloading SSE-KMS encrypted data, the sequence is as follows

 1. The S3A client issues an HTTP GET request to read the data.
 1. S3 sees that the data was encrypted with SSE-KMS, and looks up the specific key in the KMS service
@ -413,8 +413,8 @@ a KMS key hosted in the AWS-KMS service in the same region.

 ```

-Again the approprate bucket policy can be used to guarantee that all callers
-will use SSE-KMS; they can even mandata the name of the key used to encrypt
+Again the appropriate bucket policy can be used to guarantee that all callers
+will use SSE-KMS; they can even mandate the name of the key used to encrypt
 the data, so guaranteeing that access to thee data can be read by everyone
 granted access to that key, and nobody without access to it.

--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
@ -638,7 +638,7 @@ over that of the `hadoop.security` list (i.e. they are prepended to the common l
 </property>
 ```

-This was added to suppport binding different credential providers on a per
+This was added to support binding different credential providers on a per
 bucket basis, without adding alternative secrets in the credential list.
 However, some applications (e.g Hive) prevent the list of credential providers
 from being dynamically updated by users. As per-bucket secrets are now supported,
@ -938,7 +938,7 @@ The S3A client makes a best-effort attempt at recovering from network failures;
 this section covers the details of what it does.

 The S3A divides exceptions returned by the AWS SDK into different categories,
-and chooses a differnt retry policy based on their type and whether or
+and chooses a different retry policy based on their type and whether or
 not the failing operation is idempotent.


@ -969,7 +969,7 @@ These failures will be retried with a fixed sleep interval set in
 `fs.s3a.retry.interval`, up to the limit set in `fs.s3a.retry.limit`.


-### Only retrible on idempotent operations
+### Only retriable on idempotent operations

 Some network failures are considered to be retriable if they occur on
 idempotent operations; there's no way to know if they happened
@ -997,11 +997,11 @@ it's a no-op if reprocessed. As indeed, is `Filesystem.delete()`.
 1. Any filesystem supporting an atomic `FileSystem.create(path, overwrite=false)`
 operation to reject file creation if the path exists MUST NOT consider
 delete to be idempotent, because a `create(path, false)` operation will
-only succeed if the first `delete()` call has already succeded.
+only succeed if the first `delete()` call has already succeeded.
 1. And a second, retried `delete()` call could delete the new data.

-Because S3 is eventially consistent *and* doesn't support an
-atomic create-no-overwrite operation, the choice is more ambigious.
+Because S3 is eventually consistent *and* doesn't support an
+atomic create-no-overwrite operation, the choice is more ambiguous.

 Currently S3A considers delete to be
 idempotent because it is convenient for many workflows, including the
@ -1045,11 +1045,11 @@ Notes
 1. There is also throttling taking place inside the AWS SDK; this is managed
 by the value `fs.s3a.attempts.maximum`.
 1. Throttling events are tracked in the S3A filesystem metrics and statistics.
-1. Amazon KMS may thottle a customer based on the total rate of uses of
+1. Amazon KMS may throttle a customer based on the total rate of uses of
 KMS *across all user accounts and applications*.

 Throttling of S3 requests is all too common; it is caused by too many clients
-trying to access the same shard of S3 Storage. This generatlly
+trying to access the same shard of S3 Storage. This generally
 happen if there are too many reads, those being the most common in Hadoop
 applications. This problem is exacerbated by Hive's partitioning
 strategy used when storing data, such as partitioning by year and then month.
@ -1087,7 +1087,7 @@ of data asked for in every GET request, as well as how much data is
 skipped in the existing stream before aborting it and creating a new stream.
 1. If the DynamoDB tables used by S3Guard are being throttled, increase
 the capacity through `hadoop s3guard set-capacity` (and pay more, obviously).
-1. KMS: "consult AWS about increating your capacity".
+1. KMS: "consult AWS about increasing your capacity".



@ -1173,14 +1173,14 @@ fs.s3a.bucket.nightly.server-side-encryption-algorithm
 ```

 When accessing the bucket `s3a://nightly/`, the per-bucket configuration
-options for that backet will be used, here the access keys and token,
+options for that bucket will be used, here the access keys and token,
 and including the encryption algorithm and key.


 ###  <a name="per_bucket_endpoints"></a>Using Per-Bucket Configuration to access data round the world

 S3 Buckets are hosted in different "regions", the default being "US-East".
-The S3A client talks to this region by default, issing HTTP requests
+The S3A client talks to this region by default, issuing HTTP requests
 to the server `s3.amazonaws.com`.

 S3A can work with buckets from any region. Each region has its own
@ -1331,12 +1331,12 @@ The "fast" output stream
    to the available disk space.
 1.  Generates output statistics as metrics on the filesystem, including
    statistics of active and pending block uploads.
-1.  Has the time to `close()` set by the amount of remaning data to upload, rather
+1.  Has the time to `close()` set by the amount of remaining data to upload, rather
    than the total size of the file.

 Because it starts uploading while data is still being written, it offers
 significant benefits when very large amounts of data are generated.
-The in memory buffering mechanims may also offer speedup when running adjacent to
+The in memory buffering mechanisms may also offer speedup when running adjacent to
 S3 endpoints, as disks are not used for intermediate data storage.


@ -1400,7 +1400,7 @@ upload operation counts, so identifying when there is a backlog of work/
 a mismatch between data generation rates and network bandwidth. Per-stream
 statistics can also be logged by calling `toString()` on the current stream.

-* Files being written are still invisible untl the write
+* Files being written are still invisible until the write
 completes in the `close()` call, which will block until the upload is completed.


@ -1526,7 +1526,7 @@ compete with other filesystem operations.

 We recommend a low value of `fs.s3a.fast.upload.active.blocks`; enough
 to start background upload without overloading other parts of the system,
-then experiment to see if higher values deliver more throughtput —especially
+then experiment to see if higher values deliver more throughput —especially
 from VMs running on EC2.

 ```xml
@ -1569,10 +1569,10 @@ from VMs running on EC2.
 There are two mechanisms for cleaning up after leftover multipart
 uploads:
 - Hadoop s3guard CLI commands for listing and deleting uploads by their
-age. Doumented in the [S3Guard](./s3guard.html) section.
+age. Documented in the [S3Guard](./s3guard.html) section.
 - The configuration parameter `fs.s3a.multipart.purge`, covered below.

-If an large stream writeoperation is interrupted, there may be
+If a large stream write operation is interrupted, there may be
 intermediate partitions uploaded to S3 —data which will be billed for.

 These charges can be reduced by enabling `fs.s3a.multipart.purge`,
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
@ -506,7 +506,7 @@ Input seek policy: fs.s3a.experimental.input.fadvise=normal

 Note that other clients may have a S3Guard table set up to store metadata
 on this bucket; the checks are all done from the perspective of the configuration
-setttings of the current client.
+settings of the current client.

 ```bash
 hadoop s3guard bucket-info -guarded -auth s3a://landsat-pds
@ -798,6 +798,6 @@ The IO load of clients of the (shared) DynamoDB table was exceeded.
 Currently S3Guard doesn't do any throttling and retries here; the way to address
 this is to increase capacity via the AWS console or the `set-capacity` command.

-## Other Topis
+## Other Topics

 For details on how to test S3Guard, see [Testing S3Guard](./testing.html#s3guard)
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
@ -28,7 +28,7 @@ be ignored.

 ## <a name="policy"></a> Policy for submitting patches which affect the `hadoop-aws` module.

-The Apache Jenkins infrastucture does not run any S3 integration tests,
+The Apache Jenkins infrastructure does not run any S3 integration tests,
 due to the need to keep credentials secure.

 ### The submitter of any patch is required to run all the integration tests and declare which S3 region/implementation they used.
@ -319,10 +319,10 @@ mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8

 The most bandwidth intensive tests (those which upload data) always run
 sequentially; those which are slow due to HTTPS setup costs or server-side
-actionsare included in the set of parallelized tests.
+actions are included in the set of parallelized tests.


-### <a name="tuning_scale"></a> Tuning scale optins from Maven
+### <a name="tuning_scale"></a> Tuning scale options from Maven


 Some of the tests can be tuned from the maven build or from the
@ -344,7 +344,7 @@ then the configuration value is used. The `unset` option is used to

 Only a few properties can be set this way; more will be added.

-| Property | Meaninging |
+| Property | Meaning |
 |-----------|-------------|
 | `fs.s3a.scale.test.timeout`| Timeout in seconds for scale tests |
 | `fs.s3a.scale.test.huge.filesize`| Size for huge file uploads |
@ -493,7 +493,7 @@ cases must be disabled:
  <value>false</value>
 </property>
 ```
-These tests reqest a temporary set of credentials from the STS service endpoint.
+These tests request a temporary set of credentials from the STS service endpoint.
 An alternate endpoint may be defined in `test.fs.s3a.sts.endpoint`.

 ```xml
@ -641,7 +641,7 @@ to support the declaration of a specific large test file on alternate filesystem

 ### Works Over Long-haul Links

-As well as making file size and operation counts scaleable, this includes
+As well as making file size and operation counts scalable, this includes
 making test timeouts adequate. The Scale tests make this configurable; it's
 hard coded to ten minutes in `AbstractS3ATestBase()`; subclasses can
 change this by overriding `getTestTimeoutMillis()`.
@ -677,7 +677,7 @@ Tests can overrun `createConfiguration()` to add new options to the configuratio
 file for the S3A Filesystem instance used in their tests.

 However, filesystem caching may mean that a test suite may get a cached
-instance created with an differennnt configuration. For tests which don't need
+instance created with an different configuration. For tests which don't need
 specific configurations caching is good: it reduces test setup time.

 For those tests which do need unique options (encryption, magic files),
@ -888,7 +888,7 @@ s3a://bucket/a/b/c/DELAY_LISTING_ME
 ```

 In real-life S3 inconsistency, however, we expect that all the above paths
-(including `a` and `b`) will be subject to delayed visiblity.
+(including `a` and `b`) will be subject to delayed visibility.

 ### Using the `InconsistentAmazonS3CClient` in downstream integration tests

@ -952,7 +952,7 @@ When the `s3guard` profile is enabled, following profiles can be specified:
  DynamoDB web service; launch the server and creating the table.
  You won't be charged bills for using DynamoDB in test. As it runs in-JVM,
  the table isn't shared across other tests running in parallel.
-* `non-auth`: treat the S3Guard metadata as authorative.
+* `non-auth`: treat the S3Guard metadata as authoritative.

 ```bash
 mvn -T 1C verify -Dparallel-tests -DtestsThreadCount=6 -Ds3guard -Ddynamo -Dauth
@ -984,7 +984,7 @@ throttling, and compare performance for different implementations. These
 are included in the scale tests executed when `-Dscale` is passed to
 the maven command line.

-The two S3Guard scale testse are `ITestDynamoDBMetadataStoreScale` and
+The two S3Guard scale tests are `ITestDynamoDBMetadataStoreScale` and
 `ITestLocalMetadataStoreScale`.  To run the DynamoDB test, you will need to
 define your table name and region in your test configuration.  For example,
 the following settings allow us to run `ITestDynamoDBMetadataStoreScale` with
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
@ -967,7 +967,7 @@ Again, this is due to the fact that the data is cached locally until the
 `close()` operation. The S3A filesystem cannot be used as a store of data
 if it is required that the data is persisted durably after every
 `Syncable.hflush()` or `Syncable.hsync()` call.
-This includes resilient logging, HBase-style journalling
+This includes resilient logging, HBase-style journaling
 and the like. The standard strategy here is to save to HDFS and then copy to S3.

 ## <a name="encryption"></a> S3 Server Side Encryption