diff --git a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/delegation_tokens.md b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/delegation_tokens.md index f8f9d88d1e..ce204f118a 100644 --- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/delegation_tokens.md +++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/delegation_tokens.md @@ -20,7 +20,7 @@ The S3A filesystem client supports `Hadoop Delegation Tokens`. This allows YARN application like MapReduce, Distcp, Apache Flink and Apache Spark to -obtain credentials to access S3 buckets and pass them pass these credentials to +obtain credentials to access S3 buckets and pass them to jobs/queries, so granting them access to the service with the same access permissions as the user. @@ -37,9 +37,9 @@ the S3A client from the AWS STS service. They have a limited duration so restrict how long an application can access AWS on behalf of a user. Clients with this token have the full permissions of the user. -*Role Delegation Tokens:* These contain an "STS Session Token" requested by by the -STS "Assume Role" API, so grant the caller to interact with S3 as specific AWS -role, *with permissions restricted to purely accessing that specific S3 bucket*. +*Role Delegation Tokens:* These contain an "STS Session Token" requested by the +STS "Assume Role" API, granting the caller permission to interact with S3 using a specific IAM +role, *with permissions restricted to accessing a specific S3 bucket*. Role Delegation Tokens are the most powerful. By restricting the access rights of the granted STS token, no process receiving the token may perform @@ -55,13 +55,13 @@ see [S3A Delegation Token Architecture](delegation_token_architecture.html). ## Background: Hadoop Delegation Tokens. -A Hadoop Delegation Token are is a byte array of data which is submitted to -a Hadoop services as proof that the caller has the permissions to perform +A Hadoop Delegation Token is a byte array of data which is submitted to +Hadoop services as proof that the caller has the permissions to perform the operation which it is requesting — -and which can be passed between applications to *delegate* those permission. +and which can be passed between applications to *delegate* those permissions. -Tokens are opaque to clients, clients who simply get a byte array -of data which they must to provide to a service when required. +Tokens are opaque to clients. Clients simply get a byte array +of data which they must provide to a service when required. This normally contains encrypted data for use by the service. The service, which holds the password to encrypt/decrypt this data, @@ -79,7 +79,7 @@ After use, tokens may be revoked: this relies on services holding tables of valid tokens, either in memory or, for any HA service, in Apache Zookeeper or similar. Revoking tokens is used to clean up after jobs complete. -Delegation support is tightly integrated with YARN: requests to launch +Delegation Token support is tightly integrated with YARN: requests to launch containers and applications can include a list of delegation tokens to pass along. These tokens are serialized with the request, saved to a file on the node launching the container, and then loaded in to the credentials @@ -103,12 +103,12 @@ S3A now supports delegation tokens, so allowing a caller to acquire tokens from a local S3A Filesystem connector instance and pass them on to applications to grant them equivalent or restricted access. -These S3A Delegation Tokens are special in that they do not contain +These S3A Delegation Tokens are special in a way that they do not contain password-protected data opaque to clients; they contain the secrets needed to access the relevant S3 buckets and associated services. They are obtained by requesting a delegation token from the S3A filesystem client. -Issued token mey be included in job submissions, passed to running applications, +Issued tokens may be included in job submissions, passed to running applications, etc. This token is specific to an individual bucket; all buckets which a client wishes to work with must have a separate delegation token issued. @@ -117,7 +117,7 @@ class, which then supports multiple "bindings" behind it, so supporting different variants of S3A Delegation Tokens. Because applications only collect Delegation Tokens in secure clusters, -It does mean that to be able to submit delegation tokens in transient +it does mean that to be able to submit delegation tokens in transient cloud-hosted Hadoop clusters, _these clusters must also have Kerberos enabled_. *Tip*: you should only be deploying Hadoop in public clouds with Kerberos enabled. @@ -141,10 +141,10 @@ for specifics details on the (current) token lifespan. ### S3A Role Delegation Tokens -A Role Delegation Tokens is created by asking the AWS +A Role Delegation Token is created by asking the AWS [Security Token Service](http://docs.aws.amazon.com/STS/latest/APIReference/Welcome.html) -for set of "Assumed Role" credentials, with a AWS account specific role for a limited duration.. -This role is restricted to only grant access the S3 bucket and all KMS keys, +for a set of "Assumed Role" session credentials with a limited lifetime, belonging to a given IAM Role. +The resulting session credentials are restricted to grant access to all KMS keys, and to the specific S3 bucket. They are marshalled into the S3A Delegation Token. Other S3A connectors can extract these credentials and use them to @@ -156,13 +156,13 @@ Issued tokens cannot be renewed or revoked. ### S3A Full-Credential Delegation Tokens -Full Credential Delegation Tokens tokens contain the full AWS login details +Full Credential Delegation Tokens contain the full AWS login details (access key and secret key) needed to access a bucket. They never expire, so are the equivalent of storing the AWS account credentials in a Hadoop, Hive, Spark configuration or similar. -They differences are: +The differences are: 1. They are automatically passed from the client/user to the application. A remote application can use them to access data on behalf of the user. @@ -181,21 +181,20 @@ Hadoop security enabled —which inevitably means with Kerberos. Even though S3A delegation tokens do not use Kerberos, the code in applications which fetch DTs is normally only executed when the cluster is running in secure mode; somewhere where the `core-site.xml` configuration -sets `hadoop.security.authentication` to to `kerberos` or another valid +sets `hadoop.security.authentication` to `kerberos` or another valid authentication mechanism. -* Without enabling security at this level, delegation tokens will not +*Without enabling security at this level, delegation tokens will not be collected.* -Once Kerberos enabled, the process for acquiring tokens is as follows: +Once Kerberos is enabled, the process for acquiring tokens is as follows: 1. Enable Delegation token support by setting `fs.s3a.delegation.token.binding` to the classname of the token binding to use. -to use. 1. Add any other binding-specific settings (STS endpoint, IAM role, etc.) 1. Make sure the settings are the same in the service as well as the client. 1. In the client, switch to using a [Hadoop Credential Provider](hadoop-project-dist/hadoop-common/CredentialProviderAPI.html) -for storing your local credentials, *with a local filesystem store +for storing your local credentials, with a local filesystem store (`localjceks:` or `jcecks://file`), so as to keep the full secrets out of any job configurations. 1. Execute the client from a Kerberos-authenticated account @@ -215,7 +214,7 @@ application configured with the login credentials for an AWS account able to iss Hadoop MapReduce jobs copy their client-side configurations with the job. If your AWS login secrets are set in an XML file then they are picked up and passed in with the job, _even if delegation tokens are used to propagate -session or role secrets. +session or role secrets_. Spark-submit will take any credentials in the `spark-defaults.conf`file and again, spread them across the cluster. @@ -261,7 +260,7 @@ the same STS endpoint. * In experiments, a few hundred requests per second are needed to trigger throttling, so this is very unlikely to surface in production systems. * The S3A filesystem connector retries all throttled requests to AWS services, including STS. -* Other S3 clients with use the AWS SDK will, if configured, also retry throttled requests. +* Other S3 clients which use the AWS SDK will, if configured, also retry throttled requests. Overall, the risk of triggering STS throttling appears low, and most applications will recover from what is generally an intermittently used AWS service. @@ -303,7 +302,7 @@ relevant bucket, then a new session token will be issued. a session delegation token, then the existing token will be forwarded. The life of the token will not be extended. 1. If the application requesting a token does not have either of these, -the the tokens cannot be issued: the operation will fail with an error. +the token cannot be issued: the operation will fail with an error. The endpoint for STS requests are set by the same configuration @@ -353,10 +352,10 @@ it is authenticated with; the role token binding will fail. When the AWS credentials supplied to the Session Delegation Token binding through `fs.s3a.aws.credentials.provider` are themselves a set of -session credentials, generated delegation tokens with simply contain these -existing session credentials, a new set of credentials obtained from STS. +session credentials, generated delegation tokens will simply contain these +existing session credentials, not a new set of credentials obtained from STS. This is because the STS service does not let -callers authenticated with session/role credentials from requesting new sessions. +callers authenticated with session/role credentials request new sessions. This feature is useful when generating tokens from an EC2 VM instance in one IAM role and forwarding them over to VMs which are running in a different IAM role. @@ -384,7 +383,7 @@ There are some further configuration options: | **Key** | **Meaning** | **Default** | | --- | --- | --- | -| `fs.s3a.assumed.role.session.duration"` | Duration of delegation tokens | `1h` | +| `fs.s3a.assumed.role.session.duration` | Duration of delegation tokens | `1h` | | `fs.s3a.assumed.role.arn` | ARN for role to request | (undefined) | | `fs.s3a.assumed.role.sts.endpoint.region` | region for issued tokens | (undefined) | @@ -413,7 +412,8 @@ The XML settings needed to enable session tokens are: ``` A JSON role policy for the role/session will automatically be generated which will -consist of +consist of: + 1. Full access to the S3 bucket for all operations used by the S3A client (read, write, list, multipart operations, get bucket location, etc). 1. Full user access to KMS keys. This is to be able to decrypt any data @@ -449,7 +449,7 @@ relevant bucket, then a full credential token will be issued. a session delegation token, then the existing token will be forwarded. The life of the token will not be extended. 1. If the application requesting a token does not have either of these, -the the tokens cannot be issued: the operation will fail with an error. +the tokens cannot be issued: the operation will fail with an error. ## Managing the Delegation Tokens Duration @@ -465,7 +465,7 @@ that of the role itself: 1h by default, though this can be changed to 12h [In the IAM Console](https://console.aws.amazon.com/iam/home#/roles), or from the AWS CLI. -*Without increasing the duration of role, one hour is the maximum value; +Without increasing the duration of the role, one hour is the maximum value; the error message `The requested DurationSeconds exceeds the MaxSessionDuration set for this role` is returned if the requested duration of a Role Delegation Token is greater than that available for the role. @@ -545,7 +545,7 @@ Consult [troubleshooting Assumed Roles](assumed_roles.html#troubleshooting) for details on AWS error messages related to AWS IAM roles. The [cloudstore](https://github.com/steveloughran/cloudstore) module's StoreDiag -utility can also be used to explore delegation token support +utility can also be used to explore delegation token support. ### Submitted job cannot authenticate @@ -557,7 +557,7 @@ There are many causes for this; delegation tokens add some more. * This user is not `kinit`-ed in to Kerberos. Use `klist` and `hadoop kdiag` to see the Kerberos authentication state of the logged in user. -* The filesystem instance on the client has not had a token binding set in +* The filesystem instance on the client does not have a token binding set in `fs.s3a.delegation.token.binding`, so does not attempt to issue any. * The job submission is not aware that access to the specific S3 buckets are required. Review the application's submission mechanism to determine @@ -717,7 +717,7 @@ In the initial results of these tests: * A few hundred requests a second can be made before STS block the caller. * The throttling does not last very long (seconds) -* Tt does not appear to affect any other STS endpoints. +* It does not appear to affect any other STS endpoints. If developers wish to experiment with these tests and provide more detailed analysis, we would welcome this. Do bear in mind that all users of the @@ -749,7 +749,7 @@ Look at the other examples to see what to do; `SessionTokenIdentifier` does most of the work. Having a `toString()` method which is informative is ideal for the `hdfs creds` -command as well as debugging: *but do not print secrets* +command as well as debugging: *but do not print secrets*. *Important*: Add no references to any AWS SDK class, to ensure it can be safely deserialized whenever the relevant token @@ -835,13 +835,13 @@ Tests the lifecycle of session tokens. #### Integration Test `ITestSessionDelegationInFileystem`. This collects DTs from one filesystem, and uses that to create a new FS instance and -then perform filesystem operations. A miniKDC is instantiated +then perform filesystem operations. A miniKDC is instantiated. * Take care to remove all login secrets from the environment, so as to make sure that the second instance is picking up the DT information. * `UserGroupInformation.reset()` can be used to reset user secrets after every test case (e.g. teardown), so that issued DTs from one test case do not contaminate the next. -* its subclass, `ITestRoleDelegationInFileystem` adds a check that the current credentials +* It's subclass, `ITestRoleDelegationInFileystem` adds a check that the current credentials in the DT cannot be used to access data on other buckets —that is, the active session really is restricted to the target bucket. @@ -851,7 +851,7 @@ session really is restricted to the target bucket. It's not easy to bring up a YARN cluster with a secure HDFS and miniKDC controller in test cases —this test, the closest there is to an end-to-end test, uses mocking to mock the RPC calls to the YARN AM, and then verifies that the tokens -have been collected in the job context, +have been collected in the job context. #### Load Test `ILoadTestSessionCredentials`