HDFS-6824. Additional user documentation for HDFS encryption.

This commit is contained in:
Andrew Wang 2014-10-22 14:38:47 -07:00
parent 08457e9e57
commit a36399e09c
2 changed files with 106 additions and 26 deletions

View File

@ -290,6 +290,8 @@ Release 2.7.0 - UNRELEASED
HDFS-2486. Remove unnecessary priority level checks in
UnderReplicatedBlocks. (Uma Maheswara Rao G via szetszwo)
HDFS-6824. Additional user documentation for HDFS encryption. (wang)
OPTIMIZATIONS
BUG FIXES

View File

@ -23,11 +23,28 @@ Transparent Encryption in HDFS
* {Overview}
HDFS implements <transparent>, <end-to-end> encryption.
Once configured, data read from and written to HDFS is <transparently> encrypted and decrypted without requiring changes to user application code.
Once configured, data read from and written to special HDFS directories is <transparently> encrypted and decrypted without requiring changes to user application code.
This encryption is also <end-to-end>, which means the data can only be encrypted and decrypted by the client.
HDFS never stores or has access to unencrypted data or data encryption keys.
HDFS never stores or has access to unencrypted data or unencrypted data encryption keys.
This satisfies two typical requirements for encryption: <at-rest encryption> (meaning data on persistent media, such as a disk) as well as <in-transit encryption> (e.g. when data is travelling over the network).
* {Background}
Encryption can be done at different layers in a traditional data management software/hardware stack.
Choosing to encrypt at a given layer comes with different advantages and disadvantages.
* <<Application-level encryption>>. This is the most secure and most flexible approach. The application has ultimate control over what is encrypted and can precisely reflect the requirements of the user. However, writing applications to do this is hard. This is also not an option for customers of existing applications that do not support encryption.
* <<Database-level encryption>>. Similar to application-level encryption in terms of its properties. Most database vendors offer some form of encryption. However, there can be performance issues. One example is that indexes cannot be encrypted.
* <<Filesystem-level encryption>>. This option offers high performance, application transparency, and is typically easy to deploy. However, it is unable to model some application-level policies. For instance, multi-tenant applications might want to encrypt based on the end user. A database might want different encryption settings for each column stored within a single file.
* <<Disk-level encryption>>. Easy to deploy and high performance, but also quite inflexible. Only really protects against physical theft.
HDFS-level encryption fits between database-level and filesystem-level encryption in this stack. This has a lot of positive effects. HDFS encryption is able to provide good performance and existing Hadoop applications are able to run transparently on encrypted data. HDFS also has more context than traditional filesystems when it comes to making policy decisions.
HDFS-level encryption also prevents attacks at the filesystem-level and below (so-called "OS-level attacks"). The operating system and disk only interact with encrypted bytes, since the data is already encrypted by HDFS.
* {Use Cases}
Data encryption is required by a number of different government, financial, and regulatory entities.
@ -39,33 +56,29 @@ Transparent Encryption in HDFS
* {Architecture}
** {Key Management Server, KeyProvider, EDEKs}
A new cluster service is required to store, manage, and access encryption keys: the Hadoop <Key Management Server (KMS)>.
The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients.
Both the backing key store and the KMS implement the Hadoop KeyProvider client API.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
In the KeyProvider API, each encryption key has a unique <key name>.
Because keys can be rolled, a key can have multiple <key versions>, where each key version has its own <key material> (the actual secret bytes used during encryption and decryption).
An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version.
The KMS implements additional functionality which enables creation and decryption of <encrypted encryption keys (EEKs)>.
Creation and decryption of EEKs happens entirely on the KMS.
Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key.
To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client.
To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key.
In the context of HDFS encryption, EEKs are <encrypted data encryption keys (EDEKs)>, where a <data encryption key (DEK)> is what is used to encrypt and decrypt file data.
Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs.
This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to EDEK encryption keys.
** {Encryption zones}
** {Overview}
For transparent encryption, we introduce a new abstraction to HDFS: the <encryption zone>.
An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
Each encryption zone is associated with a single <encryption zone key> which is specified when the zone is created.
Each file within an encryption zone has its own unique EDEK.
Each file within an encryption zone has its own unique <data encryption key (DEK)>.
DEKs are never handled directly by HDFS.
Instead, HDFS only ever handles an <encrypted data encryption key (EDEK)>.
Clients decrypt an EDEK, and then use the subsequent DEK to read and write data.
HDFS datanodes simply see a stream of encrypted bytes.
A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS).
In the context of HDFS encryption, the KMS performs three basic responsibilities:
[[1]] Providing access to stored encryption zone keys
[[1]] Generating new encrypted data encryption keys for storage on the NameNode
[[1]] Decrypting encrypted data encryption keys for use by HDFS clients
The KMS will be described in more detail below.
** {Accessing data within an encryption zone}
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key.
The EDEK is then stored persistently as part of the file's metadata on the NameNode.
@ -80,11 +93,33 @@ Transparent Encryption in HDFS
This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys.
However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.
** {Key Management Server, KeyProvider, EDEKs}
The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients.
Both the backing key store and the KMS implement the Hadoop KeyProvider API.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
In the KeyProvider API, each encryption key has a unique <key name>.
Because keys can be rolled, a key can have multiple <key versions>, where each key version has its own <key material> (the actual secret bytes used during encryption and decryption).
An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version.
The KMS implements additional functionality which enables creation and decryption of <encrypted encryption keys (EEKs)>.
Creation and decryption of EEKs happens entirely on the KMS.
Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key.
To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client.
To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key.
In the context of HDFS encryption, EEKs are <encrypted data encryption keys (EDEKs)>, where a <data encryption key (DEK)> is what is used to encrypt and decrypt file data.
Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs.
This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to unencrypted encryption keys.
* {Configuration}
A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
Once a KMS has been set up and the NameNode and HDFS clients have been correctly configured, an admin can use the <<<hadoop key>>> and <<<hdfs crypto>>> command-line tools to create encryption keys and set up new encryption zones. Existing data can be encrypted by copying it into the new encryption zones using tools like distcp.
** Configuring the cluster KeyProvider
*** dfs.encryption.key.provider.uri
@ -152,6 +187,48 @@ Transparent Encryption in HDFS
List all encryption zones. Requires superuser permissions.
* {Example usage}
These instructions assume that you are running as the normal user or HDFS superuser as is appropriate.
Use <<<sudo>>> as needed for your environment.
-------------------------
# As the normal user, create a new encryption key
hadoop key create myKey
# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /zone
hdfs crypto -createZone -keyName myKey -path /zone
# chown it to the normal user
hadoop fs -chown myuser:myuser /zone
# As the normal user, put a file in, read it out
hadoop fs -put helloWorld /zone
hadoop fs -cat /zone/helloWorld
-------------------------
* {Distcp considerations}
** {Running as the superuser}
One common usecase for distcp is to replicate data between clusters for backup and disaster recovery purposes.
This is typically performed by the cluster administrator, who is an HDFS superuser.
To enable this same workflow when using HDFS encryption, we introduced a new virtual path prefix, <<</.reserved/raw/>>>, that gives superusers direct access to the underlying block data in the filesystem.
This allows superusers to distcp data without needing having access to encryption keys, and also avoids the overhead of decrypting and re-encrypting data. It also means the source and destination data will be byte-for-byte identical, which would not be true if the data was being re-encrypted with a new EDEK.
When using <<</.reserved/raw>>> to distcp encrypted data, it's important to preserve extended attributes with the {{-px}} flag.
This is because encrypted file attributes (such as the EDEK) are exposed through extended attributes within <<</.reserved/raw>>>, and must be preserved to be able to decrypt the file.
This means that if the distcp is initiated at or above the encryption zone root, it will automatically create an encryption zone at the destination if it does not already exist.
However, it's still recommended that the admin first create identical encryption zones on the destination cluster to avoid any potential mishaps.
** {Copying between encrypted and unencrypted locations}
By default, distcp compares checksums provided by the filesystem to verify that the data was successfully copied to the destination.
When copying between an unencrypted and encrypted location, the filesystem checksums will not match since the underlying block data is different.
In this case, specify the {{-skipcrccheck}} and {{-update}} distcp flags to avoid verifying checksums.
* {Attack vectors}
** {Hardware access exploits}
@ -208,5 +285,6 @@ Transparent Encryption in HDFS
** {Rogue user exploits}
A rogue user can collect keys to which they have access, and use them later to decrypt encrypted data.
A rogue user can collect keys of files they have access to, and use them later to decrypt the encrypted data of those files.
As the user had access to those files, they already had access to the file contents.
This can be mitigated through periodic key rolling policies.