HDFS-6394. HDFS encryption documentation. (wang)

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/fs-encryption@1616016 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Andrew Wang 2014-08-05 21:49:31 +00:00
parent 7a246c447f
commit ac47ad11de
4 changed files with 211 additions and 2 deletions

View File

@ -74,6 +74,8 @@ fs-encryption (Unreleased)
HDFS-6780. Batch the encryption zones listing API. (wang) HDFS-6780. Batch the encryption zones listing API. (wang)
HDFS-6394. HDFS encryption documentation. (wang)
OPTIMIZATIONS OPTIMIZATIONS
BUG FIXES BUG FIXES

View File

@ -125,7 +125,7 @@ public String getName() {
@Override @Override
public String getShortUsage() { public String getShortUsage() {
return "[" + getName() + " -keyName <keyName> -path <path> " + "]\n"; return "[" + getName() + " -keyName <keyName> -path <path>]\n";
} }
@Override @Override
@ -187,7 +187,7 @@ public String getShortUsage() {
@Override @Override
public String getLongUsage() { public String getLongUsage() {
return getShortUsage() + "\n" + return getShortUsage() + "\n" +
"List all encryption zones.\n\n"; "List all encryption zones. Requires superuser permissions.\n\n";
} }
@Override @Override

View File

@ -0,0 +1,206 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Transparent Encryption in HDFS
---
---
${maven.build.timestamp}
Transparent Encryption in HDFS
%{toc|section=1|fromDepth=2|toDepth=3}
* {Overview}
HDFS implements <transparent>, <end-to-end> encryption.
Once configured, data read from and written to HDFS is <transparently> encrypted and decrypted without requiring changes to user application code.
This encryption is also <end-to-end>, which means the data can only be encrypted and decrypted by the client.
HDFS never stores or has access to unencrypted data or data encryption keys.
This satisfies two typical requirements for encryption: <at-rest encryption> (meaning data on persistent media, such as a disk) as well as <in-transit encryption> (e.g. when data is travelling over the network).
* {Use Cases}
Data encryption is required by a number of different government, financial, and regulatory entities.
For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations.
Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes.
This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.
* {Architecture}
** {Key Management Server, KeyProvider, EDEKs}
A new cluster service is required to store, manage, and access encryption keys: the Hadoop <Key Management Server (KMS)>.
The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients.
Both the backing key store and the KMS implement the Hadoop KeyProvider client API.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
In the KeyProvider API, each encryption key has a unique <key name>.
Because keys can be rolled, a key can have multiple <key versions>, where each key version has its own <key material> (the actual secret bytes used during encryption and decryption).
An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version.
The KMS implements additional functionality which enables creation and decryption of <encrypted encryption keys (EEKs)>.
Creation and decryption of EEKs happens entirely on the KMS.
Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key.
To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client.
To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key.
In the context of HDFS encryption, EEKs are <encrypted data encryption keys (EDEKs)>, where a <data encryption key (DEK)> is what is used to encrypt and decrypt file data.
Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs.
This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to EDEK encryption keys.
** {Encryption zones}
For transparent encryption, we introduce a new abstraction to HDFS: the <encryption zone>.
An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
Each encryption zone is associated with a single <encryption zone key> which is specified when the zone is created.
Each file within an encryption zone has its own unique EDEK.
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key.
The EDEK is then stored persistently as part of the file's metadata on the NameNode.
When reading a file within an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK.
The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version.
Assuming that is successful, the client uses the DEK to decrypt the file's contents.
All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.
Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions.
This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys.
However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.
* {Configuration}
A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
** Selecting an encryption algorithm and codec
*** hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE
The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (eg EXAMPLECIPHERSUITE).
The first implementation will be used if available, others are fallbacks.
*** hadoop.security.crypto.codec.classes.aes.ctr.nopadding
Default: <<<org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec,org.apache.hadoop.crypto.JceAesCtrCryptoCodec>>>
Comma-separated list of crypto codec implementations for AES/CTR/NoPadding.
The first implementation will be used if available, others are fallbacks.
*** hadoop.security.crypto.cipher.suite
Default: <<<AES/CTR/NoPadding>>>
Cipher suite for crypto codec.
*** hadoop.security.crypto.jce.provider
Default: None
The JCE provider name used in CryptoCodec.
*** hadoop.security.crypto.buffer.size
Default: <<<8192>>>
The buffer size used by CryptoInputStream and CryptoOutputStream.
** Namenode configuration
*** dfs.namenode.list.encryption.zones.num.responses
Default: <<<100>>>
When listing encryption zones, the maximum number of zones that will be returned in a batch.
Fetching the list incrementally in batches improves namenode performance.
* {<<<crypto>>> command-line interface}
** {createZone}
Usage: <<<[-createZone -keyName <keyName> -path <path>]>>>
Create a new encryption zone.
*--+--+
<path> | The path of the encryption zone to create. It must be an empty directory.
*--+--+
<keyName> | Name of the key to use for the encryption zone.
*--+--+
** {listZones}
Usage: <<<[-listZones]>>>
List all encryption zones. Requires superuser permissions.
* {Attack vectors}
** {Hardware access exploits}
These exploits assume that attacker has gained physical access to hard drives from cluster machines, i.e. datanodes and namenodes.
[[1]] Access to swap files of processes containing data encryption keys.
* By itself, this does not expose cleartext, as it also requires access to encrypted block files.
* This can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out.
[[1]] Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to DEKs.
** {Root access exploits}
These exploits assume that attacker has gained root shell access to cluster machines, i.e. datanodes and namenodes.
Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext.
For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access.
[[1]] Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to encryption keys.
[[1]] Dump memory of client processes to obtain DEKs, delegation tokens, cleartext.
* No mitigation.
[[1]] Recording network traffic to sniff encryption keys and encrypted data in transit.
* By itself, insufficient to read cleartext without the EDEK encryption key.
[[1]] Dump memory of datanode process to obtain encrypted block data.
* By itself, insufficient to read cleartext without the DEK.
[[1]] Dump memory of namenode process to obtain encrypted data encryption keys.
* By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files.
** {HDFS admin exploits}
These exploits assume that the attacker has compromised HDFS, but does not have root or <<<hdfs>>> user shell access.
[[1]] Access to encrypted block files.
* By itself, insufficient to read cleartext without the EDEK and EDEK encryption key.
[[1]] Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), via -fetchImage.
* By itself, insufficient to read cleartext without EDEK encryption keys.
** {Rogue user exploits}
A rogue user can collect keys to which they have access, and use them later to decrypt encrypted data.
This can be mitigated through periodic key rolling policies.

View File

@ -89,6 +89,7 @@
<item name="HDFS NFS Gateway" href="hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html"/> <item name="HDFS NFS Gateway" href="hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html"/>
<item name="HDFS Rolling Upgrade" href="hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html"/> <item name="HDFS Rolling Upgrade" href="hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html"/>
<item name="Extended Attributes" href="hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html"/> <item name="Extended Attributes" href="hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html"/>
<item name="Transparent Encryption" href="hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html"/>
<item name="HDFS Support for Multihoming" href="hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html"/> <item name="HDFS Support for Multihoming" href="hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html"/>
</menu> </menu>