diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-fs-encryption.txt b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-fs-encryption.txt index b88b619247..4673fac180 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-fs-encryption.txt +++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-fs-encryption.txt @@ -74,6 +74,8 @@ fs-encryption (Unreleased) HDFS-6780. Batch the encryption zones listing API. (wang) + HDFS-6394. HDFS encryption documentation. (wang) + OPTIMIZATIONS BUG FIXES diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/CryptoAdmin.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/CryptoAdmin.java index fcad730075..bb52ddd153 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/CryptoAdmin.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/CryptoAdmin.java @@ -125,7 +125,7 @@ public String getName() { @Override public String getShortUsage() { - return "[" + getName() + " -keyName -path " + "]\n"; + return "[" + getName() + " -keyName -path ]\n"; } @Override @@ -187,7 +187,7 @@ public String getShortUsage() { @Override public String getLongUsage() { return getShortUsage() + "\n" + - "List all encryption zones.\n\n"; + "List all encryption zones. Requires superuser permissions.\n\n"; } @Override diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/TransparentEncryption.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/TransparentEncryption.apt.vm new file mode 100644 index 0000000000..3689a775ef --- /dev/null +++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/TransparentEncryption.apt.vm @@ -0,0 +1,206 @@ +~~ Licensed under the Apache License, Version 2.0 (the "License"); +~~ you may not use this file except in compliance with the License. +~~ You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. See accompanying LICENSE file. + + --- + Hadoop Distributed File System-${project.version} - Transparent Encryption in HDFS + --- + --- + ${maven.build.timestamp} + +Transparent Encryption in HDFS + +%{toc|section=1|fromDepth=2|toDepth=3} + +* {Overview} + + HDFS implements , encryption. + Once configured, data read from and written to HDFS is encrypted and decrypted without requiring changes to user application code. + This encryption is also , which means the data can only be encrypted and decrypted by the client. + HDFS never stores or has access to unencrypted data or data encryption keys. + This satisfies two typical requirements for encryption: (meaning data on persistent media, such as a disk) as well as (e.g. when data is travelling over the network). + +* {Use Cases} + + Data encryption is required by a number of different government, financial, and regulatory entities. + For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. + Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations. + + Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. + This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions. + +* {Architecture} + +** {Key Management Server, KeyProvider, EDEKs} + + A new cluster service is required to store, manage, and access encryption keys: the Hadoop . + The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. + Both the backing key store and the KMS implement the Hadoop KeyProvider client API. + See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information. + + In the KeyProvider API, each encryption key has a unique . + Because keys can be rolled, a key can have multiple , where each key version has its own (the actual secret bytes used during encryption and decryption). + An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version. + + The KMS implements additional functionality which enables creation and decryption of . + Creation and decryption of EEKs happens entirely on the KMS. + Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key. + To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client. + To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key. + + In the context of HDFS encryption, EEKs are , where a is what is used to encrypt and decrypt file data. + Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs. + This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to EDEK encryption keys. + +** {Encryption zones} + + For transparent encryption, we introduce a new abstraction to HDFS: the . + An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. + Each encryption zone is associated with a single which is specified when the zone is created. + Each file within an encryption zone has its own unique EDEK. + + When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key. + The EDEK is then stored persistently as part of the file's metadata on the NameNode. + + When reading a file within an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK. + The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. + Assuming that is successful, the client uses the DEK to decrypt the file's contents. + + All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS. + + Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. + This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. + However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat. + +* {Configuration} + + A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS. + See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information. + +** Selecting an encryption algorithm and codec + +*** hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE + + The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (eg EXAMPLECIPHERSUITE). + The first implementation will be used if available, others are fallbacks. + +*** hadoop.security.crypto.codec.classes.aes.ctr.nopadding + + Default: <<>> + + Comma-separated list of crypto codec implementations for AES/CTR/NoPadding. + The first implementation will be used if available, others are fallbacks. + +*** hadoop.security.crypto.cipher.suite + + Default: <<>> + + Cipher suite for crypto codec. + +*** hadoop.security.crypto.jce.provider + + Default: None + + The JCE provider name used in CryptoCodec. + +*** hadoop.security.crypto.buffer.size + + Default: <<<8192>>> + + The buffer size used by CryptoInputStream and CryptoOutputStream. + +** Namenode configuration + +*** dfs.namenode.list.encryption.zones.num.responses + + Default: <<<100>>> + + When listing encryption zones, the maximum number of zones that will be returned in a batch. + Fetching the list incrementally in batches improves namenode performance. + +* {<<>> command-line interface} + +** {createZone} + + Usage: <<<[-createZone -keyName -path ]>>> + + Create a new encryption zone. + +*--+--+ + | The path of the encryption zone to create. It must be an empty directory. +*--+--+ + | Name of the key to use for the encryption zone. +*--+--+ + +** {listZones} + + Usage: <<<[-listZones]>>> + + List all encryption zones. Requires superuser permissions. + +* {Attack vectors} + +** {Hardware access exploits} + + These exploits assume that attacker has gained physical access to hard drives from cluster machines, i.e. datanodes and namenodes. + + [[1]] Access to swap files of processes containing data encryption keys. + + * By itself, this does not expose cleartext, as it also requires access to encrypted block files. + + * This can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out. + + [[1]] Access to encrypted block files. + + * By itself, this does not expose cleartext, as it also requires access to DEKs. + +** {Root access exploits} + + These exploits assume that attacker has gained root shell access to cluster machines, i.e. datanodes and namenodes. + Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext. + For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access. + + [[1]] Access to encrypted block files. + + * By itself, this does not expose cleartext, as it also requires access to encryption keys. + + [[1]] Dump memory of client processes to obtain DEKs, delegation tokens, cleartext. + + * No mitigation. + + [[1]] Recording network traffic to sniff encryption keys and encrypted data in transit. + + * By itself, insufficient to read cleartext without the EDEK encryption key. + + [[1]] Dump memory of datanode process to obtain encrypted block data. + + * By itself, insufficient to read cleartext without the DEK. + + [[1]] Dump memory of namenode process to obtain encrypted data encryption keys. + + * By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files. + +** {HDFS admin exploits} + + These exploits assume that the attacker has compromised HDFS, but does not have root or <<>> user shell access. + + [[1]] Access to encrypted block files. + + * By itself, insufficient to read cleartext without the EDEK and EDEK encryption key. + + [[1]] Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), via -fetchImage. + + * By itself, insufficient to read cleartext without EDEK encryption keys. + +** {Rogue user exploits} + + A rogue user can collect keys to which they have access, and use them later to decrypt encrypted data. + This can be mitigated through periodic key rolling policies. diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index ec9329216d..628250f06d 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -89,6 +89,7 @@ +