HADOOP-11395. Add site documentation for Azure Storage FileSystem integration. (Contributed by Chris Nauroth)

2014-12-19 18:54:22 -08:00 · 2014-12-19 18:54:22 -08:00 · c1f857b0b4
commit c1f857b0b4
parent 808cba3821
4 changed files with 247 additions and 166 deletions
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@ -441,6 +441,9 @@ Release 2.7.0 - UNRELEASED
    HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle. 
    (Wei Yan via kasha)
    HADOOP-11395. Add site documentation for Azure Storage FileSystem
    integration. (Chris Nauroth via Arpit Agarwal)
  OPTIMIZATIONS
    HADOOP-11323. WritableComparator#compare keeps reference to byte array.
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@ -136,6 +136,7 @@
    <menu name="Hadoop Compatible File Systems" inherit="top">
      <item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
      <item name="Azure Blob Storage" href="hadoop-azure/index.html"/>
      <item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
    </menu>
--- a/hadoop-tools/hadoop-azure/README.txt
+++ b/hadoop-tools/hadoop-azure/README.txt
@ -1,166 +0,0 @@
 =============
 Building
 =============
 basic compilation:
 > mvn clean compile test-compile
 Compile, run tests and produce jar 
 > mvn clean package
 =============
 Unit tests
 =============
 Most of the tests will run without additional configuration.
 For complete testing, configuration in src/test/resources is required:
  src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information 
 The other files in src/test/resources do not normally need alteration:
  log4j.properties -> Test logging setup
  hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing
 From command-line
 ------------------
 Basic execution:
 > mvn test
 NOTES:
 - The mvn pom.xml includes src/test/resources in the runtime classpath
 - detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml
   including log4j messages.
 Run the tests and generate report:
 > mvn site (at least once to setup some basics including images for the report)
 > mvn surefire-report:report  (run and produce report)
 > mvn mvn surefire-report:report-only  (produce report from last run)
 > mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors)
 > .\target\site\surefire-report.html (view the report)
 Via eclipse
 -------------
 Manually add src\test\resources to the classpath for test run configuration:
  - run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder
 Then run via junit test runner.
 NOTE:
 - if you change log4.properties, rebuild the project to refresh the eclipse cache.
 Run Tests against Mocked storage.
 ---------------------------------
 These run automatically and make use of an in-memory emulation of azure storage.
 Running tests against the Azure storage emulator  
 ---------------------------------------------------
 A selection of tests can run against the Azure Storage Emulator which is 
 a high-fidelity emulation of live Azure Storage.  The emulator is sufficient for high-confidence testing.
 The emulator is a Windows executable that runs on a local machine. 
 To use the emulator, install Azure SDK 2.3 and start the storage emulator
 See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx
 Enable the Azure emulator tests by setting 
  fs.azure.test.emulator -> true 
 in src\test\resources\azure-test.xml
 Known issues:
  Symptom: When running tests for emulator, you see the following failure message
           com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
  Issue:   The emulator can get into a confused state.  
  Fix:     Restart the Azure Emulator.  Ensure it is v3.2 or later.
 Running tests against live Azure storage 
 -------------------------------------------------------------------------
 In order to run WASB unit tests against a live Azure Storage account, add credentials to 
 src\test\resources\azure-test.xml. These settings augment the hadoop configuration object.
 For live tests, set the following in azure-test.xml:
 1. "fs.azure.test.account.name -> {azureStorageAccountName} 
 2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}"
 ===================================
 Page Blob Support and Configuration
 ===================================
 The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs
 and page blobs. Block blobs are the default kind of blob and are good for most 
 big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc. 
 Page blob handling in hadoop-azure was introduced to support HBase log files. 
 Page blobs can be written any number of times, whereas block blobs can only be 
 appended to 50,000 times before you run out of blocks and your writes will fail.
 That won't work for HBase logs, so page blob support was introduced to overcome
 this limitation.
 Page blobs can be used for other purposes beyond just HBase log files though.
 They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in
 size, larger than the maximum 200GB size for block blobs.
 In order to have the files you create be page blobs, you must set the configuration
 variable fs.azure.page.blob.dir to a comma-separated list of folder names.
 E.g. 
    /hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles
 You can set this to simply / to make all files page blobs.
 The configuration option fs.azure.page.blob.size is the default initial 
 size for a page blob. It must be 128MB or greater, and no more than 1TB,
 specified as an integer number of bytes.
 ====================
 Atomic Folder Rename
 ====================
 Azure storage stores files as a flat key/value store without formal support
 for folders. The hadoop-azure file system layer simulates folders on top
 of Azure storage. By default, folder rename in the hadoop-azure file system
 layer is not atomic. That means that a failure during a folder rename 
 could, for example, leave some folders in the original directory and
 some in the new one.
 HBase depends on atomic folder rename. Hence, a configuration setting was
 introduced called fs.azure.atomic.rename.dir that allows you to specify a 
 comma-separated list of directories to receive special treatment so that 
 folder rename is made atomic. The default value of this setting is just /hbase.
 Redo will be applied to finish a folder rename that fails. A file  
 <folderName>-renamePending.json may appear temporarily and is the record of 
 the intention of the rename operation, to allow redo in event of a failure. 
 =============
 Findbugs
 =============
 Run findbugs and show interactive GUI for review of problems
 > mvn findbugs:gui 
 Run findbugs and fail build if errors are found:
 > mvn findbugs:check
 For help with findbugs plugin.
 > mvn findbugs:help
 =============
 Checkstyle
 =============
 Rules for checkstyle @ src\config\checkstyle.xml
 - these are based on a core set of standards, with exclusions for non-serious issues
 - as a general plan it would be good to turn on more rules over time.
 - Occasionally, run checkstyle with the default Sun rules by editing pom.xml.
 Command-line:
 > mvn checkstyle:check --> just test & fail build if violations found
 > mvn site checkstyle:checkstyle --> produce html report
 > . target\site\checkstyle.html  --> view report.
 Eclipse:
 - add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update
 - window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default.
 - project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml
 NOTE:
 - After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK
 =============
 Javadoc
 ============= 
 Command-line
 > mvn javadoc:javadoc
--- a/hadoop-tools/hadoop-azure/src/site/markdown/index.md
+++ b/hadoop-tools/hadoop-azure/src/site/markdown/index.md
@ -0,0 +1,243 @@
 <!---
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
 -->
 # Hadoop Azure Support: Azure Blob Storage
 * [Introduction](#Introduction)
 * [Features](#Features)
 * [Limitations](#Limitations)
 * [Usage](#Usage)
    * [Concepts](#Concepts)
    * [Configuring Credentials](#Configuring_Credentials)
    * [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
    * [Atomic Folder Rename](#Atomic_Folder_Rename)
    * [Accessing wasb URLs](#Accessing_wasb_URLs)
 * [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
 ## <a name="Introduction" />Introduction
 The hadoop-azure module provides support for integration with
 [Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
 The built jar file, named hadoop-azure.jar, also declares transitive dependencies
 on the additional artifacts it requires, notably the
 [Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
 ## <a name="Features" />Features
 * Read and write data stored in an Azure Blob Storage account.
 * Present a hierarchical file system view by implementing the standard Hadoop
  [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
 * Supports configuration of multiple Azure Blob Storage accounts.
 * Supports both page blobs (suitable for most use cases, such as MapReduce) and
  block blobs (suitable for continuous write use cases, such as an HBase
  write-ahead log).
 * Reference file system paths using URLs using the `wasb` scheme.
 * Also reference file system paths using URLs with the `wasbs` scheme for SSL
  encrypted access.
 * Can act as a source of data in a MapReduce job, or a sink.
 * Tested on both Linux and Windows.
 * Tested at scale.
 ## <a name="Limitations" />Limitations
 * The append operation is not implemented.
 * File owner and group are persisted, but the permissions model is not enforced.
  Authorization occurs at the level of the entire Azure Blob Storage account.
 * File last access time is not tracked.
 ## <a name="Usage" />Usage
 ### <a name="Concepts" />Concepts
 The Azure Blob Storage data model presents 3 core concepts:
 * **Storage Account**: All access is done through a storage account.
 * **Container**: A container is a grouping of multiple blobs.  A storage account
  may have multiple containers.  In Hadoop, an entire file system hierarchy is
  stored in a single container.  It is also possible to configure multiple
  containers, effectively presenting multiple file systems that can be referenced
  using distinct URLs.
 * **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
  The internal implementation also uses blobs to persist the file system
  hierarchy and other metadata.
 ### <a name="Configuring_Credentials" />Configuring Credentials
 Usage of Azure Blob Storage requires configuration of credentials.  Typically
 this is set in core-site.xml.  The configuration property name is of the form
 `fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the
 access key.  **The access key is a secret that protects access to your storage
 account.  Do not share the access key (or the core-site.xml file) with an
 untrusted party.**
 For example:
    <property>
      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
      <value>YOUR ACCESS KEY</value>
    </property>
 In many Hadoop clusters, the core-site.xml file is world-readable.  If it's
 undesirable for the access key to be visible in core-site.xml, then it's also
 possible to configure it in encrypted form.  An additional configuration property
 specifies an external program to be invoked by Hadoop processes to decrypt the
 key.  The encrypted key value is passed to this external program as a command
 line argument:
    <property>
      <name>fs.azure.account.keyprovider.youraccount</name>
      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
    </property>
    <property>
      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
      <value>YOUR ENCRYPTED ACCESS KEY</value>
    </property>
    <property>
      <name>fs.azure.shellkeyprovider.script</name>
      <value>PATH TO DECRYPTION PROGRAM</value>
    </property>
 ### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
 The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
 [block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
 Block blobs are the default kind of blob and are good for most big-data use
 cases, like input data for Hive, Pig, analytical map-reduce jobs etc.  Page blob
 handling in hadoop-azure was introduced to support HBase log files.  Page blobs
 can be written any number of times, whereas block blobs can only be appended to
 50,000 times before you run out of blocks and your writes will fail.  That won't
 work for HBase logs, so page blob support was introduced to overcome this
 limitation.
 Page blobs can be used for other purposes beyond just HBase log files though.
 Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block
 blobs.
 In order to have the files you create be page blobs, you must set the
 configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
 folder names.
 For example:
    <property>
      <name>fs.azure.page.blob.dir</name>
      <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
    </property>
 You can set this to simply / to make all files page blobs.
 The configuration option `fs.azure.page.blob.size` is the default initial
 size for a page blob. It must be 128MB or greater, and no more than 1TB,
 specified as an integer number of bytes.
 The configuration option `fs.azure.page.blob.extension.size` is the page blob
 extension size.  This defines the amount to extend a page blob if it starts to
 get full.  It must be 128MB or greater, specified as an integer number of bytes.
 ### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
 Azure storage stores files as a flat key/value store without formal support
 for folders.  The hadoop-azure file system layer simulates folders on top
 of Azure storage.  By default, folder rename in the hadoop-azure file system
 layer is not atomic.  That means that a failure during a folder rename
 could, for example, leave some folders in the original directory and
 some in the new one.
 HBase depends on atomic folder rename.  Hence, a configuration setting was
 introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
 comma-separated list of directories to receive special treatment so that
 folder rename is made atomic.  The default value of this setting is just
 `/hbase`.  Redo will be applied to finish a folder rename that fails. A file
 `<folderName>-renamePending.json` may appear temporarily and is the record of
 the intention of the rename operation, to allow redo in event of a failure.
 For example:
    <property>
      <name>fs.azure.atomic.rename.dir</name>
      <value>/hbase,/data</value>
    </property>
 ### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
 After credentials are configured in core-site.xml, any Hadoop component may
 reference files in that Azure Blob Storage account by using URLs of the following
 format:
    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
 The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
 Blob Storage.  `wasb` utilizes unencrypted HTTP access for all interaction with
 the Azure Blob Storage API.  `wasbs` utilizes SSL encrypted HTTPS access.
 For example, the following
 [FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
 commands demonstrate access to a storage account named `youraccount` and a
 container named `yourcontainer`.
    > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
    > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
    > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
    test file content
 It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
 This causes all bare paths, such as `/testDir/testFile` to resolve automatically
 to that file system.
 ## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
 The hadoop-azure module includes a full suite of unit tests.  Most of the tests
 will run without additional configuration by running `mvn test`.  This includes
 tests against mocked storage, which is an in-memory emulation of Azure Storage.
 A selection of tests can run against the
 [Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx)
 which is a high-fidelity emulation of live Azure Storage.  The emulator is
 sufficient for high-confidence testing.  The emulator is a Windows executable
 that runs on a local machine.
 To use the emulator, install Azure SDK 2.3 and start the storage emulator.  Then,
 edit `src/test/resources/azure-test.xml` and add the following property:
    <property>
      <name>fs.azure.test.emulator</name>
      <value>true</value>
    </property>
 There is a known issue when running tests with the emulator.  You may see the
 following failure message:
    com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
 To resolve this, restart the Azure Emulator.  Ensure it v3.2 or later.
 It's also possible to run tests against a live Azure Storage account by adding
 credentials to `src/test/resources/azure-test.xml` and setting
 `fs.azure.test.account.name` to the name of the storage account.
 For example:
    <property>
      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
      <value>YOUR ACCESS KEY</value>
    </property>
    <property>
      <name>fs.azure.test.account.name</name>
      <value>youraccount</value>
    </property>