HADOOP-11395. Add site documentation for Azure Storage FileSystem integration. (Contributed by Chris Nauroth)
This commit is contained in:
parent
808cba3821
commit
c1f857b0b4
@ -441,6 +441,9 @@ Release 2.7.0 - UNRELEASED
|
||||
HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle.
|
||||
(Wei Yan via kasha)
|
||||
|
||||
HADOOP-11395. Add site documentation for Azure Storage FileSystem
|
||||
integration. (Chris Nauroth via Arpit Agarwal)
|
||||
|
||||
OPTIMIZATIONS
|
||||
|
||||
HADOOP-11323. WritableComparator#compare keeps reference to byte array.
|
||||
|
@ -136,6 +136,7 @@
|
||||
|
||||
<menu name="Hadoop Compatible File Systems" inherit="top">
|
||||
<item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
|
||||
<item name="Azure Blob Storage" href="hadoop-azure/index.html"/>
|
||||
<item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
|
||||
</menu>
|
||||
|
||||
|
@ -1,166 +0,0 @@
|
||||
=============
|
||||
Building
|
||||
=============
|
||||
basic compilation:
|
||||
> mvn clean compile test-compile
|
||||
|
||||
Compile, run tests and produce jar
|
||||
> mvn clean package
|
||||
|
||||
=============
|
||||
Unit tests
|
||||
=============
|
||||
Most of the tests will run without additional configuration.
|
||||
For complete testing, configuration in src/test/resources is required:
|
||||
|
||||
src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information
|
||||
|
||||
The other files in src/test/resources do not normally need alteration:
|
||||
log4j.properties -> Test logging setup
|
||||
hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing
|
||||
|
||||
From command-line
|
||||
------------------
|
||||
Basic execution:
|
||||
> mvn test
|
||||
|
||||
NOTES:
|
||||
- The mvn pom.xml includes src/test/resources in the runtime classpath
|
||||
- detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml
|
||||
including log4j messages.
|
||||
|
||||
Run the tests and generate report:
|
||||
> mvn site (at least once to setup some basics including images for the report)
|
||||
> mvn surefire-report:report (run and produce report)
|
||||
> mvn mvn surefire-report:report-only (produce report from last run)
|
||||
> mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors)
|
||||
> .\target\site\surefire-report.html (view the report)
|
||||
|
||||
Via eclipse
|
||||
-------------
|
||||
Manually add src\test\resources to the classpath for test run configuration:
|
||||
- run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder
|
||||
|
||||
Then run via junit test runner.
|
||||
NOTE:
|
||||
- if you change log4.properties, rebuild the project to refresh the eclipse cache.
|
||||
|
||||
Run Tests against Mocked storage.
|
||||
---------------------------------
|
||||
These run automatically and make use of an in-memory emulation of azure storage.
|
||||
|
||||
|
||||
Running tests against the Azure storage emulator
|
||||
---------------------------------------------------
|
||||
A selection of tests can run against the Azure Storage Emulator which is
|
||||
a high-fidelity emulation of live Azure Storage. The emulator is sufficient for high-confidence testing.
|
||||
The emulator is a Windows executable that runs on a local machine.
|
||||
|
||||
To use the emulator, install Azure SDK 2.3 and start the storage emulator
|
||||
See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx
|
||||
|
||||
Enable the Azure emulator tests by setting
|
||||
fs.azure.test.emulator -> true
|
||||
in src\test\resources\azure-test.xml
|
||||
|
||||
Known issues:
|
||||
Symptom: When running tests for emulator, you see the following failure message
|
||||
com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
|
||||
Issue: The emulator can get into a confused state.
|
||||
Fix: Restart the Azure Emulator. Ensure it is v3.2 or later.
|
||||
|
||||
Running tests against live Azure storage
|
||||
-------------------------------------------------------------------------
|
||||
In order to run WASB unit tests against a live Azure Storage account, add credentials to
|
||||
src\test\resources\azure-test.xml. These settings augment the hadoop configuration object.
|
||||
|
||||
For live tests, set the following in azure-test.xml:
|
||||
1. "fs.azure.test.account.name -> {azureStorageAccountName}
|
||||
2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}"
|
||||
|
||||
===================================
|
||||
Page Blob Support and Configuration
|
||||
===================================
|
||||
|
||||
The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs
|
||||
and page blobs. Block blobs are the default kind of blob and are good for most
|
||||
big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc.
|
||||
Page blob handling in hadoop-azure was introduced to support HBase log files.
|
||||
Page blobs can be written any number of times, whereas block blobs can only be
|
||||
appended to 50,000 times before you run out of blocks and your writes will fail.
|
||||
That won't work for HBase logs, so page blob support was introduced to overcome
|
||||
this limitation.
|
||||
|
||||
Page blobs can be used for other purposes beyond just HBase log files though.
|
||||
They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in
|
||||
size, larger than the maximum 200GB size for block blobs.
|
||||
|
||||
In order to have the files you create be page blobs, you must set the configuration
|
||||
variable fs.azure.page.blob.dir to a comma-separated list of folder names.
|
||||
E.g.
|
||||
|
||||
/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles
|
||||
|
||||
You can set this to simply / to make all files page blobs.
|
||||
|
||||
The configuration option fs.azure.page.blob.size is the default initial
|
||||
size for a page blob. It must be 128MB or greater, and no more than 1TB,
|
||||
specified as an integer number of bytes.
|
||||
|
||||
====================
|
||||
Atomic Folder Rename
|
||||
====================
|
||||
|
||||
Azure storage stores files as a flat key/value store without formal support
|
||||
for folders. The hadoop-azure file system layer simulates folders on top
|
||||
of Azure storage. By default, folder rename in the hadoop-azure file system
|
||||
layer is not atomic. That means that a failure during a folder rename
|
||||
could, for example, leave some folders in the original directory and
|
||||
some in the new one.
|
||||
|
||||
HBase depends on atomic folder rename. Hence, a configuration setting was
|
||||
introduced called fs.azure.atomic.rename.dir that allows you to specify a
|
||||
comma-separated list of directories to receive special treatment so that
|
||||
folder rename is made atomic. The default value of this setting is just /hbase.
|
||||
Redo will be applied to finish a folder rename that fails. A file
|
||||
<folderName>-renamePending.json may appear temporarily and is the record of
|
||||
the intention of the rename operation, to allow redo in event of a failure.
|
||||
|
||||
=============
|
||||
Findbugs
|
||||
=============
|
||||
Run findbugs and show interactive GUI for review of problems
|
||||
> mvn findbugs:gui
|
||||
|
||||
Run findbugs and fail build if errors are found:
|
||||
> mvn findbugs:check
|
||||
|
||||
For help with findbugs plugin.
|
||||
> mvn findbugs:help
|
||||
|
||||
=============
|
||||
Checkstyle
|
||||
=============
|
||||
Rules for checkstyle @ src\config\checkstyle.xml
|
||||
- these are based on a core set of standards, with exclusions for non-serious issues
|
||||
- as a general plan it would be good to turn on more rules over time.
|
||||
- Occasionally, run checkstyle with the default Sun rules by editing pom.xml.
|
||||
|
||||
Command-line:
|
||||
> mvn checkstyle:check --> just test & fail build if violations found
|
||||
> mvn site checkstyle:checkstyle --> produce html report
|
||||
> . target\site\checkstyle.html --> view report.
|
||||
|
||||
Eclipse:
|
||||
- add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update
|
||||
- window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default.
|
||||
- project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml
|
||||
|
||||
NOTE:
|
||||
- After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK
|
||||
|
||||
=============
|
||||
Javadoc
|
||||
=============
|
||||
Command-line
|
||||
> mvn javadoc:javadoc
|
243
hadoop-tools/hadoop-azure/src/site/markdown/index.md
Normal file
243
hadoop-tools/hadoop-azure/src/site/markdown/index.md
Normal file
@ -0,0 +1,243 @@
|
||||
<!---
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License. See accompanying LICENSE file.
|
||||
-->
|
||||
|
||||
# Hadoop Azure Support: Azure Blob Storage
|
||||
|
||||
* [Introduction](#Introduction)
|
||||
* [Features](#Features)
|
||||
* [Limitations](#Limitations)
|
||||
* [Usage](#Usage)
|
||||
* [Concepts](#Concepts)
|
||||
* [Configuring Credentials](#Configuring_Credentials)
|
||||
* [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
|
||||
* [Atomic Folder Rename](#Atomic_Folder_Rename)
|
||||
* [Accessing wasb URLs](#Accessing_wasb_URLs)
|
||||
* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
|
||||
|
||||
## <a name="Introduction" />Introduction
|
||||
|
||||
The hadoop-azure module provides support for integration with
|
||||
[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
|
||||
The built jar file, named hadoop-azure.jar, also declares transitive dependencies
|
||||
on the additional artifacts it requires, notably the
|
||||
[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
|
||||
|
||||
## <a name="Features" />Features
|
||||
|
||||
* Read and write data stored in an Azure Blob Storage account.
|
||||
* Present a hierarchical file system view by implementing the standard Hadoop
|
||||
[`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
|
||||
* Supports configuration of multiple Azure Blob Storage accounts.
|
||||
* Supports both page blobs (suitable for most use cases, such as MapReduce) and
|
||||
block blobs (suitable for continuous write use cases, such as an HBase
|
||||
write-ahead log).
|
||||
* Reference file system paths using URLs using the `wasb` scheme.
|
||||
* Also reference file system paths using URLs with the `wasbs` scheme for SSL
|
||||
encrypted access.
|
||||
* Can act as a source of data in a MapReduce job, or a sink.
|
||||
* Tested on both Linux and Windows.
|
||||
* Tested at scale.
|
||||
|
||||
## <a name="Limitations" />Limitations
|
||||
|
||||
* The append operation is not implemented.
|
||||
* File owner and group are persisted, but the permissions model is not enforced.
|
||||
Authorization occurs at the level of the entire Azure Blob Storage account.
|
||||
* File last access time is not tracked.
|
||||
|
||||
## <a name="Usage" />Usage
|
||||
|
||||
### <a name="Concepts" />Concepts
|
||||
|
||||
The Azure Blob Storage data model presents 3 core concepts:
|
||||
|
||||
* **Storage Account**: All access is done through a storage account.
|
||||
* **Container**: A container is a grouping of multiple blobs. A storage account
|
||||
may have multiple containers. In Hadoop, an entire file system hierarchy is
|
||||
stored in a single container. It is also possible to configure multiple
|
||||
containers, effectively presenting multiple file systems that can be referenced
|
||||
using distinct URLs.
|
||||
* **Blob**: A file of any type and size. In Hadoop, files are stored in blobs.
|
||||
The internal implementation also uses blobs to persist the file system
|
||||
hierarchy and other metadata.
|
||||
|
||||
### <a name="Configuring_Credentials" />Configuring Credentials
|
||||
|
||||
Usage of Azure Blob Storage requires configuration of credentials. Typically
|
||||
this is set in core-site.xml. The configuration property name is of the form
|
||||
`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the
|
||||
access key. **The access key is a secret that protects access to your storage
|
||||
account. Do not share the access key (or the core-site.xml file) with an
|
||||
untrusted party.**
|
||||
|
||||
For example:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||
<value>YOUR ACCESS KEY</value>
|
||||
</property>
|
||||
|
||||
In many Hadoop clusters, the core-site.xml file is world-readable. If it's
|
||||
undesirable for the access key to be visible in core-site.xml, then it's also
|
||||
possible to configure it in encrypted form. An additional configuration property
|
||||
specifies an external program to be invoked by Hadoop processes to decrypt the
|
||||
key. The encrypted key value is passed to this external program as a command
|
||||
line argument:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.account.keyprovider.youraccount</name>
|
||||
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||
<value>YOUR ENCRYPTED ACCESS KEY</value>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>fs.azure.shellkeyprovider.script</name>
|
||||
<value>PATH TO DECRYPTION PROGRAM</value>
|
||||
</property>
|
||||
|
||||
### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
|
||||
|
||||
The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
|
||||
[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
|
||||
Block blobs are the default kind of blob and are good for most big-data use
|
||||
cases, like input data for Hive, Pig, analytical map-reduce jobs etc. Page blob
|
||||
handling in hadoop-azure was introduced to support HBase log files. Page blobs
|
||||
can be written any number of times, whereas block blobs can only be appended to
|
||||
50,000 times before you run out of blocks and your writes will fail. That won't
|
||||
work for HBase logs, so page blob support was introduced to overcome this
|
||||
limitation.
|
||||
|
||||
Page blobs can be used for other purposes beyond just HBase log files though.
|
||||
Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block
|
||||
blobs.
|
||||
|
||||
In order to have the files you create be page blobs, you must set the
|
||||
configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
|
||||
folder names.
|
||||
|
||||
For example:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.page.blob.dir</name>
|
||||
<value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
|
||||
</property>
|
||||
|
||||
You can set this to simply / to make all files page blobs.
|
||||
|
||||
The configuration option `fs.azure.page.blob.size` is the default initial
|
||||
size for a page blob. It must be 128MB or greater, and no more than 1TB,
|
||||
specified as an integer number of bytes.
|
||||
|
||||
The configuration option `fs.azure.page.blob.extension.size` is the page blob
|
||||
extension size. This defines the amount to extend a page blob if it starts to
|
||||
get full. It must be 128MB or greater, specified as an integer number of bytes.
|
||||
|
||||
### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
|
||||
|
||||
Azure storage stores files as a flat key/value store without formal support
|
||||
for folders. The hadoop-azure file system layer simulates folders on top
|
||||
of Azure storage. By default, folder rename in the hadoop-azure file system
|
||||
layer is not atomic. That means that a failure during a folder rename
|
||||
could, for example, leave some folders in the original directory and
|
||||
some in the new one.
|
||||
|
||||
HBase depends on atomic folder rename. Hence, a configuration setting was
|
||||
introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
|
||||
comma-separated list of directories to receive special treatment so that
|
||||
folder rename is made atomic. The default value of this setting is just
|
||||
`/hbase`. Redo will be applied to finish a folder rename that fails. A file
|
||||
`<folderName>-renamePending.json` may appear temporarily and is the record of
|
||||
the intention of the rename operation, to allow redo in event of a failure.
|
||||
|
||||
For example:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.atomic.rename.dir</name>
|
||||
<value>/hbase,/data</value>
|
||||
</property>
|
||||
|
||||
### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
|
||||
|
||||
After credentials are configured in core-site.xml, any Hadoop component may
|
||||
reference files in that Azure Blob Storage account by using URLs of the following
|
||||
format:
|
||||
|
||||
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
|
||||
|
||||
The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
|
||||
Blob Storage. `wasb` utilizes unencrypted HTTP access for all interaction with
|
||||
the Azure Blob Storage API. `wasbs` utilizes SSL encrypted HTTPS access.
|
||||
|
||||
For example, the following
|
||||
[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
|
||||
commands demonstrate access to a storage account named `youraccount` and a
|
||||
container named `yourcontainer`.
|
||||
|
||||
> hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
|
||||
|
||||
> hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
|
||||
|
||||
> hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
|
||||
test file content
|
||||
|
||||
It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
|
||||
This causes all bare paths, such as `/testDir/testFile` to resolve automatically
|
||||
to that file system.
|
||||
|
||||
## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
|
||||
|
||||
The hadoop-azure module includes a full suite of unit tests. Most of the tests
|
||||
will run without additional configuration by running `mvn test`. This includes
|
||||
tests against mocked storage, which is an in-memory emulation of Azure Storage.
|
||||
|
||||
A selection of tests can run against the
|
||||
[Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx)
|
||||
which is a high-fidelity emulation of live Azure Storage. The emulator is
|
||||
sufficient for high-confidence testing. The emulator is a Windows executable
|
||||
that runs on a local machine.
|
||||
|
||||
To use the emulator, install Azure SDK 2.3 and start the storage emulator. Then,
|
||||
edit `src/test/resources/azure-test.xml` and add the following property:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.test.emulator</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
|
||||
There is a known issue when running tests with the emulator. You may see the
|
||||
following failure message:
|
||||
|
||||
com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
|
||||
|
||||
To resolve this, restart the Azure Emulator. Ensure it v3.2 or later.
|
||||
|
||||
It's also possible to run tests against a live Azure Storage account by adding
|
||||
credentials to `src/test/resources/azure-test.xml` and setting
|
||||
`fs.azure.test.account.name` to the name of the storage account.
|
||||
|
||||
For example:
|
||||
|
||||
<property>
|
||||
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
|
||||
<value>YOUR ACCESS KEY</value>
|
||||
</property>
|
||||
|
||||
<property>
|
||||
<name>fs.azure.test.account.name</name>
|
||||
<value>youraccount</value>
|
||||
</property>
|
Loading…
Reference in New Issue
Block a user