MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1591107 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
19176f423a
commit
693025a3d4
@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED
|
|||||||
MAPREDUCE-5812. Make job context available to
|
MAPREDUCE-5812. Make job context available to
|
||||||
OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
|
OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
|
||||||
|
|
||||||
|
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
|
||||||
|
jeagles)
|
||||||
|
|
||||||
OPTIMIZATIONS
|
OPTIMIZATIONS
|
||||||
|
|
||||||
BUG FIXES
|
BUG FIXES
|
||||||
|
@ -0,0 +1,138 @@
|
|||||||
|
<!---
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License. See accompanying LICENSE file.
|
||||||
|
-->
|
||||||
|
|
||||||
|
#set ( $H3 = '###' )
|
||||||
|
|
||||||
|
Hadoop Archives Guide
|
||||||
|
=====================
|
||||||
|
|
||||||
|
- [Overview](#Overview)
|
||||||
|
- [How to Create an Archive](#How_to_Create_an_Archive)
|
||||||
|
- [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
|
||||||
|
- [Archives Examples](#Archives_Examples)
|
||||||
|
- [Creating an Archive](#Creating_an_Archive)
|
||||||
|
- [Looking Up Files](#Looking_Up_Files)
|
||||||
|
- [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
|
||||||
|
|
||||||
|
Overview
|
||||||
|
--------
|
||||||
|
|
||||||
|
Hadoop archives are special format archives. A Hadoop archive maps to a file
|
||||||
|
system directory. A Hadoop archive always has a \*.har extension. A Hadoop
|
||||||
|
archive directory contains metadata (in the form of _index and _masterindex)
|
||||||
|
and data (part-\*) files. The _index file contains the name of the files that
|
||||||
|
are part of the archive and the location within the part files.
|
||||||
|
|
||||||
|
How to Create an Archive
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
`Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
|
||||||
|
|
||||||
|
-archiveName is the name of the archive you would like to create. An example
|
||||||
|
would be foo.har. The name should have a \*.har extension. The parent argument
|
||||||
|
is to specify the relative path to which the files should be archived to.
|
||||||
|
Example would be :
|
||||||
|
|
||||||
|
`-p /foo/bar a/b/c e/f/g`
|
||||||
|
|
||||||
|
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
|
||||||
|
parent. Note that this is a Map/Reduce job that creates the archives. You
|
||||||
|
would need a map reduce cluster to run this. For a detailed example the later
|
||||||
|
sections.
|
||||||
|
|
||||||
|
If you just want to archive a single directory /foo/bar then you can just use
|
||||||
|
|
||||||
|
`hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
|
||||||
|
|
||||||
|
How to Look Up Files in Archives
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
The archive exposes itself as a file system layer. So all the fs shell
|
||||||
|
commands in the archives work but with a different URI. Also, note that
|
||||||
|
archives are immutable. So, rename's, deletes and creates return an error.
|
||||||
|
URI for Hadoop Archives is
|
||||||
|
|
||||||
|
`har://scheme-hostname:port/archivepath/fileinarchive`
|
||||||
|
|
||||||
|
If no scheme is provided it assumes the underlying filesystem. In that case
|
||||||
|
the URI would look like
|
||||||
|
|
||||||
|
`har:///archivepath/fileinarchive`
|
||||||
|
|
||||||
|
Archives Examples
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
$H3 Creating an Archive
|
||||||
|
|
||||||
|
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
|
||||||
|
|
||||||
|
The above example is creating an archive using /user/hadoop as the relative
|
||||||
|
archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
|
||||||
|
will be archived in the following file system directory -- /user/zoo/foo.har.
|
||||||
|
Archiving does not delete the input files. If you want to delete the input
|
||||||
|
files after creating the archives (to reduce namespace), you will have to do
|
||||||
|
it on your own.
|
||||||
|
|
||||||
|
$H3 Looking Up Files
|
||||||
|
|
||||||
|
Looking up files in hadoop archives is as easy as doing an ls on the
|
||||||
|
filesystem. After you have archived the directories /user/hadoop/dir1 and
|
||||||
|
/user/hadoop/dir2 as in the example above, to see all the files in the
|
||||||
|
archives you can just run:
|
||||||
|
|
||||||
|
`hdfs dfs -ls -R har:///user/zoo/foo.har/`
|
||||||
|
|
||||||
|
To understand the significance of the -p argument, lets go through the above
|
||||||
|
example again. If you just do an ls (not lsr) on the hadoop archive using
|
||||||
|
|
||||||
|
`hdfs dfs -ls har:///user/zoo/foo.har`
|
||||||
|
|
||||||
|
The output should be:
|
||||||
|
|
||||||
|
```
|
||||||
|
har:///user/zoo/foo.har/dir1
|
||||||
|
har:///user/zoo/foo.har/dir2
|
||||||
|
```
|
||||||
|
|
||||||
|
As you can recall the archives were created with the following command
|
||||||
|
|
||||||
|
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
|
||||||
|
|
||||||
|
If we were to change the command to:
|
||||||
|
|
||||||
|
`hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
|
||||||
|
|
||||||
|
then a ls on the hadoop archive using
|
||||||
|
|
||||||
|
`hdfs dfs -ls har:///user/zoo/foo.har`
|
||||||
|
|
||||||
|
would give you
|
||||||
|
|
||||||
|
```
|
||||||
|
har:///user/zoo/foo.har/hadoop/dir1
|
||||||
|
har:///user/zoo/foo.har/hadoop/dir2
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice that the archived files have been archived relative to /user/ rather
|
||||||
|
than /user/hadoop.
|
||||||
|
|
||||||
|
Hadoop Archives and MapReduce
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Using Hadoop Archives in MapReduce is as easy as specifying a different input
|
||||||
|
filesystem than the default file system. If you have a hadoop archive stored
|
||||||
|
in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
|
||||||
|
all you need to specify the input directory as har:///user/zoo/foo.har. Since
|
||||||
|
Hadoop Archives is exposed as a file system MapReduce will be able to use all
|
||||||
|
the logical input files in Hadoop Archives as input.
|
@ -92,6 +92,7 @@
|
|||||||
<item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
|
<item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
|
||||||
<item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
|
<item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
|
||||||
<item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
|
<item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
|
||||||
|
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
||||||
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user