diff --git a/hadoop-common-project/hadoop-common/dev-support/jdiff/Apache_Hadoop_Common_2.8.3.xml b/hadoop-common-project/hadoop-common/dev-support/jdiff/Apache_Hadoop_Common_2.8.3.xml
new file mode 100644
index 0000000000..bd7e69c026
--- /dev/null
+++ b/hadoop-common-project/hadoop-common/dev-support/jdiff/Apache_Hadoop_Common_2.8.3.xml
@@ -0,0 +1,38433 @@
+ UnsupportedOperationException
+ If a key is deprecated in favor of multiple keys, they are all treated as
+ aliases of each other, and setting any one of them resets all the others
+ to the new value.
+ If you have multiple deprecation entries to add, it is more efficient to
+ use #addDeprecations(DeprecationDelta[] deltas) instead.
+ @param key
+ @param newKeys
+ @param customMessage
+ @deprecated use {@link #addDeprecation(String key, String newKey,
+ String customMessage)} instead]]>
+ UnsupportedOperationException
+ If you have multiple deprecation entries to add, it is more efficient to
+ use #addDeprecations(DeprecationDelta[] deltas) instead.
+ @param key
+ @param newKey
+ @param customMessage]]>
+ UnsupportedOperationException
+ If a key is deprecated in favor of multiple keys, they are all treated as
+ aliases of each other, and setting any one of them resets all the others
+ to the new value.
+ If you have multiple deprecation entries to add, it is more efficient to
+ use #addDeprecations(DeprecationDelta[] deltas) instead.
+ @param key Key that is to be deprecated
+ @param newKeys list of keys that take up the values of deprecated key
+ @deprecated use {@link #addDeprecation(String key, String newKey)} instead]]>
+ UnsupportedOperationException
+ If you have multiple deprecation entries to add, it is more efficient to
+ use #addDeprecations(DeprecationDelta[] deltas) instead.
+ @param key Key that is to be deprecated
+ @param newKey key that takes up the value of deprecated key]]>
+ key is deprecated.
+ @param key the parameter which is to be checked for deprecation
+ @return true
if the key is deprecated and
+ false
+ final.
+ @param name resource to be added, the classpath is examined for a file
+ with that name.]]>
+ final.
+ @param url url of the resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+ final.
+ @param file file-path of resource to be added, the local filesystem is
+ examined directly to find the resource, without referring to
+ the classpath.]]>
+ final.
+ WARNING: The contents of the InputStream will be cached, by this method.
+ So use this sparingly because it does increase the memory consumption.
+ @param in InputStream to deserialize the object from. In will be read from
+ when a get or set is called next. After it is read the stream will be
+ closed.]]>
+ final.
+ @param in InputStream to deserialize the object from.
+ @param name the name of the resource because InputStream.toString is not
+ very descriptive some times.]]>
+ final.
+ @param conf Configuration object from which to load properties]]>
+ name property, null
+ no such property exists. If the key is deprecated, it returns the value of
+ the first key which replaces the deprecated key and is not null.
+ Values are processed for variable expansion
+ before being returned.
+ @param name the property name, will be trimmed before get value.
+ @return the value of the name
or its replacing property,
+ or null if no such property exists.]]>
+ name property, but only for
+ names which have no valid value, usually non-existent or commented
+ out in XML.
+ @param name the property name
+ @return true if the property name
exists without value]]>
+ name property as a trimmed String
+ null
if no such property exists.
+ If the key is deprecated, it returns the value of
+ the first key which replaces the deprecated key and is not null
+ Values are processed for variable expansion
+ before being returned.
+ @param name the property name.
+ @return the value of the name
or its replacing property,
+ or null if no such property exists.]]>
+ name property as a trimmed String
+ defaultValue
if no such property exists.
+ See @{Configuration#getTrimmed} for more details.
+ @param name the property name.
+ @param defaultValue the property default value.
+ @return the value of the name
or defaultValue
+ if it is not set.]]>
+ name property, without doing
+ variable expansion.If the key is
+ deprecated, it returns the value of the first key which replaces
+ the deprecated key and is not null.
+ @param name the property name.
+ @return the value of the name
property or
+ its replacing property and null if no such property exists.]]>
+ value of the name
property. If
+ name
is deprecated or there is a deprecated name associated to it,
+ it sets the value to both names. Name will be trimmed before put into
+ configuration.
+ @param name property name.
+ @param value property value.]]>
+ value of the name
property. If
+ name
is deprecated, it also sets the value
+ the keys that replace the deprecated key. Name will be trimmed before put
+ into configuration.
+ @param name property name.
+ @param value property value.
+ @param source the place that this configuration value came from
+ (For debugging).
+ @throws IllegalArgumentException when the value or name is null.]]>
+ name. If the key is deprecated,
+ it returns the value of the first key which replaces the deprecated key
+ and is not null.
+ If no such property exists,
+ then defaultValue
is returned.
+ @param name property name, will be trimmed before get value.
+ @param defaultValue default value.
+ @return property value, or defaultValue
if the property
+ doesn't exist.]]>
+ name property as an int
+ If no such property exists, the provided default value is returned,
+ or if the specified value is not a valid int
+ then an error is thrown.
+ @param name property name.
+ @param defaultValue default value.
+ @throws NumberFormatException when the value is invalid
+ @return property value as an int
+ or defaultValue
+ name property as a set of comma-delimited
+ int
+ If no such property exists, an empty array is returned.
+ @param name property name
+ @return property value interpreted as an array of comma-delimited
+ int
+ name property to an int
+ @param name property name.
+ @param value int
value of the property.]]>
+ name property as a long
+ If no such property exists, the provided default value is returned,
+ or if the specified value is not a valid long
+ then an error is thrown.
+ @param name property name.
+ @param defaultValue default value.
+ @throws NumberFormatException when the value is invalid
+ @return property value as a long
+ or defaultValue
+ name property as a long
+ human readable format. If no such property exists, the provided default
+ value is returned, or if the specified value is not a valid
+ long
or human readable format, then an error is thrown. You
+ can use the following suffix (case insensitive): k(kilo), m(mega), g(giga),
+ t(tera), p(peta), e(exa)
+ @param name property name.
+ @param defaultValue default value.
+ @throws NumberFormatException when the value is invalid
+ @return property value as a long
+ or defaultValue
+ name property to a long
+ @param name property name.
+ @param value long
value of the property.]]>
+ name property as a float
+ If no such property exists, the provided default value is returned,
+ or if the specified value is not a valid float
+ then an error is thrown.
+ @param name property name.
+ @param defaultValue default value.
+ @throws NumberFormatException when the value is invalid
+ @return property value as a float
+ or defaultValue
+ name property to a float
+ @param name property name.
+ @param value property value.]]>
+ name property as a double
+ If no such property exists, the provided default value is returned,
+ or if the specified value is not a valid double
+ then an error is thrown.
+ @param name property name.
+ @param defaultValue default value.
+ @throws NumberFormatException when the value is invalid
+ @return property value as a double
+ or defaultValue
+ name property to a double
+ @param name property name.
+ @param value property value.]]>
+ name property as a boolean
+ If no such property is specified, or if the specified value is not a valid
+ boolean
, then defaultValue
is returned.
+ @param name property name.
+ @param defaultValue default value.
+ @return property value as a boolean
+ or defaultValue
+ name property to a boolean
+ @param name property name.
+ @param value boolean
value of the property.]]>
+ name property to the given type. This
+ is equivalent to set(<name>, value.toString())
+ @param name property name
+ @param value new value]]>
+ name to the given time duration. This
+ is equivalent to set(<name>, value + <time suffix>)
+ @param name Property name
+ @param value Time duration
+ @param unit Unit of time]]>
+ name property as a Pattern
+ If no such property is specified, or if the specified value is not a valid
+ Pattern
, then DefaultValue
is returned.
+ Note that the returned value is NOT trimmed by this method.
+ @param name property name
+ @param defaultValue default value
+ @return property value as a compiled Pattern, or defaultValue]]>
+ Pattern.
+ If the pattern is passed as null, sets the empty pattern which results in
+ further calls to getPattern(...) returning the default value.
+ @param name property name
+ @param pattern new value]]>
+ name property as
+ a collection of String
+ If no such property is specified then empty collection is returned.
+ This is an optimized version of {@link #getStrings(String)}
+ @param name property name.
+ @return property value as a collection of String
+ name property as
+ an array of String
+ If no such property is specified then null
is returned.
+ @param name property name.
+ @return property value as an array of String
+ or null
+ name property as
+ an array of String
+ If no such property is specified then default value is returned.
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of String
+ or default value.]]>
+ name property as
+ a collection of String
s, trimmed of the leading and trailing whitespace.
+ If no such property is specified then empty Collection
is returned.
+ @param name property name.
+ @return property value as a collection of String
s, or empty Collection
+ name property as
+ an array of String
s, trimmed of the leading and trailing whitespace.
+ If no such property is specified then an empty array is returned.
+ @param name property name.
+ @return property value as an array of trimmed String
+ or empty array.]]>
+ name property as
+ an array of String
s, trimmed of the leading and trailing whitespace.
+ If no such property is specified then default value is returned.
+ @param name property name.
+ @param defaultValue The default value
+ @return property value as an array of trimmed String
+ or default value.]]>
+ name property as
+ as comma delimited values.
+ @param name property name.
+ @param values The values]]>
+ hostProperty as a
+ InetSocketAddress
. If hostProperty
+ null
, addressProperty
will be used. This
+ is useful for cases where we want to differentiate between host
+ bind address and address clients should use to establish connection.
+ @param hostProperty bind host property name.
+ @param addressProperty address property name.
+ @param defaultAddressValue the default value
+ @param defaultPort the default port
+ @return InetSocketAddress]]>
+ name property as a
+ InetSocketAddress
+ @param name property name.
+ @param defaultAddress the default value
+ @param defaultPort the default port
+ @return InetSocketAddress]]>
+ name property as
+ a host:port
+ name property as a host:port
. The wildcard
+ address is replaced with the local host's address. If the host and address
+ properties are configured the host component of the address will be combined
+ with the port component of the addr to generate the address. This is to allow
+ optional control over which host name is used in multi-home bind-host
+ cases where a host can have multiple names
+ @param hostProperty the bind-host configuration name
+ @param addressProperty the service address configuration name
+ @param defaultAddressValue the service default address configuration value
+ @param addr InetSocketAddress of the service listener
+ @return InetSocketAddress for clients to connect]]>
+ name property as a host:port
. The wildcard
+ address is replaced with the local host's address.
+ @param name property name.
+ @param addr InetSocketAddress of a listener to store in the given property
+ @return InetSocketAddress for clients to connect]]>
+ name property
+ as an array of Class
+ The value of the property specifies a list of comma separated class names.
+ If no such property is specified, then defaultValue
+ returned.
+ @param name the property name.
+ @param defaultValue default value.
+ @return property value as a Class[]
+ or defaultValue
+ name property as a Class
+ If no such property is specified, then defaultValue
+ returned.
+ @param name the class name.
+ @param defaultValue default value.
+ @return property value as a Class
+ or defaultValue
+ name property as a Class
+ implementing the interface specified by xface
+ If no such property is specified, then defaultValue
+ returned.
+ An exception is thrown if the returned class does not implement the named
+ interface.
+ @param name the class name.
+ @param defaultValue default value.
+ @param xface the interface implemented by the named class.
+ @return property value as a Class
+ or defaultValue
+ name property as a List
+ of objects implementing the interface specified by xface
+ An exception is thrown if any of the classes does not exist, or if it does
+ not implement the named interface.
+ @param name the property name.
+ @param xface the interface implemented by the classes named by
+ name
+ @return a List
of objects implementing xface
+ name property to the name of a
+ theClass
implementing the given interface xface
+ An exception is thrown if theClass
does not implement the
+ interface xface
+ @param name property name.
+ @param theClass property value.
+ @param xface the interface implemented by the named class.]]>
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+ dirsProp with
+ the given path. If dirsProp contains multiple directories,
+ then one is chosen based on path's hash code. If the selected
+ directory does not exist, an attempt is made to create it.
+ @param dirsProp directory in which to locate the file.
+ @param path file-path.
+ @return local file under the directory with the given path.]]>
+ name.
+ @param name configuration resource name.
+ @return an input stream attached to the resource.]]>
+ name.
+ @param name configuration resource name.
+ @return a reader attached to the resource.]]>
+ String
+ key-value pairs in the configuration.
+ @return an iterator over the entries.]]>
+ When property name is not empty and the property exists in the
+ configuration, this method writes the property and its attributes
+ to the {@link Writer}.
+ When property name is null or empty, this method writes all the
+ configuration properties and their attributes to the {@link Writer}.
+ When property name is not empty but the property doesn't exist in
+ the configuration, this method throws an {@link IllegalArgumentException}.
+ @param out the writer to write to.]]>
+ When propertyName is not empty, and the property exists
+ in the configuration, the format of the output would be,
+ {
+ "property": {
+ "key" : "key1",
+ "value" : "value1",
+ "isFinal" : "key1.isFinal",
+ "resource" : "key1.resource"
+ }
+ }
+ When propertyName is null or empty, it behaves same as
+ {@link #dumpConfiguration(Configuration, Writer)}, the
+ output would be,
+ { "properties" :
+ [ { key : "key1",
+ value : "value1",
+ isFinal : "key1.isFinal",
+ resource : "key1.resource" },
+ { key : "key2",
+ value : "value2",
+ isFinal : "ke2.isFinal",
+ resource : "key2.resource" }
+ ]
+ }
+ When propertyName is not empty, and the property is not
+ found in the configuration, this method will throw an
+ {@link IllegalArgumentException}.
+ @param config the configuration
+ @param propertyName property name
+ @param out the Writer to write to
+ @throws IOException
+ @throws IllegalArgumentException when property name is not
+ empty and the property is not found in configuration]]>
+ { "properties" :
+ [ { key : "key1",
+ value : "value1",
+ isFinal : "key1.isFinal",
+ resource : "key1.resource" },
+ { key : "key2",
+ value : "value2",
+ isFinal : "ke2.isFinal",
+ resource : "key2.resource" }
+ ]
+ }
+ It does not output the properties of the configuration object which
+ is loaded from an input stream.
+ @param config the configuration
+ @param out the Writer to write to
+ @throws IOException]]>
+ true to set quiet-mode on, false
+ to turn it off.]]>
+ with matching keys]]>
+ Resources
+ Configurations are specified by resources. A resource contains a set of
+ name/value pairs as XML data. Each resource is named by either a
+ String
or by a {@link Path}. If named by a String
+ then the classpath is examined for a file with that name. If named by a
+ Path
, then the local filesystem is examined directly, without
+ referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two
+ resources, loaded in-order from the classpath:
+ -
+ core-default.xml: Read-only defaults for hadoop.
+ - core-site.xml: Site-specific configuration for a given hadoop
+ installation.
+ Applications may add additional resources, which are loaded
+ subsequent to these resources in the order they are added.
+ Final Parameters
+ Configuration parameters may be declared final.
+ Once a resource declares a value final, no subsequently-loaded
+ resource can alter that value.
+ For example, one might define a final parameter with:
+ <property>
+ <name>dfs.hosts.include</name>
+ <value>/etc/hadoop/conf/hosts.include</value>
+ <final>true</final>
+ </property>
+ Administrators typically define parameters as final in
+ core-site.xml for values that user applications may not alter.
Variable Expansion
+ Value strings are first processed for variable expansion. The
+ available properties are:
+ - Other properties defined in this Configuration; and, if a name is
+ undefined here,
+ - Properties in {@link System#getProperties()}.
+ For example, if a configuration resource contains the following property
+ definitions:
+ <property>
+ <name>basedir</name>
+ <value>/user/${user.name}</value>
+ </property>
+ <property>
+ <name>tempdir</name>
+ <value>${basedir}/tmp</value>
+ </property>
+ When conf.get("tempdir") is called, then ${basedir}
+ will be resolved to another property in this Configuration, while
+ ${user.name} would then ordinarily be resolved to the value
+ of the System property with that name.
When conf.get("otherdir") is called, then ${env.BASE_DIR}
+ will be resolved to the value of the ${BASE_DIR} environment variable.
+ It supports ${env.NAME:-default} and ${env.NAME-default} notations.
+ The former is resolved to "default" if ${NAME} environment variable is undefined
+ or its value is empty.
+ The latter behaves the same way only if ${NAME} is undefined.
By default, warnings will be given to any deprecated configuration
+ parameters and these are suppressible by configuring
+ log4j.logger.org.apache.hadoop.conf.Configuration.deprecation in
+ log4j.properties file.]]>
+ This implementation generates the key material and calls the
+ {@link #createKey(String, byte[], Options)} method.
+ @param name the base name of the key
+ @param options the options for the new key.
+ @return the version name of the first version of the key.
+ @throws IOException
+ @throws NoSuchAlgorithmException]]>
+ This implementation generates the key material and calls the
+ {@link #rollNewVersion(String, byte[])} method.
+ @param name the basename of the key
+ @return the name of the new version of the key
+ @throws IOException]]>
+ KeyProvider
implementations must be thread safe.]]>
+ NULL if
+ a provider for the specified URI scheme could not be found.
+ @throws IOException thrown if the provider failed to initialize.]]>
+ uri has syntax error]]>
+ uri is
+ not found]]>
+ uri
+ determines a configuration property name,
+ fs.AbstractFileSystem.scheme.impl whose value names the
+ AbstractFileSystem class.
+ The entire URI and conf is passed to the AbstractFileSystem factory method.
+ @param uri for the file system to be created.
+ @param conf which is passed to the file system impl.
+ @return file system for the given URI.
+ @throws UnsupportedFileSystemException if the file system for
+ uri
is not supported.]]>
+ default port;]]>
+ describing modifications
+ @throws IOException if an ACL could not be modified]]>
+ describing entries to remove
+ @throws IOException if an ACL could not be modified]]>
+ describing modifications, must include entries
+ for user, group, and others for compatibility with permission bits.
+ @throws IOException if an ACL could not be modified]]>
+ which returns each AclStatus
+ @throws IOException if an ACL could not be read]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @param flag xattr set flag
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attribute
+ @param name xattr name.
+ @return byte[] xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @param names XAttr names.
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to remove extended attribute
+ @param name xattr name
+ @throws IOException]]>
+ After a successful call, buf.position() will be advanced by the number
+ of bytes read and buf.limit() should be unchanged.
+ In the case of an exception, the values of buf.position() and buf.limit()
+ are undefined, and callers should be prepared to recover from this
+ eventuality.
+ Many implementations will throw {@link UnsupportedOperationException}, so
+ callers that are not confident in support for this method from the
+ underlying filesystem should be prepared to handle that exception.
+ Implementations should treat 0-length requests as legitimate, and must not
+ signal an error upon their receipt.
+ @param buf
+ the ByteBuffer to receive the results of the read operation.
+ @return the number of bytes read, possibly zero, or -1 if
+ reach end-of-stream
+ @throws IOException
+ if there is some error performing the read]]>
+ setReplication of FileSystem
+ @param src file name
+ @param replication new replication
+ @throws IOException
+ @return true if successful;
+ false if file does not exist or is a directory]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ core-default.xml]]>
+ EnumSet.of(CreateFlag.CREATE, CreateFlag.APPEND)
+ Use the CreateFlag as follows:
+ - CREATE - to create a file if it does not exist,
+ else throw FileAlreadyExists.
+ - APPEND - to append to a file if it exists,
+ else throw FileNotFoundException.
+ - OVERWRITE - to truncate a file if it exists,
+ else throw FileNotFoundException.
+ - CREATE|APPEND - to create a file if it does not exist,
+ else append to an existing file.
+ - CREATE|OVERWRITE - to create a file if it does not exist,
+ else overwrite an existing file.
+ - SYNC_BLOCK - to force closed blocks to the disk device.
+ In addition {@link Syncable#hsync()} should be called after each write,
+ if true synchronous behavior is required.
+ - LAZY_PERSIST - Create the block on transient storage (RAM) if
+ available.
+ - APPEND_NEWBLOCK - Append data to a new block instead of end of the last
+ partial block.
+ Following combinations are not valid and will result in
+ {@link HadoopIllegalArgumentException}:
+ absOrFqPath is not supported.
+ @throws IOException If the file system for absOrFqPath
+ not be instantiated.]]>
+ defaultFsUri is not supported]]>
+ NewWdir can be one of:
+ - relative path: "foo/bar";
+ - absolute without scheme: "/foo/bar"
+ - fully qualified with scheme: "xx://auth/foo/bar"
+ Illegal WDs:
+ - relative with scheme: "xx:foo/bar"
+ - non existent directory
+ f does not exist
+ @throws AccessControlException if access denied
+ @throws IOException If an IO Error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+ RuntimeExceptions:
+ @throws InvalidPathException If path f
is not valid]]>
+ Progress - to report progress on the operation - default null
+ Permission - umask is applied against permisssion: default is
+ FsPermissions:getDefault()
+ CreateParent - create missing parent path; default is to not
+ to create parents
+ The defaults for the following are SS defaults of the file
+ server implementing the target path. Not all parameters make sense
+ for all kinds of file system - eg. localFS ignores Blocksize,
+ replication, checksum
+ - BufferSize - buffersize used in FSDataOutputStream
- Blocksize - block size for file blocks
- ReplicationFactor - replication for blocks
- ChecksumParam - Checksum parameters. server default is used
+ if not specified.
+ @return {@link FSDataOutputStream} for created file
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file f
already exists
+ @throws FileNotFoundException If parent of f
does not exist
+ and createParent
is false
+ @throws ParentNotDirectoryException If parent of f
is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+ RuntimeExceptions:
+ @throws InvalidPathException If path f
is not valid]]>
+ dir already
+ exists
+ @throws FileNotFoundException If parent of dir
does not exist
+ and createParent
is false
+ @throws ParentNotDirectoryException If parent of dir
is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for dir
+ is not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+ RuntimeExceptions:
+ @throws InvalidPathException If path dir
is not valid]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+ RuntimeExceptions:
+ @throws InvalidPathException If path f
is invalid]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Fails if path is a directory.
+ Fails if path does not exist.
+ Fails if path is not closed.
+ Fails if new size is greater than current size.
+ @param f The path to the file to be truncated
+ @param newLength The size the file is to be truncated to
+ @return true
if the file has been truncated to the desired
+ newLength
and is immediately available to be reused for
+ write operations such as append
, or
+ false
if a background process of adjusting the length of
+ the last block has been started, and clients should wait for it to
+ complete before proceeding with further file updates.
+ @throws AccessControlException If access is denied
+ @throws FileNotFoundException If file f
does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Fails if src is a file and dst is a directory.
+ Fails if src is a directory and dst is a file.
+ Fails if the parent of dst does not exist or is a file.
+ If OVERWRITE option is not passed as an argument, rename fails if the dst
+ already exists.
+ If OVERWRITE option is passed as an argument, rename overwrites the dst if
+ it is a file or an empty directory. Rename fails if dst is a non-empty
+ directory.
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for details
+ @param src path to be renamed
+ @param dst new path after rename
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If dst
already exists and
+ options has {@link Options.Rename#OVERWRITE}
+ option false.
+ @throws FileNotFoundException If src
does not exist
+ @throws ParentNotDirectoryException If parent of dst
is not a
+ directory
+ @throws UnsupportedFileSystemException If file system for src
+ and dst
is not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ is not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server
+ RuntimeExceptions:
+ @throws HadoopIllegalArgumentException If username
+ groupname
is invalid.]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If the given path does not refer to a symlink
+ or an I/O error occurred]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Given a path referring to a symlink of form:
+ <---X--->
+ fs://host/A/B/link
+ <-----Y----->
+ In this path X is the scheme and authority that identify the file system,
+ and Y is the path leading up to the final path component "link". If Y is
+ a symlink itself then let Y' be the target of Y and X' be the scheme and
+ authority of Y'. Symlink targets may:
+ 1. Fully qualified URIs
+ fs://hostX/A/B/file Resolved according to the target file system.
+ 2. Partially qualified URIs (eg scheme but no host)
+ fs:///A/B/file Resolved according to the target file system. Eg resolving
+ a symlink to hdfs:///A results in an exception because
+ HDFS URIs must be fully qualified, while a symlink to
+ file:///A will not since Hadoop's local file systems
+ require partially qualified URIs.
+ 3. Relative paths
+ path Resolves to [Y'][path]. Eg if Y resolves to hdfs://host/A and path
+ is "../B/file" then [Y'][path] is hdfs://host/B/file
+ 4. Absolute paths
+ path Resolves to [X'][path]. Eg if Y resolves hdfs://host/A/B and path
+ is "/file" then [X][path] is hdfs://host/file
+ @param target the target of the symbolic link
+ @param link the path to be created that points to target
+ @param createParent if true then missing parent dirs are created if
+ false then parent must exist
+ @throws AccessControlException If access is denied
+ @throws FileAlreadyExistsException If file linkcode> already exists
+ @throws FileNotFoundException If target
does not exist
+ @throws ParentNotDirectoryException If parent of link
is not a
+ directory.
+ @throws UnsupportedFileSystemException If file system for
+ target
or link
is not supported
+ @throws IOException If an I/O error occurred]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f does not exist
+ @throws UnsupportedFileSystemException If file system for f
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ f is
+ not supported
+ @throws IOException If an I/O error occurred
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ describing modifications
+ @throws IOException if an ACL could not be modified]]>
+ describing entries to remove
+ @throws IOException if an ACL could not be modified]]>
+ describing modifications, must include entries
+ for user, group, and others for compatibility with permission bits.
+ @throws IOException if an ACL could not be modified]]>
+ which returns each AclStatus
+ @throws IOException if an ACL could not be read]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @param flag xattr set flag
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attribute
+ @param name xattr name.
+ @return byte[] xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @param names XAttr names.
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to remove extended attribute
+ @param name xattr name
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return List of the XAttr names of the file or directory
+ @throws IOException]]>
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ Exceptions applicable to file systems accessed over RPC:
+ @throws RpcClientException If an exception occurred in the RPC client
+ @throws RpcServerException If an exception occurred in the RPC server
+ @throws UnexpectedServerException If server implementation throws
+ undeclared exception to RPC server]]>
+ *** Path Names ***
+ The Hadoop file system supports a URI name space and URI names.
+ It offers a forest of file systems that can be referenced using fully
+ qualified URIs.
+ Two common Hadoop file systems implementations are
+ - the local file system: file:///path
- the hdfs file system hdfs://nnAddress:nnPort/path
+ While URI names are very flexible, it requires knowing the name or address
+ of the server. For convenience one often wants to access the default system
+ in one's environment without knowing its name/address. This has an
+ additional benefit that it allows one to change one's default fs
+ (e.g. admin moves application from cluster1 to cluster2).
+ To facilitate this, Hadoop supports a notion of a default file system.
+ The user can set his default file system, although this is
+ typically set up for you in your environment via your default config.
+ A default file system implies a default scheme and authority; slash-relative
+ names (such as /for/bar) are resolved relative to that default FS.
+ Similarly a user can also have working-directory-relative names (i.e. names
+ not starting with a slash). While the working directory is generally in the
+ same default FS, the wd can be in a different FS.
+ Hence Hadoop path names can be one of:
+ - fully qualified URI: scheme://authority/path
- slash relative names: /path relative to the default file system
- wd-relative names: path relative to the working dir
+ Relative paths with scheme (scheme:foo/bar) are illegal.
+ ****The Role of the FileContext and configuration defaults****
+ The FileContext provides file namespace context for resolving file names;
+ it also contains the umask for permissions, In that sense it is like the
+ per-process file-related state in Unix system.
+ These two properties
+ - default file system i.e your slash)
- umask
+ in general, are obtained from the default configuration file
+ in your environment, (@see {@link Configuration}).
+ No other configuration parameters are obtained from the default config as
+ far as the file context layer is concerned. All file system instances
+ (i.e. deployments of file systems) have default properties; we call these
+ server side (SS) defaults. Operation like create allow one to select many
+ properties: either pass them in as explicit parameters or use
+ the SS properties.
+ The file system related SS defaults are
+ - the home directory (default is "/user/userName")
- the initial wd (only for local fs)
- replication factor
- block size
- buffer size
- encryptDataTransfer
- checksum option. (checksumType and bytesPerChecksum)
+ *** Usage Model for the FileContext class ***
+ Example 1: use the default config read from the $HADOOP_CONFIG/core.xml.
+ Unspecified values come from core-defaults.xml in the release jar.
+ - myFContext = FileContext.getFileContext(); // uses the default config
+ // which has your default FS
- myFContext.create(path, ...);
- myFContext.setWorkingDir(path)
- myFContext.open (path, ...);
+ Example 2: Get a FileContext with a specific URI as the default FS
+ - myFContext = FileContext.getFileContext(URI)
- myFContext.create(path, ...);
+ ...
+ Example 3: FileContext with local file system as the default
+ - myFContext = FileContext.getLocalFSFileContext()
- myFContext.create(path, ...);
- ...
+ Example 4: Use a specific config, ignoring $HADOOP_CONFIG
+ Generally you should not need use a config unless you are doing
+ - configX = someConfigSomeOnePassedToYou.
- myFContext = getFileContext(configX); // configX is not changed,
+ // is passed down
- myFContext.create(path, ...);
- ...
+ This implementation throws an UnsupportedOperationException
+ @return the protocol scheme for the FileSystem.]]>
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.]]>
+ fs.scheme.class whose value names the FileSystem class.
+ The entire URI is passed to the FileSystem instance's initialize method.
+ This always returns a new FileSystem object.]]>
+ Fails if src is a file and dst is a directory.
+ Fails if src is a directory and dst is a file.
+ Fails if the parent of dst does not exist or is a file.
+ If OVERWRITE option is not passed as an argument, rename fails
+ if the dst already exists.
+ If OVERWRITE option is passed as an argument, rename overwrites
+ the dst if it is a file or an empty directory. Rename fails if dst is
+ a non-empty directory.
+ Note that atomicity of rename is dependent on the file system
+ implementation. Please refer to the file system documentation for
+ details. This default implementation is non atomic.
+ This method is deprecated since it is a temporary method added to
+ support the transition from FileSystem to FileContext for user
+ applications.
+ @param src path to be renamed
+ @param dst new path after rename
+ @throws IOException on failure]]>
+ Fails if path is a directory.
+ Fails if path does not exist.
+ Fails if path is not closed.
+ Fails if new size is greater than current size.
+ @param f The path to the file to be truncated
+ @param newLength The size the file is to be truncated to
+ @return true
if the file has been truncated to the desired
+ newLength
and is immediately available to be reused for
+ write operations such as append
, or
+ false
if a background process of adjusting the length of
+ the last block has been started, and clients should wait for it to
+ complete before proceeding with further file updates.]]>
+ Does not guarantee to return the List of files/directories status in a
+ sorted order.
+ @param f given path
+ @return the statuses of the files/directories in the given patch
+ @throws FileNotFoundException when the path does not exist;
+ IOException see specific implementation]]>
+ Does not guarantee to return the List of files/directories status in a
+ sorted order.
+ @param f
+ a path name
+ @param filter
+ the user-supplied path filter
+ @return an array of FileStatus objects for the files under the given path
+ after applying the filter
+ @throws FileNotFoundException when the path does not exist;
+ IOException see specific implementation]]>
+ Does not guarantee to return the List of files/directories status in a
+ sorted order.
+ @param files
+ a list of paths
+ @return a list of statuses for the files under the given paths after
+ applying the filter default Path filter
+ @throws FileNotFoundException when the path does not exist;
+ IOException see specific implementation]]>
+ Does not guarantee to return the List of files/directories status in a
+ sorted order.
+ @param files
+ a list of paths
+ @param filter
+ the user-supplied path filter
+ @return a list of statuses for the files under the given paths after
+ applying the filter
+ @throws FileNotFoundException when the path does not exist;
+ IOException see specific implementation]]>
+ Return all the files that match filePattern and are not checksum
+ files. Results are sorted by their names.
+ A filename pattern is composed of regular characters and
+ special pattern matching characters, which are:
+ -
- ?
- Matches any single character.
- *
- Matches zero or more characters.
- [abc]
- Matches a single character from character set
+ {a,b,c}.
- [a-b]
- Matches a single character from the character range
+ {a...b}. Note that character a must be
+ lexicographically less than or equal to character b.
- [^a]
- Matches a single character that is not from character set or range
+ {a}. Note that the ^ character must occur
+ immediately to the right of the opening bracket.
- \c
- Removes (escapes) any special meaning of character c.
- {ab,cd}
- Matches a string from the string set {ab, cd}
- {ab,c{de,fh}}
- Matches a string from the string set {ab, cde, cfh}
+ @param pathPattern a regular expression specifying a pth pattern
+ @return an array of paths that match the path pattern
+ @throws IOException]]>
+ f does not exist
+ @throws IOException If an I/O error occurred]]>
+ f does not exist
+ @throws IOException if any I/O error occurred]]>
+ describing modifications
+ @throws IOException if an ACL could not be modified]]>
+ describing entries to remove
+ @throws IOException if an ACL could not be modified]]>
+ describing modifications, must include entries
+ for user, group, and others for compatibility with permission bits.
+ @throws IOException if an ACL could not be modified]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to modify
+ @param name xattr name.
+ @param value xattr value.
+ @param flag xattr set flag
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attribute
+ @param name xattr name.
+ @return byte[] xattr value.
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @param names XAttr names.
+ @return Map describing the XAttrs of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to get extended attributes
+ @return List of the XAttr names of the file or directory
+ @throws IOException]]>
+ Refer to the HDFS extended attributes user documentation for details.
+ @param path Path to remove extended attribute
+ @param name xattr name
+ @throws IOException]]>
+ This is a default method which is intended to be overridden by
+ subclasses. The default implementation returns an empty storage statistics
+ object.
+ @return The StorageStatistics for this FileSystem instance.
+ Will never be null.]]>
+ All user code that may potentially use the Hadoop Distributed
+ File System should be written to use a FileSystem object. The
+ Hadoop DFS is a multi-machine system that appears as a single
+ disk. It's useful because of its fault tolerance and potentially
+ very large capacity.
+ The local implementation is {@link LocalFileSystem} and distributed
+ implementation is DistributedFileSystem.]]>
+ caller's environment variables to use
+ for expansion
+ @return String[] with absolute path to new jar in position 0 and
+ unexpanded wild card entry path in position 1
+ @throws IOException if there is an I/O error while writing the jar file]]>
+ FilterFileSystem contains
+ some other file system, which it uses as
+ its basic file system, possibly transforming
+ the data along the way or providing additional
+ functionality. The class FilterFileSystem
+ itself simply overrides all methods of
+ FileSystem
with versions that
+ pass all requests to the contained file
+ system. Subclasses of FilterFileSystem
+ may further override some of these methods
+ and may also provide additional methods
+ and fields.]]>
+ -1
+ if there is no more data because the end of the stream has been
+ reached]]>
+ length bytes have been read.
+ @param position position in the input stream to seek
+ @param buffer buffer into which data is read
+ @param offset offset into the buffer in which data is written
+ @param length the number of bytes to read
+ @throws IOException IO problems
+ @throws EOFException If the end of stream is reached while reading.
+ If an exception is thrown an undetermined number
+ of bytes in the buffer may have been written.]]>
+ path is invalid]]>
+ @return file
+ and the scheme is null, and the authority
+ is null.
+ @return whether the path is absolute and the URI has no scheme nor
+ authority parts]]>
+ true if and only if pathname
+ should be included]]>
+ Warning: Not all filesystems satisfy the thread-safety requirement.
+ @param position position within file
+ @param buffer destination buffer
+ @param offset offset in the buffer
+ @param length number of bytes to read
+ @return actual number of bytes read; -1 means "none"
+ @throws IOException IO problems.]]>
+ Warning: Not all filesystems satisfy the thread-safety requirement.
+ @param position position within file
+ @param buffer destination buffer
+ @param offset offset in the buffer
+ @param length number of bytes to read
+ @throws IOException IO problems.
+ @throws EOFException the end of the data was reached before
+ the read operation completed]]>
+ Warning: Not all filesystems satisfy the thread-safety requirement.
+ @param position position within file
+ @param buffer destination buffer
+ @throws IOException IO problems.
+ @throws EOFException the end of the data was reached before
+ the read operation completed]]>
+ <----15----> <----15----> <----15----> <-------18------->
+ XAttr is byte[], this class is to
+ covert byte[] to some kind of string representation or convert back.
+ String representation is convenient for display and input. For example
+ display in screen as shell response and json response, input as http
+ or shell parameter.]]>
+ @return ftp
+ A {@link FileSystem} backed by an FTP client provided by Apache Commons Net.
+ ]]>
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+ But for removeAcl operation it will be false. i.e. AclSpec should
+ not contain permissions.
+ Example: "user:foo,group:bar"
+ @return Returns list of {@link AclEntry} parsed]]>
+ The expected format of ACL entries in the string parameter is the same
+ format produced by the {@link #toStringStable()} method.
+ @param aclStr
+ String representation of an ACL.
+ Example: "user:foo:rw-"
+ @param includePermission
+ for setAcl operations this will be true. i.e. Acl should include
+ permissions.
+ But for removeAcl operation it will be false. i.e. Acl should not
+ contain permissions.
+ Example: "user:foo,group:bar,mask::"
+ @return Returns an {@link AclEntry} object]]>
+ unmodifiable ordered list of all ACL entries]]>
+ Recommended to use this API ONLY if client communicates with the old
+ NameNode, needs to pass the Permission for the path to get effective
+ permission, else use {@link AclStatus#getEffectivePermission(AclEntry)}.
+ @param entry AclEntry to get the effective action
+ @param permArg Permission for the path. However if the client is NOT
+ communicating with old namenode, then this argument will not have
+ any preference.
+ @return Returns the effective permission for the entry.
+ @throws IllegalArgumentException If the client communicating with old
+ namenode and permission is not passed as an argument.]]>
+ mode is invalid]]>
+ @return viewfs
+ /user -> hdfs://nnContainingUserDir/user
+ /project/foo -> hdfs://nnProject1/projects/foo
+ /project/bar -> hdfs://nnProject2/projects/bar
+ /tmp -> hdfs://nnTmp/privateTmpForUserXXX
+ ViewFs is specified with the following URI: viewfs:///
+ To use viewfs one would typically set the default file system in the
+ config (i.e. fs.defaultFS < = viewfs:///) along with the
+ mount table config variables as described below.
+ ** Config variables to specify the mount table entries **
+ The file system is initialized from the standard Hadoop config through
+ config variables.
+ See {@link FsConstants} for URI and Scheme constants;
+ See {@link Constants} for config var constants;
+ see {@link ConfigUtil} for convenient lib.
+ All the mount table config entries for view fs are prefixed by
+ fs.viewfs.mounttable.
+ For example the above example can be specified with the following
+ config variables:
+ - fs.viewfs.mounttable.default.link./user=
+ hdfs://nnContainingUserDir/user
- fs.viewfs.mounttable.default.link./project/foo=
+ hdfs://nnProject1/projects/foo
- fs.viewfs.mounttable.default.link./project/bar=
+ hdfs://nnProject2/projects/bar
- fs.viewfs.mounttable.default.link./tmp=
+ hdfs://nnTmp/privateTmpForUserXXX
+ The default mount table (when no authority is specified) is
+ from config variables prefixed by fs.viewFs.mounttable.default
+ The authority component of a URI can be used to specify a different mount
+ table. For example,
+ - viewfs://sanjayMountable/
+ is initialized from fs.viewFs.mounttable.sanjayMountable.* config variables.
+ **** Merge Mounts **** (NOTE: merge mounts are not implemented yet.)
+ One can also use "MergeMounts" to merge several directories (this is
+ sometimes called union-mounts or junction-mounts in the literature.
+ For example of the home directories are stored on say two file systems
+ (because they do not fit on one) then one could specify a mount
+ entry such as following merges two dirs:
+ - /user -> hdfs://nnUser1/user,hdfs://nnUser2/user
+ Such a mergeLink can be specified with the following config var where ","
+ is used as the separator for each of links to be merged:
+ - fs.viewfs.mounttable.default.linkMerge./user=
+ hdfs://nnUser1/user,hdfs://nnUser1/user
+ A special case of the merge mount is where mount table's root is merged
+ with the root (slash) of another file system:
+ - fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
+ In this cases the root of the mount table is merged with the root of
+ hdfs://nn99/ ]]>
+ Since these methods are often vendor- or device-specific, operators
+ may implement this interface in order to achieve fencing.
+ Fencing is configured by the operator as an ordered list of methods to
+ attempt. Each method will be tried in turn, and the next in the list
+ will only be attempted if the previous one fails. See {@link NodeFencer}
+ for more information.
+ If an implementation also implements {@link Configurable} then its
+ setConf
method will be called upon instantiation.]]>
+ state (e.g ACTIVE/STANDBY) as well as
+ some additional information.
+ @throws AccessControlException
+ if access is denied.
+ @throws IOException
+ if other errors happen
+ @see HAServiceStatus]]>
+ hadoop.http.filter.initializers.
+- StaticUserWebFilter - An authorization plugin that makes all
+users a static configured user.
+ public class IntArrayWritable extends ArrayWritable {
+ public IntArrayWritable() {
+ super(IntWritable.class);
+ }
+ }
+ ]]>
+ o is a ByteWritable with the same value.]]>
+ the class of the item
+ @param conf the configuration to store
+ @param item the object to be stored
+ @param keyName the name of the key to use
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+ the class of the item
+ @param conf the configuration to use
+ @param items the objects to be stored
+ @param keyName the name of the key to use
+ @throws IndexOutOfBoundsException if the items array is empty
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+ the class of the item
+ @param conf the configuration to use
+ @param keyName the name of the key to use
+ @param itemClass the class of the item
+ @return restored object
+ @throws IOException : forwards Exceptions from the underlying
+ {@link Serialization} classes.]]>
+ DefaultStringifier offers convenience methods to store/load objects to/from
+ the configuration.
+ @param the class of the objects to stringify]]>
+ o is a DoubleWritable with the same value.]]>
+ value argument is null or
+ its size is zero, the elementType argument must not be null. If
+ the argument value's size is bigger than zero, the argument
+ elementType is not be used.
+ @param value
+ @param elementType]]>
+ value should not be null
+ or empty.
+ @param value]]>
+ value and elementType. If the value argument
+ is null or its size is zero, the elementType argument must not be
+ null. If the argument value's size is bigger than zero, the
+ argument elementType is not be used.
+ @param value
+ @param elementType]]>
+ o is an EnumSetWritable with the same value,
+ or both are null.]]>
+ o is a FloatWritable with the same value.]]>
+ When two sequence files, which have same Key type but different Value
+ types, are mapped out to reduce, multiple Value types is not allowed.
+ In this case, this class can help you wrap instances with different types.
+ Compared with ObjectWritable
, this class is much more effective,
+ because ObjectWritable
will append the class declaration as a String
+ into the output file in every Key-Value pair.
+ Generic Writable implements {@link Configurable} interface, so that it will be
+ configured by the framework. The configuration is passed to the wrapped objects
+ implementing {@link Configurable} interface before deserialization.
+ how to use it:
+ 1. Write your own class, such as GenericObject, which extends GenericWritable.
+ 2. Implements the abstract method getTypes()
, defines
+ the classes which will be wrapped in GenericObject in application.
+ Attention: this classes defined in getTypes()
method, must
+ implement Writable
+ The code looks like this:
+ public class GenericObject extends GenericWritable {
+ private static Class[] CLASSES = {
+ ClassType1.class,
+ ClassType2.class,
+ ClassType3.class,
+ };
+ protected Class[] getTypes() {
+ return CLASSES;
+ }
+ }
+ @since Nov 8, 2006]]>
+ o is a IntWritable with the same value.]]>
+ closes the input and output streams
+ at the end.
+ @param in InputStrem to read from
+ @param out OutputStream to write to
+ @param conf the Configuration object]]>
+ ignore any {@link IOException} or
+ null pointers. Must only be used for cleanup in exception handlers.
+ @param log the log to record problems to at debug level. Can be null.
+ @param closeables the objects to close]]>
+ This is better than File#listDir because it does not ignore IOExceptions.
+ @param dir The directory to list.
+ @param filter If non-null, the filter to use when listing
+ this directory.
+ @return The list of files in the directory.
+ @throws IOException On I/O error]]>
+ Borrowed from Uwe Schindler in LUCENE-5588
+ @param fileToSync the file to fsync]]>
+ o is a LongWritable with the same value.]]>
+ A map is a directory containing two files, the data
+ containing all keys and values in the map, and a smaller index
+ file, containing a fraction of the keys. The fraction is determined by
+ {@link Writer#getIndexInterval()}.
+ The index file is read entirely into memory. Thus key implementations
+ should try to keep themselves small.
Map files are created by adding entries in-order. To maintain a large
+ database, perform updates by copying the previous version of a database and
+ merging in a sorted change list, to create a new version of the database in
+ a new file. Sorting large change lists can be done with {@link
+ SequenceFile.Sorter}.]]>
+ o is an MD5Hash whose digest contains the
+ same values.]]>
+ className by first finding
+ it in the specified conf. If the specified conf is null,
+ try load it directly.]]>
+ A {@link Comparator} that operates directly on byte representations of
+ objects.
+ @param
+ @see DeserializerComparator]]>
+ SequenceFiles are flat files consisting of binary key/value
+ pairs.
+ SequenceFile
provides {@link SequenceFile.Writer},
+ {@link SequenceFile.Reader} and {@link Sorter} classes for writing,
+ reading and sorting respectively.
+ There are three SequenceFile
s based on the
+ {@link CompressionType} used to compress key/value pairs:
+ -
: Uncompressed records.
+ -
: Record-compressed files, only compress
+ values.
+ -
: Block-compressed files, both keys &
+ values are collected in 'blocks'
+ separately and compressed. The size of
+ the 'block' is configurable.
+ The actual compression algorithm used to compress key and/or values can be
+ specified by using the appropriate {@link CompressionCodec}.
+ The recommended way is to use the static createWriter methods
+ provided by the SequenceFile
to chose the preferred format.
+ The {@link SequenceFile.Reader} acts as the bridge and can read any of the
+ above SequenceFile
+ Essentially there are 3 different formats for SequenceFile
+ depending on the CompressionType
specified. All of them share a
+ common header described below.
+ -
+ version - 3 bytes of magic header SEQ, followed by 1 byte of actual
+ version number (e.g. SEQ4 or SEQ6)
+ -
+ keyClassName -key class
+ -
+ valueClassName - value class
+ -
+ compression - A boolean which specifies if compression is turned on for
+ keys/values in this file.
+ -
+ blockCompression - A boolean which specifies if block-compression is
+ turned on for keys/values in this file.
+ -
+ compression codec -
class which is used for
+ compression of keys and/or values (if compression is
+ enabled).
+ -
+ metadata - {@link Metadata} for this file.
+ -
+ sync - A sync marker to denote end of the header.
+ -
+ Header
+ -
+ Record
+ - Record length
+ - Key length
+ - Key
+ - Value
+ -
+ A sync-marker every few
bytes or so.
+ -
+ Header
+ -
+ Record
+ - Record length
+ - Key length
+ - Key
+ - Compressed Value
+ -
+ A sync-marker every few
bytes or so.
+ -
+ Header
+ -
+ Record Block
+ - Uncompressed number of records in the block
+ - Compressed key-lengths block-size
+ - Compressed key-lengths block
+ - Compressed keys block-size
+ - Compressed keys block
+ - Compressed value-lengths block-size
+ - Compressed value-lengths block
+ - Compressed values block-size
+ - Compressed values block
+ -
+ A sync-marker every block.
+ The compressed blocks of key lengths and value lengths consist of the
+ actual lengths of individual keys/values encoded in ZeroCompressedInteger
+ format.
+ @see CompressionCodec]]>
+ o is a ShortWritable with the same value.]]>
+ the class of the objects to stringify]]>
+ position. Note that this
+ method avoids using the converter or doing String instantiation
+ @return the Unicode scalar value at position or -1
+ if the position is invalid or points to a
+ trailing byte]]>
+ what in the backing
+ buffer, starting as position start
. The starting
+ position is measured in bytes and the return value is in
+ terms of byte position in the buffer. The backing buffer is
+ not converted to a string for this operation.
+ @return byte position of the first occurence of the search
+ string in the UTF-8 buffer or -1 if not found]]>
+ Note: For performance reasons, this call does not clear the
+ underlying byte array that is retrievable via {@link #getBytes()}.
+ In order to free the byte-array memory, call {@link #set(byte[])}
+ with an empty byte array (For example, new byte[0]
+ o is a Text with the same contents.]]>
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.]]>
+ replace is true, then
+ malformed input is replaced with the
+ substitution character, which is U+FFFD. Otherwise the
+ method throws a MalformedInputException.
+ @return ByteBuffer: bytes stores at ByteBuffer.array()
+ and length is ByteBuffer.limit()]]>
+ In
+ addition, it provides methods for string traversal without converting the
+ byte array to a string. Also includes utilities for
+ serializing/deserialing a string, coding/decoding a string, checking if a
+ byte array contains valid UTF8 code, calculating the length of an encoded
+ string.]]>
+ This is useful when a class may evolve, so that instances written by the
+ old version of the class may still be processed by the new version. To
+ handle this situation, {@link #readFields(DataInput)}
+ implementations should catch {@link VersionMismatchException}.]]>
+ o is a VIntWritable with the same value.]]>
+ o is a VLongWritable with the same value.]]>
+ out.
+ @param out DataOuput
to serialize this object into.
+ @throws IOException]]>
+ in.
+ For efficiency, implementations should attempt to re-use storage in the
+ existing object where possible.
+ @param in DataInput
to deseriablize this object from.
+ @throws IOException]]>
+ Any key
or value
type in the Hadoop Map-Reduce
+ framework implements this interface.
+ Implementations typically implement a static read(DataInput)
+ method which constructs a new instance, calls {@link #readFields(DataInput)}
+ and returns the instance.
+ Example:
+ public class MyWritable implements Writable {
+ // Some data
+ private int counter;
+ private long timestamp;
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+ public static MyWritable read(DataInput in) throws IOException {
+ MyWritable w = new MyWritable();
+ w.readFields(in);
+ return w;
+ }
+ }
+ WritableComparable
s can be compared to each other, typically
+ via Comparator
s. Any type which is to be used as a
+ key
in the Hadoop Map-Reduce framework should implement this
+ interface.
+ Note that hashCode()
is frequently used in Hadoop to partition
+ keys. It's important that your implementation of hashCode() returns the same
+ result across different instances of the JVM. Note also that the default
+ hashCode()
implementation in Object
does not
+ satisfy this property.
+ Example:
+ public class MyWritableComparable implements WritableComparable {
+ // Some data
+ private int counter;
+ private long timestamp;
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+ public int compareTo(MyWritableComparable o) {
+ int thisValue = this.value;
+ int thatValue = o.value;
+ return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
+ }
+ public int hashCode() {
+ final int prime = 31;
+ int result = 1;
+ result = prime * result + counter;
+ result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
+ return result
+ }
+ }
+ The default implementation reads the data into two {@link
+ WritableComparable}s (using {@link
+ Writable#readFields(DataInput)}, then calls {@link
+ #compare(WritableComparable,WritableComparable)}.]]>
+ The default implementation uses the natural ordering, calling {@link
+ Comparable#compareTo(Object)}.]]>
+ This base implemenation uses the natural ordering. To define alternate
+ orderings, override {@link #compare(WritableComparable,WritableComparable)}.
+ One may optimize compare-intensive operations by overriding
+ {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are
+ provided to assist in optimized implementations of this method.]]>
+ Enum type
+ @param in DataInput to read from
+ @param enumType Class type of Enum
+ @return Enum represented by String read from DataInput
+ @throws IOException]]>
+ len number of bytes in input streamin
+ @param in input stream
+ @param len number of bytes to skip
+ @throws IOException when skipped less number of bytes]]>
+ CompressionCodec for which to get the
+ Compressor
+ @param conf the Configuration
object which contains confs for creating or reinit the compressor
+ @return Compressor
for the given
+ CompressionCodec
from the pool or a new one]]>
+ CompressionCodec for which to get the
+ Decompressor
+ @return Decompressor
for the given
+ CompressionCodec
the pool or a new one]]>
+ Compressor to be returned to the pool]]>
+ Decompressor to be returned to the
+ pool]]>
+ Codec aliases are case insensitive.
+ The code alias is the short class name (without the package name).
+ If the short class name ends with 'Codec', then there are two aliases for
+ the codec, the complete short class name and the short class name without
+ the 'Codec' ending. For example for the 'GzipCodec' codec class name the
+ alias are 'gzip' and 'gzipcodec'.
+ @param codecName the canonical class name of the codec
+ @return the codec object]]>
+ Codec aliases are case insensitive.
+ The code alias is the short class name (without the package name).
+ If the short class name ends with 'Codec', then there are two aliases for
+ the codec, the complete short class name and the short class name without
+ the 'Codec' ending. For example for the 'GzipCodec' codec class name the
+ alias are 'gzip' and 'gzipcodec'.
+ @param codecName the canonical class name of the codec
+ @return the codec class]]>
+ Implementations are assumed to be buffered. This permits clients to
+ reposition the underlying input stream then call {@link #resetState()},
+ without having to also synchronize client buffers.]]>
+ true indicating that more input data is required.
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+ true if the input data buffer is empty and
+ #setInput() should be called in order to provide more input.]]>
+ true if the end of the compressed
+ data output stream has been reached.]]>
+ true indicating that more input data is required.
+ (Both native and non-native versions of various Decompressors require
+ that the data passed in via b[]
remain unmodified until
+ the caller is explicitly notified--via {@link #needsInput()}--that the
+ buffer may be safely modified. With this requirement, an extra
+ buffer-copy can be avoided.)
+ @param b Input data
+ @param off Start offset
+ @param len Length]]>
+ true if the input data buffer is empty and
+ {@link #setInput(byte[], int, int)} should be called to
+ provide more input.
+ @return true
if the input data buffer is empty and
+ {@link #setInput(byte[], int, int)} should be called in
+ order to provide more input.]]>
+ true if a preset dictionary is needed for decompression.
+ @return true
if a preset dictionary is needed for decompression]]>
+ true if the end of the decompressed
+ data output stream has been reached. Indicates a concatenated data stream
+ when finished() returns true
and {@link #getRemaining()}
+ returns a positive value. finished() will be reset with the
+ {@link #reset()} method.
+ @return true
if the end of the decompressed
+ data output stream has been reached.]]>
+ true and getRemaining() returns a positive value. If
+ {@link #finished()} returns true
and getRemaining() returns
+ a zero value, indicates that the end of data stream has been reached and
+ is not a concatenated data stream.
+ @return The number of bytes remaining in the compressed data buffer.]]>
+ true and {@link #getRemaining()} returns a positive value,
+ reset() is called before processing of the next data stream in the
+ concatenated data stream. {@link #finished()} will be reset and will
+ return false
when reset() is called.]]>
+ "none" - No compression.
+ "lzo" - LZO compression.
+ "gz" - GZIP compression.
+ ]]>
+ Block Compression.
+ Named meta data blocks.
+ Sorted or unsorted keys.
+ Seek by key or by file offset.
+ The memory footprint of a TFile includes the following:
+ - Some constant overhead of reading or writing a compressed block.
+ - Each compressed block requires one compression/decompression codec for
+ I/O.
- Temporary space to buffer the key.
- Temporary space to buffer the value (for TFile.Writer only). Values are
+ chunk encoded, so that we buffer at most one chunk of user data. By default,
+ the chunk buffer is 1MB. Reading chunked value does not require additional
+ memory.
+ - TFile index, which is proportional to the total number of Data Blocks.
+ The total amount of memory needed to hold the index can be estimated as
+ (56+AvgKeySize)*NumBlocks.
- MetaBlock index, which is proportional to the total number of Meta
+ Blocks.The total amount of memory needed to hold the index for Meta Blocks
+ can be estimated as (40+AvgMetaBlockName)*NumMetaBlock.
+ The behavior of TFile can be customized by the following variables through
+ Configuration:
+ - tfile.io.chunk.size: Value chunk size. Integer (in bytes). Default
+ to 1MB. Values of the length less than the chunk size is guaranteed to have
+ known value length in read time (See
+ {@link TFile.Reader.Scanner.Entry#isValueLengthKnown()}).
- tfile.fs.output.buffer.size: Buffer size used for
+ FSDataOutputStream. Integer (in bytes). Default to 256KB.
- tfile.fs.input.buffer.size: Buffer size used for
+ FSDataInputStream. Integer (in bytes). Default to 256KB.
+ Suggestions on performance optimization.
+ - Minimum block size. We recommend a setting of minimum block size between
+ 256KB to 1MB for general usage. Larger block size is preferred if files are
+ primarily for sequential access. However, it would lead to inefficient random
+ access (because there are more data to decompress). Smaller blocks are good
+ for random access, but require more memory to hold the block index, and may
+ be slower to create (because we must flush the compressor stream at the
+ conclusion of each data block, which leads to an FS I/O flush). Further, due
+ to the internal caching in Compression codec, the smallest possible block
+ size would be around 20KB-30KB.
- The current implementation does not offer true multi-threading for
+ reading. The implementation uses FSDataInputStream seek()+read(), which is
+ shown to be much faster than positioned-read call in single thread mode.
+ However, it also means that if multiple threads attempt to access the same
+ TFile (using multiple scanners) simultaneously, the actual I/O is carried out
+ sequentially even if they access different DFS blocks.
- Compression codec. Use "none" if the data is not very compressable (by
+ compressable, I mean a compression ratio at least 2:1). Generally, use "lzo"
+ as the starting point for experimenting. "gz" overs slightly better
+ compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to
+ decompress, comparing to "lzo".
- File system buffering, if the underlying FSDataInputStream and
+ FSDataOutputStream is already adequately buffered; or if applications
+ reads/writes keys and values in large buffers, we can reduce the sizes of
+ input/output buffering in TFile layer by setting the configuration parameters
+ "tfile.fs.input.buffer.size" and "tfile.fs.output.buffer.size".
+ Some design rationale behind TFile can be found at Hadoop-3315.]]>
+ Utils#writeVLong(out, n).
+ @param out
+ output stream
+ @param n
+ The integer to be encoded
+ @throws IOException
+ @see Utils#writeVLong(DataOutput, long)]]>
+ if n in [-32, 127): encode in one byte with the actual value.
+ Otherwise,
+ if n in [-20*2^8, 20*2^8): encode in two bytes: byte[0] = n/256 - 52;
+ byte[1]=n&0xff. Otherwise,
+ if n IN [-16*2^16, 16*2^16): encode in three bytes: byte[0]=n/2^16 -
+ 88; byte[1]=(n>>8)&0xff; byte[2]=n&0xff. Otherwise,
+ if n in [-8*2^24, 8*2^24): encode in four bytes: byte[0]=n/2^24 - 112;
+ byte[1] = (n>>16)&0xff; byte[2] = (n>>8)&0xff; byte[3]=n&0xff. Otherwise:
+ if n in [-2^31, 2^31): encode in five bytes: byte[0]=-125; byte[1] =
+ (n>>24)&0xff; byte[2]=(n>>16)&0xff; byte[3]=(n>>8)&0xff; byte[4]=n&0xff;
+ if n in [-2^39, 2^39): encode in six bytes: byte[0]=-124; byte[1] =
+ (n>>32)&0xff; byte[2]=(n>>24)&0xff; byte[3]=(n>>16)&0xff;
+ byte[4]=(n>>8)&0xff; byte[5]=n&0xff
+ if n in [-2^47, 2^47): encode in seven bytes: byte[0]=-123; byte[1] =
+ (n>>40)&0xff; byte[2]=(n>>32)&0xff; byte[3]=(n>>24)&0xff;
+ byte[4]=(n>>16)&0xff; byte[5]=(n>>8)&0xff; byte[6]=n&0xff;
+ if n in [-2^55, 2^55): encode in eight bytes: byte[0]=-122; byte[1] =
+ (n>>48)&0xff; byte[2] = (n>>40)&0xff; byte[3]=(n>>32)&0xff;
+ byte[4]=(n>>24)&0xff; byte[5]=(n>>16)&0xff; byte[6]=(n>>8)&0xff;
+ byte[7]=n&0xff;
+ if n in [-2^63, 2^63): encode in nine bytes: byte[0]=-121; byte[1] =
+ (n>>54)&0xff; byte[2] = (n>>48)&0xff; byte[3] = (n>>40)&0xff;
+ byte[4]=(n>>32)&0xff; byte[5]=(n>>24)&0xff; byte[6]=(n>>16)&0xff;
+ byte[7]=(n>>8)&0xff; byte[8]=n&0xff;
+ @param out
+ output stream
+ @param n
+ the integer number
+ @throws IOException]]>
+ (int)Utils#readVLong(in).
+ @param in
+ input stream
+ @return the decoded integer
+ @throws IOException
+ @see Utils#readVLong(DataInput)]]>
+ if (FB >= -32), return (long)FB;
+ if (FB in [-72, -33]), return (FB+52)<<8 + NB[0]&0xff;
+ if (FB in [-104, -73]), return (FB+88)<<16 + (NB[0]&0xff)<<8 +
+ NB[1]&0xff;
+ if (FB in [-120, -105]), return (FB+112)<<24 + (NB[0]&0xff)<<16 +
+ (NB[1]&0xff)<<8 + NB[2]&0xff;
+ if (FB in [-128, -121]), return interpret NB[FB+129] as a signed
+ big-endian integer.
+ @param in
+ input stream
+ @return the decoded long integer.
+ @throws IOException]]>
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @param cmp
+ Comparator for the key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+ Type of the input key.
+ @param list
+ The list
+ @param key
+ The input key.
+ @return The index to the desired element if it exists; or list.size()
+ otherwise.]]>
+ An experimental {@link Serialization} for Java {@link Serializable} classes.
+ @see JavaSerializationComparator]]>
+ A {@link RawComparator} that uses a {@link JavaSerialization}
+ {@link Deserializer} to deserialize objects that are then compared via
+ their {@link Comparable} interfaces.
+ @param
+ @see JavaSerialization]]>
+This package provides a mechanism for using different serialization frameworks
+in Hadoop. The property "io.serializations" defines a list of
+{@link org.apache.hadoop.io.serializer.Serialization}s that know how to create
+{@link org.apache.hadoop.io.serializer.Serializer}s and
+{@link org.apache.hadoop.io.serializer.Deserializer}s.
+To add a new serialization framework write an implementation of
+{@link org.apache.hadoop.io.serializer.Serialization} and add its name to the
+"io.serializations" property.
+ avro.reflect.pkgs or implement
+ {@link AvroReflectSerializable} interface.]]>
+This package provides Avro serialization in Hadoop. This can be used to
+serialize/deserialize Avro types in Hadoop.
+Use {@link org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization} for
+serialization of classes generated by Avro's 'specific' compiler.
+Use {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} for
+other classes.
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} work for
+any class which is either in the package list configured via
+{@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization#AVRO_REFLECT_PACKAGES}
+or implement {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerializable}
+The API is abstract so that it can be implemented on top of
+a variety of metrics client libraries. The choice of
+client library is a configuration option, and different
+modules within the same application can use
+different metrics implementation libraries.
+ org.apache.hadoop.metrics.spi
+ - The abstract Server Provider Interface package. Those wishing to
+ integrate the metrics API with a particular metrics client library should
+ extend this package.
+ org.apache.hadoop.metrics.file
+ - An implementation package which writes the metric data to
+ a file, or sends it to the standard output stream.
+ -
+ - An implementation package which sends metric data to
+ Ganglia.
+Introduction to the Metrics API
+Here is a simple example of how to use this package to report a single
+metric value:
+ private ContextFactory contextFactory = ContextFactory.getFactory();
+ void reportMyMetric(float myMetric) {
+ MetricsContext myContext = contextFactory.getContext("myContext");
+ MetricsRecord myRecord = myContext.getRecord("myRecord");
+ myRecord.setMetric("myMetric", myMetric);
+ myRecord.update();
+ }
+In this example there are three names:
+ - myContext
+ - The context name will typically identify either the application, or else a
+ module within an application or library.
+ - myRecord
+ - The record name generally identifies some entity for which a set of
+ metrics are to be reported. For example, you could have a record named
+ "cacheStats" for reporting a number of statistics relating to the usage of
+ some cache in your application.
+ - myMetric
+ - This identifies a particular metric. For example, you might have metrics
+ named "cache_hits" and "cache_misses".
+In some cases it is useful to have multiple records with the same name. For
+example, suppose that you want to report statistics about each disk on a computer.
+In this case, the record name would be something like "diskStats", but you also
+need to identify the disk which is done by adding a tag to the record.
+The code could look something like this:
+ private MetricsRecord diskStats =
+ contextFactory.getContext("myContext").getRecord("diskStats");
+ void reportDiskMetrics(String diskName, float diskBusy, float diskUsed) {
+ diskStats.setTag("diskName", diskName);
+ diskStats.setMetric("diskBusy", diskBusy);
+ diskStats.setMetric("diskUsed", diskUsed);
+ diskStats.update();
+ }
+Buffering and Callbacks
+Data is not sent immediately to the metrics system when
is called. Instead it is stored in an
+internal table, and the contents of the table are sent periodically.
+This can be important for two reasons:
+ - It means that a programmer is free to put calls to this API in an
+ inner loop, since updates can be very frequent without slowing down
+ the application significantly.
+ - Some implementations can gain efficiency by combining many metrics
+ into a single UDP message.
+The API provides a timer-based callback via the
method. The benefit of this
+versus using java.util.Timer
is that the callbacks will be done
+immediately before sending the data, making the data as current as possible.
+It is possible to programmatically examine and modify configuration data
+before creating a context, like this:
+ ContextFactory factory = ContextFactory.getFactory();
+ ... examine and/or modify factory attributes ...
+ MetricsContext context = factory.getContext("myContext");
+The factory attributes can be examined and modified using the following
+ Object getAttribute(String attributeName)
+ String[] getAttributeNames()
+ void setAttribute(String name, Object value)
+ void removeAttribute(attributeName)
initializes the factory attributes by
+reading the properties file hadoop-metrics.properties
if it exists
+on the class path.
+A factory attribute named:
+should have as its value the fully qualified name of the class to be
+instantiated by a call of the CodeFactory
. If this factory attribute is not
+specified, the default is to instantiate
+Other factory attributes are specific to a particular implementation of this
+API and are documented elsewhere. For example, configuration attributes for
+the file and Ganglia implementations can be found in the javadoc for
+their respective packages.]]>
+Implementation of the metrics package that sends metric data to
+Programmers should not normally need to use this package directly. Instead
+they should use org.hadoop.metrics.
+These are the implementation specific factory attributes
+(See ContextFactory.getFactory()):
+ - contextName.servers
+ - Space and/or comma separated sequence of servers to which UDP
+ messages should be sent.
+ - contextName.period
+ - The period in seconds on which the metric data is sent to the
+ server(s).
+ - contextName.multicast
+ - Enable multicast for Ganglia
+ - contextName.multicast.ttl
+ - TTL for multicast packets
+ - contextName.units.recordName.metricName
+ - The units for the specified metric in the specified record.
+ - contextName.slope.recordName.metricName
+ - The slope for the specified metric in the specified record.
+ - contextName.tmax.recordName.metricName
+ - The tmax for the specified metric in the specified record.
+ - contextName.dmax.recordName.metricName
+ - The dmax for the specified metric in the specified record.
+ contextName.tableName. The returned map consists of
+ those attributes with the contextName and tableName stripped off.]]>
+ recordName.
+ Throws an exception if the metrics implementation is configured with a fixed
+ set of record names and recordName
is not in that set.
+ @param recordName the name of the record
+ @throws MetricsException if recordName conflicts with configuration data]]>
+ This class implements the internal table of metric data, and the timer
+ on which data is to be sent to the metrics system. Subclasses must
+ override the abstract emitRecord
method in order to transmit
+ the data.
+ @deprecated Use org.apache.hadoop.metrics2 package instead.]]>
+ update
+ and remove()
+ @deprecated Use {@link org.apache.hadoop.metrics2.impl.MetricsRecordImpl}
+ instead.]]>
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+ @return a list of InetSocketAddress objects.]]>
+ org.apache.hadoop.metrics.file and
+Plugging in an implementation involves writing a concrete subclass of
. The subclass should get its
+ configuration information using the getAttribute(attributeName)
+ method.]]>
+ Implementations of this interface consume the {@link MetricsRecord} generated
+ from {@link MetricsSource}. It registers with {@link MetricsSystem} which
+ periodically pushes the {@link MetricsRecord} to the sink using
+ {@link #putMetrics(MetricsRecord)} method. If the implementing class also
+ implements {@link Closeable}, then the MetricsSystem will close the sink when
+ it is stopped.]]>
+ the actual type of the source object
+ @param source object to register
+ @return the source object
+ @exception MetricsException]]>
+ the actual type of the source object
+ @param source object to register
+ @param name of the source. Must be unique or null (then extracted from
+ the annotations of the source object.)
+ @param desc the description of the source (or null. See above.)
+ @return the source object
+ @exception MetricsException]]>
+ CollectD StatsD plugin).
+ To configure this plugin, you will need to add the following
+ entries to your hadoop-metrics2.properties file:
+ *.sink.statsd.class=org.apache.hadoop.metrics2.sink.StatsDSink
+ [prefix].sink.statsd.server.host=
+ [prefix].sink.statsd.server.port=
+ [prefix].sink.statsd.skip.hostname=true|false (optional)
+ [prefix].sink.statsd.service.name=NameNode (name you want for service)
+ ,name="
+ Where the and are the supplied parameters
+ @param serviceName
+ @param nameName
+ @param theMbean - the MBean to register
+ @return the named used to register the MBean]]>
+ hostname or hostname:port. If
+ the specs string is null, defaults to localhost:defaultPort.
+ @param specs server specs (see description)
+ @param defaultPort the default port if not specified
+ @return a list of InetSocketAddress objects.]]>
+ This method is used when parts of Hadoop need know whether to apply
+ single rack vs multi-rack policies, such as during block placement.
+ Such algorithms behave differently if they are on multi-switch systems.
+ @return true if the mapping thinks that it is on a single switch]]>
+ This predicate simply assumes that all mappings not derived from
+ this class are multi-switch.
+ @param mapping the mapping to query
+ @return true if the base class says it is single switch, or the mapping
+ is not derived from this class.]]>
+ It is not mandatory to
+ derive {@link DNSToSwitchMapping} implementations from it, but it is strongly
+ recommended, as it makes it easy for the Hadoop developers to add new methods
+ to this base class that are automatically picked up by all implementations.
+ This class does not extend the Configured
+ base class, and should not be changed to do so, as it causes problems
+ for subclasses. The constructor of the Configured
+ the {@link #setConf(Configuration)} method, which will call into the
+ subclasses before they have been fully constructed.]]>
+ If a name cannot be resolved to a rack, the implementation
+ should return {@link NetworkTopology#DEFAULT_RACK}. This
+ is what the bundled implementations do, though it is not a formal requirement
+ @param names the list of hosts to resolve (can be empty)
+ @return list of resolved network paths.
+ If names is empty, the returned list is also empty]]>
+ Calling {@link #setConf(Configuration)} will trigger a
+ re-evaluation of the configuration settings and so be used to
+ set up the mapping script.]]>
+ This will get called in the superclass constructor, so a check is needed
+ to ensure that the raw mapping is defined before trying to relaying a null
+ configuration.
+ @param conf]]>
+ It contains a static class RawScriptBasedMapping
that performs
+ the work: reading the configuration parameters, executing any defined
+ script, handling errors and such like. The outer
+ class extends {@link CachedDNSToSwitchMapping} to cache the delegated
+ queries.
+ This DNS mapper's {@link #isSingleSwitch()} predicate returns
+ true if and only if a script is defined.]]>
+ Simple {@link DNSToSwitchMapping} implementation that reads a 2 column text
+ file. The columns are separated by whitespace. The first column is a DNS or
+ IP address and the second column specifies the rack where the address maps.
+ This class uses the configuration parameter {@code
+ net.topology.table.file.name} to locate the mapping file.
+ Calls to {@link #resolve(List)} will look up the address as defined in the
+ mapping file. If no entry corresponding to the address is found, the value
+ {@code /default-rack} is returned.
+ Avro.]]>
+ Avro.]]>
+ =} getCount().
+ @param newCapacity The new capacity in bytes.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Index idx = startVector(...);
+ while (!idx.done()) {
+ .... // read element of a vector
+ idx.incr();
+ }
+ @deprecated Replaced by Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ (DEPRECATED) Hadoop record I/O contains classes and a record description language
+ translator for simplifying serialization and deserialization of records in a
+ language-neutral manner.
+ DEPRECATED: Replaced by Avro.
+ Introduction
+ Software systems of any significant complexity require mechanisms for data
+interchange with the outside world. These interchanges typically involve the
+marshaling and unmarshaling of logical units of data to and from data streams
+(files, network connections, memory buffers etc.). Applications usually have
+some code for serializing and deserializing the data types that they manipulate
+embedded in them. The work of serialization has several features that make
+automatic code generation for it worthwhile. Given a particular output encoding
+(binary, XML, etc.), serialization of primitive types and simple compositions
+of primitives (structs, vectors etc.) is a very mechanical task. Manually
+written serialization code can be susceptible to bugs especially when records
+have a large number of fields or a record definition changes between software
+versions. Lastly, it can be very useful for applications written in different
+programming languages to be able to share and interchange data. This can be
+made a lot easier by describing the data records manipulated by these
+applications in a language agnostic manner and using the descriptions to derive
+implementations of serialization in multiple target languages.
+This document describes Hadoop Record I/O, a mechanism that is aimed
+- enabling the specification of simple serializable data types (records)
- enabling the generation of code in multiple target languages for
+marshaling and unmarshaling such types
- providing target language specific support that will enable application
+programmers to incorporate generated code into their applications
+The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR,
+ASN.1, PADS and ICE. While these systems all include a DDL that enables
+the specification of most record types, they differ widely in what else they
+focus on. The focus in Hadoop Record I/O is on data marshaling and
+multi-lingual support. We take a translator-based approach to serialization.
+Hadoop users have to describe their data in a simple data description
+language. The Hadoop DDL translator rcc generates code that users
+can invoke in order to read/write their data from/to simple stream
+abstractions. Next we list explicitly some of the goals and non-goals of
+Hadoop Record I/O.
+- Support for commonly used primitive types. Hadoop should include as
+primitives commonly used builtin types from programming languages we intend to
- Support for common data compositions (including recursive compositions).
+Hadoop should support widely used composite types such as structs and
- Code generation in multiple target languages. Hadoop should be capable of
+generating serialization code in multiple target languages and should be
+easily extensible to new target languages. The initial target languages are
+C++ and Java.
- Support for generated target languages. Hadooop should include support
+in the form of headers, libraries, packages for supported target languages
+that enable easy inclusion and use of generated code in applications.
- Support for multiple output encodings. Candidates include
+packed binary, comma-separated text, XML etc.
- Support for specifying record types in a backwards/forwards compatible
+manner. This will probably be in the form of support for optional fields in
+records. This version of the document does not include a description of the
+planned mechanism, we intend to include it in the next iteration.
+ - Serializing existing arbitrary C++ classes.
- Serializing complex data structures such as trees, linked lists etc.
- Built-in indexing schemes, compression, or check-sums.
- Dynamic construction of objects from an XML schema.
+The remainder of this document describes the features of Hadoop record I/O
+in more detail. Section 2 describes the data types supported by the system.
+Section 3 lays out the DDL syntax with some examples of simple records.
+Section 4 describes the process of code generation with rcc. Section 5
+describes target language mappings and support for Hadoop types. We include a
+fairly complete description of C++ mappings with intent to include Java and
+others in upcoming iterations of this document. The last section talks about
+supported output encodings.
+Data Types and Streams
+This section describes the primitive and composite types supported by Hadoop.
+We aim to support a set of types that can be used to simply and efficiently
+express a wide range of record types in different programming languages.
+Primitive Types
+For the most part, the primitive types of Hadoop map directly to primitive
+types in high level programming languages. Special cases are the
+ustring (a Unicode string) and buffer types, which we believe
+find wide use and which are usually implemented in library code and not
+available as language built-ins. Hadoop also supplies these via library code
+when a target language built-in is not present and there is no widely
+adopted "standard" implementation. The complete list of primitive types is:
+ - byte: An 8-bit unsigned integer.
- boolean: A boolean value.
- int: A 32-bit signed integer.
- long: A 64-bit signed integer.
- float: A single precision floating point number as described by
+ IEEE-754.
- double: A double precision floating point number as described by
+ IEEE-754.
- ustring: A string consisting of Unicode characters.
- buffer: An arbitrary sequence of bytes.
+Composite Types
+Hadoop supports a small set of composite types that enable the description
+of simple aggregate types and containers. A composite type is serialized
+by sequentially serializing it constituent elements. The supported
+composite types are:
+ - record: An aggregate type like a C-struct. This is a list of
+typed fields that are together considered a single unit of data. A record
+is serialized by sequentially serializing its constituent fields. In addition
+to serialization a record has comparison operations (equality and less-than)
+implemented for it, these are defined as memberwise comparisons.
- vector: A sequence of entries of the same data type, primitive
+or composite.
- map: An associative container mapping instances of a key type to
+instances of a value type. The key and value types may themselves be primitive
+or composite types.
+Hadoop generates code for serializing and deserializing record types to
+abstract streams. For each target language Hadoop defines very simple input
+and output stream interfaces. Application writers can usually develop
+concrete implementations of these by putting a one method wrapper around
+an existing stream implementation.
+DDL Syntax and Examples
+We now describe the syntax of the Hadoop data description language. This is
+followed by a few examples of DDL usage.
+Hadoop DDL Syntax
+recfile = *include module *record
+include = "include" path
+path = (relative-path / absolute-path)
+module = "module" module-name
+module-name = name *("." name)
+record := "class" name "{" 1*(field) "}"
+field := type name ";"
+name := ALPHA (ALPHA / DIGIT / "_" )*
+type := (ptype / ctype)
+ptype := ("byte" / "boolean" / "int" |
+ "long" / "float" / "double"
+ "ustring" / "buffer")
+ctype := (("vector" "<" type ">") /
+ ("map" "<" type "," type ">" ) ) / name)
+A DDL file describes one or more record types. It begins with zero or
+more include declarations, a single mandatory module declaration
+followed by zero or more class declarations. The semantics of each of
+these declarations are described below:
+- include: An include declaration specifies a DDL file to be
+referenced when generating code for types in the current DDL file. Record types
+in the current compilation unit may refer to types in all included files.
+File inclusion is recursive. An include does not trigger code
+generation for the referenced file.
- module: Every Hadoop DDL file must have a single module
+declaration that follows the list of includes and precedes all record
+declarations. A module declaration identifies a scope within which
+the names of all types in the current file are visible. Module names are
+mapped to C++ namespaces, Java packages etc. in generated code.
- class: Records types are specified through class
+declarations. A class declaration is like a Java class declaration.
+It specifies a named record type and a list of fields that constitute records
+of the type. Usage is illustrated in the following examples.
+- A simple DDL file links.jr with just one record declaration.
+module links {
+ class Link {
+ ustring URL;
+ boolean isRelative;
+ ustring anchorText;
+ };
+ - A DDL file outlinks.jr which includes another
+include "links.jr"
+module outlinks {
+ class OutLinks {
+ ustring baseURL;
+ vector outLinks;
+ };
+Code Generation
+The Hadoop translator is written in Java. Invocation is done by executing a
+wrapper shell script named named rcc. It takes a list of
+record description files as a mandatory argument and an
+optional language argument (the default is Java) --language or
+-l. Thus a typical invocation would look like:
+$ rcc -l C++ ...
+Target Language Mappings and Support
+For all target languages, the unit of code generation is a record type.
+For each record type, Hadoop generates code for serialization and
+deserialization, record comparison and access to record members.
+Support for including Hadoop generated C++ code in applications comes in the
+form of a header file recordio.hh which needs to be included in source
+that uses Hadoop types and a library librecordio.a which applications need
+to be linked with. The header declares the Hadoop C++ namespace which defines
+appropriate types for the various primitives, the basic interfaces for
+records and streams and enumerates the supported serialization encodings.
+Declarations of these interfaces and a description of their semantics follow:
+namespace hadoop {
+ enum RecFormat { kBinary, kXML, kCSV };
+ class InStream {
+ public:
+ virtual ssize_t read(void *buf, size_t n) = 0;
+ };
+ class OutStream {
+ public:
+ virtual ssize_t write(const void *buf, size_t n) = 0;
+ };
+ class IOError : public runtime_error {
+ public:
+ explicit IOError(const std::string& msg);
+ };
+ class IArchive;
+ class OArchive;
+ class RecordReader {
+ public:
+ RecordReader(InStream& in, RecFormat fmt);
+ virtual ~RecordReader(void);
+ virtual void read(Record& rec);
+ };
+ class RecordWriter {
+ public:
+ RecordWriter(OutStream& out, RecFormat fmt);
+ virtual ~RecordWriter(void);
+ virtual void write(Record& rec);
+ };
+ class Record {
+ public:
+ virtual std::string type(void) const = 0;
+ virtual std::string signature(void) const = 0;
+ protected:
+ virtual bool validate(void) const = 0;
+ virtual void
+ serialize(OArchive& oa, const std::string& tag) const = 0;
+ virtual void
+ deserialize(IArchive& ia, const std::string& tag) = 0;
+ };
+- RecFormat: An enumeration of the serialization encodings supported
+by this implementation of Hadoop.
- InStream: A simple abstraction for an input stream. This has a
+single public read method that reads n bytes from the stream into
+the buffer buf. Has the same semantics as a blocking read system
+call. Returns the number of bytes read or -1 if an error occurs.
- OutStream: A simple abstraction for an output stream. This has a
+single write method that writes n bytes to the stream from the
+buffer buf. Has the same semantics as a blocking write system
+call. Returns the number of bytes written or -1 if an error occurs.
- RecordReader: A RecordReader reads records one at a time from
+an underlying stream in a specified record format. The reader is instantiated
+with a stream and a serialization format. It has a read method that
+takes an instance of a record and deserializes the record from the stream.
- RecordWriter: A RecordWriter writes records one at a
+time to an underlying stream in a specified record format. The writer is
+instantiated with a stream and a serialization format. It has a
+write method that takes an instance of a record and serializes the
+record to the stream.
- Record: The base class for all generated record types. This has two
+public methods type and signature that return the typename and the
+type signature of the record.
+Two files are generated for each record file (note: not for each record). If a
+record file is named "name.jr", the generated files are
+"name.jr.cc" and "name.jr.hh" containing serialization
+implementations and record type declarations respectively.
+For each record in the DDL file, the generated header file will contain a
+class definition corresponding to the record type, method definitions for the
+generated type will be present in the '.cc' file. The generated class will
+inherit from the abstract class hadoop::Record. The DDL files
+module declaration determines the namespace the record belongs to.
+Each '.' delimited token in the module declaration results in the
+creation of a namespace. For instance, the declaration module docs.links
+results in the creation of a docs namespace and a nested
+docs::links namespace. In the preceding examples, the Link class
+is placed in the links namespace. The header file corresponding to
+the links.jr file will contain:
+namespace links {
+ class Link : public hadoop::Record {
+ // ....
+ };
+Each field within the record will cause the generation of a private member
+declaration of the appropriate type in the class declaration, and one or more
+acccessor methods. The generated class will implement the serialize and
+deserialize methods defined in hadoop::Record+. It will also
+implement the inspection methods type and signature from
+hadoop::Record. A default constructor and virtual destructor will also
+be generated. Serialization code will read/write records into streams that
+implement the hadoop::InStream and the hadoop::OutStream interfaces.
+For each member of a record an accessor method is generated that returns
+either the member or a reference to the member. For members that are returned
+by value, a setter method is also generated. This is true for primitive
+data members of the types byte, int, long, boolean, float and
+double. For example, for a int field called MyField the folowing
+code is generated.
+ int32_t mMyField;
+ ...
+ int32_t getMyField(void) const {
+ return mMyField;
+ };
+ void setMyField(int32_t m) {
+ mMyField = m;
+ };
+ ...
+For a ustring or buffer or composite field. The generated code
+only contains accessors that return a reference to the field. A const
+and a non-const accessor are generated. For example:
+ std::string mMyBuf;
+ ...
+ std::string& getMyBuf() {
+ return mMyBuf;
+ };
+ const std::string& getMyBuf() const {
+ return mMyBuf;
+ };
+ ...
+Suppose the inclrec.jr file contains:
+module inclrec {
+ class RI {
+ int I32;
+ double D;
+ ustring S;
+ };
+and the testrec.jr file contains:
+include "inclrec.jr"
+module testrec {
+ class R {
+ vector VF;
+ RI Rec;
+ buffer Buf;
+ };
+Then the invocation of rcc such as:
+$ rcc -l c++ inclrec.jr testrec.jr
+will result in generation of four files:
+inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}.
+The inclrec.jr.hh will contain:
+#ifndef _INCLREC_JR_HH_
+#define _INCLREC_JR_HH_
+#include "recordio.hh"
+namespace inclrec {
+ class RI : public hadoop::Record {
+ private:
+ int32_t I32;
+ double D;
+ std::string S;
+ public:
+ RI(void);
+ virtual ~RI(void);
+ virtual bool operator==(const RI& peer) const;
+ virtual bool operator<(const RI& peer) const;
+ virtual int32_t getI32(void) const { return I32; }
+ virtual void setI32(int32_t v) { I32 = v; }
+ virtual double getD(void) const { return D; }
+ virtual void setD(double v) { D = v; }
+ virtual std::string& getS(void) const { return S; }
+ virtual const std::string& getS(void) const { return S; }
+ virtual std::string type(void) const;
+ virtual std::string signature(void) const;
+ protected:
+ virtual void serialize(hadoop::OArchive& a) const;
+ virtual void deserialize(hadoop::IArchive& a);
+ };
+} // end namespace inclrec
+#endif /* _INCLREC_JR_HH_ */
+The testrec.jr.hh file will contain:
+#ifndef _TESTREC_JR_HH_
+#define _TESTREC_JR_HH_
+#include "inclrec.jr.hh"
+namespace testrec {
+ class R : public hadoop::Record {
+ private:
+ std::vector VF;
+ inclrec::RI Rec;
+ std::string Buf;
+ public:
+ R(void);
+ virtual ~R(void);
+ virtual bool operator==(const R& peer) const;
+ virtual bool operator<(const R& peer) const;
+ virtual std::vector& getVF(void) const;
+ virtual const std::vector& getVF(void) const;
+ virtual std::string& getBuf(void) const ;
+ virtual const std::string& getBuf(void) const;
+ virtual inclrec::RI& getRec(void) const;
+ virtual const inclrec::RI& getRec(void) const;
+ virtual bool serialize(hadoop::OutArchive& a) const;
+ virtual bool deserialize(hadoop::InArchive& a);
+ virtual std::string type(void) const;
+ virtual std::string signature(void) const;
+ };
+}; // end namespace testrec
+#endif /* _TESTREC_JR_HH_ */
+Code generation for Java is similar to that for C++. A Java class is generated
+for each record type with private members corresponding to the fields. Getters
+and setters for fields are also generated. Some differences arise in the
+way comparison is expressed and in the mapping of modules to packages and
+classes to files. For equality testing, an equals method is generated
+for each record type. As per Java requirements a hashCode method is also
+generated. For comparison a compareTo method is generated for each
+record type. This has the semantics as defined by the Java Comparable
+interface, that is, the method returns a negative integer, zero, or a positive
+integer as the invoked object is less than, equal to, or greater than the
+comparison parameter.
+A .java file is generated per record type as opposed to per DDL
+file as in C++. The module declaration translates to a Java
+package declaration. The module name maps to an identical Java package
+name. In addition to this mapping, the DDL compiler creates the appropriate
+directory hierarchy for the package and places the generated .java
+files in the correct directories.
+Mapping Summary
+DDL Type C++ Type Java Type
+boolean bool boolean
+byte int8_t byte
+int int32_t int
+long int64_t long
+float float float
+double double double
+ustring std::string java.lang.String
+buffer std::string org.apache.hadoop.record.Buffer
+class type class type class type
+vector std::vector java.util.ArrayList
+map std::map java.util.TreeMap
+Data encodings
+This section describes the format of the data encodings supported by Hadoop.
+Currently, three data encodings are supported, namely binary, CSV and XML.
+Binary Serialization Format
+The binary data encoding format is fairly dense. Serialization of composite
+types is simply defined as a concatenation of serializations of the constituent
+elements (lengths are included in vectors and maps).
+Composite types are serialized as follows:
+- class: Sequence of serialized members.
- vector: The number of elements serialized as an int. Followed by a
+sequence of serialized elements.
- map: The number of key value pairs serialized as an int. Followed
+by a sequence of serialized (key,value) pairs.
+Serialization of primitives is more interesting, with a zero compression
+optimization for integral types and normalization to UTF-8 for strings.
+Primitive types are serialized as follows:
+- byte: Represented by 1 byte, as is.
- boolean: Represented by 1-byte (0 or 1)
- int/long: Integers and longs are serialized zero compressed.
+Represented as 1-byte if -120 <= value < 128. Otherwise, serialized as a
+sequence of 2-5 bytes for ints, 2-9 bytes for longs. The first byte represents
+the number of trailing bytes, N, as the negative number (-120-N). For example,
+the number 1024 (0x400) is represented by the byte sequence 'x86 x04 x00'.
+This doesn't help much for 4-byte integers but does a reasonably good job with
+longs without bit twiddling.
- float/double: Serialized in IEEE 754 single and double precision
+format in network byte order. This is the format used by Java.
- ustring: Serialized as 4-byte zero compressed length followed by
+data encoded as UTF-8. Strings are normalized to UTF-8 regardless of native
+language representation.
- buffer: Serialized as a 4-byte zero compressed length followed by the
+raw bytes in the buffer.
+CSV Serialization Format
+The CSV serialization format has a lot more structure than the "standard"
+Excel CSV format, but we believe the additional structure is useful because
+- it makes parsing a lot easier without detracting too much from legibility
- the delimiters around composites make it obvious when one is reading a
+sequence of Hadoop records
+Serialization formats for the various types are detailed in the grammar that
+follows. The notable feature of the formats is the use of delimiters for
+indicating the certain field types.
+- A string field begins with a single quote (').
- A buffer field begins with a sharp (#).
- A class, vector or map begins with 's{', 'v{' or 'm{' respectively and
+ends with '}'.
+The CSV format can be described by the following grammar:
+record = primitive / struct / vector / map
+primitive = boolean / int / long / float / double / ustring / buffer
+boolean = "T" / "F"
+int = ["-"] 1*DIGIT
+long = ";" ["-"] 1*DIGIT
+float = ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
+double = ";" ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
+ustring = "'" *(UTF8 char except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
+buffer = "#" *(BYTE except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
+struct = "s{" record *("," record) "}"
+vector = "v{" [record *("," record)] "}"
+map = "m{" [*(record "," record)] "}"
+XML Serialization Format
+The XML serialization format is the same used by Apache XML-RPC
+(http://ws.apache.org/xmlrpc/types.html). This is an extension of the original
+XML-RPC format and adds some additional data types. All record I/O types are
+not directly expressible in this format, and access to a DDL is required in
+order to convert these to valid types. All types primitive or composite are
+represented by <value> elements. The particular XML-RPC type is
+indicated by a nested element in the <value> element. The encoding for
+records is always UTF-8. Primitive types are serialized as follows:
+- byte: XML tag <ex:i1>. Values: 1-byte unsigned
+integers represented in US-ASCII
- boolean: XML tag <boolean>. Values: "0" or "1"
- int: XML tags <i4> or <int>. Values: 4-byte
+signed integers represented in US-ASCII.
- long: XML tag <ex:i8>. Values: 8-byte signed integers
+represented in US-ASCII.
- float: XML tag <ex:float>. Values: Single precision
+floating point numbers represented in US-ASCII.
- double: XML tag <double>. Values: Double precision
+floating point numbers represented in US-ASCII.
- ustring: XML tag <;string>. Values: String values
+represented as UTF-8. XML does not permit all Unicode characters in literal
+data. In particular, NULLs and control chars are not allowed. Additionally,
+XML processors are required to replace carriage returns with line feeds and to
+replace CRLF sequences with line feeds. Programming languages that we work
+with do not impose these restrictions on string types. To work around these
+restrictions, disallowed characters and CRs are percent escaped in strings.
+The '%' character is also percent escaped.
- buffer: XML tag <string&>. Values: Arbitrary binary
+data. Represented as hexBinary, each byte is replaced by its 2-byte
+hexadecimal representation.
+Composite types are serialized as follows:
+- class: XML tag <struct>. A struct is a sequence of
+<member> elements. Each <member> element has a <name>
+element and a <value> element. The <name> is a string that must
+match /[a-zA-Z][a-zA-Z0-9_]*/. The value of the member is represented
+by a <value> element.
- vector: XML tag <array<. An <array> contains a
+single <data> element. The <data> element is a sequence of
+<value> elements each of which represents an element of the vector.
- map: XML tag <array>. Same as vector.
+For example:
+class {
+ int MY_INT; // value 5
+ vector MY_VEC; // values 0.1, -0.89, 2.45e4
+ buffer MY_BUF; // value '\00\n\tabc%'
+is serialized as
+ <struct>
+ <member>
+ <name>MY_INT</name>
+ <value><i4>5</i4></value>
+ </member>
+ <member>
+ <name>MY_VEC</name>
+ <value>
+ <array>
+ <data>
+ <value><ex:float>0.1</ex:float></value>
+ <value><ex:float>-0.89</ex:float></value>
+ <value><ex:float>2.45e4</ex:float></value>
+ </data>
+ </array>
+ </value>
+ </member>
+ <member>
+ <name>MY_BUF</name>
+ <value><string>%00\n\tabc%25</string></value>
+ </member>
+ </struct>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ (DEPRECATED) This package contains classes needed for code generation
+ from the hadoop record compiler. CppGenerator and JavaGenerator
+ are the main entry points from the parser. There are classes
+ corrsponding to every primitive type and compound type
+ included in Hadoop record I/O syntax.
+ DEPRECATED: Replaced by Avro.
+ This task takes the given record definition files and compiles them into
+ java or c++
+ files. It is then up to the user to compile the generated files.
+ The task requires the file
or the nested fileset element to be
+ specified. Optional attributes are language
(set the output
+ language, default is "java"),
+ destdir
(name of the destination directory for generated java/c++
+ code, default is ".") and failonerror
(specifies error handling
+ behavior. default is true).
+ <recordcc
+ destdir="${basedir}/gensrc"
+ language="java">
+ <fileset include="**\/*.jr" />
+ </recordcc>
+ @deprecated Replaced by Avro.]]>
+ ]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ (DEPRECATED) This package contains code generated by JavaCC from the
+ Hadoop record syntax file rcc.jj. For details about the
+ record file syntax please @see org.apache.hadoop.record.
+ DEPRECATED: Replaced by Avro.
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ Avro.]]>
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+ mapping
+ and mapping]]>
+ /host@realm.
+ @param principalName principal name of format as described above
+ @return host name if the the string conforms to the above format, else null]]>
+ "jack"
+ @param userName
+ @return userName without login method]]>
+ the return type of the run method
+ @param action the method to execute
+ @return the value from the run method]]>
+ the return type of the run method
+ @param action the method to execute
+ @return the value from the run method
+ @throws IOException if the action throws an IOException
+ @throws Error if the action throws an Error
+ @throws RuntimeException if the action throws a RuntimeException
+ @throws InterruptedException if the action throws an InterruptedException
+ @throws UndeclaredThrowableException if the action throws something else]]>
+ (cause==null ? null : cause.toString()) (which
+ typically contains the class and detail message of cause).
+ @param cause the cause (which is saved for later retrieval by the
+ {@link #getCause()} method). (A null value is
+ permitted, and indicates that the cause is nonexistent or
+ unknown.)]]>
+ does not provide the stack trace for security purposes.]]>
+ A User-Agent String is considered to be a browser if it matches
+ any of the regex patterns from browser-useragent-regex; the default
+ behavior is to consider everything a browser that matches the following:
+ "^Mozilla.*,^Opera.*". Subclasses can optionally override
+ this method to use different behavior.
+ @param userAgent The User-Agent String, or null if there isn't one
+ @return true if the User-Agent String refers to a browser, false if not]]>
+ The type of the token identifier]]>
+ T extends TokenIdentifier]]>
+ DelegationTokenAuthenticatedURL.
+ An instance of the default {@link DelegationTokenAuthenticator} will be
+ used.]]>
+ DelegationTokenAuthenticatedURL.
+ @param authenticator the {@link DelegationTokenAuthenticator} instance to
+ use, if null
the default one will be used.]]>
+ DelegationTokenAuthenticatedURL using the default
+ {@link DelegationTokenAuthenticator} class.
+ @param connConfigurator a connection configurator.]]>
+ DelegationTokenAuthenticatedURL.
+ @param authenticator the {@link DelegationTokenAuthenticator} instance to
+ use, if null
the default one will be used.
+ @param connConfigurator a connection configurator.]]>
+ The default class is {@link KerberosDelegationTokenAuthenticator}
+ @return the delegation token authenticator class to use as default.]]>
+ This method is provided to enable WebHDFS backwards compatibility.
+ @param useQueryString TRUE
if the token is transmitted in the
+ URL query string, FALSE
if the delegation token is transmitted
+ using the {@link DelegationTokenAuthenticator#DELEGATION_TOKEN_HEADER} HTTP
+ header.]]>
+ TRUE if the token is transmitted in the URL query
+ string, FALSE
if the delegation token is transmitted using the
+ {@link DelegationTokenAuthenticator#DELEGATION_TOKEN_HEADER} HTTP header.]]>
+ Authenticator.
+ @param url the URL to connect to. Only HTTP/S URLs are supported.
+ @param token the authentication token being used for the user.
+ @return an authenticated {@link HttpURLConnection}.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator. If the doAs
parameter is not NULL,
+ the request will be done on behalf of the specified doAs
+ @param url the URL to connect to. Only HTTP/S URLs are supported.
+ @param token the authentication token being used for the user.
+ @param doAs user to do the the request on behalf of, if NULL the request is
+ as self.
+ @return an authenticated {@link HttpURLConnection}.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator
+ for authentication.
+ @param url the URL to get the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token being used for the user where the
+ Delegation token will be stored.
+ @param renewer the renewer user.
+ @return a delegation token.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator
+ for authentication.
+ @param url the URL to get the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token being used for the user where the
+ Delegation token will be stored.
+ @param renewer the renewer user.
+ @param doAsUser the user to do as, which will be the token owner.
+ @return a delegation token.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator for authentication.
+ @param url the URL to renew the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token with the Delegation Token to renew.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator for authentication.
+ @param url the URL to renew the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token with the Delegation Token to renew.
+ @param doAsUser the user to do as, which will be the token owner.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator.
+ @param url the URL to cancel the delegation token from. Only HTTP/S URLs
+ are supported.
+ @param token the authentication token with the Delegation Token to cancel.
+ @throws IOException if an IO error occurred.]]>
+ Authenticator.
+ @param url the URL to cancel the delegation token from. Only HTTP/S URLs
+ are supported.
+ @param token the authentication token with the Delegation Token to cancel.
+ @param doAsUser the user to do as, which will be the token owner.
+ @throws IOException if an IO error occurred.]]>
+ DelegationTokenAuthenticatedURL is a
+ {@link AuthenticatedURL} sub-class with built-in Hadoop Delegation Token
+ functionality.
+ The authentication mechanisms supported by default are Hadoop Simple
+ authentication (also known as pseudo authentication) and Kerberos SPNEGO
+ authentication.
+ Additional authentication mechanisms can be supported via {@link
+ DelegationTokenAuthenticator} implementations.
+ The default {@link DelegationTokenAuthenticator} is the {@link
+ KerberosDelegationTokenAuthenticator} class which supports
+ automatic fallback from Kerberos SPNEGO to Hadoop Simple authentication via
+ the {@link PseudoDelegationTokenAuthenticator} class.
+ AuthenticatedURL
instances are not thread-safe.]]>
+ Authenticator
+ for authentication.
+ @param url the URL to get the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token being used for the user where the
+ Delegation token will be stored.
+ @param renewer the renewer user.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator
+ for authentication.
+ @param url the URL to get the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token being used for the user where the
+ Delegation token will be stored.
+ @param renewer the renewer user.
+ @param doAsUser the user to do as, which will be the token owner.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator for authentication.
+ @param url the URL to renew the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token with the Delegation Token to renew.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator for authentication.
+ @param url the URL to renew the delegation token from. Only HTTP/S URLs are
+ supported.
+ @param token the authentication token with the Delegation Token to renew.
+ @param doAsUser the user to do as, which will be the token owner.
+ @throws IOException if an IO error occurred.
+ @throws AuthenticationException if an authentication exception occurred.]]>
+ Authenticator.
+ @param url the URL to cancel the delegation token from. Only HTTP/S URLs
+ are supported.
+ @param token the authentication token with the Delegation Token to cancel.
+ @throws IOException if an IO error occurred.]]>
+ Authenticator.
+ @param url the URL to cancel the delegation token from. Only HTTP/S URLs
+ are supported.
+ @param token the authentication token with the Delegation Token to cancel.
+ @param doAsUser the user to do as, which will be the token owner.
+ @throws IOException if an IO error occurred.]]>
+ KerberosDelegationTokenAuthenticator provides support for
+ Kerberos SPNEGO authentication mechanism and support for Hadoop Delegation
+ Token operations.
+ It falls back to the {@link PseudoDelegationTokenAuthenticator} if the HTTP
+ endpoint does not trigger a SPNEGO authentication]]>
+ PseudoDelegationTokenAuthenticator provides support for
+ Hadoop's pseudo authentication mechanism that accepts
+ the user name specified as a query string parameter and support for Hadoop
+ Delegation Token operations.
+ This mimics the model of Hadoop Simple authentication trusting the
+ {@link UserGroupInformation#getCurrentUser()} value.]]>
+ live.
+ @return a (snapshotted) map of blocker name->description values]]>
+ Do nothing if the service is null or not
+ in a state in which it can be/needs to be stopped.
+ The service state is checked before the operation begins.
+ This process is not thread safe.
+ @param service a service or null]]>
+ Any long-lived operation here will prevent the service state
+ change from completing in a timely manner.
+ If another thread is somehow invoked from the listener, and
+ that thread invokes the methods of the service (including
+ subclass-specific methods), there is a risk of a deadlock.
+ @param service the service that has changed.]]>
+ Clients and/or applications can use the provided Progressable
+ to explicitly report progress to the Hadoop framework. This is especially
+ important for operations which take significant amount of time since,
+ in-lieu of the reported progress, the framework has to assume that an error
+ has occured and time-out the operation.]]>
+ Class is to be obtained
+ @return the correctly typed Class
of the given object.]]>
+ kill -0 command or equivalent]]>
+ ".cmd" on Windows, or ".sh"
+ @param parent File parent directory
+ @param basename String script file basename
+ @return File referencing the script in the directory]]>
+ ".cmd" on Windows, or ".sh"
+ @param basename String script file basename
+ @return String script file name]]>
+ IOException.
+ @return the path to {@link #WINUTILS_EXE}
+ @throws RuntimeException if the path is not resolvable]]>
+ Shell interface.
+ @param cmd shell command to execute.
+ @return the output of the executed command.]]>
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @param timeout time in milliseconds after which script should be marked timeout
+ @return the output of the executed command.
+ @throws IOException on any problem.]]>
+ Shell interface.
+ @param env the map of environment key=value
+ @param cmd shell command to execute.
+ @return the output of the executed command.
+ @throws IOException on any problem.]]>
+ CreateProcess synchronization object.]]>
+ os.name property.]]>
+ Important: caller must check for this value being null.
+ The lack of such checks has led to many support issues being raised.
+ @deprecated use one of the exception-raising getter methods,
+ specifically {@link #getWinUtilsPath()} or {@link #getWinUtilsFile()}]]>
+ Shell can be used to run shell commands like du
+ df
. It also offers facilities to gate commands by
+ time-intervals.]]>
+ Tool
, is the standard for any Map-Reduce tool/application.
+ The tool/application should delegate the handling of
+ standard command-line options to {@link ToolRunner#run(Tool, String[])}
+ and only handle its custom arguments.
+ Here is how a typical Tool
is implemented:
+ public class MyApp extends Configured implements Tool {
+ public int run(String[] args) throws Exception {
+ // Configuration
processed by ToolRunner
+ Configuration conf = getConf();
+ // Create a JobConf using the processed conf
+ JobConf job = new JobConf(conf, MyApp.class);
+ // Process custom command-line options
+ Path in = new Path(args[1]);
+ Path out = new Path(args[2]);
+ // Specify various job-specific parameters
+ job.setJobName("my-app");
+ job.setInputPath(in);
+ job.setOutputPath(out);
+ job.setMapperClass(MyMapper.class);
+ job.setReducerClass(MyReducer.class);
+ // Submit the job, then poll for progress until the job is complete
+ RunningJob runningJob = JobClient.runJob(job);
+ if (runningJob.isSuccessful()) {
+ return 0;
+ } else {
+ return 1;
+ }
+ }
+ public static void main(String[] args) throws Exception {
+ // Let ToolRunner
handle generic command-line options
+ int res = ToolRunner.run(new Configuration(), new MyApp(), args);
+ System.exit(res);
+ }
+ }
+ @see GenericOptionsParser
+ @see ToolRunner]]>
+ Tool by {@link Tool#run(String[])}, after
+ parsing with the given generic arguments. Uses the given
+ Configuration
, or builds one if null.
+ Sets the Tool
's configuration with the possibly modified
+ version of the conf
+ @param conf Configuration
for the Tool
+ @param tool Tool
to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+ Tool with its Configuration
+ Equivalent to run(tool.getConf(), tool, args)
+ @param tool Tool
to run.
+ @param args command-line arguments to the tool.
+ @return exit code of the {@link Tool#run(String[])} method.]]>
+ ToolRunner
can be used to run classes implementing
+ Tool
interface. It works in conjunction with
+ {@link GenericOptionsParser} to parse the
+ generic hadoop command line arguments and modifies the
+ Configuration
of the Tool
. The
+ application-specific options are passed along without being modified.
+ @see Tool
+ @see GenericOptionsParser]]>
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+ Bloom filter, as defined by Bloom in 1970.
+ The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by
+ the networking research community in the past decade thanks to the bandwidth efficiencies that it
+ offers for the transmission of set membership information between networked hosts. A sender encodes
+ the information into a bit vector, the Bloom filter, that is more compact than a conventional
+ representation. Computation and space costs for construction are linear in the number of elements.
+ The receiver uses the filter to test whether various elements are members of the set. Though the
+ filter will occasionally return a false positive, it will never return a false negative. When creating
+ the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
+ Originally created by
+ European Commission One-Lab Project 034819.
+ @see Filter The general behavior of a filter
+ @see Space/Time Trade-Offs in Hash Coding with Allowable Errors]]>
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+ this counting Bloom filter.
+ Invariant: nothing happens if the specified key does not belong to this counter Bloom filter.
+ @param key The key to remove.]]>
+ key -> count map.
+ NOTE: due to the bucket size of this filter, inserting the same
+ key more than 15 times will cause an overflow at all filter positions
+ associated with this key, and it will significantly increase the error
+ rate for this and other keys. For this reason the filter can only be
+ used to store small count values 0 <= N << 15
+ @param key key to be tested
+ @return 0 if the key is not present. Otherwise, a positive value v will
+ be returned such that v == count
with probability equal to the
+ error rate of this filter, and v > count
+ Additionally, if the filter experienced an underflow as a result of
+ {@link #delete(Key)} operation, the return value may be lower than the
+ count
with the probability of the false negative rate of such
+ filter.]]>
+ counting Bloom filter, as defined by Fan et al. in a ToN
+ 2000 paper.
+ A counting Bloom filter is an improvement to standard a Bloom filter as it
+ allows dynamic additions and deletions of set membership information. This
+ is achieved through the use of a counting vector instead of a bit vector.
+ Originally created by
+ European Commission One-Lab Project 034819.
+ @see Filter The general behavior of a filter
+ @see Summary cache: a scalable wide-area web cache sharing protocol]]>
+ Builds an empty Dynamic Bloom filter.
+ @param vectorSize The number of bits in the vector.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).
+ @param nr The threshold for the maximum number of keys to record in a
+ dynamic Bloom filter row.]]>
+ dynamic Bloom filter, as defined in the INFOCOM 2006 paper.
+ A dynamic Bloom filter (DBF) makes use of a s * m
bit matrix but
+ each of the s
rows is a standard Bloom filter. The creation
+ process of a DBF is iterative. At the start, the DBF is a 1 * m
+ bit matrix, i.e., it is composed of a single standard Bloom filter.
+ It assumes that nr
elements are recorded in the
+ initial bit vector, where nr <= n
+ the cardinality of the set A
to record in the filter).
+ As the size of A
grows during the execution of the application,
+ several keys must be inserted in the DBF. When inserting a key into the DBF,
+ one must first get an active Bloom filter in the matrix. A Bloom filter is
+ active when the number of recorded keys, nr
, is
+ strictly less than the current cardinality of A
, n
+ If an active Bloom filter is found, the key is inserted and
+ nr
is incremented by one. On the other hand, if there
+ is no active Bloom filter, a new one is created (i.e., a new row is added to
+ the matrix) according to the current size of A
and the element
+ is added in this new Bloom filter and the nr
value of
+ this new Bloom filter is set to one. A given key is said to belong to the
+ DBF if the k
positions are set to one in one of the matrix rows.
+ Originally created by
+ European Commission One-Lab Project 034819.
+ @see Filter The general behavior of a filter
+ @see BloomFilter A Bloom filter
+ @see Theory and Network Applications of Dynamic Bloom Filters]]>
+ Builds a hash function that must obey to a given maximum number of returned values and a highest value.
+ @param maxValue The maximum highest returned value.
+ @param nbHash The number of resulting hashed values.
+ @param hashType type of the hashing function (see {@link Hash}).]]>
+ this hash function. A NOOP]]>
+ The idea is to randomly select a bit to reset.]]>
+ The idea is to select the bit to reset that will generate the minimum
+ number of false negative.]]>
+ The idea is to select the bit to reset that will remove the maximum number
+ of false positive.]]>
+ The idea is to select the bit to reset that will, at the same time, remove
+ the maximum number of false positve while minimizing the amount of false
+ negative generated.]]>
+ Originally created by
+ European Commission One-Lab Project 034819.]]>
+ this filter.
+ @param nbHash The number of hash function to consider.
+ @param hashType type of the hashing function (see
+ {@link org.apache.hadoop.util.hash.Hash}).]]>
+ this retouched Bloom filter.
+ Invariant: if the false positive is null
, nothing happens.
+ @param key The false positive key to add.]]>
+ this retouched Bloom filter.
+ @param coll The collection of false positive.]]>
+ this retouched Bloom filter.
+ @param keys The list of false positive.]]>
+ this retouched Bloom filter.
+ @param keys The array of false positive.]]>
+ this retouched Bloom filter.
+ @param scheme The selective clearing scheme to apply.]]>
+ retouched Bloom filter, as defined in the CoNEXT 2006 paper.
+ It allows the removal of selected false positives at the cost of introducing
+ random false negatives, and with the benefit of eliminating some random false
+ positives at the same time.
+ Originally created by
+ European Commission One-Lab Project 034819.
+ @see Filter The general behavior of a filter
+ @see BloomFilter A Bloom filter
+ @see RemoveScheme The different selective clearing algorithms
+ @see Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives]]>
diff --git a/hadoop-hdfs-project/hadoop-hdfs/dev-support/jdiff/Apache_Hadoop_HDFS_2.8.3.xml b/hadoop-hdfs-project/hadoop-hdfs/dev-support/jdiff/Apache_Hadoop_HDFS_2.8.3.xml
new file mode 100644
index 0000000000..331dd1e569
--- /dev/null
+++ b/hadoop-hdfs-project/hadoop-hdfs/dev-support/jdiff/Apache_Hadoop_HDFS_2.8.3.xml
@@ -0,0 +1,312 @@
+ A distributed implementation of {@link
+org.apache.hadoop.fs.FileSystem}. This is loosely modelled after
+Google's GFS.
+The most important difference is that unlike GFS, Hadoop DFS files
+have strictly one writer at any one time. Bytes are always appended
+to the end of the writer's stream. There is no notion of "record appends"
+or "mutations" that are then checked or reordered. Writers simply emit
+a byte stream. That byte stream is guaranteed to be stored in the
+order written.
+ This method must return as quickly as possible, since it's called
+ in a critical section of the NameNode's operation.
+ @param succeeded Whether authorization succeeded.
+ @param userName Name of the user executing the request.
+ @param addr Remote address of the request.
+ @param cmd The requested command.
+ @param src Path of affected source file.
+ @param dst Path of affected destination file (if any).
+ @param stat File information for operations that change the file's
+ metadata (permissions, owner, times, etc).]]>
diff --git a/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Common_2.8.3.xml b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Common_2.8.3.xml
new file mode 100644
index 0000000000..b3d52bf810
--- /dev/null
+++ b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Common_2.8.3.xml
@@ -0,0 +1,113 @@
diff --git a/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Core_2.8.3.xml b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Core_2.8.3.xml
new file mode 100644
index 0000000000..e96d0188e4
--- /dev/null
+++ b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_Core_2.8.3.xml
@@ -0,0 +1,27495 @@
+ FileStatus of a given cache file on hdfs
+ @throws IOException]]>
+ DistributedCache
is a facility provided by the Map-Reduce
+ framework to cache files (text, archives, jars etc.) needed by applications.
+ Applications specify the files, via urls (hdfs:// or http://) to be cached
+ via the {@link org.apache.hadoop.mapred.JobConf}. The
+ DistributedCache
assumes that the files specified via urls are
+ already present on the {@link FileSystem} at the path specified by the url
+ and are accessible by every machine in the cluster.
+ The framework will copy the necessary files on to the slave node before
+ any tasks for the job are executed on that node. Its efficiency stems from
+ the fact that the files are only copied once per job and the ability to
+ cache archives which are un-archived on the slaves.
+ DistributedCache
can be used to distribute simple, read-only
+ data/text files and/or more complex types such as archives, jars etc.
+ Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
+ Jars may be optionally added to the classpath of the tasks, a rudimentary
+ software distribution mechanism. Files have execution permissions.
+ In older version of Hadoop Map/Reduce users could optionally ask for symlinks
+ to be created in the working directory of the child task. In the current
+ version symlinks are always created. If the URL does not have a fragment
+ the name of the file or directory will be used. If multiple files or
+ directories map to the same link name, the last one added, will be used. All
+ others will not even be downloaded.
+ DistributedCache
tracks modification timestamps of the cache
+ files. Clearly the cache files should not be modified by the application
+ or externally while the job is executing.
+ Here is an illustrative example on how to use the
+ DistributedCache
+ // Setting up the cache for the application
+ 1. Copy the requisite files to the FileSystem
+ $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
+ $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
+ $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
+ $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
+ $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
+ $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
+ 2. Setup the application's JobConf
+ JobConf job = new JobConf();
+ DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
+ job);
+ DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
+ DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
+ DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
+ 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper}
+ or {@link org.apache.hadoop.mapred.Reducer}:
+ public static class MapClass extends MapReduceBase
+ implements Mapper<K, V, K, V> {
+ private Path[] localArchives;
+ private Path[] localFiles;
+ public void configure(JobConf job) {
+ // Get the cached archives/files
+ File f = new File("./map.zip/some/file/in/zip.txt");
+ }
+ public void map(K key, V value,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Use data from the cached archives/files here
+ // ...
+ // ...
+ output.collect(k, v);
+ }
+ }
+ It is also very common to use the DistributedCache by using
+ {@link org.apache.hadoop.util.GenericOptionsParser}.
+ This class includes methods that should be used by users
+ (specifically those mentioned in the example above, as well
+ as {@link DistributedCache#addArchiveToClassPath(Path, Configuration)}),
+ as well as methods intended for use by the MapReduce framework
+ (e.g., {@link org.apache.hadoop.mapred.JobClient}).
+ @see org.apache.hadoop.mapred.JobConf
+ @see org.apache.hadoop.mapred.JobClient
+ @see org.apache.hadoop.mapreduce.Job]]>
+ JobTracker,
+ as {@link JobTracker.State}
+ {@link JobTracker.State} should no longer be used on M/R 2.x. The function
+ is kept to be compatible with M/R 1.x applications.
+ @return the invalid state of the JobTracker
+ ClusterStatus
provides clients with information such as:
+ -
+ Size of the cluster.
+ -
+ Name of the trackers.
+ -
+ Task capacity of the cluster.
+ -
+ The number of currently running map and reduce tasks.
+ -
+ State of the
+ -
+ Details regarding black listed trackers.
+ Clients can query for the latest ClusterStatus
, via
+ {@link JobClient#getClusterStatus()}.
+ @see JobClient]]>
+ Counters
represent global counters, defined either by the
+ Map-Reduce framework or applications. Each Counter
can be of
+ any {@link Enum} type.
+ Counters
are bunched into {@link Group}s, each comprising of
+ counters from a particular Enum
+ Group of counters, comprising of counters from a particular
+ counter {@link Enum} class.
+ Group
handles localization of the class name and the
+ counter names.
+ FileInputFormat always returns
+ true. Implementations that may deal with non-splittable files must
+ override this method.
+ FileInputFormat
implementations can override this and return
+ false
to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+ @param fs the file system that the file is on
+ @param filename the file name to check
+ @return is this file splitable?]]>
+ FileInputFormat
is the base class for all file-based
+ InputFormat
s. This provides a generic implementation of
+ {@link #getSplits(JobConf, int)}.
+ Implementations of FileInputFormat
can also override the
+ {@link #isSplitable(FileSystem, Path)} method to prevent input files
+ from being split-up in certain situations. Implementations that may
+ deal with non-splittable files must override this method, since
+ the default implementation assumes splitting is always possible.]]>
+ true if the job output should be compressed,
+ false
+ Tasks' Side-Effect Files
+ Note: The following is valid only if the {@link OutputCommitter}
+ is {@link FileOutputCommitter}. If OutputCommitter
is not
+ a FileOutputCommitter
, the task's temporary output
+ directory is same as {@link #getOutputPath(JobConf)} i.e.
+ ${mapreduce.output.fileoutputformat.outputdir}$
+ Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+ To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only)
+ are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+ The application-writer can take advantage of this by creating any
+ side-files required in ${mapreduce.task.output.dir} during execution
+ of his reduce-task i.e. via {@link #getWorkOutputPath(JobConf)}, and the
+ framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+ Note: the value of ${mapreduce.task.output.dir} during
+ execution of a particular task-attempt is actually
+ ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is
+ set by the map-reduce framework. So, just create any side-files in the
+ path returned by {@link #getWorkOutputPath(JobConf)} from map/reduce
+ task to take advantage of this feature.
+ The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+ The generated name can be used to create custom files from within the
+ different tasks for the job, the names for different tasks will not collide
+ with each other.
+ The given name is postfixed with the task type, 'm' for maps, 'r' for
+ reduces and the task partition number. For example, give a name 'test'
+ running on the first map o the job the generated name will be
+ 'test-m-00000'.
+ @param conf the configuration for the job.
+ @param name the name to make unique.
+ @return a unique name accross all tasks of the job.]]>
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+ This method uses the {@link #getUniqueName} method to make the file name
+ unique for the task.
+ @param conf the configuration for the job.
+ @param name the name for the file.
+ @return a unique path accross all tasks of the job.]]>
+ conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength);
+ @see FixedLengthRecordReader]]>
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+ Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple.
+ @param job job configuration.
+ @param numSplits the desired number of splits, a hint.
+ @return an array of {@link InputSplit}s for the job.]]>
+ It is the responsibility of the RecordReader
to respect
+ record boundaries while processing the logical split to present a
+ record-oriented view to the individual task.
+ @param split the {@link InputSplit}
+ @param job the job that this split belongs to
+ @return a {@link RecordReader}]]>
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the InputFormat
of the
+ job to:
+ -
+ Validate the input-specification of the job.
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+ -
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical
for processing by
+ the {@link Mapper}.
+ The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+ mapreduce.input.fileinputformat.split.minsize.
+ Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to be respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibilty to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit
to the individual task.
+ @see InputSplit
+ @see RecordReader
+ @see JobClient
+ @see FileInputFormat]]>
+ InputSplit.
+ @return the number of bytes in the input split.
+ @throws IOException]]>
+ InputSplit is
+ located as an array of String
+ @throws IOException]]>
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+ Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+ @see InputFormat
+ @see RecordReader]]>
+ SplitLocationInfos describing how the split
+ data is stored at each location. A null value indicates that all the
+ locations have the data stored on disk.
+ @throws IOException]]>
+ JobClient.]]>
+ jobid doesn't correspond to any known job.
+ @throws IOException]]>
+ JobClient is the primary interface for the user-job to interact
+ with the cluster.
+ JobClient
provides facilities to submit jobs, track their
+ progress, access component-tasks' reports/logs, get the Map-Reduce cluster
+ status information etc.
+ The job submission process involves:
+ -
+ Checking the input and output specifications of the job.
+ -
+ Computing the {@link InputSplit}s for the job.
+ -
+ Setup the requisite accounting information for the {@link DistributedCache}
+ of the job, if necessary.
+ -
+ Copying the job's jar and configuration to the map-reduce system directory
+ on the distributed file-system.
+ -
+ Submitting the job to the cluster and optionally monitoring
+ it's status.
+ Normally the user creates the application, describes various facets of the
+ job via {@link JobConf} and then uses the JobClient
to submit
+ the job and monitor its progress.
+ Here is an example on how to use JobClient
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+ // Submit the job, then poll for progress until the job is complete
+ JobClient.runJob(job);
+ Job Control
+ At times clients would chain map-reduce jobs to accomplish complex tasks
+ which cannot be done via a single map-reduce job. This is fairly easy since
+ the output of the job, typically, goes to distributed file-system and that
+ can be used as the input for the next job.
+ However, this also means that the onus on ensuring jobs are complete
+ (success/failure) lies squarely on the clients. In such situations the
+ various job-control options are:
+ -
+ {@link #runJob(JobConf)} : submits the job and returns only after
+ the job has completed.
+ -
+ {@link #submitJob(JobConf)} : only submits the job, then poll the
+ returned handle to the {@link RunningJob} to query status and make
+ scheduling decisions.
+ -
+ {@link JobConf#setJobEndNotificationURI(String)} : setup a notification
+ on job-completion, thus avoiding polling.
+ @see JobConf
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+ If the parameter {@code loadDefaults} is false, the new instance
+ will not load resources from the default files.
+ @param loadDefaults specifies whether to load from the default files]]>
+ true if framework should keep the intermediate files
+ for failed tasks, false
+ true if the outputs of the maps are to be compressed,
+ false
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+ For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+ Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+ Note: This is not a guarantee of the combiner sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the combiner being non-deterministic, it wouldn't make
+ that much sense.)
+ @param theClass the comparator class to be used for grouping keys for the
+ combiner. It should implement RawComparator
+ @see #setOutputKeyComparatorClass(Class)]]>
+ This comparator should be provided if the equivalence rules for keys
+ for sorting the intermediates are different from those for grouping keys
+ before each call to
+ {@link Reducer#reduce(Object, java.util.Iterator, OutputCollector, Reporter)}.
+ For key-value pairs (K1,V1) and (K2,V2), the values (V1, V2) are passed
+ in a single call to the reduce function if K1 and K2 compare as equal.
+ Since {@link #setOutputKeyComparatorClass(Class)} can be used to control
+ how keys are sorted, this can be used in conjunction to simulate
+ secondary sort on values.
+ Note: This is not a guarantee of the reduce sort being
+ stable in any sense. (In any case, with the order of available
+ map-outputs to the reduce being non-deterministic, it wouldn't make
+ that much sense.)
+ @param theClass the comparator class to be used for grouping keys.
+ It should implement RawComparator
+ @see #setOutputKeyComparatorClass(Class)
+ @see #setCombinerKeyGroupingComparator(Class)]]>
+ combiner class used to combine map-outputs
+ before being sent to the reducers. Typically the combiner is same as the
+ the {@link Reducer} for the job i.e. {@link #getReducerClass()}.
+ @return the user-defined combiner class used to combine map-outputs.]]>
+ combiner class used to combine map-outputs
+ before being sent to the reducers.
+ The combiner is an application-specified aggregation operation, which
+ can help cut down the amount of data transferred between the
+ {@link Mapper} and the {@link Reducer}, leading to better performance.
+ The framework may invoke the combiner 0, 1, or multiple times, in both
+ the mapper and reducer tasks. In general, the combiner is called as the
+ sort/merge result is written to disk. The combiner must:
+ - be side-effect free
+ - have the same input and output key types and the same input and
+ output value types
+ Typically the combiner is same as the Reducer
for the
+ job i.e. {@link #setReducerClass(Class)}.
+ @param theClass the user-defined combiner class used to combine
+ map-outputs.]]>
+ true.
+ @return true
if speculative execution be used for this job,
+ false
+ true if speculative execution
+ should be turned on, else false
+ true.
+ @return true
if speculative execution be
+ used for this job for map tasks,
+ false
+ true if speculative execution
+ should be turned on for map tasks,
+ else false
+ true.
+ @return true
if speculative execution be used
+ for reduce tasks for this job,
+ false
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false
+ 1.
+ @return the number of map tasks for this job.]]>
+ Note: This is only a hint to the framework. The actual
+ number of spawned map tasks depends on the number of {@link InputSplit}s
+ generated by the job's {@link InputFormat#getSplits(JobConf, int)}.
+ A custom {@link InputFormat} is typically used to accurately control
+ the number of map tasks for the job.
+ How many maps?
+ The number of maps is usually driven by the total size of the inputs
+ i.e. total number of blocks of the input files.
+ The right level of parallelism for maps seems to be around 10-100 maps
+ per-node, although it has been set up to 300 or so for very cpu-light map
+ tasks. Task setup takes awhile, so it is best if the maps take at least a
+ minute to execute.
+ The default behavior of file-based {@link InputFormat}s is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of input files. However, the {@link FileSystem} blocksize of the
+ input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+ mapreduce.input.fileinputformat.split.minsize.
+ Thus, if you expect 10TB of input data and have a blocksize of 128MB,
+ you'll end up with 82,000 maps, unless {@link #setNumMapTasks(int)} is
+ used to set it even higher.
+ @param n the number of map tasks for this job.
+ @see InputFormat#getSplits(JobConf, int)
+ @see FileInputFormat
+ @see FileSystem#getDefaultBlockSize()
+ @see FileStatus#getBlockSize()]]>
+ 1.
+ @return the number of reduce tasks for this job.]]>
+ How many reduces?
+ The right number of reduces seems to be 0.95
+ 1.75
multiplied by (
+ available memory for reduce tasks
+ (The value of this should be smaller than
+ numNodes * yarn.nodemanager.resource.memory-mb
+ since the resource of memory is shared by map tasks and other
+ applications) /
+ mapreduce.reduce.memory.mb).
+ With 0.95
all of the reduces can launch immediately and
+ start transfering map outputs as the maps finish. With 1.75
+ the faster nodes will finish their first round of reduces and launch a
+ second wave of reduces doing a much better job of load balancing.
+ Increasing the number of reduces increases the framework overhead, but
+ increases load balancing and lowers the cost of failures.
+ The scaling factors above are slightly less than whole numbers to
+ reserve a few reduce slots in the framework for speculative-tasks, failures
+ etc.
+ Reducer NONE
+ It is legal to set the number of reduce-tasks to zero
+ In this case the output of the map-tasks directly go to distributed
+ file-system, to the path set by
+ {@link FileOutputFormat#setOutputPath(JobConf, Path)}. Also, the
+ framework doesn't sort the map-outputs before writing it out to HDFS.
+ @param n the number of reduce tasks for this job.]]>
+ mapreduce.map.maxattempts
+ property. If this property is not already set, the default is 4 attempts.
+ @return the max number of attempts per map task.]]>
+ mapreduce.reduce.maxattempts
+ property. If this property is not already set, the default is 4 attempts.
+ @return the max number of attempts per reduce task.]]>
+ noFailures, the
+ tasktracker is blacklisted for this job.
+ @param noFailures maximum no. of failures of a given job per tasktracker.]]>
+ blacklisted for this job.
+ @return the maximum no. of failures of a given job per tasktracker.]]>
+ failed.
+ Defaults to zero
, i.e. any failed map-task results in
+ the job being declared as {@link JobStatus#FAILED}.
+ @return the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+ failed.
+ @param percent the maximum percentage of map tasks that can fail without
+ the job being aborted.]]>
+ failed.
+ Defaults to zero
, i.e. any failed reduce-task results
+ in the job being declared as {@link JobStatus#FAILED}.
+ @return the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+ failed.
+ @param percent the maximum percentage of reduce tasks that can fail without
+ the job being aborted.]]>
+ The debug script can aid debugging of failed map tasks. The script is
+ given task's stdout, stderr, syslog, jobconf files as arguments.
+ The debug command, run on the node where the map failed, is:
+ $script $stdout $stderr $syslog $jobconf.
+ The script file is distributed through {@link DistributedCache}
+ APIs. The script needs to be symlinked.
+ Here is an example on how to submit a script
+ job.setMapDebugScript("./myscript");
+ DistributedCache.createSymlink(job);
+ DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
+ @param mDbgScript the script name]]>
+ The debug script can aid debugging of failed reduce tasks. The script
+ is given task's stdout, stderr, syslog, jobconf files as arguments.
+ The debug command, run on the node where the map failed, is:
+ $script $stdout $stderr $syslog $jobconf.
+ The script file is distributed through {@link DistributedCache}
+ APIs. The script file needs to be symlinked
+ Here is an example on how to submit a script
+ job.setReduceDebugScript("./myscript");
+ DistributedCache.createSymlink(job);
+ DistributedCache.addCacheFile("/debug/scripts/myscript#myscript");
+ @param rDbgScript the script name]]>
+ null if it hasn't
+ been set.
+ @see #setJobEndNotificationURI(String)]]>
+ The uri can contain 2 special parameters: $jobId and
+ $jobStatus. Those, if present, are replaced by the job's
+ identifier and completion-status respectively.
+ This is typically used by application-writers to implement chaining of
+ Map-Reduce jobs in an asynchronous manner.
+ @param uri the job end notification uri
+ @see JobStatus]]>
+ When a job starts, a shared directory is created at location
+ ${mapreduce.cluster.local.dir}/taskTracker/$user/jobcache/$jobid/work/
+ This directory is exposed to the users through
+ mapreduce.job.local.dir
+ So, the tasks can use this space
+ as scratch space and share files among them.
+ This value is available as System property also.
+ @return The localized job specific shared directory]]>
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to a value different
+ from {@link #DISABLED_MEMORY_LIMIT}, that value will be used
+ after converting it from bytes to MB.
+ @return memory required to run a map task of the job, in MB,]]>
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY} to a value different
+ from {@link #DISABLED_MEMORY_LIMIT}, that value will be used
+ after converting it from bytes to MB.
+ @return memory required to run a reduce task of the job, in MB.]]>
+ This method is deprecated. Now, different memory limits can be
+ set for map and reduce tasks of a job, in MB.
+ For backward compatibility, if the job configuration sets the
+ key {@link #MAPRED_TASK_MAXVMEM_PROPERTY}, that value is returned.
+ Otherwise, this method will return the larger of the values returned by
+ {@link #getMemoryForMapTask()} and {@link #getMemoryForReduceTask()}
+ after converting them into bytes.
+ @return Memory required to run a task of this job, in bytes.
+ @see #setMaxVirtualMemoryForTask(long)
+ @deprecated Use {@link #getMemoryForMapTask()} and
+ {@link #getMemoryForReduceTask()}]]>
+ mapred.task.maxvmem is split into
+ mapreduce.map.memory.mb
+ and mapreduce.map.memory.mb,mapred
+ each of the new key are set
+ as mapred.task.maxvmem / 1024
+ as new values are in MB
+ @param vmem Maximum amount of virtual memory in bytes any task of this job
+ can use.
+ @see #getMaxVirtualMemoryForTask()
+ @deprecated
+ Use {@link #setMemoryForMapTask(long mem)} and
+ Use {@link #setMemoryForReduceTask(long mem)}]]>
+ k1=v1,k2=v2. Further it can
+ reference existing environment variables via $key
+ Linux or %key%
on Windows.
+ Example:
+ - A=foo - This will set the env variable A to foo.
+ - B=$X:c This is inherit tasktracker's X env variable on Linux.
+ - B=%X%;c This is inherit tasktracker's X env variable on Windows.
+ @deprecated Use {@link #MAPRED_MAP_TASK_ENV} or
+ k1=v1,k2=v2. Further it can
+ reference existing environment variables via $key
+ Linux or %key%
on Windows.
+ Example:
+ - A=foo - This will set the env variable A to foo.
+ - B=$X:c This is inherit tasktracker's X env variable on Linux.
+ - B=%X%;c This is inherit tasktracker's X env variable on Windows.
+ k1=v1,k2=v2. Further it can
+ reference existing environment variables via $key
+ Linux or %key%
on Windows.
+ Example:
+ - A=foo - This will set the env variable A to foo.
+ - B=$X:c This is inherit tasktracker's X env variable on Linux.
+ - B=%X%;c This is inherit tasktracker's X env variable on Windows.
+ JobConf
is the primary interface for a user to describe a
+ map-reduce job to the Hadoop framework for execution. The framework tries to
+ faithfully execute the job as-is described by JobConf
, however:
+ -
+ Some configuration parameters might have been marked as
+ final by administrators and hence cannot be altered.
+ -
+ While some job parameters are straight-forward to set
+ (e.g. {@link #setNumReduceTasks(int)}), some parameters interact subtly
+ with the rest of the framework and/or job-configuration and is relatively
+ more complex for the user to control finely
+ (e.g. {@link #setNumMapTasks(int)}).
+ JobConf
typically specifies the {@link Mapper}, combiner
+ (if any), {@link Partitioner}, {@link Reducer}, {@link InputFormat} and
+ {@link OutputFormat} implementations to be used etc.
Optionally JobConf
is used to specify other advanced facets
+ of the job such as Comparator
s to be used, files to be put in
+ the {@link DistributedCache}, whether or not intermediate and/or job outputs
+ are to be compressed (and how), debugability via user-provided scripts
+ ( {@link #setMapDebugScript(String)}/{@link #setReduceDebugScript(String)}),
+ for doing post-processing on task logs, task's stdout, stderr, syslog.
+ and etc.
+ Here is an example on how to configure a job via JobConf
+ // Create a new JobConf
+ JobConf job = new JobConf(new Configuration(), MyJob.class);
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+ FileInputFormat.setInputPaths(job, new Path("in"));
+ FileOutputFormat.setOutputPath(job, new Path("out"));
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setCombinerClass(MyJob.MyReducer.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+ job.setInputFormat(SequenceFileInputFormat.class);
+ job.setOutputFormat(SequenceFileOutputFormat.class);
+ @see JobClient
+ @see ClusterStatus
+ @see Tool
+ @see DistributedCache]]>
+ any job
+ run on the jobtracker started at 200707121733, we would use :
+ JobID.getTaskIDsPattern("200707121733", null);
+ which will return :
+ "job_200707121733_[0-9]*"
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @return a regex pattern matching JobIDs]]>
+ An example JobID is :
+ job_200707121733_0003
, which represents the third job
+ running at the jobtracker started at 200707121733
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+ @see TaskID
+ @see TaskAttemptID]]>
+ Output pairs need not be of the same types as input pairs. A given
+ input pair may map to zero or many output pairs. Output pairs are
+ collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+ Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes significant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+ mapreduce.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+ @param key the input key.
+ @param value the input value.
+ @param output collects mapped keys and values.
+ @param reporter facility to report progress.]]>
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+ The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper
implementations can access the {@link JobConf} for the
+ job via the {@link JobConfigurable#configure(JobConf)} and initialize
+ themselves. Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+ The framework then calls
+ {@link #map(Object, Object, OutputCollector, Reporter)}
+ for each key/value pair in the InputSplit
for that task.
+ All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the grouping by specifying
+ a Comparator
+ {@link JobConf#setOutputKeyComparatorClass(Class)}.
+ The grouped Mapper
outputs are partitioned per
+ Reducer
. Users can control which keys (and hence records) go to
+ which Reducer
by implementing a custom {@link Partitioner}.
Users can optionally specify a combiner
, via
+ {@link JobConf#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper
to the Reducer
The intermediate, grouped outputs are always stored in
+ {@link SequenceFile}s. Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the JobConf
+ If the job has
+ zero
+ reduces then the output of the Mapper
is directly written
+ to the {@link FileSystem} without grouping by keys.
+ Example:
+ public class MyMapper<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Mapper<K, V, K, V> {
+ static enum MyCounters { NUM_RECORDS }
+ private String mapTaskId;
+ private String inputFile;
+ private int noRecords = 0;
+ public void configure(JobConf job) {
+ mapTaskId = job.get(JobContext.TASK_ATTEMPT_ID);
+ inputFile = job.get(JobContext.MAP_INPUT_FILE);
+ }
+ public void map(K key, V val,
+ OutputCollector<K, V> output, Reporter reporter)
+ throws IOException {
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+ // Let the framework know that we are alive, and kicking!
+ // reporter.progress();
+ // Process some more
+ // ...
+ // ...
+ // Increment the no. of <key, value> pairs processed
+ ++noRecords;
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+ // Every 100 records update application-level status
+ if ((noRecords%100) == 0) {
+ reporter.setStatus(mapTaskId + " processed " + noRecords +
+ " from input-file: " + inputFile);
+ }
+ // Output the result
+ output.collect(key, val);
+ }
+ }
+ Applications may write a custom {@link MapRunnable} to exert greater
+ control on map processing e.g. multi-threaded Mapper
s etc.
+ @see JobConf
+ @see InputFormat
+ @see Partitioner
+ @see Reducer
+ @see MapReduceBase
+ @see MapRunnable
+ @see SequenceFile]]>
+ Provides default no-op implementations for a few methods, most non-trivial
+ applications need to override some of them.]]>
+ <key, value> pairs.
+ Mapping of input records to output records is complete when this method
+ returns.
+ @param input the {@link RecordReader} to read the input records.
+ @param output the {@link OutputCollector} to collect the outputrecords.
+ @param reporter {@link Reporter} to report progress, status-updates etc.
+ @throws IOException]]>
+ Custom implementations of MapRunnable
can exert greater
+ control on map processing e.g. multi-threaded, asynchronous mappers etc.
+ @see Mapper]]>
+ nearly
+ equal content length.
+ Subclasses implement {@link #getRecordReader(InputSplit, JobConf, Reporter)}
+ to construct RecordReader
's for MultiFileSplit
+ @see MultiFileSplit]]>
+ MultiFileSplit can be used to implement {@link RecordReader}'s, with
+ reading one record per file.
+ @see FileSplit
+ @see MultiFileInputFormat]]>
+ <key, value> pairs output by {@link Mapper}s
+ and {@link Reducer}s.
+ OutputCollector
is the generalization of the facility
+ provided by the Map-Reduce framework to collect data output by either the
+ Mapper
or the Reducer
i.e. intermediate outputs
+ or the output of the job.
+ true if task output recovery is supported,
+ false
+ @throws IOException
+ @see #recoverTask(TaskAttemptContext)]]>
+ true repeatable job commit is supported,
+ false
+ @throws IOException]]>
+ OutputCommitter. This is called from the application master
+ process, but it is called individually for each task.
+ If an exception is thrown the task will be attempted again.
+ @param taskContext Context of the task whose output is being recovered
+ @throws IOException]]>
+ OutputCommitter describes the commit of task output for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the OutputCommitter
+ the job to:
+ -
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+ -
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+ -
+ Setup the task temporary output.
+ -
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+ -
+ Commit of the task output.
+ -
+ Discard the task commit.
+ The methods in this class can be called from several different processes and
+ from several different contexts. It is important to know which process and
+ which context each is called from. Each method should be marked accordingly
+ in its documentation. It is also important to note that not all methods are
+ guaranteed to be called once and only once. If a method is not guaranteed to
+ have this property the output committer needs to handle this appropriately.
+ Also note it will only be in rare situations where they may be called
+ multiple times for the same task.
+ @see FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+ @param ignored
+ @param job job configuration.
+ @throws IOException when output should not be attempted]]>
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the OutputFormat
of the
+ job to:
+ -
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+ @see RecordWriter
+ @see JobConf]]>
+ Typically a hash function on a all or a subset of the key.
+ @param key the key to be paritioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key
+ Partitioner
controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m
reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+ @see Reducer]]>
+ 0.0 to 1.0
+ @throws IOException]]>
+ RecordReader reads <key, value> pairs from an
+ {@link InputSplit}.
+ RecordReader
, typically, converts the byte-oriented view of
+ the input, provided by the InputSplit
, and presents a
+ record-oriented view for the {@link Mapper} and {@link Reducer} tasks for
+ processing. It thus assumes the responsibility of processing record
+ boundaries and presenting the tasks with keys and values.
+ @see InputSplit
+ @see InputFormat]]>
+ RecordWriter to future operations.
+ @param reporter facility to report progress.
+ @throws IOException]]>
+ RecordWriter writes the output <key, value> pairs
+ to an output file.
+ RecordWriter
implementations write the job outputs to the
+ {@link FileSystem}.
+ @see OutputFormat]]>
+ Reduces values for a given key.
+ The framework calls this method for each
+ <key, (list of values)>
pair in the grouped inputs.
+ Output values must be of the same type as input values. Input keys must
+ not be altered. The framework will reuse the key and value objects
+ that are passed into the reduce, therefore the application should clone
+ the objects they want to keep a copy of. In many cases, all values are
+ combined into zero or one value.
+ Output pairs are collected with calls to
+ {@link OutputCollector#collect(Object,Object)}.
+ Applications can use the {@link Reporter} provided to report progress
+ or just indicate that they are alive. In scenarios where the application
+ takes a significant amount of time to process individual key/value
+ pairs, this is crucial since the framework might assume that the task has
+ timed-out and kill that task. The other way of avoiding this is to set
+ mapreduce.task.timeout to a high-enough value (or even zero for no
+ time-outs).
+ @param key the key.
+ @param values the list of values to reduce.
+ @param output to collect keys and combined values.
+ @param reporter facility to report progress.]]>
+ The number of Reducer
s for the job is set by the user via
+ {@link JobConf#setNumReduceTasks(int)}. Reducer
+ can access the {@link JobConf} for the job via the
+ {@link JobConfigurable#configure(JobConf)} method and initialize themselves.
+ Similarly they can use the {@link Closeable#close()} method for
+ de-initialization.
+ Reducer
has 3 primary phases:
+ -
+ Shuffle
is input the grouped output of a {@link Mapper}.
+ In the phase the framework, for each Reducer
, fetches the
+ relevant partition of the output of all the Mapper
s, via HTTP.
+ -
+ Sort
The framework groups Reducer
inputs by key
+ (since different Mapper
s may have output the same key) in this
+ stage.
+ The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+ SecondarySort
+ If equivalence rules for keys while grouping the intermediates are
+ different from those for grouping keys before reduction, then one may
+ specify a Comparator
+ {@link JobConf#setOutputValueGroupingComparator(Class)}.Since
+ {@link JobConf#setOutputKeyComparatorClass(Class)} can be used to
+ control how intermediate keys are grouped, these can be used in conjunction
+ to simulate secondary sort on values.
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+ - Map Input Key: url
+ - Map Input Value: document
+ - Map Output Key: document checksum, url pagerank
+ - Map Output Value: url
+ - Partitioner: by checksum
+ - OutputKeyComparator: by checksum and then decreasing pagerank
+ - OutputValueGroupingComparator: by checksum
+ -
+ Reduce
In this phase the
+ {@link #reduce(Object, Iterator, OutputCollector, Reporter)}
+ method is called for each <key, (list of values)>
pair in
+ the grouped inputs.
+ The output of the reduce task is typically written to the
+ {@link FileSystem} via
+ {@link OutputCollector#collect(Object, Object)}.
+ The output of the Reducer
is not re-sorted.
+ Example:
+ public class MyReducer<K extends WritableComparable, V extends Writable>
+ extends MapReduceBase implements Reducer<K, V, K, V> {
+ static enum MyCounters { NUM_RECORDS }
+ private String reduceTaskId;
+ private int noKeys = 0;
+ public void configure(JobConf job) {
+ reduceTaskId = job.get(JobContext.TASK_ATTEMPT_ID);
+ }
+ public void reduce(K key, Iterator<V> values,
+ OutputCollector<K, V> output,
+ Reporter reporter)
+ throws IOException {
+ // Process
+ int noValues = 0;
+ while (values.hasNext()) {
+ V value = values.next();
+ // Increment the no. of values for this key
+ ++noValues;
+ // Process the <key, value> pair (assume this takes a while)
+ // ...
+ // ...
+ // Let the framework know that we are alive, and kicking!
+ if ((noValues%10) == 0) {
+ reporter.progress();
+ }
+ // Process some more
+ // ...
+ // ...
+ // Output the <key, value>
+ output.collect(key, value);
+ }
+ // Increment the no. of <key, list of values> pairs processed
+ ++noKeys;
+ // Increment counters
+ reporter.incrCounter(NUM_RECORDS, 1);
+ // Every 100 keys update application-level status
+ if ((noKeys%100) == 0) {
+ reporter.setStatus(reduceTaskId + " processed " + noKeys);
+ }
+ }
+ }
+ @see Mapper
+ @see Partitioner
+ @see Reporter
+ @see MapReduceBase]]>
+ Counter of the given group/name.]]>
+ Counter of the given group/name.]]>
+ Enum.
+ @param amount A non-negative amount by which the counter is to
+ be incremented.]]>
+ InputSplit that the map is reading from.
+ @throws UnsupportedOperationException if called outside a mapper]]>
+ {@link Mapper} and {@link Reducer} can use the Reporter
+ provided to report progress or just indicate that they are alive. In
+ scenarios where the application takes significant amount of time to
+ process individual key/value pairs, this is crucial since the framework
+ might assume that the task has timed-out and kill that task.
+ Applications can also update {@link Counters} via the provided
+ Reporter
+ @see Progressable
+ @see Counters]]>
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+ true if the job is complete, else false
+ @throws IOException]]>
+ true if the job succeeded, else false
+ @throws IOException]]>
+ true if the job retired, else false
+ @throws IOException]]>
+ RunningJob is the user-interface to query for details on a
+ running Map-Reduce job.
+ Clients can get hold of RunningJob
via the {@link JobClient}
+ and then query the running-job for details such as name, configuration,
+ progress etc.
+ @see JobClient]]>
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+ @param conf the {@link JobConf} to modify
+ @param theClass the SequenceFile output key class.]]>
+ true if auto increment
+ false
+ true if auto increment
+ false
+ Hadoop provides an optional mode of execution in which the bad records
+ are detected and skipped in further attempts.
+ This feature can be used when map/reduce tasks crashes deterministically on
+ certain input. This happens due to bugs in the map/reduce function. The usual
+ course would be to fix these bugs. But sometimes this is not possible;
+ perhaps the bug is in third party libraries for which the source code is
+ not available. Due to this, the task never reaches to completion even with
+ multiple attempts and complete data for that task is lost.
+ With this feature, only a small portion of data is lost surrounding
+ the bad record, which may be acceptable for some user applications.
+ see {@link SkipBadRecords#setMapperMaxSkipRecords(Configuration, long)}
+ The skipping mode gets kicked off after certain no of failures
+ see {@link SkipBadRecords#setAttemptsToStartSkipping(Configuration, int)}
+ In the skipping mode, the map/reduce task maintains the record range which
+ is getting processed at all times. Before giving the input to the
+ map/reduce function, it sends this record range to the Task tracker.
+ If task crashes, the Task tracker knows which one was the last reported
+ range. On further attempts that range get skipped.
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+ TaskAttemptID.getTaskAttemptIDsPattern(null, null, true, 1, null);
+ which will return :
+ "attempt_[^_]*_[0-9]*_m_000001_[0-9]*"
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+ all task attempt IDs
+ of any jobtracker, in any job, of the first
+ map task, we would use :
+ TaskAttemptID.getTaskAttemptIDsPattern(null, null, TaskType.MAP, 1, null);
+ which will return :
+ "attempt_[^_]*_[0-9]*_m_000001_[0-9]*"
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param type the {@link TaskType}
+ @param taskId taskId number, or null
+ @param attemptId the task attempt number, or null
+ @return a regex pattern matching TaskAttemptIDs]]>
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0
, which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+ @see JobID
+ @see TaskID]]>
+ the first map task
+ of any jobtracker, of any job, we would use :
+ TaskID.getTaskIDsPattern(null, null, true, 1);
+ which will return :
+ "task_[^_]*_[0-9]*_m_000001*"
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param isMap whether the tip is a map, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs
+ @deprecated Use {@link TaskID#getTaskIDsPattern(String, Integer, TaskType,
+ Integer)}]]>
+ the first map task
+ of any jobtracker, of any job, we would use :
+ TaskID.getTaskIDsPattern(null, null, true, 1);
+ which will return :
+ "task_[^_]*_[0-9]*_m_000001*"
+ @param jtIdentifier jobTracker identifier, or null
+ @param jobId job number, or null
+ @param type the {@link TaskType}, or null
+ @param taskId taskId number, or null
+ @return a regex pattern matching TaskIDs]]>
+ An example TaskID is :
+ task_200707121733_0003_m_000005
, which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+ @see JobID
+ @see TaskAttemptID]]>
+ true if the Job was added.]]>
+ ([,]*)
+ func ::= tbl(,"")
+ class ::= @see java.lang.Class#forName(java.lang.String)
+ path ::= @see org.apache.hadoop.fs.Path#Path(java.lang.String)
+ }
+ Reads expression from the mapred.join.expr property and
+ user-supplied join types from mapred.join.define.<ident>
+ types. Paths supplied to tbl are given as input paths to the
+ InputFormat class listed.
+ @see #compose(java.lang.String, java.lang.Class, java.lang.String...)]]>
+ , ) }]]>
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+ mapred.join.define.<ident> to a classname. In the expression
+ mapred.join.expr, the identifier will be assumed to be a
+ ComposableRecordReader.
+ mapred.join.keycomparator can be a classname used to compare keys
+ in the join.
+ @see #setFormat
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+ ......
+ }]]>
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+ For the added Mapper the configuration given for it,
+ mapperConf
, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ @param job job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults)
constructor with FALSE.]]>
+ If this method is overriden super.configure(...)
should be
+ invoked at the beginning of the overwriter method.]]>
+ map(...) methods of the Mappers in the chain.]]>
+ If this method is overriden super.close()
should be
+ invoked at the end of the overwriter method.]]>
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]
. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+ ChainMapper usage pattern:
+ ...
+ conf.setJobName("chain");
+ conf.setInputFormat(TextInputFormat.class);
+ conf.setOutputFormat(TextOutputFormat.class);
+ JobConf mapAConf = new JobConf(false);
+ ...
+ ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, mapAConf);
+ JobConf mapBConf = new JobConf(false);
+ ...
+ ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, mapBConf);
+ JobConf reduceConf = new JobConf(false);
+ ...
+ ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, reduceConf);
+ ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, null);
+ ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
+ LongWritable.class, LongWritable.class, true, null);
+ FileInputFormat.setInputPaths(conf, inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+ ...
+ JobClient jc = new JobClient(conf);
+ RunningJob job = jc.submitJob(conf);
+ ...
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Reducer leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Reducer does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+ For the added Reducer the configuration given for it,
+ reducerConf
, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+ @param job job's JobConf to add the Reducer class.
+ @param klass the Reducer class to add.
+ @param inputKeyClass reducer input key class.
+ @param inputValueClass reducer input value class.
+ @param outputKeyClass reducer output key class.
+ @param outputValueClass reducer output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param reducerConf a JobConf with the configuration for the Reducer
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults)
constructor with FALSE.]]>
+ It has to be specified how key and values are passed from one element of
+ the chain to the next, by value or by reference. If a Mapper leverages the
+ assumed semantics that the key and values are not modified by the collector
+ 'by value' must be used. If the Mapper does not expect this semantics, as
+ an optimization to avoid serialization and deserialization 'by reference'
+ can be used.
+ For the added Mapper the configuration given for it,
+ mapperConf
, have precedence over the job's JobConf. This
+ precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ .
+ @param job chain job's JobConf to add the Mapper class.
+ @param klass the Mapper class to add.
+ @param inputKeyClass mapper input key class.
+ @param inputValueClass mapper input value class.
+ @param outputKeyClass mapper output key class.
+ @param outputValueClass mapper output value class.
+ @param byValue indicates if key/values should be passed by value
+ to the next Mapper in the chain, if any.
+ @param mapperConf a JobConf with the configuration for the Mapper
+ class. It is recommended to use a JobConf without default values using the
+ JobConf(boolean loadDefaults)
constructor with FALSE.]]>
+ If this method is overriden super.configure(...)
should be
+ invoked at the beginning of the overwriter method.]]>
+ reduce(...) method of the Reducer with the
+ map(...)
methods of the Mappers in the chain.]]>
+ If this method is overriden super.close()
should be
+ invoked at the end of the overwriter method.]]>
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion, the output of the first becomes the input of the
+ second, and so on until the last Mapper, the output of the last Mapper will
+ be written to the task's output.
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain.
+ This enables having reusable specialized Mappers that can be combined to
+ perform composite operations within a single task.
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use maching output and input key and
+ value classes as no conversion is done by the chaining code.
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]
. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+ ChainReducer usage pattern:
+ ...
+ conf.setJobName("chain");
+ conf.setInputFormat(TextInputFormat.class);
+ conf.setOutputFormat(TextOutputFormat.class);
+ JobConf mapAConf = new JobConf(false);
+ ...
+ ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, mapAConf);
+ JobConf mapBConf = new JobConf(false);
+ ...
+ ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, mapBConf);
+ JobConf reduceConf = new JobConf(false);
+ ...
+ ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, reduceConf);
+ ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, null);
+ ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
+ LongWritable.class, LongWritable.class, true, null);
+ FileInputFormat.setInputPaths(conf, inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+ ...
+ JobClient jc = new JobClient(conf);
+ RunningJob job = jc.submitJob(conf);
+ ...
+ RecordReader's for CombineFileSplit
+ @see CombineFileSplit]]>
+ CombineFileRecordReader.
+ Subclassing is needed to get a concrete record reader wrapper because of the
+ constructor requirement.
+ @see CombineFileRecordReader
+ @see CombineFileInputFormat]]>
+ CombineFileInputFormat-equivalent for
+ SequenceFileInputFormat
+ @see CombineFileInputFormat]]>
+ CombineFileInputFormat-equivalent for
+ TextInputFormat
+ @see CombineFileInputFormat]]>
+ true if the name output is multi, false
+ if it is single. If the name output is not defined it returns
+ false
+ By default these counters are disabled.
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+ @param conf job conf to enableadd the named output.
+ @param enabled indicates if the counters will be enabled or not.]]>
+ By default these counters are disabled.
+ MultipleOutputs supports counters, by default the are disabled.
+ The counters group is the {@link MultipleOutputs} class name.
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+ @param conf job conf to enableadd the named output.
+ @return TRUE if the counters are enabled, FALSE if they are disabled.]]>
+ If overriden subclasses must invoke super.close()
at the
+ end of their close()
+ @throws java.io.IOException thrown if any of the MultipleOutput files
+ could not be closed properly.]]>
+ OutputCollector passed to
+ the map()
and reduce()
methods of the
+ Mapper
and Reducer
+ Each additional output, or named output, may be configured with its own
+ OutputFormat
, with its own key class and with its own value
+ class.
+ A named output can be a single file or a multi file. The later is refered as
+ a multi named output.
+ A multi named output is an unbound set of files all sharing the same
+ OutputFormat
, key class and value class configuration.
+ When named outputs are used within a Mapper
+ key/values written to a name output are not part of the reduce phase, only
+ key/values written to the job OutputCollector
are part of the
+ reduce phase.
+ MultipleOutputs supports counters, by default the are disabled. The counters
+ group is the {@link MultipleOutputs} class name.
+ The names of the counters are the same as the named outputs. For multi
+ named outputs the name of the counter is the concatenation of the named
+ output, and underscore '_' and the multiname.
+ Job configuration usage pattern is:
+ JobConf conf = new JobConf();
+ conf.setInputPath(inDir);
+ FileOutputFormat.setOutputPath(conf, outDir);
+ conf.setMapperClass(MOMap.class);
+ conf.setReducerClass(MOReduce.class);
+ ...
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+ // Defines additional multi sequencefile based output 'sequence' for the
+ // job
+ MultipleOutputs.addMultiNamedOutput(conf, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+ JobClient jc = new JobClient();
+ RunningJob job = jc.submitJob(conf);
+ ...
+ Job configuration usage pattern is:
+ public class MOReduce implements
+ Reducer<WritableComparable, Writable> {
+ private MultipleOutputs mos;
+ public void configure(JobConf conf) {
+ ...
+ mos = new MultipleOutputs(conf);
+ }
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ OutputCollector output, Reporter reporter)
+ throws IOException {
+ ...
+ mos.getCollector("text", reporter).collect(key, new Text("Hello"));
+ mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
+ mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
+ ...
+ }
+ public void close() throws IOException {
+ mos.close();
+ ...
+ }
+ }
+ It can be used instead of the default implementation,
+ of {@link org.apache.hadoop.mapred.MapRunner}, when the Map
+ operation is not CPU bound in order to improve throughput.
+ Map implementations using this MapRunnable must be thread-safe.
+ The Map-Reduce job has to be configured to use this MapRunnable class (using
+ the JobConf.setMapRunnerClass method) and
+ the number of threads the thread-pool can use with the
+ mapred.map.multithreadedrunner.threads
property, its default
+ value is 10 threads.
+ R reduces, there are R-1
+ keys in the SequenceFile.
+ @deprecated Use
+ {@link #setPartitionFile(Configuration, Path)}
+ instead]]>
+ Cluster.
+ @throws IOException]]>
+ ClusterMetrics
provides clients with information such as:
+ -
+ Size of the cluster.
+ -
+ Number of blacklisted and decommissioned trackers.
+ -
+ Slot capacity of the cluster.
+ -
+ The number of currently occupied/reserved map and reduce slots.
+ -
+ The number of currently running map and reduce tasks.
+ -
+ The number of job submissions.
+ Clients can query for the latest ClusterMetrics
, via
+ {@link Cluster#getClusterStatus()}.
+ @see Cluster]]>
+ Counters
represent global counters, defined either by the
+ Map-Reduce framework or applications. Each Counter
is named by
+ an {@link Enum} and has a long for the value.
+ Counters
are bunched into Groups, each comprising of
+ counters from a particular Enum
+ the type of counter
+ @param the type of counter group
+ @param counters the old counters object]]>
+ Counters
holds per job/task counters, defined either by the
+ Map-Reduce framework or applications. Each Counter
can be of
+ any {@link Enum} type.
+ Counters
are bunched into {@link CounterGroup}s, each
+ comprising of counters from a particular Enum
+ Each {@link InputSplit} is then assigned to an individual {@link Mapper}
+ for processing.
+ Note: The split is a logical split of the inputs and the
+ input files are not physically split into chunks. For e.g. a split could
+ be <input-file-path, start, offset> tuple. The InputFormat
+ also creates the {@link RecordReader} to read the {@link InputSplit}.
+ @param context job configuration.
+ @return an array of {@link InputSplit}s for the job.]]>
+ InputFormat describes the input-specification for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the InputFormat
of the
+ job to:
+ -
+ Validate the input-specification of the job.
+ Split-up the input file(s) into logical {@link InputSplit}s, each of
+ which is then assigned to an individual {@link Mapper}.
+ -
+ Provide the {@link RecordReader} implementation to be used to glean
+ input records from the logical
for processing by
+ the {@link Mapper}.
+ The default behavior of file-based {@link InputFormat}s, typically
+ sub-classes of {@link FileInputFormat}, is to split the
+ input into logical {@link InputSplit}s based on the total size, in
+ bytes, of the input files. However, the {@link FileSystem} blocksize of
+ the input files is treated as an upper bound for input splits. A lower bound
+ on the split size can be set via
+ mapreduce.input.fileinputformat.split.minsize.
+ Clearly, logical splits based on input-size is insufficient for many
+ applications since record boundaries are to respected. In such cases, the
+ application has to also implement a {@link RecordReader} on whom lies the
+ responsibility to respect record-boundaries and present a record-oriented
+ view of the logical InputSplit
to the individual task.
+ @see InputSplit
+ @see RecordReader
+ @see FileInputFormat]]>
+ SplitLocationInfos describing how the split
+ data is stored at each location. A null value indicates that all the
+ locations have the data stored on disk.
+ @throws IOException]]>
+ InputSplit represents the data to be processed by an
+ individual {@link Mapper}.
+ Typically, it presents a byte-oriented view on the input and is the
+ responsibility of {@link RecordReader} of the job to process this and present
+ a record-oriented view.
+ @see InputFormat
+ @see RecordReader]]>
+ Job makes a copy of the Configuration
+ that any necessary internal modifications do not reflect on the incoming
+ parameter.
+ A Cluster will be created from the conf parameter only when it's needed.
+ @param conf the configuration
+ @return the {@link Job} , with no connection to a cluster yet.
+ @throws IOException]]>
+ Job makes a copy of the Configuration
+ that any necessary internal modifications do not reflect on the incoming
+ parameter.
+ @param conf the configuration
+ @return the {@link Job} , with no connection to a cluster yet.
+ @throws IOException]]>
+ Job makes a copy of the Configuration
+ that any necessary internal modifications do not reflect on the incoming
+ parameter.
+ @param status job status
+ @param conf job configuration
+ @return the {@link Job} , with no connection to a cluster yet.
+ @throws IOException]]>
+ Job makes a copy of the Configuration
+ that any necessary internal modifications do not reflect on the incoming
+ parameter.
+ @param ignored
+ @return the {@link Job} , with no connection to a cluster yet.
+ @throws IOException
+ @deprecated Use {@link #getInstance()}]]>
+ Job makes a copy of the Configuration
+ that any necessary internal modifications do not reflect on the incoming
+ parameter.
+ @param ignored
+ @param conf job configuration
+ @return the {@link Job} , with no connection to a cluster yet.
+ @throws IOException
+ @deprecated Use {@link #getInstance(Configuration)}]]>
+ progress of the job's map-tasks, as a float between 0.0
+ and 1.0. When all map tasks have completed, the function returns 1.0.
+ @return the progress of the job's map-tasks.
+ @throws IOException]]>
+ progress of the job's reduce-tasks, as a float between 0.0
+ and 1.0. When all reduce tasks have completed, the function returns 1.0.
+ @return the progress of the job's reduce-tasks.
+ @throws IOException]]>
+ progress of the job's cleanup-tasks, as a float between 0.0
+ and 1.0. When all cleanup tasks have completed, the function returns 1.0.
+ @return the progress of the job's cleanup-tasks.
+ @throws IOException]]>
+ progress of the job's setup-tasks, as a float between 0.0
+ and 1.0. When all setup tasks have completed, the function returns 1.0.
+ @return the progress of the job's setup-tasks.
+ @throws IOException]]>
+ true if the job is complete, else false
+ @throws IOException]]>
+ true if the job succeeded, else false
+ @throws IOException]]>
+ InputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+ OutputFormat to use
+ @throws IllegalStateException if the job is submitted]]>
+ Mapper to use
+ @throws IllegalStateException if the job is submitted]]>
+ Reducer to use
+ @throws IllegalStateException if the job is submitted]]>
+ Partitioner to use
+ @throws IllegalStateException if the job is submitted]]>
+ true if speculative execution
+ should be turned on, else false
+ true if speculative execution
+ should be turned on for map tasks,
+ else false
+ true if speculative execution
+ should be turned on for reduce tasks,
+ else false
+ true, job-setup and job-cleanup will be
+ considered from {@link OutputCommitter}
+ else ignored.]]>
+ JobTracker is lost]]>
+ It allows the user to configure the
+ job, submit it, control its execution, and query the state. The set methods
+ only work until the job is submitted, afterwards they will throw an
+ IllegalStateException.
+ Normally the user creates the application, describes various facets of the
+ job via {@link Job} and then submits the job and monitor its progress.
+ Here is an example on how to submit a job:
+ // Create a new Job
+ Job job = Job.getInstance();
+ job.setJarByClass(MyJob.class);
+ // Specify various job-specific parameters
+ job.setJobName("myjob");
+ job.setInputPath(new Path("in"));
+ job.setOutputPath(new Path("out"));
+ job.setMapperClass(MyJob.MyMapper.class);
+ job.setReducerClass(MyJob.MyReducer.class);
+ // Submit the job, then poll for progress until the job is complete
+ job.waitForCompletion(true);
+ 1.
+ @return the number of reduce tasks for this job.]]>
+ mapred.map.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+ @return the max number of attempts per map task.]]>
+ mapred.reduce.max.attempts
+ property. If this property is not already set, the default is 4 attempts.
+ @return the max number of attempts per reduce task.]]>
+ An example JobID is :
+ job_200707121733_0003
, which represents the third job
+ running at the jobtracker started at 200707121733
+ Applications should never construct or parse JobID strings, but rather
+ use appropriate constructors or {@link #forName(String)} method.
+ @see TaskID
+ @see TaskAttemptID]]>
+ the key input type to the Mapper
+ @param the value input type to the Mapper
+ @param the key output type from the Mapper
+ @param the value output type from the Mapper]]>
+ Maps are the individual tasks which transform input records into a
+ intermediate records. The transformed intermediate records need not be of
+ the same type as the input records. A given input pair may map to zero or
+ many output pairs.
+ The Hadoop Map-Reduce framework spawns one map task for each
+ {@link InputSplit} generated by the {@link InputFormat} for the job.
+ Mapper
implementations can access the {@link Configuration} for
+ the job via the {@link JobContext#getConfiguration()}.
The framework first calls
+ {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
+ {@link #map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context)}
+ for each key/value pair in the InputSplit
. Finally
+ {@link #cleanup(org.apache.hadoop.mapreduce.Mapper.Context)} is called.
+ All intermediate values associated with a given output key are
+ subsequently grouped by the framework, and passed to a {@link Reducer} to
+ determine the final output. Users can control the sorting and grouping by
+ specifying two key {@link RawComparator} classes.
+ The Mapper
outputs are partitioned per
+ Reducer
. Users can control which keys (and hence records) go to
+ which Reducer
by implementing a custom {@link Partitioner}.
Users can optionally specify a combiner
, via
+ {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
+ intermediate outputs, which helps to cut down the amount of data transferred
+ from the Mapper
to the Reducer
Applications can specify if and how the intermediate
+ outputs are to be compressed and which {@link CompressionCodec}s are to be
+ used via the Configuration
+ If the job has zero
+ reduces then the output of the Mapper
is directly written
+ to the {@link OutputFormat} without sorting by keys.
+ Example:
+ public class TokenCounterMapper
+ extends Mapper<Object, Text, Text, IntWritable>{
+ private final static IntWritable one = new IntWritable(1);
+ private Text word = new Text();
+ public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
+ StringTokenizer itr = new StringTokenizer(value.toString());
+ while (itr.hasMoreTokens()) {
+ word.set(itr.nextToken());
+ context.write(word, one);
+ }
+ }
+ }
+ Applications may override the
+ {@link #run(org.apache.hadoop.mapreduce.Mapper.Context)} method to exert
+ greater control on map processing e.g. multi-threaded Mapper
+ etc.
+ @see InputFormat
+ @see JobContext
+ @see Partitioner
+ @see Reducer]]>
+ MarkableIterator is a wrapper iterator class that
+ implements the {@link MarkableIteratorInterface}.]]>
+ true if task output recovery is supported,
+ false
+ @see #recoverTask(TaskAttemptContext)
+ @deprecated Use {@link #isRecoverySupported(JobContext)} instead.]]>
+ true repeatable job commit is supported,
+ false
+ @throws IOException]]>
+ true if task output recovery is supported,
+ false
+ @throws IOException
+ @see #recoverTask(TaskAttemptContext)]]>
+ OutputCommitter. This is called from the application master
+ process, but it is called individually for each task.
+ If an exception is thrown the task will be attempted again.
+ This may be called multiple times for the same task. But from different
+ application attempts.
+ @param taskContext Context of the task whose output is being recovered
+ @throws IOException]]>
+ OutputCommitter describes the commit of task output for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the OutputCommitter
+ the job to:
+ -
+ Setup the job during initialization. For example, create the temporary
+ output directory for the job during the initialization of the job.
+ -
+ Cleanup the job after the job completion. For example, remove the
+ temporary output directory after the job completion.
+ -
+ Setup the task temporary output.
+ -
+ Check whether a task needs a commit. This is to avoid the commit
+ procedure if a task does not need commit.
+ -
+ Commit of the task output.
+ -
+ Discard the task commit.
+ The methods in this class can be called from several different processes and
+ from several different contexts. It is important to know which process and
+ which context each is called from. Each method should be marked accordingly
+ in its documentation. It is also important to note that not all methods are
+ guaranteed to be called once and only once. If a method is not guaranteed to
+ have this property the output committer needs to handle this appropriately.
+ Also note it will only be in rare situations where they may be called
+ multiple times for the same task.
+ @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+ @see JobContext
+ @see TaskAttemptContext]]>
+ This is to validate the output specification for the job when it is
+ a job is submitted. Typically checks that it does not already exist,
+ throwing an exception when it already exists, so that output is not
+ overwritten.
+ @param context information about the job
+ @throws IOException when output should not be attempted]]>
+ OutputFormat describes the output-specification for a
+ Map-Reduce job.
+ The Map-Reduce framework relies on the OutputFormat
of the
+ job to:
+ -
+ Validate the output-specification of the job. For e.g. check that the
+ output directory doesn't already exist.
+ Provide the {@link RecordWriter} implementation to be used to write out
+ the output files of the job. Output files are stored in a
+ {@link FileSystem}.
+ @see RecordWriter]]>
+ Typically a hash function on a all or a subset of the key.
+ @param key the key to be partioned.
+ @param value the entry value.
+ @param numPartitions the total number of partitions.
+ @return the partition number for the key
+ Partitioner
controls the partitioning of the keys of the
+ intermediate map-outputs. The key (or a subset of the key) is used to derive
+ the partition, typically by a hash function. The total number of partitions
+ is the same as the number of reduce tasks for the job. Hence this controls
+ which of the m
reduce tasks the intermediate key (and hence the
+ record) is sent for reduction.
+ Note: If you require your Partitioner class to obtain the Job's configuration
+ object, implement the {@link Configurable} interface.
+ @see Reducer]]>
+ "N/A"
+ @return Scheduling information associated to particular Job Queue]]>
+ @param ]]>
+ RecordWriter to future operations.
+ @param context the context of the task
+ @throws IOException]]>
+ RecordWriter writes the output <key, value> pairs
+ to an output file.
+ RecordWriter
implementations write the job outputs to the
+ {@link FileSystem}.
+ @see OutputFormat]]>
+ the class of the input keys
+ @param the class of the input values
+ @param the class of the output keys
+ @param the class of the output values]]>
+ Reducer
+ can access the {@link Configuration} for the job via the
+ {@link JobContext#getConfiguration()} method.
+ Reducer
has 3 primary phases:
+ -
+ Shuffle
The Reducer
copies the sorted output from each
+ {@link Mapper} using HTTP across the network.
+ -
+ Sort
The framework merge sorts Reducer
inputs by
+ key
+ (since different Mapper
s may have output the same key).
+ The shuffle and sort phases occur simultaneously i.e. while outputs are
+ being fetched they are merged.
+ SecondarySort
+ To achieve a secondary sort on the values returned by the value
+ iterator, the application should extend the key with the secondary
+ key and define a grouping comparator. The keys will be sorted using the
+ entire key, but will be grouped using the grouping comparator to decide
+ which keys and values are sent in the same call to reduce.The grouping
+ comparator is specified via
+ {@link Job#setGroupingComparatorClass(Class)}. The sort order is
+ controlled by
+ {@link Job#setSortComparatorClass(Class)}.
+ For example, say that you want to find duplicate web pages and tag them
+ all with the url of the "best" known example. You would set up the job
+ like:
+ - Map Input Key: url
+ - Map Input Value: document
+ - Map Output Key: document checksum, url pagerank
+ - Map Output Value: url
+ - Partitioner: by checksum
+ - OutputKeyComparator: by checksum and then decreasing pagerank
+ - OutputValueGroupingComparator: by checksum
+ -
+ Reduce
In this phase the
+ {@link #reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)}
+ method is called for each <key, (collection of values)>
+ the sorted inputs.
+ The output of the reduce task is typically written to a
+ {@link RecordWriter} via
+ {@link Context#write(Object, Object)}.
+ The output of the Reducer
is not re-sorted.
+ Example:
+ public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
+ Key,IntWritable> {
+ private IntWritable result = new IntWritable();
+ public void reduce(Key key, Iterable<IntWritable> values,
+ Context context) throws IOException, InterruptedException {
+ int sum = 0;
+ for (IntWritable val : values) {
+ sum += val.get();
+ }
+ result.set(sum);
+ context.write(key, result);
+ }
+ }
+ @see Mapper
+ @see Partitioner]]>
+ counterName.
+ @param counterName counter name
+ @return the Counter
for the given counterName
+ groupName and
+ counterName
+ @param counterName counter name
+ @return the Counter
for the given groupName
+ counterName
+ An example TaskAttemptID is :
+ attempt_200707121733_0003_m_000005_0
, which represents the
+ zeroth task attempt for the fifth map task in the third job
+ running at the jobtracker started at 200707121733
+ Applications should never construct or parse TaskAttemptID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+ @see JobID
+ @see TaskID]]>
+ An example TaskID is :
+ task_200707121733_0003_m_000005
, which represents the
+ fifth map task in the third job running at the jobtracker
+ started at 200707121733
+ Applications should never construct or parse TaskID strings
+ , but rather use appropriate constructors or {@link #forName(String)}
+ method.
+ @see JobID
+ @see TaskAttemptID]]>
+ OutputCommitter for the task-attempt]]>
+ the input key type for the task
+ @param the input value type for the task
+ @param the output key type for the task
+ @param the output value type for the task]]>
+ type of the other counter
+ @param type of the other counter group
+ @param counters the counters object to copy
+ @param groupFactory the factory for new groups]]>
+ type of counter inside the counters
+ @param type of group inside the counters]]>
+ type of the counter for the group]]>
+ The key and values are passed from one element of the chain to the next, by
+ value. For the added Mapper the configuration given for it,
+ mapperConf
, have precedence over the job's Configuration. This
+ precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain
+ @param job
+ The job.
+ @param klass
+ the Mapper class to add.
+ @param inputKeyClass
+ mapper input key class.
+ @param inputValueClass
+ mapper input value class.
+ @param outputKeyClass
+ mapper output key class.
+ @param outputValueClass
+ mapper output value class.
+ @param mapperConf
+ a configuration for the Mapper class. It is recommended to use a
+ Configuration without default values using the
+ Configuration(boolean loadDefaults)
constructor with
+ FALSE.]]>
+ The Mapper classes are invoked in a chained (or piped) fashion, the output of
+ the first becomes the input of the second, and so on until the last Mapper,
+ the output of the last Mapper will be written to the task's output.
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed in a chain. This enables having
+ reusable specialized Mappers that can be combined to perform composite
+ operations within a single task.
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use matching output and input key and
+ value classes as no conversion is done by the chaining code.
+ Using the ChainMapper and the ChainReducer classes is possible to compose
+ Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]
. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the chain.
+ ChainMapper usage pattern:
+ ...
+ Job = new Job(conf);
+ Configuration mapAConf = new Configuration(false);
+ ...
+ ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, mapAConf);
+ Configuration mapBConf = new Configuration(false);
+ ...
+ ChainMapper.addMapper(job, BMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, mapBConf);
+ ...
+ job.waitForComplettion(true);
+ ...
+ The key and values are passed from one element of the chain to the next, by
+ value. For the added Reducer the configuration given for it,
+ reducerConf
, have precedence over the job's Configuration.
+ This precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+ @param job
+ the job
+ @param klass
+ the Reducer class to add.
+ @param inputKeyClass
+ reducer input key class.
+ @param inputValueClass
+ reducer input value class.
+ @param outputKeyClass
+ reducer output key class.
+ @param outputValueClass
+ reducer output value class.
+ @param reducerConf
+ a configuration for the Reducer class. It is recommended to use a
+ Configuration without default values using the
+ Configuration(boolean loadDefaults)
constructor with
+ FALSE.]]>
+ The key and values are passed from one element of the chain to the next, by
+ value For the added Mapper the configuration given for it,
+ mapperConf
, have precedence over the job's Configuration. This
+ precedence is in effect when the task is running.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainMapper, this is done by the addMapper for the last mapper in the
+ chain.
+ @param job
+ The job.
+ @param klass
+ the Mapper class to add.
+ @param inputKeyClass
+ mapper input key class.
+ @param inputValueClass
+ mapper input value class.
+ @param outputKeyClass
+ mapper output key class.
+ @param outputValueClass
+ mapper output value class.
+ @param mapperConf
+ a configuration for the Mapper class. It is recommended to use a
+ Configuration without default values using the
+ Configuration(boolean loadDefaults)
constructor with
+ FALSE.]]>
+ For each record output by the Reducer, the Mapper classes are invoked in a
+ chained (or piped) fashion. The output of the reducer becomes the input of
+ the first mapper and output of first becomes the input of the second, and so
+ on until the last Mapper, the output of the last Mapper will be written to
+ the task's output.
+ The key functionality of this feature is that the Mappers in the chain do not
+ need to be aware that they are executed after the Reducer or in a chain. This
+ enables having reusable specialized Mappers that can be combined to perform
+ composite operations within a single task.
+ Special care has to be taken when creating chains that the key/values output
+ by a Mapper are valid for the following Mapper in the chain. It is assumed
+ all Mappers and the Reduce in the chain use matching output and input key and
+ value classes as no conversion is done by the chaining code.
+ Using the ChainMapper and the ChainReducer classes is possible to
+ compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]
. And
+ immediate benefit of this pattern is a dramatic reduction in disk IO.
+ IMPORTANT: There is no need to specify the output key/value classes for the
+ ChainReducer, this is done by the setReducer or the addMapper for the last
+ element in the chain.
+ ChainReducer usage pattern:
+ ...
+ Job = new Job(conf);
+ ....
+ Configuration reduceConf = new Configuration(false);
+ ...
+ ChainReducer.setReducer(job, XReduce.class, LongWritable.class, Text.class,
+ Text.class, Text.class, true, reduceConf);
+ ChainReducer.addMapper(job, CMap.class, Text.class, Text.class,
+ LongWritable.class, Text.class, false, null);
+ ChainReducer.addMapper(job, DMap.class, LongWritable.class, Text.class,
+ LongWritable.class, LongWritable.class, true, null);
+ ...
+ job.waitForCompletion(true);
+ ...
+ DBInputFormat emits LongWritables containing the record number as
+ key and DBWritables as value.
+ The SQL query, and input class can be using one of the two
+ setInput methods.]]>
+ {@link DBOutputFormat} accepts <key,value> pairs, where
+ key has a type extending DBWritable. Returned {@link RecordWriter}
+ writes only the key to the database with a batch SQL query.]]>
+ DBWritable. DBWritable, is similar to {@link Writable}
+ except that the {@link #write(PreparedStatement)} method takes a
+ {@link PreparedStatement}, and {@link #readFields(ResultSet)}
+ takes a {@link ResultSet}.
+ Implementations are responsible for writing the fields of the object
+ to PreparedStatement, and reading the fields of the object from the
+ ResultSet.
+ If we have the following table in the database :
+ timestamp BIGINT NOT NULL,
+ );
+ then we can read/write the tuples from/to the table with :
+ public class MyWritable implements Writable, DBWritable {
+ // Some data
+ private int counter;
+ private long timestamp;
+ //Writable#write() implementation
+ public void write(DataOutput out) throws IOException {
+ out.writeInt(counter);
+ out.writeLong(timestamp);
+ }
+ //Writable#readFields() implementation
+ public void readFields(DataInput in) throws IOException {
+ counter = in.readInt();
+ timestamp = in.readLong();
+ }
+ public void write(PreparedStatement statement) throws SQLException {
+ statement.setInt(1, counter);
+ statement.setLong(2, timestamp);
+ }
+ public void readFields(ResultSet resultSet) throws SQLException {
+ counter = resultSet.getInt(1);
+ timestamp = resultSet.getLong(2);
+ }
+ }
+ RecordReader's for
+ CombineFileSplit
+ @see CombineFileSplit]]>
+ CombineFileRecordReader.
+ Subclassing is needed to get a concrete record reader wrapper because of the
+ constructor requirement.
+ @see CombineFileRecordReader
+ @see CombineFileInputFormat]]>
+ th Path]]>
+ th Path]]>
+ th Path]]>
+ CombineFileSplit can be used to implement {@link RecordReader}'s,
+ with reading one record per file.
+ @see FileSplit
+ @see CombineFileInputFormat]]>
+ CombineFileInputFormat-equivalent for
+ SequenceFileInputFormat
+ @see CombineFileInputFormat]]>
+ CombineFileInputFormat-equivalent for
+ TextInputFormat
+ @see CombineFileInputFormat]]>
+ FileInputFormat always returns
+ true. Implementations that may deal with non-splittable files must
+ override this method.
+ FileInputFormat
implementations can override this and return
+ false
to ensure that individual input files are never split-up
+ so that {@link Mapper}s process entire files.
+ @param context the job context
+ @param filename the file name to check
+ @return is this file splitable?]]>
+ FileInputFormat
is the base class for all file-based
+ InputFormat
s. This provides a generic implementation of
+ {@link #getSplits(JobContext)}.
+ Implementations of FileInputFormat
can also override the
+ {@link #isSplitable(JobContext, Path)} method to prevent input files
+ from being split-up in certain situations. Implementations that may
+ deal with non-splittable files must override this method, since
+ the default implementation assumes splitting is always possible.]]>
+ conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength);
+ @see FixedLengthRecordReader]]>
+ true if the Job was added.]]>
+ ([,]*)
+ func ::= tbl(,"")
+ class ::= @see java.lang.Class#forName(java.lang.String)
+ path ::= @see org.apache.hadoop.fs.Path#Path(java.lang.String)
+ }
+ Reads expression from the mapreduce.join.expr property and
+ user-supplied join types from mapreduce.join.define.<ident>
+ types. Paths supplied to tbl are given as input paths to the
+ InputFormat class listed.
+ @see #compose(java.lang.String, java.lang.Class, java.lang.String...)]]>
+ , ) }]]>
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+ (tbl(,),tbl(,),...,tbl(,)) }]]>
+ mapreduce.join.define.<ident> to a classname.
+ In the expression mapreduce.join.expr, the identifier will be
+ assumed to be a ComposableRecordReader.
+ mapreduce.join.keycomparator can be a classname used to compare
+ keys in the join.
+ @see #setFormat
+ @see JoinRecordReader
+ @see MultiFilterRecordReader]]>
+ ......
+ }]]>
+ capacity children to position
+ id in the parent reader.
+ The id of a root CompositeRecordReader is -1 by convention, but relying
+ on this is not recommended.]]>
+ override(S1,S2,S3) will prefer values
+ from S3 over S2, and values from S2 over S1 for all keys
+ emitted from all sources.]]>
+ [<child1>,<child2>,...,<childn>]]]>
+ out.
+ TupleWritable format:
+ {@code
+ ......
+ }]]>
+ the map's input key type
+ @param the map's input value type
+ @param the map's output key type
+ @param the map's output value type
+ @param job the job
+ @return the mapper class to run]]>
+ the map input key type
+ @param the map input value type
+ @param the map output key type
+ @param the map output value type
+ @param job the job to modify
+ @param cls the class to use as the mapper]]>
+ It can be used instead of the default implementation,
+ {@link org.apache.hadoop.mapred.MapRunner}, when the Map operation is not CPU
+ bound in order to improve throughput.
+ Mapper implementations using this MapRunnable must be thread-safe.
+ The Map-Reduce job has to be configured with the mapper to use via
+ {@link #setMapperClass(Job, Class)} and
+ the number of thread the thread-pool can use with the
+ {@link #getNumberOfThreads(JobContext)} method. The default
+ value is 10 threads.
+ MapContext to be wrapped
+ @return a wrapped Mapper.Context
for custom implementations]]>
+ true if the job output should be compressed,
+ false
+ Tasks' Side-Effect Files
+ Some applications need to create/write-to side-files, which differ from
+ the actual job-outputs.
In such cases there could be issues with 2 instances of the same TIP
+ (running simultaneously e.g. speculative tasks) trying to open/write-to the
+ same file (path) on HDFS. Hence the application-writer will have to pick
+ unique names per task-attempt (e.g. using the attemptid, say
+ attempt_200709221812_0001_m_000000_0), not just per TIP.
+ To get around this the Map-Reduce framework helps the application-writer
+ out by maintaining a special
+ ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid}
+ sub-directory for each task-attempt on HDFS where the output of the
+ task-attempt goes. On successful completion of the task-attempt the files
+ in the ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only)
+ are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the
+ framework discards the sub-directory of unsuccessful task-attempts. This
+ is completely transparent to the application.
+ The application-writer can take advantage of this by creating any
+ side-files required in a work directory during execution
+ of his task i.e. via
+ {@link #getWorkOutputPath(TaskInputOutputContext)}, and
+ the framework will move them out similarly - thus she doesn't have to pick
+ unique paths per task-attempt.
+ The entire discussion holds true for maps of jobs with
+ reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
+ goes directly to HDFS.
+ @return the {@link Path} to the task's temporary output directory
+ for the map-reduce job.]]>
+ The path can be used to create custom files from within the map and
+ reduce tasks. The path name will be unique for each task. The path parent
+ will be the job output directory.ls
+ This method uses the {@link #getUniqueFile} method to make the file name
+ unique for the task.
+ @param context the context for the task.
+ @param name the name for the file.
+ @param extension the extension for the file
+ @return a unique path accross all tasks of the job.]]>
+ Warning: when the baseOutputPath is a path that resolves
+ outside of the final job output directory, the directory is created
+ immediately and then persists through subsequent task retries, breaking
+ the concept of output committing.]]>
+ Warning: when the baseOutputPath is a path that resolves
+ outside of the final job output directory, the directory is created
+ immediately and then persists through subsequent task retries, breaking
+ the concept of output committing.]]>
+ super.close() at the
+ end of their close()
+ Case one: writing to additional outputs other than the job default output.
+ Each additional output, or named output, may be configured with its own
+ OutputFormat
, with its own key class and with its own value
+ class.
+ Case two: to write data to different files provided by user
+ MultipleOutputs supports counters, by default they are disabled. The
+ counters group is the {@link MultipleOutputs} class name. The names of the
+ counters are the same as the output name. These count the number records
+ written to each output name.
+ Usage pattern for job submission:
+ Job job = new Job();
+ FileInputFormat.setInputPath(job, inDir);
+ FileOutputFormat.setOutputPath(job, outDir);
+ job.setMapperClass(MOMap.class);
+ job.setReducerClass(MOReduce.class);
+ ...
+ // Defines additional single text based output 'text' for the job
+ MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
+ LongWritable.class, Text.class);
+ // Defines additional sequence-file based output 'sequence' for the job
+ MultipleOutputs.addNamedOutput(job, "seq",
+ SequenceFileOutputFormat.class,
+ LongWritable.class, Text.class);
+ ...
+ job.waitForCompletion(true);
+ ...
+ Usage in Reducer:
+ <K, V> String generateFileName(K k, V v) {
+ return k.toString() + "_" + v.toString();
+ }
+ public class MOReduce extends
+ Reducer<WritableComparable, Writable,WritableComparable, Writable> {
+ private MultipleOutputs mos;
+ public void setup(Context context) {
+ ...
+ mos = new MultipleOutputs(context);
+ }
+ public void reduce(WritableComparable key, Iterator<Writable> values,
+ Context context)
+ throws IOException {
+ ...
+ mos.write("text", , key, new Text("Hello"));
+ mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
+ mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
+ mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
+ ...
+ }
+ public void cleanup(Context) throws IOException {
+ mos.close();
+ ...
+ }
+ }
+ When used in conjuction with org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat,
+ MultipleOutputs can mimic the behaviour of MultipleTextOutputFormat and MultipleSequenceFileOutputFormat
+ from the old Hadoop API - ie, output can be written from the Reducer to more than one location.
+ Use MultipleOutputs.write(KEYOUT key, VALUEOUT value, String baseOutputPath)
to write key and
+ value to a path specified by baseOutputPath
, with no need to specify a named output.
+ Warning: when the baseOutputPath passed to MultipleOutputs.write
+ is a path that resolves outside of the final job output directory, the
+ directory is created immediately and then persists through subsequent
+ task retries, breaking the concept of output committing:
+ private MultipleOutputs<Text, Text> out;
+ public void setup(Context context) {
+ out = new MultipleOutputs<Text, Text>(context);
+ ...
+ }
+ public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
+ for (Text t : values) {
+ out.write(key, t, generateFileName(<parameter list...>));
+ }
+ }
+ protected void cleanup(Context context) throws IOException, InterruptedException {
+ out.close();
+ }
+ Use your own code in generateFileName()
to create a custom path to your results.
+ '/' characters in baseOutputPath
will be translated into directory levels in your file system.
+ Also, append your custom-generated path with "part" or similar, otherwise your output will be -00000, -00001 etc.
+ No call to context.write()
is necessary. See example generateFileName()
code below.
+ private String generateFileName(Text k) {
+ // expect Text k in format "Surname|Forename"
+ String[] kStr = k.toString().split("\\|");
+ String sName = kStr[0];
+ String fName = kStr[1];
+ // example for k = Smith|John
+ // output written to /user/hadoop/path/to/output/Smith/John-r-00000 (etc)
+ return sName + "/" + fName;
+ }
+ Using MultipleOutputs in this way will still create zero-sized default output, eg part-00000.
+ To prevent this use LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
+ instead of job.setOutputFormatClass(TextOutputFormat.class);
in your Hadoop job configuration.
+ This allows the user to specify the key class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+ @param job the {@link Job} to modify
+ @param theClass the SequenceFile output key class.]]>
+ This allows the user to specify the value class to be different
+ from the actual class ({@link BytesWritable}) used for writing
+ @param job the {@link Job} to modify
+ @param theClass the SequenceFile output key class.]]>
+ bytes[left:(right+1)] in Python syntax.
+ @param conf configuration object
+ @param left left Python-style offset
+ @param right right Python-style offset]]>
+ bytes[offset:] in Python syntax.
+ @param conf configuration object
+ @param offset left Python-style offset]]>
+ bytes[:(offset+1)] in Python syntax.
+ @param conf configuration object
+ @param offset right Python-style offset]]>
+ Partition {@link BinaryComparable} keys using a configurable part of
+ the bytes array returned by {@link BinaryComparable#getBytes()}.
+ The subarray to be used for the partitioning can be defined by means
+ of the following properties:
+ -
+ mapreduce.partition.binarypartitioner.left.offset:
+ left offset in array (0 by default)
+ -
+ mapreduce.partition.binarypartitioner.right.offset:
+ right offset in array (-1 by default)
+ Like in Python, both negative and positive offsets are allowed, but
+ the meaning is slightly different. In case of an array of length 5,
+ for instance, the possible offsets are:
+ +---+---+---+---+---+
+ | B | B | B | B | B |
+ +---+---+---+---+---+
+ 0 1 2 3 4
+ -5 -4 -3 -2 -1
+ The first row of numbers gives the position of the offsets 0...5 in
+ the array; the second row gives the corresponding negative offsets.
+ Contrary to Python, the specified subarray has byte i
+ and j
as first and last element, repectively, when
+ i
and j
are the left and right offset.
+ For Hadoop programs written in Java, it is advisable to use one of
+ the following static convenience methods for setting the offsets:
+ - {@link #setOffsets}
+ - {@link #setLeftOffset}
+ - {@link #setRightOffset}
+ total.order.partitioner.natural.order is not false, a trie
+ of the first total.order.partitioner.max.trie.depth(2) + 1 bytes
+ will be built. Otherwise, keys will be located using a binary search of
+ the partition keyset using the {@link org.apache.hadoop.io.RawComparator}
+ defined for this job. The input file must be sorted with the same
+ comparator and contain {@link Job#getNumReduceTasks()} - 1 keys.]]>
+ R reduces, there are R-1
+ keys in the SequenceFile.]]>
+ ReduceContext to be wrapped
+ @return a wrapped Reducer.Context
for custom implementations]]>
diff --git a/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_JobClient_2.8.3.xml b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_JobClient_2.8.3.xml
new file mode 100644
index 0000000000..a63b3aca1e
--- /dev/null
+++ b/hadoop-mapreduce-project/dev-support/jdiff/Apache_Hadoop_MapReduce_JobClient_2.8.3.xml
@@ -0,0 +1,16 @@
diff --git a/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Client_2.8.3.xml b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Client_2.8.3.xml
new file mode 100644
index 0000000000..3f6c5eb5e3
--- /dev/null
+++ b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Client_2.8.3.xml
@@ -0,0 +1,2316 @@
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ If the user does not have VIEW_APP
access then the following
+ fields in the report will be set to stubbed values:
+ - host - set to "N/A"
+ - RPC port - set to -1
+ - client token - set to "N/A"
+ - diagnostics - set to "N/A"
+ - tracking URL - set to "N/A"
+ - original tracking URL - set to "N/A"
+ - resource usage report - all values are -1
+ @param appId
+ {@link ApplicationId} of the application that needs a report
+ @return application report
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of all Applications in the cluster.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @return a list of reports for all applications
+ @throws YarnException
+ @throws IOException]]>
+ Get a report of the given ApplicationAttempt.
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ @param applicationAttemptId
+ {@link ApplicationAttemptId} of the application attempt that needs
+ a report
+ @return application attempt report
+ @throws YarnException
+ @throws ApplicationAttemptNotFoundException if application attempt
+ not found
+ @throws IOException]]>
+ Get a report of all (ApplicationAttempts) of Application in the cluster.
+ @param applicationId
+ @return a list of reports for all application attempts for specified
+ application
+ @throws YarnException
+ @throws IOException]]>
+ Get a report of the given Container.
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ @param containerId
+ {@link ContainerId} of the container that needs a report
+ @return container report
+ @throws YarnException
+ @throws ContainerNotFoundException if container not found
+ @throws IOException]]>
+ Get a report of all (Containers) of ApplicationAttempt in the cluster.
+ @param applicationAttemptId
+ @return a list of reports of all containers for specified application
+ attempt
+ @throws YarnException
+ @throws IOException]]>
+ {@code
+ AMRMClient.createAMRMClientContainerRequest()
+ }
+ @return the newly create AMRMClient instance.]]>
+ RegisterApplicationMasterResponse
+ @throws YarnException
+ @throws IOException]]>
+ addContainerRequest are sent to the
+ ResourceManager
. New containers assigned to the master are
+ retrieved. Status of completed containers and node health updates are also
+ retrieved. This also doubles up as a heartbeat to the ResourceManager and
+ must be made periodically. The call may not always return any new
+ allocations of containers. App should not make concurrent allocate
+ requests. May cause request loss.
+ Note : If the user has not removed container requests that have already
+ been satisfied, then the re-register may end up sending the entire
+ container requests to the RM (including matched requests). Which would mean
+ the RM could end up giving it a lot of new allocated containers.
+ @param progressIndicator Indicates progress made by the master
+ @return the response of the allocate request
+ @throws YarnException
+ @throws IOException]]>
+ allocate
+ @param req Resource request]]>
+ allocate.
+ Any previous pending resource change request of the same container will be
+ removed.
+ Application that calls this method is expected to maintain the
+ Container
s that are returned from previous successful
+ allocations or resource changes. By passing in the existing container and a
+ target resource capability to this method, the application requests the
+ ResourceManager to change the existing resource allocation to the target
+ resource allocation.
+ @param container The container returned from the last successful resource
+ allocation or resource change
+ @param capability The target resource capability of the container]]>
+ ContainerRequests matching the given
+ parameters. These ContainerRequests should have been added via
+ addContainerRequest
earlier in the lifecycle. For performance,
+ the AMRMClient may return its internal collection directly without creating
+ a copy. Users should not perform mutable operations on the return value.
+ Each collection in the list contains requests with identical
+ Resource
size that fit in the given capability. In a
+ collection, requests will be returned in the same order as they were added.
+ @return Collection of request matching the parameters]]>
+ AMRMClient. This cache must
+ be shared with the {@link NMClient} used to manage containers for the
+ AMRMClient
+ If a NM token cache is not set, the {@link NMTokenCache#getSingleton()}
+ singleton instance will be used.
+ @param nmTokenCache the NM token cache to use.]]>
+ AMRMClient. This cache must be
+ shared with the {@link NMClient} used to manage containers for the
+ AMRMClient
+ If a NM token cache is not set, the {@link NMTokenCache#getSingleton()}
+ singleton instance will be used.
+ @return the NM token cache.]]>
+ check to return true for each 1000 ms.
+ See also {@link #waitFor(com.google.common.base.Supplier, int)}
+ and {@link #waitFor(com.google.common.base.Supplier, int, int)}
+ @param check]]>
+ check to return true for each
+ checkEveryMillis
+ See also {@link #waitFor(com.google.common.base.Supplier, int, int)}
+ @param check user defined checker
+ @param checkEveryMillis interval to call check
+ check to return true for each
+ checkEveryMillis
ms. In the main loop, this method will log
+ the message "waiting in main loop" for each logInterval
+ iteration to confirm the thread is alive.
+ @param check user defined checker
+ @param checkEveryMillis interval to call check
+ @param logInterval interval to log for each]]>
+ Start an allocated container.
+ The ApplicationMaster
or other applications that use the
+ client must provide the details of the allocated container, including the
+ Id, the assigned node's Id and the token via {@link Container}. In
+ addition, the AM needs to provide the {@link ContainerLaunchContext} as
+ well.
+ @param container the allocated container
+ @param containerLaunchContext the context information needed by the
+ NodeManager
to launch the
+ container
+ @return a map between the auxiliary service names and their outputs
+ @throws YarnException
+ @throws IOException]]>
+ Increase the resource of a container.
+ The ApplicationMaster
or other applications that use the
+ client must provide the details of the container, including the Id and
+ the target resource encapsulated in the updated container token via
+ {@link Container}.
+ @param container the container with updated token
+ @throws YarnException
+ @throws IOException]]>
+ Stop an started container.
+ @param containerId the Id of the started container
+ @param nodeId the Id of the NodeManager
+ @throws YarnException
+ @throws IOException]]>
+ Query the status of a container.
+ @param containerId the Id of the started container
+ @param nodeId the Id of the NodeManager
+ @return the status of a container
+ @throws YarnException
+ @throws IOException]]>
+ Set whether the containers that are started by this client, and are
+ still running should be stopped when the client stops. By default, the
+ feature should be enabled. However, containers will be stopped only
+ when service is stopped. i.e. after {@link NMClient#stop()}.
+ @param enabled whether the feature is enabled or not]]>
+ NMClient. This cache must be
+ shared with the {@link AMRMClient} that requested the containers managed
+ by this NMClient
+ If a NM token cache is not set, the {@link NMTokenCache#getSingleton()}
+ singleton instance will be used.
+ @param nmTokenCache the NM token cache to use.]]>
+ NMClient. This cache must be
+ shared with the {@link AMRMClient} that requested the containers managed
+ by this NMClient
+ If a NM token cache is not set, the {@link NMTokenCache#getSingleton()}
+ singleton instance will be used.
+ @return the NM token cache]]>
+ By default Yarn client libraries {@link AMRMClient} and {@link NMClient} use
+ {@link #getSingleton()} instance of the cache.
+ -
+ Using the singleton instance of the cache is appropriate when running a
+ single ApplicationMaster in the same JVM.
+ -
+ When using the singleton, users don't need to do anything special,
+ {@link AMRMClient} and {@link NMClient} are already set up to use the
+ default singleton {@link NMTokenCache}
+ If running multiple Application Masters in the same JVM, a different cache
+ instance should be used for each Application Master.
+ -
+ If using the {@link AMRMClient} and the {@link NMClient}, setting up
+ and using an instance cache is as follows:
+ NMTokenCache nmTokenCache = new NMTokenCache();
+ AMRMClient rmClient = AMRMClient.createAMRMClient();
+ NMClient nmClient = NMClient.createNMClient();
+ nmClient.setNMTokenCache(nmTokenCache);
+ ...
+ -
+ If using the {@link AMRMClientAsync} and the {@link NMClientAsync},
+ setting up and using an instance cache is as follows:
+ NMTokenCache nmTokenCache = new NMTokenCache();
+ AMRMClient rmClient = AMRMClient.createAMRMClient();
+ NMClient nmClient = NMClient.createNMClient();
+ nmClient.setNMTokenCache(nmTokenCache);
+ AMRMClientAsync rmClientAsync = new AMRMClientAsync(rmClient, 1000, [AMRM_CALLBACK]);
+ NMClientAsync nmClientAsync = new NMClientAsync("nmClient", nmClient, [NM_CALLBACK]);
+ ...
+ -
+ If using {@link ApplicationMasterProtocol} and
+ {@link ContainerManagementProtocol} directly, setting up and using an
+ instance cache is as follows:
+ NMTokenCache nmTokenCache = new NMTokenCache();
+ ...
+ ApplicationMasterProtocol amPro = ClientRMProxy.createRMProxy(conf, ApplicationMasterProtocol.class);
+ ...
+ AllocateRequest allocateRequest = ...
+ ...
+ AllocateResponse allocateResponse = rmClient.allocate(allocateRequest);
+ for (NMToken token : allocateResponse.getNMTokens()) {
+ nmTokenCache.setToken(token.getNodeId().toString(), token.getToken());
+ }
+ ...
+ ContainerManagementProtocolProxy nmPro = ContainerManagementProtocolProxy(conf, nmTokenCache);
+ ...
+ nmPro.startContainer(container, containerContext);
+ ...
+ It is also possible to mix the usage of a client ({@code AMRMClient} or
+ {@code NMClient}, or the async versions of them) with a protocol proxy
+ ({@code ContainerManagementProtocolProxy} or
+ {@code ApplicationMasterProtocol}).]]>
+ The method to claim a resource with the SharedCacheManager.
+ The client uses a checksum to identify the resource and an
+ {@link ApplicationId} to identify which application will be using the
+ resource.
+ The SharedCacheManager
responds with whether or not the
+ resource exists in the cache. If the resource exists, a Path
+ to the resource in the shared cache is returned. If the resource does not
+ exist, null is returned instead.
+ @param applicationId ApplicationId of the application using the resource
+ @param resourceKey the key (i.e. checksum) that identifies the resource
+ @return Path to the resource, or null if it does not exist]]>
+ The method to release a resource with the SharedCacheManager.
+ This method is called once an application is no longer using a claimed
+ resource in the shared cache. The client uses a checksum to identify the
+ resource and an {@link ApplicationId} to identify which application is
+ releasing the resource.
+ Note: This method is an optimization and the client is not required to call
+ it for correctness.
+ @param applicationId ApplicationId of the application releasing the
+ resource
+ @param resourceKey the key (i.e. checksum) that identifies the resource]]>
+ Obtain a {@link YarnClientApplication} for a new application,
+ which in turn contains the {@link ApplicationSubmissionContext} and
+ {@link org.apache.hadoop.yarn.api.protocolrecords.GetNewApplicationResponse}
+ objects.
+ @return {@link YarnClientApplication} built for a new application
+ @throws YarnException
+ @throws IOException]]>
+ Submit a new application to YARN.
It is a blocking call - it
+ will not return {@link ApplicationId} until the submitted application is
+ submitted successfully and accepted by the ResourceManager.
+ Users should provide an {@link ApplicationId} as part of the parameter
+ {@link ApplicationSubmissionContext} when submitting a new application,
+ otherwise it will throw the {@link ApplicationIdNotProvidedException}.
+ This internally calls {@link ApplicationClientProtocol#submitApplication
+ (SubmitApplicationRequest)}, and after that, it internally invokes
+ {@link ApplicationClientProtocol#getApplicationReport
+ (GetApplicationReportRequest)} and waits till it can make sure that the
+ application gets properly submitted. If RM fails over or RM restart
+ happens before ResourceManager saves the application's state,
+ {@link ApplicationClientProtocol
+ #getApplicationReport(GetApplicationReportRequest)} will throw
+ the {@link ApplicationNotFoundException}. This API automatically resubmits
+ the application with the same {@link ApplicationSubmissionContext} when it
+ catches the {@link ApplicationNotFoundException}
+ @param appContext
+ {@link ApplicationSubmissionContext} containing all the details
+ needed to submit a new application
+ @return {@link ApplicationId} of the accepted application
+ @throws YarnException
+ @throws IOException
+ @see #createApplication()]]>
+ Fail an application attempt identified by given ID.
+ @param applicationAttemptId
+ {@link ApplicationAttemptId} of the attempt to fail.
+ @throws YarnException
+ in case of errors or if YARN rejects the request due to
+ access-control restrictions.
+ @throws IOException
+ @see #getQueueAclsInfo()]]>
+ Kill an application identified by given ID.
+ @param applicationId
+ {@link ApplicationId} of the application that needs to be killed
+ @throws YarnException
+ in case of errors or if YARN rejects the request due to
+ access-control restrictions.
+ @throws IOException
+ @see #getQueueAclsInfo()]]>
+ Kill an application identified by given ID.
+ @param applicationId {@link ApplicationId} of the application that needs to
+ be killed
+ @param diagnostics for killing an application.
+ @throws YarnException in case of errors or if YARN rejects the request due
+ to access-control restrictions.
+ @throws IOException]]>
+ Get a report of the given Application.
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ If the user does not have VIEW_APP
access then the following
+ fields in the report will be set to stubbed values:
+ - host - set to "N/A"
+ - RPC port - set to -1
+ - client token - set to "N/A"
+ - diagnostics - set to "N/A"
+ - tracking URL - set to "N/A"
+ - original tracking URL - set to "N/A"
+ - resource usage report - all values are -1
+ @param appId
+ {@link ApplicationId} of the application that needs a report
+ @return application report
+ @throws YarnException
+ @throws IOException]]>
+ The AMRM token is required for AM to RM scheduling operations. For
+ managed Application Masters Yarn takes care of injecting it. For unmanaged
+ Applications Masters, the token must be obtained via this method and set
+ in the {@link org.apache.hadoop.security.UserGroupInformation} of the
+ current user.
+ The AMRM token will be returned only if all the following conditions are
+ met:
+ - the requester is the owner of the ApplicationMaster
+ - the application master is an unmanaged ApplicationMaster
+ - the application master is in ACCEPTED state
+ Else this method returns NULL.
+ @param appId {@link ApplicationId} of the application to get the AMRM token
+ @return the AMRM token if available
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of all Applications in the cluster.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @return a list of reports of all running applications
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of Applications
+ matching the given application types in the cluster.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @param applicationTypes set of application types you are interested in
+ @return a list of reports of applications
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of Applications matching the given
+ application states in the cluster.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @param applicationStates set of application states you are interested in
+ @return a list of reports of applications
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of Applications matching the given
+ application types and application states in the cluster.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @param applicationTypes set of application types you are interested in
+ @param applicationStates set of application states you are interested in
+ @return a list of reports of applications
+ @throws YarnException
+ @throws IOException]]>
+ Get a report (ApplicationReport) of Applications matching the given users,
+ queues, application types and application states in the cluster. If any of
+ the params is set to null, it is not used when filtering.
+ If the user does not have VIEW_APP
access for an application
+ then the corresponding report will be filtered as described in
+ {@link #getApplicationReport(ApplicationId)}.
+ @param queues set of queues you are interested in
+ @param users set of users you are interested in
+ @param applicationTypes set of application types you are interested in
+ @param applicationStates set of application states you are interested in
+ @return a list of reports of applications
+ @throws YarnException
+ @throws IOException]]>
+ Get metrics ({@link YarnClusterMetrics}) about the cluster.
+ @return cluster metrics
+ @throws YarnException
+ @throws IOException]]>
+ Get a report of nodes ({@link NodeReport}) in the cluster.
+ @param states The {@link NodeState}s to filter on. If no filter states are
+ given, nodes in all states will be returned.
+ @return A list of node reports
+ @throws YarnException
+ @throws IOException]]>
+ Get a delegation token so as to be able to talk to YARN using those tokens.
+ @param renewer
+ Address of the renewer who can renew these tokens when needed by
+ securely talking to YARN.
+ @return a delegation token ({@link Token}) that can be used to
+ talk to YARN
+ @throws YarnException
+ @throws IOException]]>
+ Get information ({@link QueueInfo}) about a given queue.
+ @param queueName
+ Name of the queue whose information is needed
+ @return queue information
+ @throws YarnException
+ in case of errors or if YARN rejects the request due to
+ access-control restrictions.
+ @throws IOException]]>
+ Get information ({@link QueueInfo}) about all queues, recursively if there
+ is a hierarchy
+ @return a list of queue-information for all queues
+ @throws YarnException
+ @throws IOException]]>
+ Get information ({@link QueueInfo}) about top level queues.
+ @return a list of queue-information for all the top-level queues
+ @throws YarnException
+ @throws IOException]]>
+ Get information ({@link QueueInfo}) about all the immediate children queues
+ of the given queue
+ @param parent
+ Name of the queue whose child-queues' information is needed
+ @return a list of queue-information for all queues who are direct children
+ of the given parent queue.
+ @throws YarnException
+ @throws IOException]]>
+ Get information about acls for current user on all the
+ existing queues.
+ @return a list of queue acls ({@link QueueUserACLInfo}) for
+ current user
+ @throws YarnException
+ @throws IOException]]>
+ Get a report of the given ApplicationAttempt.
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ @param applicationAttemptId
+ {@link ApplicationAttemptId} of the application attempt that needs
+ a report
+ @return application attempt report
+ @throws YarnException
+ @throws ApplicationAttemptNotFoundException if application attempt
+ not found
+ @throws IOException]]>
+ Get a report of all (ApplicationAttempts) of Application in the cluster.
+ @param applicationId application id of the app
+ @return a list of reports for all application attempts for specified
+ application.
+ @throws YarnException
+ @throws IOException]]>
+ Get a report of the given Container.
+ In secure mode, YARN
verifies access to the application, queue
+ etc. before accepting the request.
+ @param containerId
+ {@link ContainerId} of the container that needs a report
+ @return container report
+ @throws YarnException
+ @throws ContainerNotFoundException if container not found.
+ @throws IOException]]>
+ Get a report of all (Containers) of ApplicationAttempt in the cluster.
+ @param applicationAttemptId application attempt id
+ @return a list of reports of all containers for specified application
+ attempts
+ @throws YarnException
+ @throws IOException]]>
+ Attempts to move the given application to the given queue.
+ @param appId
+ Application to move.
+ @param queue
+ Queue to place it in to.
+ @throws YarnException
+ @throws IOException]]>
+ Obtain a {@link GetNewReservationResponse} for a new reservation,
+ which contains the {@link ReservationId} object.
+ @return The {@link GetNewReservationResponse} containing a new
+ {@link ReservationId} object.
+ @throws YarnException if reservation cannot be created.
+ @throws IOException if reservation cannot be created.]]>
+ The interface used by clients to submit a new reservation to the
+ {@code ResourceManager}.
+ The client packages all details of its request in a
+ {@link ReservationSubmissionRequest} object. This contains information
+ about the amount of capacity, temporal constraints, and gang needs.
+ Furthermore, the reservation might be composed of multiple stages, with
+ ordering dependencies among them.
+ In order to respond, a new admission control component in the
+ {@code ResourceManager} performs an analysis of the resources that have
+ been committed over the period of time the user is requesting, verify that
+ the user requests can be fulfilled, and that it respect a sharing policy
+ (e.g., {@code CapacityOverTimePolicy}). Once it has positively determined
+ that the ReservationRequest is satisfiable the {@code ResourceManager}
+ answers with a {@link ReservationSubmissionResponse} that includes a
+ {@link ReservationId}. Upon failure to find a valid allocation the response
+ is an exception with the message detailing the reason of failure.
+ The semantics guarantees that the {@link ReservationId} returned,
+ corresponds to a valid reservation existing in the time-range request by
+ the user. The amount of capacity dedicated to such reservation can vary
+ overtime, depending of the allocation that has been determined. But it is
+ guaranteed to satisfy all the constraint expressed by the user in the
+ {@link ReservationDefinition}
+ @param request request to submit a new Reservation
+ @return response contains the {@link ReservationId} on accepting the
+ submission
+ @throws YarnException if the reservation cannot be created successfully
+ @throws IOException]]>
+ The interface used by clients to update an existing Reservation. This is
+ referred to as a re-negotiation process, in which a user that has
+ previously submitted a Reservation.
+ The allocation is attempted by virtually substituting all previous
+ allocations related to this Reservation with new ones, that satisfy the new
+ {@link ReservationDefinition}. Upon success the previous allocation is
+ atomically substituted by the new one, and on failure (i.e., if the system
+ cannot find a valid allocation for the updated request), the previous
+ allocation remains valid.
+ @param request to update an existing Reservation (the
+ {@link ReservationUpdateRequest} should refer to an existing valid
+ {@link ReservationId})
+ @return response empty on successfully updating the existing reservation
+ @throws YarnException if the request is invalid or reservation cannot be
+ updated successfully
+ @throws IOException]]>
+ The interface used by clients to remove an existing Reservation.
+ @param request to remove an existing Reservation (the
+ {@link ReservationDeleteRequest} should refer to an existing valid
+ {@link ReservationId})
+ @return response empty on successfully deleting the existing reservation
+ @throws YarnException if the request is invalid or reservation cannot be
+ deleted successfully
+ @throws IOException]]>
+ The interface used by clients to get the list of reservations in a plan.
+ The reservationId will be used to search for reservations to list if it is
+ provided. Otherwise, it will select active reservations within the
+ startTime and endTime (inclusive).
+ @param request to list reservations in a plan. Contains fields to select
+ String queue, ReservationId reservationId, long startTime,
+ long endTime, and a bool includeReservationAllocations.
+ queue: Required. Cannot be null or empty. Refers to the
+ reservable queue in the scheduler that was selected when
+ creating a reservation submission
+ {@link ReservationSubmissionRequest}.
+ reservationId: Optional. If provided, other fields will
+ be ignored.
+ startTime: Optional. If provided, only reservations that
+ end after the startTime will be selected. This defaults
+ to 0 if an invalid number is used.
+ endTime: Optional. If provided, only reservations that
+ start on or before endTime will be selected. This defaults
+ to Long.MAX_VALUE if an invalid number is used.
+ includeReservationAllocations: Optional. Flag that
+ determines whether the entire reservation allocations are
+ to be returned. Reservation allocations are subject to
+ change in the event of re-planning as described by
+ {@link ReservationDefinition}.
+ @return response that contains information about reservations that are
+ being searched for.
+ @throws YarnException if the request is invalid
+ @throws IOException if the request failed otherwise]]>
+ The interface used by client to get node to labels mappings in existing cluster
+ @return node to labels mappings
+ @throws YarnException
+ @throws IOException]]>
+ The interface used by client to get labels to nodes mapping
+ in existing cluster
+ @return node to labels mappings
+ @throws YarnException
+ @throws IOException]]>
+ The interface used by client to get labels to nodes mapping
+ for specified labels in existing cluster
+ @param labels labels for which labels to nodes mapping has to be retrieved
+ @return labels to nodes mappings for specific labels
+ @throws YarnException
+ @throws IOException]]>
+ The interface used by client to get node labels in the cluster
+ @return cluster node labels collection
+ @throws YarnException when there is a failure in
+ {@link ApplicationClientProtocol}
+ @throws IOException when there is a failure in
+ {@link ApplicationClientProtocol}]]>
+ The interface used by client to set priority of an application
+ @param applicationId
+ @param priority
+ @return updated priority of an application.
+ @throws YarnException
+ @throws IOException]]>
+ Signal a container identified by given ID.
+ @param containerId
+ {@link ContainerId} of the container that needs to be signaled
+ @param command the signal container command
+ @throws YarnException
+ @throws IOException]]>
+ Create a new instance of AMRMClientAsync.
+ @param intervalMs heartbeat interval in milliseconds between AM and RM
+ @param callbackHandler callback handler that processes responses from
+ the ResourceManager
+ Create a new instance of AMRMClientAsync.
+ @param client the AMRMClient instance
+ @param intervalMs heartbeat interval in milliseconds between AM and RM
+ @param callbackHandler callback handler that processes responses from
+ the ResourceManager
+ allocate
+ @param req Resource request]]>
+ allocate.
+ Any previous pending resource change request of the same container will be
+ removed.
+ Application that calls this method is expected to maintain the
+ Container
s that are returned from previous successful
+ allocations or resource changes. By passing in the existing container and a
+ target resource capability to this method, the application requests the
+ ResourceManager to change the existing resource allocation to the target
+ resource allocation.
+ @param container The container returned from the last successful resource
+ allocation or resource change
+ @param capability The target resource capability of the container]]>
+ check to return true for each 1000 ms.
+ See also {@link #waitFor(com.google.common.base.Supplier, int)}
+ and {@link #waitFor(com.google.common.base.Supplier, int, int)}
+ @param check]]>
+ check to return true for each
+ checkEveryMillis
+ See also {@link #waitFor(com.google.common.base.Supplier, int, int)}
+ @param check user defined checker
+ @param checkEveryMillis interval to call check
+ check to return true for each
+ checkEveryMillis
ms. In the main loop, this method will log
+ the message "waiting in main loop" for each logInterval
+ iteration to confirm the thread is alive.
+ @param check user defined checker
+ @param checkEveryMillis interval to call check
+ @param logInterval interval to log for each]]>
+ AMRMClientAsync handles communication with the ResourceManager
+ and provides asynchronous updates on events such as container allocations and
+ completions. It contains a thread that sends periodic heartbeats to the
+ ResourceManager.
+ It should be used by implementing a CallbackHandler:
+ {@code
+ class MyCallbackHandler extends AMRMClientAsync.AbstractCallbackHandler {
+ public void onContainersAllocated(List containers) {
+ [run tasks on the containers]
+ }
+ public void onContainersUpdated(List containers) {
+ [determine if resource allocation of containers have been increased in
+ the ResourceManager, and if so, inform the NodeManagers to increase the
+ resource monitor/enforcement on the containers]
+ }
+ public void onContainersCompleted(List statuses) {
+ [update progress, check whether app is done]
+ }
+ public void onNodesUpdated(List updated) {}
+ public void onReboot() {}
+ }
+ }
+ The client's lifecycle should be managed similarly to the following:
+ {@code
+ AMRMClientAsync asyncClient =
+ createAMRMClientAsync(appAttId, 1000, new MyCallbackhandler());
+ asyncClient.init(conf);
+ asyncClient.start();
+ RegisterApplicationMasterResponse response = asyncClient
+ .registerApplicationMaster(appMasterHostname, appMasterRpcPort,
+ appMasterTrackingUrl);
+ asyncClient.addContainerRequest(containerRequest);
+ [... wait for application to complete]
+ asyncClient.unregisterApplicationMaster(status, appMsg, trackingUrl);
+ asyncClient.stop();
+ }
+ NMClientAsync handles communication with all the NodeManagers
+ and provides asynchronous updates on getting responses from them. It
+ maintains a thread pool to communicate with individual NMs where a number of
+ worker threads process requests to NMs by using {@link NMClientImpl}. The max
+ size of the thread pool is configurable through
+ {@link YarnConfiguration#NM_CLIENT_ASYNC_THREAD_POOL_MAX_SIZE}.
+ It should be used in conjunction with a CallbackHandler. For example
+ {@code
+ class MyCallbackHandler extends NMClientAsync.AbstractCallbackHandler {
+ public void onContainerStarted(ContainerId containerId,
+ Map allServiceResponse) {
+ [post process after the container is started, process the response]
+ }
+ public void onContainerResourceIncreased(ContainerId containerId,
+ Resource resource) {
+ [post process after the container resource is increased]
+ }
+ public void onContainerStatusReceived(ContainerId containerId,
+ ContainerStatus containerStatus) {
+ [make use of the status of the container]
+ }
+ public void onContainerStopped(ContainerId containerId) {
+ [post process after the container is stopped]
+ }
+ public void onStartContainerError(
+ ContainerId containerId, Throwable t) {
+ [handle the raised exception]
+ }
+ public void onGetContainerStatusError(
+ ContainerId containerId, Throwable t) {
+ [handle the raised exception]
+ }
+ public void onStopContainerError(
+ ContainerId containerId, Throwable t) {
+ [handle the raised exception]
+ }
+ }
+ }
+ The client's life-cycle should be managed like the following:
+ {@code
+ NMClientAsync asyncClient =
+ NMClientAsync.createNMClientAsync(new MyCallbackhandler());
+ asyncClient.init(conf);
+ asyncClient.start();
+ asyncClient.startContainer(container, containerLaunchContext);
+ [... wait for container being started]
+ asyncClient.getContainerStatus(container.getId(), container.getNodeId(),
+ container.getContainerToken());
+ [... handle the status in the callback instance]
+ asyncClient.stopContainer(container.getId(), container.getNodeId(),
+ container.getContainerToken());
+ [... wait for container being stopped]
+ asyncClient.stop();
+ }
diff --git a/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Common_2.8.3.xml b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Common_2.8.3.xml
new file mode 100644
index 0000000000..6826c2565f
--- /dev/null
+++ b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Common_2.8.3.xml
@@ -0,0 +1,2665 @@
+ Type of proxy.
+ @return Proxy to the ResourceManager for the specified client protocol.
+ @throws IOException]]>
+ Type information of the proxy
+ @return Proxy to the RM
+ @throws IOException]]>
+ Send the information of a number of conceptual entities to the timeline
+ server. It is a blocking API. The method will not return until it gets the
+ response from the timeline server.
+ @param entities
+ the collection of {@link TimelineEntity}
+ @return the error information if the sent entities are not correctly stored
+ @throws IOException
+ @throws YarnException]]>
+ Send the information of a number of conceptual entities to the timeline
+ server. It is a blocking API. The method will not return until it gets the
+ response from the timeline server.
+ This API is only for timeline service v1.5
+ @param appAttemptId {@link ApplicationAttemptId}
+ @param groupId {@link TimelineEntityGroupId}
+ @param entities
+ the collection of {@link TimelineEntity}
+ @return the error information if the sent entities are not correctly stored
+ @throws IOException
+ @throws YarnException]]>
+ Send the information of a domain to the timeline server. It is a
+ blocking API. The method will not return until it gets the response from
+ the timeline server.
+ @param domain
+ an {@link TimelineDomain} object
+ @throws IOException
+ @throws YarnException]]>
+ Send the information of a domain to the timeline server. It is a
+ blocking API. The method will not return until it gets the response from
+ the timeline server.
+ This API is only for timeline service v1.5
+ @param domain
+ an {@link TimelineDomain} object
+ @param appAttemptId {@link ApplicationAttemptId}
+ @throws IOException
+ @throws YarnException]]>
+ Get a delegation token so as to be able to talk to the timeline server in a
+ secure way.
+ @param renewer
+ Address of the renewer who can renew these tokens when needed by
+ securely talking to the timeline server
+ @return a delegation token ({@link Token}) that can be used to talk to the
+ timeline server
+ @throws IOException
+ @throws YarnException]]>
+ Renew a timeline delegation token.
+ @param timelineDT
+ the delegation token to renew
+ @return the new expiration time
+ @throws IOException
+ @throws YarnException]]>
+ Cancel a timeline delegation token.
+ @param timelineDT
+ the delegation token to cancel
+ @throws IOException
+ @throws YarnException]]>
+ parameterized event of type T]]>
+ InputStream to be checksumed
+ @return the message digest of the input stream
+ @throws IOException]]>
+ SharedCacheChecksum object based on the configurable
+ algorithm implementation
+ (see yarn.sharedcache.checksum.algo.impl
+ @return SharedCacheChecksum
+ The object type on which this state machine operates.
+ @param The state of the entity.
+ @param The external eventType to be handled.
+ @param The event object.]]>
diff --git a/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Server_Common_2.8.3.xml b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Server_Common_2.8.3.xml
new file mode 100644
index 0000000000..f3191e46d5
--- /dev/null
+++ b/hadoop-yarn-project/hadoop-yarn/dev-support/jdiff/Apache_Hadoop_YARN_Server_Common_2.8.3.xml
@@ -0,0 +1,829 @@
+ true if the node is healthy, else false
+ diagnostic health report of the node.
+ @return diagnostic health report of the node]]>
+ last timestamp at which the health report was received.
+ @return last timestamp at which the health report was received]]>
+ It includes information such as:
+ -
+ An indicator of whether the node is healthy, as determined by the
+ health-check script.
+ - The previous time at which the health status was reported.
+ - A diagnostic report on the health status.
+ @see NodeReport
+ @see ApplicationClientProtocol#getClusterNodes(org.apache.hadoop.yarn.api.protocolrecords.GetClusterNodesRequest)]]>
+ true if the iteration has more elements.]]>