null
if
no such property exists. If the key is deprecated, it returns the value of
the first key which replaces the deprecated key and is not null
Values are processed for variable expansion
before being returned.
@param name the property name.
@return the value of the name
or its replacing property,
or null if no such property exists.]]>
name
property or
its replacing property and null if no such property exists.]]>
name
property. If
name
is deprecated, it sets the value
to the keys
that replace the deprecated key.
@param name property name.
@param value property value.]]>
defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value, or defaultValue
if the property
doesn't exist.]]>
int
.
If no such property exists, or if the specified value is not a valid
int
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as an int
,
or defaultValue
.]]>
int
.
@param name property name.
@param value int
value of the property.]]>
long
.
If no such property is specified, or if the specified value is not a valid
long
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a long
,
or defaultValue
.]]>
long
.
@param name property name.
@param value long
value of the property.]]>
float
.
If no such property is specified, or if the specified value is not a valid
float
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a float
,
or defaultValue
.]]>
float
.
@param name property name.
@param value property value.]]>
boolean
.
If no such property is specified, or if the specified value is not a valid
boolean
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a boolean
,
or defaultValue
.]]>
boolean
.
@param name property name.
@param value boolean
value of the property.]]>
set(<name>, value.toString())
.
@param name property name
@param value new value]]>
Pattern
.
If no such property is specified, or if the specified value is not a valid
Pattern
, then DefaultValue
is returned.
@param name property name
@param defaultValue default value
@return property value as a compiled Pattern, or defaultValue]]>
String
s.
If no such property is specified then empty collection is returned.
This is an optimized version of {@link #getStrings(String)}
@param name property name.
@return property value as a collection of String
s.]]>
String
s.
If no such property is specified then null
is returned.
@param name property name.
@return property value as an array of String
s,
or null
.]]>
String
s.
If no such property is specified then default value is returned.
@param name property name.
@param defaultValue The default value
@return property value as an array of String
s,
or default value.]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then empty Collection
is returned.
@param name property name.
@return property value as a collection of String
s, or empty Collection
]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then an empty array is returned.
@param name property name.
@return property value as an array of trimmed String
s,
or empty array.]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then default value is returned.
@param name property name.
@param defaultValue The default value
@return property value as an array of trimmed String
s,
or default value.]]>
Class
.
The value of the property specifies a list of comma separated class names.
If no such property is specified, then defaultValue
is
returned.
@param name the property name.
@param defaultValue default value.
@return property value as a Class[]
,
or defaultValue
.]]>
Class
.
If no such property is specified, then defaultValue
is
returned.
@param name the class name.
@param defaultValue default value.
@return property value as a Class
,
or defaultValue
.]]>
Class
implementing the interface specified by xface
.
If no such property is specified, then defaultValue
is
returned.
An exception is thrown if the returned class does not implement the named
interface.
@param name the class name.
@param defaultValue default value.
@param xface the interface implemented by the named class.
@return property value as a Class
,
or defaultValue
.]]>
List
of objects implementing the interface specified by xface
.
An exception is thrown if any of the classes does not exist, or if it does
not implement the named interface.
@param name the property name.
@param xface the interface implemented by the classes named by
name
.
@return a List
of objects implementing xface
.]]>
theClass
implementing the given interface xface
.
An exception is thrown if theClass
does not implement the
interface xface
.
@param name property name.
@param theClass property value.
@param xface the interface implemented by the named class.]]>
false
to turn it off.]]>
Configurations are specified by resources. A resource contains a set of
name/value pairs as XML data. Each resource is named by either a
String
or by a {@link Path}. If named by a String
,
then the classpath is examined for a file with that name. If named by a
Path
, then the local filesystem is examined directly, without
referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
Configuration parameters may be declared final.
Once a resource declares a value final, no subsequently-loaded
resource can alter that value.
For example, one might define a final parameter with:
<property>
<name>dfs.client.buffer.dir</name>
<value>/tmp/hadoop/dfs/client</value>
<final>true</final>
</property>
Administrators typically define parameters as final in
core-site.xml for values that user applications may not alter.
Value strings are first processed for variable expansion. The available properties are:
For example, if a configuration resource contains the following property
definitions:
<property>
<name>basedir</name>
<value>/user/${user.name}</value>
</property>
<property>
<name>tempdir</name>
<value>${basedir}/tmp</value>
</property>
When conf.get("tempdir") is called, then ${basedir}
will be resolved to another property in this Configuration, while
${user.name} would then ordinarily be resolved to the value
of the System property with that name.]]>
EnumSet.of(CreateFlag.CREATE, CreateFlag.APPEND)
and pass it to {@link org.apache.hadoop.fs.FileSystem #create(Path f, FsPermission permission,
EnumSet flag, int bufferSize, short replication, long blockSize,
Progressable progress)}.
Combine {@link #OVERWRITE} with either {@link #CREATE}
or {@link #APPEND} does the same as only use
{@link #OVERWRITE}.
Combine {@link #CREATE} with {@link #APPEND} has the semantic:
- create the file if it does not exist;
- append the file if it already exists.
]]>
f
already exists
@throws FileNotFoundException If parent of f
does not exist
and createParent
is false
@throws ParentNotDirectoryException If parent of f
is not a
directory.
@throws UnsupportedFileSystemException If file system for f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path f
is not valid]]>
dir
does not exist
and createParent
is false
@throws ParentNotDirectoryException If parent of dir
is not a
directory
@throws UnsupportedFileSystemException If file system for dir
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path dir
is not valid]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path f
is invalid]]>
f
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
If OVERWRITE option is not passed as an argument, rename fails if the dst already exists.
If OVERWRITE option is passed as an argument, rename overwrites the dst if it is a file or an empty directory. Rename fails if dst is a non-empty directory.
Note that atomicity of rename is dependent on the file system implementation. Please refer to the file system documentation for details
@param src path to be renamed
@param dst new path after rename
@throws AccessControlException If access is denied
@throws FileAlreadyExistsException If dst
already exists and
options has {@link Rename#OVERWRITE} option
false.
@throws FileNotFoundException If
src
does not exist
@throws ParentNotDirectoryException If parent of dst
is not a
directory
@throws UnsupportedFileSystemException If file system for src
and dst
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws HadoopIllegalArgumentException If username
or
groupname
is invalid.]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred]]>
f
is
not supported
@throws IOException If an I/O error occurred]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path f
is invalid]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
linkcode> already exists
@throws FileNotFoundException If target
does not exist
@throws ParentNotDirectoryException If parent of link
is not a
directory.
@throws UnsupportedFileSystemException If file system for
target
or link
is not supported
@throws IOException If an I/O error occurred]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
The Hadoop file system supports a URI name space and URI names. It offers a forest of file systems that can be referenced using fully qualified URIs. Two common Hadoop file systems implementations are
To facilitate this, Hadoop supports a notion of a default file system. The user can set his default file system, although this is typically set up for you in your environment via your default config. A default file system implies a default scheme and authority; slash-relative names (such as /for/bar) are resolved relative to that default FS. Similarly a user can also have working-directory-relative names (i.e. names not starting with a slash). While the working directory is generally in the same default FS, the wd can be in a different FS.
Hence Hadoop path names can be one of:
****The Role of the FileContext and configuration defaults****
The FileContext provides file namespace context for resolving file names; it also contains the umask for permissions, In that sense it is like the per-process file-related state in Unix system. These two properties
The file system related SS defaults are
*** Usage Model for the FileContext class ***
Example 1: use the default config read from the $HADOOP_CONFIG/core.xml. Unspecified values come from core-defaults.xml in the release jar.
f
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
pathPattern
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
A filename pattern is composed of regular characters and special pattern matching characters, which are:
pathPattern
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
src
does not exist
@throws ParentNotDirectoryException If parent of dst
is not
a directory
@throws UnsupportedFileSystemException If file system for
src
or dst
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path dst
is invalid]]>
If OVERWRITE option is not passed as an argument, rename fails if the dst already exists.
If OVERWRITE option is passed as an argument, rename overwrites the dst if it is a file or an empty directory. Rename fails if dst is a non-empty directory.
Note that atomicity of rename is dependent on the file system implementation. Please refer to the file system documentation for details. This default implementation is non atomic.
This method is deprecated since it is a temporary method added to support the transition from FileSystem to FileContext for user applications. @param src path to be renamed @param dst new path after rename @throws IOException on failure]]>
A filename pattern is composed of regular characters and special pattern matching characters, which are:
The local implementation is {@link LocalFileSystem} and distributed implementation is DistributedFileSystem.]]>
FilterFileSystem
itself simply overrides all methods of
FileSystem
with versions that
pass all requests to the contained file
system. Subclasses of FilterFileSystem
may further override some of these methods
and may also provide additional methods
and fields.]]>
pathname
should be included]]>
<property> <name>fs.kfs.impl</name> <value>org.apache.hadoop.fs.kfs.KosmosFileSystem</value> <description>The FileSystem for kfs: uris.</description> </property>
<property> <name>fs.default.name</name> <value>kfs://<server:port></value> </property> <property> <name>fs.kfs.metaServerHost</name> <value><server></value> <description>The location of the KFS meta server.</description> </property> <property> <name>fs.kfs.metaServerPort</name> <value><port></value> <description>The location of the meta server's port.</description> </property>
export LD_LIBRARY_PATH=<path>
All files in the filesystem are migrated by re-writing the block metadata - no datafiles are touched.
]]>Files are stored in S3 as blocks (represented by {@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length. Block metadata is stored in S3 as a small record (represented by {@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded path string as a key. Inodes record the file type (regular file or directory) and the list of blocks. This design makes it easy to seek to any given position in a file by reading the inode data to compute which block to access, then using S3's support for HTTP Range headers to start streaming from the correct position. Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since S3 does not support renames).
For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3 would be something like this:
/ /dir1 /dir1/file1 block-6415776850131549260 block-3026438247347758425
Inodes start with a leading /
, while blocks are prefixed with block-
.
f
is a file, this method will make a single call to S3.
If f
is a directory, this method will make a maximum of
(n / 1000) + 2 calls to S3, where n is the total number of
files and directories contained directly in f
.
]]>
Compared with ObjectWritable
, this class is much more effective,
because ObjectWritable
will append the class declaration as a String
into the output file in every Key-Value pair.
Generic Writable implements {@link Configurable} interface, so that it will be configured by the framework. The configuration is passed to the wrapped objects implementing {@link Configurable} interface before deserialization.
how to use it:getTypes()
, defines
the classes which will be wrapped in GenericObject in application.
Attention: this classes defined in getTypes()
method, must
implement Writable
interface.
@since Nov 8, 2006]]>public class GenericObject extends GenericWritable { private static Class[] CLASSES = { ClassType1.class, ClassType2.class, ClassType3.class, }; protected Class[] getTypes() { return CLASSES; } }
data
file,
containing all keys and values in the map, and a smaller index
file, containing a fraction of the keys. The fraction is determined by
{@link Writer#getIndexInterval()}.
The index file is read entirely into memory. Thus key implementations should try to keep themselves small.
Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with {@link SequenceFile.Sorter}.]]>
val
. Returns true if such a pair exists and false when at
the end of the map]]>
key
. Otherwise,
return the record that sorts just after.
@return - the key that was the closest match or null if eof.]]>
SequenceFile
provides {@link Writer}, {@link Reader} and
{@link Sorter} classes for writing, reading and sorting respectively.
SequenceFile
Writer
s based on the
{@link CompressionType} used to compress key/value pairs:
Writer
: Uncompressed records.
RecordCompressWriter
: Record-compressed files, only compress
values.
BlockCompressWriter
: Block-compressed files, both keys &
values are collected in 'blocks'
separately and compressed. The size of
the 'block' is configurable.
The actual compression algorithm used to compress key and/or values can be specified by using the appropriate {@link CompressionCodec}.
The recommended way is to use the static createWriter methods
provided by the SequenceFile
to chose the preferred format.
The {@link Reader} acts as the bridge and can read any of the above
SequenceFile
formats.
Essentially there are 3 different formats for SequenceFile
s
depending on the CompressionType
specified. All of them share a
common header described below.
CompressionCodec
class which is used for
compression of keys and/or values (if compression is
enabled).
100
bytes or so.
100
bytes or so.
100
bytes or so.
The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.
@see CompressionCodec]]>val
. Returns true if such a pair exists and false when at
end of file]]>
key
, or null if no match exists.]]>
start
. The starting
position is measured in bytes and the return value is in
terms of byte position in the buffer. The backing buffer is
not converted to a string for this operation.
@return byte position of the first occurence of the search
string in the UTF-8 buffer or -1 if not found]]>
Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.]]>
DataOuput
to serialize this object into.
@throws IOException]]>
For efficiency, implementations should attempt to re-use storage in the existing object where possible.
@param inDataInput
to deseriablize this object from.
@throws IOException]]>
key
or value
type in the Hadoop Map-Reduce
framework implements this interface.
Implementations typically implement a static read(DataInput)
method which constructs a new instance, calls {@link #readFields(DataInput)}
and returns the instance.
Example:
]]>public class MyWritable implements Writable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public static MyWritable read(DataInput in) throws IOException { MyWritable w = new MyWritable(); w.readFields(in); return w; } }
WritableComparable
s can be compared to each other, typically
via Comparator
s. Any type which is to be used as a
key
in the Hadoop Map-Reduce framework should implement this
interface.
Example:
]]>public class MyWritableComparable implements WritableComparable<MyWritableComparable> { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public int compareTo(MyWritableComparable other) { int thisValue = this.counter; int thatValue = other.counter; return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1)); } }
One may optimize compare-intensive operations by overriding {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are provided to assist in optimized implementations of this method.]]>
Compressor
@param conf the Configuration
object which contains confs for creating or reinit the compressor
@return Compressor
for the given
CompressionCodec
from the pool or a new one]]>
Decompressor
@return Decompressor
for the given
CompressionCodec
the pool or a new one]]>
true
if a preset dictionary is needed for decompression]]>
The behavior of TFile can be customized by the following variables through Configuration:
Suggestions on performance optimization.
Use {@link Scanner#advance()} to move the cursor to the next key-value pair (or end if none exists). Use seekTo methods ( {@link Scanner#seekTo(byte[])} or {@link Scanner#seekTo(byte[], int, int)}) to seek to any arbitrary location in the covered range (including backward seeking). Use {@link Scanner#rewind()} to seek back to the beginning of the scanner. Use {@link Scanner#seekToEnd()} to seek to the end of the scanner.
Actual keys and values may be obtained through {@link Scanner.Entry} object, which is obtained through {@link Scanner#entry()}.]]>
To add a new serialization framework write an implementation of {@link org.apache.hadoop.io.serializer.Serialization} and add its name to the "io.serializations" property.
]]>Use {@link org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization} for serialization of classes generated by Avro's 'specific' compiler.
Use {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} for other classes. {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} work for any class which is either in the package list configured via {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization#AVRO_REFLECT_PACKAGES} or implement {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerializable} interface.
]]>org.apache.hadoop.metrics.spi
org.apache.hadoop.metrics.file
org.apache.hadoop.metrics.ganglia
private ContextFactory contextFactory = ContextFactory.getFactory(); void reportMyMetric(float myMetric) { MetricsContext myContext = contextFactory.getContext("myContext"); MetricsRecord myRecord = myContext.getRecord("myRecord"); myRecord.setMetric("myMetric", myMetric); myRecord.update(); }In this example there are three names:
private MetricsRecord diskStats = contextFactory.getContext("myContext").getRecord("diskStats"); void reportDiskMetrics(String diskName, float diskBusy, float diskUsed) { diskStats.setTag("diskName", diskName); diskStats.setMetric("diskBusy", diskBusy); diskStats.setMetric("diskUsed", diskUsed); diskStats.update(); }
MetricsRecord.update()
is called. Instead it is stored in an
internal table, and the contents of the table are sent periodically.
This can be important for two reasons:
registerUpdater()
method. The benefit of this
versus using java.util.Timer
is that the callbacks will be done
immediately before sending the data, making the data as current as possible.
ContextFactory factory = ContextFactory.getFactory(); ... examine and/or modify factory attributes ... MetricsContext context = factory.getContext("myContext");The factory attributes can be examined and modified using the following
ContextFactory
methods:
Object getAttribute(String attributeName)
String[] getAttributeNames()
void setAttribute(String name, Object value)
void removeAttribute(attributeName)
ContextFactory.getFactory()
initializes the factory attributes by
reading the properties file hadoop-metrics.properties
if it exists
on the class path.
A factory attribute named:
contextName.classshould have as its value the fully qualified name of the class to be instantiated by a call of the
CodeFactory
method
getContext(contextName)
. If this factory attribute is not
specified, the default is to instantiate
org.apache.hadoop.metrics.file.FileContext
.
Other factory attributes are specific to a particular implementation of this
API and are documented elsewhere. For example, configuration attributes for
the file and Ganglia implementations can be found in the javadoc for
their respective packages.]]>
myContextName.fileName=/tmp/metrics.log myContextName.period=5]]>
recordName
is not in that set.
@param recordName the name of the record
@throws MetricsException if recordName conflicts with configuration data]]>
emitRecord
method in order to transmit
the data. ]]>
remove()
.]]>
org.apache.hadoop.metrics.ganglia
.
Plugging in an implementation involves writing a concrete subclass of
AbstractMetricsContext
. The subclass should get its
configuration information using the getAttribute(attributeName)
method.]]>
DEPRECATED: Replaced by Avro.
recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name := ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
"long" / "float" / "double"
"ustring" / "buffer")
ctype := (("vector" "<" type ">") /
("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or
more include declarations, a single mandatory module declaration
followed by zero or more class declarations. The semantics of each of
these declarations are described below:
module links {
class Link {
ustring URL;
boolean isRelative;
ustring anchorText;
};
}
include "links.jr"
module outlinks {
class OutLinks {
ustring baseURL;
vector outLinks;
};
}
$ rcc -l C++ ...
namespace hadoop {
enum RecFormat { kBinary, kXML, kCSV };
class InStream {
public:
virtual ssize_t read(void *buf, size_t n) = 0;
};
class OutStream {
public:
virtual ssize_t write(const void *buf, size_t n) = 0;
};
class IOError : public runtime_error {
public:
explicit IOError(const std::string& msg);
};
class IArchive;
class OArchive;
class RecordReader {
public:
RecordReader(InStream& in, RecFormat fmt);
virtual ~RecordReader(void);
virtual void read(Record& rec);
};
class RecordWriter {
public:
RecordWriter(OutStream& out, RecFormat fmt);
virtual ~RecordWriter(void);
virtual void write(Record& rec);
};
class Record {
public:
virtual std::string type(void) const = 0;
virtual std::string signature(void) const = 0;
protected:
virtual bool validate(void) const = 0;
virtual void
serialize(OArchive& oa, const std::string& tag) const = 0;
virtual void
deserialize(IArchive& ia, const std::string& tag) = 0;
};
}
namespace links {
class Link : public hadoop::Record {
// ....
};
};
Each field within the record will cause the generation of a private member
declaration of the appropriate type in the class declaration, and one or more
acccessor methods. The generated class will implement the serialize and
deserialize methods defined in hadoop::Record+. It will also
implement the inspection methods type and signature from
hadoop::Record. A default constructor and virtual destructor will also
be generated. Serialization code will read/write records into streams that
implement the hadoop::InStream and the hadoop::OutStream interfaces.
For each member of a record an accessor method is generated that returns
either the member or a reference to the member. For members that are returned
by value, a setter method is also generated. This is true for primitive
data members of the types byte, int, long, boolean, float and
double. For example, for a int field called MyField the folowing
code is generated.
...
private:
int32_t mMyField;
...
public:
int32_t getMyField(void) const {
return mMyField;
};
void setMyField(int32_t m) {
mMyField = m;
};
...
For a ustring or buffer or composite field. The generated code
only contains accessors that return a reference to the field. A const
and a non-const accessor are generated. For example:
...
private:
std::string mMyBuf;
...
public:
std::string& getMyBuf() {
return mMyBuf;
};
const std::string& getMyBuf() const {
return mMyBuf;
};
...
module inclrec {
class RI {
int I32;
double D;
ustring S;
};
}
and the testrec.jr file contains:
include "inclrec.jr"
module testrec {
class R {
vector VF;
RI Rec;
buffer Buf;
};
}
Then the invocation of rcc such as:
$ rcc -l c++ inclrec.jr testrec.jr
will result in generation of four files:
inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}.
The inclrec.jr.hh will contain:
#ifndef _INCLREC_JR_HH_
#define _INCLREC_JR_HH_
#include "recordio.hh"
namespace inclrec {
class RI : public hadoop::Record {
private:
int32_t I32;
double D;
std::string S;
public:
RI(void);
virtual ~RI(void);
virtual bool operator==(const RI& peer) const;
virtual bool operator<(const RI& peer) const;
virtual int32_t getI32(void) const { return I32; }
virtual void setI32(int32_t v) { I32 = v; }
virtual double getD(void) const { return D; }
virtual void setD(double v) { D = v; }
virtual std::string& getS(void) const { return S; }
virtual const std::string& getS(void) const { return S; }
virtual std::string type(void) const;
virtual std::string signature(void) const;
protected:
virtual void serialize(hadoop::OArchive& a) const;
virtual void deserialize(hadoop::IArchive& a);
};
} // end namespace inclrec
#endif /* _INCLREC_JR_HH_ */
The testrec.jr.hh file will contain:
#ifndef _TESTREC_JR_HH_
#define _TESTREC_JR_HH_
#include "inclrec.jr.hh"
namespace testrec {
class R : public hadoop::Record {
private:
std::vector VF;
inclrec::RI Rec;
std::string Buf;
public:
R(void);
virtual ~R(void);
virtual bool operator==(const R& peer) const;
virtual bool operator<(const R& peer) const;
virtual std::vector& getVF(void) const;
virtual const std::vector& getVF(void) const;
virtual std::string& getBuf(void) const ;
virtual const std::string& getBuf(void) const;
virtual inclrec::RI& getRec(void) const;
virtual const inclrec::RI& getRec(void) const;
virtual bool serialize(hadoop::OutArchive& a) const;
virtual bool deserialize(hadoop::InArchive& a);
virtual std::string type(void) const;
virtual std::string signature(void) const;
};
}; // end namespace testrec
#endif /* _TESTREC_JR_HH_ */
DDL Type C++ Type Java Type
boolean bool boolean
byte int8_t byte
int int32_t int
long int64_t long
float float float
double double double
ustring std::string java.lang.String
buffer std::string org.apache.hadoop.record.Buffer
class type class type class type
vector std::vector java.util.ArrayList
map std::map java.util.TreeMap
record = primitive / struct / vector / map
primitive = boolean / int / long / float / double / ustring / buffer
boolean = "T" / "F"
int = ["-"] 1*DIGIT
long = ";" ["-"] 1*DIGIT
float = ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
double = ";" ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
ustring = "'" *(UTF8 char except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
buffer = "#" *(BYTE except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
struct = "s{" record *("," record) "}"
vector = "v{" [record *("," record)] "}"
map = "m{" [*(record "," record)] "}"
class {
int MY_INT; // value 5
vector MY_VEC; // values 0.1, -0.89, 2.45e4
buffer MY_BUF; // value '\00\n\tabc%'
}
is serialized as
<value>
<struct>
<member>
<name>MY_INT</name>
<value><i4>5</i4></value>
</member>
<member>
<name>MY_VEC</name>
<value>
<array>
<data>
<value><ex:float>0.1</ex:float></value>
<value><ex:float>-0.89</ex:float></value>
<value><ex:float>2.45e4</ex:float></value>
</data>
</array>
</value>
</member>
<member>
<name>MY_BUF</name>
<value><string>%00\n\tabc%25</string></value>
</member>
</struct>
</value>
]]>
DEPRECATED: Replaced by Avro.
]]> The task requires the file
or the nested fileset element to be
specified. Optional attributes are language
(set the output
language, default is "java"),
destdir
(name of the destination directory for generated java/c++
code, default is ".") and failonerror
(specifies error handling
behavior. default is true).
<recordcc destdir="${basedir}/gensrc" language="java"> <fileset include="**\/*.jr" /> </recordcc>@deprecated Replaced by Avro.]]>
DEPRECATED: Replaced by Avro.
]]>Progressable
to explicitly report progress to the Hadoop framework. This is especially
important for operations which take an insignificant amount of time since,
in-lieu of the reported progress, the framework has to assume that an error
has occured and time-out the operation.]]>
Class
of the given object.]]>
Tool
, is the standard for any Map-Reduce tool/application.
The tool/application should delegate the handling of
standard command-line options to {@link ToolRunner#run(Tool, String[])}
and only handle its custom arguments.
Here is how a typical Tool
is implemented:
@see GenericOptionsParser @see ToolRunner]]>public class MyApp extends Configured implements Tool { public int run(String[] args) throws Exception { //Configuration
processed byToolRunner
Configuration conf = getConf(); // Create a JobConf using the processedconf
JobConf job = new JobConf(conf, MyApp.class); // Process custom command-line options Path in = new Path(args[1]); Path out = new Path(args[2]); // Specify various job-specific parameters job.setJobName("my-app"); job.setInputPath(in); job.setOutputPath(out); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); // Submit the job, then poll for progress until the job is complete JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { // LetToolRunner
handle generic command-line options int res = ToolRunner.run(new Configuration(), new MyApp(), args); System.exit(res); } }
Configuration
, or builds one if null.
Sets the Tool
's configuration with the possibly modified
version of the conf
.
@param conf Configuration
for the Tool
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
Configuration
.
Equivalent to run(tool.getConf(), tool, args)
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
ToolRunner
can be used to run classes implementing
Tool
interface. It works in conjunction with
{@link GenericOptionsParser} to parse the
generic hadoop command line arguments and modifies the
Configuration
of the Tool
. The
application-specific options are passed along without being modified.
@see Tool
@see GenericOptionsParser]]>
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Space/Time Trade-Offs in Hash Coding with Allowable Errors]]>
Invariant: nothing happens if the specified key does not belong to this counter Bloom filter. @param key The key to remove.]]>
NOTE: due to the bucket size of this filter, inserting the same
key more than 15 times will cause an overflow at all filter positions
associated with this key, and it will significantly increase the error
rate for this and other keys. For this reason the filter can only be
used to store small count values 0 <= N << 15
.
@param key key to be tested
@return 0 if the key is not present. Otherwise, a positive value v will
be returned such that v == count
with probability equal to the
error rate of this filter, and v > count
otherwise.
Additionally, if the filter experienced an underflow as a result of
{@link #delete(Key)} operation, the return value may be lower than the
count
with the probability of the false negative rate of such
filter.]]>
A counting Bloom filter is an improvement to standard a Bloom filter as it allows dynamic additions and deletions of set membership information. This is achieved through the use of a counting vector instead of a bit vector.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Summary cache: a scalable wide-area web cache sharing protocol]]>
A dynamic Bloom filter (DBF) makes use of a s * m
bit matrix but
each of the s
rows is a standard Bloom filter. The creation
process of a DBF is iterative. At the start, the DBF is a 1 * m
bit matrix, i.e., it is composed of a single standard Bloom filter.
It assumes that nr
elements are recorded in the
initial bit vector, where nr <= n
(n
is
the cardinality of the set A
to record in the filter).
As the size of A
grows during the execution of the application,
several keys must be inserted in the DBF. When inserting a key into the DBF,
one must first get an active Bloom filter in the matrix. A Bloom filter is
active when the number of recorded keys, nr
, is
strictly less than the current cardinality of A
, n
.
If an active Bloom filter is found, the key is inserted and
nr
is incremented by one. On the other hand, if there
is no active Bloom filter, a new one is created (i.e., a new row is added to
the matrix) according to the current size of A
and the element
is added in this new Bloom filter and the nr
value of
this new Bloom filter is set to one. A given key is said to belong to the
DBF if the k
positions are set to one in one of the matrix rows.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see Theory and Network Applications of Dynamic Bloom Filters]]>
Invariant: if the false positive is null
, nothing happens.
@param key The false positive key to add.]]>
It allows the removal of selected false positives at the cost of introducing random false negatives, and with the benefit of eliminating some random false positives at the same time.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see RemoveScheme The different selective clearing algorithms @see Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives]]>