hadoop/hadoop-project/src/site/markdown/index.md.vm

<!---
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

Apache Hadoop ${project.version}
================================

Apache Hadoop ${project.version} incorporates a number of significant
enhancements over the previous major release line (hadoop-2.x).

This is an alpha release to facilitate testing and the collection of
feedback from downstream application developers and users. There are
no guarantees regarding API stability or quality.

Overview
========

Users are encouraged to read the full set of release notes.
This page provides an overview of the major changes.

Minimum required Java version increased from Java 7 to Java 8
------------------

All Hadoop JARs are now compiled targeting a runtime version of Java 8.
Users still using Java 7 or below must upgrade to Java 8.

Support for erasure encoding in HDFS
------------------

Erasure coding is a method for durably storing data with significant space
savings compared to replication. Standard encodings like Reed-Solomon (10,4)
have a 1.4x space overhead, compared to the 3x overhead of standard HDFS
replication.

Since erasure coding imposes additional overhead during reconstruction
and performs mostly remote reads, it has traditionally been used for
storing colder, less frequently accessed data. Users should consider
the network and CPU overheads of erasure coding when deploying this
feature.

More details are available in the
[HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)
documentation.

YARN Timeline Service v.2
-------------------

We are introducing an early preview (alpha 2) of a major revision of YARN
Timeline Service: v.2. YARN Timeline Service v.2 addresses two major
challenges: improving scalability and reliability of Timeline Service, and
enhancing usability by introducing flows and aggregation.

YARN Timeline Service v.2 alpha 2 is provided so that users and developers
can test it and provide feedback and suggestions for making it a ready
replacement for Timeline Service v.1.x. It should be used only in a test
capacity.

More details are available in the
[YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
documentation.

Shell script rewrite
-------------------

The Hadoop shell scripts have been rewritten to fix many long-standing
bugs and include some new features.  While an eye has been kept towards
compatibility, some changes may break existing installations.

Incompatible changes are documented in the release notes, with related
discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).

More details are available in the
[Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)
documentation. Power users will also be pleased by the
[Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)
documentation, which describes much of the new functionality, particularly
related to extensibility.

Shaded client jars
------------------

The `hadoop-client` Maven artifact available in 2.x releases pulls
Hadoop's transitive dependencies onto a Hadoop application's classpath.
This can be problematic if the versions of these transitive dependencies
conflict with the versions used by the application.

[HADOOP-11804](https://issues.apache.org/jira/browse/HADOOP-11804) adds
new `hadoop-client-api` and `hadoop-client-runtime` artifacts that
shade Hadoop's dependencies into a single jar. This avoids leaking
Hadoop's dependencies onto the application's classpath.

Support for Opportunistic Containers and Distributed Scheduling.
--------------------

A notion of `ExecutionType` has been introduced, whereby Applications can
now request for containers with an execution type of `Opportunistic`.
Containers of this type can be dispatched for execution at an NM even if
there are no resources available at the moment of scheduling. In such a
case, these containers will be queued at the NM, waiting for resources to
be available for it to start. Opportunistic containers are of lower priority
than the default `Guaranteed` containers and are therefore preempted,
if needed, to make room for Guaranteed containers. This should
improve cluster utilization.

Opportunistic containers are by default allocated by the central RM, but
support has also been added to allow opportunistic containers to be
allocated by a distributed scheduler which is implemented as an
AMRMProtocol interceptor.

Please see [documentation](./hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html)
for more details.

MapReduce task-level native optimization
--------------------

MapReduce has added support for a native implementation of the map output
collector. For shuffle-intensive jobs, this can lead to a performance
improvement of 30% or more.

See the release notes for
[MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)
for more detail.

Support for more than 2 NameNodes.
--------------------

The initial implementation of HDFS NameNode high-availability provided
for a single active NameNode and a single Standby NameNode. By replicating
edits to a quorum of three JournalNodes, this architecture is able to
tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance.
This is enabled by this new feature, which allows users to run multiple
standby NameNodes. For instance, by configuring three NameNodes and
five JournalNodes, the cluster is able to tolerate the failure of two
nodes rather than just one.

The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)
has been updated with instructions on how to configure more than two
NameNodes.

Default ports of multiple services have been changed.
------------------------

Previously, the default ports of multiple Hadoop services were in the
Linux ephemeral port range (32768-61000). This meant that at startup,
services would sometimes fail to bind to the port due to a conflict
with another application.

These conflicting ports have been moved out of the ephemeral range,
affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our
documentation has been updated appropriately, but see the release
notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and
[HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)
for a list of port changes.

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
---------------------

Hadoop now supports integration with Microsoft Azure Data Lake and
Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

Intra-datanode balancer
-------------------

A single DataNode manages multiple disks. During normal write operation,
disks will be filled up evenly. However, adding or replacing disks can
lead to significant skew within a DataNode. This situation is not handled
by the existing HDFS balancer, which concerns itself with inter-, not intra-,
DN skew.

This situation is handled by the new intra-DataNode balancing
functionality, which is invoked via the `hdfs diskbalancer` CLI.
See the disk balancer section in the
[HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
for more information.

Reworked daemon and task heap management
---------------------

A series of changes have been made to heap management for Hadoop daemons
as well as MapReduce tasks.

[HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces
new methods for configuring daemon heap sizes.
Notably, auto-tuning is now possible based on the memory size of the host,
and the `HADOOP_HEAPSIZE` variable has been deprecated.
See the full release notes of HADOOP-10950 for more detail.

[MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)
simplifies the configuration of map and reduce task
heap sizes, so the desired heap size no longer needs to be specified
in both the task configuration and as a Java option.
Existing configs that already specify both are not affected by this change.
See the full release notes of MAPREDUCE-5785 for more details.

Getting Started
===============

The Hadoop documentation includes the information you need to get started using
Hadoop. Begin with the 
[Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
which shows you how to set up a single-node Hadoop installation.
Then move on to the
[Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
to learn how to set up a multi-node Hadoop installation.
HADOOP-11593. Convert site documentation from apt to markdown (stragglers) (Masatake Iwasaki via aw) 2015-02-17 21:30:24 -10:00			`<!---`
			`Licensed under the Apache License, Version 2.0 (the "License");`
			`you may not use this file except in compliance with the License.`
			`You may obtain a copy of the License at`

			`http://www.apache.org/licenses/LICENSE-2.0`

			`Unless required by applicable law or agreed to in writing, software`
			`distributed under the License is distributed on an "AS IS" BASIS,`
			`WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`See the License for the specific language governing permissions and`
			`limitations under the License. See accompanying LICENSE file.`
			`-->`

			`Apache Hadoop ${project.version}`
			`================================`

HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00			`Apache Hadoop ${project.version} incorporates a number of significant`
			`enhancements over the previous major release line (hadoop-2.x).`

			`This is an alpha release to facilitate testing and the collection of`
			`feedback from downstream application developers and users. There are`
			`no guarantees regarding API stability or quality.`

			`Overview`
			`========`

			`Users are encouraged to read the full set of release notes.`
			`This page provides an overview of the major changes.`

			`Minimum required Java version increased from Java 7 to Java 8`
			`------------------`

			`All Hadoop JARs are now compiled targeting a runtime version of Java 8.`
			`Users still using Java 7 or below must upgrade to Java 8.`

			`Support for erasure encoding in HDFS`
			`------------------`

			`Erasure coding is a method for durably storing data with significant space`
			`savings compared to replication. Standard encodings like Reed-Solomon (10,4)`
			`have a 1.4x space overhead, compared to the 3x overhead of standard HDFS`
			`replication.`

			`Since erasure coding imposes additional overhead during reconstruction`
			`and performs mostly remote reads, it has traditionally been used for`
			`storing colder, less frequently accessed data. Users should consider`
			`the network and CPU overheads of erasure coding when deploying this`
			`feature.`

			`More details are available in the`
			`[HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)`
			`documentation.`

			`YARN Timeline Service v.2`
			`-------------------`

YARN-6047 Documentation updates for TimelineService v2 (Contributed by Rohith Sharma) 2017-08-21 20:29:05 -07:00			`We are introducing an early preview (alpha 2) of a major revision of YARN`
HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00			`Timeline Service: v.2. YARN Timeline Service v.2 addresses two major`
			`challenges: improving scalability and reliability of Timeline Service, and`
			`enhancing usability by introducing flows and aggregation.`

YARN-6047 Documentation updates for TimelineService v2 (Contributed by Rohith Sharma) 2017-08-21 20:29:05 -07:00			`YARN Timeline Service v.2 alpha 2 is provided so that users and developers`
HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00			`can test it and provide feedback and suggestions for making it a ready`
			`replacement for Timeline Service v.1.x. It should be used only in a test`
YARN-6047 Documentation updates for TimelineService v2 (Contributed by Rohith Sharma) 2017-08-21 20:29:05 -07:00			`capacity.`
HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00
			`More details are available in the`
			`[YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)`
			`documentation.`

			`Shell script rewrite`
			`-------------------`

			`The Hadoop shell scripts have been rewritten to fix many long-standing`
			`bugs and include some new features. While an eye has been kept towards`
			`compatibility, some changes may break existing installations.`

			`Incompatible changes are documented in the release notes, with related`
			`discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).`

			`More details are available in the`
			`[Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)`
			`documentation. Power users will also be pleased by the`
			`[Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)`
			`documentation, which describes much of the new functionality, particularly`
			`related to extensibility.`

HADOOP-13978. Update project release notes for 3.0.0-alpha2. Contributed by Andrew Wang and Arun Suresh. 2017-01-17 13:56:06 -08:00			`Shaded client jars`
			`------------------`

			The `hadoop-client` Maven artifact available in 2.x releases pulls
			`Hadoop's transitive dependencies onto a Hadoop application's classpath.`
			`This can be problematic if the versions of these transitive dependencies`
			`conflict with the versions used by the application.`

			`[HADOOP-11804](https://issues.apache.org/jira/browse/HADOOP-11804) adds`
			new `hadoop-client-api` and `hadoop-client-runtime` artifacts that
			`shade Hadoop's dependencies into a single jar. This avoids leaking`
			`Hadoop's dependencies onto the application's classpath.`

			`Support for Opportunistic Containers and Distributed Scheduling.`
			`--------------------`

			A notion of `ExecutionType` has been introduced, whereby Applications can
			now request for containers with an execution type of `Opportunistic`.
			`Containers of this type can be dispatched for execution at an NM even if`
			`there are no resources available at the moment of scheduling. In such a`
			`case, these containers will be queued at the NM, waiting for resources to`
			`be available for it to start. Opportunistic containers are of lower priority`
			than the default `Guaranteed` containers and are therefore preempted,
			`if needed, to make room for Guaranteed containers. This should`
			`improve cluster utilization.`

			`Opportunistic containers are by default allocated by the central RM, but`
			`support has also been added to allow opportunistic containers to be`
			`allocated by a distributed scheduler which is implemented as an`
			`AMRMProtocol interceptor.`

			`Please see [documentation](./hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html)`
			`for more details.`

HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00			`MapReduce task-level native optimization`
			`--------------------`

			`MapReduce has added support for a native implementation of the map output`
			`collector. For shuffle-intensive jobs, this can lead to a performance`
			`improvement of 30% or more.`

			`See the release notes for`
			`[MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)`
			`for more detail.`

			`Support for more than 2 NameNodes.`
			`--------------------`

			`The initial implementation of HDFS NameNode high-availability provided`
			`for a single active NameNode and a single Standby NameNode. By replicating`
			`edits to a quorum of three JournalNodes, this architecture is able to`
			`tolerate the failure of any one node in the system.`

			`However, some deployments require higher degrees of fault-tolerance.`
			`This is enabled by this new feature, which allows users to run multiple`
			`standby NameNodes. For instance, by configuring three NameNodes and`
			`five JournalNodes, the cluster is able to tolerate the failure of two`
			`nodes rather than just one.`

			`The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)`
			`has been updated with instructions on how to configure more than two`
			`NameNodes.`

			`Default ports of multiple services have been changed.`
			`------------------------`

			`Previously, the default ports of multiple Hadoop services were in the`
			`Linux ephemeral port range (32768-61000). This meant that at startup,`
			`services would sometimes fail to bind to the port due to a conflict`
			`with another application.`

			`These conflicting ports have been moved out of the ephemeral range,`
			`affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our`
			`documentation has been updated appropriately, but see the release`
			`notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and`
			`[HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)`
			`for a list of port changes.`

HADOOP-13978. Update project release notes for 3.0.0-alpha2. Contributed by Andrew Wang and Arun Suresh. 2017-01-17 13:56:06 -08:00			`Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors`
HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00			`---------------------`

HADOOP-13978. Update project release notes for 3.0.0-alpha2. Contributed by Andrew Wang and Arun Suresh. 2017-01-17 13:56:06 -08:00			`Hadoop now supports integration with Microsoft Azure Data Lake and`
			`Aliyun Object Storage System as alternative Hadoop-compatible filesystems.`
HADOOP-13383. Update release notes for 3.0.0-alpha1. 2016-07-20 16:57:55 -07:00
			`Intra-datanode balancer`
			`-------------------`

			`A single DataNode manages multiple disks. During normal write operation,`
			`disks will be filled up evenly. However, adding or replacing disks can`
			`lead to significant skew within a DataNode. This situation is not handled`
			`by the existing HDFS balancer, which concerns itself with inter-, not intra-,`
			`DN skew.`

			`This situation is handled by the new intra-DataNode balancing`
			functionality, which is invoked via the `hdfs diskbalancer` CLI.
			`See the disk balancer section in the`
			`[HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)`
			`for more information.`

			`Reworked daemon and task heap management`
			`---------------------`

			`A series of changes have been made to heap management for Hadoop daemons`
			`as well as MapReduce tasks.`

			`[HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces`
			`new methods for configuring daemon heap sizes.`
			`Notably, auto-tuning is now possible based on the memory size of the host,`
			and the `HADOOP_HEAPSIZE` variable has been deprecated.
			`See the full release notes of HADOOP-10950 for more detail.`

			`[MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)`
			`simplifies the configuration of map and reduce task`
			`heap sizes, so the desired heap size no longer needs to be specified`
			`in both the task configuration and as a Java option.`
			`Existing configs that already specify both are not affected by this change.`
			`See the full release notes of MAPREDUCE-5785 for more details.`
HADOOP-11593. Convert site documentation from apt to markdown (stragglers) (Masatake Iwasaki via aw) 2015-02-17 21:30:24 -10:00
			`Getting Started`
			`===============`

			`The Hadoop documentation includes the information you need to get started using`
			`Hadoop. Begin with the`
			`[Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)`
			`which shows you how to set up a single-node Hadoop installation.`
			`Then move on to the`
			`[Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)`
			`to learn how to set up a multi-node Hadoop installation.`