hadoop/hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md

35 KiB

Apache Hadoop Compatibility

Purpose

This document captures the compatibility goals of the Apache Hadoop project. The different types of compatibility between Hadoop releases that affect Hadoop developers, downstream projects, and end-users are enumerated. For each type of compatibility this document will:

  • describe the impact on downstream projects or end-users
  • where applicable, call out the policy adopted by the Hadoop developers when incompatible changes are permitted.

All Hadoop interfaces are classified according to the intended audience and stability in order to maintain compatibility with previous releases. See the Hadoop Interface Taxonomy for details about the classifications.

Target Audience

This document is intended for consumption by the Hadoop developer community. This document describes the lens through which changes to the Hadoop project should be viewed. In order for end users and third party developers to have confidence about cross-release compatibility, the developer community must ensure that development efforts adhere to these policies. It is the responsibility of the project committers to validate that all changes either maintain compatibility or are explicitly marked as incompatible.

Within a component Hadoop developers are free to use Private and Limited Private APIs, but when using components from a different module Hadoop developers should follow the same guidelines as third-party developers: do not use Private or Limited Private (unless explicitly allowed) interfaces and prefer instead Stable interfaces to Evolving or Unstable interfaces where possible. Where not possible, the preferred solution is to expand the audience of the API rather than introducing or perpetuating an exception to these compatibility guidelines. When working within a Maven module Hadoop developers should observe where possible the same level of restraint with regard to using components located in other Maven modules.

Above all, Hadoop developers must be mindful of the impact of their changes. Stable interfaces must not change between major releases. Evolving interfaces must not change between minor releases. New classes and components must be labeled appropriately for audience and stability. See the Hadoop Interface Taxonomy for details about when the various labels are appropriate. As a general rule, all new interfaces and APIs should have the most limited labels (e.g. Private Unstable) that will not inhibit the intent of the interface or API.

Notational Conventions

The key words "MUST" "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.

Deprecation

The Java API provides a @Deprecated annotation to mark an API element as flagged for removal. The standard meaning of the annotation is that the API element should not be used and may be removed in a later version.

In all cases removing an element from an API is an incompatible change. In the case of Stable APIs, the change cannot be made between minor releases within the same major version. In addition, to allow consumers of the API time to adapt to the change, the API element to be removed should be marked as deprecated for a full major release before it is removed. For example, if a method is marked as deprecated in Hadoop 2.8, it cannot be removed until Hadoop 4.0.

Policy

Stable API elements MUST NOT be removed until they have been marked as deprecated (through the @Deprecated annotation or other appropriate documentation) for a full major release. In the case that an API element was introduced as deprecated (to indicate that it is a temporary measure that is intended to be removed) the API element MAY be removed in the following major release. When modifying a Stable API, developers SHOULD prefer introducing a new method or endpoint and deprecating the existing one to making incompatible changes to the method or endpoint.

Compatibility types

Java API

Developers SHOULD annotate all Hadoop interfaces and classes with the @InterfaceAudience and @InterfaceStability annotations to describe the intended audience and stability. Annotations may be at the package, class, or member variable or method level. Member variable and method annotations SHALL override class annotations, and class annotations SHALL override package annotations. A package, class, or member variable or method that is not annotated SHALL be interpreted as implicitly Private and Unstable.

  • @InterfaceAudience captures the intended audience. Possible values are Public (for end users and external projects), LimitedPrivate (for other Hadoop components, and closely related projects like YARN, MapReduce, HBase etc.), and Private (for intra component use).
  • @InterfaceStability describes what types of interface changes are permitted. Possible values are Stable, Evolving, and Unstable.
  • @Deprecated notes that the package, class, or member variable or method could potentially be removed in the future and should not be used.

Use Cases

  • Public-Stable API compatibility is required to ensure end-user programs and downstream projects continue to work without modification.
  • Public-Evolving API compatibility is useful to make functionality available for consumption before it is fully baked.
  • Limited Private-Stable API compatibility is required to allow upgrade of individual components across minor releases.
  • Private-Stable API compatibility is required for rolling upgrades.
  • Private-Unstable API compatibility allows internal components to evolve rapidly without concern for downstream consumers, and is how most interfaces should be labeled.

Policy

The compatibility policy SHALL be determined by the relevant package, class, or member variable or method annotations.

Note: APIs generated from the proto files MUST be compatible for rolling upgrades. See the section on wire protocol compatibility for more details. The compatibility policies for APIs and wire protocols must therefore go hand in hand.

Semantic compatibility

Apache Hadoop strives to ensure that the behavior of APIs remains consistent over versions, though changes for correctness may result in changes in behavior. API behavior SHALL be specified by the JavaDoc API documentation where present and complete. When JavaDoc API documentation is not available, behavior SHALL be specified by the behavior expected by the related unit tests. In cases with no JavaDoc API documentation or unit test coverage, the expected behavior is presumed to be obvious and SHOULD be assumed to be the minimum functionality implied by the interface naming. The community is in the process of specifying some APIs more rigorously and enhancing test suites to verify compliance with the specification, effectively creating a formal specification for the subset of behaviors that can be easily tested.

The behavior of any API MAY be changed to fix incorrect behavior according to the stability of the API, with such a change to be accompanied by updating existing documentation and tests and/or adding new documentation or tests.

Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI

Apache Hadoop revisions SHOUD retain binary compatability such that end-user applications continue to work without any modifications. Minor Apache Hadoop revisions within the same major revision MUST retain compatibility such that existing MapReduce applications (e.g. end-user applications and projects such as Apache Pig, Apache Hive, et al), existing YARN applications (e.g. end-user applications and projects such as Apache Spark, Apache Tez et al), and applications that accesses HDFS directly (e.g. end-user applications and projects such as Apache HBase, Apache Flume, et al) work unmodified and without recompilation when used with any Apache Hadoop cluster within the same major release as the original build target.

For MapReduce applications in particular, i.e. applications using the org.apache.hadoop.mapred and/or org.apache.hadoop.mapreduce APIs, the developer community SHALL support binary compatibility across major releases. The MapReduce APIs SHALL be supported compatibly across major releases. See Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x for more details.

Some applications may be affected by changes to disk layouts or other internal changes. See the sections that follow for policies on how incompatible changes to non-API interfaces are handled.

Native Dependencies

Hadoop includes several native components, including compression, the container executor binary, and various native integrations. These native components introduce a set of native dependencies for Hadoop, both at compile time and at runtime, such as cmake, gcc, zlib, etc. This set of native dependencies is part of the Hadoop ABI.

Policy

The minimum required versions of the native components on which Hadoop depends at compile time and/or runtime SHALL be considered Evolving. Changes to the minimum required versions SHOULD NOT increase between minor releases within a major version, though updates because of security issues, license issues, or other reasons may occur. When the native components on which Hadoop depends must be updated between minor releases within a major release, where possible the changes SHOULD only change the minor versions of the components without changing the major versions.

Wire Protocols

Wire compatibility concerns data being transmitted "over the wire" between Hadoop processes. Hadoop uses Protocol Buffers for most RPC communication. Preserving compatibility requires prohibiting modification as described below. Non-RPC communication should be considered as well, for example using HTTP to transfer an HDFS image as part of snapshotting or transferring MapReduce map task output. The communications can be categorized as follows:

  • Client-Server: communication between Hadoop clients and servers (e.g., the HDFS client to NameNode protocol, or the YARN client to ResourceManager protocol).
  • Client-Server (Admin): It is worth distinguishing a subset of the Client-Server protocols used solely by administrative commands (e.g., the HAAdmin protocol) as these protocols only impact administrators who can tolerate changes that end users (which use general Client-Server protocols) cannot.
  • Server-Server: communication between servers (e.g., the protocol between the DataNode and NameNode, or NodeManager and ResourceManager)

Protocol Dependencies

The components of Apache Hadoop may have dependencies that include their own protocols, such as Zookeeper, S3, Kerberos, etc. These protocol dependencies SHALL be treated as internal protocols and governed by the same policy.

Transports

In addition to compatibility of the protocols themselves, maintaining cross-version communications requires that the transports supported also be stable. The most likely source of transport changes stems from secure transports, such as SSL. Upgrading a service from SSLv2 to SSLv3 may break existing SSLv2 clients. The minimum supported major version of any transports MUST not increase across minor releases within a major version.

Service ports are considered as part of the transport mechanism. Fixed service port numbers MUST be kept consistent to prevent breaking clients.

Policy

Hadoop wire protocols are defined in .proto (ProtocolBuffers) files. Client-Server and Server-Server protocols SHALL be classified according to the audience and stability classifications noted in their .proto files. In cases where no classifications are present, the protocols SHOULD be assumed to be Private and Stable.

The following changes to a .proto file SHALL be considered compatible:

  • Add an optional field, with the expectation that the code deals with the field missing due to communication with an older version of the code
  • Add a new rpc/method to the service
  • Add a new optional request to a Message
  • Rename a field
  • Rename a .proto file
  • Change .proto annotations that effect code generation (e.g. name of java package)

The following changes to a .proto file SHALL be considered incompatible:

  • Change an rpc/method name
  • Change an rpc/method parameter type or return type
  • Remove an rpc/method
  • Change the service name
  • Change the name of a Message
  • Modify a field type in an incompatible way (as defined recursively)
  • Change an optional field to required
  • Add or delete a required field
  • Delete an optional field as long as the optional field has reasonable defaults to allow deletions

The following changes to a .proto file SHALL be considered incompatible:

  • Change a field id
  • Reuse an old field that was previously deleted.

Hadoop wire protocols that are not defined via .proto files SHOULD be considered to be Private and Stable.

In addition to the limitations imposed by being Stable, Hadoop's wire protocols MUST also be forward compatible across minor releases within a major version according to the following:

  • Client-Server compatibility MUST be maintained so as to allow users to continue using older clients even after upgrading the server (cluster) to a later version (or vice versa). For example, a Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
  • Client-Server compatibility MUST be maintained so as to allow users to upgrade the client before upgrading the server (cluster). For example, a Hadoop 2.4.0 client talking to a Hadoop 2.3.0 cluster. This allows deployment of client-side bug fixes ahead of full cluster upgrades. Note that new cluster features invoked by new client APIs or shell commands will not be usable. YARN applications that attempt to use new APIs (including new fields in data structures) that have not yet been deployed to the cluster can expect link exceptions.
  • Client-Server compatibility MUST be maintained so as to allow upgrading individual components without upgrading others. For example, upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
  • Server-Server compatibility MUST be maintained so as to allow mixed versions within an active cluster so the cluster may be upgraded without downtime in a rolling fashion.

New transport mechanisms MUST only be introduced with minor or major version changes. Existing transport mechanisms MUST continue to be supported across minor versions within a major version. Service port numbers MUST remain consistent across minor version numbers within a major version.

REST APIs

REST API compatibility applies to the REST endpoints (URLs) and response data format. Hadoop REST APIs are specifically meant for stable use by clients across releases, even major ones. The following is a non-exhaustive list of the exposed REST APIs:

Each API has an API-specific version number. Any incompatible changes MUST increment the API version number.

Policy

The Hadoop REST APIs SHALL be considered Public and Evolving. With respect to API version numbers, the Hadoop REST APIs SHALL be considered Public and Stable, i.e. no incompatible changes are allowed to within an API version number.

Log Output

The Hadoop daemons and CLIs produce log output via Log4j that is intended to aid administrators and developers in understanding and troubleshooting cluster behavior. Log messages are intended for human consumption, though automation use cases are also supported.

Policy

All log output SHALL be considered Public and Evolving.

Audit Log Output

Several components have audit logging systems that record system information in a machine readable format. Incompatible changes to that data format may break existing automation utilities. For the audit log, an incompatible change is any change that changes the format such that existing parsers no longer can parse the logs.

Policy

All audit log output SHALL be considered Public and Stable. Any change to the data format SHALL be considered an incompatible change.

Metrics/JMX

While the Metrics API compatibility is governed by Java API compatibility, the Metrics data format exposed by Hadoop MUST be maintained as compatible for consumers of the data, e.g. for automation tasks.

Policy

The data format exposed via Metrics SHALL be considered Public and Stable.

File formats & Metadata

User and system level data (including metadata) is stored in files of various formats. Changes to the metadata or the file formats used to store data/metadata can lead to incompatibilities between versions. Each class of file format is addressed below.

User-level file formats

Changes to formats that end users use to store their data can prevent them from accessing the data in later releases, and hence are important to be compatible. Examples of these formats include har, war, SequenceFileFormat, etc.

Policy

User-level file formats SHALL be considered Public and Stable. User-lever file format changes SHOULD be made forward compatible across major releases and MUST be made forward compatible within a major release. The developer community SHOULD prefer the creation of a new derivative file format to making incompatible changes to an existing file format. Such new file formats MUST be created as opt-in, meaning that users must be able to continue using the existing compatible format until and unless they explicitly opt in to using the new file format.

System-internal data schemas

Hadoop internal data may also be stored in files or other data stores. Changing the schemas of these data stores can lead to incompatibilities.

MapReduce

MapReduce uses formats like I-File to store MapReduce-specific data.

Policy

All MapReduce-internal file formats, such as I-File format or the job history server's jhist file format, SHALL be considered Private and Stable.

HDFS Metadata

HDFS persists metadata (the image and edit logs) in a private file format. Incompatible changes to either the format or the metadata prevent subsequent releases from reading older metadata. Incompatible changes MUST include a process by which existing metadata may be upgraded. Changes SHALL be allowed to require more than one upgrade. Incompatible changes MUST result in the metadata version number being incremented.

Depending on the degree of incompatibility in the changes, the following potential scenarios can arise:

  • Automatic: The image upgrades automatically, no need for an explicit "upgrade".
  • Direct: The image is upgradable, but might require one explicit release "upgrade".
  • Indirect: The image is upgradable, but might require upgrading to intermediate release(s) first.
  • Not upgradeable: The image is not upgradeable.

HDFS data nodes store data in a private directory structure. The schema of that directory structure must remain stable to retain compatibility.

Policy

The HDFS metadata format SHALL be considered Private and Evolving. Incompatible changes MUST include a process by which existing metada may be upgraded. The upgrade process MUST allow the cluster metadata to be rolled back to the older version and its older disk format. The rollback MUST restore the original data but is not REQUIRED to restore the updated data. Any incompatible change to the format MUST result in the major version number of the schema being incremented.

The data node directory format SHALL be considered Private and Evolving.

AWS S3A Guard Metadata

For each operation in the Hadoop S3 client (s3a) that reads or modifies file metadata, a shadow copy of that file metadata is stored in a separate metadata store, which offers HDFS-like consistency for the metadata, and may also provide faster lookups for things like file status or directory listings. S3A guard tables are created with a version marker which indicates compatibility.

Policy

The S3A guard metadata schema SHALL be considered Private and Unstable. Any incompatible change to the schema MUST result in the version number of the schema being incremented.

YARN Resource Manager State Store

The YARN resource manager stores information about the cluster state in an external state store for use in fail over and recovery. If the schema used for the state store data does not remain compatible, the resource manager will not be able to recover its state and will fail to start. The state store data schema includes a version number that indicates compatibility.

Policy

The YARN resource manager state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.

YARN Node Manager State Store

The YARN node manager stores information about the node state in an external state store for use in recovery. If the schema used for the state store data does not remain compatible, the node manager will not be able to recover its state and will fail to start. The state store data schema includes a version number that indicates compatibility.

Policy

The YARN node manager state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.

YARN Federation State Store

The YARN resource manager federation service stores information about the federated clusters, running applications, and routing policies in an external state store for use in replication and recovery. If the schema used for the state store data does not remain compatible, the federation service will fail to initialize. The state store data schema includes a version number that indicates compatibility.

Policy

The YARN federation service state store data schema SHALL be considered Private and Evolving. Any incompatible change to the schema MUST result in the major version number of the schema being incremented. Any compatible change to the schema MUST result in the minor version number being incremented.

Command Line Interface (CLI)

The Hadoop command line programs may be used either directly via the system shell or via shell scripts. The CLIs include both the user-facing commands, such as the hdfs command or the yarn command, and the admin-facing commands, such as the scripts used to start and stop daemons. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return codes and output break compatibility and adversely affect users.

Policy

All Hadoop CLI paths, usage, and output SHALL be considered Public and Stable. Note that the CLI output SHALL be considered distinct from the log output generated by the Hadoop CLIs. The latter SHALL be governed by the policy on log output. Note also that for CLI output, all changes SHALL be considered incompatible changes.

Web UI

Web UI, particularly the content and layout of web pages, changes could potentially interfere with attempts to screen scrape the web pages for information. The Hadoop Web UI pages, however, are not meant to be scraped, e.g. for automation purposes. Users are expected to use REST APIs to programmatically access cluster information.

Policy

The Hadoop Web UI SHALL be considered Public and Unstable.

Hadoop Configuration Files

Users use Hadoop-defined properties to configure and provide hints to Hadoop and custom properties to pass information to jobs. Users are encouraged to avoid using custom configuration property names that conflict with the namespace of Hadoop-defined properties and should avoid using any prefixes used by Hadoop, e.g. hadoop, io, ipc, fs, net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, and yarn.

Policy

Hadoop-defined properties (names and meanings) SHALL be considered Public and Stable. The units implied by a Hadoop-defined property MUST NOT change, even across major versions. Default values of Hadoop-defined properties SHALL be considered Public and Evolving.

Log4j Configuration Files

The log output produced by Hadoop daemons and CLIs is governed by a set of configuration files. These files control the minimum level of log message that will be output by the various components of Hadoop, as well as where and how those messages are stored.

Policy

All Log4j configurations SHALL be considered Public and Evolving.

Directory Structure

Source code, artifacts (source and tests), user logs, configuration files, output, and job history are all stored on disk either local file system or HDFS. Changing the directory structure of these user-accessible files can break compatibility, even in cases where the original path is preserved via symbolic links (such as when the path is accessed by a servlet that is configured to not follow symbolic links).

Policy

The layout of source code and build artifacts SHALL be considered Private and Unstable. Within a major version, the developer community SHOULD preserve the overall directory structure, though individual files MAY be added, moved, or deleted with no warning.

The directory structure of configuration files, user logs, and job history SHALL be considered Public and Evolving.

Java Classpath

Hadoop provides several client artifacts that applications use to interact with the system. These artifacts typically have their own dependencies on common libraries. In the cases where these dependencies are exposed to end user applications or downstream consumers (i.e. not shaded) changes to these dependencies can be disruptive. Developers are strongly encouraged to avoid exposing dependencies to clients by using techniques such as shading.

With regard to dependencies, adding a dependency is an incompatible change, whereas removing a dependency is a compatible change.

Some user applications built against Hadoop may add all Hadoop JAR files (including Hadoop's library dependencies) to the application's classpath. Adding new dependencies or updating the versions of existing dependencies may interfere with those in applications' classpaths and hence their correct operation. Users are therefore discouraged from adopting this practice.

Policy

The set of dependencies exposed by the Hadoop client artifacts SHALL be considered Public and Stable. Any dependencies that are not exposed to clients (either because they are shaded or only exist in non-client artifacts) SHALL be considered Private and Unstable

Environment variables

Users and related projects often utilize the environment variables exported by Hadoop (e.g. HADOOP_CONF_DIR). Removing or renaming environment variables can therefore impact end user applications.

Policy

The environment variables consumed by Hadoop and the environment variables made accessible to applications through YARN SHALL be considered Public and Evolving. The developer community SHOULD limit changes to major releases.

Build artifacts

Hadoop uses Maven for project management. Changes to the contents of generated artifacts can impact existing user applications.

Policy

The contents of Hadoop test artifacts SHALL be considered Private and Unstable. Test artifacts include all JAR files generated from test source code and all JAR files that include "tests" in the file name.

The Hadoop client artifacts SHALL be considered Public and Stable. Client artifacts are the following:

  • hadoop-client
  • hadoop-client-api
  • hadoop-client-minicluster
  • hadoop-client-runtime
  • hadoop-hdfs-client
  • hadoop-hdfs-native-client
  • hadoop-mapreduce-client-app
  • hadoop-mapreduce-client-common
  • hadoop-mapreduce-client-core
  • hadoop-mapreduce-client-hs
  • hadoop-mapreduce-client-hs-plugins
  • hadoop-mapreduce-client-jobclient
  • hadoop-mapreduce-client-nativetask
  • hadoop-mapreduce-client-shuffle
  • hadoop-yarn-client

All other build artifacts SHALL be considered Private and Unstable.

Hardware/Software Requirements

To keep up with the latest advances in hardware, operating systems, JVMs, and other software, new Hadoop releases may include features that require newer hardware, operating systems releases, or JVM versions than previous Hadoop releases. For a specific environment, upgrading Hadoop might require upgrading other dependent software components.

Policies

  • Hardware
    • Architecture: The community has no plans to restrict Hadoop to specific architectures, but can have family-specific optimizations.
    • Minimum resources: While there are no guarantees on the minimum resources required by Hadoop daemons, the developer community SHOULD avoid increasing requirements within a minor release.
  • Operating Systems: The community SHOULD maintain the same minimum OS requirements (OS kernel versions) within a minor release. Currently GNU/Linux and Microsoft Windows are the OSes officially supported by the community, while Apache Hadoop is known to work reasonably well on other OSes such as Apple MacOSX, Solaris, etc.
  • The JVM requirements SHALL NOT change across minor releases within the same major release unless the JVM version in question becomes unsupported. The JVM version requirement MAY be different for different operating systems or even operating system releases.
  • File systems supported by Hadoop, e.g. through the HDFS FileSystem API, SHOULD not become unsupported between minor releases within a major version unless a migration path to an alternate client implementation is available.

References

Here are some relevant JIRAs and pages related to the topic: