hadoop/dev-support/docker
Gautham B A f7bb4f1595
HADOOP-18135. Produce Windows binaries of Hadoop (#6673)
This PR enables one to create the Hadoop
release tarball on Windows, complete with
the native binaries (including winutils.exe).
This PR contains the following changes -

* Prevents splitting during array element
  expansion - this is needed since we need
  to pass the arguments correctly to maven.
* Install Python 3.11.8 and pip to the
  Windows docker image for building
  Hadoop.
* pom file changes to get maven to invoke
  the releasedocmaker script through
  bash.exe on Windows.
2024-04-09 22:15:05 +05:30
..
pkg-resolver HADOOP-18135. Produce Windows binaries of Hadoop (#6673) 2024-04-09 22:15:05 +05:30
Dockerfile HADOOP-19065. Update Protocol Buffers installation to 3.21.12 (#6526) 2024-02-22 17:09:46 +00:00
Dockerfile_aarch64 HADOOP-19065. Update Protocol Buffers installation to 3.21.12 (#6526) 2024-02-22 17:09:46 +00:00
Dockerfile_centos_7 HADOOP-19065. Update Protocol Buffers installation to 3.21.12 (#6526) 2024-02-22 17:09:46 +00:00
Dockerfile_centos_8 HADOOP-19065. Update Protocol Buffers installation to 3.21.12 (#6526) 2024-02-22 17:09:46 +00:00
Dockerfile_debian_10 HADOOP-19065. Update Protocol Buffers installation to 3.21.12 (#6526) 2024-02-22 17:09:46 +00:00
Dockerfile_windows_10 HADOOP-18135. Produce Windows binaries of Hadoop (#6673) 2024-04-09 22:15:05 +05:30
hadoop_env_checks.sh HADOOP-14816. Update Dockerfile to use Xenial. Contributed by Allen Wittenauer 2017-10-19 16:45:18 -07:00
README.md HADOOP-17913. Filter deps with release labels (#3437) 2021-09-16 09:18:58 -07:00

Docker images for building Hadoop

This folder contains the Dockerfiles for building Hadoop on various platforms.

Dependency management

The mode of installation of the dependencies needed for building Hadoop varies from one platform to the other. Different platforms have different toolchains. Some packages tend to be polymorphic across platforms and most commonly, a package that's readily available in one platform's toolchain isn't available on another. We thus, resort to building and installing the package from source, causing duplication of code since this needs to be done for all the Dockerfiles pertaining to all the platforms. We need a system to track a dependency - for a package - for a platform

  • (and optionally) for a release. Thus, there's a lot of diversity that needs to be handled for managing package dependencies and pkg-resolver caters to that.

Supported platforms

pkg-resolver/platforms.json contains a list of the supported platforms for dependency management.

Package dependencies

pkg-resolver/packages.json maps a dependency to a given platform. Here's the schema of this JSON.

{
  "dependency_1": {
    "platform_1": "package_1",
    "platform_2": [
      "package_1",
      "package_2"
    ]
  },
  "dependency_2": {
    "platform_1": [
      "package_1",
      "package_2",
      "package_3"
    ]
  },
  "dependency_3": {
    "platform_1": {
      "release_1": "package_1_1_1",
      "release_2": [
        "package_1_2_1",
        "package_1_2_2"
      ]
    },
    "platform_2": [
      "package_2_1",
      {
        "release_1": "package_2_1_1"
      }
    ]
  }
}

The root JSON element contains unique dependency children. This in turn contains the name of the _ platforms_ and the list of packages to be installed for that platform. Just to give an example of how to interpret the above JSON -

  1. For dependency_1, package_1 needs to be installed for platform_1.
  2. For dependency_2, package_1 and package_2 needs to be installed for platform_2.
  3. For dependency_2, package_1, package_3 and package_3 needs to be installed for platform_1.
  4. For dependency_3, package_1_1_1 gets installed only if release_1 has been specified for platform_1.
  5. For dependency_3, the packages package_1_2_1 and package_1_2_2 gets installed only if release_2 has been specified for platform_1.
  6. For dependency_3, for platform_2, package_2_1 is always installed, but package_2_1_1 gets installed only if release_1 has been specified.

Tool help

$ pkg-resolver/resolve.py -h
usage: resolve.py [-h] [-r RELEASE] platform

Platform package dependency resolver for building Apache Hadoop

positional arguments:
  platform              The name of the platform to resolve the dependencies for

optional arguments:
  -h, --help            show this help message and exit
  -r RELEASE, --release RELEASE
                        The release label to filter the packages for the given platform

Standalone packages

Most commonly, some packages are not available across the toolchains in various platforms. Thus, we would need to build and install them. Since we need to do this across all the Dockerfiles for all the platforms, it could lead to code duplication and managing them becomes a hassle. Thus, we put the build steps in a pkg-resolver/install-<package>.sh and invoke this in all the Dockerfiles.