Docker images for building Hadoop
This folder contains the Dockerfiles for building Hadoop on various platforms.
Dependency management
The mode of installation of the dependencies needed for building Hadoop varies from one platform to the other. Different platforms have different toolchains. Some packages tend to be polymorphic across platforms and most commonly, a package that's readily available in one platform's toolchain isn't available on another. We thus, resort to building and installing the package from source, causing duplication of code since this needs to be done for all the Dockerfiles pertaining to all the platforms. We need a system to track a dependency - for a package - for a platform
- (and optionally) for a release. Thus, there's a lot of diversity that needs to be handled for
managing package dependencies and
pkg-resolver
caters to that.
Supported platforms
pkg-resolver/platforms.json
contains a list of the supported platforms for dependency management.
Package dependencies
pkg-resolver/packages.json
maps a dependency to a given platform. Here's the schema of this JSON.
{
"dependency_1": {
"platform_1": "package_1",
"platform_2": [
"package_1",
"package_2"
]
},
"dependency_2": {
"platform_1": [
"package_1",
"package_2",
"package_3"
]
},
"dependency_3": {
"platform_1": {
"release_1": "package_1_1_1",
"release_2": [
"package_1_2_1",
"package_1_2_2"
]
},
"platform_2": [
"package_2_1",
{
"release_1": "package_2_1_1"
}
]
}
}
The root JSON element contains unique dependency children. This in turn contains the name of the _ platforms_ and the list of packages to be installed for that platform. Just to give an example of how to interpret the above JSON -
- For
dependency_1
,package_1
needs to be installed forplatform_1
. - For
dependency_2
,package_1
andpackage_2
needs to be installed forplatform_2
. - For
dependency_2
,package_1
,package_3
andpackage_3
needs to be installed forplatform_1
. - For
dependency_3
,package_1_1_1
gets installed only ifrelease_1
has been specified forplatform_1
. - For
dependency_3
, the packagespackage_1_2_1
andpackage_1_2_2
gets installed only ifrelease_2
has been specified forplatform_1
. - For
dependency_3
, forplatform_2
,package_2_1
is always installed, butpackage_2_1_1
gets installed only ifrelease_1
has been specified.
Tool help
$ pkg-resolver/resolve.py -h
usage: resolve.py [-h] [-r RELEASE] platform
Platform package dependency resolver for building Apache Hadoop
positional arguments:
platform The name of the platform to resolve the dependencies for
optional arguments:
-h, --help show this help message and exit
-r RELEASE, --release RELEASE
The release label to filter the packages for the given platform
Standalone packages
Most commonly, some packages are not available across the toolchains in various platforms. Thus, we
would need to build and install them. Since we need to do this across all the Dockerfiles for all
the platforms, it could lead to code duplication and managing them becomes a hassle. Thus, we put
the build steps in a pkg-resolver/install-<package>.sh
and invoke this in all the Dockerfiles.