By default, the mapreduce manifest committer is used for jobs working with abfs and gcs.
Hadoop mapreduce will pick this up automatically; for Spark it is a bit complicated: read the docs
to see the steps required.
This modifies the manifest committer so that the list of files
to rename is passed between stages as a file of
writeable entries on the local filesystem.
The map of directories to create is still passed in memory;
this map is built across all tasks, so even if many tasks
created files, if they all write into the same set of directories
the memory needed is O(directories) with the
task count not a factor.
The _SUCCESS file reports on heap size through gauges.
This should give a warning if there are problems.
Contributed by Steve Loughran
This:
1. Adds optLong, optDouble, mustLong and mustDouble
methods to the FSBuilder interface to let callers explicitly
passin long and double arguments.
2. The opt() and must() builder calls which take float/double values
now only set long values instead, so as to avoid problems
related to overloaded methods resulting in a ".0" being appended
to a long value.
3. All of the relevant opt/must calls in the hadoop codebase move to
the new methods
4. And the s3a code is resilient to parse errors in is numeric options
-it will downgrade to the default.
This is nominally incompatible, but the floating-point builder methods
were never used: nothing currently expects floating point numbers.
For anyone who wants to safely set numeric builder options across all compatible
releases, convert the number to a string and then use the opt(String, String)
and must(String, String) methods.
Contributed by Steve Loughran
The HDFS lease APIs have been replicated as interfaces in hadoop-common so other filesystems can
also implement them. Applications which use the leasing APIs should migrate to the new
interface where possible.
Contributed by Stephen Wu
Removed JUnit APIs from WebServicesTestUtils and TestContainerLogsUtils.
They are used by MapReduce modules as well as YARN modules, so the
APIs need to be removed to upgrade the JUnit version on a per-module basis.
Also, this effectively reverts the prior fix in #5209 because it didn't actually
fix the issue.
Declares its compatibility with Spark's dynamic
output partitioning by having the stream capability
"mapreduce.job.committer.dynamic.partitioning"
Requires a Spark release with SPARK-40034, which
does the probing before deciding whether to
accept/rejecting instantiation with
dynamic partition overwrite set
This feature can be declared as supported by
any other PathOutputCommitter implementations
whose algorithm and destination filesystem
are compatible.
None of the S3A committers are compatible.
The classic FileOutputCommitter is, but it
does not declare itself as such out of our fear
of changing that code. The Spark-side code
will automatically infer compatibility if
the created committer is of that class or
a subclass.
Contributed by Steve Loughran.
This downgrades jackson from the version switched to in
HADOOP-18033 (2.13.0), to Jackson 2.12.7.
This removes the dependency on javax.ws.rs-api,
so avoiding runtime problems with applications using
jersey-core v1 and/or jsr311-api.
The 2.12.7 release still contains the fix for CVE-2020-36518.
Contributed by PJ Fanning