hadoop/hadoop-tools/hadoop-distcp
Steve Loughran 20eec95867
HADOOP-16932. distcp copy calls getFileStatus() needlessly and can fail against S3 (#1936)
Contributed by Steve Loughran.

This strips out all the -p preservation options which have already been
processed when uploading a file before deciding whether or not to query
the far end for the status of the (existing/uploaded) file to see if any
other attributes need changing.

This will avoid 404 caching-related issues in S3, wherein a newly created
file can have a 404 entry in the S3 load balancer's cache from the
probes for the file's existence prior to the upload.

It partially addresses a regression caused by HADOOP-8143,
"Change distcp to have -pb on by default" that causes a resurfacing
of HADOOP-13145, "In DistCp, prevent unnecessary getFileStatus call when
not preserving metadata"
2020-04-07 17:55:55 +01:00
..
src HADOOP-16932. distcp copy calls getFileStatus() needlessly and can fail against S3 (#1936) 2020-04-07 17:55:55 +01:00
pom.xml Preparing for 3.4.0 development 2020-03-29 23:24:25 +05:30
README HADOOP-11437. Remove the version and author information from distcp's README file (Brahma Reddy Battula via aw) 2015-02-11 15:47:36 -08:00

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. 
It uses Map/Reduce to effect its distribution, error handling and recovery, 
and reporting. It expands a list of files and directories into input to map tasks, 
each of which will copy a partition of the files specified in the source list.