Commit Graph

200 Commits

Author SHA1 Message Date
Steve Loughran
6999acf520
HADOOP-16202. Enhanced openFile(): mapreduce and YARN changes. (#2584/2)
These changes ensure that sequential files are opened with the
right read policy, and split start/end is passed in.

As well as offering opportunities for filesystem clients to
choose fetch/cache/seek policies, the settings ensure that
processing text files on an s3 bucket where the default policy
is "random" will still be processed efficiently.

This commit depends on the associated hadoop-common patch,
which must be committed first.

Contributed by Steve Loughran.

Change-Id: Ic6713fd752441cf42ebe8739d05c2293a5db9f94
2022-04-24 17:33:05 +01:00
GuoPhilipse
214f369073
HDFS-16556. Fix typos in distcp (#4217) 2022-04-22 14:01:20 -04:00
Mohanad Elsafty
a4f459097b
HADOOP-18117. Add an option to preserve root directory permissions (#3970) 2022-02-18 19:12:50 +08:00
Ayush Saxena
fe583c4b63
HADOOP-18096. Distcp: Sync moves filtered file to home directory rather than deleting. (#3940). Contributed by Ayush Saxena.
Reviewed-by: Steve Loughran <stevel@apache.org>
Reviewed-by: stack <stack@apache.org>
2022-02-11 01:59:40 +05:30
Ayush Saxena
657a2882e9
HADOOP-18056. DistCp: Filter duplicates in the source paths. (#3825). Contributed by Ayush Saxena.
Reviewed-by: tomscut <litao@bigo.sg>
Reviewed-by: Steve Loughran <stevel@apache.org>
2022-01-05 23:53:07 +05:30
Viraj Jasani
c7ec1897c4
HADOOP-18018. unguava: remove Preconditions from hadoop-tools modules (#3688) 2021-11-23 13:34:10 +09:00
adol001
280ae1c0a9
HADOOP-17932. Distcp file length comparison have no effect (#3519)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2021-10-18 19:07:53 +09:00
Viraj Jasani
79e5a7f3e3
HADOOP-17962. Replace Guava VisibleForTesting by Hadoop's own annotation in hadoop-tools modules (#3540) 2021-10-14 17:43:32 +09:00
Steve Loughran
ee466d4b40
HADOOP-17628. Distcp contract test is really slow with ABFS and S3A; timing out. (#3240)
This patch cuts down the size of directory trees used for
distcp contract tests against object stores, so making
them much faster against distant/slow stores.

On abfs, the test only runs with -Dscale (as was the case for s3a already),
and has the larger scale test timeout.

After every test case, the FileSystem IOStatistics are logged,
to provide information about what IO is taking place and
what it's performance is.

There are some test cases which upload files of 1+ MiB; you can
increase the size of the upload in the option
"scale.test.distcp.file.size.kb" 
Set it to zero and the large file tests are skipped.

Contributed by Steve Loughran.
2021-08-02 11:36:43 +01:00
bshashikant
dac10fcc20
HDFS-16145. CopyListing fails with FNF exception with snapshot diff. (#3234) 2021-07-28 10:29:00 +05:30
litao
fef53aacc9
HDFS-16122. Fix DistCpContext#toString() (#3191). Contributed by tomscut.
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2021-07-10 13:55:11 +05:30
Masatake Iwasaki
3788fe52da HDFS-13916. Distcp SnapshotDiff to support WebHDFS. Contributed by Xun REN.
Signed-off-by: Masatake Iwasaki <iwasakims@apache.org>
2021-06-26 21:04:56 +00:00
Viraj Jasani
f4b24c68e7
HADOOP-17743. Replace Guava Lists usage by Hadoop's own Lists in hadoop-common, hadoop-tools and cloud-storage projects (#3072) 2021-06-07 13:24:09 +09:00
zhengchenyu
d5ad181684
MAPREDUCE-7287. Distcp will delete exists file , If we use "-delete and -update" options and distcp file. (#2852)
Contributed by zhengchenyu
2021-05-28 20:21:37 +01:00
Viraj Jasani
e4062ad027
HADOOP-17115. Replace Guava Sets usage by Hadoop's own Sets in hadoop-common and hadoop-tools (#2985)
Signed-off-by: Sean Busbey <busbey@apache.org>
2021-05-20 10:47:04 -05:00
Ayush Saxena
6800b21e3b
HADOOP-17620. DistCp: Use Iterator for listing target directory as well. (#2861). Contributed by Ayush Saxena.
Signed-off-by: Vinayakumar B <vinayakumarb@apache.org>
2021-04-23 22:48:15 +05:30
Viraj Jasani
3f2682b92b
HADOOP-17622. Avoid usage of deprecated IOUtils#cleanup API. (#2862)
Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>
2021-04-06 13:39:10 +09:00
Ayush Saxena
03cfc85279
HADOOP-17531. DistCp: Reduce memory usage on copying huge directories. (#2732). Contributed by Ayush Saxena.
Signed-off-by: Steve Loughran <stevel@apache.org>
2021-03-24 02:36:26 +05:30
Ayush Saxena
4781761dc2
HADOOP-17594. DistCp: Expose the JobId for applications executing through run method (#2786). Contributed by Ayush Saxena.
Signed-off-by: Mingliang Liu <liuml07@apache.org>
Signed-off-by: Steve Loughran <stevel@apache.org>
2021-03-19 14:19:49 +05:30
jianghuazhu
375900049c
HDFS-15608.Reset the DistCp#CLEANUP variable definition. (#2351). Contributed by JiangHua Zhu.
Co-authored-by: zhujianghua <zhujianghua@zhujianghuadeMacBook-Pro.local>
2020-11-10 13:02:29 -08:00
Ayush Saxena
1e3a6efcef
HADOOP-17288. Use shaded guava from thirdparty. (#2342). Contributed by Ayush Saxena. 2020-10-17 12:01:18 +05:30
Arpit Agarwal
18fa4397e6
MAPREDUCE-7298. Distcp doesn't close the job after the job is completed. Contributed by Aasha Medhi.
Change-Id: I63d249bbb18ccedaeee9f10123a78e32f9e54ed2
2020-10-02 08:29:55 -07:00
swamirishi
872c2909bd
HADOOP-17122: Preserving Directory Attributes in DistCp with Atomic Copy (#2133)
Contributed by Swaminathan Balachandran
2020-08-22 18:48:21 +01:00
Steve Loughran
d08b9e94e3
Revert "HADOOP-14557. Document HADOOP-8143 (Change distcp to have -pb on by default)."
This reverts commit 44350fdf49.

It is related to the rollback of HADOOP-8143.

Change-Id: If48e3dd670c920ada702dc36461ff398fe9d35cc
2020-05-14 19:04:36 +01:00
Steve Loughran
4486220bb2
Revert "HADOOP-8143. Change distcp to have -pb on by default."
This reverts commit dd65eea74b.

Change-Id: I74180cf59d5bbad8c9f66cb331535addcbea863e
2020-05-14 19:03:56 +01:00
Ayush Saxena
c757cb61eb HADOOP-14254. Add a Distcp option to preserve Erasure Coding attributes. Contributed by Ayush Saxena. 2020-05-14 00:31:20 +05:30
Steve Loughran
20eec95867
HADOOP-16932. distcp copy calls getFileStatus() needlessly and can fail against S3 (#1936)
Contributed by Steve Loughran.

This strips out all the -p preservation options which have already been
processed when uploading a file before deciding whether or not to query
the far end for the status of the (existing/uploaded) file to see if any
other attributes need changing.

This will avoid 404 caching-related issues in S3, wherein a newly created
file can have a 404 entry in the S3 load balancer's cache from the
probes for the file's existence prior to the upload.

It partially addresses a regression caused by HADOOP-8143,
"Change distcp to have -pb on by default" that causes a resurfacing
of HADOOP-13145, "In DistCp, prevent unnecessary getFileStatus call when
not preserving metadata"
2020-04-07 17:55:55 +01:00
Sebastian Nagel
18050bc583
HADOOP-16909 Typo in distcp counters.
Contributed by Sebastian Nagel.
2020-03-09 14:37:08 +00:00
Mukund Thakur
819159fa06
HDFS-14788. Use dynamic regex filter to ignore copy of source files in Distcp.
Contributed by Mukund Thakur.

Change-Id: I781387ddce95ee300c12a160dc9a0f7d602403c3
2020-01-06 19:10:39 +00:00
Steve Loughran
b6dc00f481
HADOOP-16775. DistCp reuses the same temp file within the task for different files.
Contributed by Amir Shenavandeh.

This avoids overwrite consistency issues with S3 and other stores -though
given S3's copy operation is O(data), you are still best of using -direct
when distcp-ing to it.

Change-Id: I8dc9f048ad0cc57ff01543b849da1ce4eaadf8c3
2020-01-02 15:36:33 +00:00
aasha
fccccc9703 HDFS-14869 Copy renamed files which are not excluded anymore by filter (#1530) 2019-12-06 17:41:25 +05:30
pingsutw
14cd969b6e
HADOOP-16512. [hadoop-tools] Fix order of actual and expected expression in assert statements
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2019-10-07 16:38:08 +09:00
Mukund Thakur
51c64b357d
HDFS-13660. DistCp job fails when new data is appended in the file while the DistCp copy job is running
This uses the length of the file known at the start of the copy to determine the amount of data to copy.

* If a file is appended to during the copy, the original bytes are copied.
* If a file is truncated during a copy, or the attempt to read the data fails with a truncated stream,
  distcp will now fail. Until now these failures were not detected.

Contributed by Mukund Thakur.

Change-Id: I576a49d951fa48d37a45a7e4c82c47488aa8e884
2019-09-24 11:23:24 +01:00
KAI XIE
c765584eb2 HADOOP-16158. DistCp to support checksum validation when copy blocks in parallel (#919)
* DistCp to support checksum validation when copy blocks in parallel

* address review comments

* add checksums comparison test for combine mode
2019-08-18 18:46:31 -07:00
Ayush Saxena
e60f5e2572 HADOOP-16440. Distcp can not preserve timestamp with -delete option. Contributed by ludun. 2019-07-20 13:11:14 +05:30
Steve Loughran
19a001826f
Revert "HDFS-9913. DistCp to add -useTrash to move deleted files to Trash."
Reverting due to test failures if ~/.Trash not present during test setup.

This reverts commit ee3115f488.

Change-Id: Icbeeb261570b9131ff99d765ac0945c335b26658
2019-07-17 13:13:24 +01:00
Shen Yinjie
ee3115f488
HDFS-9913. DistCp to add -useTrash to move deleted files to Trash.
Contributed by Shen Yinjie.

Change-Id: I03ac7d22ab1054f8e5de4aa7552909c734438f4a
2019-07-17 11:50:46 +01:00
Takanobu Asanuma
98d2065643 HDFS-12564. Add the documents of swebhdfs configurations on the client side. Contributed by Takanobu Asanuma.
Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
2019-06-20 20:17:24 -07:00
Andrew Olson
c15b3bca86
HADOOP-16294: Enable access to input options by DistCp subclasses.
Adding a protected-scope getter for the DistCpOptions, so that a subclass does
not need to save its own copy of the inputOptions supplied to its constructor,
if it wishes to override the createInputFileListing method with logic similar
to the original implementation, i.e. calling CopyListing#buildListing with a path and input options.

Author:    Andrew Olson
2019-05-16 16:11:12 +02:00
Giovanni Matteo Fumarola
7a3188d054 HADOOP-16282. Avoid FileStream to improve performance. Contributed by Ayush Saxena. 2019-05-02 12:58:42 -07:00
Masatake Iwasaki
bbdbc7a9a1 HADOOP-14544. DistCp documentation for command line options is misaligned. Contributed by Masatake Iwasaki. 2019-04-12 11:52:18 +09:00
Siyao Meng
ce4bafdf44
HADOOP-16037. DistCp: Document usage of Sync (-diff option) in detail.
Contributed by Siyao Meng
2019-03-26 18:42:54 +00:00
Andrew Olson
faba3591d3
HADOOP-16147. Allow CopyListing sequence file keys and values to be more easily customized.
Author:    Andrew Olson
2019-03-22 10:35:30 +00:00
Ranith Sardar
546c5d70ef
HADOOP-16032. Distcp It should clear sub directory ACL before applying new ACL on. 2019-02-07 21:48:07 +00:00
Andrew Olson
de804e53b9
HADOOP-15281. Distcp to add no-rename copy option.
Contributed by Andrew Olson.
2019-02-07 10:07:22 +00:00
Giovanni Matteo Fumarola
fb8932a727 HADOOP-16029. Consecutive StringBuilder.append can be reused. Contributed by Ayush Saxena. 2019-01-11 10:54:49 -08:00
Kai Xie
188bebbe7e HADOOP-16018. DistCp won't reassemble chunks when blocks per chunk > 0.
Contributed by Kai Xie.
2019-01-08 11:57:57 +00:00
Akira Ajisaka
7f78397036
Revert "HADOOP-14556. S3A to support Delegation Tokens."
This reverts commit d7152332b3.
2019-01-08 14:51:30 +09:00
Steve Loughran
d7152332b3
HADOOP-14556. S3A to support Delegation Tokens.
Contributed by Steve Loughran.
2019-01-07 13:18:03 +00:00
Arpit Agarwal
914b0cf15f HADOOP-12558. distcp documentation is woefully out of date. Contributed by Dinesh Chitlangia. 2018-11-15 13:58:13 -08:00