Commit Graph

514 Commits

Author SHA1 Message Date
Ayush Saxena
28538d628e
HADOOP-19164. Hadoop CLI MiniCluster is broken (#7050). Contributed by Ayush Saxena.
Reviewed-by: Vinayakumar B <vinayakumarb@apache.org>
2024-09-21 21:26:51 +05:30
Steve Loughran
55a576906d
HADOOP-19131. Assist reflection IO with WrappedOperations class (#6686)
1. The class WrappedIO has been extended with more filesystem operations

- openFile()
- PathCapabilities
- StreamCapabilities
- ByteBufferPositionedReadable

All these static methods raise UncheckedIOExceptions rather than
checked ones.

2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics
provides similar access to IOStatistics/IOStatisticsContext classes
and operations.

Allows callers to:
* Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or
  IOStatistics instance
* Save an IOStatisticsSnapshot to file
* Convert an IOStatisticsSnapshot to JSON
* Given an object which may be an IOStatisticsSource, return an object
  whose toString() value is a dynamically generated, human readable summary.
  This is for logging.
* Separate getters to the different sections of IOStatistics.
* Mean values are returned as a Map.Pair<Long, Long> of (samples, sum)
  from which means may be calculated.

There are examples of the dynamic bindings to these classes in:

org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO
org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics

These use DynMethods and other classes in the package
org.apache.hadoop.util.dynamic which are based on the
Apache Parquet equivalents.
This makes re-implementing these in that library and others
which their own fork of the classes (example: Apache Iceberg)

3. The openFile() option "fs.option.openfile.read.policy" has
added specific file format policies for the core filetypes

* avro
* columnar
* csv
* hbase
* json
* orc
* parquet

S3A chooses the appropriate sequential/random policy as a 

A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for
any filesystem aware of it, falling back to the first entry in the list which
the specific version of the filesystem recognizes

4. New Path capability fs.capability.virtual.block.locations

Indicates that locations are generated client side
and don't refer to real hosts.

Contributed by Steve Loughran
2024-08-14 14:43:00 +01:00
Viraj Jasani
321a6cc55e
HADOOP-19072. S3A: expand optimisations on stores with "fs.s3a.performance.flags" for mkdir (#6543)
If the flag list in fs.s3a.performance.flags includes "mkdir"
then the safety check of a walk up the tree to look for a parent directory,
-done to verify a directory isn't being created under a file- are skipped.

This saves the cost of multiple list operations.

Contributed by Viraj Jasani
2024-08-08 17:48:51 +01:00
gavin.wang
783a852029
HDFS-17555. Fix NumberFormatException of NNThroughputBenchmark when configured dfs.blocksize. (#6894). Contributed by wangzhongwei
Reviewed-by: He Xiaoqiao <hexiaoqiao@apache.org>
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2024-07-09 13:52:15 +05:30
Steve Loughran
56c8aa5f1c
HADOOP-19204. VectorIO regression: empty ranges are now rejected (#6887)
- restore old outcome: no-op
- test this
- update spec

This is a critical fix for vector IO and MUST be cherrypicked to all branches with
that feature

Contributed by Steve Loughran
2024-06-19 12:05:24 +01:00
Fateh Singh
90024d8cb1
HDFS-17439. Support -nonSuperUser for NNThroughputBenchmark: useful for testing auth frameworks such as Ranger (#6677) 2024-06-18 13:52:24 +01:00
Cheng Pan
2bde5ccb81
HADOOP-19192. Log level is WARN when fail to load native hadoop libs (#6863)
Updates the documentation to be consistent with the logging.

Contributed by Cheng Pan
2024-06-14 19:05:27 +01:00
Mukund Thakur
06dd3bfee8
HADOOP-19196. Allow base path to be deleted as well using Bulk Delete. (#6872)
Contributed by: Mukund Thakur
2024-06-11 14:06:53 -05:00
Mukund Thakur
47be1ab3b6
HADOOP-18679. Add API for bulk/paged delete of files (#6726)
Applications can create a BulkDelete instance from a
BulkDeleteSource; the BulkDelete interface provides
the pageSize(): the maximum number of entries which can be
deleted, and a bulkDelete(Collection paths)
method which can take a collection up to pageSize() long.

This is optimized for object stores with bulk delete APIs;
the S3A connector will offer the page size of
fs.s3a.bulk.delete.page.size unless bulk delete has
been disabled.

Even with a page size of 1, the S3A implementation is
more efficient than delete(path)
as there are no safety checks for the path being a directory
or probes for the need to recreate directories.

The interface BulkDeleteSource is implemented by
all FileSystem implementations, with a page size
of 1 and mapped to delete(pathToDelete, false).
This means that callers do not need to have special
case handling for object stores versus classic filesystems.

To aid use through reflection APIs, the class
org.apache.hadoop.io.wrappedio.WrappedIO
has been created with "reflection friendly" methods.

Contributed by Mukund Thakur and Steve Loughran
2024-05-20 17:05:25 +01:00
Felix Nguyen
fb0519253d
HDFS-17488. DN can fail IBRs with NPE when a volume is removed (#6759) 2024-05-11 15:37:43 +08:00
zhtttylz
daafc8a0b8
HDFS-17367. Add PercentUsed for Different StorageTypes in JMX (#6735) Contributed by Hualong Zhang.
Signed-off-by: Shilun Fan <slfan1989@apache.org>
2024-04-27 20:36:11 +08:00
Steve Loughran
87fb977777
HADOOP-19098. Vector IO: Specify and validate ranges consistently. #6604
Clarifies behaviour of VectorIO methods with contract tests as well as
specification.

* Add precondition range checks to all implementations
* Identify and fix bug where direct buffer reads was broken
  (HADOOP-19101; this surfaced in ABFS contract tests)
* Logging in VectoredReadUtils.
* TestVectoredReadUtils verifies validation logic.
* FileRangeImpl toString() improvements
* CombinedFileRange tracks bytes in range which are wanted;
   toString() output logs this.

HDFS
* Add test TestHDFSContractVectoredRead

ABFS
* Add test ITestAbfsFileSystemContractVectoredRead

S3A
* checks for vector IO being stopped in all iterative
  vector operations, including draining
* maps read() returning -1 to failure
* passes in file length to validation
* Error reporting to only completeExceptionally() those ranges
  which had not yet read data in.
* Improved logging.

readVectored()
* made synchronized. This is only for the invocation;
  the actual async retrieves are unsynchronized.
* closes input stream on invocation
* switches to random IO, so avoids keeping any long-lived connection around.

+ AbstractSTestS3AHugeFiles enhancements.
+ ADDENDUM: test fix in ITestS3AContractVectoredRead

Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback
implementation

Contributed by Steve Loughran

Change-Id: Ia4ed71864c595f175c275aad83a2ff5741693432
2024-04-03 13:17:52 +01:00
Steve Loughran
b4f9d8e6fa
Revert "HADOOP-19098. Vector IO: Specify and validate ranges consistently."
This reverts commit ba7faf90c8.
2024-04-03 13:15:05 +01:00
Steve Loughran
ba7faf90c8
HADOOP-19098. Vector IO: Specify and validate ranges consistently.
Clarifies behaviour of VectorIO methods with contract tests as well as specification.

* Add precondition range checks to all implementations
* Identify and fix bug where direct buffer reads was broken
  (HADOOP-19101; this surfaced in ABFS contract tests)
* Logging in VectoredReadUtils.
* TestVectoredReadUtils verifies validation logic.
* FileRangeImpl toString() improvements
* CombinedFileRange tracks bytes in range which are wanted;
   toString() output logs this.

HDFS
* Add test TestHDFSContractVectoredRead

ABFS
* Add test ITestAbfsFileSystemContractVectoredRead

S3A
* checks for vector IO being stopped in all iterative
  vector operations, including draining
* maps read() returning -1 to failure
* passes in file length to validation
* Error reporting to only completeExceptionally() those ranges
  which had not yet read data in.
* Improved logging.  

readVectored()
* made synchronized. This is only for the invocation;
  the actual async retrieves are unsynchronized.
* closes input stream on invocation
* switches to random IO, so avoids keeping any long-lived connection around.

+ AbstractSTestS3AHugeFiles enhancements.

Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback implementation

Contributed by Steve Loughran
2024-04-02 20:16:38 +01:00
slfan1989
ff3f2255d2
HADOOP-19112. Hadoop 3.4.0 release wrap-up. (#6640) Contributed by Shilun Fan.
Reviewed-by: He Xiaoqiao <hexiaoqiao@apache.org>
Signed-off-by: Shilun Fan <slfan1989@apache.org>
2024-03-19 20:08:03 +08:00
hfutatzhanghb
7012986fc3
HDFS-17345. Add a metrics to record block report generating cost time. (#6475). Contributed by farmmamba.
Reviewed-by:  Shuyan Zhang <zhangshuyan@apache.org>
Signed-off-by:  Shuyan Zhang <zhangshuyan@apache.org>
2024-03-06 16:59:00 +08:00
DieterDP
be13e94843
HADOOP-18987. Various fixes to FileSystem API docs (#6292)
Contributed by Dieter De Paepe
2024-02-02 11:49:31 +00:00
LiuGuH
5f9932acc4
HDFS-17325. Fix the documentation of fs expunge command in FileSystemShell.md. (#6413) Contributed by liuguanghua.
Reviewed-by: Ayush Saxena <ayushsaxena@apache.org>
Signed-off-by: Shilun Fan <slfan1989@apache.org>
2024-01-05 18:42:55 +08:00
Lei Yang
661c784662
HDFS-17290: Adds disconnected client rpc backoff metrics (#6359) 2024-01-04 20:24:10 -08:00
huangzhaobo
e26139beaa
HDFS-17301. Add read and write dataXceiver threads count metrics to datanode. (#6377)
Reviewed-by: hfutatzhanghb <hfutzhanghb@163.com>
Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>
2023-12-29 12:43:46 +09:00
hfutatzhanghb
e91daae318
HDFS-17152. Fix the documentation of count command in FileSystemShell.md. (#5939). Contributed by farmmamba.
Reviewed-by: Shilun Fan <slfan1989@apache.org>
Signed-off-by:  Shuyan Zhang <zhangshuyan@apache.org>
2023-12-11 16:53:37 +08:00
caozhiqiang
37d6cada14
HDFS-17272. NNThroughputBenchmark should support specifying the base directory for multi-client test (#6319). Contributed by caozhiqiang.
Reviewed-by: Tao Li <tomscut@apache.org>
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2023-12-10 13:43:04 +05:30
zhangshuyan
809ae58e71
HADOOP-18982. Fix doc about loading native libraries. (#6281). Contributed by Shuyan Zhang.
Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
2023-12-06 21:24:14 +08:00
Tom
f58945d7d1
HDFS-16791. Add getEnclosingRoot() API to filesystem interface and implementations (#6198)
The enclosing root path is a common ancestor that should be used for temp and staging dirs
as well as within encryption zones and other restricted directories.

Contributed by Tom McCormick
2023-11-08 14:25:21 +00:00
Steve Loughran
7ec636deec
HADOOP-18930. Make fs.s3a.create.performance a bucket-wide setting. (#6168)
If fs.s3a.create.performance is set on a bucket
- All file overwrite checks are skipped, even if the caller says
  otherwise.
- All directory existence checks are skipped.
- Marker deletion is *always* skipped.

This eliminates a HEAD and a LIST for every creation.

* New path capability "fs.s3a.create.performance.enabled" true
  if the option is enabled.
* Parameterize ITestS3AContractCreate to expect the different
  outcomes
* Parameterize ITestCreateFileCost similarly, with
  changed cost assertions there.
* create(/) raises an IOE. existing bug only noticed here.

Contributed by Steve Loughran
2023-10-27 12:23:55 +01:00
huangzhaobo
daa78adc88
HDFS-17200. Add some datanode related metrics to Metrics.md. (#6099). Contributed by huangzhaobo99
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2023-10-06 12:40:44 +05:30
huhaiyang
2831c7ce26
HADOOP-18880. Add some rpc related metrics to Metrics.md (#6015) Contributed by Yanghai Hu.
Reviewed-by: Inigo Goiri <inigoiri@apache.org>
Signed-off-by: Shilun Fan <slfan1989@apache.org>
2023-09-05 17:34:05 +08:00
Wei-Chiu Chuang
e239d40ab1 Post release update
* Add jdiff xml files from 3.3.6 release.
* Declare 3.3.6 as the latest stable release.
* Copy release notes.

(cherry picked from commit 7db9895000)
(cherry picked from commit cc121e2124aa01458dc296a060edc5e21a295268)
2023-06-26 16:08:24 +00:00
Xing Lin
427366b73b
HDFS-17042 Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode (#5730) 2023-06-15 13:59:58 -07:00
Steve Loughran
e76c09ac3b
HADOOP-18724. Open file fails with NumberFormatException for S3AFileSystem (#5611)
This:

1. Adds optLong, optDouble, mustLong and mustDouble
   methods to the FSBuilder interface to let callers explicitly
   passin long and double arguments.
2. The opt() and must() builder calls which take float/double values
   now only set long values instead, so as to avoid problems
   related to overloaded methods resulting in a ".0" being appended
   to a long value.
3. All of the relevant opt/must calls in the hadoop codebase move to
   the new methods
4. And the s3a code is resilient to parse errors in is numeric options
   -it will downgrade to the default.

This is nominally incompatible, but the floating-point builder methods
were never used: nothing currently expects floating point numbers.

For anyone who wants to safely set numeric builder options across all compatible
releases, convert the number to a string and then use the opt(String, String)
and must(String, String) methods.

Contributed by Steve Loughran
2023-05-11 17:57:25 +01:00
Tak Lon (Stephen) Wu
0e46388474
HADOOP-18671. Add recoverLease(), setSafeMode(), isFileClosed() as interfaces to hadoop-common (#5553)
The HDFS lease APIs have been replicated as interfaces in hadoop-common so other filesystems can
also implement them.  Applications which use the leasing APIs should migrate to the new
interface where possible.

Contributed by Stephen Wu
2023-05-03 11:05:55 +01:00
Sebastian Baunsgaard
6aac6cb212
HADOOP-18660. Filesystem Spelling Mistake (#5475). Contributed by Sebastian Baunsgaard.
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2023-04-25 21:44:04 +05:30
Nikita Eshkeev
d07356e60e
HADOOP-18597. Simplify single node instructions for creating directories for Map Reduce. (#5305)
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2023-04-20 16:12:44 +05:30
Steve Loughran
405ed1dde6
HADOOP-18470. Hadoop 3.3.5 release wrap-up (#5558)
Post-release updates of the branches

* Add jdiff xml files from 3.3.5 release.
* Declare 3.3.5 as the latest stable release.
* Copy release notes.
2023-04-18 10:12:07 +01:00
Viraj Jasani
b4bcbb9515
HDFS-16959. RBF: State store cache loading metrics (#5497) 2023-03-29 10:43:13 -07:00
rdingankar
0ca5686034
HDFS-16917 Add transfer rate quantile metrics for DataNode reads (#5397)
Co-authored-by: Ravindra Dingankar <rdingankar@linkedin.com>
2023-02-27 18:26:32 +00:00
Arnout Engelen
02fd87a4d8
HADOOP-18627. Add stronger wording in 'secure mode' introduction (#5406)
Make it more clear that when deploying Hadoop 'secure mode' is generally not optional.

Contributed by Arnout Engelen
2023-02-17 16:30:41 +00:00
Steve Loughran
d56977e909
HADOOP-18470. More in the 3.3.5 index.html about security (#5383)
Expands on the comments in cluster config to tell people
they shouldn't be running a cluster without a private VLAN
in cloud, that Knox is good here, and unsecured clusters
without a VLAN are just computation-as-a-service to crypto miners

Contributed by Steve Loughran
2023-02-14 17:22:59 +00:00
Nikita Eshkeev
4de31123ce
Fix "the the" and friends typos (#5267)
Signed-off-by: Nikita Eshkeev <neshkeev@yandex.ru>
2023-01-17 03:33:59 +08:00
Steve Loughran
84b33b897c
HADOOP-18470. index.md update for 3.3.5 release 2022-12-05 16:13:24 +00:00
GuoPhilipse
069bd973d8
HADOOP-18532. Update command usage in FileSystemShell.md (#5141)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-11-21 15:55:46 +09:00
Ashutosh Gupta
a48e8c9beb
MAPREDUCE-5608. Replace and deprecate mapred.tasktracker.indexcache.mb (#5014)
Co-authored-by: Ashutosh Gupta <ashugpt@amazon.com>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
2022-11-14 11:07:40 +09:00
Steve Loughran
38b2ed2151
HADOOP-18442. Remove openstack support (#4855)
Contributed by Steve Loughran
2022-10-06 11:49:38 +01:00
Ayush Saxena
cc41ad63f9
HADOOP-18388. Allow dynamic groupSearchFilter in LdapGroupsMapping. (#4798)
* HADOOP-18388. Allow dynamic groupSearchFilter in LdapGroupsMapping.
2022-09-06 18:38:51 -04:00
Mukund Thakur
231e095802
HADOOP-18407. Improve readVectored() api spec (#4760)
part of HADOOP-18103.

Contributed By: Mukund Thakur
2022-08-22 23:19:29 +05:30
Steve Loughran
62dbefd8f2
HADOOP-18305. Release Hadoop 3.3.4: upstream changelog and jdiff files
Add the r3.3.4 changelog, release notes and jdiff xml files.
2022-08-05 14:06:22 +01:00
Masatake Iwasaki
3cce41a1f6 Make upstream aware of 3.2.4 release.
(cherry picked from commit e1637a57dfd41385dbce5de90620c48a45abb263)
2022-07-22 02:27:19 +00:00
Mukund Thakur
4d1f6f9b99 HADOOP-18106: Handle memory fragmentation in S3A Vectored IO. (#4445)
part of HADOOP-18103.
Handling memory fragmentation in S3A vectored IO implementation by
allocating smaller user range requested size buffers and directly
filling them from the remote S3 stream and skipping undesired
data in between ranges.
This patch also adds aborting active vectored reads when stream is
closed or unbuffer() is called.

Contributed By: Mukund Thakur
2022-06-22 17:29:32 +01:00
Mukund Thakur
5db0f34e29 HADOOP-18104: S3A: Add configs to configure minSeekForVectorReads and maxReadSizeForVectorReads (#3964)
Part of HADOOP-18103.
Introducing fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size
to configure min seek and max read during a vectored IO operation in S3A connector.
These properties actually define how the ranges will be merged. To completely
disable merging set fs.s3a.max.readsize.vectored.read to 0.

Contributed By: Mukund Thakur
2022-06-22 17:29:32 +01:00
Mukund Thakur
2daf0a814f HADOOP-11867. Add a high-performance vectored read API. (#3904)
part of HADOOP-18103.
Add support for multiple ranged vectored read api in PositionedReadable.
The default iterates through the ranges to read each synchronously,
but the intent is that FSDataInputStream subclasses can make more
efficient readers especially in object stores implementation.

Also added implementation in S3A where smaller ranges are merged and
sliced byte buffers are returned to the readers. All the merged ranged are
fetched from S3 asynchronously.

Contributed By: Owen O'Malley and Mukund Thakur
2022-06-22 17:29:32 +01:00