hadoop

Author	SHA1	Message	Date
Ayush Saxena	0837c84a9f	Revert "HADOOP-19231. Add JacksonUtil to manage Jackson classes (#6953 )" This reverts commit `fa9bb0d1ac`.	2024-08-29 14:42:03 +05:30
Carl Levasseur	68fcd7234c	HADOOP-18542. Keep MSI tenant ID and client ID optional (#4262 ) Contributed by Carl Levasseur	2024-08-21 14:15:28 +01:00
Anuj Modi	b15ed27cfb	HADOOP-19187: [ABFS][FNSOverBlob] AbfsClient Refactoring to Support Multiple Implementation of Clients. (#6879 ) Refactor AbfsClient into DFS and Blob Client. Contributed by Anuj Modi	2024-08-20 18:07:07 +01:00
PJ Fanning	fa9bb0d1ac	HADOOP-19231. Add JacksonUtil to manage Jackson classes (#6953 ) New class org.apache.hadoop.util.JacksonUtil centralizes construction of Jackson ObjectMappers and JsonFactories. Contributed by PJ Fanning	2024-08-15 16:44:54 +01:00
Steve Loughran	55a576906d	HADOOP-19131. Assist reflection IO with WrappedOperations class (#6686 ) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran	2024-08-14 14:43:00 +01:00
Pranav Saxena	b60497ff41	HADOOP-19120. ApacheHttpClient adaptation in ABFS. (#6633 ) Apache httpclient 4.5.x is the new default implementation of http connections; this supports a large configurable pool of connections along with the ability to limit their lifespan. The networking library can be chosen using the configuration option fs.azure.networking.library The supported values are - APACHE_HTTP_CLIENT : Use Apache HttpClient [Default] - JDK_HTTP_URL_CONNECTION : Use JDK networking library Important: unless the networking library is switched back to the JDK, the apache httpcore and httpclient must be on the classpath Contributed by Pranav Saxena	2024-07-22 19:03:51 +01:00
Anuj Modi	51cb858cc8	HADOOP-19208: [ABFS] Fixing logic to determine HNS nature of account to avoid extra getAcl() calls (#6893 )	2024-07-15 21:51:54 +05:30
Anuj Modi	005030f7a0	HADOOP-18610: [ABFS] OAuth2 Token Provider support for Azure Workload Identity (#6787 ) Add support for Azure Active Directory (Azure AD) workload identities which integrate with the Kubernetes's native capabilities to federate with any external identity provider. Contributed By: Anuj Modi	2024-06-11 13:06:39 -05:00
Pranav Saxena	2e1deee87a	HADOOP-19137. [ABFS] Prevent ABFS initialization for non-hierarchal-namespace account if Customer-provided-key configs given. (#6752 ) Customer-provided-keys (CPK) configs are not allowed with non-hierarchal-namespace (non-HNS) accounts for ABFS. This patch aims to prevent ABFS initialization for non-HNS accounts if CPK configs are provided. Contributed by: Pranav Saxena	2024-06-10 15:03:41 -05:00
Anuj Modi	bbb17e76a7	HADOOP-19178: [WASB Deprecation] Updating Documentation on Upcoming Plans for Hadoop-Azure (#6862 ) Contributed by Anuj Modi	2024-06-07 14:28:24 +01:00
Anuj Modi	d8b485a512	HADOOP-18516: [ABFS][Authentication] Support Fixed SAS Token for ABFS Authentication (#6552 ) Contributed by Anuj Modi	2024-05-30 20:46:19 +01:00
Anmol Asrani	d168d3ffee	HADOOP-18325: ABFS: Add correlated metric support for ABFS operations (#6314 ) Adds support for metric collection at the filesystem instance level. Metrics are pushed to the store upon the closure of a filesystem instance, encompassing all operations that utilized that specific instance. Collected Metrics: - Number of successful requests without any retries. - Count of requests that succeeded after a specified number of retries (x retries). - Request count subjected to throttling. - Number of requests that failed despite exhausting all retry attempts. etc. Implementation Details: Incorporated logic in the AbfsClient to facilitate metric pushing through an additional request. This occurs in scenarios where no requests are sent to the backend for a defined idle period. By implementing these enhancements, we ensure comprehensive monitoring and analysis of filesystem interactions, enabling a deeper understanding of success rates, retry scenarios, throttling instances, and exhaustive failure scenarios. Additionally, the AbfsClient logic ensures that metrics are proactively pushed even during idle periods, maintaining a continuous and accurate representation of filesystem performance. Contributed by Anmol Asrani	2024-05-23 15:10:10 +01:00
Mukund Thakur	47be1ab3b6	HADOOP-18679. Add API for bulk/paged delete of files (#6726 ) Applications can create a BulkDelete instance from a BulkDeleteSource; the BulkDelete interface provides the pageSize(): the maximum number of entries which can be deleted, and a bulkDelete(Collection paths) method which can take a collection up to pageSize() long. This is optimized for object stores with bulk delete APIs; the S3A connector will offer the page size of fs.s3a.bulk.delete.page.size unless bulk delete has been disabled. Even with a page size of 1, the S3A implementation is more efficient than delete(path) as there are no safety checks for the path being a directory or probes for the need to recreate directories. The interface BulkDeleteSource is implemented by all FileSystem implementations, with a page size of 1 and mapped to delete(pathToDelete, false). This means that callers do not need to have special case handling for object stores versus classic filesystems. To aid use through reflection APIs, the class org.apache.hadoop.io.wrappedio.WrappedIO has been created with "reflection friendly" methods. Contributed by Mukund Thakur and Steve Loughran	2024-05-20 17:05:25 +01:00
xuzifu666	cf9559eb27	HADOOP-19073 WASB: Fix connection leak in FolderRenamePending (#6534 ) Contributed by xuyu	2024-05-15 14:38:06 +01:00
Steve Loughran	c9270600b7	MAPREDUCE-7474. Improve Manifest committer resilience (#6716 ) Improve task commit resilience everywhere and add an option to reduce delete IO requests on job cleanup (relevant for ABFS and HDFS). Task Commit Resilience ---------------------- Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option: mapreduce.manifest.committer.manifest.save.attempts * The default is 5. * The minimum is 1; asking for less is ignored. * A retry policy adds 500ms of sleep per attempt. * Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN. * New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file). This is only saved to the manifest on task commit retries, and provides statistics on all previous unsuccessful attempts to save the manifests + test changes to match the codepath changes, including improvements in fault injection. Directory size for deletion --------------------------- New option mapreduce.manifest.committer.cleanup.parallel.delete.base.first This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout. This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details. Success file printing --------------------- The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command: mapred successfile <path to file> Contributed by Steve Loughran	2024-05-13 21:12:34 +01:00
Anuj Modi	a6f2c4617e	HADOOP-19150: [ABFS] Fixing Test Code for ITestAbfsRestOperationException#testAuthFailException (#6756 ) Contributed by: Anuj Modi	2024-04-29 11:48:34 -05:00
Pranav Saxena	6404692c09	HADOOP-19102. [ABFS] FooterReadBufferSize should not be greater than readBufferSize (#6617 ) Contributed by Pranav Saxena	2024-04-22 18:36:12 +01:00
Anuj Modi	bd1a08b2cf	HADOOP-19129: [ABFS] Test Fixes and Test Script Bug Fixes (#6676 ) Contributed by Anuj Modi	2024-04-12 17:52:47 +01:00
Anuj Modi	dbe2d61258	HADOOP-19096. [ABFS] [CST Optimization] Enhance Client-Side Throttling Metrics Logic (#6276 ) ABFS has a client-side throttling mechanism which works on the metrics collected from past requests When requests are fail due to server-side throttling it updates its metrics and recalculates any client side backoff. The choice of which requests should be used to compute client side backoff interval is based on the http status code: - Status code in 2xx range: Successful Operations should contribute. - Status code in 3xx range: Redirection Operations should not contribute. - Status code in 4xx range: User Errors should not contribute. - Status code is 503: Throttling Error should contribute only if they are due to client limits breach as follows: * 503, Ingress Over Account Limit: Should Contribute * 503, Egress Over Account Limit: Should Contribute * 503, TPS Over Account Limit: Should Contribute * 503, Other Server Throttling: Should not Contribute. - Status code in 5xx range other than 503: Should not Contribute. - IOException and UnknownHostExceptions: Should not Contribute. Contributed by Anuj Modi	2024-04-10 14:46:23 +01:00
Anuj Modi	6ed73896f6	HADOOP-18656. [ABFS] Add Support for Paginated Delete for Large Directories in HNS Account (#6409 ) Contributed by Anuj Modi	2024-04-04 19:48:25 +01:00
Steve Loughran	87fb977777	HADOOP-19098. Vector IO: Specify and validate ranges consistently. #6604 Clarifies behaviour of VectorIO methods with contract tests as well as specification. * Add precondition range checks to all implementations * Identify and fix bug where direct buffer reads was broken (HADOOP-19101; this surfaced in ABFS contract tests) * Logging in VectoredReadUtils. * TestVectoredReadUtils verifies validation logic. * FileRangeImpl toString() improvements * CombinedFileRange tracks bytes in range which are wanted; toString() output logs this. HDFS * Add test TestHDFSContractVectoredRead ABFS * Add test ITestAbfsFileSystemContractVectoredRead S3A * checks for vector IO being stopped in all iterative vector operations, including draining * maps read() returning -1 to failure * passes in file length to validation * Error reporting to only completeExceptionally() those ranges which had not yet read data in. * Improved logging. readVectored() * made synchronized. This is only for the invocation; the actual async retrieves are unsynchronized. * closes input stream on invocation * switches to random IO, so avoids keeping any long-lived connection around. + AbstractSTestS3AHugeFiles enhancements. + ADDENDUM: test fix in ITestS3AContractVectoredRead Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback implementation Contributed by Steve Loughran Change-Id: Ia4ed71864c595f175c275aad83a2ff5741693432	2024-04-03 13:17:52 +01:00
Steve Loughran	b4f9d8e6fa	Revert "HADOOP-19098. Vector IO: Specify and validate ranges consistently." This reverts commit `ba7faf90c8`.	2024-04-03 13:15:05 +01:00
Steve Loughran	ba7faf90c8	HADOOP-19098. Vector IO: Specify and validate ranges consistently. Clarifies behaviour of VectorIO methods with contract tests as well as specification. * Add precondition range checks to all implementations * Identify and fix bug where direct buffer reads was broken (HADOOP-19101; this surfaced in ABFS contract tests) * Logging in VectoredReadUtils. * TestVectoredReadUtils verifies validation logic. * FileRangeImpl toString() improvements * CombinedFileRange tracks bytes in range which are wanted; toString() output logs this. HDFS * Add test TestHDFSContractVectoredRead ABFS * Add test ITestAbfsFileSystemContractVectoredRead S3A * checks for vector IO being stopped in all iterative vector operations, including draining * maps read() returning -1 to failure * passes in file length to validation * Error reporting to only completeExceptionally() those ranges which had not yet read data in. * Improved logging. readVectored() * made synchronized. This is only for the invocation; the actual async retrieves are unsynchronized. * closes input stream on invocation * switches to random IO, so avoids keeping any long-lived connection around. + AbstractSTestS3AHugeFiles enhancements. Contains: HADOOP-19101. Vectored Read into off-heap buffer broken in fallback implementation Contributed by Steve Loughran	2024-04-02 20:16:38 +01:00
PJ Fanning	97c5a6efba	HADOOP-19041. Use StandardCharsets in more places (#6449 )	2024-03-28 23:17:18 -04:00
Anuj Modi	c4fa1b65fb	HADOOP-19089: [ABFS] Reverting Back Support of setXAttr() and getXAttr() on root path (#6592 ) This reverts most of HADOOP-18869: [ABFS] Fix behavior of a File System APIs on root path (#6003). Calling getXAttr("/") or setXAttr("/") on an abfs container will fail with `Operation failed: "The request URI is invalid.", HTTP 400 Bad Request` This change is to ensure: * Consistency across ADLS clients * Consistency across authentication mechanisms. Contributed by Anuj Modi	2024-03-25 14:13:24 +00:00
Anuj Modi	99b9e7fb43	HADOOP-18910: [ABFS] Adding Support for MD5 Hash based integrity verification of the request content during transport (#6069 ) Contributed By: Anuj Modi	2024-02-22 11:49:37 -06:00
Anuj Modi	1336c362e5	Hadoop-18759: [ABFS][Backoff-Optimization] Have a Static retry policy for connection timeout. (#5881 ) Contributed By: Anuj Modi	2024-02-20 11:31:42 -06:00
Pranav Saxena	7dc166ddc7	HADOOP-18883. [ABFS]: Expect-100 JDK bug resolution: prevent multiple server calls (#6022 ) Address JDK bug JDK-8314978 related to handling of HTTP 100 responses. https://bugs.openjdk.org/browse/JDK-8314978 In the AbfsHttpOperation, after sendRequest() we call processResponse() method from AbfsRestOperation. Even if the conn.getOutputStream() fails due to expect-100 error, we consume the exception and let the code go ahead. This may call getHeaderField() / getHeaderFields() / getHeaderFieldLong() after getOutputStream() has failed. These invocation all lead to server calls. This commit aims to prevent this. If connection.getOutputStream() fails due to an Expect-100 error, the ABFS client does not invoke getHeaderField(), getHeaderFields(), getHeaderFieldLong() or getInputStream(). getResponseCode() is safe as on the failure it sets the responseCode variable in HttpUrlConnection object. Contributed by Pranav Saxena	2024-01-21 19:14:54 +00:00
slfan1989	8444f69511	Preparing for 3.5.0 development (#6411 ) Co-authored-by: slfan1989 <slfan1989@apache.org>	2024-01-19 15:05:22 +08:00
Anuj Modi	e3c135b0b3	HADOOP-18971. [ABFS] Read and cache file footer with fs.azure.footer.read.request.size (#6270 ) The option fs.azure.footer.read.request.size sets the size of the footer to read and cache; the default value of 524288 has been measured to be good for most workloads running on parquet, ORC and similar file formats. Contributed by Anuj Modi	2024-01-03 12:49:52 +00:00
Pranav Saxena	0b43026cab	HADOOP-17912. ABFS: Support for Encryption Context (#6221 ) Contributed by Pranav Saxena and others.	2024-01-01 19:09:44 +00:00
PJ Fanning	f609460bda	HADOOP-18957. Use StandardCharsets.UTF_8 (#6231 ). Contributed by PJ Fanning. Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>	2023-11-20 23:44:48 +05:30
Anuj Modi	000a39ba2d	HADOOP-18872: [ABFS] [BugFix] Misreporting Retry Count for Sub-sequential and Parallel Operations (#6019 ) Contributed by Anuj Modi	2023-11-13 19:36:33 +00:00
Anuj Modi	597ceaae3a	HADOOP-18874: [ABFS] Add Server request ID in Exception Messages thrown to the caller. (#6004 ) Contributed by Anuj Modi	2023-11-06 20:56:55 +00:00
Anuj Modi	594e9f29f5	HADOOP-18869: [ABFS] Fix behavior of a File System APIs on root path (#6003 ) Contributed by Anuj Modi	2023-10-09 20:05:23 +01:00
Steve Loughran	882378c3e9	Revert "HADOOP-18869: [ABFS] Fix behavior of a File System APIs on root path (#6003 )" This reverts commit `6c6df40d35`. ...so as to give the correct credit	2023-10-09 20:05:07 +01:00
Anuj Modi	6c6df40d35	HADOOP-18869: [ABFS] Fix behavior of a File System APIs on root path (#6003 ) Contributed by Anmol Asrani	2023-10-09 20:01:56 +01:00
Anmol Asrani	9c621fcea7	HADOOP-18861. ABFS: Fix failing tests for CPK (#5979 ) Contributed by Anmol Asrani	2023-10-09 17:40:15 +01:00
Anmol Asrani	666af58700	HADOOP-18876. ABFS: Change default for fs.azure.data.blocks.buffer to bytebuffer (#6009 ) The default value for fs.azure.data.blocks.buffer is changed from "disk" to "bytebuffer" This will speed up writing to azure storage, at the risk of running out of memory -especially if there are many threads writing to abfs at the same time and the upload bandwidth is limited. If jobs do run out of memory writing to abfs, change the option back to "disk" Contributed by Anmol Asrani	2023-10-09 16:51:12 +01:00
PJ Fanning	57100bba1b	HADOOP-18917. Addendum: Upgrade to commons-io 2.14.0 (#6152 ). Contributed by PJ Fanning Co-authored-by: Ayush Saxena <ayushsaxena@apache.org>	2023-10-06 09:40:32 +05:30
Anmol Asrani	ababe3d9b0	HADOOP-18875. ABFS: Add sendMs and recvMs information for each AbfsHttpOperation by default. (#6008 ) Contributed By: Anmol Asrani	2023-10-04 13:55:03 -05:00
Pranav Saxena	f24b73e5f3	HADOOP-18873. ABFS: AbfsOutputStream doesnt close DataBlocks object. (#6010 ) AbfsOutputStream to close the dataBlock object created for the upload. Contributed By: Pranav Saxena	2023-09-20 14:24:36 +05:30
Anmol Asrani	01cc6d0bc8	HADOOP-18865. ABFS: Add "100-continue" in userAgent if enabled (#5987 ) Contributed by Anmol Asrani	2023-08-31 15:10:04 +01:00
Anuj Modi	ba32ea70fd	HADOOP-18826. [ABFS] Fix for GetFileStatus("/") failure. (#5909 ) Contributed by Anmol Asrani	2023-08-08 19:00:02 +01:00
Mehakmeet Singh	fac7d26c5d	HADOOP-18781. ABFS backReference passed down to streams to avoid GC closing the FS. (#5780 ) To avoid the ABFS instance getting closed due to GC while the streams are working, attach the ABFS instance to a backReference opaque object and passing down to the streams so that we have a hard reference while the streams are working. Contributed by: Mehakmeet Singh	2023-07-11 17:57:05 +05:30
Steve Loughran	7a45ef4164	MAPREDUCE-7435. Manifest Committer OOM on abfs (#5519 ) This modifies the manifest committer so that the list of files to rename is passed between stages as a file of writeable entries on the local filesystem. The map of directories to create is still passed in memory; this map is built across all tasks, so even if many tasks created files, if they all write into the same set of directories the memory needed is O(directories) with the task count not a factor. The _SUCCESS file reports on heap size through gauges. This should give a warning if there are problems. Contributed by Steve Loughran	2023-06-09 17:00:59 +01:00
Ayush Saxena	1d0c9ab433	Revert "HADOOP-18207. Introduce hadoop-logging module (#5503 )" This reverts commit `03a499821c`.	2023-06-05 09:34:40 +05:30
Viraj Jasani	03a499821c	HADOOP-18207. Introduce hadoop-logging module (#5503 ) Reviewed-by: Duo Zhang <zhangduo@apache.org>	2023-06-02 18:07:34 -07:00
Steve Loughran	e76c09ac3b	HADOOP-18724. Open file fails with NumberFormatException for S3AFileSystem (#5611 ) This: 1. Adds optLong, optDouble, mustLong and mustDouble methods to the FSBuilder interface to let callers explicitly passin long and double arguments. 2. The opt() and must() builder calls which take float/double values now only set long values instead, so as to avoid problems related to overloaded methods resulting in a ".0" being appended to a long value. 3. All of the relevant opt/must calls in the hadoop codebase move to the new methods 4. And the s3a code is resilient to parse errors in is numeric options -it will downgrade to the default. This is nominally incompatible, but the floating-point builder methods were never used: nothing currently expects floating point numbers. For anyone who wants to safely set numeric builder options across all compatible releases, convert the number to a string and then use the opt(String, String) and must(String, String) methods. Contributed by Steve Loughran	2023-05-11 17:57:25 +01:00
Tamas Domok	05e6dc19ea	HADOOP-18705. ABFS should exclude incompatible credential providers. (#5560 ) Contributed by Tamas Domok.	2023-04-24 15:46:40 +01:00

1 2 3 4 5 ...

427 Commits