flink_book/basic/checkpoint.md
2023-08-27 23:00:44 +08:00

4.1 KiB
Raw Blame History

状态后端和checkpoint

  • 状态后端是保存到本地的状态。
  • checkpoint是将状态定时备份到第三方存储比如hdfsobs上面方便在作业重新运行的时候恢复数据。

pic

状态后端

配置名称 默认值 说明
state.backend - 建议配置为rocksdb
state.backend.latency-track.keyed-state-enabled false 是否跟踪keyed state operations的延时建议不要开启
state.backend.latency-track.sample-interval 100 跟踪耗时超过100ms的operations
state.backend.latency-track.history-size 128 跟踪耗时较高operation的个数
table.exec.state.ttl - 状态后端ttl时间一般用于join场景下防止状态后端过大导致作业失败

checkpoint 常用配置

配置名称 默认值 说明
execution.checkpointing.interval - checkpoint的触发的时间每个一段时间都会触发checkpoint。建议一般配置为1-10min左右
execution.checkpointing.mode EXACTLY_ONCE EXACTLY_ONCE保证精确一次;
AT_LEAST_ONCE:至少一次。建议EXACTLY_ONCE
state.backend.incremental false 是否开启增量checkpoint建议开启
execution.checkpointing.timeout 10min checkpoint的超时时间建议设置长一点30min左右
execution.checkpointing.unaligned.enabled false 是否启用非对齐checkpoint建议不开启
execution.checkpointing.unaligned.forced false 是否强制开启非对齐checkpoint
execution.checkpointing.max-concurrent-checkpoints 1 同时进行checkpoint的最大次数
execution.checkpointing.min-pause 0 两个checkpoint之间的最小停顿时间
execution.checkpointing.tolerable-failed-checkpoints - 可容忍的checkpoint的连续故障数目
execution.checkpointing.aligned-checkpoint-timeout 0 对齐checkpoint超时时间
execution.checkpointing.alignment-timeout 0 参考execution.checkpointing.aligned-checkpoint-timeout (已经过期)
execution.checkpointing.force false 是否强制检查点(已经过期)
state.checkpoints.num-retained 1 checkpoint 保存个数
state.backend.async true 是否开启异步checkpoint (已经过期)
state.savepoints.dir - savepoints存储文件夹
state.checkpoints.dir - checkpoint存储文件夹
state.storage.fs.memory-threshold 20kb 状态文件的最小大小
state.storage.fs.write-buffer-size 4 * 1024 写入文件系统的检查点流的写入缓冲区的默认大小。

常见报错

The maximum number of queued checkpoint requests exceeded

未完成的Checkpoint排队超过了1000个。需要查看作业是否存在被压等。一般情况下作业被压会导致checkpoint失败。

Periodic checkpoint scheduler is shut down

The minimum time between checkpoints is still pending

Not all required tasks are currently running

部分算子任务已经完成但是如果在维表join场景下flink 1.13版本之前可能无法恢复checkpoint

An Exception occurred while triggering the checkpoint.

Asynchronous task checkpoint failed.

The checkpoint was aborted due to exception of other subtasks sharing the ChannelState file

Checkpoint expired before completing

Checkpoint has been subsumed

Checkpoint was declined

Checkpoint was declined (tasks not ready)

Checkpoint was declined (task is closing)

Checkpoint was canceled because a barrier from newer checkpoint was received

Task received cancellation from one of its inputs

Checkpoint was declined because one input stream is finished

CheckpointCoordinator shutdown

Checkpoint Coordinator is suspending

FailoverRegion is restarting

Task has failed

Task local checkpoint failure

Unknown task for the checkpoint to notify

Failure to finalize checkpoint

Trigger checkpoint failure