flink_book/basic/checkpoint.md
2023-08-27 23:00:44 +08:00

125 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 状态后端和checkpoint
- 状态后端是保存到本地的状态。
- checkpoint是将状态定时备份到第三方存储比如hdfsobs上面方便在作业重新运行的时候恢复数据。
![pic](https://pan.zeekling.cn//flink/basic/flink_backent_0001.png)
# 状态后端
| 配置名称 | 默认值 | 说明 |
|---|---|---|
| **state.backend** | - | 建议配置为rocksdb |
| state.backend.latency-track.keyed-state-enabled | false | 是否跟踪keyed state operations的延时建议不要开启 |
| state.backend.latency-track.sample-interval | 100 | 跟踪耗时超过100ms的operations |
| state.backend.latency-track.history-size | 128 | 跟踪耗时较高operation的个数 |
| table.exec.state.ttl | - | 状态后端ttl时间一般用于join场景下防止状态后端过大导致作业失败 |
# checkpoint 常用配置
| 配置名称 | 默认值 | 说明 |
|---|---|---|
| **execution.checkpointing.interval** | - | checkpoint的触发的时间每个一段时间都会触发checkpoint。建议一般配置为1-10min左右 |
| **execution.checkpointing.mode** | EXACTLY_ONCE | EXACTLY_ONCE保证精确一次;<br> AT_LEAST_ONCE:至少一次。建议EXACTLY_ONCE |
| **state.backend.incremental** | false | 是否开启增量checkpoint建议开启 |
| **execution.checkpointing.timeout** | 10min| checkpoint的超时时间建议设置长一点30min左右 |
| **execution.checkpointing.unaligned.enabled** | false | 是否启用非对齐checkpoint建议不开启 |
| execution.checkpointing.unaligned.forced | false | 是否强制开启非对齐checkpoint |
| execution.checkpointing.max-concurrent-checkpoints | 1 | 同时进行checkpoint的最大次数 |
| execution.checkpointing.min-pause | 0 | 两个checkpoint之间的最小停顿时间 |
| execution.checkpointing.tolerable-failed-checkpoints | - | 可容忍的checkpoint的连续故障数目 |
| execution.checkpointing.aligned-checkpoint-timeout | 0 | 对齐checkpoint超时时间 |
| execution.checkpointing.alignment-timeout | 0 | 参考execution.checkpointing.aligned-checkpoint-timeout <span style="color:red;">(已经过期)</span> |
| execution.checkpointing.force | false | 是否强制检查点<span style="color:red;">(已经过期)</span> |
| state.checkpoints.num-retained | 1 | checkpoint 保存个数 |
| state.backend.async | true | 是否开启异步checkpoint <span style="color:red;">(已经过期)</span> |
| state.savepoints.dir | - | savepoints存储文件夹 |
| state.checkpoints.dir | - | checkpoint存储文件夹 |
| state.storage.fs.memory-threshold | 20kb | 状态文件的最小大小 |
| state.storage.fs.write-buffer-size | 4 * 1024 | 写入文件系统的检查点流的写入缓冲区的默认大小。 |
# 常见报错
## The maximum number of queued checkpoint requests exceeded
未完成的Checkpoint排队超过了1000个。需要查看作业是否存在被压等。一般情况下作业被压会导致checkpoint失败。
## Periodic checkpoint scheduler is shut down
## The minimum time between checkpoints is still pending
## Not all required tasks are currently running
部分算子任务已经完成但是如果在维表join场景下flink 1.13版本之前可能无法恢复checkpoint
## An Exception occurred while triggering the checkpoint.
## Asynchronous task checkpoint failed.
## The checkpoint was aborted due to exception of other subtasks sharing the ChannelState file
## Checkpoint expired before completing
## Checkpoint has been subsumed
## Checkpoint was declined
## Checkpoint was declined (tasks not ready)
## Checkpoint was declined (task is closing)
## Checkpoint was canceled because a barrier from newer checkpoint was received
## Task received cancellation from one of its inputs
## Checkpoint was declined because one input stream is finished
## CheckpointCoordinator shutdown
## Checkpoint Coordinator is suspending
## FailoverRegion is restarting
## Task has failed
## Task local checkpoint failure
## Unknown task for the checkpoint to notify
## Failure to finalize checkpoint
## Trigger checkpoint failure