flink_book/basic/checkpoint.md

125 lines
4.1 KiB
Markdown
Raw Normal View History

2023-08-27 15:00:44 +00:00
# 状态后端和checkpoint
- 状态后端是保存到本地的状态。
- checkpoint是将状态定时备份到第三方存储比如hdfsobs上面方便在作业重新运行的时候恢复数据。
![pic](https://pan.zeekling.cn//flink/basic/flink_backent_0001.png)
# 状态后端
| 配置名称 | 默认值 | 说明 |
|---|---|---|
| **state.backend** | - | 建议配置为rocksdb |
| state.backend.latency-track.keyed-state-enabled | false | 是否跟踪keyed state operations的延时建议不要开启 |
| state.backend.latency-track.sample-interval | 100 | 跟踪耗时超过100ms的operations |
| state.backend.latency-track.history-size | 128 | 跟踪耗时较高operation的个数 |
| table.exec.state.ttl | - | 状态后端ttl时间一般用于join场景下防止状态后端过大导致作业失败 |
# checkpoint 常用配置
2023-08-27 06:38:06 +00:00
| 配置名称 | 默认值 | 说明 |
|---|---|---|
2023-08-27 15:00:44 +00:00
| **execution.checkpointing.interval** | - | checkpoint的触发的时间每个一段时间都会触发checkpoint。建议一般配置为1-10min左右 |
| **execution.checkpointing.mode** | EXACTLY_ONCE | EXACTLY_ONCE保证精确一次;<br> AT_LEAST_ONCE:至少一次。建议EXACTLY_ONCE |
| **state.backend.incremental** | false | 是否开启增量checkpoint建议开启 |
| **execution.checkpointing.timeout** | 10min| checkpoint的超时时间建议设置长一点30min左右 |
| **execution.checkpointing.unaligned.enabled** | false | 是否启用非对齐checkpoint建议不开启 |
| execution.checkpointing.unaligned.forced | false | 是否强制开启非对齐checkpoint |
| execution.checkpointing.max-concurrent-checkpoints | 1 | 同时进行checkpoint的最大次数 |
| execution.checkpointing.min-pause | 0 | 两个checkpoint之间的最小停顿时间 |
| execution.checkpointing.tolerable-failed-checkpoints | - | 可容忍的checkpoint的连续故障数目 |
| execution.checkpointing.aligned-checkpoint-timeout | 0 | 对齐checkpoint超时时间 |
| execution.checkpointing.alignment-timeout | 0 | 参考execution.checkpointing.aligned-checkpoint-timeout <span style="color:red;">(已经过期)</span> |
| execution.checkpointing.force | false | 是否强制检查点<span style="color:red;">(已经过期)</span> |
| state.checkpoints.num-retained | 1 | checkpoint 保存个数 |
| state.backend.async | true | 是否开启异步checkpoint <span style="color:red;">(已经过期)</span> |
| state.savepoints.dir | - | savepoints存储文件夹 |
| state.checkpoints.dir | - | checkpoint存储文件夹 |
| state.storage.fs.memory-threshold | 20kb | 状态文件的最小大小 |
| state.storage.fs.write-buffer-size | 4 * 1024 | 写入文件系统的检查点流的写入缓冲区的默认大小。 |
2023-08-27 06:38:06 +00:00
# 常见报错
## The maximum number of queued checkpoint requests exceeded
未完成的Checkpoint排队超过了1000个。需要查看作业是否存在被压等。一般情况下作业被压会导致checkpoint失败。
## Periodic checkpoint scheduler is shut down
## The minimum time between checkpoints is still pending
## Not all required tasks are currently running
部分算子任务已经完成但是如果在维表join场景下flink 1.13版本之前可能无法恢复checkpoint
## An Exception occurred while triggering the checkpoint.
## Asynchronous task checkpoint failed.
## The checkpoint was aborted due to exception of other subtasks sharing the ChannelState file
## Checkpoint expired before completing
## Checkpoint has been subsumed
## Checkpoint was declined
## Checkpoint was declined (tasks not ready)
## Checkpoint was declined (task is closing)
## Checkpoint was canceled because a barrier from newer checkpoint was received
## Task received cancellation from one of its inputs
## Checkpoint was declined because one input stream is finished
## CheckpointCoordinator shutdown
## Checkpoint Coordinator is suspending
## FailoverRegion is restarting
## Task has failed
## Task local checkpoint failure
## Unknown task for the checkpoint to notify
## Failure to finalize checkpoint
## Trigger checkpoint failure