Monitoring and Alert for Backup and Restore
This document describes the monitoring and alert of the backup and restore feature, including how to deploy monitoring components, monitoring metrics, and common alerts.
Snapshot backup and restore monitoring
To view the snapshot backup and restore metrics, go to the TiKV-Details > Backup & Import dashboard in Grafana.
Log backup monitoring
Log backup supports using Prometheus to collect monitoring metrics. Currently all monitoring metrics are built into TiKV.
Monitoring configuration
- For clusters deployed using TiUP, Prometheus automatically collects monitoring metrics.
- For clusters deployed manually, follow the instructions in TiDB Cluster Monitoring Deployment to add TiKV-related jobs to the
scrape_configssection of the Prometheus configuration file.
Grafana configuration
- For clusters deployed using TiUP, the Grafana dashboard contains the point-in-time recovery (PITR) panel. The Backup Log panel in the TiKV-Details dashboard is the PITR panel.
- For clusters deployed manually, refer to Import a Grafana dashboard and upload the tikv_details JSON file to Grafana. Then find the Backup Log panel in the TiKV-Details dashboard.
Monitoring metrics
Log backup alerts
Alert configuration
Currently, PITR does not have built-in alert items. This section introduces how to configure alert items in PITR and recommends some items.
To configure alert items in PITR, follow these steps:
- Create a configuration file (for example,
pitr.rules.yml) for the alert rules on the node where Prometheus is located. In the file, fill in the alert rules according to the Prometheus documentation, the following recommended alert items, and the configuration sample. - In the
rule_filesfield of the Prometheus configuration file, add the path of the alert rule file. - Send
SIGHUPsignal to the Prometheus process (kill -HUP pid) or send an HTTPPOSTrequest tohttp://prometheus-addr/-/reload(before you send the HTTP request, add the--web.enable-lifecycleparameter when starting Prometheus).
The recommended alert items are as follows:
LogBackupRunningRPOMoreThan10m
- Alert item:
max(time() - tidb_log_backup_last_checkpoint / 262144000) by (task) / 60 > 10 and max(tidb_log_backup_last_checkpoint) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0 - Alert level: warning
- Description: The log data is not persisted to the storage for more than 10 minutes. This alert item is a reminder. In most cases, it does not affect log backup.
A configuration sample of this alert item is as follows:
groups:
- name: PiTR
rules:
- alert: LogBackupRunningRPOMoreThan10m
expr: max(time() - tidb_log_backup_last_checkpoint / 262144000) by (task) / 60 > 10 and max(tidb_log_backup_last_checkpoint) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0
labels:
severity: warning
annotations:
summary: RPO of log backup is high
message: RPO of the log backup task {{ $labels.task }} is more than 10m
LogBackupRunningRPOMoreThan30m
- Alert item:
max(time() - tidb_log_backup_last_checkpoint / 262144000) by (task) / 60 > 30 and max(tidb_log_backup_last_checkpoint) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 0 - Alert level: critical
- Description: The log data is not persisted to the storage for more than 30 minutes. This alert often indicates anomalies. You can check the TiKV logs to find the cause.
LogBackupPausingMoreThan2h
- Alert item:
max(time() - tidb_log_backup_last_checkpoint / 262144000) by (task) / 3600 > 2 and max(tidb_log_backup_last_checkpoint) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 1 - Alert level: warning
- Description: The log backup task is paused for more than 2 hours. This alert item is a reminder and you are expected to run
br log resumeas soon as possible.
LogBackupPausingMoreThan12h
- Alert item:
max(time() - tidb_log_backup_last_checkpoint / 262144000) by (task) / 3600 > 12 and max(tidb_log_backup_last_checkpoint) by (task) > 0 and max(tikv_log_backup_task_status) by (task) == 1 - Alert level: critical
- Description: The log backup task is paused for more than 12 hours. You are expected to run
br log resumeas soon as possible to resume the task. Log tasks paused for too long have the risk of data loss.
LogBackupFailed
- Alert item:
max(tikv_log_backup_task_status) by (task) == 2 and max(tidb_log_backup_last_checkpoint) by (task) > 0 - Alert level: critical
- Description: The log backup task fails. You need to run
br log statusto see the failure reason. If necessary, you need to further check the TiKV logs.
LogBackupGCSafePointExceedsCheckpoint
- Alert item:
min(tidb_log_backup_last_checkpoint) by (instance) - max(tikv_gcworker_autogc_safe_point) by (instance) < 0 - Alert level: critical
- Description: Some data has been garbage-collected before the backup. This means that some data has been lost and is very likely to affect your services.