Snapshot Backup and Restore Guide
This document describes how to back up and restore TiDB snapshots using the br command-line tool (hereinafter referred to as br
). Before backing up and restoring data, you need to install the br command-line tool first.
Snapshot backup is an implementation to back up the entire cluster. It is based on multi-version concurrency control (MVCC) and backs up all data in the specified snapshot to a target storage. The size of the backup data is approximately the size of the compressed single replica in the cluster. After the backup is completed, you can restore the backup data to an empty cluster or a cluster that does not contain conflict data (with the same schema or same tables), restore the cluster to the time point of the snapshot backup, and restore multiple replicas according to the cluster replica settings.
Besides basic backup and restore, snapshot backup and restore also provides the following features:
Back up cluster snapshots
You can back up a TiDB cluster snapshot by running the tiup br backup full
command. Run tiup br backup full --help
to see the help information:
tiup br backup full --pd "${PD_IP}:2379" \
--backupts '2022-09-08 13:30:00 +08:00' \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}" \
--ratelimit 128 \
In the preceding command:
--backupts
: The time point of the snapshot. The format can be TSO or timestamp, such as400036290571534337
or2018-05-11 01:42:23 +08:00
. If the data of this snapshot is garbage collected, thetiup br backup
command returns an error andbr
exits. When backing up using a timestamp, it is recommended to specify the time zone as well. Otherwise,br
uses the local time zone to construct the timestamp by default, which might lead to an incorrect backup time point. If you leave this parameter unspecified,br
picks the snapshot corresponding to the backup start time.--storage
: The storage address of the backup data. Snapshot backup supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as backup storage. The preceding command uses Amazon S3 as an example. For more details, see URI Formats of External Storage Services.--ratelimit
: The maximum speed per TiKV performing backup tasks. The unit is in MiB/s.
During backup, a progress bar is displayed in the terminal as shown below. When the progress bar advances to 100%, the backup task is completed and statistics such as total backup time, average backup speed, and backup data size are displayed.
Full Backup <-------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------> 100.00%
*** ["Full Backup success summary"] *** [backup-checksum=3.597416ms] [backup-fast-checksum=2.36975ms] *** [total-take=4.715509333s] [BackupTS=435844546560000000] [total-kv=1131] [total-kv-size=250kB] [average-speed=53.02kB/s] [backup-data-size(after-compressed)=71.33kB] [Size=71330]
Get the backup time point of a snapshot backup
To manage a lot of backups, if you need to get the physical time of a snapshot backup, you can run the following command:
tiup br validate decode --field="end-version" \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}" | tail -n1
The output is as follows, corresponding to the physical time 2022-09-08 13:30:00 +0800 CST
:
435844546560000000
Restore cluster snapshots
You can restore a snapshot backup by running the tiup br restore full
command. Run tiup br restore full --help
to see the help information:
The following example restores the preceding backup snapshot to a target cluster:
tiup br restore full --pd "${PD_IP}:2379" \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}"
During restore, a progress bar is displayed in the terminal as shown below. When the progress bar advances to 100%, the restore task is completed and statistics such as total restore time, average restore speed, and total data size are displayed.
Full Restore <------------------------------------------------------------------------------> 100.00%
*** ["Full Restore success summary"] *** [total-take=4.344617542s] [total-kv=5] [total-kv-size=327B] [average-speed=75.27B/s] [restore-data-size(after-compressed)=4.813kB] [Size=4813] [BackupTS=435844901803917314]
Restore a database or a table
BR supports restoring partial data of a specified database or table from backup data. This feature allows you to filter out unwanted data and back up only a specific database or table.
Restore a database
To restore a database to a cluster, run the tiup br restore db
command. The following example restores the test
database from the backup data to the target cluster:
tiup br restore db \
--pd "${PD_IP}:2379" \
--db "test" \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}"
In the preceding command, --db
specifies the name of the database to be restored.
Restore a table
To restore a single table to a cluster, run the tiup br restore table
command. The following example restores the test.usertable
table from the backup data to the target cluster:
tiup br restore table --pd "${PD_IP}:2379" \
--db "test" \
--table "usertable" \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}"
In the preceding command, --db
specifies the name of the database to be restored, and --table
specifies the name of the table to be restored.
Restore multiple tables with table filter
To restore multiple tables with more complex filter rules, run the tiup br restore full
command and specify the table filters with --filter
or -f
. The following example restores tables that match the db*.tbl*
filter rule from the backup data to the target cluster:
tiup br restore full \
--pd "${PD_IP}:2379" \
--filter 'db*.tbl*' \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}"
Restore tables in the mysql
schema
- Starting from BR v5.1.0, when you back up snapshots, BR automatically backs up the system tables in the
mysql
schema, but does not restore these system tables by default. - Starting from v6.2.0, BR lets you specify
--with-sys-table
to restore data in some system tables. - Starting from v7.6.0, BR enables
--with-sys-table
by default, which means that BR restores data in some system tables by default.
BR can restore data in the following system tables:
+----------------------------------+
| mysql.columns_priv |
| mysql.db |
| mysql.default_roles |
| mysql.global_grants |
| mysql.global_priv |
| mysql.role_edges |
| mysql.tables_priv |
| mysql.user |
| mysql.bind_info |
+----------------------------------+
BR does not restore the following system tables:
- Statistics tables (
mysql.stat_*
). But statistics can be restored. See Back up statistics. - System variable tables (
mysql.tidb
andmysql.global_variables
) - Other system tables
+-----------------------------------------------------+
| capture_plan_baselines_blacklist |
| column_stats_usage |
| gc_delete_range |
| gc_delete_range_done |
| global_variables |
| stats_buckets |
| stats_extended |
| stats_feedback |
| stats_fm_sketch |
| stats_histograms |
| stats_history |
| stats_meta |
| stats_meta_history |
| stats_table_locked |
| stats_top_n |
| tidb |
+-----------------------------------------------------+
When you restore data related to system privilege, note that before restoring data, BR checks whether the system tables in the target cluster are compatible with those in the backup data. "Compatible" means that all the following conditions are met:
- The target cluster has the same system tables as the backup data.
- The number of columns in the system privilege table of the target cluster is the same as that in the backup data. The column order is not important.
- The columns in the system privilege table of the target cluster are compatible with that in the backup data. If the data type of the column is a type with a length (such as integer and string), the length in the target cluster must be >= the length in the backup data. If the data type of the column is an
ENUM
type, the number ofENUM
values in the target cluster must be a superset of that in the backup data.
Performance and impact
Performance and impact of snapshot backup
The backup feature has some impact on cluster performance (transaction latency and QPS). However, you can mitigate the impact by adjusting the number of backup threads backup.num-threads
or by adding more clusters.
To illustrate the impact of backup, this document lists the test conclusions of several snapshot backup tests:
- (5.3.0 and earlier) When the backup threads of BR on a TiKV node take up 75% of the total CPU of the node, the QPS is reduced by 35% of the original QPS.
- (5.4.0 and later) When there are no more than
8
threads of BR on a TiKV node and the cluster's total CPU utilization does not exceed 80%, the impact of BR tasks on the cluster (write and read) is 20% at most. - (5.4.0 and later) When there are no more than
8
threads of BR on a TiKV node and the cluster's total CPU utilization does not exceed 75%, the impact of BR tasks on the cluster (write and read) is 10% at most. - (5.4.0 and later) When there are no more than
8
threads of BR on a TiKV node and the cluster's total CPU utilization does not exceed 60%, BR tasks have little impact on the cluster (write and read).
You can use the following methods to manually control the impact of backup tasks on cluster performance. However, these two methods also reduce the speed of backup tasks while reducing the impact of backup tasks on the cluster.
- Use the
--ratelimit
parameter to limit the speed of backup tasks. Note that this parameter limits the speed of saving backup files to external storage. When calculating the total size of backup files, use thebackup data size(after compressed)
as a benchmark. When--ratelimit
is set, to avoid too many tasks causing the speed limit to fail, theconcurrency
parameter of br is automatically adjusted to1
. - Adjust the TiKV configuration item
backup.num-threads
to limit the number of threads used by backup tasks. According to internal tests, when BR uses no more than8
threads for backup tasks, and the total CPU utilization of the cluster does not exceed 60%, the backup tasks have little impact on the cluster, regardless of the read and write workload.
The impact of backup on cluster performance can be reduced by limiting the backup threads number, but this affects the backup performance. The preceding tests show that the backup speed is proportional to the number of backup threads. When the number of threads is small, the backup speed is about 20 MiB/thread. For example, 5 backup threads on a single TiKV node can reach a backup speed of 100 MiB/s.
Performance and impact of snapshot restore
During data restore, TiDB tries to fully utilize the TiKV CPU, disk IO, and network bandwidth resources. Therefore, it is recommended to restore the backup data on an empty cluster to avoid affecting the running applications.
The speed of restoring backup data is much related with the cluster configuration, deployment, and running applications. In internal tests, the restore speed of a single TiKV node can reach 100 MiB/s. The performance and impact of snapshot restore are varied in different user scenarios and should be tested in actual environments.
BR provides a coarse-grained Region scattering algorithm to accelerate Region restore in large-scale Region scenarios. The algorithm is controlled by the command-line parameter
--granularity="coarse-grained"
and is enabled by default. This algorithm ensures that each TiKV node receives stable and evenly distributed download tasks, thus fully utilizing the resources of each TiKV node and achieving a rapid parallel recovery. In several real-world cases, the snapshot restore speed of the cluster is improved by about 3 times in large-scale Region scenarios. The following is an example:tiup br restore full \ --pd "${PDIP}:2379" \ --storage "s3://${Bucket}/${Folder}" \ --s3.region "${region}" \ --granularity "coarse-grained" \ --send-credentials-to-tikv=true \ --log-file restorefull.logStarting from v8.0.0, the
br
command-line tool introduces the--tikv-max-restore-concurrency
parameter to control the maximum number of files that BR downloads and ingests per TiKV node. By configuring this parameter, you can also control the maximum length of the job queue (the maximum length of the job queue = 32 * the number of TiKV nodes *--tikv-max-restore-concurrency
), thereby controlling the memory consumption of the BR node.In normal cases,
--tikv-max-restore-concurrency
is automatically adjusted based on the cluster configuration, so manual configuration is unnecessary. If the TiKV-Details > Backup & Import > Import RPC count monitoring metric in Grafana shows that the number of files BR downloads remains close to 0 for a long time while the number of files that BR ingests consistently reaches the upper limit, it indicates that ingesting file tasks pile up and the job queue has reached its maximum length. In this case, you can take the following measures to alleviate the task pilling-up issue:- Set the
--ratelimit
parameter to limit the download speed, ensuring sufficient resources for ingesting file tasks. For example, if the disk throughput of any TiKV node isx MiB/s
and the network bandwidth for downloading backup files exceedsx/2 MiB/s
, you can set the parameter as--ratelimit x/2
. If the disk throughput of any TiKV node isx MiB/s
and the network bandwidth for downloading backup files is less than or equal tox/2 MiB/s
, you can leave the parameter--ratelimit
unset. - Increase the
--tikv-max-restore-concurrency
to increase the maximum length of the job queue.
- Set the