Performance of EBS Snapshot Backup and Restore

This document describes the performance of EBS snapshot backup and restore, the factors that affect performance, and the performance test results. The performance metrics are based on the AWS region us-west-2.

Backup performance

This section introduces the performance of EBS snapshot backup using volumes, the factors that affect performance, and the performance test results.

Backup time consumption

EBS snapshot backup using volumes consists of the following processes: creates a backup task, stops scheduling, disables GC, and obtains the backupts and volume snapshots. For more detailed information about these processes, see Architecture of EBS snapshot volume backup and restore. Among these processes, creating a volume snapshot consumes most of the time. Volume snapshots are created in parallel, and the time taken to complete the entire backup task depends on when the most time-consuming volume is created.

Time consumption ratio of backup

Backup stageTime takenTotal ratioRemarks
Create volume snapshots16 minutes (50 GB)99%Including the time for creating AWS EBS snapshots
Others1 second1%Including the time for stopping scheduling, disabling GC, and obtaining the backupts

Backup performance data

Time taken by snapshot backup using volumes depends on when the last volume snapshot is backed up, which is done by AWS EBS. For now, AWS does not provide quantitative metrics for volume snapshot backup. The time taken by the entire backup process is as follows under the recommended machine type and GP3 storage volume, with the configuration of 400 MiB/s and 7000 IOPS:

EBS Snapshot backup perf

Volume dataTotal volume sizeVolume configurationAppropriate backup duration
50 GB500 GB7000IOPS/400MiB/s20 minutes
100 GB500 GB7000IOPS/400MiB/s50 minutes
200 GB500 GB7000IOPS/400MiB/s100 minutes
500 GB1024 GB7000IOPS/400MiB/s150 minutes
1024 GB3500 GB7000IOPS/400MiB/s350 minutes

Backup impact

It is tested that the backup impact on clusters in less than 3% when GP3 volumes are used. In the following figure, the backup is initiated after 10:25.

EBS Snapshot backup impact

Restore performance

This document describes the performance of EBS snapshot restore using volumes, the factors that affect performance, and the performance test results.

Restore time consumption

EBS snapshot restore using volumes consists of the following processes. For detailed information, see Architecture of EBS snapshot volume backup and restore.

  1. Create a cluster.

    TiDB Operator creates a cluster in recoveryMode and starts all PD nodes.

  2. Restore volumes.

    TiDB Operator creates volume restore subtasks using BR. BR restores the data volumes from the snapshots to start TiKV.

  3. Start TiKV.

    TiDB Operator mounts the TiKV volumes and starts TiKV.

  4. Restore data.

    TiDB Operator creates volume data restore subtasks. BR restores the data volumes to a consistent state.

  5. Start TiDB.

    TiDB is started and the restore is completed.

Time consumption ratio of restore

Restore stageAppropriate time takenRestore ratioRemarks
Creates clusters30 seconds2%Including the time for downloading docker image and starting PD
Restores volumes20 seconds1%Including the time for starting the BR Pod and restoring volumes
Starts TiKV10 to 16 minutes42%Including the time for starting RocksDB and reading the meta data of all Regions
Restores data2 to 20 minutes52%Including the time for restoring data in the Raft consensus layer and deleting MVCC data
Starts TiDB1 minute3%Including the time for downloading the tidb docker image and starting TiDB

Restore performance data

Time taken by snapshot restore using volumes mainly depends on the time taken by starting TiKV and restoring data. TiKV startup and data restore need to read volume data that is restored from snapshots. Such volume data is loaded with certain latency. Specifically, the data does not reach optimal performance immediately after restore. This is because the data is available only after it is downloaded from Amazon S3 and written to the volumes.

The data load latency results in high I/O operation latency when each block is accessed for the first time. Due to the impact of data load latency, TiKV startup and data restore consume most of the time in the whole process of snapshot restore using volumes. Test data is as follows under the recommended machine type and GP3 storage volume:

EBS Snapshot restore perf

Volume dataTotal volume sizeVolume configurationAppropriate restore duration
50 GB500 GB7000IOPS/400MiB/s16 minutes
100 GB500 GB7000IOPS/400MiB/s18 minutes
200 GB500 GB7000IOPS/400MiB/s21 minutes
500 GB1024 GB7000IOPS/400MiB/s25 minutes
1024 GB3500 GB7000IOPS/400MiB/s34 minutes

Was this page helpful?