Sign InTry Free

FAQs on EBS Snapshot Backup and Restore

This document describes the common questions that occur during EBS snapshot backup and restore and the solutions.

Backup issues

You might encounter the following problems during EBS snapshot backup:

Failed to start a backup or the backup failed immediately after it started

Issue: #4781

  • Symptom 1: After the backup CRD yaml file is applied, the pod and job are not created, and the backup cannot be started.

    1. Run the following command to check the pod of TiDB Operator:

      kubectl get po -n ${namespace}
      
    2. Run the following command to check the log of tidb-controller-manager:

      kubectl -n ${namespace} logs ${tidb-controller-manager}
      
    3. Check whether the log contains the following error message:

      ```shell
      metadata.annotations: Too long: must have at most 262144 bytes, spec.template.annotations: Too long: must have  at most 262144 bytes
      ```
      

      Cause: TiDB uses annotations to pass in the PVC or PV configuration, and the annotation of a backup job should not exceed 256 KB. When a TiKV cluster is excessively large, the configuration of PVC or PV will be larger than 256 KB. As a result, calling Kubernetes API fails.

  • Symptom 2: After the backup CRD yaml file is applied, the pod and job are successfully created, but the backup failed immediately.

    Check the log of the backup job as described in symptom 1. The error message is as follows:

    exec /entrypoint.sh: argument list too long
    

    Cause: In TiDB Operator, before the backup pod starts, the PVC or PV configuration is injected to the environment variables of the backup pod. Then the backup task starts. The environment variables cannot exceed 1 MB. Therefore, when the configuration of PVC or PV is larger than 1 MB, the backup pod cannot get the environment variables and the backup fails.

    Scenario: This issue occurs when the TiKV cluster is excessively large (40+ TiKV nodes) or too many volumes have been configured, and the TiDB Operator is v1.4.0-beta.2 or earlier.

Solution: Upgrade TiDB Operator to the latest version.

The backup CR of a failed task could not be deleted

Issue: #4778

Symptom: Deleting the backup CR is stuck.

Scenario: This issue occurs when the TiDB Operator is v1.4.0-beta.2 or earlier.

Solution: Upgrade TiDB Operator to the latest version.

Backup failed

Issue: #13838

Symptom: After the backup CRD yaml file is applied, the pod and job are successfully created, but the backup failed immediately.

Check whether the log contains the following error:

GC safepoint 437271276493996032 exceed TS 437270540511608835

Scenario: This issue occurs when you initiate backup tasks using volumes in a large cluster (20+ TiKV nodes), after you perform large-scale data restore using BR.

Solution: Start the grafana ${cluster-name}-TiKV-Details panel. Unfold Resolved-TS and check the Max Resolved TS gap panel. Locate the Max Resolved TS with a large value (greater than 1 min). Then restart the corresponding TiKV.

Restore issues

You might encounter the following problems during EBS snapshot restore:

Failed to restore the cluster with the error keepalive watchdog timeout

Symptom: The subtasks of BR data restore failed. The first restore subtask succeeded (volume complete) but the second one failed. The following log information is found in the failed task:

error="rpc error: code = Unavailable desc = keepalive watchdog timeout"

Scenario: This issue occurs when the data volume is large and the TiDB cluster version is v6.3.0.

Solution:

  1. Upgrade the TiDB cluster to v6.4.0 or later.

  2. Edit the configuration file of the TiDB cluster and increase the value of TiKV's keepalive:

    config: |
      [server]
        grpc-keepalive-time = "500s"
        grpc-keepalive-timeout = "10s"
    

Restore period is excessively long (longer than 2 hours)

Scenario: This issue occurs when the TiDB cluster version is v6.3.0 or v6.4.0.

Solution:

  1. Upgrade the TiDB cluster to v6.5.0.

  2. In BR spec, temporarily increase the volume performance for restore, and then manually degrade the performance after the restore is completed. Specifically, you can specify parameters to get higher volume configuration, such as specifying --volume-iops=8000, --volume-throughput=600, or other configurations.

    spec:
      backupType: full
      restoreMode: volume-snapshot
      serviceAccount: tidb-backup-manager
      toolImage: pingcap/br:v6.5.0
      br:
        cluster: basic
        clusterNamespace: tidb-cluster
        sendCredToTikv: false
    options:
    - --volume-type=gp3
    - --volume-iops=8000
    - --volume-throughput=600
    
Download PDFRequest docs changesAsk questions on TiDB Forum
Was this page helpful?
Open Source Ecosystem
TiDB
TiKV
TiSpark
Chaos Mesh
© 2023 PingCAP. All Rights Reserved.