Use PD Recover to Recover the PD Cluster
PD Recover is a disaster recovery tool of PD, used to recover the PD cluster which cannot start or provide services normally. For detailed introduction of this tool, see TiDB documentation - PD Recover. This document introduces how to download PD Recover and how to use it to recover a PD cluster.
Download PD Recover
Download the official TiDB package:
wget https://download.pingcap.org/tidb-community-toolkit-${version}-linux-amd64.tar.gzIn the command above,
${version}
is the version of the TiDB cluster, such asv7.5.0
.Unpack the TiDB package:
tar -xzf tidb-community-toolkit-${version}-linux-amd64.tar.gz tar -xzf tidb-community-toolkit-${version}-linux-amd64/pd-recover-${version}-linux-amd64.tar.gzpd-recover
is in the current directory.
Scenario 1: At least one PD node is alive
This section introduces how to recover the PD cluster using PD Recover and alive PD nodes. This section is only applicable to the scenario where the PD cluster has alive PD nodes. If all PD nodes are unavailable, refer to Scenario 2.
Step 1. Recover the PD Pod
Use an alive PD node pd-0
to force recreate the PD cluster. The detailed steps are as follows:
Let pd-0 pod enter debug mode:
kubectl annotate pod ${cluster_name}-pd-0 -n ${namespace} runmode=debug kubectl exec ${cluster_name}-pd-0 -n ${namespace} -- kill -SIGTERM 1Enter the pd-0 pod:
kubectl exec ${cluster_name}-pd-0 -n ${namespace} -it -- shRefer to the default startup script
pd-start-script
or the start script of an alive PD node, and configure environment variables in pd-0:# Use HOSTNAME if POD_NAME is unset for backward compatibility. POD_NAME=${POD_NAME:-$HOSTNAME} # the general form of variable PEER_SERVICE_NAME is: "<clusterName>-pd-peer" cluster_name=`echo ${PEER_SERVICE_NAME} | sed 's/-pd-peer//'` domain="${POD_NAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc" discovery_url="${cluster_name}-discovery.${NAMESPACE}.svc:10261" encoded_domain_url=`echo ${domain}:2380 | base64 | tr "\n" " " | sed "s/ //g"` elapseTime=0 period=1 threshold=30 while true; do sleep ${period} elapseTime=$(( elapseTime+period )) if [[ ${elapseTime} -ge ${threshold} ]] then echo "waiting for pd cluster ready timeout" >&2 exit 1 fi if nslookup ${domain} 2>/dev/null then echo "nslookup domain ${domain}.svc success" break else echo "nslookup domain ${domain} failed" >&2 fi done ARGS="--data-dir=/var/lib/pd \ --name=${POD_NAME} \ --peer-urls=http://0.0.0.0:2380 \ --advertise-peer-urls=http://${domain}:2380 \ --client-urls=http://0.0.0.0:2379 \ --advertise-client-urls=http://${domain}:2379 \ --config=/etc/pd/pd.toml \ " if [[ -f /var/lib/pd/join ]] then # The content of the join file is: # demo-pd-0=http://demo-pd-0.demo-pd-peer.demo.svc:2380,demo-pd-1=http://demo-pd-1.demo-pd-peer.demo.svc:2380 # The --join args must be: # --join=http://demo-pd-0.demo-pd-peer.demo.svc:2380,http://demo-pd-1.demo-pd-peer.demo.svc:2380 join=`cat /var/lib/pd/join | tr "," "\n" | awk -F'=' '{print $2}' | tr "\n" ","` join=${join%,} ARGS="${ARGS} --join=${join}" elif [[ ! -d /var/lib/pd/member/wal ]] then until result=$(wget -qO- -T 3 http://${discovery_url}/new/${encoded_domain_url} 2>/dev/null); do echo "waiting for discovery service to return start args ..." sleep $((RANDOM % 5)) done ARGS="${ARGS}${result}" fiUse original pd-0 data directory to force start a new PD cluster:
echo "starting pd-server ..." sleep $((RANDOM % 10)) echo "/pd-server --force-new-cluster ${ARGS}" exec /pd-server --force-new-cluster ${ARGS} &Exit pd-0 pod:
exitExecute the following command to confirm that PD is started:
kubectl logs -f ${cluster_name}-pd-0 -n ${namespace} | grep "Welcome to Placement Driver (PD)"
Step 2. Recover the PD cluster
Copy
pd-recover
to the PD pod:kubectl cp ./pd-recover ${namespace}/${cluster_name}-pd-0:./Recover the PD cluster by running the
pd-recover
command:In the command, use the newly created cluster in the previous step:
kubectl exec ${cluster_name}-pd-0 -n ${namespace} -- ./pd-recover --from-old-member -endpoints http://127.0.0.1:2379recover success! please restart the PD cluster
Step 3. Restart the PD Pod
Delete the PD Pod:
kubectl delete pod ${cluster_name}-pd-0 -n ${namespace}Confirm the Cluster ID is generated:
kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- wget -q http://127.0.0.1:2379/pd/api/v1/cluster kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- cat cluster
Step 4. Recreate other failed or available PD nodes
In this example, recreate pd-1 and pd-2:
kubectl -n ${namespace} delete pvc pd-${cluster_name}-pd-1 --wait=false
kubectl -n ${namespace} delete pvc pd-${cluster_name}-pd-2 --wait=false
kubectl -n ${namespace} delete pod ${cluster_name}-pd-1
kubectl -n ${namespace} delete pod ${cluster_name}-pd-2
Step 5. Check PD health and configuration
Check health:
kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- ./pd-ctl health
Check configuration. The following command uses placement rules as an example:
kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- ./pd-ctl config placement-rules show
Now the TiDB cluster is recovered.
Scenarios 2: All PD nodes are down and cannot be recovered
This section introduces how to recover the PD cluster by using PD Recover and creating new PD nodes. This section is only applicable when all PD nodes in the cluster have failed and cannot be recovered. If there are alive PD nodes in the cluster, refer to Scenario 1.
Step 1: Get Cluster ID
kubectl get tc ${cluster_name} -n ${namespace} -o='go-template={{.status.clusterID}}{{"\n"}}'
Example:
kubectl get tc test -n test -o='go-template={{.status.clusterID}}{{"\n"}}'
6821434242797747735
Step 2. Get Alloc ID
When you use pd-recover
to recover the PD cluster, you need to specify alloc-id
. The value of alloc-id
must be larger than the largest allocated ID (Alloc ID
) of the original cluster.
Access the Prometheus monitoring data of the TiDB cluster by taking steps in Access the Prometheus monitoring data.
Enter
pd_cluster_id
in the input box and click theExecute
button to make a query. Get the largest value in the query result.Multiply the largest value in the query result by
100
. Use the multiplied value as thealloc-id
value specified when usingpd-recover
.
Step 3. Recover the PD Pod
Delete the Pod of the PD cluster.
Execute the following command to set the value of
spec.pd.replicas
to0
:kubectl patch tc ${cluster_name} -n ${namespace} --type merge -p '{"spec":{"pd":{"replicas": 0}}}'Because the PD cluster is in an abnormal state, TiDB Operator cannot synchronize the change above to the PD StatefulSet. You need to execute the following command to set the
spec.replicas
of the PD StatefulSet to0
.kubectl patch sts ${cluster_name}-pd -n ${namespace} -p '{"spec":{"replicas": 0}}'Execute the following command to confirm that the PD Pod is deleted:
kubectl get pod -n ${namespace}After confirming that all PD Pods are deleted, execute the following command to delete the PVCs bound to the PD Pods:
kubectl delete pvc -l app.kubernetes.io/component=pd,app.kubernetes.io/instance=${cluster_name} -n ${namespace}After the PVCs are deleted, scale out the PD cluster to one Pod:
Execute the following command to set the value of
spec.pd.replicas
to1
:kubectl patch tc ${cluster_name} -n ${namespace} --type merge -p '{"spec":{"pd":{"replicas": 1}}}'Because the PD cluster is in an abnormal state, TiDB Operator cannot synchronize the change above to the PD StatefulSet. You need to execute the following command to set the
spec.replicas
of the PD StatefulSet to1
.kubectl patch sts ${cluster_name}-pd -n ${namespace} -p '{"spec":{"replicas": 1}}'Execute the following command to confirm that the PD cluster is started:
kubectl logs -f ${cluster_name}-pd-0 -n ${namespace} | grep "Welcome to Placement Driver (PD)"
Step 4. Recover the cluster
Copy
pd-recover
command to the PD pod:kubectl cp ./pd-recover ${namespace}/${cluster_name}-pd-0:./Execute the
pd-recover
command to recover the PD cluster:kubectl exec ${cluster_name}-pd-0 -n ${namespace} -- ./pd-recover -endpoints http://127.0.0.1:2379 -cluster-id ${cluster_id} -alloc-id ${alloc_id}In the command above,
${cluster_id}
is the cluster ID got in Get Cluster ID.${alloc_id}
is the largest value ofpd_cluster_id
(got in Get Alloc ID) multiplied by100
.After the
pd-recover
command is successfully executed, the following result is printed:recover success! please restart the PD cluster
Step 5. Restart the PD Pod
Delete the PD Pod:
kubectl delete pod ${cluster_name}-pd-0 -n ${namespace}Execute the following command to confirm the Cluster ID is the one got in Get Cluster ID.
kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- wget -q http://127.0.0.1:2379/pd/api/v1/cluster kubectl -n ${namespace} exec -it ${cluster_name}-pd-0 -- cat cluster
Step 6. Scale out the PD cluster
Execute the following command to set the value of spec.pd.replicas
to the desired number of Pods:
kubectl patch tc ${cluster_name} -n ${namespace} --type merge -p '{"spec":{"pd":{"replicas": $replicas}}}'
Step 7. Restart TiDB and TiKV
Use the following commands to restart the TiDB and TiKV clusters:
kubectl delete pod -l app.kubernetes.io/component=tidb,app.kubernetes.io/instance=${cluster_name} -n ${namespace} &&
kubectl delete pod -l app.kubernetes.io/component=tikv,app.kubernetes.io/instance=${cluster_name} -n ${namespace}
Now the TiDB cluster is recovered.