Tips for troubleshooting TiDB on Kubernetes
This document describes the commonly used tips for troubleshooting TiDB on Kubernetes.
Use the debug mode
When a Pod is in the CrashLoopBackoff
state, the containers in the Pod exit continually. As a result, you cannot use kubectl exec
normally, making it inconvenient to diagnose issues.
To solve this problem, TiDB Operator provides the Pod debug mode for PD, TiKV, and TiDB components. In this mode, the containers in the Pod hang directly after they are started, and will not repeatedly crash. Then you can use kubectl exec
to connect to the Pod containers for diagnosis.
To use the debug mode for troubleshooting:
Add an annotation to the Pod to be diagnosed:
kubectl annotate pod ${pod_name} -n ${namespace} runmode=debugWhen the container in the Pod is restarted again, it will detect this annotation and enter the debug mode.
Wait for the Pod to enter the Running state.
watch kubectl get pod ${pod_name} -n ${namespace}Here's an example of using
kubectl exec
to get into the container for diagnosis:kubectl exec -it ${pod_name} -n ${namespace} -- /bin/shAfter finishing the diagnosis and resolving the problem, delete the Pod.
kubectl delete pod ${pod_name} -n ${namespace}
After the Pod is rebuilt, it automatically returns to the normal mode.
Modify the configuration of a TiKV instance
In some test scenarios, if you need to modify the configuration of a TiKV instance and do not want the configuration to affect other instances, you can use the following methods.
Modify online
Refer to the document and use SQL to online modify the configuration of a single TiKV instance.
Modify manually in debug mode
After the TiKV Pod enters debug mode, you can modify the TiKV configuration file and then manually start the TiKV process using the modified configuration file.
The steps are as follows:
Get the start command from the TiKV log, which will be used in a subsequent step.
kubectl logs pod ${pod_name} -n ${namespace} -c tikv | head -2 | tail -1You can see a similar output as follows, which is the start command of TiKV.
/tikv-server --pd=http://${tc_name}-pd:2379 --advertise-addr=${pod_name}.${tc_name}-tikv-peer.default.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.tomlTurn on debug mode for the Pod and restart the Pod.
Add an annotation to the Pod and wait for the Pod to restart.
kubectl annotate pod ${pod_name} -n ${namespace} runmode=debugIf the Pod keeps running, you can force restart the container by running the following command:
kubectl exec ${pod_name} -n ${namespace} -c tikv -- kill -SIGTERM 1Check the log of TiKV to ensure that the Pod is in debug mode.
kubectl logs ${pod_name} -n ${namespace} -c tikvThe output is similar to the following:
entering debug mode.Enter the TiKV container by running the following command:
kubectl exec -it ${pod_name} -n ${namespace} -c tikv -- shIn the TiKV container, copy the configuration file of TiKV to a new file, and modify the new file.
cp /etc/tikv/tikv.toml /tmp/tikv.toml && vi /tmp/tikv.tmolIn the TiKV container, modify the start command obtained in Step 1 and configure the
--config
flag as the new configuration file. Run the modified start command to start the TiKV process:/tikv-server --pd=http://${tc_name}-pd:2379 --advertise-addr=${pod_name}.${tc_name}-tikv-peer.default.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/tmp/tikv.toml
After the test is completed, if you want to recover the TiKV Pod, you can delete the TiKV Pod and wait for the Pod to be automatically started.
kubectl delete ${pod_name} -n ${namespace}
Configure forceful upgrade for the TiKV cluster
Normally, during TiKV rolling update, TiDB Operator evicts all Region leaders for TiKV Pods before restarting the TiKV Pods. This is meant for minimizing the impact of the rolling update on user requests.
In some test scenarios, if you do not need to wait for the Region leader to migrate during TiKV rolling upgrade, or if you want to speed up the rolling upgrade, you can configure the spec.tikv.evictLeaderTimeout
field in the spec of TidbCluster to a small value.
spec:
tikv:
evictLeaderTimeout: 10s
For more information about this field, refer to Configure graceful upgrade.
Configure forceful upgrade for the TiCDC cluster
Normally, during TiCDC rolling update, TiDB Operator drains all replication workloads for TiCDC Pods before restarting the TiCDC Pods. This is meant for minimizing the impact of the rolling update on replication latency.
In some test scenarios, if you do not need to wait for the draining to complete during TiCDC rolling upgrade, or if you want to speed up the rolling upgrade, you can configure the spec.ticdc.gracefulShutdownTimeout
field in the spec of TidbCluster to a small value.
spec:
ticdc:
gracefulShutdownTimeout: 10s
For more information about this field, refer to Configure graceful upgrade.