Troubleshoot a TiFlash Cluster
This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions.
TiFlash fails to start
The issue might occur due to different reasons. It is recommended that you troubleshoot it following the steps below:
Check whether your system is CentOS8.
CentOS8 does not have the
libnsl.so
system library. You can manually install it via the following command:dnf install libnslCheck your system's
ulimit
parameter setting.ulimit -n 1000000Use the PD Control tool to check whether there is any TiFlash instance that failed to go offline on the node (same IP and Port) and force the instance(s) to go offline. For detailed steps, refer to Scale in a TiFlash cluster.
If the above methods cannot resolve your issue, save the TiFlash log files and email to info@pingcap.com for more information.
TiFlash replica is always unavailable
This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component:
Check whether PD enables the
Placement Rules
feature:echo 'config show replication' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>The expected result is
"enable-placement-rules": "true"
. If not enabled, enable the Placement Rules feature.Check whether the TiFlash process is working correctly by viewing
UpTime
on the TiFlash-Summary monitoring panel.Check whether the TiFlash proxy status is normal through
pd-ctl
.echo "store" | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>The TiFlash proxy's
store.labels
includes information such as{"key": "engine", "value": "tiflash"}
. You can check this information to confirm a TiFlash proxy.Check whether
pd buddy
can correctly print the logs (the log path is the value oflog
in the [flash.flash_cluster] configuration item; the default log path is under thetmp
directory configured in the TiFlash configuration file).Check whether the number of configured replicas is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash:
echo 'config placement-rules show' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>Reconfirm the value of
default: count
.Check whether the remaining disk space of the machine (where
store
of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of thestore
capacity (which is controlled by thelow-space-ratio
parameter), PD cannot schedule data to this TiFlash node.
TiFlash query time is unstable, and the error log prints many Lock Exception
messages
This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry.
You can set the query timestamp to one second earlier in TiDB. For example, if the current time is '2020-04-08 20:15:01', you can execute set @@tidb_snapshot='2020-04-08 20:15:00';
before you execute the query. This makes less TiFlash queries encounter a lock and mitigates the risk of unstable query time.
Some queries return the Region Unavailable
error
If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the Region Unavailable
error.
In this case, you can balance the load pressure by adding more TiFlash nodes.
Data file corruption
Take the following steps to handle the data file corruption:
- Refer to Take a TiFlash node down to take the corresponding TiFlash node down.
- Delete the related data of the TiFlash node.
- Redeploy the TiFlash node in the cluster.