TiFlash Alert Rules
This document introduces the alert rules of the TiFlash cluster.
TiFlash_schema_error
- Alert rule: - increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0
- Description: - When the schema apply error occurs, an alert is triggered. 
- Solution: - The error might be caused by some wrong logic. Get support from PingCAP or the community. 
TiFlash_schema_apply_duration
- Alert rule: - histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20
- Description: - When the probability that the apply duration exceeds 20 seconds is over 99%, an alert is triggered. 
- Solution: - It might be caused by the internal problems of the TiFlash storage engine. Get support from PingCAP or the community. 
TiFlash_raft_read_index_duration
- Alert rule: - histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3
- Description: - When the probability that the read index duration exceeds 3 seconds is over 99%, an alert is triggered. 
- Solution: - The frequent retries might be caused by frequent splitting or migration of the TiKV cluster. You can check the TiKV cluster status to identify the retry reason. 
TiFlash_raft_wait_index_duration
- Alert rule: - histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2
- Description: - When the probability that the waiting time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered. 
- Solution: - It might be caused by a communication error between TiKV and the proxy. Get support from PingCAP or the community.