- Key Features
- Horizontal Scalability
- MySQL Compatible Syntax
- Replicate from and to MySQL
- Distributed Transactions with Strong Consistency
- Cloud Native Architecture
- Minimize ETL with HTAP
- Fault Tolerance & Recovery with Raft
- Automatic Rebalancing
- Deployment and Orchestration with Ansible, Kubernetes, Docker
- JSON Support
- Spark Integration
- Read Historical Data Without Restoring from Backup
- Fast Import and Restore of Data
- Hybrid of Column and Row Storage
- SQL Plan Management
- Open Source
- Online Schema Changes
- Key Features
- Get Started
- From Binary Tarball
- Orchestrated Deployment
- Geographic Redundancy
- SQL Language Structure
- Data Types
- Numeric Types
- Date and Time Types
- String Types
- Functions and Operators
- Function and Operator Reference
- Type Conversion in Expression Evaluation
- Control Flow Functions
- String Functions
- Numeric Functions and Operators
- Date and Time Functions
- Bit Functions and Operators
- Cast Functions and Operators
- Encryption and Compression Functions
- Information Functions
- JSON Functions
- Aggregate (GROUP BY) Functions
- Window Functions
- Miscellaneous Functions
- Precision Math
- List of Expressions for Pushdown
- SQL Statements
ADMIN CANCEL DDL
ADMIN CHECKSUM TABLE
ADMIN CHECK [TABLE|INDEX]
ADMIN SHOW DDL [JOBS|QUERIES]
CREATE TABLE LIKE
SET DEFAULT ROLE
SET [NAMES|CHARACTER SET]
SET [GLOBAL|SESSION] <variable>
SHOW ANALYZE STATUS
SHOW CHARACTER SET
SHOW [FULL] COLUMNS FROM
SHOW CREATE TABLE
SHOW CREATE USER
SHOW [FULL] FIELDS FROM
SHOW INDEXES [FROM|IN]
SHOW INDEX [FROM|IN]
SHOW KEYS [FROM|IN]
SHOW [FULL] PROCESSSLIST
SHOW [FULL] TABLES
SHOW TABLE REGIONS
SHOW TABLE STATUS
SHOW [GLOBAL|SESSION] VARIABLES
- System Databases
- Garbage Collection (GC)
- Understanding the Query Execution Plan
- The Blocklist of Optimization Rules and Expression Pushdown
- Introduction to Statistics
- TopN and Limit Push Down
- Optimizer Hints
- Follower Read
- Check the TiDB Cluster Status Using SQL Statements
- Execution Plan Binding
- Statement Summary Table
- Tune TiKV
- Operating System Tuning
- Column Pruning
- Key Monitoring Metrics
- Best Practices
- TiDB Binlog
- Binlog Consumer Client
- TiDB Binlog Relay Log
- Bidirectional Replication Between TiDB Clusters
- TiDB Lightning
- All Releases
In TiDB, data is split into Regions, each storing data for a specific key range. These Regions are distributed among multiple TiKV instances. As data is written into a cluster, millions of or even tens of millions of Regions are created. Too many Regions on a single TiKV instance can bring a heavy burden to the cluster and affect its performance.
This document introduces the workflow of Raftstore (a core module of TiKV), explains why a massive amount of Regions affect the performance, and offers methods for tuning TiKV performance.
A TiKV instance has multiple Regions on it. The Raftstore module drives the Raft state machine to process Region messages. These messages include processing read or write requests on Regions, persisting or replicating Raft logs, and processing Raft heartbeats. However, an increasing number of Regions can affect performance of the whole cluster. To understand this, it is necessary to learn the workflow of Raftstore shown as follows:
This diagram only illustrates the workflow of Raftstore and does not represent the actual code structure.
From the above diagram, you can see that requests from the TiDB servers, after passing through the gRPC and storage modules, become read and write messages of KV (key-value), and are sent to the corresponding Regions. These messages are not immediately processed but are temporarily stored. Raftstore polls to check whether each Region has messages to process. If a Region has messages to process, Raftstore drives the Raft state machine of this Region to process these messages and perform subsequent operations according to the state changes of these messages. For example, when write requests come in, the Raft state machine stores logs into disk and sends logs to other Region replicas; when the heartbeat interval is reached, the Raft state machine sends heartbeat information to other Region replicas.
From the Raftstore workflow diagram, messages in each Region are processed one by one. When a large number of Regions exist, it takes Raftstore some time to process the heartbeats of these Regions, which can cause some delay. As a result, some read and write requests are not processed in time. If read and write pressure is high, the CPU usage of the Raftstore thread might easily become the bottleneck, which further increases the delay and affects the performance.
Generally, if the CPU usage of the loaded Raftstore reaches 85% or higher, Raftstore goes into a busy state and becomes the bottleneck. At the same time,
propose wait duration can be as high as hundreds of milliseconds.
- For the CPU usage of Raftstore as mentioned above, Raftstore is single-threaded. If Raftstore is multi-threaded, you can increase the CPU usage threshold (85%) proportionally.
- Because I/O operations exist in the Raftstore thread, CPU usage cannot reach 100%.
You can check the following monitoring metrics in Grafana's TiKV Dashboard:
Raft store CPUin the Thread-CPU panel
Reference value: lower than
raftstore.store-pool-size * 85%.
Propose wait durationin the Raft Propose panel
Propose wait durationis the delay between the time a request is sent to Raftstore and the time Raftstore actually starts processing the request. Long delay means that Raftstore is busy, or that processing the append log is time-consuming, making Raftstore unable to process the request in time.
Reference value: lower than 50~100 ms according to the cluster size
After finding out the cause of a performance problem, try to solve it from the following two aspects:
- Reduce the number of Regions on a single TiKV instance
- Reduce the number of messages for a single Region
Raftstore has been upgraded to a multi-threaded module since TiDB v3.0, which greatly reduces the possibility that a Raftstore thread becomes the bottleneck.
raftstore.store-pool-size is configured to
2 in TiKV. If a bottleneck occurs in Raftstore, you can properly increase the value of this configuration item according to the actual situation. But to avoid introducing unnecessary thread switching overhead, it is recommended that you do not set this value too high.
In the actual situation, read and write requests are not evenly distributed on every Region. Instead, they are concentrated on a few Regions. Then you can minimize the number of messages between the Raft leader and the followers for the temporarily idle Regions, which is the feature of Hibernate Region. In this feature, Raftstore does sent tick messages to the Raft state machines of idle Regions if not necessary. Then these Raft state machines will not be triggered to generate heartbeat messages, which can greatly reduce the workload of Raftstore.
Region Mergeis enabled by default since TiDB v3.0.
You can also reduce the number of Regions by enabling
Region Merge. Contrary to
Region Merge is the process of merging adjacent small Regions through scheduling. After dropping data or executing the
Drop Table or
Truncate Table statement, you can merge small Regions or even empty Regions to reduce resource consumption.
Region Merge by configuring the following parameters:
>> pd-ctl config set max-merge-region-size 20 >> pd-ctl config set max-merge-region-keys 200000 >> pd-ctl config set merge-schedule-limit 8
The default configuration of the
Region Merge parameters is rather conservative. You can speed up the
Region Merge process by referring to the method provided in PD Scheduling Best Practices.
If I/O resources and CPU resources are sufficient, you can deploy multiple TiKV instances on a single machine to reduce the number of Regions on a single TiKV instance; or you can increase the number of machines in the TiKV cluster.
In addition to reducing the number of Regions, you can also reduce pressure on Raftstore by reducing the number of messages for each Region within a unit of time. For example, you can properly increase the value of the
raft-base-tick-interval configuration item:
[raftstore] raft-base-tick-interval = "2s"
In the above configuration,
raft-base-tick-interval is the time interval at which Raftstore drives the Raft state machine of each Region, which means at this time interval, Raftstore sends a tick message to the Raft state machine. Increasing this interval can effectively reduce the number of messages from Raftstore.
Note that this interval between tick messages also determines the intervals between
election timeout and
heartbeat. See the following example:
raft-election-timeout = raft-base-tick-interval * raft-election-timeout-ticks raft-heartbeat-interval = raft-base-tick-interval * raft-heartbeat-ticks
If Region followers have not received the heartbeat from the leader within the
raft-election-timeout interval, these followers determine that the leader has failed and start a new election.
raft-heartbeat-interval is the interval at which a leader sends a heartbeat to followers. Therefore, increasing the value of
raft-base-tick-interval can reduce the number of network messages sent from Raft state machines but also makes it longer for Raft state machines to detect the leader failure.
This section describes some other problems and solutions.
PD needs to persist Region Meta information on etcd to ensure that PD can quickly resume to provide Region routing services after switching the PD Leader node. As the number of Regions increases, the performance problem of etcd appears, making it slower for PD to get Region Meta information from etcd when PD is switching the Leader. With millions of Regions, it might take more than ten seconds or even tens of seconds to get the meta information from etcd.
To address this problem,
use-region-storage is enabled by default in PD since TiDB v3.0. With this feature enabled, PD stores Region Meta information on local LevelDB and synchronizes the information among PD nodes through other mechanisms.
In TiKV, pd-worker regularly reports Region Meta information to PD. When TiKV is restarted or switches the Region leader, PD needs to recalculate Region's
approximate size / keys through statistics. Therefore, with a large number of Regions, the single-threaded pd-worker might become the bottleneck, causing tasks to be piled up and not processed in time. In this situation, PD cannot obtain certain Region Meta information in time so that the routing information is not updated in time. This problem does not affect the actual reads and writes, but might cause inaccurate PD scheduling and require several round trips when TiDB updates Region cache.
You can check Worker pending tasks under Task in the TiKV Grafana panel to determine whether pd-worker has tasks piled up. Generally,
pending tasks should be kept at a relatively low value.
pd-worker has been optimized for better performance since v3.0.5. If you encounter a similar problem, it is recommended to upgrade to the latest version.
In a large-scale cluster, as the number of TiKV instances increases, Prometheus has greater pressure to query metrics, making it slower for Grafana to display these metrics. To ease this problem, metrics pre-calculation is configured since v3.0.
- Best Practices for TiKV Performance Tuning with Massive Regions
- Raftstore workflow
- Performance problem
- Performance tuning methods
- Other problems and solutions