TiDB Cloud Performance Highlights for TiDB v8.5.0

TiDB v8.5.0 is an important Long-Term Support (LTS) release, which delivers notable improvements in performance, scalability, and operational efficiency.

This document outlines the performance improvements in v8.5.0 across the following areas:

General performance
Hotspot reads with excessive MVCC versions
IO latency jitter
Batch processing
TiKV scaling performance

General performance

With the default Region size increased from 96 MiB to 256 MiB and some other improvements in v8.5.0, significant performance improvements are observed:

oltp_insert performance improves by 27%.
Analyze performance shows a major boost of approximately 45%.

Hotspot reads with excessive MVCC versions

Challenge

In some user scenarios, the following challenges might occur:

Frequent version updates: in some workloads, the data is updated and read very frequently.
Long retention of historical versions: to meet business requirements, such as supporting flashback to specific time points, users might configure overly long GC (Garbage Collection) times (such as 24 hours). This results in an excessive accumulation of Multi-Version Concurrency Control (MVCC) versions, which significantly reduces query efficiency.

The accumulation of MVCC versions creates a large gap between the requested data and the processed data, leading to degraded read performance.

Solution

To address this issue, TiKV introduces In-Memory Engine (IME). By caching the latest versions in memory, TiKV reduces the impact of historical versions on read performance, significantly improving query efficiency.

Test environment

Cluster topology: TiDB (16 vCPU, 32 GiB) * 1 + TiKV (16 vCPU, 32 GiB) * 6

Cluster configurations:

tikv_configs:
[in-memory-engine]
enable = true

Dataset: 24 GiB storage size with about 340 Regions, containing frequently updated data
Workload: rows requiring scans contain numerous MVCC versions introduced by frequent updates

Test results

The query latency decreases by 50%, and the throughput increases by 100%.

Configuration	Duration (s)	Threads	TPS	QPS	Average latency (ms)	P95 latency (ms)
Disable IME	3600	10	123.8	1498.3	80.76	207.82
Enable IME	3600	10	238.2	2881.5	41.99	78.60

IO latency jitter

Challenge

In cloud environments, transient or sustained IO latency fluctuations on cloud disks are a common challenge. These fluctuations can increase request latency, causing timeouts, errors, and disruptions to normal business operations, ultimately degrading service quality.

Solution

TiDB v8.5.0 introduces multiple enhancements to mitigate the impact of cloud disk IO jitter on performance:

Leader write optimization: allows the leader to apply committed but not yet persisted Raft logs early, reducing IO jitter's effect on write latency for the leader peer.
Enhanced slow-node detection: improves the slow-node detection algorithm, enabling slow score detection by default. This triggers the evict-leader scheduler to restore performance when a slow node is identified. The slow node detection mechanism uses evict-slow-store-scheduler to detect and manage slow nodes, reducing the effects of cloud disk jitter.
Unified health controller: adds a unified health controller in TiKV and a feedback mechanism to the KV client. The KV client optimizes error handling and replica selection based on TiKV node health and performance.
Improved replica selector: introduces Replica Selector V2 in the KV client, with refined state transitions that eliminate unnecessary retries and backoff operations.
Additional fixes and improvements: includes enhancements to critical components such as the region cache and KV client health checker, while avoiding unnecessary IO operations in TiKV's store loop.

Test environment

Cluster topology: TiDB (32 vCPU, 64 GiB) * 3 + TiKV (32 vCPU, 64 GiB) * 6
Workload: a read/write ratio of 2:1, with simulated cloud disk IO delays or hangs on one TiKV node

Test results

Based on the preceding test setup, failovers are now improved in multiple IO delay scenarios, and P99/999 latency during impacts is reduced by up to 98%.

In the following table of test results, the Current column shows the results with improvements to reduce IO latency jitter, while the Original column shows the results without these improvements:

Workload description	Failover time		Maximum latency during impacts (P999)		Maximum latency during impacts (P99)
Workload description	Current	Original	Current	Original	Current	Original
IO delay of 2 ms lasts for 10 mins	Almost no effect	Failover not available	14 ms	232 ms	7.9 ms	22.9 ms
IO delay of 5 ms lasts for 10 mins	2 mins	Failover not available	37.9 ms	462 ms	10 ms	246 ms
IO delay of 10 ms lasts for 10 mins	3 mins	Failover not available	69 ms	3 s	25 ms	1.45 s
IO delay of 50 ms lasts for 10 mins	3 mins	Failover not available	1.36 s	13.2 s	238 ms	6.7 s
IO delay of 100 ms lasts for 10 mins	3 mins	Failover not available	7.53 s	32 s	1.7 s	26 s

Further improvements

The severity of disk jitter might also be highly related to users' workload profiles. In latency-sensitive scenarios, designing applications in conjunction with TiDB features can further minimize the impact of IO jitter on applications. For example, in read-heavy and latency-sensitive environments, adjusting the tikv_client_read_timeout system variable according to latency requirements and using stale reads or follower reads can enable faster failover retries to other replica peers for KV requests sent from TiDB. This reduces the impact of IO jitter on a single TiKV node and helps improve query latency. Note that the effectiveness of this feature depends on the workload profile, which should be evaluated before implementation.

Additionally, users deploying TiDB on public cloud can reduce the probability of jitter by choosing cloud disks with higher performance.

Batch processing

Challenge

Large-scale transactions, such as bulk data updates, system migrations, and ETL workflows, involve processing millions of rows and are vital for supporting critical operations. While TiDB excels as a distributed SQL database, handling such transactions at scale presents two significant challenges:

Memory limits: in versions earlier than TiDB v8.1.0, all transaction mutations are held in memory throughout the transaction lifecycle, which strains resources and reduces performance. For operations involving millions of rows, this could lead to excessive memory usage and, in some cases, Out of Memory (OOM) errors when resources are insufficient.
Performance slowdowns: managing large in-memory buffers relies on red-black trees, which introduces computational overhead. As buffers grow, their operations slow down due to the O(N log N) complexity inherent in these data structures.

Solution

These challenges highlight a clear opportunity to improve scalability, reduce complexity, and enhance reliability. With the rise of modern data workloads, TiDB introduces the Pipelined DML feature, designed to optimize the handling of large transactions, and improve resource utilization and overall performance.

Test environment

Cluster topology: TiDB (16 vCPU, 32 GiB) * 1 + TiKV (16 vCPU, 32 GiB) * 3
Dataset: YCSB non-clustered table with 10 million rows (about 10 GiB of data). Primary keys are selectively removed in certain tests to isolate and evaluate the impact of hotspot patterns.
Workload: DML operations including INSERT, UPDATE, and DELETE.

Test results

The execution speed increases by 2x, the maximum TiDB memory usage decreases by 50%, and the TiKV write flow becomes much smoother.

Latency (in seconds)
Workload (10 GiB) Standard DML Pipelined DML Improvement
YCSB-insert-10M 368 159 131.45%
YCSB-update-10M 255 131 94.66%
YCSB-delete-10M 136 42 223.81%
TiDB memory usage peak (GiB)
Workload (10 GiB) Standard DML Pipelined DML Reduction
YCSB-insert-10M 25.8 12 53.49%
YCSB-update-10M 23.1 12.9 44.16%
YCSB-delete-10M 10.1 8.08 20.00%

Workload (10 GiB)	Standard DML	Pipelined DML	Improvement
YCSB-insert-10M	368	159	131.45%
YCSB-update-10M	255	131	94.66%
YCSB-delete-10M	136	42	223.81%

Workload (10 GiB)	Standard DML	Pipelined DML	Reduction
YCSB-insert-10M	25.8	12	53.49%
YCSB-update-10M	23.1	12.9	44.16%
YCSB-delete-10M	10.1	8.08	20.00%

TiKV scaling performance

Challenge

Horizontal scaling is a core capability of TiKV, enabling the system to scale in or out as needed. As business demands grow and the number of tenants increases, TiDB clusters experience rapid growth in databases, tables, and data volume. Scaling out TiKV nodes quickly becomes essential to maintaining service quality.

In some scenarios, TiDB hosts a large number of databases and tables. When these tables are small or empty, TiKV accumulates a significant number of tiny Regions, especially when the number of tables grows to a large scale (such as 1 million or more). These small Regions introduce a substantial maintenance burden, increase resource overhead, and reduce efficiency.

Solution

To address this issue, TiDB v8.5.0 improves the performance of merging small Regions, reducing internal overhead and improving resource utilization. Additionally, TiDB v8.5.0 includes several other enhancements to further improve TiKV scaling performance.

Test environment

Merging small Regions

Cluster topology: TiDB (16 vCPU, 32 GiB) * 1 + TiKV (16 vCPU, 32 GiB) * 3
Dataset: nearly 1 million small tables, with the size of each table < 2 MiB
Workload: automatic merging of small Regions

Scaling out TiKV nodes

Cluster topology: TiDB (16 vCPU, 32 GiB) * 1 + TiKV (16 vCPU, 32 GiB) * 4
Dataset: TPC-C dataset with 20,000 warehouses
Workload: scaling out TiKV nodes from 4 to 7

Test results

The small Region merge speed improves by about 10 times.

Metric	Without improvement	With improvement
Region merging duration (hrs)	20	2

TiKV scaling performance improves by over 40%, and TiKV node scaling out duration gains a 30% reduction.

Metric	Without improvement	With improvement
TiKV scaling out duration (hrs)	5	3.5

Benchmark

In addition to the preceding test data, you can refer to the following benchmark results for v8.5.0 performance:

TiDB Cloud Performance Highlights for TiDB v8.5.0

General performance

Hotspot reads with excessive MVCC versions

Challenge

Solution

Test environment

Test results

IO latency jitter

Challenge

Solution

Test environment

Test results

Further improvements

Batch processing

Challenge

Solution

Test environment

Test results

TiKV scaling performance

Challenge

Solution

Test environment

Merging small Regions

Scaling out TiKV nodes

Test results

Benchmark

Was this page helpful?