Explore HTAP
This guide describes how to explore and use the features of TiDB Hybrid Transactional and Analytical Processing (HTAP).
Use cases
TiDB HTAP can handle the massive data that increases rapidly, reduce the cost of DevOps, and be deployed in either self-hosted or cloud environments easily, which brings the value of data assets in real time.
The following are the typical use cases of HTAP:
Hybrid workload
When using TiDB for real-time Online Analytical Processing (OLAP) in hybrid load scenarios, you only need to provide an entry point of TiDB to your data. TiDB automatically selects different processing engines based on the specific business.
Real-time stream processing
When using TiDB in real-time stream processing scenarios, TiDB ensures that all the data flowed in constantly can be queried in real time. At the same time, TiDB also can handle highly concurrent data workloads and Business Intelligence (BI) queries.
Data hub
When using TiDB as a data hub, TiDB can meet specific business needs by seamlessly connecting the data for the application and the data warehouse.
For more information about use cases of TiDB HTAP, see blogs about HTAP on the PingCAP website.
To enhance the overall performance of TiDB, it is recommended to use HTAP in the following technical scenarios:
Improve analytical processing performance
If your application involves complex analytical queries, such as aggregation and join operations, and these queries are performed on a large amount of data (more than 10 million rows), the row-based storage engine TiKV might not meet your performance requirements if the tables in these queries cannot effectively use indexes or have low index selectivity.
Hybrid workload isolation
While dealing with high-concurrency Online Transactional Processing (OLTP) workloads, your system might also need to handle some OLAP workloads. To ensure the overall system stability, you expect to avoid the impact of OLAP queries on OLTP performance.
Simplify the ETL technology stack
When the amount of data to be processed is of medium scale (less than 100 TB), the data processing and scheduling processes are relatively simple, and the concurrency is not high (less than 10), you might want to simplify the technology stack of your system. By replacing multiple different technology stacks used in OLTP, ETL, and OLAP systems with a single database, you can meet the requirements of both transactional systems and analytical systems. This reduces technical complexity and the need for maintenance personnel.
Strongly consistent analysis
To achieve real-time, strongly consistent data analysis and calculation, and ensure the analysis results to be completely consistent with the transactional data, you need to avoid data latency and inconsistency issues.
Architecture
In TiDB, a row-based storage engine TiKV for Online Transactional Processing (OLTP) and a columnar storage engine TiFlash for Online Analytical Processing (OLAP) co-exist, replicate data automatically, and keep strong consistency.
For more information about the architecture, see architecture of TiDB HTAP.
Environment preparation
Before exploring the features of TiDB HTAP, you need to deploy TiDB and the corresponding storage engines according to the data volume. If the data volume is large (for example, 100 T), it is recommended to use TiFlash Massively Parallel Processing (MPP) as the primary solution and TiSpark as the supplementary solution.
TiFlash
If you have deployed a TiDB cluster with no TiFlash node, add the TiFlash nodes in the current TiDB cluster. For detailed information, see Scale out a TiFlash cluster.
If you have not deployed a TiDB cluster, see Deploy a TiDB Cluster Using TiUP. Based on the minimal TiDB topology, you also need to deploy the topology of TiFlash.
When deciding how to choose the number of TiFlash nodes, consider the following scenarios:
- If your use case requires OLTP with small-scale analytical processing and Ad-Hoc queries, deploy one or several TiFlash nodes. They can dramatically increase the speed of analytic queries.
- If the OLTP throughput does not cause significant pressure to I/O usage rate of the TiFlash nodes, each TiFlash node uses more resources for computation, and thus the TiFlash cluster can have near-linear scalability. The number of TiFlash nodes should be tuned based on expected performance and response time.
- If the OLTP throughput is relatively high (for example, the write or update throughput is higher than 10 million lines/hours), due to the limited write capacity of network and physical disks, the I/O between TiKV and TiFlash becomes a bottleneck and is also prone to read and write hotspots. In this case, the number of TiFlash nodes has a complex non-linear relationship with the computation volume of analytical processing, so you need to tune the number of TiFlash nodes based on the actual status of the system.
TiSpark
- If your data needs to be analyzed with Spark, deploy TiSpark. For specific process, see TiSpark User Guide.
Data preparation
After TiFlash is deployed, TiKV does not replicate data to TiFlash automatically. You need to manually specify which tables need to be replicated to TiFlash. After that, TiDB creates the corresponding TiFlash replicas.
- If there is no data in the TiDB Cluster, migrate the data to TiDB first. For detailed information, see data migration.
- If the TiDB cluster already has the replicated data from upstream, after TiFlash is deployed, data replication does not automatically begin. You need to manually specify the tables to be replicated to TiFlash. For detailed information, see Use TiFlash.
Data processing
With TiDB, you can simply enter SQL statements for query or write requests. For the tables with TiFlash replicas, TiDB uses the front-end optimizer to automatically choose the optimal execution plan.
Performance monitoring
When using TiDB, you can monitor the TiDB cluster status and performance metrics in either of the following ways:
- TiDB Dashboard: you can see the overall running status of the TiDB cluster, analyse distribution and trends of read and write traffic, and learn the detailed execution information of slow queries.
- Monitoring system (Prometheus & Grafana): you can see the monitoring parameters of TiDB cluster-related components including PD, TiDB, TiKV, TiFlash, TiCDC, and Node_exporter.
To see the alert rules of TiDB cluster and TiFlash cluster, see TiDB cluster alert rules and TiFlash alert rules.
Troubleshooting
If any issue occurs during using TiDB, refer to the following documents:
- Analyze slow queries
- Identify expensive queries
- Troubleshoot hotspot issues
- TiDB cluster troubleshooting guide
- Troubleshoot a TiFlash Cluster
You are also welcome to create GitHub Issues or submit your questions on AskTUG.
What's next
- To check the TiFlash version, critical logs, system tables, see Maintain a TiFlash cluster.
- To remove a specific TiFlash node, see Scale out a TiFlash cluster.