Replicate Data to MySQL-compatible Databases

This document describes how to replicate incremental data to the downstream TiDB database or other MySQL-compatible databases using TiCDC. It also introduces how to use the eventually consistent replication feature in disaster scenarios.

Create a replication task

Create a replication task by running the following command:

cdc cli changefeed create \
    --server=http://10.0.10.25:8300 \
    --sink-uri="mysql://root:123456@127.0.0.1:3306/" \
    --changefeed-id="simple-replication-task"

Create changefeed successfully!
ID: simple-replication-task
Info: {"sink-uri":"mysql://root:123456@127.0.0.1:3306/","opts":{},"create-time":"2023-11-28T22:04:08.103600025+08:00","start-ts":415241823337054209,"target-ts":0,"admin-job-type":0,"sort-engine":"unified","sort-dir":".","config":{"case-sensitive":false,"filter":{"rules":["*.*"],"ignore-txn-start-ts":null,"ddl-allow-list":null},"mounter":{"worker-num":16},"sink":{"dispatchers":null},"scheduler":{"type":"table-number","polling-time":-1}},"state":"normal","history":null,"error":null}

--server: The address of any TiCDC server in the TiCDC cluster.
--changefeed-id: The ID of the replication task. The format must match the ^[a-zA-Z0-9]+(\-[a-zA-Z0-9]+)*$ regular expression. If this ID is not specified, TiCDC automatically generates a UUID (the version 4 format) as the ID.
--sink-uri: The downstream address of the replication task. For details, see Configure sink URI with mysql/tidb.
--start-ts: Specifies the starting TSO of the changefeed. From this TSO, the TiCDC cluster starts pulling data. The default value is the current time.
--target-ts: Specifies the ending TSO of the changefeed. To this TSO, the TiCDC cluster stops pulling data. The default value is empty, which means that TiCDC does not automatically stop pulling data.
--config: Specifies the changefeed configuration file. For details, see TiCDC Changefeed Configuration Parameters.

Note

TiCDC only replicates incremental data. To initialize full data, use Dumpling/TiDB Lightning or BR.
After the full data is initialized, you need to specify the start-ts as the TSO when the upstream backup is performed. For example, the pos value in the metadata file under the Dumpling directory, or the backupTS value in the log output after BR completes the backup.

Configure sink URI for MySQL or TiDB

Sink URI is used to specify the connection information of the TiCDC target system. The format is as follows:

[scheme]://[userinfo@][host]:[port][/path]?[query_parameters]

Note

/path is not used for the MySQL sink.

Sample configuration for MySQL:

--sink-uri="mysql://root:12345678@127.0.0.1:3306"

The following are descriptions of sink URI parameters and parameter values that can be configured for MySQL or TiDB:

Parameter/Parameter value	Description
`root`	The username of the downstream database. To replicate data to TiDB or other MySQL-compatible databases, make sure that the downstream database user has certain permissions.
`12345678`	The password of the downstream database (can be encoded using Base64).
`127.0.0.1`	The IP address of the downstream database.
`3306`	The port for the downstream database.
`worker-count`	The number of SQL statements that can be concurrently executed to the downstream (optional, the default value is `16`, and the maximum value is `1024`).
`cache-prep-stmts`	Controls whether to use prepared statements when executing SQL in the downstream and enable prepared statement cache on the client side (optional, `true` by default).
`multi-stmt-enable`	Controls whether the SQL statements executed downstream support multiple SQL statements separated by semicolons (optional, the default value is `true`). If it is set to `false`, each SQL statement is executed as a separate transaction. If it is set to `true`, `cache-prep-stmts` does not take effect.
`max-txn-row`	The batch size of SQL statements executed to the downstream (optional, the default value is `256`, and the maximum value is `2048`).
`max-multi-update-row`	The batch size of `UPDATE ROWS` SQL statements executed to the downstream when batch write (`batch-dml-enable`) is enabled, always less than `max-txn-row` (optional, the default value is `40`, and the maximum value is `256`).
`max-multi-update-row-size`	When batch write (`batch-dml-enable`) is enabled, this parameter controls the batch processing size (in bytes) for `UPDATE ROWS` SQL statements executed to the downstream. If the average size of a single row exceeds this threshold, each row is executed as an independent SQL statement (optional, the default value is `1024`, and the maximum value is `8192`).
`ssl-ca`	The path of the CA certificate file needed to connect to the downstream MySQL instance (optional).
`ssl-cert`	The path of the certificate file needed to connect to the downstream MySQL instance (optional).
`ssl-key`	The path of the certificate key file needed to connect to the downstream MySQL instance (optional).
`time-zone`	The time zone name used when connecting to downstream MySQL and TiDB instances (that is, the `time_zone` of the downstream connection session), which is effective since v4.0.8. This is an optional parameter. If this parameter is not specified, the time zone of TiCDC service processes is used. If this parameter is set to an empty value, such as `time-zone=""`, it means no session time zone is specified when TiCDC connects, and the default time zone of the downstream is used.
`transaction-atomicity`	The atomicity level of a transaction. This is an optional parameter, with the default value of `none`. When the value is `table`, TiCDC ensures the atomicity of a single-table transaction. When the value is `none`, TiCDC splits the single-table transaction.
`batch-dml-enable`	Enables the batch write (batch-dml) feature (optional, the default value is `true`).
`read-timeout`	The go-sql-driver parameter, I/O read timeout (optional, the default value is `2m`).
`write-timeout`	The go-sql-driver parameter, I/O write timeout (optional, the default value is `2m`).
`timeout`	The go-sql-driver parameter, timeout for establishing connections, also known as dial timeout (optional, the default value is `2m`).
`tidb-txn-mode`	Specifies the `tidb_txn_mode` environment variable (optional, the default value is `optimistic`).
`safe-mode`	Specifies how TiCDC handles `INSERT` and `UPDATE` statements when replicating data to the downstream. When it is `true`, TiCDC converts all upstream `INSERT` statements to `REPLACE INTO` statements, and all `UPDATE` statements to `DELETE` + `REPLACE INTO` statements. Before v6.1.3, the default value of this parameter is `true`. Starting from v6.1.3, the default value is changed to `false`. When TiCDC starts, it obtains a current timestamp `ThresholdTs`. For `INSERT` and `UPDATE` statements with `CommitTs` less than `ThresholdTs`, TiCDC converts them to `REPLACE INTO` statements and `DELETE` + `REPLACE INTO` statements respectively. For `INSERT` and `UPDATE` statements with `CommitTs` greater than or equal to `ThresholdTs`, `INSERT` statements are directly replicated to the downstream, while the behavior of `UPDATE` statements follows the TiCDC Behavior in Splitting UPDATE Events.

Note

time-zone is only effective for mysql and tidb sinks. After TiCDC establishes a connection with the downstream, it sets the time_zone for the session. This time_zone is used by the downstream to parse time values affected by time zones, such as TIMESTAMP, when executing DDL and DML statements. The DATETIME, DATE, and TIME data types are not affected by time zone settings.
To avoid data inconsistency caused by inconsistent time zone settings, it is recommended to explicitly set time-zone and ensure that its value is consistent with the TiCDC Server's --tz parameter and the downstream database's time zone.

To encode the database password in the sink URI using Base64, use the following command:

echo -n '12345678' | base64   # '12345678' is the password to be encoded.

The encoded password is as follows:

MTIzNDU2Nzg=

Note

When the sink URI parameters contain special characters such as ! * ' ( ) ; : @ & = + $ , / ? % # [ ], you need to escape the special characters, for example, in URI Encoder.

For example, if the username for connecting to the downstream database is R&D (2) and the certificate file path is /data1/R&D (2).pem, you need to escape these parameters as follows:

--sink-uri="mysql://R%26D%20%282%29:MTIzNDU2Nzg%3D@127.0.0.1:3306/?ssl-cert=/data1/R%26D%20%282%29.pem"
#                    ^~~ ^~~^~~ ^~~            ^~~                                  ^~~ ^~~^~~ ^~~

Permissions required for the downstream database user

To replicate data to TiDB or other MySQL-compatible databases, the downstream database user needs the following permissions:

Select
Index
Insert
Update
Delete
Create
References
Drop
Alter
Create View

To replicate RECOVER TABLE to the downstream TiDB, the downstream database user also needs the Super permission.

If the downstream TiDB cluster has read-only mode enabled, the downstream database user also needs the RESTRICTED_REPLICA_WRITER_ADMIN permission.

Eventually consistent replication in disaster scenarios

The eventually consistent replication feature in TiCDC uses redo logs to ensure data consistency in the event of an upstream disaster. Starting from v6.1.1, this feature becomes GA. Starting from v5.3.0, TiCDC supports backing up incremental data from an upstream TiDB cluster to an object storage or an NFS of the downstream cluster. When the upstream cluster encounters a disaster and becomes unavailable, TiCDC can restore the downstream data to the recent eventually consistent state. With this capability, you can switch applications to the downstream cluster quickly, avoiding long-time downtime and improving service continuity.

Currently, TiCDC can replicate incremental data from a TiDB cluster to another TiDB cluster or a MySQL-compatible database system (including Aurora, MySQL, and MariaDB). In case the upstream cluster crashes, TiCDC can restore data in the downstream cluster within 5 minutes, given the conditions that TiCDC replicates data normally before the crash, and the replication lag is small. It allows data loss of 10s at most, that is, RTO <= 5 min, and P95 RPO <= 10s.

TiCDC replication lag increases in the following scenarios:

The TPS increases significantly in a short time.
Large or long transactions occur in the upstream.
The TiKV or TiCDC cluster in the upstream is reloaded or upgraded.
Time-consuming DDL statements, such as add index, are executed in the upstream.
The PD is configured with aggressive scheduling strategies, resulting in frequent transfer of Region leaders, or frequent Region merge or Region split.

Note

Starting from v6.1.1, the eventually consistent replication feature of TiCDC supports Amazon S3-compatible object storage. Starting from v6.1.4, this feature supports GCS- and Azure-compatible object storage.

Prerequisites

Prepare a highly available object storage or NFS for storing TiCDC's real-time incremental data backup files. These files can be accessed in case of a disaster in the upstream.
Enable this feature for changefeeds that need to have eventual consistency in disaster scenarios. To enable it, you can add the following configuration to the changefeed configuration file.

[consistent]
# Consistency level. Options include:
# - none: the default value. In a non-disaster scenario, eventual consistency is only guaranteed if and only if finished-ts is specified.
# - eventual: Uses redo log to guarantee eventual consistency in case of the primary cluster disasters.
level = "eventual"

# Individual redo log file size, in MiB. By default, it's 64. It is recommended to be no more than 128.
max-log-size = 64

# The interval for flushing or uploading redo logs to Amazon S3, in milliseconds. It is recommended that this configuration be equal to or greater than 2000.
flush-interval = 2000

# The path under which redo log backup is stored. The scheme can be nfs (NFS directory), or Amazon S3, GCS, and Azure (uploaded to object storage).
storage = "$SCHEME://logbucket/test-changefeed?endpoint=http://$ENDPOINT/"

Disaster recovery

When a disaster happens in the primary cluster, you need to recover manually in the secondary cluster by running the cdc redo command. The recovery process is as follows.

Ensure that all the TiCDC processes have exited. This is to prevent the primary cluster from resuming service during data recovery and prevent TiCDC from restarting data synchronization.
Use cdc binary for data recovery. Run the following command:

cdc redo apply --tmp-dir="/tmp/cdc/redo/apply" \
    --storage="s3://logbucket/test-changefeed?endpoint=http://10.0.10.25:24927/" \
    --sink-uri="mysql://normal:123456@10.0.10.55:3306/"

In this command:

tmp-dir: Specifies the temporary directory for downloading TiCDC incremental data backup files.
storage: Specifies the address for storing the TiCDC incremental data backup files, either a URI of object storage or an NFS directory.
sink-uri: Specifies the secondary cluster address to restore the data to. Scheme can only be mysql.

Replicate Data to MySQL-compatible Databases

Create a replication task

Configure sink URI for MySQL or TiDB

Permissions required for the downstream database user

Eventually consistent replication in disaster scenarios

Prerequisites

Disaster recovery

Was this page helpful?