Migrate and Merge MySQL Shards of Large Datasets to TiDB Cloud

This document describes how to migrate and merge a large MySQL dataset (for example, more than 1 TiB) from different partitions into TiDB Cloud. After full data migration, you can use TiDB Data Migration (DM) to perform incremental migration according to your business needs.

The example in this document uses a complex shard migration task across multiple MySQL instances, and involves handling conflicts in auto-increment primary keys. The scenario in this example is also applicable to merging data from different sharded tables within a single MySQL instance.

Environment information in the example

This section describes the basic information of the upstream cluster, DM, and downstream cluster used in the example.

Upstream cluster

The environment information of the upstream cluster is as follows:

MySQL version: MySQL v5.7.18
MySQL instance1:
- schema store_01 and table [sale_01, sale_02]
- schema store_02 and table [sale_01, sale_02]
MySQL instance2:
- schema store_01and table [sale_01, sale_02]
- schema store_02and table [sale_01, sale_02]

Table structure:

CREATE TABLE sale_01 (
id bigint(20) NOT NULL auto_increment,
uid varchar(40) NOT NULL,
sale_num bigint DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY ind_uid (uid)
);

DM

The version of DM is v5.3.0. You need to deploy TiDB DM manually. For detailed steps, see Deploy a DM Cluster Using TiUP.

External storage

This document uses the Amazon S3 as an example.

Downstream cluster

The sharded schemas and tables are merged into the table store.sales.

Perform full data migration from MySQL to TiDB Cloud

The following is the procedure to migrate and merge full data of MySQL shards to TiDB Cloud.

In the following example, you only need to export the data in tables to CSV format.

Step 1. Create directories in the Amazon S3 bucket

Create a first-level directory store (corresponding to the level of databases) and a second-level directory sales (corresponding to the level of tables) in the Amazon S3 bucket. In sales, create a third-level directory for each MySQL instance (corresponding to the level of MySQL instances). For example:

Migrate the data in MySQL instance1 to s3://dumpling-s3/store/sales/instance01/
Migrate the data in MySQL instance2 to s3://dumpling-s3/store/sales/instance02/

If there are shards across multiple instances, you can create one first-level directory for each database and create one second-level directory for each sharded table. Then create a third-level directory for each MySQL instance for easy management. For example, if you want to migrate and merge tables stock_N.product_N from MySQL instance1 and MySQL instance2 into the table stock.products in TiDB Cloud, you can create the following directories:

s3://dumpling-s3/stock/products/instance01/
s3://dumpling-s3/stock/products/instance02/

Step 2. Use Dumpling to export data to Amazon S3

For information about how to install Dumpling, see Dumpling Introduction.

When you use Dumpling to export data to Amazon S3, note the following:

Enable binlog for upstream clusters.
Choose the correct Amazon S3 directory and region.
Choose the appropriate concurrency by configuring the -t option to minimize the impact on the upstream cluster, or export directly from the backup database. For more information about how to use this parameter, see Option list of Dumpling.
Set appropriate values for --filetype csv and --no-schemas. For more information about how to use these parameters, see Option list of Dumpling.

Name the CSV files as follows:

If the data of one table is separated into multiple CSV files, append a numeric suffix to these CSV files. For example, ${db_name}.${table_name}.000001.csv and ${db_name}.${table_name}.000002.csv. The numeric suffixes can be inconsecutive but must be in ascending order. You also need to add extra zeros before the number to ensure all the suffixes are in the same length.

Note

If you cannot update the CSV filenames according to the preceding rules in some cases (for example, the CSV file links are also used by your other programs), you can keep the filenames unchanged and use the Mapping Settings in Step 5 to import your source data to a single target table.

To export data to Amazon S3, do the following:

Get the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY of the Amazon S3 bucket.

[root@localhost ~]# export AWS_ACCESS_KEY_ID={your_aws_access_key_id}
[root@localhost ~]# export AWS_SECRET_ACCESS_KEY= {your_aws_secret_access_key}

Export data from MySQL instance1 to the s3://dumpling-s3/store/sales/instance01/ directory in the Amazon S3 bucket.

[root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql01-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance01/" --s3.region "ap-northeast-1"

For more information about the parameters, see Option list of Dumpling.

Export data from MySQL instance2 to the s3://dumpling-s3/store/sales/instance02/ directory in the Amazon S3 bucket.

[root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql02-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance02/" --s3.region "ap-northeast-1"

For detailed steps, see Export data to Amazon S3 cloud storage.

Step 3. Create schemas in TiDB Cloud cluster

Create schemas in the TiDB Cloud cluster as follows:

mysql> CREATE DATABASE store;
Query OK, 0 rows affected (0.16 sec)
mysql> use store;
Database changed

In this example, the column IDs of the upstream tables sale_01 and sale_02 are auto-increment primary keys. Conflicts might occur when you merge sharded tables in the downstream database. Execute the following SQL statement to set the ID column as a normal index instead of a primary key:

mysql> CREATE TABLE `sales` (
         `id` bigint(20) NOT NULL ,
         `uid` varchar(40) NOT NULL,
         `sale_num` bigint DEFAULT NULL,
         INDEX (`id`),
         UNIQUE KEY `ind_uid` (`uid`)
        );
Query OK, 0 rows affected (0.17 sec)

For more information about the solutions to solve such conflicts, see Remove the PRIMARY KEY attribute from the column.

Step 4. Configure Amazon S3 access

Follow the instructions in Configure Amazon S3 access to get the role ARN to access the source data.

The following example only lists key policy configurations. Replace the Amazon S3 path with your own values.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "VisualEditor0",
           "Effect": "Allow",
           "Action": [
               "s3:GetObject",
               "s3:GetObjectVersion"
           ],
           "Resource": [
               "arn:aws:s3:::dumpling-s3/*"
           ]
       },
       {
           "Sid": "VisualEditor1",
           "Effect": "Allow",
           "Action": [
               "s3:ListBucket",
               "s3:GetBucketLocation"
           ],

           "Resource": "arn:aws:s3:::dumpling-s3"
       }
   ]
}

Step 5. Perform the data import task

After configuring the Amazon S3 access, you can perform the data import task in the TiDB Cloud console as follows:

Open the Import page for your target cluster.
1. Log in to the TiDB Cloud console and navigate to the Clusters page of your project.
  Tip
  You can use the combo box in the upper-left corner to switch between organizations, projects, and clusters.
2. Click the name of your target cluster to go to its overview page, and then click Data > Import in the left navigation pane.
Select Import data from Cloud Storage, and then click Amazon S3.
On the Import Data from Amazon S3 page, fill in the following information:
- Import File Count: for TiDB Cloud Starter or TiDB Cloud Essential, select Multiple files. This field is not available in TiDB Cloud Dedicated.
- Included Schema Files: select No.
- Data Format: select CSV.
- Folder URI: fill in the bucket URI of your source data. You can use the second-level directory corresponding to tables, s3://dumpling-s3/store/sales/ in this example, so that TiDB Cloud can import and merge the data in all MySQL instances into store.sales in one go.
- Bucket Access > AWS Role ARN: enter the Role-ARN you obtained.
If the location of the bucket is different from your cluster, confirm the compliance of cross region.
TiDB Cloud starts validating whether it can access your data in the specified bucket URI. After validation, TiDB Cloud tries to scan all the files in the data source using the default file naming pattern, and returns a scan summary result on the left side of the next page. If you get the AccessDenied error, see Troubleshoot Access Denied Errors during Data Import from S3.
Click Connect.
In the Destination section, select the target database and table.
When importing multiple files, you can use Advanced Settings > Mapping Settings to define a custom mapping rule for each target table and its corresponding CSV file. After that, the data source files will be re-scanned using the provided custom mapping rule.
When you enter the source file URI and name in Source File URIs and Names, make sure it is in the following format s3://[bucket_name]/[data_source_folder]/[file_name].csv. For example, s3://sampledata/ingest/TableName.01.csv.
You can also use wildcards to match the source files. For example:
- s3://[bucket_name]/[data_source_folder]/my-data?.csv: all CSV files starting with my-data followed by one character (such as my-data1.csv and my-data2.csv) in that folder will be imported into the same target table.
- s3://[bucket_name]/[data_source_folder]/my-data*.csv: all CSV files in the folder starting with my-data will be imported into the same target table.
Note that only ? and * are supported.
Note
The URI must contain the data source folder.
Edit the CSV configuration if needed.
You can also click Edit CSV configuration to configure Backslash Escape, Separator, and Delimiter for more fine-grained control.
Note
For the configurations of separator, delimiter, and null, you can use both alphanumeric characters and certain special characters. The supported special characters include \t, \b, \n, \r, \f, and \u0001.
Click Start Import.
When the import progress shows Completed, check the imported tables.

After the data is imported, if you want to remove the Amazon S3 access of TiDB Cloud, simply delete the policy that you added.

Perform incremental data replication from MySQL to TiDB Cloud

To replicate the data changes based on binlog from a specified position in the upstream cluster to TiDB Cloud, you can use TiDB Data Migration (DM) to perform incremental replication.

Before you begin

If you want to migrate incremental data and merge MySQL shards to TiDB Cloud, you need to manually deploy TiDB DM, because TiDB Cloud does not support migrating and merging MySQL shards yet. For detailed steps, see Deploy a DM Cluster Using TiUP.

Step 1. Add the data source

Create a new data source file dm-source1.yaml to configure an upstream data source into DM. Add the following content:

# MySQL Configuration.
source-id: "mysql-replica-01"
# Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier).
# The prerequisite is that you have already enabled GTID in the upstream MySQL.
# If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID.
enable-gtid: true
from:
 host: "${host}"           # For example: 192.168.10.101
 user: "user01"
 password: "${password}"   # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords.
 port: ${port}             # For example: 3307

Create another new data source file dm-source2.yaml, and add the following content:

# MySQL Configuration.
source-id: "mysql-replica-02"
# Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier).
# The prerequisite is that you have already enabled GTID in the upstream MySQL.
# If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID.
enable-gtid: true
from:
 host: "192.168.10.102"
 user: "user02"
 password: "${password}"
 port: 3308

Run the following command in a terminal. Use tiup dmctl to load the first data source configuration into the DM cluster:

[root@localhost ~]# tiup dmctl --master-addr ${advertise-addr} operate-source create dm-source1.yaml

The parameters used in the command above are described as follows:

Parameter	Description
`--master-addr`	The `{advertise-addr}` of any DM-master node in the cluster where `dmctl` is to be connected. For example: 192.168.11.110:9261
`operate-source create`	Loads the data source to the DM cluster.

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source1.yaml

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-01",
           "worker": "dm-192.168.11.111-9262"
       }
   ]
}

Run the following command in a terminal. Use tiup dmctl to load the second data source configuration into the DM cluster:

[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-02",
           "worker": "dm-192.168.11.112-9262"
       }
   ]
}

Step 2. Create a replication task

Create a test-task1.yaml file for the replication task.

Find the starting point in the metadata file of MySQL instance1 exported by Dumpling. For example:

Started dump at: 2022-05-25 10:16:26
SHOW MASTER STATUS:
       Log: mysql-bin.000002
       Pos: 246546174
       GTID:b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801
Finished dump at: 2022-05-25 10:16:27

Find the starting point in the metadata file of MySQL instance2 exported by Dumpling. For example:

Started dump at: 2022-05-25 10:20:32
SHOW MASTER STATUS:
       Log: mysql-bin.000001
       Pos: 1312659
       GTID:cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036
Finished dump at: 2022-05-25 10:20:32

Edit the task configuration file test-task1, to configure the incremental replication mode and replication starting point for each data source.

## ********* Task Configuration *********
name: test-task1
shard-mode: "pessimistic"
# Task mode. The "incremental" mode only performs incremental data migration.
task-mode: incremental
# timezone: "UTC"

## ******** Data Source Configuration **********
## (Optional) If you need to incrementally replicate data that has already been migrated in the full data migration, you need to enable the safe mode to avoid the incremental data migration error.
##  This scenario is common in the following case: the full migration data does not belong to the data source's consistency snapshot, and after that, DM starts to replicate incremental data from a position earlier than the full migration.
syncers:           # The running configurations of the sync processing unit.
 global:           # Configuration name.
   safe-mode: false # # If this field is set to true, DM changes INSERT of the data source to REPLACE for the target database,
                    # # and changes UPDATE of the data source to DELETE and REPLACE for the target database.
                    # # This is to ensure that when the table schema contains a primary key or unique index, DML statements can be imported repeatedly.
                    # # In the first minute of starting or resuming an incremental migration task, DM automatically enables the safe mode.
mysql-instances:
- source-id: "mysql-replica-01"
   block-allow-list:  "bw-rule-1"
   route-rules: ["store-route-rule", "sale-route-rule"]
   filter-rules: ["store-filter-rule", "sale-filter-rule"]
   syncer-config-name: "global"
   meta:
     binlog-name: "mysql-bin.000002"
     binlog-pos: 246546174
     binlog-gtid: "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801"
- source-id: "mysql-replica-02"
   block-allow-list:  "bw-rule-1"
   route-rules: ["store-route-rule", "sale-route-rule"]
   filter-rules: ["store-filter-rule", "sale-filter-rule"]
   syncer-config-name: "global"
   meta:
     binlog-name: "mysql-bin.000001"
     binlog-pos: 1312659
     binlog-gtid: "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036"

## ******** Configuration of the target TiDB cluster on TiDB Cloud **********
target-database:       # The target TiDB cluster on TiDB Cloud
 host: "tidb.xxxxxxx.xxxxxxxxx.ap-northeast-1.prod.aws.tidbcloud.com"
 port: 4000
 user: "root"
 password: "${password}"  # If the password is not empty, it is recommended to use a dmctl-encrypted cipher.

## ******** Function Configuration **********
routes:
 store-route-rule:
   schema-pattern: "store_*"
   target-schema: "store"
 sale-route-rule:
   schema-pattern: "store_*"
   table-pattern: "sale_*"
   target-schema: "store"
   target-table:  "sales"
filters:
 sale-filter-rule:
   schema-pattern: "store_*"
   table-pattern: "sale_*"
   events: ["truncate table", "drop table", "delete"]
   action: Ignore
 store-filter-rule:
   schema-pattern: "store_*"
   events: ["drop database"]
   action: Ignore
block-allow-list:
 bw-rule-1:
   do-dbs: ["store_*"]

## ******** Ignore check items **********
ignore-checking-items: ["table_schema","auto_increment_ID"]

For detailed task configurations, see DM Task Configurations.

To run a data replication task smoothly, DM triggers a precheck automatically at the start of the task and returns the check results. DM starts the replication only after the precheck is passed. To trigger a precheck manually, run the check-task command:

[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml

{
   "result": true,
   "msg": "check pass!!!"
}

Step 3. Start the replication task

Use tiup dmctl to run the following command to start the data replication task:

[root@localhost ~]# tiup dmctl --master-addr ${advertise-addr}  start-task dm-task.yaml

The parameters used in the command above are described as follows:

Parameter	Description
`--master-addr`	The `{advertise-addr}` of any DM-master node in the cluster where `dmctl` is to be connected. For example: 192.168.11.110:9261
`start-task`	Starts the migration task.

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 start-task dm-task.yaml

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-01",
           "worker": "dm-192.168.11.111-9262"
       },

       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-02",
           "worker": "dm-192.168.11.112-9262"
       }
   ],
   "checkResult": ""
}

If the task fails to start, check the prompt message and fix the configuration. After that, you can re-run the command above to start the task.

If you encounter any problem, refer to DM error handling and DM FAQ.

Step 4. Check the replication task status

To learn whether the DM cluster has an ongoing replication task and view the task status, run the query-status command using tiup dmctl:

[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 query-status test-task1

The following is an example output:

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "sourceStatus": {
               "source": "mysql-replica-01",
               "worker": "dm-192.168.11.111-9262",
               "result": null,
               "relayStatus": null
           },

           "subTaskStatus": [
               {
                   "name": "test-task1",
                   "stage": "Running",
                   "unit": "Sync",
                   "result": null,
                   "unresolvedDDLLockID": "",
                   "sync": {
                       "totalEvents": "4048",
                       "totalTps": "3",
                       "recentTps": "3",
                       "masterBinlog": "(mysql-bin.000002, 246550002)",
                       "masterBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
                       "syncerBinlog": "(mysql-bin.000002, 246550002)",
                       "syncerBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
                       "blockingDDLs": [
                       ],
                       "unresolvedGroups": [
                       ],
                       "synced": true,
                       "binlogType": "remote",
                       "secondsBehindMaster": "0",
                       "blockDDLOwner": "",
                       "conflictMsg": ""
                   }
               }
           ]
       },
       {
           "result": true,
           "msg": "",
           "sourceStatus": {
               "source": "mysql-replica-02",
               "worker": "dm-192.168.11.112-9262",
               "result": null,
               "relayStatus": null
           },
           "subTaskStatus": [
               {
                   "name": "test-task1",
                   "stage": "Running",
                   "unit": "Sync",
                   "result": null,
                   "unresolvedDDLLockID": "",
                   "sync": {
                       "totalEvents": "33",
                       "totalTps": "0",
                       "recentTps": "0",
                       "masterBinlog": "(mysql-bin.000001, 1316487)",
                       "masterBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
                       "syncerBinlog": "(mysql-bin.000001, 1316487)",
                       "syncerBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
                       "blockingDDLs": [
                       ],
                       "unresolvedGroups": [
                       ],
                       "synced": true,
                       "binlogType": "remote",
                       "secondsBehindMaster": "0",
                       "blockDDLOwner": "",
                       "conflictMsg": ""
                   }
               }
           ]
       }
   ]
}

For a detailed interpretation of the results, see Query Status.

Migrate and Merge MySQL Shards of Large Datasets to TiDB Cloud

Environment information in the example

Upstream cluster

DM

External storage

Downstream cluster

Perform full data migration from MySQL to TiDB Cloud

Step 1. Create directories in the Amazon S3 bucket

Step 2. Use Dumpling to export data to Amazon S3

Step 3. Create schemas in TiDB Cloud cluster

Step 4. Configure Amazon S3 access

Step 5. Perform the data import task

Perform incremental data replication from MySQL to TiDB Cloud

Before you begin

Step 1. Add the data source

Step 2. Create a replication task

Step 3. Start the replication task

Step 4. Check the replication task status

Was this page helpful?