TiCDC Behavior in Splitting UPDATE Events
Split UPDATE
events for MySQL sinks
Starting from v6.5.10, when using the MySQL sink, any TiCDC node that receives a request for replicating a table will fetch the current timestamp thresholdTS
from PD before starting the replication to the downstream. Based on the value of this timestamp, TiCDC decides whether to split UPDATE
events:
- For transactions containing one or multiple
UPDATE
changes, if the transactioncommitTS
is less thanthresholdTS
, TiCDC splits theUPDATE
event into aDELETE
event and anINSERT
event before writing them to the Sorter module. - For
UPDATE
events with the transactioncommitTS
greater than or equal tothresholdTS
, TiCDC does not split them. For more information, see GitHub issue #10918.
This behavior change addresses the issue of downstream data inconsistencies caused by the potentially incorrect order of UPDATE
events received by TiCDC, which can lead to an incorrect order of split DELETE
and INSERT
events.
Take the following SQL statements as an example:
CREATE TABLE t (a INT PRIMARY KEY, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (2, 2);
BEGIN;
UPDATE t SET a = 3 WHERE a = 2;
UPDATE t SET a = 2 WHERE a = 1;
COMMIT;
In this example, the two UPDATE
statements within the transaction have a sequential dependency on execution. The primary key a
is changed from 2
to 3
, and then the primary key a
is changed from 1
to 2
. After this transaction is executed, the records in the upstream database are (2, 1)
and (3, 2)
.
However, the order of UPDATE
events received by TiCDC might differ from the actual execution order of the upstream transaction. For example:
UPDATE t SET a = 2 WHERE a = 1;
UPDATE t SET a = 3 WHERE a = 2;
Before this behavior change, TiCDC writes these
UPDATE
events to the Sorter module and then splits them intoDELETE
andINSERT
events. After the split, the actual execution order of these events in the downstream is as follows:BEGIN; DELETE FROM t WHERE a = 1; REPLACE INTO t VALUES (2, 1); DELETE FROM t WHERE a = 2; REPLACE INTO t VALUES (3, 2); COMMIT;After the downstream executes the transaction, the records in the database are
(3, 2)
, which are different from the records in the upstream database ((2, 1)
and(3, 2)
), indicating a data inconsistency issue.After this behavior change, if the transaction
commitTS
is less than thethresholdTS
fetched from PD when TiCDC starts replicating the corresponding table to the downstream, TiCDC splits theseUPDATE
events intoDELETE
andINSERT
events before writing them to the Sorter module. After the sorting by the Sorter module, the actual execution order of these events in the downstream is as follows:BEGIN; DELETE FROM t WHERE a = 1; DELETE FROM t WHERE a = 2; REPLACE INTO t VALUES (2, 1); REPLACE INTO t VALUES (3, 2); COMMIT;After the downstream executes the transaction, the records in the downstream database are the same as those in the upstream database, which are
(2, 1)
and(3, 2)
, ensuring data consistency.
As you can see from the preceding example, splitting the UPDATE
event into DELETE
and INSERT
events before writing them to the Sorter module ensures that all DELETE
events are executed before INSERT
events after the split, thereby maintaining data consistency regardless of the order of UPDATE
events received by TiCDC.
Split primary or unique key UPDATE
events for non-MySQL sinks
Transactions containing a single UPDATE
change
Starting from v6.5.3, when using a non-MySQL sink, for transactions that only contain a single update change, if the primary key or non-null unique index value is modified in an UPDATE
event, TiCDC splits this event into DELETE
and INSERT
events. For more information, see GitHub issue #9086.
This change primarily addresses the issue that TiCDC only outputs the new value without the old value by default when using the CSV and AVRO protocols. Due to this issue, when the primary key or non-null unique index value changes, the consumer can only receive the new value, making it impossible to process the value before the change (for example, delete the old value). Take the following SQL as an example:
CREATE TABLE t (a INT PRIMARY KEY, b INT);
INSERT INTO t VALUES (1, 1);
UPDATE t SET a = 2 WHERE a = 1;
In this example, the primary key a
is updated from 1
to 2
. If the UPDATE
event is not split, the consumer can only obtain the new value a = 2
and cannot obtain the old value a = 1
when using the CSV and AVRO protocols. This might cause the downstream consumer to only insert the new value 2
without deleting the old value 1
.
Transactions containing multiple UPDATE
changes
Starting from v6.5.4, for transactions containing multiple changes, if the primary key or non-null unique index value is modified in the UPDATE
event, TiCDC splits the event into DELETE
and INSERT
events and ensures that all events follow the sequence of DELETE
events preceding INSERT
events. For more information, see GitHub issue #9430.
This change primarily addresses the potential issue of primary key or unique key conflicts that consumers might encounter when writing data changes from the Kafka sink or other sinks to a relational database or performing a similar operation. This issue is caused by the potentially incorrect order of UPDATE
events received by TiCDC.
Take the following SQL as an example:
CREATE TABLE t (a INT PRIMARY KEY, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (2, 2);
BEGIN;
UPDATE t SET a = 3 WHERE a = 1;
UPDATE t SET a = 1 WHERE a = 2;
UPDATE t SET a = 2 WHERE a = 3;
COMMIT;
In this example, by executing three SQL statements to swap the primary keys of two rows, TiCDC only receives two update change events, that is, changing the primary key a
from 1
to 2
and changing the primary key a
from 2
to 1
. If consumers directly write these two UPDATE
events to the downstream, a primary key conflict will occur, leading to changefeed errors.
Therefore, TiCDC splits these two events into four events, that is, deleting records (1, 1)
and (2, 2)
and writing records (2, 1)
and (1, 2)
.
Control whether to split primary or unique key UPDATE
events
Starting from v6.5.10, when using a non-MySQL sink, TiCDC supports controlling whether to split primary or unique key UPDATE
events via the output-raw-change-event
parameter, as described in the GitHub issue #11211. The specific behavior of this parameter is as follows:
- When you set
output-raw-change-event = false
, if the primary key or non-null unique index value is modified in anUPDATE
event, TiCDC splits the event intoDELETE
andINSERT
events and ensures that all events follow the sequence ofDELETE
events precedingINSERT
events. - When you set
output-raw-change-event = true
, TiCDC does not splitUPDATE
events, and the consumer side is responsible for dealing with the problems described in Split primary or unique keyUPDATE
events for non-MySQL sinks. Otherwise there might be a risk of data inconsistency. Note that when the primary key of a table is a clustered index, updates to the primary key are still split intoDELETE
andINSERT
events in TiDB, and such behavior is not affected by theoutput-raw-change-event
parameter.
Release 6.5 compatibility
Version | Protocol | Split UK/PK UPDATE events | Not split UK/PK UPDATE events | Comments |
---|---|---|---|---|
<= v6.5.2 | ALL | ✗ | ✓ | |
v6.5.3 / v6.5.4 | Canal/Open | ✗ | ✓ | |
v6.5.3 | CSV/Avro | ✗ | ✗ | Split but does not sort. See #9086 |
v6.5.4 | Canal/Open | ✗ | ✗ | Only split and sort transactions that contain multiple changes |
v6.5.5 ~ v6.5.9 | ALL | ✓ | ✗ | |
>= v6.5.10 | ALL | ✓ (Default value: output-raw-change-event = false ) | ✓ (Optional: output-raw-change-event = true ) |