diff --git a/.github/vale-styles/Yugabyte/spelling-exceptions.txt b/.github/vale-styles/Yugabyte/spelling-exceptions.txt index 5924953d9191..e278e0a4185c 100644 --- a/.github/vale-styles/Yugabyte/spelling-exceptions.txt +++ b/.github/vale-styles/Yugabyte/spelling-exceptions.txt @@ -445,6 +445,7 @@ Patroni performant PgBouncer pgLoader +pg_recvlogical Phabricator phaser phasers diff --git a/docs/content/preview/architecture/docdb-replication/cdc-logical-replication.md b/docs/content/preview/architecture/docdb-replication/cdc-logical-replication.md index a406bb93cf3a..146114da766e 100644 --- a/docs/content/preview/architecture/docdb-replication/cdc-logical-replication.md +++ b/docs/content/preview/architecture/docdb-replication/cdc-logical-replication.md @@ -17,7 +17,7 @@ type: docs Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and made available for consumption by applications and other tools. -CDC in YugabyteDB is based on the PostgreSQL Logical Replication model. The fundamental concept here is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel. +CDC in YugabyteDB is based on the PostgreSQL Logical Replication model. The fundamental concept is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel. ## Architecture @@ -35,7 +35,7 @@ The following are the main components of the Yugabyte CDC solution: Logical replication starts by copying a snapshot of the data on the publisher database. After that is done, changes on the publisher are streamed to the server as they occur in near real time. -To setup Logical Replication, an application will first have to create a replication slot. When a replication slot is created, a boundary is established between the snapshot data and the streaming changes. This boundary or `consistent_point` is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the `consistent_point` are consumed as part of the initial snapshot. Changes from transactions with commit time greater than the commit time of the `consistent_point` are consumed in the streaming phase in transaction commit time order. +To set up Logical Replication, an application will first have to create a replication slot. When a replication slot is created, a boundary is established between the snapshot data and the streaming changes. This boundary or `consistent_point` is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the `consistent_point` are consumed as part of the initial snapshot. Changes from transactions with commit time greater than the commit time of the `consistent_point` are consumed in the streaming phase in transaction commit time order. #### Initial Snapshot diff --git a/docs/content/preview/architecture/docdb-replication/change-data-capture.md b/docs/content/preview/architecture/docdb-replication/change-data-capture.md index b783a90ee1e7..a06c2b9b1757 100644 --- a/docs/content/preview/architecture/docdb-replication/change-data-capture.md +++ b/docs/content/preview/architecture/docdb-replication/change-data-capture.md @@ -33,7 +33,7 @@ Each tablet has its own WAL file. WAL is NOT in-memory, but it is disk persisted YugabyteDB normally purges WAL segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the database schemas. -The Debezium YugabyteDB connector captures row-level changes in the schemas of a YugabyteDB database. The first time it connects to a YugabyteDB cluster, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. +The YugabyteDB Debezium connector captures row-level changes in the schemas of a YugabyteDB database. The first time it connects to a YugabyteDB cluster, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. ![How does CDC work](/images/explore/cdc-overview-work.png) diff --git a/docs/content/preview/explore/change-data-capture/_index.md b/docs/content/preview/explore/change-data-capture/_index.md index cdc447229f34..80ff0b93b84d 100644 --- a/docs/content/preview/explore/change-data-capture/_index.md +++ b/docs/content/preview/explore/change-data-capture/_index.md @@ -4,7 +4,7 @@ headerTitle: Change data capture (CDC) linkTitle: Change data capture description: CDC or Change data capture is a process to capture changes made to data in the database. headcontent: Capture changes made to data in the database -image: /images/section_icons/index/develop.png +image: fa-light fa-rotate cascade: earlyAccess: /preview/releases/versioning/#feature-maturity menu: @@ -26,9 +26,9 @@ In databases, change data capture (CDC) is a set of software design patterns use YugabyteDB supports the following methods for reading change events. -## PostgreSQL Logical Replication Protocol (Recommended) +## PostgreSQL Replication Protocol -This method uses the PostgreSQL replication protocol, ensuring compatibility with PostgreSQL CDC systems. Logical replication operates through a publish-subscribe model. It replicates data objects and their changes based on the replication identity. +This method uses the [PostgreSQL replication protocol](using-logical-replication/key-concepts/#replication-protocols), ensuring compatibility with PostgreSQL CDC systems. Logical replication operates through a publish-subscribe model. It replicates data objects and their changes based on the replication identity. It works as follows: @@ -36,13 +36,9 @@ It works as follows: 1. Deploy the YugabyteDB Connector in your preferred Kafka Connect environment. 1. The connector uses replication slots to capture change events and publishes them directly to a Kafka topic. -This is the recommended approach for most CDC applications due to its compatibility with PostgreSQL. - - ## YugabyteDB gRPC Replication Protocol @@ -55,5 +51,5 @@ It works as follows: 1. The connector captures change events using YugabyteDB's native gRPC replication and directly publishes them to a Kafka topic. {{}} -To learn about gRPC Replication, see [Using YugabyteDB gRPC Replication](./using-yugabytedb-grpc-replication/). +To learn about CDC in YugabyteDB using the gRPC Replication Protocol, see [CDC using gRPC Replication Protocol](./using-yugabytedb-grpc-replication/). {{}} diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/_index.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/_index.md index 5ff2d5b2dbf8..1271d80a68be 100644 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/_index.md +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/_index.md @@ -1,10 +1,9 @@ - \ No newline at end of file +showRightNav: true +--- + +## Overview + +YugabyteDB CDC captures changes made to data in the database and streams those changes to external processes, applications, or other databases. CDC allows you to track and propagate changes in a YugabyteDB database to downstream consumers based on its Write-Ahead Log (WAL). YugabyteDB CDC captures row-level changes resulting from INSERT, UPDATE, and DELETE operations in the configured database and publishes it further to be consumed by downstream applications. + +### Highlights + +#### Resilience + +YugabyteDB CDC with PostgreSQL Logical Replication provides resilience as follows: + +1. Following a failure of the application, server, or network, the replication can continue from any of the available server nodes. + +2. Replication continues from the transaction immediately after the transaction that was last acknowledged by the application. No transactions are missed by the application. + +#### Security + +Because YugabyteDB is using the PostgreSQL Logical Replication model, the following applies: + +- The CDC user persona will be a PostgreSQL replication client. + +- A standard replication connection is used for consumption, and all the server-side configurations for authentication, authorizations, SSL modes, and connection load balancing can be leveraged automatically. + +#### Guarantees + +CDC in YugabyteDB provides the following guarantees. + +| GUARANTEE | DESCRIPTION | +| :----- | :----- | +| Per-slot ordered delivery guarantee | Changes from transactions from all the tables that are part of the replication slot's publication are received in the order they were committed. This also implies ordered delivery across all the tablets that are part of the publication's table list. | +| At least once delivery | Changes from transactions are streamed at least once. Changes from transactions may be streamed again in case of restart after failure. For example, this can happen in the case of a Kafka Connect node failure. If the Kafka Connect node pushes the records to Kafka and crashes before committing the offset, it will again get the same set of records upon restart. | +| No gaps in change stream | Receiving changes that are part of a transaction with commit time *t* implies that you have already received changes from all transactions with commit time lower than *t*. Thus, receiving any change for a row with commit timestamp *t* implies that you have received all older changes for that row. | + +## Key concepts + +The YugabyteDB logical replication feature makes use of PostgreSQL concepts like replication slot, publication, replica identity, and so on. Understanding these key concepts is crucial for setting up and managing a logical replication environment effectively. + +{{}} +To know more about the key concepts of YugabyteDB CDC with logical replication, see [Key concepts](./key-concepts). +{{}} + +## Getting started + +Get started with YugabyteDB logical replication using the YugabyteDB Connector. + +{{}} + +To learn how get started with the connector, see [Get started](./get-started). + +{{}} + +## Monitoring + +You can monitor the activities and status of the deployed connectors using the http end points provided by YugabyteDB. + +{{}} +To know more about how to monitor your CDC setup, see [Monitor](./monitor/). +{{}} + +## YugabyteDB Connector + +To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the YugabyteDB Connector, which is based on the Debezium platform. The connector is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. + +{{}} +To understand how the various features and configuration of the connector, see [YugabyteDB Connector](./yugabytedb-connector/). +{{}} + +## Limitations + +- LSN Comparisons Across Slots. + + In the case of YugabyteDB, the LSN  does not represent the byte offset of a WAL record. Hence, arithmetic on LSN and any other usages of the LSN making this assumption will not work. Also, currently, comparison of LSN values from messages coming from different replication slots is not supported. + +- The following functions are currently unsupported: + + - `pg_current_wal_lsn` + - `pg_wal_lsn_diff` + - `IDENTIFY SYSTEM` + - `txid_current` + - `pg_stat_replication` + + Additionally, the functions responsible for pulling changes instead of the server streaming it are unsupported as well. They are described in [Replication Functions](https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-REPLICATION) in the PostgreSQL documentation. + +- Restriction on DDLs + + DDL operations should not be performed from the time of replication slot creation till the start of snapshot consumption of the last table. + +- There should be a primary key on the table you want to stream the changes from. + +- CDC is not supported on a target table for xCluster replication [11829](https://github.com/yugabyte/yugabyte-db/issues/11829). + +- Currently we don't support schema evolution for changes that require table rewrites (ex: ALTER TYPE). + +- YCQL tables aren't currently supported. Issue [11320](https://github.com/yugabyte/yugabyte-db/issues/11320). + +- Support for point-in-time recovery (PITR) is tracked in issue [10938](https://github.com/yugabyte/yugabyte-db/issues/10938). + +- Support for transaction savepoints is tracked in issue [10936](https://github.com/yugabyte/yugabyte-db/issues/10936). + +- Support for enabling CDC on Read Replicas is tracked in issue [11116](https://github.com/yugabyte/yugabyte-db/issues/11116). \ No newline at end of file diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-configuration.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-configuration.md index 0f676e580ab7..9b64bf7b37f5 100644 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-configuration.md +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-configuration.md @@ -1,12 +1,36 @@ --- -title: Advanced Configurations -headerTitle: Advanced Configurations -linkTitle: Advanced Configurations -description: Advanced Configurations for Change Data Capture in YugabyteDB. +title: Advanced configurations for CDC using Logical Replication +headerTitle: Advanced configuration +linkTitle: Advanced configuration +description: Advanced Configurations for Logical Replication. +headcontent: Tune your CDC configuration menu: preview: parent: explore-change-data-capture-logical-replication identifier: advanced-configurations weight: 40 type: docs ---- \ No newline at end of file +--- + +## YB-TServer flags + +You can use the following [YB-TServer flags](../../../../reference/configuration/yb-tserver/) to tune logical replication deployment configuration: + +- [ysql_yb_default_replica_identity](../../../../reference/configuration/yb-tserver/#ysql-yb-default-replica-identity) +- [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) +- [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) +- [cdcsdk_max_consistent_records](../../../../reference/configuration/yb-tserver/#cdcsdk-max-consistent-records) +- [cdcsdk_vwal_getchanges_resp_max_size_bytes](../../../../reference/configuration/yb-tserver/#cdcsdk-vwal-getchanges-resp-max-size-bytes) + +## Retention of resources + +CDC retains resources (such as WAL segments) that contain information related to the changes involved in the transactions. These resources are typically retained until the consuming client acknowledges the receipt of all the transactions contained in that resource. + +Retaining resources has an impact on the system. Clients are expected to consume these transactions within configurable duration limits. Resources will be released if the duration exceeds these configured limits. + +Use the following flags to control the duration for which resources are retained: + +- [cdc_wal_retention_secs](../../../../reference/configuration/yb-tserver/#cdc-wal-retention-secs) +- [cdc_intent_retention_ms](../../../../reference/configuration/yb-tserver/#cdc-intent-retention-ms) + +Resources are retained for each tablet of a table that is part of a database whose changes are being consumed using a replication slot. This includes those tables that may not be currently part of the publication specification. diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-topic.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-topic.md index 73c365ffd09a..715896323163 100644 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-topic.md +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/advanced-topic.md @@ -9,4 +9,200 @@ menu: identifier: advanced-topics weight: 50 type: docs ---- \ No newline at end of file +--- + +This section explores a range of topics designed to provide deeper insights and enhance your understanding of advanced functionalities. + +## Schema evolution + +A change in the schema of the tables (ALTER TABLE) being streamed is transparently handled by the database without manual intervention. + +This is illustrated in the following example. The client used for the example is [pg_recvlogical](../get-started/#get-started-with-pg-recvlogical). + +1. Create a table and create the replication slot. pg_recvlogical uses the test_decoding output plugin by default. + + ```sql + CREATE TABLE demo_table (id INT PRIMARY KEY); + ``` + + ```sh + pg_recvlogical -d yugabyte --slot=demo_slot --create-slot + ``` + +1. Start streaming the changes from the replication slot. + + ```sh + pg_recvlogical -d yugabyte --slot=demo_slot --start -f - + ``` + +1. In a new shell, start ysqlsh and insert some data. + + ```sh + bin/ysqlsh + ``` + + ```sql + INSERT INTO demo_table VALUES (1); + INSERT INTO demo_table VALUES (2); + ``` + + pg_recvlogical receives the inserts and prints it on the console. + + ```output + BEGIN 2 + TABLE public.demo_table: INSERT: id[integer]:1 + COMMIT 2 + BEGIN 3 + TABLE public.demo_table: INSERT: id[integer]:2 + COMMIT 3 + ``` + +1. Add a new column to the `demo_table` and insert some more rows. + + ```sql + ALTER TABLE demo_table ADD COLUMN address TEXT; + INSERT INTO demo_table VALUES (3, 'address1'); + INSERT INTO demo_table VALUES (4, 'address2'); + ``` + + Without any manual intervention, pg_recvlogical receives the inserts with the new schema and prints it on the console. + + ```output + BEGIN 4 + TABLE public.demo_table: INSERT: id[integer]:3 col_text[text]:'address1' + COMMIT 4 + BEGIN 5 + TABLE public.demo_table: INSERT: id[integer]:4 col_text[text]:'address2' + COMMIT 5 + ``` + +## Adding tables to publication + +Addition of tables to the streaming list after slot creation is currently a preview feature. To enable dynamic table addition, set the [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) flag to true. + +The Publication's tables list can change in two ways. The first way is by adding a table to the publication by performing an alter publication. + +```sql +CREATE TABLE test_table_1(id INT PRIMARY KEY, aa INT, bb INT); +CREATE TABLE test_table_2(id INT PRIMARY KEY, aa INT, bb INT); + +CREATE PUBLICATION PUB FOR TABLE test_table_1; + +-- Start consumption through a replication slot. + +ALTER PUBLICATION ADD TABLE test_table_2; + +CREATE TABLE test_table_3(id INT PRIMARY KEY, aa INT, bb INT); + +ALTER PUBLICATION ADD TABLE test_table_3; +``` + +The second way is when a table is added to `ALL TABLES` publication upon creation. + +```sql +CREATE TABLE test_table_1(id INT PRIMARY KEY, aa INT, bb INT); + +CREATE PUBLICATION PUB FOR ALL TABLES; + +-- Start consumption through a replication slot. + +CREATE TABLE test_table_2(id INT PRIMARY KEY, aa INT, bb INT); +-- Since the publication was created for ALL TABLES, alter publication is not requirred. +``` + +### YugabyteDB semantics + +Unlike PostgreSQL, any changes made to the publication's tables list are not applied immediately in YugabyteDB. Instead the publication's tables list is periodically refreshed, and changes, if any, are applied. The refresh interval is managed using the [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) flag. The default is one hour (3600 sec). This means that any changes made to the publication's tables list will be applied after `cdcsdk_publication_list_refresh_interval_secs` in the worst case. + +Consider the following example: + +- Suppose that the value of the flag `cdcsdk_publication_list_refresh_interval_secs` is 3600 sec (1 hour) and the publication's tables list is being refreshed every hour at 8 am, 9 am, 10 am, and so on. + +- If any change is made to publication's tables list at 8:01 am, then this change will be applied at 9:00 am. However, any change made to publication's tables list at 8:59 am will also be applied at 9:00 am. + +The value of this flag can be changed at run time, but the change becomes effective only after some time. Continuing the example: + +- Suppose that the value of the flag `cdcsdk_publication_list_refresh_interval_secs` is changed from 3600 sec (1 hour) to 600 sec (10 minutes) at 8:01 am. + +- This change will only be applied after 9:00 am. That is, the publication's tables list will be next refreshed at 9:00 am. Then, the next refresh will happen at 9:10 am, and the subsequent refreshes will take place every 10 minutes. + +### Required settings + +To enable dynamic table addition, perform the following steps: + +1. Set the [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) to true. + + Because it is a preview flag, first add it to the `allowed_preview_flags_csv` list. + + ```sh + ./yb-ts-cli --server_address= set_flag allowed_preview_flags_csv cdcsdk_enable_dynamic_table_support + ``` + + Then set the `cdcsdk_enable_dynamic_table_support` flag to true. + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_enable_dynamic_table_support true + ``` + +1. Set the [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) flag to a lower value, such as 60 or 120 seconds. Note that the effect of this setting takes place after the upcoming publication refresh is performed. + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_publication_list_refresh_interval_secs 120 + ``` + +1. After you start receiving records from the newly added table in the publication, reset the `cdcsdk_publication_list_refresh_interval_secs` flag to a high value (for example, 3600 seconds). + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_publication_list_refresh_interval_secs 3600 + ``` + +## Initial snapshot + +The [initial snapshot](../../../../architecture/docdb-replication/cdc-logical-replication/#initial-snapshot) data for a table is consumed by executing a snapshot query (SELECT statement). To ensure that the streaming phase continues exactly from where the snapshot left, this snapshot query is executed as of a specific database state. In YugabyteDB, this database state is represented by a value of `HybridTime`. Changes due to transactions with commit time strictly greater than this snapshot `HybridTime` will be consumed during the streaming phase. + +The consistent database state on which the snapshot query is to be executed is specified using the following command: + +```sql +SET LOCAL yb_read_time TO ' ht'; +``` + +This command should first be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction. The HybridTime value to use in the `SET LOCAL yb_read_time` command is the value of the `snapshot_name` field that is returned by the [CREATE REPLICATION SLOT](../../../../api/ysql/the-sql-language/statements/#streaming-replication-protocol-statements) command. + +You can also obtain this value by executing the following query: + +```sql +select yb_restart_commit_ht +from pg_replication_slots where slot_name = ; +``` + +For more information on the `pg_replication_slots` catalog view, refer to [pg_replication_slots](../monitor/#pg-replication-slots). + +### Permissions + +Only a superuser can execute the command to set the value of `yb_read_time`. + +For a non-superuser to be able to perform an initial snapshot, perform the following additional setup as a superuser (in addition to granting the required SELECT and USAGE privileges): + +```sql +CREATE ROLE appuser WITH LOGIN REPLICATION; +CREATE SCHEMA appuser AUTHORIZATION appuser; + +CREATE OR REPLACE PROCEDURE appuser.set_yb_read_time(value TEXT) +LANGUAGE plpgsql +AS $$ +BEGIN + EXECUTE 'SET LOCAL yb_read_time = ' || quote_literal(value); +END; +$$ +SECURITY DEFINER; + + +REVOKE EXECUTE ON PROCEDURE appuser.set_yb_read_time FROM PUBLIC; +GRANT EXECUTE ON PROCEDURE appuser.set_yb_read_time TO appuser; +``` + +With this setup, the command to be executed by the application user as part of the transaction prior to executing the snapshot SELECT query would be: + +```sh +CALL set_yb_read_time(' ht') +``` diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/best-practices.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/best-practices.md new file mode 100644 index 000000000000..135f91a0a325 --- /dev/null +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/best-practices.md @@ -0,0 +1,28 @@ +--- +title: Best Practices for logical replication +headerTitle: Best practices +linkTitle: Best practices +description: Best Practices for for logical replication with Change Data Capture in YugabyteDB. +menu: + preview: + parent: explore-change-data-capture-logical-replication + identifier: best-practices-cdc + weight: 60 +type: docs +--- + +This section describes best practices to achieve scalability and performance while using CDC with logical replication. + +## Parallel consumption + +The recommended approach towards addressing the requirement of consuming changes in parallel from different tables is to use multiple replication slots. One replication slot per table could be used. Each replication slot is independent of the other and the changes from the tables can be consumed in parallel. + +## Fan out + +Consider the requirement where there are multiple applications, all of them requiring to consume changes from the same table. The recommended approach to address this requirement is to use one replication slot to consume the changes from the table and write the changes to a system like Kafka. The fan out can then be implemented with the multiple applications consuming from Kafka. + +## Load balancing consumption + +An application can connect to any of the YB-TServer nodes to consume from a replication slot. Furthermore, even in case of an interruption, a fresh connection can be made to a different node (different from the node from which consumption was previously happening) to continue consumption from the same replication slot. + +When there are multiple consuming applications each consuming from a different replication slot, it is best that the applications connect to different YB-TServer nodes in the cluster. This ensures better load balancing. The [YugabyteDB smart driver](../../../../drivers-orms/smart-drivers/) does this automatically, so it is recommended that applications use this smart driver. diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/cdc-best-practices.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/cdc-best-practices.md deleted file mode 100644 index a2e5ab485373..000000000000 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/cdc-best-practices.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: Best Practices -headerTitle: Best Practices -linkTitle: Best Practices -description: Best Practices for Change Data Capture in YugabyteDB. -menu: - preview: - parent: explore-change-data-capture-logical-replication - identifier: best-practices-cdc - weight: 60 -type: docs ---- \ No newline at end of file diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/get-started.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/get-started.md index c2f1375ace02..28864d417813 100644 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/get-started.md +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/get-started.md @@ -1,13 +1,586 @@ --- -title: Get started with CDC in YugabyteDB +title: Get started with CDC logical replication in YugabyteDB headerTitle: Get started linkTitle: Get started description: Get started with Change Data Capture in YugabyteDB. -headcontent: Get set up for using CDC in YugabyteDB +headcontent: Start using CDC with logical replication menu: preview: parent: explore-change-data-capture-logical-replication identifier: get-started weight: 20 type: docs ---- \ No newline at end of file +--- + +To get started streaming data change events from a YugabyteDB database using a replication slot, you can use either of the following client options: + +- [pg_recvlogical](#get-started-with-pg_recvlogical) +- [YugabyteDB connector](#get-started-with-yugabytedb-connector) + +{{< note title="Note" >}} + +CDC via logical replication is supported in YugabyteDB starting from version 2024.1.1. + +{{< /note >}} + +## Get started with pg_recvlogical + +pg_recvlogical is a command-line tool provided by PostgreSQL for interacting with the logical replication feature. It is specifically used to receive changes from the database using logical replication slots. + +YugabyteDB provides the pg_recvlogical binary in the `/postgres/bin/` directory, which is inherited and based on PostgreSQL 11.2. Although PostgreSQL also offers a pg_recvlogical binary, you are strongly advised to use the YugabyteDB version to avoid compatibility issues. + +### Set up pg_recvlogical + +To set up pg_recvlogical, create and start the local cluster by running the following command from your YugabyteDB home directory: + +```sh +./bin/yugabyted start \ + --advertise_address=127.0.0.1 \ + --base_dir="${HOME}/var/node1" \ + --tserver_flags="allowed_preview_flags_csv={cdcsdk_enable_dynamic_table_support},cdcsdk_enable_dynamic_table_support=true,cdcsdk_publication_list_refresh_interval_secs=2" +``` + +#### Create tables + +1. Use ysqlsh to connect to the default `yugabyte` database with the default superuser `yugabyte`, as follows: + + ```sh + bin/ysqlsh -h 127.0.0.1 -U yugabyte -d yugabyte + ``` + +1. In the `yugabyte` database, create a table `employees`. + + ```sql + CREATE TABLE employees ( + employee_id SERIAL PRIMARY KEY, + name VARCHAR(255), + email VARCHAR(255), + department_id INTEGER + ); + ``` + +#### Create a Replication slot + +Create a logical replication slot named `test_logical_replication_slot` using the `test_decoding` output plugin via the following function: + +```sql +SELECT * +FROM pg_create_logical_replication_slot('test_logical_replication_slot', 'test_decoding'); +``` + +Expected output after running the command that indicates successful creation of the slot: + +```output + slot_name | lsn +-------------------------------+----- + test_logical_replication_slot | 0/2 +``` + +#### Configure and start pg_recvlogical + +The pg_recvlogical binary can be found under `/postgres/bin/`. + +Open a new shell and start pg_recvlogical to connect to the `yugabyte` database with the superuser `yugabyte` and replicate changes using the replication slot you created as follows: + +```sh +./pg_recvlogical -d yugabyte \ + -U yugabyte \ + -h 127.0.0.1 \ + --slot test_logical_replication_slot \ + --start \ + -f - +``` + +Any changes that get replicated are printed to stdout. + +For more pg_recvlogical configurations, refer to the PostgreSQL [pg_recvlogical](https://www.postgresql.org/docs/11/app-pgrecvlogical.html) documentation. + +#### Verify Replication + +Return to the shell where ysqlsh is running. Perform DMLs on the `employees` table. + +```sql +BEGIN; + +INSERT INTO employees (name, email, department_id) +VALUES ('Alice Johnson', 'alice@example.com', 1); + +INSERT INTO employees (name, email, department_id) +VALUES ('Bob Smith', 'bob@example.com', 2); + +COMMIT; +``` + +Expected output observed on stdout where pg_recvlogical is running: + +```output +BEGIN 2 +table public.employees: INSERT: employee_id[integer]:1 name[character varying]:'Alice Johnson' email[character varying]:'alice@example.com' department_id[integer]:1 +table public.employees: INSERT: employee_id[integer]:2 name[character varying]:'Bob Smith' email[character varying]:'bob@example.com' department_id[integer]:2 +COMMIT 2 +``` + +#### Add tables (Dynamic table addition) + +You can add a new table to the `yugabyte` database and any DMLs performed on the new table would also be replicated to pg_recvlogical. + +1. In the `yugabyte` database, create a new table `projects`: + + ```sql + CREATE TABLE projects ( + project_id SERIAL PRIMARY KEY, + name VARCHAR(255), + description TEXT + ); + ``` + +2. Perform DMLs on the `projects` table: + + ```sql + INSERT INTO projects (name, description) + VALUES ('Project A', 'Description of Project A'); + ``` + +Expected output observed on stdout where pg_recvlogical is running: + +```output +BEGIN 3 +table public.projects: INSERT: project_id[integer]:1 name[character varying]:'Project A' description[text]:'Description of Project A' +COMMIT 3 +``` + +{{% explore-cleanup-local %}} + +## Get started with YugabyteDB connector + +This tutorial demonstrates how to use Debezium to monitor a YugabyteDB database. As the data in the database changes, you will see the resulting event streams. + +In this tutorial you will start the Debezium services, run a YugabyteDB instance with a basic example database, and use Debezium to monitor the database for changes. + +**Prerequisites** + +- Docker is installed and running. + + This tutorial uses Docker and the Debezium container images to run the required services. You should use the latest version of Docker. For more information, see the [Docker Engine installation](https://docs.docker.com/engine/installation/) documentation. + +### Start the services + +Using Debezium requires three separate services: [Zookeeper](http://zookeeper.apache.org/), [Kafka](https://kafka.apache.org), and the Debezium connector service. + +In this tutorial, you will set up a single instance of each service using Docker and the Debezium container images. + +To start the services needed for this tutorial, you must: + +- [Start Zookeeper](#start-zookeeper) +- [Start Kafka](#start-kafka) +- [Start a YugabyteDB database](#start-a-yugabytedb-database) +- [Start Kafka Connect](#start-kafka-connect) + +#### Start Zookeeper + +Zookeeper is the first service you must start. + +1. Open a terminal and use it to start Zookeeper in a container. This command runs a new container using version `2.5.2.Final` of the `debezium/zookeeper` image: + +```sh +docker run -d --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:2.5.2.Final +``` + +#### Start Kafka + +After starting Zookeeper, you can start Kafka in a new container. + +1. Open a new terminal and use it to start Kafka in a container. This command runs a new container using version `2.5.2.Final` of the `debezium/kafka` image: + +```sh +docker run -d --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:2.5.2.Final +``` + +{{< note title="Note" >}} + +In this tutorial, you will always connect to Kafka from in a Docker container. Any of these containers can communicate with the `kafka` container by linking to it. If you need to connect to Kafka from outside of a Docker container, you have to set the `-e` option to advertise the Kafka address through the Docker host (`-e ADVERTISED_HOST_NAME=` followed by either the IP address or resolvable host name of the Docker host). + +{{< /note >}} + +#### Start a YugabyteDB database + +At this point, you have started Zookeeper and Kafka, but you still need a database server from which Debezium can capture changes. In this procedure, you start a YugabyteDB instance with an example database. Follow the [Quick Start](../../../../quick-start) to start an instance using yugabyted. + +{{< note title="Note" >}} + +You need to start the database on an IP that is resolvable by the docker containers. If you use the localhost address (that is, `127.0.0.1`) then if you deploy the connectors in the docker containers, they won't be able to talk to the database and will keep trying to connect to `127.0.0.1` inside the container. Use the [--advertise_address option for yugabyted](../../../../reference/configuration/yugabyted#flags-8) to specify the IP you want to start your database instance. + +For example, Linux users can use the following: + +```sh +./bin/yugabyted start --advertise_address $(hostname -i) +``` + +{{< /note >}} + +##### Use the YSQL command line client + +After starting YugabyteDB, use ysqlsh to create your database: + +1. Connect the client to the database process running on the IP you specified when you started up the database instance. + + ```sh + ./bin/ysqlsh -h + ``` + + You should see output similar to the following: + + ```output + ysqlsh (11.2-YB-2.21.1.0-b0) + Type "help" for help. + + yugabyte=# + ``` + +1. Load the schema of the sample tables: + + ```sql + yugabyte=# \i share/schema.sql + CREATE TABLE + CREATE TABLE + CREATE TABLE + CREATE TABLE + ``` + +1. List the tables + + ```sql + yugabyte=# \d + ``` + + ```output + List of relations + Schema | Name | Type | Owner + --------+-----------------+----------+---------- + public | orders | table | yugabyte + public | orders_id_seq | sequence | yugabyte + public | products | table | yugabyte + public | products_id_seq | sequence | yugabyte + public | reviews | table | yugabyte + public | reviews_id_seq | sequence | yugabyte + public | users | table | yugabyte + public | users_id_seq | sequence | yugabyte + (8 rows) + ``` + +1. Load data in one of the tables and verify the count. + + ```sql + yugabyte=# \i share/products.sql + ``` + + ```output + yugabyte=# select count(*) from products; + count + ------- + 200 + (1 row) + ``` + +#### Start Kafka Connect + +After starting YugabyteDB, you start the Kafka Connect service. This service exposes a REST API to manage the YugabyteDB connector. + +1. Open a new terminal, and use it to start the Kafka Connect service in a container. + + The following command runs a new container using the `dz.2.5.2.yb.2024.1.SNAPSHOT.1` version of the `quay.io/yugabyte/ybdb-debezium` image: + + ```sh + docker run -it --rm --name connect -p 8083:8083 -p 1976:1976 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuses -e CLASSPATH=/kafka/connect/ --link zookeeper:zookeeper --link kafka:kafka quay.io/yugabyte/ybdb-debezium:dz.2.5.2.yb.2024.1.SNAPSHOT.1 + ``` + +1. Verify that Kafka Connect started and is ready to accept connections. You should see output similar to the following: + + ```output + ... + 2024-07-19 12:04:33,044 INFO || Kafka version: 3.6.1 [org.apache.kafka.common.utils.AppInfoParser] + ... + 2024-07-19 12:04:33,661 INFO || [Worker clientId=connect-1, groupId=1] Starting connectors and tasks using config offset -1 [org.apache.kafka.connect.runtime.distributed.DistributedHerder] + 2024-07-19 12:04:33,661 INFO || [Worker clientId=connect-1, groupId=1] Finished starting connectors and tasks [org.apache.kafka.connect.runtime.distributed.DistributedHerder] + ``` + +1. Use the Kafka Connect REST API to check the status of the Kafka Connect service. + + Kafka Connect exposes a REST API to manage Debezium connectors. To communicate with the Kafka Connect service, you can use the `curl` command to send API requests to port 8083 of the Docker host (which you mapped to port 8083 in the `connect` container when you started Kafka Connect). + + Open a new terminal and check the status of the Kafka Connect service: + + ```sh + $ curl -H "Accept:application/json" localhost:8083/ + + {"version":"3.6.1","commit":"5e3c2b738d253ff5","kafka_cluster_id":"kafka-cluster-id"} + ``` + +{{< note title="Note" >}} + +These commands use `localhost`. If you are using a non-native Docker platform (such as Docker Toolbox), replace `localhost` with the IP address of your Docker host. + +{{< /note >}} + +### Deploy the YugabyteDB connector + +After starting the Debezium and YugabyteDB service, you are ready to deploy the YugabyteDB connector. To deploy the connector, you must: + +- [Register the YugabyteDB connector to monitor the `yugabyte` database](#register-a-connector-to-monitor-yugabyte-database) +- Watch the connector start + +#### Register a connector to monitor `yugabyte` database + +By registering the YugabyteDB connector, the connector will start monitoring the YugabyteDB database's table `products`. When a row in the table changes, Debezium generates a change event. + +{{< note title="Note" >}} + +In a production environment, you would typically either use the Kafka tools to manually create the necessary topics, including specifying the number of replicas, or you would use the Kafka Connect mechanism for customizing the settings of [auto-created](https://debezium.io/documentation/reference/2.5/configuration/topic-auto-create-config.html) topics. However, for this tutorial, Kafka is configured to automatically create the topics with just one replica. + +{{< /note >}} + +1. Review the configuration of the YugabyteDB connector that you will register. Before registering the connector, you should be familiar with its configuration. In the next step, you will register the following connector: + + ```json + { + "name": "ybconnector", + "config": { + "tasks.max":"1", + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", + "database.hostname":"'$(hostname -i)'", + "database.port":"5433", + "database.user": "yugabyte", + "database.password":"yugabyte", + "database.dbname":"yugabyte", + "topic.prefix":"dbserver1", + "snapshot.mode":"initial", + "table.include.list":"public.products", + "plugin.name":"yboutput", + "slot.name":"yb_replication_slot" + } + } + ``` + + - `name` - The name of the connector. + - `config` - The connector's configuration. + - `database.hostname` - The database host, which is the IP of the machine running YugabyteDB. If YugabyteDB were running on a normal network, you would specify the IP address or resolvable host name for this value. + - `topic.prefix` - A unique topic prefix. This name will be used as the prefix for all Kafka topics. + - `table.include.list` - Only changes in the table `products` of the schema `public` will be detected. + - `plugin.name` - [Plugin](../key-concepts/#output-plugin) to be used for replication. + - `slot.name` - Name of the [replication slot](../key-concepts/#replication-slot). + + For more information, see [YugabyteDB connector configuration properties](../yugabytedb-connector/#connector-properties). + +1. Open a new terminal and use the `curl` command to register the YugabyteDB connector. + + This command uses the Kafka Connect service API to submit a `POST` request against the `/connectors` resource with a `JSON` document that describes the new connector (called `ybconnector`). + + ```sh + curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ + "name": "ybconnector", + "config": { + "tasks.max":"1", + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", + "database.hostname":"'$(hostname -i)'", + "database.port":"5433", + "database.user": "yugabyte", + "database.password":"yugabyte", + "database.dbname":"yugabyte", + "topic.prefix":"dbserver1", + "snapshot.mode":"initial", + "table.include.list":"public.products", + "plugin.name":"yboutput", + "slot.name":"yb_replication_slot" + } + }' + ``` + + {{< note title="Note" >}} +Windows users may need to escape the double-quotes. + {{< /note >}} + +1. Verify that `ybconnector` is included in the list of connectors: + + ```sh + $ curl -H "Accept:application/json" localhost:8083/connectors/ + + ["ybconnector"] + ``` + +#### Watch the connector start + +When you register a connector, it generates a large amount of log output in the Kafka Connect container. By reviewing this output, you can better understand the process that the connector goes through from the time it is created until it begins reading the change events. + +After registering the `ybconnector` connector, you can review the log output in the Kafka Connect container (`connect`) to track the connector's status. + +Kafka Connect reports some "errors". However, you can safely ignore these warnings: these messages just mean that new Kafka topics were created and that Kafka had to assign a new leader for each one: + +```output +2021-11-30 01:38:45,555 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 3 : {dbserver1=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,691 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 9 : {dbserver1.public.orders=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,813 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 13 : {dbserver1.public.users=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,927 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 18 : {dbserver1.public.products=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:46,043 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 22 : {dbserver1.public.reviews=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +``` + +### View change events + +After deploying the YugabyteDB connector, it starts monitoring the `yugabyte` database for data change events. + +For this tutorial, you will explore the `dbserver1.public.products` topic. + +#### View a change event + +Open a new terminal, and use it to start the watch-topic utility to watch the `dbserver1.public.products` topic from the beginning of the topic. + +The following command runs the `watch-topic` utility in a new container using the `2.5.2.Final` version of the `debezium/kafka` image: + +```sh +docker run -it --rm --name consumer --link zookeeper:zookeeper --link kafka:kafka debezium/kafka:2.5.2.Final watch-topic -a dbserver1.public.products +``` + +The `watch-topic` utility returns the event records from the `products` table. There will be 200 events, one for each row in the table which was snapshotted. Each event is formatted in JSON, because that is how you configured the Kafka Connect service. There are two JSON documents for each event: one for the key, and one for the value. + +You should see output similar to the following: + +```output.json +Using ZOOKEEPER_CONNECT=172.17.0.2:2181 +Using KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.17.0.7:9092 +Using KAFKA_BROKER=172.17.0.3:9092 +Contents of topic dbserver1.public.products: +... +{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int64","optional":false,"default":0,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":false,"name":"id","field":"id"},{"type":"struct","fields":[{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"created_at","field":"created_at"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"category","field":"category"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"ean","field":"ean"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"price","field":"price"},{"type":"struct","fields":[{"type":"int32","optional":true,"default":5000,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"quantity","field":"quantity"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"rating","field":"rating"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"title","field":"title"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"vendor","field":"vendor"}],"optional":true,"name":"dbserver1.public.products.Value","field":"before"},{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int64","optional":false,"default":0,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":false,"name":"id","field":"id"},{"type":"struct","fields":[{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"created_at","field":"created_at"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"category","field":"category"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"ean","field":"ean"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"price","field":"price"},{"type":"struct","fields":[{"type":"int32","optional":true,"default":5000,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"quantity","field":"quantity"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"rating","field":"rating"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"title","field":"title"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"vendor","field":"vendor"}],"optional":true,"name":"dbserver1.public.products.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"dbserver1.public.products.Envelope","version":1},"payload":{"before":null,"after":{"id":{"value":147,"set":true},"created_at":{"value":1500306107286000,"set":true},"category":{"value":"Doohickey","set":true},"ean":{"value":"6590063715","set":true},"price":{"value":44.4315141414441,"set":true},"quantity":{"value":5000,"set":true},"rating":{"value":4.6,"set":true},"title":{"value":"Mediocre Wool Toucan","set":true},"vendor":{"value":"Bradtke, Wilkinson and Reilly","set":true}},"source":{"version":"dz.2.5.2.yb.2024.1-SNAPSHOT","connector":"postgresql","name":"dbserver1","ts_ms":1721400304248,"snapshot":"true","db":"yugabyte","sequence":"[null,\"2\"]","schema":"public","table":"products","txId":2,"lsn":2,"xmin":null},"op":"r","ts_ms":1721400309609,"transaction":null}} +... +``` + +{{< note title="Note" >}} + +This utility keeps watching the topic, so any new events will automatically appear as long as the utility is running. + +{{< /note >}} + +#### Update the database and view the update event + +Now that you have seen how the YugabyteDB connector captured the create events in the `yugabyte` database, change one of the records and see how the connector captures it. + +By completing this procedure, you will learn how to find details about what changed in a database commit, and how you can compare change events to determine when the change occurred in relation to other changes. + +1. In the terminal that is running ysqlsh, run the following statement: + + ```sql + update products set title = 'Enormous Granite Shiny Shoes' where id = 22; + ``` + +1. View the updated `products` table: + + ```sql + yugabyte=# select * from products where id = 22; + ``` + + ```output + id | created_at | category | ean | price | quantity | rating | title | vendor + ----+-------------------------+----------+---------------+------------------+----------+--------+------------------------------+--------------------------- + 22 | 2017-11-24 20:14:28.415 | Gizmo | 7595223735110 | 21.4245199604423 | 5000 | 4.2 | Enormous Granite Shiny Shoes | Mayer, Kiehn and Turcotte + (1 row) + ``` + +1. Switch to the terminal running `watch-topic` to see a new event. + + By changing a record in the `products` table, the YugabyteDB connector generated a new event. + + The details for the payload of the *update* event will look similar to the following (formatted for readability): + + ```json + { + "before": null, + "after": { + "id": { + "value": 22, + "set": true + }, + "created_at": null, + "category": null, + "ean": null, + "price": null, + "quantity": null, + "rating": null, + "title": { + "value": "Enormous Granite Shiny Shoes", + "set": true + }, + "vendor": null + } + } + ``` + +Note that the fields which were not updated are coming out as `null`. This is because the [REPLICA IDENTITY](../key-concepts/#replica-identity) of the table is `CHANGE` by default, where you only send the values of the updated columns in the change event. + +#### Delete a row and view the delete event + +1. In the terminal that is running ysqlsh, run the following statement: + + ```sql + delete from products where id = 22; + ``` + +1. Switch to the terminal running `watch-topic` to see two new events. By deleting a row in the `products` table, the YugabyteDB connector generated 2 new events. + + The details for the payload of the first event will look similar to the following (formatted for readability): + + ```json + { + "before": { + "id": { + "value": 22, + "set": true + }, + "created_at": { + "value": null, + "set": true + }, + "category": { + "value": null, + "set": true + }, + "ean": { + "value": null, + "set": true + }, + "price": { + "value": null, + "set": true + }, + "quantity": { + "value": 5000, + "set": true + }, + "rating": { + "value": null, + "set": true + }, + "title": { + "value": null, + "set": true + }, + "vendor": { + "value": null, + "set": true + } + }, + "after": null + } + ``` + +The second event will have a *key* but the *value* will be `null`; that is a [tombstone event](../yugabytedb-connector/#tombstone-events) generated by the YugabyteDB connector. + +### Clean up + +After you are finished with the tutorial, you can use Docker to stop all of the running containers. + +Run the following command: + +```sh +docker stop zookeeper kafka connect consumer +``` + +Docker stops each container. Because you used the `--rm` option when you started them, Docker also removes them. diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/key-concepts.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/key-concepts.md new file mode 100644 index 000000000000..2babbabc821b --- /dev/null +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/key-concepts.md @@ -0,0 +1,123 @@ +--- +title: Key concepts - logical replication +headerTitle: Key concepts +linkTitle: Key concepts +description: Change Data Capture in YugabyteDB. +headcontent: PostgreSQL logical replication concepts +menu: + preview: + parent: explore-change-data-capture-logical-replication + identifier: key-concepts + weight: 10 +type: docs +--- + +The YugabyteDB logical replication feature uses [PostgreSQL Logical Replication](https://www.postgresql.org/docs/11/logical-replication.html), which operates using a publish-subscribe model. Understanding the following key concepts will help you set up and manage a logical replication environment effectively. + +## Concepts + +### Replication slot + +A replication slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database. + +In logical replication, the fundamental unit of data transmission is a transaction. A logical slot emits each change just once in normal operation. The current position of each slot is persisted only at checkpoint, so if a replication process is interrupted and restarts, even if the checkpoint or the starting LSN falls in the middle of a transaction, **the entire transaction is retransmitted**. This behavior guarantees that clients receive complete transactions without missing any intermediate changes, maintaining data integrity across the replication stream​. Logical decoding clients are responsible for avoiding ill effects from handling the same message more than once. Clients may wish to record the last LSN they saw when decoding and skip over any repeated data or (when using the replication protocol) request that decoding start from that LSN rather than letting the server determine the start point. + +For more information, refer to [Replication slots](https://www.postgresql.org/docs/11/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS) in the PostgreSQL documentation. + +### Publication + +A publication is a set of changes generated from a table or a group of tables, and might also be described as a change set or replication set. Each publication exists in only one database. + +Publications are different from schemas and do not affect how the table is accessed. Each table can be added to multiple publications if needed. Publications may currently only contain tables. Objects must be added explicitly, except when a publication is created for ALL TABLES. + +For more information, refer to [Publication](https://www.postgresql.org/docs/11/logical-replication-publication.html#LOGICAL-REPLICATION-PUBLICATION) in the PostgreSQL documentation. + +### Output plugin + +Output plugins transform the data from the write-ahead log's internal representation into the format that can be consumed by replication clients. These plugins are notified about the change events that need to be processed and sent via various callbacks. These callbacks are only invoked when the transaction actually commits. + +YugabyteDB supports the following four output plugins: + +- `yboutput` +- `pgoutput` +- `test_decoding` +- `wal2json` + +All these plugins are pre-packaged with YugabyteDB and do not require any external installation. + +{{< note title="Note" >}} + +The plugin `yboutput` is YugabyteDB specific. It is similar to `pgoutput` in most aspects. The only difference being that replica identity `CHANGE` is not supported in `pgoutput`. All other plugins support replica identity `CHANGE`. + +{{}} + +For more information, refer to [Logical Decoding Output Plugins](https://www.postgresql.org/docs/11/logicaldecoding-output-plugin.html) in the PostgreSQL documentation. + +### LSN + +LSN (Log Sequence Number) in YugabyteDB is an unsigned 64-bit integer that uniquely identifies a change record or a transaction boundary record that is consumed from a given replication slot. + +In YugabyteDB, LSN values from different slots are considered unrelated and should not be compared. In YugabyteDB, LSN no longer represents the byte offset of a WAL record. + +LSN values for a single replication slot satisfy the following properties: + +- **Uniqueness** + + LSN values for the change and `COMMIT` records for a given replication slot are unique. In particular, changes from different tablets of the same or different tables will have unique LSN values for a replication slot. + +- **Ordering** + + LSN values can be compared ( `<`, `>`, `=` ). + + The LSN of the change records in a transaction will be strictly lower than the LSN of the COMMIT record of the same transaction. + + The LSNs of change records in a transaction will be in increasing order and will correspond to the order in which those changes were made in that transaction. That is, the LSN of an earlier change will have a strictly lower value than the LSN of a later change in the same transaction. This is the case even if the changes correspond to rows in different tablets of the same or different tables. + + For a given replication slot, the LSN of a `COMMIT` record of an earlier transaction will be strictly lower than the LSN of the `COMMIT` record of a later transaction. + +- **Determinism** + + For a given replication slot, the LSN value of a change record (or a transaction boundary record) remains the same for the lifetime of that replication slot. In particular, this is true across server and client restarts and client re-connections. Thus, LSN values for a single replication slot may be used to uniquely identify records that are consumed from that replication slot. The values can be compared for determining duplicates at the client side. + +### Replica identity + +Replica identity is a table-level parameter that controls the amount of information being written to the change records. YugabyteDB supports the following four replica identities: + +- CHANGE (default) +- DEFAULT +- FULL +- NOTHING + +The replica identity `INDEX` is not supported in YugabyteDB. + +Replica identity `CHANGE` is the best performant and the default replica identity. The replica identity of a table can be changed by performing an alter table. However, for a given slot, the alter tables performed to change the replica identity after the creation of the slot will have no effect. This means that the effective replica identity for any table for a slot, is the replica identity of the table that existed at the time of slot creation. A dynamically created table (a table created after slot creation) will have the default replica identity. For a replica identity modified after slot creation to take effect, a new slot will have to be created after performing the Alter table. + +The [ysql_yb_default_replica_identity](../../../../reference/configuration/yb-tserver/#ysql-yb-default-replica-identity) flag determines the default replica identity for user tables at the time of table creation. The default value is `CHANGE`. The purpose of this flag is to set the replica identities for dynamically created tables. In order to create a dynamic table with desired replica identity, the flag must be set accordingly and then the table must be created. + +{{< note title="Advisory" >}} +You should refrain from altering the replica identity of a dynamically created table for at least 5 minutes after its creation. +{{< /note >}} + +For more information, refer to [Replica Identity](https://www.postgresql.org/docs/11/sql-altertable.html#SQL-CREATETABLE-REPLICA-IDENTITY) in the PostgreSQL documentation. + +### Replication protocols + +PostgreSQL has defined protocols for replication that need to be followed by clients to establish replication connection as well as message structures for streaming data. This includes the [Streaming Replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html) and the [Logical Streaming Replication protocol](https://www.postgresql.org/docs/11/protocol-logical-replication.html). + +The logical streaming replication protocol sends individual transactions one-by-one. This means that all messages between a pair of `BEGIN` and `COMMIT` messages belong to the same transaction. + +YugabyteDB supports both the streaming replication protocols used in PostgreSQL to support logical replication, maintaining the same semantics described in PostgreSQL: + +- Streaming Replication Protocol - This protocol is followed by all output plugins. + +- Logical Streaming Replication Protocol - This protocol is followed by `pgoutput` and `yboutput`, in addition to the Streaming replication protocol. + +{{< note title="Note" >}} + +YugabyteDB does not support Physical Replication. + +{{< /note >}} + +## Learn more + +[CDC using Logical Replication architecture](../../../../architecture/docdb-replication/cdc-logical-replication/) diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/monitor.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/monitor.md index 8aaf773a9913..13887ea4efe7 100644 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/monitor.md +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/monitor.md @@ -9,4 +9,79 @@ menu: identifier: monitor weight: 30 type: docs ---- \ No newline at end of file +--- + +## Catalog objects and views + +### pg_publication + +Contains all publication objects contained in the database. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| oid | oid | Row identifier | +| pubname | name | Name of the publication | +| pubowner | oid | OID of the owner. | +| puballtables | bool | If true, this publication includes all tables in the database including those added in the future. | +| pubinsert | bool | If true, INSERT operations are replicated for tables in the publication. | +| pubupdate | bool | If true, UPDATE operations are replicated for tables in the publication. | +| pubdelete | bool | If true, DELETE operations are replicated for tables in the publication. | +| pubtruncate | bool | If true, TRUNCATE operations are replicated for tables in the publication. | + +### pg_publication_rel + +Contains mapping between publications and tables. This is a many-to-many mapping. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| oid | oid | Row identifier. | +| prpubid | oid | OID of the publication. References pg_publication.oid. | +| prrelid| oid | OID of the relation. References pg_class.oid. | + +### pg_publication_tables + +Contains mapping between publications and tables. It is a wrapper over `pg_publication_rel` as it expands the publications defined as FOR ALL TABLES, so for such publications there will be a row for each eligible table. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| pubname | name | Name of publication. | +| schemaname | name | Name of schema containing table. | +| tablename | name | Name of table. | + +### pg_replication_slots + +Provides a list of all replication slots that currently exist on the database cluster, along with their metadata. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| slot_name | name | Name of the replication slot. | +| plugin | name | Output plugin name (Always `yboutput`). | +| slot_type | text | Always logical. | +| datoid | oid | The OID of the database this slot is associated with. | +| database | text | The name of the database this slot is associated with. | +| temporary | boolean | True if this is a temporary replication slot. Temporary slots are automatically dropped on error or when the session has finished. | +| active | boolean | True if this slot is currently actively being used. In YSQL, an "active" replication slot means a slot which has been consumed at least once in a certain time frame. The time is defined using the `ysql_replication_slot_activity_threshold` flag, which has a default of 5 minutes. | +| active_pid | integer | The process ID of the session using this slot if the slot is currently actively being used. `NULL` if no replication process is ongoing. | +| xmin | xid | The oldest transaction that this slot needs the database to retain. | +| catalog_xmin | xid | Not applicable for YSQL. Always set to xmin. | +| restart_lsn | pg_lsn | The LSN of the oldest change record which still might be required by the consumer of this slot and thus won't be automatically removed during checkpoints. | +| confirmed_flush_lsn | pg_lsn | The LSN up to which the logical slot's consumer has confirmed receiving data. Data older than this is not available anymore. Transactions with commit LSN lower than the `confirmed_flush_lsn` are not available anymore. | +| yb_stream_id | text | UUID of the CDC stream | +| yb_restart_commit_ht | int8 | A uint64 representation of the commit Hybrid Time corresponding to the `restart_lsn`. This can be used by the client (like YugabyteDB connector) to perform a consistent snapshot (as of the `consistent_point`) in the case when a replication slot already exists. | + +## CDC Service metrics + +Provide information about CDC service in YugabyteDB. + +| Metric name | Type | Description | +| :---- | :---- | :---- | +| cdcsdk_change_event_count | `long` | The number of records sent by the CDC Service. | +| cdcsdk_traffic_sent | `long` | The number of milliseconds since the connector has read and processed the most recent event. | +| cdcsdk_event_lag_micros | `long` | The LAG metric is calculated by subtracting the timestamp of the latest record in the WAL of a tablet from the last record sent to the CDC connector. | +| cdcsdk_expiry_time_ms | `long` | The time left to read records from WAL is tracked by the Stream Expiry Time (ms). | + +## Connector metrics + + + +Refer to [Monitoring](../yugabytedb-connector/#monitoring) for information on YugabyteDB connector metrics. diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/overview.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/overview.md deleted file mode 100644 index dbf2a9522fa9..000000000000 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/overview.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: Overview of CDC - logical replication -linkTitle: Overview -description: Change Data Capture in YugabyteDB. -headcontent: Change Data Capture in YugabyteDB -menu: - preview: - parent: explore-change-data-capture-logical-replication - identifier: overview - weight: 10 -type: docs ---- \ No newline at end of file diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabtyedb-connector.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabtyedb-connector.md deleted file mode 100644 index bb203f0a9b04..000000000000 --- a/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabtyedb-connector.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: YugabyteDB connector -headerTitle: YugabyteDB connector -linkTitle: YugabyteDB connector -description: YugabyteDB connector for Change Data Capture in YugabyteDB. -menu: - preview: - parent: explore-change-data-capture-logical-replication - identifier: yugabytedb-connector - weight: 70 -type: docs ---- \ No newline at end of file diff --git a/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md b/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md new file mode 100644 index 000000000000..df1d0eed2805 --- /dev/null +++ b/docs/content/preview/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md @@ -0,0 +1,1679 @@ +--- +title: YugabyteDB connector +headerTitle: YugabyteDB connector +linkTitle: YugabyteDB connector +description: YugabyteDB connector for Change Data Capture in YugabyteDB. +menu: + preview: + parent: explore-change-data-capture-logical-replication + identifier: yugabytedb-connector + weight: 70 +type: docs +--- + +The YugabyteDB Connector is based on the Debezium Connector, and captures row-level changes in the schemas of a YugabyteDB database using the PostgreSQL replication protocol. + +The first time it connects to a YugabyteDB server, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. The connector generates data change event records and streams them to Kafka topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic. + +## Overview + +YugabyteDB CDC using logical decoding is a mechanism that allows the extraction of changes that were committed to the transaction log and the processing of these changes in a user-friendly manner with the help of a [PostgreSQL output plugin](https://www.postgresql.org/docs/11/logicaldecoding-output-plugin.html). The output plugin enables clients to consume the changes. + +The YugabyteDB connector contains two main parts that work together to read and process database changes: + +* You must configure a replication slot that uses your chosen output plugin before running the YugabyteDB server. The plugin can be one of the following: + + + * `yboutput` is the plugin packaged with YugabyteDB. It is maintained by Yugabyte and is always present with the distribution. + + * `pgoutput` is the standard logical decoding output plugin in PostgreSQL 10+. It is maintained by the PostgreSQL community, and used by PostgreSQL itself for logical replication. YugabyteDB bundles this plugin with the standard distribution so it is always present and no additional libraries need to be installed. The YugabyteDB connector interprets the raw replication event stream directly into change events. + + +* Java code (the actual Kafka Connect connector) that reads the changes produced by the chosen logical decoding output plugin. It uses the [streaming replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html), by means of the YugabyteDB JDBC driver. + +The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in a separate Kafka topic. Client applications read the Kafka topics that correspond to the database tables of interest, and can react to every row-level event they receive from those topics. + +YugabyteDB normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the YugabyteDB connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the configured tables. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken. + +The connector is tolerant of failures. As the connector reads changes and produces events, it records the LSN for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off. + +{{< tip title="Use UTF-8 encoding" >}} + +Debezium supports databases with UTF-8 character encoding only. With a single-byte character encoding, it's not possible to correctly process strings that contain extended ASCII code characters. + +{{< /tip >}} + +## How the connector works + +To optimally configure and run a Debezium connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. + +### Security + +To use the Debezium connector to stream changes from a YugabyteDB database, the connector must operate with specific privileges in the database. Although one way to grant the necessary privileges is to provide the user with `superuser` privileges, doing so potentially exposes your YugabyteDB data to unauthorized access. Rather than granting excessive privileges to the Debezium user, it is best to create a dedicated Debezium replication user to which you grant specific privileges. + +For more information about configuring privileges for the Debezium replication user, see [Setting up permissions](#setting-up-permissions). + +### Snapshots + +Most YugabyteDB servers are configured to not retain the complete history of the database in the WAL segments. This means that the YugabyteDB connector would be unable to see the entire history of the database by reading only the WAL. Consequently, the first time that the connector starts, it performs an initial consistent snapshot of the database. + +#### Default workflow behavior of initial snapshots + +The default behavior for performing a snapshot consists of the following steps. You can change this behavior by setting the `snapshot.mode` [connector configuration property](#advanced-configuration-properties) to a value other than `initial`. + +1. Start a transaction. +2. Set the transaction read time to the [consistent point](../../../../architecture/docdb-replication/cdc-logical-replication/#initial-snapshot) associated with the replication slot. +3. Execute snapshot through the execution of a `SELECT` query. +4. Generate a `READ` event for each row and write to the appropriate table-specific Kafka topic. +5. Record successful completion of the snapshot in the connector offsets. + +If the connector fails, is rebalanced, or stops after Step 1 begins but before Step 5 completes, upon restart the connector begins a new snapshot. After the connector completes its initial snapshot, the YugabyteDB connector continues streaming from the position that it read in Step 2. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. + +The following table describes the options for the `snapshot.mode` connector configuration property. + +| Option | Description | +| :--- | :--- | +| `never` | The connector never performs snapshots. When a connector is configured this way, its behavior when it starts is as follows. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. If no LSN has been stored, the connector starts streaming changes from the point in time when the YugabyteDB logical replication slot was created on the server. The `never` snapshot mode is beneficial only when you know all data of interest is still reflected in the WAL. | +| `initial` (default) | The connector performs a database snapshot when no Kafka offsets topic exists. After the database snapshot completes the Kafka offsets topic is written. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. | +| `initial_only` | The connector performs a database snapshot and stops before streaming any change event records. If the connector had started but did not complete a snapshot before stopping, the connector restarts the snapshot process and stops when the snapshot completes. | + +### Streaming changes + +The YugabyteDB connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. This mechanism relies on [PostgreSQL's replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html). This protocol enables clients to receive changes from the server as they are committed in the server's transaction logs. + +Whenever the server commits a transaction, a separate server process invokes a callback function from the [logical decoding plugin](../key-concepts/#output-plugin). This function processes the changes from the transaction, converts them to a specific format and writes them on an output stream, which can then be consumed by clients. + +The YugabyteDB connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The YugabyteDB connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. + +Periodically, Kafka Connect records the most recent offset in another Kafka topic. The offset indicates source-specific position information that Debezium includes with each event. For the YugabyteDB connector, the LSN recorded in each change event is the offset. + +When Kafka Connect gracefully shuts down, it stops the connectors, flushes all event records to Kafka, and records the last offset received from each connector. When Kafka Connect restarts, it reads the last recorded offset for each connector, and starts each connector at its last recorded offset. When the connector restarts, it sends a request to the YugabyteDB server to send the events starting just after that position. + +### Logical decoding plugin support + +As of YugabyteDB v2024.1.1 and later, YugabyteDB supports the [yboutput plugin](../key-concepts/#output-plugin), a native output plugin for logical decoding. + +Additionally, YugabyteDB also supports the PostgreSQL `pgoutput` plugin natively. This means that the YugabyteDB connector can work with an existing setup configured using `pgoutput`. + +### Topic names + +By default, the YugabyteDB connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _topicPrefix.schemaName.tableName_. + +The components of a topic name are as follows: + +* _topicPrefix_ - the topic prefix as specified by the `topic.prefix` configuration property. +* _schemaName_ - the name of the database schema in which the change event occurred. +* _tableName_ - the name of the database table in which the change event occurred. + +For example, suppose that `dbserver` is the topic prefix in the configuration for a connector that is capturing changes in a YugabyteDB installation that has a `yugabyte` database and an `inventory` schema that contains four tables: `products`, `products_on_hand`, `customers`, and `orders`. The connector would stream records to these four Kafka topics: + +* `dbserver.inventory.products` +* `dbserver.inventory.products_on_hand` +* `dbserver.inventory.customers` +* `dbserver.inventory.orders` + +Now suppose that the tables are not part of a specific schema but were created in the default public YugabyteDB schema. The names of the Kafka topics would be: + +* `dbserver.public.products` +* `dbserver.public.products_on_hand` +* `dbserver.public.customers` +* `dbserver.public.orders` + +The connector applies similar naming conventions to label its [transaction metadata topics](#transaction-metadata). + +If the default topic names don't meet your requirements, you can configure custom topic names. To configure custom topic names, you specify regular expressions in the logical topic routing SMT. For more information about using the logical topic routing SMT to customize topic naming, see the Debezium documentation on [Topic routing](https://debezium.io/documentation/reference/2.5/transformations/topic-routing.html). + +### Transaction metadata + +Debezium can generate events that represent transaction boundaries and that enrich data change event messages. + +{{< note title="Limits on when Debezium receives transaction metadata" >}} + +Debezium registers and receives metadata only for transactions that occur _after you deploy the connector_. Metadata for transactions that occur before you deploy the connector is not available. + +{{< /note >}} + +For every transaction `BEGIN` and `END`, Debezium generates an event containing the following fields: + +* `status` - `BEGIN` or `END`. +* `id` - String representation of the unique transaction identifier composed of YugabyteDB transaction ID itself and LSN of given operation separated by colon, that is, the format is `txID:LSN`. +* `ts_ms` - The time of a transaction boundary event (`BEGIN` or `END` event) at the data source. If the data source does not provide Debezium with the event time, then the field instead represents the time at which Debezium processes the event. +* `event_count` (for `END` events) - total number of events emitted by the transaction. +* `data_collections` (for `END` events) - an array of pairs of `data_collection` and `event_count` that provides the number of events emitted by changes originating from given data collection. + +For example: + +```output.json +{ + "status": "BEGIN", + "id": "571:53195829", + "ts_ms": 1486500577125, + "event_count": null, + "data_collections": null +} + +{ + "status": "END", + "id": "571:53195832", + "ts_ms": 1486500577691, + "event_count": 2, + "data_collections": [ + { + "data_collection": "s1.a", + "event_count": 1 + }, + { + "data_collection": "s2.a", + "event_count": 1 + } + ] +} +``` + +Unless overridden via the `transaction.topic` option, transaction events are written to the topic and named __.transaction. + +#### Change data event enrichment + +When transaction metadata is enabled the data message `Envelope` is enriched with a new `transaction` field. This field provides information about every event in the form of a composite of fields: + +* `id` - string representation of unique transaction identifier +* `total_order` - absolute position of the event among all events generated by the transaction +* `data_collection_order` - the per-data collection position of the event among all events emitted by the transaction + +Following is an example of a message: + +```output.json +{ + "before": null, + "after": { + "pk": "2", + "aa": "1" + }, + "source": { + ... + }, + "op": "c", + "ts_ms": "1580390884335", + "transaction": { + "id": "571:53195832", + "total_order": "1", + "data_collection_order": "1" + } +} +``` + +## Data change events + +The YugabyteDB connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. + +Debezium and Kafka Connect are designed around _continuous streams of event messages_. However, the structure of these events may change over time, which can be difficult for consumers to handle. To address this, each event contains the schema for its content or, if you are using a schema registry, a schema ID that a consumer can use to obtain the schema from the registry. This makes each event self-contained. + +The following skeleton JSON shows the basic four parts of a change event. However, how you configure the Kafka Connect converter that you choose to use in your application determines the representation of these four parts in change events. A `schema` field is in a change event only when you configure the converter to produce it. Likewise, the event key and event payload are in a change event only if you configure a converter to produce it. If you use the JSON converter and you configure it to produce all four basic change event parts, change events have this structure: + +```output.json +{ + "schema": { --> 1 + ... + }, + "payload": { --> 2 + ... + }, + "schema": { --> 3 + ... + }, + "payload": { --> 4 + ... + } +} +``` + +The following table describes the content of the change events. + +| Item | Field name | Description | +| :--: | :--------- | :---------- | +| 1 | `schema` | The first `schema` field is part of the event key. It specifies a Kafka Connect schema that describes what is in the event key's `payload` portion. In other words, the first `schema` field describes the structure of the primary key, or the unique key if the table does not have a primary key, for the table that was changed. | +| 2 | `payload` | The first `payload` field is part of the event key. It has the structure described by the previous `schema` field and it contains the key for the row that was changed. | +| 3 | `schema` | The second `schema` field is part of the event value. It specifies the Kafka Connect schema that describes what is in the event value's `payload` portion. In other words, the second `schema` describes the structure of the row that was changed. Typically, this schema contains nested schemas. | +| 4 | `payload` | The second `payload` field is part of the event value. It has the structure described by the previous `schema` field and it contains the actual data for the row that was changed. | + +By default, the connector streams change event records to [Kafka topics](#topic-names) with names that are the same as the event's originating table. + +{{< note title="Note" >}} + +Starting with Kafka 0.10, Kafka can optionally record the event key and value with the timestamp at which the message was created (recorded by the producer) or written to the log by Kafka. + +{{< /note >}} + +{{< warning title="Warning" >}} + +The YugabyteDB connector ensures that all Kafka Connect schema names adhere to the Avro schema name format. This means that the logical server name must start with a Latin letter or an underscore, that is, `a-z`, `A-Z`, or `_`. Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore, that is, `a-z`, `A-Z`, `0-9`, or `_`. If there is an invalid character it is replaced with an underscore character. + +This can lead to unexpected conflicts if the topic prefix, a schema name, or a table name contains invalid characters, and the only characters that distinguish names from one another are invalid and thus replaced with underscores. + +{{< /warning >}} + +### Change event keys + +For a given table, the change event's key has a structure that contains a field for each column in the primary key of the table at the time the event was created. Alternatively, if the table has `REPLICA IDENTITY` set to `FULL` there is a field for each unique key constraint. + +Consider a `customers` table defined in the `public` database schema and the example of a change event key for that table. + +**Example table:** + +```sql +CREATE TABLE customers ( + id SERIAL, + first_name VARCHAR(255) NOT NULL, + last_name VARCHAR(255) NOT NULL, + email VARCHAR(255) NOT NULL, + PRIMARY KEY(id) +); +``` + +#### Example change event key + +If the `topic.prefix` connector configuration property has the value `YugabyteDB_server`, every change event for the `customers` table while it has this definition has the same key structure, which in JSON looks like this: + +```output.json +{ + "schema": { --> 1 + "type": "struct", + "name": "YugabyteDB_server.public.customers.Key", --> 2 + "optional": false, --> 3 + "fields": [ --> 4 + { + "name": "id", + "index": "0", + "schema": { + "type": "INT32", + "optional": "false" + } + } + ] + }, + "payload": { --> 5 + "id": "1" + }, +} +``` + +**Description of a change event key:** + +| Item | Field name | Description | +| :--- | :--------- | :---------- | +| 1 | schema | The schema portion of the key specifies a Kafka Connect schema that describes what is in the key's `payload` portion. | +| 2 | YugabyteDB_server.public.customers.Key | Name of the schema that defines the structure of the key's payload. This schema describes the structure of the primary key for the table that was changed. Key schema names have the format _connector-name.database-name.table-name.Key_. In this example:
`YugabyteDB_server` is the name of the connector that generated this event.
`public` is the schema which contains the table that was changed.
`customers` is the table that was updated. | +| 3 | optional | Indicates whether the event key must contain a value in its `payload` field. In this example, a value in the key's payload is required. | +| 4 | fields | Specifies each field that is expected in the payload, including each field's name, index, and schema. | +| 5 | payload | Contains the key for the row for which this change event was generated. In this example, the key, contains a single `id` field whose value is `1`. | + +{{< note title="Note" >}} + +Although the `column.exclude.list` and `column.include.list` connector configuration properties allow you to capture only a subset of table columns, all columns in a primary or unique key are always included in the event's key. + +{{< /note >}} + +{{< warning title="Warning" >}} + +CDC is not supported for tables without primary keys. + +{{< /warning >}} + +### Change event values + +The value in a change event is a bit more complicated than the key. Like the key, the value has a `schema` section and a `payload` section. The `schema` section contains the schema that describes the `Envelope` structure of the `payload` section, including its nested fields. Change events for operations that create, update or delete data all have a value payload with an envelope structure. + +Consider the same sample table that was used to show an example of a change event key: + +```sql +CREATE TABLE customers ( + id SERIAL, + first_name VARCHAR(255) NOT NULL, + last_name VARCHAR(255) NOT NULL, + email VARCHAR(255) NOT NULL, + PRIMARY KEY(id) +); +``` + +The value portion of a change event for a change to this table varies according to the `REPLICA IDENTITY` setting and the operation that the event is for. + +### Replica Identity + +[REPLICA IDENTITY](https://www.postgresql.org/docs/11/sql-altertable.html#SQL-CREATETABLE-REPLICA-IDENTITY) is a YugabyteDB-specific table-level setting that determines the amount of information that is available to the logical decoding plugin for `UPDATE` and `DELETE` events. More specifically, the setting of `REPLICA IDENTITY` controls what (if any) information is available for the previous values of the table columns involved, whenever an `UPDATE` or `DELETE` event occurs. + +There are 4 possible values for `REPLICA IDENTITY`: + +* `CHANGE` - Emitted events for `UPDATE` operations will only contain the value of the changed column along with the primary key column with no previous values present. `DELETE` operations will only contain the previous value of the primary key column in the table. +* `DEFAULT` - The default behavior is that only `DELETE` events contain the previous values for the primary key columns of a table. For an `UPDATE` event, no previous values will be present and the new values will be present for all the columns in the table. +* `FULL` - Emitted events for `UPDATE` and `DELETE` operations contain the previous values of all columns in the table. +* `NOTHING` - Emitted events for `UPDATE` and `DELETE` operations do not contain any information about the previous value of any table column. + +{{< note title="Note">}} + +YugabyteDB supports the replica identity CHANGE only with the plugin `yboutput`. + +{{< /note >}} + +#### Message formats for replica identities + +Consider the following employee table into which a row is inserted, subsequently updated, and deleted: + +```sql +CREATE TABLE employee ( + employee_id INT PRIMARY KEY, + employee_name VARCHAR, + employee_dept TEXT); + +INSERT INTO employee VALUES (1001, 'Alice', 'Packaging'); + +UPDATE employee SET employee_name = 'Bob' WHERE employee_id = 1001; + +DELETE FROM employee WHERE employee_id = 1001; +``` + +{{< tabpane text=true >}} + + {{% tab header="CHANGE" lang="change" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": null
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+      "value": 1001,
+      "set": true
+    },
+    "employee_name": null,
+    "employee_dept": null
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="DEFAULT" lang="default" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": null
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+      "value": 1001,
+      "set": true
+    },
+    "employee_name": null,
+    "employee_dept": null
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging"
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="FULL" lang="full" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  },
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging",
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging"
+  },
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="NOTHING" lang="nothing" %}} + +**yboutput plugin** + + + + + + + + + + +
INSERT
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + +
INSERT
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging",
+  }
+  "op": "c"
+}
+
+
+ + {{% /tab %}} + +{{< /tabpane >}} + +{{< note title="Note" >}} + +If `UPDATE` and `DELETE` operations will be performed on a table in publication without any replica identity (that is, `REPLICA IDENTITY` set to `NOTHING`), then the operations will cause an error on the publisher. For more details, see [Publication](https://www.postgresql.org/docs/11/logical-replication-publication.html). + +{{< /note >}} + +### *create* events + +The following example shows the value portion of a change event that the connector generates for an operation that creates data in the `customers` table: + +```output.json +{ + "schema": { --> 1 + "type": "struct", + "fields": [ + { + "type": "struct", + "fields": [ + { + "type": "int32", + "optional": false, + "field": "id" + }, + { + "type": "string", + "optional": false, + "field": "first_name" + }, + { + "type": "string", + "optional": false, + "field": "last_name" + }, + { + "type": "string", + "optional": false, + "field": "email" + } + ], + "optional": true, + "name": "YugabyteDB_server.inventory.customers.Value", --> 2 + "field": "before" + }, + { + "type": "struct", + "fields": [ + { + "type": "int32", + "optional": false, + "field": "id" + }, + { + "type": "string", + "optional": false, + "field": "first_name" + }, + { + "type": "string", + "optional": false, + "field": "last_name" + }, + { + "type": "string", + "optional": false, + "field": "email" + } + ], + "optional": true, + "name": "YugabyteDB_server.inventory.customers.Value", + "field": "after" + }, + { + "type": "struct", + "fields": [ + { + "type": "string", + "optional": false, + "field": "version" + }, + { + "type": "string", + "optional": false, + "field": "connector" + }, + { + "type": "string", + "optional": false, + "field": "name" + }, + { + "type": "int64", + "optional": false, + "field": "ts_ms" + }, + { + "type": "boolean", + "optional": true, + "default": false, + "field": "snapshot" + }, + { + "type": "string", + "optional": false, + "field": "db" + }, + { + "type": "string", + "optional": false, + "field": "schema" + }, + { + "type": "string", + "optional": false, + "field": "table" + }, + { + "type": "int64", + "optional": true, + "field": "txId" + }, + { + "type": "int64", + "optional": true, + "field": "lsn" + }, + { + "type": "int64", + "optional": true, + "field": "xmin" + } + ], + "optional": false, + "name": "io.debezium.connector.postgresql.Source", --> 3 + "field": "source" + }, + { + "type": "string", + "optional": false, + "field": "op" + }, + { + "type": "int64", + "optional": true, + "field": "ts_ms" + } + ], + "optional": false, + "name": "YugabyteDB_server.public.customers.Envelope" --> 4 + }, + "payload": { --> 5 + "before": null, --> 6 + "after": { --> 7 + "id": 1, + "first_name": "Anne", + "last_name": "Kretchmar", + "email": "annek@noanswer.org" + }, + "source": { --> 8 + "version": "2.5.2.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": true, + "db": "postgres", + "sequence": "[\"24023119\",\"24023128\"]", + "schema": "public", + "table": "customers", + "txId": 555, + "lsn": 24023128, + "xmin": null + }, + "op": "c", --> 9 + "ts_ms": 1559033904863 --> 10 + } +} +``` + +The following table describes the create event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | schema | The value's schema, which describes the structure of the value's payload. A change event's value schema is the same in every change event that the connector generates for a particular table. | +| 2 | name | In the schema section, each name field specifies the schema for a field in the value's payload.

`YugabyteDB_server.inventory.customers.Value` is the schema for the payload's _before_ and _after_ fields. This schema is specific to the customers table.

Names of schemas for _before_ and _after_ fields are of the form `logicalName.tableName.Value`, which ensures that the schema name is unique in the database. This means that when using the [Avro Converter](https://www.confluent.io/hub/confluentinc/kafka-connect-avro-converter), the resulting Avro schema for each table in each logical source has its own evolution and history. | +| 3 | name | `io.debezium.connector.postgresql.Source` is the schema for the payload's `source` field. This schema is specific to the YugabyteDB connector. The connector uses it for all events that it generates. | +| 4 | name | `YugabyteDB_server.inventory.customers.Envelope` is the schema for the overall structure of the payload, where `YugabyteDB_server` is the connector name, `public` is the schema, and `customers` is the table. | +| 5 | payload | The value's actual data. This is the information that the change event is providing.

It may appear that the JSON representations of the events are much larger than the rows they describe. This is because the JSON representation must include the schema and the payload portions of the message. However, by using the Avro converter, you can significantly decrease the size of the messages that the connector streams to Kafka topics. | +| 6 | before | An optional field that specifies the state of the row before the event occurred. When the op field is `c` for create, as it is in this example, the `before` field is `null` as this change event is for new content.
{{< note title="Note" >}}Whether or not this field is available is dependent on the [REPLICA IDENTITY](#replica-identity) setting for each table.{{< /note >}} | +| 7 | after | An optional field that specifies the state of the row after the event occurred. In this example, the `after` field contains the values of the new row's `id`, `first_name`, `last_name`, and `email` columns. | +| 8 | source | Mandatory field that describes the source metadata for the event. This field contains information that you can use to compare this event with other events, with regard to the origin of the events, the order in which the events occurred, and whether events were part of the same transaction. The source metadata includes:
  • Debezium version
  • Connector type and name
  • Database and table that contains the new row
  • Stringified JSON array of additional offset information. The first value is always the last committed LSN, the second value is always the current LSN. Either value may be null.
  • Schema name
  • If the event was part of a snapshot
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 9 | op | Mandatory string that describes the type of operation that caused the connector to generate the event. In this example, `c` indicates that the operation created a row. Valid values are:
  • `c` = create
  • `r` = read (applies to only snapshots)
  • `u` = update
  • `d` = delete
| +| 10 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +### *update* events + +The value of a change event for an update in the sample `customers` table has the same schema as a create event for that table. Likewise, the event value's payload has the same structure. However, the event value payload contains different values in an update event. The following is an example of a change event value in an event that the connector generates for an update in the `customers` table: + + + +```sql +{ + "schema": { ... }, + "payload": { + "before": null, --> 1 + "after": { --> 2 + "id": 1, + "first_name": "Anne Marie", + "last_name": "Kretchmar", + "email": "annek@noanswer.org" + }, + "source": { --> 3 + "version": "2.5.2.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": false, + "db": "postgres", + "schema": "public", + "table": "customers", + "txId": 556, + "lsn": 24023128, + "xmin": null + }, + "op": "u", --> 4 + "ts_ms": 1465584025523 --> 5 + } +} +``` + +The following table describes the update event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | before | An optional field that contains values that were in the row before the database commit. In this example, no previous value for any of the columns, is present because the table's [REPLICA IDENTITY](#replica-identity) setting is, `DEFAULT`. For an update event to contain the previous values of all columns in the row, you would have to change the `customers` table by running `ALTER TABLE customers REPLICA IDENTITY FULL`. | +| 2 | after | An optional field that specifies the state of the row after the event occurred. In this example, the `first_name` value is now `Anne Marie`. | +| 3 | source | Mandatory field that describes the source metadata for the event. The `source` field structure has the same fields as in a create event, but some values are different. The source metadata includes:
  • Debezium version
  • Connector type and name
  • Database and table that contains the new row
  • Schema name
  • If the event was part of a snapshot (always `false` for _update_ events)
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 4 | op | Mandatory string that describes the type of operation. In an update event value, the `op` field value is `u`, signifying that this row changed because of an update. | +| 5 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +{{< note title="Note" >}} + +Updating the columns for a row's primary/unique key changes the value of the row's key. When a key changes, Debezium outputs three events: a `DELETE` event and a [tombstone event](#tombstone-events) with the old key for the row, followed by an event with the new key for the row. Details are in the next section. + +{{< /note >}} + +### Primary key updates + +An `UPDATE` operation that changes a row's primary key field(s) is known as a primary key change. For a primary key change, in place of sending an `UPDATE` event record, the connector sends a `DELETE` event record for the old key and a `CREATE` event record for the new (updated) key. + +### _delete_ events + +The value in a _delete_ change event has the same `schema` portion as create and update events for the same table. The `payload` portion in a delete event for the sample `customers` table looks like this: + +```output.json +{ + "schema": { ... }, + "payload": { + "before": { --> 1 + "id": 1 + }, + "after": null, --> 2 + "source": { --> 3 + "version": "2.5.4.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": false, + "db": "postgres", + "schema": "public", + "table": "customers", + "txId": 556, + "lsn": 46523128, + "xmin": null + }, + "op": "d", --> 4 + "ts_ms": 1465581902461 --> 5 + } +} +``` + +The following table describes the delete event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | before | Optional field that specifies the state of the row before the event occurred. In a _delete_ event value, the `before` field contains the values that were in the row before it was deleted with the database commit.

In this example, the before field contains only the primary key column because the table's [REPLICA IDENTITY](#replica-identity) setting is `DEFAULT`. | +| 2 | after | Optional field that specifies the state of the row after the event occurred. In a delete event value, the `after` field is `null`, signifying that the row no longer exists. | +| 3 | source | Mandatory field that describes the source metadata for the event. In a delete event value, the source field structure is the same as for create and update events for the same table. Many source field values are also the same. In a delete event value, the `ts_ms` and `lsn` field values, as well as other values, might have changed. But the source field in a delete event value provides the same metadata:
  • Debezium version
  • Connector type and name
  • Database and table that contained the deleted row
  • Schema name
  • If the event was part of a snapshot (always false for delete events)
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 4 | op | Mandatory string that describes the type of operation. The `op` field value is `d`, signifying that this row was deleted. | +| 5 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +A _delete_ change event record provides a consumer with the information it needs to process the removal of this row. + +YugabyteDB connector events are designed to work with [Kafka log compaction](https://kafka.apache.org/documentation#compaction). Log compaction enables removal of some older messages as long as at least the most recent message for every key is kept. This lets Kafka reclaim storage space while ensuring that the topic contains a complete data set and can be used for reloading key-based state. + +#### Tombstone events + +When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the YugabyteDB connector follows a _delete_ event with a special tombstone event that has the same key but a `null` value. + + + +## Data type mappings + +The YugabyteDB connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. + +* `literal` type describes how the value is literally represented using Kafka Connect schema types: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT32`, `FLOAT64`, `BOOLEAN`, `STRING`, `BYTES`, `ARRAY`, `MAP`, and `STRUCT`. +* `semantic` type describes how the Kafka Connect schema captures the meaning of the field using the name of the Kafka Connect schema for the field. + +If the default data type conversions do not meet your needs, you can [create a custom converter](https://debezium.io/documentation/reference/2.5/development/converters.html#custom-converters) for the connector. + +### Basic types + +| YugabyteDB data type| Literal type (schema type) | Semantic type (schema name) and Notes | +| :------------------ | :------------------------- | :-------------------------- | +| `BOOLEAN` | `BOOLEAN` | N/A | +| `BIT(1)` | `BOOLEAN` | N/A | +| `BIT( > 1)` | `BYTES` | `io.debezium.data.Bits`
The `length` schema parameter contains an integer that represents the number of bits. The resulting `byte[]` contains the bits in little-endian form and is sized to contain the specified number of bits. For example, `numBytes = n/8 + (n % 8 == 0 ? 0 : 1)` where `n` is the number of bits. | +| `BIT VARYING[(M)]` | `BYTES` | `io.debezium.data.Bits`
The `length` schema parameter contains an integer that represents the number of bits (2^31 - 1 in case no length is given for the column). The resulting `byte[]` contains the bits in little-endian form and is sized based on the content. The specified size (`M`) is stored in the length parameter of the `io.debezium.data.Bits` type. | +| `SMALLINT`, `SMALLSERIAL` | `INT16` | N/A | +| `INTEGER`, `SERIAL` | `INT32` | N/A | +| `BIGINT`, `BIGSERIAL`, `OID` | `INT64` | N/A | +| `REAL` | `FLOAT32` | N/A | +| `DOUBLE PRECISION` | `FLOAT64` | N/A | +| `CHAR [(M)]` | `STRING` | N/A | +| `VARCHAR [(M)]` | `STRING` | N/A | +| `CHARACTER [(M)]` | `STRING` | N/A | +| `CHARACTER VARYING [(M)]` | `STRING` | N/A | +| `TIMESTAMPTZ`, `TIMESTAMP WITH TIME ZONE` | `STRING` | `io.debezium.time.ZonedTimestamp`
A string representation of a timestamp with timezone information, where the timezone is GMT. | +| `TIMETZ`, `TIME WITH TIME ZONE` | `STRING` | `io.debezium.time.ZonedTime`
A string representation of a time value with timezone information, where the timezone is GMT. | +| `INTERVAL [P]` | `INT64` | `io.debezium.time.MicroDuration` (default)
The approximate number of microseconds for a time interval using the `365.25 / 12.0` formula for days per month average. | +| `INTERVAL [P]` | `STRING` | `io.debezium.time.Interval`
(when `interval.handling.mode` is `string`)
The string representation of the interval value that follows the pattern
P\Y\M\DT\H\M\S.
For example, `P1Y2M3DT4H5M6.78S`. | +| `BYTEA` | `BYTES` or `STRING` | n/a

Either the raw bytes (the default), a base64-encoded string, or a base64-url-safe-encoded String, or a hex-encoded string, based on the connector's `binary handling mode` setting.

Debezium only supports Yugabyte `bytea_output` configuration of value `hex`. For more information about PostgreSQL binary data types, see the [Binary data types](../../../../api/ysql/datatypes/type_binary/). | +| `JSON`, `JSONB` | `STRING` | `io.debezium.data.Json`
Contains the string representation of a JSON document, array, or scalar. | +| `UUID` | `STRING` | `io.debezium.data.Uuid`
Contains the string representation of a YugabyteDB UUID value. | +| `INT4RANGE` | `STRING` | Range of integer. | +| `INT8RANGE` | `STRING` | Range of `bigint`. | +| `NUMRANGE` | `STRING` | Range of `numeric`. | +| `TSRANGE` | `STRING` | n/a

The string representation of a timestamp range without a time zone. | +| `TSTZRANGE` | `STRING` | n/a

The string representation of a timestamp range with the local system time zone. | +| `DATERANGE` | `STRING` | n/a

The string representation of a date range. Always has an _exclusive_ upper bound. | +| `ENUM` | `STRING` | `io.debezium.data.Enum`

Contains the string representation of the YugabyteDB `ENUM` value. The set of allowed values is maintained in the allowed schema parameter. | + +### Temporal types + +Other than YugabyteDB's `TIMESTAMPTZ` and `TIMETZ` data types, which contain time zone information, how temporal types are mapped depends on the value of the `time.precision.mode` connector configuration property. The following sections describe these mappings: + +* `time.precision.mode=adaptive` +* `time.precision.mode=adaptive_time_microseconds` +* `time.precision.mode=connect` + +#### time.precision.mode=adaptive + +When the `time.precision.mode` property is set to `adaptive`, the default, the connector determines the literal type and semantic type based on the column's data type definition. This ensures that events _exactly_ represent the values in the database. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `io.debezium.time.Date`
Represents the number of days since the epoch. | +| `TIME(1)`, `TIME(2)`, `TIME(3)` | `INT32` | `io.debezium.time.Time`
Represents the number of milliseconds past midnight, and does not include timezone information. | +| `TIME(4)`, `TIME(5)`, `TIME(6)` | `INT64` | `io.debezium.time.MicroTime`
Represents the number of microseconds past midnight, and does not include timezone information. | +| `TIMESTAMP(1)`, `TIMESTAMP(2)`, `TIMESTAMP(3)` | `INT64` | `io.debezium.time.Timestamp`
Represents the number of milliseconds since the epoch, and does not include timezone information. | +| `TIMESTAMP(4)`, `TIMESTAMP(5)`, `TIMESTAMP(6)`, `TIMESTAMP` | `INT64` | `io.debezium.time.MicroTimestamp`
Represents the number of microseconds since the epoch, and does not include timezone information. | + +#### time.precision.mode=adaptive_time_microseconds + +When the `time.precision.mode` configuration property is set to `adaptive_time_microseconds`, the connector determines the literal type and semantic type for temporal types based on the column's data type definition. This ensures that events _exactly_ represent the values in the database, except all `TIME` fields are captured as microseconds. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `io.debezium.time.Date`
Represents the number of days since the epoch. | +| `TIME([P])` | `INT64` | `io.debezium.time.MicroTime`
Represents the time value in microseconds and does not include timezone information. YugabyteDB allows precision `P` to be in the range 0-6 to store up to microsecond precision. | +| `TIMESTAMP(1)` , `TIMESTAMP(2)`, `TIMESTAMP(3)` | `INT64` | `io.debezium.time.Timestamp`
Represents the number of milliseconds past the epoch, and does not include timezone information. | +| `TIMESTAMP(4)`, `TIMESTAMP(5)`, `TIMESTAMP(6)`, `TIMESTAMP` | `INT64` | `io.debezium.time.MicroTimestamp`
Represents the number of microseconds past the epoch, and does not include timezone information. | + +#### time.precision.mode=connect + +When the `time.precision.mode` configuration property is set to `connect`, the connector uses Kafka Connect logical types. This may be useful when consumers can handle only the built-in Kafka Connect logical types and are unable to handle variable-precision time values. However, because YugabyteDB supports microsecond precision, the events generated by a connector with the connect time precision mode results in a loss of precision when the database column has a fractional second precision value that is greater than 3. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `org.apache.kafka.connect.data.Date`
Represents the number of days since the epoch. | +| `TIME([P])` | `INT64` | `org.apache.kafka.connect.data.Time`
Represents the number of milliseconds since midnight, and does not include timezone information. YugabyteDB allows `P` to be in the range 0-6 to store up to microsecond precision, though this mode results in a loss of precision when `P` is greater than 3. | +| `TIMESTAMP([P])` | `INT64` | `org.apache.kafka.connect.data.Timestamp`
Represents the number of milliseconds since the epoch, and does not include timezone information. YugabyteDB allows `P` to be in the range 0-6 to store up to microsecond precision, though this mode results in a loss of precision when `P` is greater than 3. | + +### TIMESTAMP type + +The `TIMESTAMP` type represents a timestamp without time zone information. Such columns are converted into an equivalent Kafka Connect value based on UTC. For example, the `TIMESTAMP` value "2018-06-20 15:13:16.945104" is represented by an `io.debezium.time.MicroTimestamp` with the value "1529507596945104" when `time.precision.mode` is not set to `connect`. + +The timezone of the JVM running Kafka Connect and Debezium does not affect this conversion. + +YugabyteDB supports using +/-infinite values in `TIMESTAMP` columns. These special values are converted to timestamps with value `9223372036825200000` in case of positive infinity or `-9223372036832400000` in case of negative infinity. This behavior mimics the standard behavior of the YugabyteDB JDBC driver. For reference, see the [`org.postgresql.PGStatement`](https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGStatement.html) interface. + +### Decimal types + +The setting of the YugabyteDB connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. + +#### decimal.handling.mode=double + +When the `decimal.handling.mode` property is set to `double`, the connector represents all `DECIMAL`, `NUMERIC` and `MONEY` values as Java double values and encodes them as shown in the following table. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `NUMERIC[(M[,D])]` | `FLOAT64` | | +| `DECIMAL[(M[,D])]` | `FLOAT64` | | +| `MONEY[(M[,D])]` | `FLOAT64` | | + +#### decimal.handling.mode=string + +The last possible setting for the `decimal.handling.mode` configuration property is `string`. In this case, the connector represents `DECIMAL`, `NUMERIC` and `MONEY` values as their formatted string representation, and encodes them as shown in the following table. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `NUMERIC[(M[,D])]` | `STRING` | | +| `DECIMAL[(M[,D])]` | `STRING` | | +| `MONEY[(M[,D])]` | `STRING` | | + +{{< note title="Note" >}} + +Decimal handling mode `precise` is not yet supported by `YugabyteDBConnector`. + +{{< /note >}} + +### HSTORE types + +The setting of the YugabyteDB connector configuration property `hstore.handling.mode` determines how the connector maps `HSTORE` values. + +When the `hstore.handling.mode` property is set to json (the default), the connector represents `HSTORE` values as string representations of `JSON` values and encodes them as shown in the following table. When the `hstore.handling.mode` property is set to map, the connector uses the `MAP` schema type for `HSTORE` values. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `HSTORE` | `STRING` | `io.debezium.data.Json`

Example: output representation using the JSON converter is `{"key" : "val"}` | +| `HSTORE` | `MAP` | n/a

Example: output representation using the `JSON` converter is `{"key" : "val"}` | + +### Domain types + +YugabyteDB supports user-defined types that are based on other underlying types. When such column types are used, Debezium exposes the column's representation based on the full type hierarchy. + +{{< note title="Note" >}} + +Capturing changes in columns that use YugabyteDB domain types requires special consideration. When a column is defined to contain a domain type that extends one of the default database types and the domain type defines a custom length or scale, the generated schema inherits that defined length or scale. + +When a column is defined to contain a domain type that extends another domain type that defines a custom length or scale, the generated schema does not inherit the defined length or scale because that information is not available in the YugabyteDB driver's column metadata. + +{{< /note >}} + +### Network address types + +YugabyteDB has data types that can store IPv4, IPv6, and MAC addresses. It is better to use these types instead of plain text types to store network addresses. Network address types offer input error checking and specialized operators and functions. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `INET` | `STRING` | n/a

IPv4 and IPv6 networks | +| `CIDR` | `STRING` | n/a

IPv4 and IPv6 hosts and networks | +| `MACADDR` | `STRING` | n/a

MAC addresses | +| `MACADDR8` | `STRING` | n/a

MAC addresses in EUI-64 format | + + + + +## Setting up YugabyteDB + +### Setting up permissions + +Setting up a YugabyteDB server to run a Debezium connector requires a database user that can perform replications. Replication can be performed only by a database user that has appropriate permissions and only for a configured number of hosts. + +Although, by default, superusers have the necessary `REPLICATION` and `LOGIN` roles, as mentioned in [Security](#security), it is best not to provide the Debezium replication user with elevated privileges. Instead, create a Debezium user that has the minimum required privileges. + +**Prerequisites:** + +* YugabyteDB administrative permissions. + +**Procedure:** + +To provide a user with replication permissions, define a YugabyteDB role that has at least the `REPLICATION` and `LOGIN` permissions, and then grant that role to the user. For example: + +```sql +CREATE ROLE REPLICATION LOGIN; +``` + +### Setting privileges to enable Debezium to create YugabyteDB publications when you use `pgoutput` or `yboutput` + +If you use `pgoutput` or `yboutput` as the logical decoding plugin, Debezium must operate in the database as a user with specific privileges. + +Debezium streams change events for YugabyteDB source tables from publications that are created for the tables. Publications contain a filtered set of change events that are generated from one or more tables. The data in each publication is filtered based on the publication specification. The specification can be created by the `YugabyteDB` database administrator or by the Debezium connector. To permit the Debezium connector to create publications and specify the data to replicate to them, the connector must operate with specific privileges in the database. + +There are several options for determining how publications are created. In general, it is best to manually create publications for the tables that you want to capture, before you set up the connector. However, you can configure your environment in a way that permits Debezium to create publications automatically, and to specify the data that is added to them. + +Debezium uses include list and exclude list properties to specify how data is inserted in the publication. For more information about the options for enabling Debezium to create publications, see `publication.autocreate.mode`. + +For Debezium to create a YugabyteDB publication, it must run as a user that has the following privileges: + +* Replication privileges in the database to add the table to a publication. +* `CREATE` privileges on the database to add publications. +* `SELECT` privileges on the tables to copy the initial table data. Table owners automatically have `SELECT` permission for the table. + +To add tables to a publication, the user must be an owner of the table. But because the source table already exists, you need a mechanism to share ownership with the original owner. To enable shared ownership, create a YugabyteDB replication group, then add the existing table owner and the replication user to the group. + +Procedure + +1. Create a replication group. + + ```sql + CREATE ROLE ; + ``` + +2. Add the original owner of the table to the group. + + ```sql + GRANT REPLICATION_GROUP TO ; + ``` + +3. Add the Debezium replication user to the group. + + ```sql + GRANT REPLICATION_GROUP TO ; + ``` + +4. Transfer ownership of the table to ``. + + ```sql + ALTER TABLE OWNER TO REPLICATION_GROUP; + ``` + +For Debezium to specify the capture configuration, the value of `publication.autocreate.mode` must be set to `filtered`. + +### Configuring YugabyteDB to allow replication with the Debezium connector host + +To enable Debezium to replicate YugabyteDB data, you must configure the database to permit replication with the host that runs the YugabyteDB connector. To specify the clients that are permitted to replicate with the database, add entries to the YugabyteDB host-based authentication file, `ysql_hba.conf`. For more information about the pg_hba.conf file, see the [YugabyteDB documentation](../../../../secure/authentication/host-based-authentication#ysql_hbaconf-file). + +Procedure + +* Add entries to the `ysql_hba.conf` file to specify the Debezium connector hosts that can replicate with the database host. For example, + +```sh +--ysql_hba_conf_csv="local replication trust, local replication 127.0.0.1/32 trust, host replication ::1/128 trust" +``` + +### Supported YugabyteDB topologies + +As mentioned in the beginning, YugabyteDB (for all versions > 2024.1.1) supports logical replication slots. The YugabyteDB connector can communicate with the server by connecting to any node using the [YugabyteDB Java driver](../../../../reference/drivers/java/yugabyte-jdbc-reference). Should any node fail, the connector receives an error and restarts. Upon restart, the connector connects to any available node and continues streaming from that node. + +### Setting up multiple connectors for same database server + +Debezium uses [replication slots](https://www.postgresql.org/docs/11/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS) to stream changes from a database. These replication slots maintain the current position in form of a LSN. This helps YugabyteDB keep the WAL available until it is processed by Debezium. A single replication slot can exist only for a single consumer or process - as different consumer might have different state and may need data from different position. + +Because a replication slot can only be used by a single connector, it is essential to create a unique replication slot for each Debezium connector. Although when a connector is not active, YugabyteDB may allow other connectors to consume the replication slot - which could be dangerous as it may lead to data loss as a slot will emit each change just once. + +In addition to replication slot, Debezium uses publication to stream events when using the `pgoutput`or `yboutput` plugin. Similar to replication slot, publication is at database level and is defined for a set of tables. Thus, you'll need a unique publication for each connector, unless the connectors work on same set of tables. For more information about the options for enabling Debezium to create publications, see `publication.autocreate.mode`. + +See `slot.name` and `publication.name` on how to set a unique replication slot name and publication name for each connector. + +## Deployment + +To deploy the connector, you install the connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. + +**Prerequisites** + +* [Zookeeper](https://zookeeper.apache.org/), [Kafka](http://kafka.apache.org/), and [Kafka Connect](https://kafka.apache.org/documentation.html#connect) are installed. +* YugabyteDB is installed and is [set up to run the Debezium connector](#setting-up-yugabytedb). + +**Procedure** + +1. Download the [YugabyteDB connector plugin archive](https://github.com/yugabyte/debezium/releases/tag/dz.2.5.2.yb.2024.1.SNAPSHOT.1). +2. Extract the files into your Kafka Connect environment. +3. Add the directory with the JAR files to the [Kafka Connect `plugin.path`](https://kafka.apache.org/documentation/#connectconfigs). +4. Restart your Kafka Connect process to pick up the new JAR files. + +### Creating Kafka topics + +If [auto creation of topics](https://debezium.io/documentation/reference/2.5/configuration/topic-auto-create-config.html) is not enabled in the Kafka Connect cluster then you will need to create the following topics manually: + +* Topic for each table in the format `..` +* Heartbeat topic in the format `.`. The [topic.heartbeat.prefix](#connector-properties) has a default value of `__debezium-heartbeat`. + +### Connector configuration example + +Following is an example of the configuration for a YugabyteDB connector that connects to a YugabyteDB server on port `5433` at `192.168.99.100`, whose topic prefix is `fulfillment`. Typically, you configure the YugabyteDB connector in a JSON file by setting the configuration properties available for the connector. + +You can choose to produce events for a subset of the schemas and tables in a database. Optionally, you can ignore, mask, or truncate columns that contain sensitive data, are larger than a specified size, or that you do not need. + +```output.json +{ + "name": "fulfillment-connector", --> 1 + "config": { + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", --> 2 + "database.hostname": "192.168.99.100:5433,192.168.1.10:5433,192.168.1.68:5433", --> 3 + "database.port": "5432", --> 4 + "database.user": "postgres", --> 5 + "database.password": "postgres", --> 6 + "database.dbname" : "postgres", --> 7 + "topic.prefix": "fulfillment", --> 8 + "table.include.list": "public.inventory" --> 9 + } +} +``` + +1. The name of the connector when registered with a Kafka Connect service. +2. The name of this YugabyteDB connector class. +3. The addresses of the YugabyteDB YB-TServer nodes. This can take a value of multiple addresses in the format `IP1:PORT1,IP2:PORT2,IP3:PORT3`. +4. The port number of the YugabyteDB server. +5. The name of the YugabyteDB user that has the [required privileges](#setting-up-yugabytedb). +6. The password for the YugabyteDB user that has the [required privileges](#setting-up-yugabytedb). +7. The name of the YugabyteDB database to connect to +8. The topic prefix for the YugabyteDB server/cluster, which forms a namespace and is used in all the names of the Kafka topics to which the connector writes, the Kafka Connect schema names, and the namespaces of the corresponding Avro schema when the Avro converter is used. +9. A list of all tables hosted by this server that this connector will monitor. This is optional, and there are other properties for listing the schemas and tables to include or exclude from monitoring. + +See the [complete list of YugabyteDB connector properties](#connector-properties) that can be specified in these configurations. + +You can send this configuration with a `POST` command to a running Kafka Connect service. The service records the configuration and starts one connector task that performs the following actions: + +* Connects to the YugabyteDB database. +* Reads the transaction log. +* Streams change event records to Kafka topics. + +### Adding connector configuration + +To run a the connector, create a connector configuration and add the configuration to your Kafka Connect cluster. + +**Prerequisites** + +* [YugabyteDB is configured to support logical replication.](#setting-up-yugabytedb) +* The YugabyteDB connector is installed. + +**Procedure** + +1. Create a configuration for the YugabyteDB connector. +2. Use the [Kafka Connect REST API](https://kafka.apache.org/documentation/#connect_rest) to add that connector configuration to your Kafka Connect cluster. + +#### Results + +After the connector starts, it performs a consistent snapshot of the YugabyteDB server databases that the connector is configured for. The connector then starts generating data change events for row-level operations and streaming change event records to Kafka topics. + +### Connector properties + +The connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. Information about the properties is organized as follows: + +* [Required configuration properties](#required-configuration-properties) +* [Advanced configuration properties](#advanced-configuration-properties) +* [Pass-through configuration properties](#pass-through-configuration-properties) + +The following configuration properties are _required_ unless a default value is available. + +#### Required configuration properties + +| Property | Default value | Description | +| :------- | :------------ | :---------- | +| name | No default | Unique name for the connector. Attempting to register again with the same name will fail. This property is required by all Kafka Connect connectors. | +| connector.class | No default | The name of the Java class for the connector. Always use a value of `io.debezium.connector.postgresql.YugabyteDBConnector` for the YugabyteDB connector. | +| tasks.max | 1 | The maximum number of tasks that should be created for this connector. The YugabyteDB connector always uses a single task and therefore does not use this value, so the default is always acceptable. | +| plugin.name | decoderbufs | The name of the YugabyteDB [logical decoding plugin](../key-concepts/#output-plugin) installed on the YugabyteDB server.
Supported values are `yboutput`, and `pgoutput`. | +| slot.name | debezium | The name of the YugabyteDB logical decoding slot that was created for streaming changes from a particular plugin for a particular database/schema. The server uses this slot to stream events to the Debezium connector that you are configuring.
Slot names can contain lower-case letters, numbers, and the underscore character. | +| slot.drop.on.stop | false | Whether or not to delete the logical replication slot when the connector stops in a graceful, expected way. The default behavior is that the replication slot remains configured for the connector when the connector stops. When the connector restarts, having the same replication slot enables the connector to start processing where it left off.
Set to true in only testing or development environments. Dropping the slot allows the database to discard WAL segments. When the connector restarts it performs a new snapshot or it can continue from a persistent offset in the Kafka Connect offsets topic. | +| publication.name | `dbz_publication` | The name of the YugabyteDB publication created for streaming changes when using pgoutput.
This publication is created at start-up if it does not already exist and it includes all tables. Debezium then applies its own include/exclude list filtering, if configured, to limit the publication to change events for the specific tables of interest. The connector user must have superuser permissions to create this publication, so it is usually preferable to create the publication before starting the connector for the first time.
If the publication already exists, either for all tables or configured with a subset of tables, Debezium uses the publication as it is defined. | +| database.hostname | No default | IP address or hostname of the YugabyteDB database server. This needs to be in the format `IP1:PORT1,IP2:PORT2,IP3:PORT3` | +| database.port | 5433 | Integer port number of the YugabyteDB database server. | +| database.user | No default | Name of the YugabyteDB database user for connecting to the YugabyteDB database server. | +| database.password | No default | Password to use when connecting to the YugabyteDB database server. | +| database.dbname | No default | The name of the YugabyteDB database from which to stream the changes. | +| topic.prefix | No default | Topic prefix that provides a namespace for the particular YugabyteDB database server or cluster in which Debezium is capturing changes. The prefix should be unique across all other connectors, as it is used as a topic name prefix for all Kafka topics that receive records from this connector. Only alphanumeric characters, hyphens, dots, and underscores must be used in the database server logical name. {{< warning title="Warning" >}} Do not change the value of this property. If you change the name value, after a restart, instead of continuing to emit events to the original topics, the connector emits subsequent events to topics whose names are based on the new value. {{< /warning >}} | +| schema.include.list | No default | An optional, comma-separated list of regular expressions that match names of schemas for which you **want** to capture changes. Any schema name not included in `schema.include.list` is excluded from having its changes captured. By default, all non-system schemas have their changes captured.
To match the name of a schema, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the schema; it does not match substrings that might be present in a schema name.
If you include this property in the configuration, do not also set the `schema.exclude.list` property. | +| schema.exclude.list | No default | An optional, comma-separated list of regular expressions that match names of schemas for which you **do not** want to capture changes. Any schema whose name is not included in `schema.exclude.list` has its changes captured, with the exception of system schemas.
To match the name of a schema, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the schema; it does not match substrings that might be present in a schema name.
If you include this property in the configuration, do not set the `schema.include.list` property. | +| table.include.list | No default | An optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you want to capture. When this property is set, the connector captures changes only from the specified tables. Each identifier is of the form `schemaName.tableName`. By default, the connector captures changes in every non-system table in each schema whose changes are being captured.
To match the name of a table, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire identifier for the table; it does not match substrings that might be present in a table name.
If you include this property in the configuration, do not also set the `table.exclude.list` property. | +| table.exclude.list | No default | An optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you do not want to capture. Each identifier is of the form `schemaName.tableName`. When this property is set, the connector captures changes from every table that you do not specify.
To match the name of a table, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the table; it does not match substrings that might be present in a table name.
If you include this property in the configuration, do not set the `table.include.list` property. | +| column.include.list | No default | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be included in change event record values. Fully-qualified names for columns are of the form `schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the expression is used to match the entire name string of the column; it does not match substrings that might be present in a column name.
If you include this property in the configuration, do not also set the `column.exclude.list` property. | +| column.exclude.list | No default | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event record values. Fully-qualified names for columns are of the form `schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the expression is used to match the entire name string of the column; it does not match substrings that might be present in a column name.
If you include this property in the configuration, do not set the `column.include.list` property. | +| skip.messages.without.change | false | Specifies whether to skip publishing messages when there is no change in included columns. This would essentially filter messages if there is no change in columns included as per `column.include.list` or `column.exclude.list` properties.
Note: Only works when REPLICA IDENTITY of the table is set to FULL | +| time.precision.mode | adaptive | Time, date, and timestamps can be represented with different kinds of precision:

`adaptive` captures the time and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column's type.

`adaptive_time_microseconds` captures the date, datetime and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column's type. An exception is `TIME` type fields, which are always captured as microseconds.

`connect` always represents time and timestamp values by using Kafka Connect built-in representations for `Time`, `Date`, and `Timestamp`, which use millisecond precision regardless of the database columns' precision. For more information, see [Temporal types](#temporal-types). | +| decimal.handling.mode | precise | Specifies how the connector should handle values for `DECIMAL` and `NUMERIC` columns:

`double` represents values by using double values, which might result in a loss of precision but which is easier to use.

`string` encodes values as formatted strings, which are easy to consume but semantic information about the real type is lost. For more information, see [Decimal types](#decimal-types). | +| interval.handling.mode | numeric | Specifies how the connector should handle values for interval columns:

`numeric` represents intervals using approximate number of microseconds.

`string` represents intervals exactly by using the string pattern representation `PYMDTHMS`. For example: `P1Y2M3DT4H5M6.78S`. For more information, see [Basic types](#basic-types). | +| database.sslmode | prefer | Whether to use an encrypted connection to the YugabyteDB server. Options include:

`disable` uses an unencrypted connection.

`allow` attempts to use an unencrypted connection first and, failing that, a secure (encrypted) connection.

`prefer` attempts to use a secure (encrypted) connection first and, failing that, an unencrypted connection.

`require` uses a secure (encrypted) connection, and fails if one cannot be established.

`verify-ca` behaves like require but also verifies the server TLS certificate against the configured Certificate Authority (CA) certificates, or fails if no valid matching CA certificates are found.

`verify-full` behaves like verify-ca but also verifies that the server certificate matches the host to which the connector is trying to connect. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslcert | No default | The path to the file that contains the SSL certificate for the client. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslkey | No default | The path to the file that contains the SSL private key of the client. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslpassword | No default | The password to access the client private key from the file specified by `database.sslkey`. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslrootcert | No default | The path to the file that contains the root certificate(s) against which the server is validated. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.tcpKeepAlive | true | Enable TCP keep-alive probe to verify that the database connection is still alive. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| tombstones.on.delete | true | Controls whether a delete event is followed by a tombstone event.

`true` - a delete operation is represented by a delete event and a subsequent tombstone event.

`false` - only a delete event is emitted.

After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic. | +| column.truncate.to.length.chars | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Set this property if you want to truncate the data in a set of columns when it exceeds the number of characters specified by the length in the property name. Set `length` to a positive integer value, for example, `column.truncate.to.20.chars`.

The fully-qualified name of a column observes the following format: `..`. To match the name of a column, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name.

You can specify multiple properties with different lengths in a single configuration. | +| column.mask.with.length.chars | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Set this property if you want the connector to mask the values for a set of columns, for example, if they contain sensitive data. Set `length` to a positive integer to replace data in the specified columns with the number of asterisk (`*`) characters specified by the length in the property name. Set length to `0` (zero) to replace data in the specified columns with an empty string.

The fully-qualified name of a column observes the following format: schemaName.tableName.columnName. To match the name of a column, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name.

You can specify multiple properties with different lengths in a single configuration. | +| column.mask.hash.hashAlgorithm.with.salt._salt_;
column.mask.hash.v2.hashAlgorithm.with.salt._salt_ | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form `..`.
To match the name of a column Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name. In the resulting change event record, the values for the specified columns are replaced with pseudonyms.
A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the [MessageDigest](https://docs.oracle.com/javase/7/docs/technotes/guides/security/StandardNames.html#MessageDigest) section of the Java Cryptography Architecture Standard Algorithm Name Documentation.

In the following example, `CzQMA0cB5K` is a randomly selected salt.

```column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerName```
If necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts.

Depending on the `hashAlgorithm` used, the salt selected, and the actual data set, the resulting data set might not be completely masked.

Hashing strategy version 2 should be used to ensure fidelity if the value is being hashed in different places or systems. | +| column.propagate.source.type | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns for which you want the connector to emit extra parameters that represent column metadata. When this property is set, the connector adds the following fields to the schema of event records:
  • __debezium.source.column.type
  • __debezium.source.column.length
  • __debezium.source.column.scale
These parameters propagate a column's original type name and length (for variable-width types), respectively.
Enabling the connector to emit this extra data can assist in properly sizing specific numeric or character-based columns in sink databases.
The fully-qualified name of a column observes one of the following formats: `databaseName.tableName.columnName`, or `databaseName.schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name. | +| datatype.propagate.source.type | n/a | An optional, comma-separated list of regular expressions that specify the fully-qualified names of data types that are defined for columns in a database. When this property is set, for columns with matching data types, the connector emits event records that include the following extra fields in their schema:
  • __debezium.source.column.type
  • __debezium.source.column.length
  • __debezium.source.column.scale
These parameters propagate a column's original type name and length (for variable-width types), respectively.
Enabling the connector to emit this extra data can assist in properly sizing specific numeric or character-based columns in sink databases.
The fully-qualified name of a column observes one of the following formats: `databaseName.tableName.typeName`, or `databaseName.schemaName.tableName.typeName`.
To match the name of a data type, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the data type; the expression does not match substrings that might be present in a type name.
For the list of YugabyteDB-specific data type names, see [Data type mappings](#data-type-mappings). | +| message.key.columns | _empty string_ | A list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.
By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns.

To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format:

`:,`

To base a table key on multiple column names, insert commas between the column names.
Each fully-qualified table name is a regular expression in the following format:

`.`

The property can include entries for multiple tables. Use a semicolon to separate table entries in the list.

The following example sets the message key for the tables `inventory.customers` and `purchase.orders`:

`inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4`

For the table `inventory.customer`, the columns `pk1` and `pk2` are specified as the message key. For the `purchaseorders` tables in any schema, the columns `pk3` and `pk4` server as the message key.
There is no limit to the number of columns that you use to create custom message keys. However, it's best to use the minimum number that are required to specify a unique key.
Note that having this property set and `REPLICA IDENTITY` set to `DEFAULT` on the tables, will cause the tombstone events to not be created properly if the key columns are not part of the primary key of the table.
Setting `REPLICA IDENTITY` to `FULL` is the only solution. | +| publication.autocreate.mode | `all_tables` | Applies only when streaming changes by using the [pgoutput plugin](https://www.postgresql.org/docs/11/sql-createpublication.html). The setting determines how creation of a [publication](https://www.postgresql.org/docs/11/logical-replication-publication.html) should work. Specify one of the following values:

`all_tables` - If a publication exists, the connector uses it. If a publication does not exist, the connector creates a publication for all tables in the database for which the connector is capturing changes. For the connector to create a publication it must access the database through a database user account that has permission to create publications and perform replications. You grant the required permission by using the following SQL command `CREATE PUBLICATION FOR ALL TABLES;`.

`disabled` - The connector does not attempt to create a publication. A database administrator or the user configured to perform replications must have created the publication before running the connector. If the connector cannot find the publication, the connector throws an exception and stops.

`filtered` - If a publication exists, the connector uses it. If no publication exists, the connector creates a new publication for tables that match the current filter configuration as specified by the `schema.include.list`, `schema.exclude.list`, and `table.include.list`, and `table.exclude.list` connector configuration properties. For example: `CREATE PUBLICATION FOR TABLE `. If the publication exists, the connector updates the publication for tables that match the current filter configuration. For example: `ALTER PUBLICATION SET TABLE `. | +| replica.identity.autoset.values | _empty string_ | The setting determines the value for [replica identity](#replica-identity) at table level.

This option will overwrite the existing value in database. A comma-separated list of regular expressions that match fully-qualified tables and replica identity value to be used in the table.

Each expression must match the pattern `:`, where the table name could be defined as (`SCHEMA_NAME.TABLE_NAME`), and the replica identity values are:

`DEFAULT` - Records the old values of the columns of the primary key, if any. This is the default for non-system tables.

`FULL` - Records the old values of all columns in the row.

`NOTHING` - Records no information about the old row. This is the default for system tables.

For example,

```schema1.*:FULL,schema2.table2:NOTHING,schema2.table3:DEFAULT```

{{< warning title="Warning" >}} Tables in YugabyteDB will always have the replica identity present at the time of replication slot creation, it cannot be altered at runtime. If it needs to be altered, it will only be reflected on a new slot created after altering the replica identity. {{< /warning >}} | +| binary.handling.mode | bytes | Specifies how binary (`bytea`) columns should be represented in change events:

`bytes` represents binary data as byte array.

`base64` represents binary data as base64-encoded strings.

`base64-url-safe` represents binary data as base64-url-safe-encoded strings.

`hex` represents binary data as hex-encoded (base16) strings. | +| schema.name.adjustment.mode | none | Specifies how schema names should be adjusted for compatibility with the message converter used by the connector. Possible settings:

  • `none` does not apply any adjustment.
  • `avro` replaces the characters that cannot be used in the Avro type name with underscore.
  • `avro_unicode` replaces the underscore or characters that cannot be used in the Avro type name with corresponding unicode like _uxxxx. Note: `_` is an escape sequence like backslash in Java.
| +| field.name.adjustment.mode | none | Specifies how field names should be adjusted for compatibility with the message converter used by the connector. Possible settings:

  • `none` does not apply any adjustment.
  • `avro` replaces the characters that cannot be used in the Avro type name with underscore.
  • `avro_unicode` replaces the underscore or characters that cannot be used in the Avro type name with corresponding unicode like _uxxxx. Note: `_` is an escape sequence like backslash in Java.
For more information, see [Avro naming](https://debezium.io/documentation/reference/2.5/configuration/avro.html#avro-naming). | +| money.fraction.digits | 2 | Specifies how many decimal digits should be used when converting Postgres `money` type to `java.math.BigDecimal`, which represents the values in change events. Applicable only when `decimal.handling.mode` is set to `precise`. | + +#### Advanced configuration properties + +The following advanced configuration properties have defaults that work in most situations and therefore rarely need to be specified in the connector's configuration. + +| Property | Default value | Description | +| :------- | :------------ | :---------- | +| converters | No default | Enumerates a comma-separated list of the symbolic names of the custom converter instances that the connector can use. For example,

```isbn```

You must set the converters property to enable the connector to use a custom converter.
For each converter that you configure for a connector, you must also add a .type property, which specifies the fully-qualified name of the class that implements the converter interface. The `.type` property uses the following format:
`.type`
For example,

```isbn.type: io.debezium.test.IsbnConverter```

If you want to further control the behavior of a configured converter, you can add one or more configuration parameters to pass values to the converter. To associate any additional configuration parameter with a converter, prefix the parameter names with the symbolic name of the converter.
For example,

```isbn.schema.name: io.debezium.YugabyteDB.type.Isbn``` | +| snapshot.mode | `initial` | Specifies the criteria for performing a snapshot when the connector starts:

`initial` - The connector performs a snapshot only when no offsets have been recorded for the logical server name.

`never` - The connector never performs snapshots. When a connector is configured this way, its behavior when it starts is as follows. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. If no LSN has been stored, the connector starts streaming changes from the point in time when the YugabyteDB logical replication slot was created on the server. The never snapshot mode is useful only when you know all data of interest is still reflected in the WAL.

`initial_only` - The connector performs an initial snapshot and then stops, without processing any subsequent changes. | +| snapshot.include.collection.list | All tables included in `table.include.list` | An optional, comma-separated list of regular expressions that match the fully-qualified names (`.`) of the tables to include in a snapshot. The specified items must be named in the connector's `table.include.list` property. This property takes effect only if the connector's `snapshot.mode` property is set to a value other than `never`.
This property does not affect the behavior of incremental snapshots.
To match the name of a table, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the table; it does not match substrings that might be present in a table name. | +| event.processing.failure.handling.mode | fail | Specifies how the connector should react to exceptions during processing of events:

`fail` propagates the exception, indicates the offset of the problematic event, and causes the connector to stop.

`warn` logs the offset of the problematic event, skips that event, and continues processing.

`skip` skips the problematic event and continues processing. | +| max.batch.size | 2048 | Positive integer value that specifies the maximum size of each batch of events that the connector processes. | +| max.queue.size | 8192 | Positive integer value that specifies the maximum number of records that the blocking queue can hold. When Debezium reads events streamed from the database, it places the events in the blocking queue before it writes them to Kafka. The blocking queue can provide backpressure for reading change events from the database in cases where the connector ingests messages faster than it can write them to Kafka, or when Kafka becomes unavailable. Events that are held in the queue are disregarded when the connector periodically records offsets. Always set the value of `max.queue.size` to be larger than the value of `max.batch.size`. | +| max.queue.size.in.bytes | 0 | A long integer value that specifies the maximum volume of the blocking queue in bytes. By default, volume limits are not specified for the blocking queue. To specify the number of bytes that the queue can consume, set this property to a positive long value.
If `max.queue.size` is also set, writing to the queue is blocked when the size of the queue reaches the limit specified by either property. For example, if you set `max.queue.size=1000`, and `max.queue.size.in.bytes=5000`, writing to the queue is blocked after the queue contains 1000 records, or after the volume of the records in the queue reaches 5000 bytes. | +| poll.interval.ms | 500 | Positive integer value that specifies the number of milliseconds the connector should wait for new change events to appear before it starts processing a batch of events. Defaults to 500 milliseconds. | +| include.unknown.datatypes | false | Specifies connector behavior when the connector encounters a field whose data type is unknown. The default behavior is that the connector omits the field from the change event and logs a warning.

Set this property to `true` if you want the change event to contain an opaque binary representation of the field. This lets consumers decode the field. You can control the exact representation by setting the [binary handling mode](#connector-properties) property.{{< note title="Note" >}} Consumers risk backward compatibility issues when `include.unknown.datatypes` is set to `true`. Not only may the database-specific binary representation change between releases, but if the data type is eventually supported by Debezium, the data type will be sent downstream in a logical type, which would require adjustments by consumers. In general, when encountering unsupported data types, create a feature request so that support can be added. {{< /note >}} | +| database.initial.statements | No default | A semicolon separated list of SQL statements that the connector executes when it establishes a JDBC connection to the database. To use a semicolon as a character and not as a delimiter, specify two consecutive semicolons, `;;`.

The connector may establish JDBC connections at its own discretion. Consequently, this property is useful for configuration of session parameters only, and not for executing DML statements.

The connector does not execute these statements when it creates a connection for reading the transaction log. | +| status.update.interval.ms | 10000 | Frequency for sending replication connection status updates to the server, given in milliseconds. The property also controls how frequently the database status is checked to detect a dead connection in case the database was shut down. | +| schema.refresh.mode | columns_diff | Specify the conditions that trigger a refresh of the in-memory schema for a table.

`columns_diff` is the safest mode. It ensures that the in-memory schema stays in sync with the database table's schema at all times.

`columns_diff_exclude_unchanged_toast` instructs the connector to refresh the in-memory schema cache if there is a discrepancy with the schema derived from the incoming message, unless unchanged TOASTable data fully accounts for the discrepancy.

This setting can significantly improve connector performance if there are frequently-updated tables that have TOASTed data that are rarely part of updates. However, it is possible for the in-memory schema to become outdated if TOASTable columns are dropped from the table. | +| snapshot.delay.ms | No default | An interval in milliseconds that the connector should wait before performing a snapshot when the connector starts. If you are starting multiple connectors in a cluster, this property is useful for avoiding snapshot interruptions, which might cause re-balancing of connectors. | +| snapshot.fetch.size | 10240 | During a snapshot, the connector reads table content in batches of rows. This property specifies the maximum number of rows in a batch. | +| slot.stream.params | No default | Semicolon separated list of parameters to pass to the configured logical decoding plugin. | +| slot.max.retries | 6 | If connecting to a replication slot fails, this is the maximum number of consecutive attempts to connect. | +| slot.retry.delay.ms | 10000 (10 seconds) | The number of milliseconds to wait between retry attempts when the connector fails to connect to a replication slot. | +| unavailable.value.placeholder | __debezium_unavailable_value | Specifies the constant that the connector provides to indicate that the original value is a toasted value that is not provided by the database. If the setting of `unavailable.value.placeholder` starts with the `hex:` prefix it is expected that the rest of the string represents hexadecimally encoded octets. | +| provide.transaction.metadata | false | Determines whether the connector generates events with transaction boundaries and enriches change event envelopes with transaction metadata. Specify true if you want the connector to do this. For more information, see [Transaction metadata](#transaction-metadata). | +| flush.lsn.source | true | Determines whether the connector should commit the LSN of the processed records in the source YugabyteDB database so that the WAL logs can be deleted. Specify `false` if you don't want the connector to do this. Please note that if set to `false` LSN will not be acknowledged by Debezium and as a result WAL logs will not be cleared which might result in disk space issues. User is expected to handle the acknowledgement of LSN outside Debezium. | +| retriable.restart.connector.wait.ms | 10000 (10 seconds) | The number of milliseconds to wait before restarting a connector after a retriable error occurs. | +| skipped.operations | t | A comma-separated list of operation types that will be skipped during streaming. The operations include: `c` for inserts/create, `u` for updates, `d` for deletes, `t` for truncates, and `none` to not skip any operations. By default, truncate operations are skipped. | +| xmin.fetch.interval.ms | 0 | How often, in milliseconds, the XMIN will be read from the replication slot. The XMIN value provides the lower bounds of where a new replication slot could start from. The default value of `0` disables tracking XMIN tracking. | +| topic.naming.strategy | `io.debezium.schema.SchemaTopicNamingStrategy` | The name of the TopicNamingStrategy class that should be used to determine the topic name for data change, schema change, transaction, heartbeat event etc., defaults to `SchemaTopicNamingStrategy`. | +| topic.delimiter | `.` | Specify the delimiter for topic name, defaults to `.`. | +| topic.cache.size | 10000 | The size used for holding the topic names in bounded concurrent hash map. This cache will help to determine the topic name corresponding to a given data collection. | +| topic.heartbeat.prefix | `__debezium-heartbeat` | Controls the name of the topic to which the connector sends heartbeat messages. The topic name has this pattern:

`topic.heartbeat.prefix.topic.prefix`

For example, if the topic prefix is `fulfillment`, the default topic name is `__debezium-heartbeat.fulfillment.` | +| topic.transaction | transaction | Controls the name of the topic to which the connector sends transaction metadata messages. The topic name has this pattern:
`.`

For example, if the `topic.prefix` is `fulfillment`, the default topic name is `fulfillment.transaction`. | +| snapshot.max.threads | 1 | Specifies the number of threads that the connector uses when performing an initial snapshot. To enable parallel initial snapshots, set the property to a value greater than 1. In a parallel initial snapshot, the connector processes multiple tables concurrently. This feature is incubating. | +| custom.metric.tags | No default | The custom metric tags will accept key-value pairs to customize the MBean object name which should be appended the end of regular name, each key would represent a tag for the MBean object name, and the corresponding value would be the value of that tag the key is. For example: `k1=v1,k2=v2`. | +| errors.max.retries | -1 | The maximum number of retries on retriable errors (for example, connection errors) before failing (-1 = no limit, 0 = disabled, > 0 = num of retries). | + +#### Pass-through configuration properties + +The connector also supports pass-through configuration properties that are used when creating the Kafka producer and consumer. + +Be sure to consult the [Kafka documentation](https://kafka.apache.org/documentation.html) for all of the configuration properties for Kafka producers and consumers. The YugabyteDB connector does use the [new consumer configuration properties](https://kafka.apache.org/documentation.html#consumerconfigs). + +## Monitoring + +The YugabyteDB connector provides two metrics in addition to the built-in support for JMX metrics that Zookeeper, Kafka, and Kafka Connect provide: + +* [Snapshot metrics](#snapshot-metrics) provide information about connector operation while performing a snapshot. +* [Streaming metrics](#streaming-metrics) provide information about connector operation when the connector is capturing changes and streaming change event records. + +[Debezium monitoring documentation](https://debezium.io/documentation/reference/2.5/operations/monitoring.html#monitoring-debezium) provides details for how to expose these metrics by using JMX. + +### Snapshot metrics + +The **MBean** is `debezium.postgres:type=connector-metrics,context=snapshot,server=`. + +Snapshot metrics are not exposed unless a snapshot operation is active, or if a snapshot has occurred since the last connector start. + +The following table lists the snapshot metrics that are available. + +| Attributes | Type | Description | +| :--------- | :--- | :---------- | +| `LastEvent` | string | The last snapshot event that the connector has read. | +| `MilliSecondsSinceLastEvent` | long | The number of milliseconds since the connector has read and processed the most recent event. | +| `TotalNumberOfEventsSeen` | long | The total number of events that this connector has seen since last started or reset. | +| `NumberOfEventsFiltered` | long | The number of events that have been filtered by include/exclude list filtering rules configured on the connector. | +| `CapturedTables` | string[] | The list of tables that are captured by the connector. | +| `QueueTotalCapacity` | int | The length the queue used to pass events between the snapshotter and the main Kafka Connect loop. | +| `QueueRemainingCapacity` | int | The free capacity of the queue used to pass events between the snapshotter and the main Kafka Connect loop. | +| `TotalTableCount` | int | The total number of tables that are being included in the snapshot. | +| `RemainingTableCount` | int | The number of tables that the snapshot has yet to copy. | +| `SnapshotRunning` | boolean | Whether the snapshot was started. | +| `SnapshotPaused` | boolean | Whether the snapshot was paused. | +| `SnapshotAborted` | boolean | Whether the snapshot was aborted. | +| `SnapshotCompleted` | boolean | Whether the snapshot completed. | +| `SnapshotDurationInSeconds` | long | The total number of seconds that the snapshot has taken so far, even if not complete. Includes also time when snapshot was paused. | +| `SnapshotPausedDurationInSeconds` | long | The total number of seconds that the snapshot was paused. If the snapshot was paused several times, the paused time adds up. | +| `RowsScanned` | Map | Map containing the number of rows scanned for each table in the snapshot. Tables are incrementally added to the Map during processing. Updates every 10,000 rows scanned and upon completing a table. | +| `MaxQueueSizeInBytes` | long | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | +| `CurrentQueueSizeInBytes` | long | The current volume, in bytes, of records in the queue. | + +### Streaming metrics + +The **MBean** is `debezium.postgres:type=connector-metrics,context=streaming,server=`. + +The following table lists the streaming metrics that are available. + +| Attributes | Type | Description | +| :--------- | :--- | :---------- | +| `LastEvent` | string | The last streaming event that the connector has read. | +| `MilliSecondsSinceLastEvent` | long | The number of milliseconds since the connector has read and processed the most recent event. | +| `TotalNumberOfEventsSeen` | long | The total number of events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfCreateEventsSeen` | long | The total number of create events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfUpdateEventsSeen` | long | The total number of update events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfDeleteEventsSeen` | long | The total number of delete events that this connector has seen since the last start or metrics reset. | +| `NumberOfEventsFiltered` | long | The number of events that have been filtered by include/exclude list filtering rules configured on the connector. | +| `CapturedTables` | string[] | The list of tables that are captured by the connector. | +| `QueueTotalCapacity` | int | The length the queue used to pass events between the streamer and the main Kafka Connect loop. | +| `QueueRemainingCapacity` | int | The free capacity of the queue used to pass events between the streamer and the main Kafka Connect loop. | +| `Connected` | boolean | Flag that denotes whether the connector is currently connected to the database server. | +| `MilliSecondsBehindSource` | long | The number of milliseconds between the last change event's timestamp and the connector processing it. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running. | +| `NumberOfCommittedTransactions` | long | The number of processed transactions that were committed. | +| `SourceEventPosition` | Map | The coordinates of the last received event. | +| `LastTransactionId` | string | Transaction identifier of the last processed transaction. | +| `MaxQueueSizeInBytes` | long | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | +| `CurrentQueueSizeInBytes` | long | The current volume, in bytes, of records in the queue. | + +## Behavior when things go wrong + +Debezium is a distributed system that captures all changes in multiple upstream databases; it never misses or loses an event. When the system is operating normally or being managed carefully then Debezium provides _exactly once_ delivery of every change event record. If a fault does happen then the system does not lose any events. However, while it is recovering from the fault, it's possible that the connector might emit some duplicate change events. In these abnormal situations, Debezium, like Kafka, provides _at least once_ delivery of change events. + +The rest of this section describes how Debezium handles various kinds of faults and problems. + +### Configuration and startup errors + +In the following situations, the connector fails when trying to start, reports an error/exception in the log, and stops running: + +* The connector's configuration is invalid. +* The connector cannot successfully connect to YugabyteDB by using the specified connection parameters. +* The connector is restarting from a previously-recorded LSN and YugabyteDB no longer has that history available. + +In these cases, the error message has details about the problem and possibly a suggested workaround. After you correct the configuration or address the YugabyteDB problem, restart the connector. + +### YB-TServer becomes unavailable + +When the connector is running, the YB-TServer that it is connected to could become unavailable for any number of reasons. If this happens, the connector fails with an error and retries to connect to the YugabyteDB server. Because the connector uses the [YugabyteDB Java driver](../../../../drivers-orms/java/), the connection is handled internally and the connector restores the connection to another running node. + +The YugabyteDB connector externally stores the last processed offset in the form of a YugabyteDB LSN. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the Debezium replication slot remains intact. + +{{< warning title="Warning" >}} + +Never drop a replication slot on the server or you will lose data. + +{{< /warning >}} + +### Cluster failures + +When the connector is running, it is possible that the YugabyteDB server becomes unavailable for any number of reasons. If that happens, the connector fails with and error and initiates retries but as the complete YugabyteDB server is unavailable, all the retries will fail. + +When the YugabyteDB server is back up, restart the connector to continue streaming where it left off. + +### Kafka Connect process stops gracefully + +Suppose that Kafka Connect is being run in distributed mode and a Kafka Connect process is stopped gracefully. Prior to shutting down that process, Kafka Connect migrates the process's connector tasks to another Kafka Connect process in that group. The new connector tasks start processing exactly where the prior tasks stopped. There is a short delay in processing while the connector tasks are stopped gracefully and restarted on the new processes. + +### Kafka Connect process crashes + +If the Kafka Connector process stops unexpectedly, any connector tasks it was running terminate without recording their most recently processed offsets. When Kafka Connect is being run in distributed mode, Kafka Connect restarts those connector tasks on other processes. However, YugabyteDB connectors resume from the last offset that was recorded by the earlier processes. This means that the new replacement tasks might generate some of the same change events that were processed just prior to the crash. The number of duplicate events depends on the offset flush period and the volume of data changes just before the crash. + +Because there is a chance that some events might be duplicated during a recovery from failure, consumers should always anticipate some duplicate events. Debezium changes are idempotent, so a sequence of events always results in the same state. + +In each change event record, Debezium connectors insert source-specific information about the origin of the event, including the YugabyteDB server's time of the event, the ID of the server transaction, and the position in the write-ahead log where the transaction changes were written. Consumers can keep track of this information, especially the LSN, to determine whether an event is a duplicate. + +### Kafka becomes unavailable + +As the connector generates change events, the Kafka Connect framework records those events in Kafka by using the Kafka producer API. Periodically, at a frequency that you specify in the Kafka Connect configuration, Kafka Connect records the latest offset that appears in those change events. If the Kafka brokers become unavailable, the Kafka Connect process that is running the connectors repeatedly tries to reconnect to the Kafka brokers. In other words, the connector tasks pause until a connection can be re-established, at which point the connectors resume exactly where they left off. + +### Connector is stopped for a duration + +If the connector is gracefully stopped, the database can continue to be used. Any changes are recorded in the YugabyteDB WAL. When the connector restarts, it resumes streaming changes where it left off. That is, it generates change event records for all database changes that were made while the connector was stopped. + +A properly configured Kafka cluster is able to handle massive throughput. Kafka Connect is written according to Kafka best practices, and given enough resources a Kafka Connect connector can also handle very large numbers of database change events. Because of this, after being stopped for a while, when a Debezium connector restarts, it is very likely to catch up with the database changes that were made while it was stopped. How quickly this happens depends on the capabilities and performance of Kafka and the volume of changes being made to the data in YugabyteDB. diff --git a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md index 1cf103f6a468..60eb64bdd386 100644 --- a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md +++ b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md @@ -1,10 +1,9 @@ --- -title: Using YugabyteDB gRPC replication -headerTitle: Using YugabyteDB gRPC replication -linkTitle: Using YugabyteDB gRPC replication -description: CDC or Change data capture is a process to capture changes made to data in the database. +title: CDC using YugabyteDB gRPC replication protocol +headerTitle: CDC using gRPC replication protocol +linkTitle: gRPC protocol +description: CDC using YugabyteDB gRPC replication protocol. headcontent: Capture changes made to data in the database -image: /images/section_icons/index/develop.png cascade: earlyAccess: /preview/releases/versioning/#feature-maturity aliases: @@ -18,11 +17,9 @@ type: indexpage showRightNav: true --- -## Overview - YugabyteDB CDC captures changes made to data in the database and streams those changes to external processes, applications, or other databases. CDC allows you to track and propagate changes in a YugabyteDB database to downstream consumers based on its Write-Ahead Log (WAL). YugabyteDB CDC uses Debezium to capture row-level changes resulting from INSERT, UPDATE, and DELETE operations in the upstream database, and publishes them as events to Kafka using Kafka Connect-compatible connectors. -![What is CDC](/images/explore/cdc-overview-what.png) +![What is CDC](/images/explore/cdc-overview-work.png) -## YugabyteDB gRPC Connector - -To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the YugabyteDB gRPC (Debezium) connector. The connector is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. - -{{}} -To understand how the various features and configuration of the connector, see [YugabyteDB gRPC Connector](./debezium-connector-yugabytedb). -{{}} - ## Get started Get started with Yugabyte gRPC replication. @@ -56,6 +45,14 @@ You can monitor the activities and status of the deployed connectors using the h To know more about how to monitor your CDC setup, see [Monitor](./cdc-monitor). {{}} +## YugabyteDB gRPC Connector + +To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the YugabyteDB gRPC connector, which is based on the Debezium platform. The connector is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. + +{{}} +To understand how the various features and configuration of the connector, see [YugabyteDB gRPC Connector](./debezium-connector-yugabytedb). +{{}} + ## Known limitations * A single stream can only be used to stream data from one namespace only. @@ -73,13 +70,13 @@ In addition, CDC support for the following features will be added in upcoming re ## Learn more -- [Examples of CDC usage and patterns](https://github.com/yugabyte/cdc-examples/tree/main) {{}} -- [Tutorials to deploy in different Kafka environments](../../../tutorials/cdc-tutorials/) {{}} -- [Data Streaming Using YugabyteDB CDC, Kafka, and SnowflakeSinkConnector](https://www.yugabyte.com/blog/data-streaming-using-yugabytedb-cdc-kafka-and-snowflakesinkconnector/) {{}} -- [Unlock Azure Storage Options With YugabyteDB CDC](https://www.yugabyte.com/blog/unlocking-azure-storage-options-with-yugabytedb-cdc/) {{}} -- [Change Data Capture From YugabyteDB to Elasticsearch](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-elasticsearch/) {{}} -- [Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB](https://www.yugabyte.com/blog/snowflake-cdc-publish-data-using-amazon-s3-yugabytedb/) {{}} -- [Streaming Changes From YugabyteDB to Downstream Databases](https://www.yugabyte.com/blog/streaming-changes-yugabytedb-cdc-downstream-databases/) {{}} -- [Change Data Capture from YugabyteDB CDC to ClickHouse](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-clickhouse/) {{}} -- [How to Run Debezium Server with Kafka as a Sink](https://www.yugabyte.com/blog/change-data-capture-cdc-run-debezium-server-kafka-sink/) {{}} -- [Change Data Capture Using a Spring Data Processing Pipeline](https://www.yugabyte.com/blog/change-data-capture-cdc-spring-data-processing-pipeline/) {{}} +* [Examples of CDC usage and patterns](https://github.com/yugabyte/cdc-examples/tree/main) {{}} +* [Tutorials to deploy in different Kafka environments](/preview/tutorials/cdc-tutorials/) {{}} +* [Data Streaming Using YugabyteDB CDC, Kafka, and SnowflakeSinkConnector](https://www.yugabyte.com/blog/data-streaming-using-yugabytedb-cdc-kafka-and-snowflakesinkconnector/) {{}} +* [Unlock Azure Storage Options With YugabyteDB CDC](https://www.yugabyte.com/blog/unlocking-azure-storage-options-with-yugabytedb-cdc/) {{}} +* [Change Data Capture From YugabyteDB to Elasticsearch](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-elasticsearch/) {{}} +* [Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB](https://www.yugabyte.com/blog/snowflake-cdc-publish-data-using-amazon-s3-yugabytedb/) {{}} +* [Streaming Changes From YugabyteDB to Downstream Databases](https://www.yugabyte.com/blog/streaming-changes-yugabytedb-cdc-downstream-databases/) {{}} +* [Change Data Capture from YugabyteDB CDC to ClickHouse](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-clickhouse/) {{}} +* [How to Run Debezium Server with Kafka as a Sink](https://www.yugabyte.com/blog/change-data-capture-cdc-run-debezium-server-kafka-sink/) {{}} +* [Change Data Capture Using a Spring Data Processing Pipeline](https://www.yugabyte.com/blog/change-data-capture-cdc-spring-data-processing-pipeline/) {{}} diff --git a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md index 37e92fd28dde..74fa629901e8 100644 --- a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md +++ b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md @@ -10,7 +10,7 @@ menu: preview: parent: explore-change-data-capture-grpc-replication identifier: cdc-get-started - weight: 30 + weight: 10 type: docs --- @@ -33,14 +33,15 @@ The following steps are necessary to set up YugabyteDB for use with the Yugabyte In case CDC is lagging or away for some time, the disk usage may grow and cause YugabyteDB cluster instability. To avoid this scenario, if a stream is inactive for a configured amount of time, the WAL is garbage collected. This is configurable using a [YB-TServer flag](../../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags). ## Deploying the YugabyteDB gRPC Connector + To stream data change events from YugabyteDB databases, follow these steps to deploy the YugabyteDB gRPC Connector: -* Download the Connector: You can download the connector from the [GitHub releases](https://github.com/yugabyte/debezium-connector-yugabytedb/releases) -* Install the Connector: Extract and install the connector archive in your Kafka Connect environment. -* Configure the Connector: Modify the connector configuration to suit your specific requirements. -* Start the Connector: Add the connector's configuration to Kafka Connect and start the connector. +- Download the Connector: You can download the connector from the [GitHub releases](https://github.com/yugabyte/debezium-connector-yugabytedb/releases) +- Install the Connector: Extract and install the connector archive in your Kafka Connect environment. +- Configure the Connector: Modify the connector configuration to suit your specific requirements. +- Start the Connector: Add the connector's configuration to Kafka Connect and start the connector. -For more details on connector configuration and deployment steps, refer to the [YugabyteDB gRPC Connector documentation]((../debezium-connector-yugabytedb/)). +For more details on connector configuration and deployment steps, refer to the [YugabyteDB gRPC Connector documentation](../debezium-connector-yugabytedb/). ## Serialization @@ -537,12 +538,12 @@ Longer values of `cdc_intent_retention_ms`, coupled with longer CDC lags (period ## Content-based routing -By default, the connector streams all of the change events that it reads from a table to a single static topic. However, you may want to re-route the events into different Kafka topics based on the event's content. You can do this using the Debezium `ContentBasedRouter`. But first, two additional dependencies need to be placed in the Kafka-Connect environment. These are not included in the official *yugabyte-debezium-connector* for security reasons. These dependencies are: +By default, the connector streams all of the change events that it reads from a table to a single static topic. However, you may want to re-route the events into different Kafka topics based on the event's content. You can do this using the Debezium `ContentBasedRouter`. But first, two additional dependencies need to be placed in the Kafka-Connect environment. These are not included in the official _yugabyte-debezium-connector_ for security reasons. These dependencies are: - Debezium routing SMT (Single Message Transform) - Groovy JSR223 implementation (or other scripting languages that integrate with [JSR 223](https://jcp.org/en/jsr/detail?id=223)) -To get started, you can rebuild the *yugabyte-debezium-connector* image including these dependencies. Here's what the Dockerfile would look like: +To get started, you can rebuild the _yugabyte-debezium-connector_ image including these dependencies. The following shows what the Dockerfile would look like: ```Dockerfile FROM quay.io/yugabyte/debezium-connector:latest @@ -573,6 +574,6 @@ The `` contains the logic for routing of the events. For exa value.after != null ? (value.after?.country?.value == '\''UK'\'' ? '\''uk_users'\'' : null) : (value.before?.country?.value == '\''UK'\'' ? '\''uk_users'\'' : null)" ``` -This expression checks if the value of the row after the operation has the country set to "UK". If *yes* then the expression returns "uk_users." If *no*, it returns *null*, and in case the row after the operation is *null* (for example, in a "delete" operation), the expression also checks for the same condition on row values before the operation. The value that is returned determines which new Kafka Topic will receive the re-routed event. If it returns *null*, the event is sent to the default topic. +This expression checks if the value of the row after the operation has the country set to `UK`. If _yes_, then the expression returns `uk_users`. If _no_, it returns _null_, and in case the row after the operation is _null_ (for example, in a "delete" operation), the expression also checks for the same condition on row values before the operation. The value that is returned determines which new Kafka Topic will receive the re-routed event. If it returns _null_, the event is sent to the default topic. For more advanced routing configuration, refer to the [Debezium documentation](https://debezium.io/documentation/reference/stable/transformations/content-based-routing.html) on content-based routing. diff --git a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md index 4dad0867f92a..c6e797f1860d 100644 --- a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md +++ b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md @@ -10,7 +10,7 @@ menu: preview: parent: explore-change-data-capture-grpc-replication identifier: cdc-monitor - weight: 60 + weight: 20 type: docs --- @@ -104,7 +104,7 @@ The following streaming metrics are available: | QueueTotalCapacity | `int` | The length the queue used to pass events between the streamer and the main Kafka Connect loop. | | QueueRemainingCapacity | `int` | The free capacity of the queue used to pass events between the streamer and the main Kafka Connect loop. | | Connected | `boolean` | Indicates whether the connector is currently connected to the database server. | -| MilliSecondsBehindSource | `long` | The number of milliseconds between the last change event’s timestamp and when the connector processed it. The value incorporates any differences between the clocks on the machines where the database server and the connector are running. | +| MilliSecondsBehindSource | `long` | The number of milliseconds between the last change event's timestamp and when the connector processed it. The value incorporates any differences between the clocks on the machines where the database server and the connector are running. | | SourceEventPosition | `Map` | The coordinates of the last received event. | | LastTransactionId | `string` | Transaction identifier of the last processed transaction. | | MaxQueueSizeInBytes | `long` | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | diff --git a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-overview.md b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-overview.md deleted file mode 100644 index ee018de84821..000000000000 --- a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-overview.md +++ /dev/null @@ -1,13 +0,0 @@ - diff --git a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md index 91c4a5169b02..2f3bc22e0b81 100644 --- a/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md +++ b/docs/content/preview/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md @@ -13,14 +13,15 @@ menu: preview: parent: explore-change-data-capture-grpc-replication identifier: debezium-connector-yugabytedb - weight: 20 + weight: 50 type: docs rightNav: hideH4: true --- -The YugabyteDB gRPC Connector captures row-level changes in a YugabyteDB database's schemas. -## YugabyteDB gRPC Connector compatibility +The YugabyteDB gRPC Connector is based on the Debezium Connector, and captures row-level changes in a YugabyteDB database's schemas using the YugabyteDB gRPC replication protocol. + +## Connector compatibility The connector is compatible with the following versions of YugabyteDB. @@ -31,49 +32,50 @@ The connector is compatible with the following versions of YugabyteDB. | 2.18.2 | 1.9.5.y.33.2 | | 2.20 | 1.9.5.y.220.2 | -Compatibility -* Kafka Connect: The connector supports version 2.x and later. -* YugabyteDB: The connector supports version 2.14 and later. +In addition, the connector supports the following: + +* Kafka Connect v2.x and later. +* YugabyteDB v2.14 and later. {{< note title="Note" >}} Starting with YugabyteDB v2.20, the naming convention for releases of the connector uses the scheme *major.y.minor*, as follows: -- *major* - Debezium release the connector is based on -- *minor* - version of YugabyteDB the connector works with + +* *major* - Debezium release the connector is based on +* *minor* - version of YugabyteDB the connector works with + The connector is backward compatible with previous releases of YugabyteDB unless stated otherwise. {{< /note >}} -## Initial Snapshot and Continuous Streaming: +## Initial Snapshot and Continuous Streaming * Initial Snapshot: Upon its first connection to a YugabyteDB cluster, the connector takes a consistent snapshot of the configured tables. * Continuous Streaming: After the snapshot, it continuously captures row-level changes (insertions, updates, and deletions) from the database. It then generates data change event records and streams them to Kafka topics. -![What is CDC](/images/explore/cdc-overview-what.png) - - +![What is CDC](/images/explore/cdc-overview-work.png) -## Kafka Integration: +## Kafka integration For each table, the connector streams all generated events to a separate Kafka topic. Client applications and services can consume these data change event records from their respective topics. -* CDC (Change Data Capture) Service: The Debezium connector for YugabyteDB leverages the CDC service APIs to read the changes from YugabyteDB. +* CDC (Change Data Capture) Service: The Debezium connector leverages the CDC service APIs to read the changes from YugabyteDB. * Event Production: For every row-level insert, update, and delete operation captured, the connector produces a corresponding change event and sends it to separate Kafka topics dedicated to each table. * Client Consumption: Applications read the Kafka topics corresponding to the database tables they are interested in and react to the row-level events received. -## Failure Tolerance +## Failure tolerance + The connector records the WAL position for each event as it reads changes and produces events. If the connector stops (due to communication failures, network problems, or crashes), it resumes reading the WAL from the last recorded position upon restart. This uses checkpoints managed on both the Kafka side and the YugabyteDB cluster. + {{< tip title="Use UTF-8 encoding" >}} Debezium supports databases with UTF-8 character encoding only. With a single-byte character encoding, it's not possible to correctly process strings that contain extended ASCII code characters. {{< /tip >}} - - ## How the connector works -To optimally configure and run a Debezium YugabyteDB connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. +To optimally configure and run a Debezium connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. ### Security @@ -89,7 +91,7 @@ Per-user CDC privileges are planned for a future release. Most YugabyteDB servers are configured to not retain the complete history of the database in the WAL segments. This means that the YugayteDB connector would be unable to see the entire history of the database by reading only the WAL. Consequently, the first time that the connector starts, it performs an initial consistent snapshot of the database. You can change this behavior by setting the `snapshot.mode` connector configuration property to a value other than initial. -After the connector completes its initial snapshot, the YugabyteDB connector continues streaming the changes. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. +After the connector completes its initial snapshot, it continues streaming the changes. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. Options for the `snapshot.mode` connector configuration property are as follows: @@ -101,11 +103,11 @@ Options for the `snapshot.mode` connector configuration property are as follows: ### Streaming changes -The YugabyteDB connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. +The YugabyteDB gRPC Connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. The connector keeps polling for changes and whenever there is a change, the connector processes them, converts them to a specific format (Protobuf or JSON in the case of the Debezium plugin) and writes them on an output stream, which can then be consumed by clients. -The Debezium YugabyteDB connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The YugabyteDB connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. +The connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. Periodically, Kafka Connect records the most recent offset in another Kafka topic. The offset indicates source-specific position information that Debezium includes with each event. @@ -113,13 +115,13 @@ When Kafka Connect gracefully shuts down, it stops the connectors, and flushes a {{< note title="Schema changes" >}} -The YugabyteDB connector retrieves schema information as part of the change events which consist of the schema metadata for the table. When there is any schema change on the configured table, the connector will automatically receive an event pertaining to the change and it will update its internal schema. +The connector retrieves schema information as part of the change events which consist of the schema metadata for the table. When there is any schema change on the configured table, the connector will automatically receive an event pertaining to the change and it will update its internal schema. {{< /note >}} ### Topic names -By default, the YugabyteDB connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _serverName.schemaName.tableName_. +By default, the YugabyteDB gRPC connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _serverName.schemaName.tableName_. The components of a topic name are as follows: @@ -147,7 +149,7 @@ If the default topic names don't meet your requirements, you can configure custo ### Meta information -In addition to the data change event, each record produced by the YugabyteDB connector contains some metadata. Metadata includes information about which tablet caused the change event to occur, the commit time, table, database, offset of the event, for example: +In addition to the data change event, each record produced by the connector contains some metadata. Metadata includes information about which tablet caused the change event to occur, the commit time, table, database, offset of the event, for example: ```output.json "source": { @@ -250,7 +252,7 @@ For example: ## Data change events -The Debezium YugabyteDB connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. +The connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. Debezium and Kafka Connect are designed around continuous streams of event messages. However, the structure of these events may change over time, which can be difficult for consumers to handle. To address this, each event contains the schema for its content. This makes each event self-contained. @@ -284,7 +286,7 @@ If you use the JSON converter and you configure it to produce all four basic cha {{< warning title="Naming conflicts due to invalid characters" >}} -The YugabyteDB connector ensures that all Kafka Connect schema names adhere to the [Avro schema name format](http://avro.apache.org/docs/current/spec.html#names). This means that the logical server name must start with a Latin letter or an underscore (a-z, A-Z, or \_). Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore (a-z, A-Z, 0-9, or \_). Invalid characters are replaced with an underscore character. +The YugabyteDB gRPC connector ensures that all Kafka Connect schema names adhere to the [Avro schema name format](http://avro.apache.org/docs/current/spec.html#names). This means that the logical server name must start with a Latin letter or an underscore (a-z, A-Z, or \_). Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore (a-z, A-Z, 0-9, or \_). Invalid characters are replaced with an underscore character. This can lead to unexpected conflicts if the logical server name, a schema name, or a table name contains invalid characters, in the event that the only characters that distinguish names from one another are invalid, and thus replaced with underscores. @@ -704,7 +706,7 @@ A `delete` change event record provides a consumer with the information it needs #### Tombstone events -When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the YugabyteDB connector follows a delete event with a special _tombstone_ event that has the same key but a null value. +When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the connector follows a delete event with a special _tombstone_ event that has the same key but a null value. {{< tip title="TRUNCATE tables when CDC is enabled" >}} @@ -726,14 +728,14 @@ If you set the property to `false` to prevent the connector from saving tombston ## Datatype mappings -The YugabyteDB connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. +The connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. * The literal type describes how the value is literally represented using Kafka Connect schema types: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, BOOLEAN, STRING, BYTES, ARRAY, MAP, and STRUCT. * The semantic type describes how the Kafka Connect schema captures the meaning of the field using the name of the Kafka Connect schema for the field. ### Default values -If there is a default value for any column in a the YugabyteDB database schema, the YugabyteDB Debezium connector propagates the same value to the Kafka schema. +If there is a default value for any column in a the YugabyteDB database schema, the connector propagates the same value to the Kafka schema. ### Basic types @@ -826,7 +828,7 @@ YugabyteDB supports using `+/-infinity` values in `TIMESTAMP` columns. These spe ### Decimal types -The setting of the YugabyteDB connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. +The setting of the connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. {{< note title="Note" >}} @@ -926,7 +928,7 @@ Support for the following YugabyteDB data types will be enabled in future releas ## Setting up YugabyteDB -Before using the YugabyteDB connector to monitor the changes on a YugabyteDB server, you need to ensure the following: +Before using the connector to monitor the changes on a YugabyteDB server, you need to ensure the following: * You have a stream ID created on the database you want to monitor the changes for. The stream can be created using the [yb-admin create_change_data_stream](../../../../admin/yb-admin#create_change_data_stream) command. * The table which is supposed to be monitored should have a primary key. Only tables which have a primary key can be streamed. See [limitations](../cdc-overview/#known-limitations). @@ -939,11 +941,11 @@ For example, the connector is lagging behind in streaming the changes. In this c ## Deployment -To deploy a Debezium YugabyteDB connector, you install the Debezium YugabyteDB connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. For complete steps, follow the guide to [running the Debezium connector for YugabyteDB](../../../../integrations/cdc/debezium/). +To deploy a Debezium connector, you install the YugabyteDB gRPC Connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. For complete steps, follow the guide to [running the Debezium connector in YugabyteDB](../../../../integrations/cdc/debezium/). ### Connector configuration example -Following is an example of the configuration for a YugabyteDB connector that connects to a YugabyteDB server on port 5433 at 127.0.0.1, whose logical name is `dbserver1`. Typically, you configure the Debezium YugabyteDB connector in a JSON file by setting the configuration properties available for the connector. +Following is an example of the configuration for a connector that connects to a YugabyteDB server on port 5433 at 127.0.0.1, whose logical name is `dbserver1`. Typically, you configure the connector in a JSON file by setting the configuration properties available for the connector. You can choose to produce events for a subset of the schemas and tables in a database. Optionally, you can ignore, mask, or truncate columns that contain sensitive data, are larger than a specified size, or that you do not need. @@ -966,7 +968,7 @@ You can choose to produce events for a subset of the schemas and tables in a dat ``` 1. The name of the connector when registered with a Kafka Connect service. -1. The name of this YugabyteDB connector class. +1. The name of this YugabyteDB gRPC Connector class. 1. The address of this YugabyteDB server. 1. The port number of the YugabyteDB YSQL process. 1. List of comma separated values of master nodes of the YugabyteDB server. Usually in the form `host`:`port`. @@ -996,16 +998,16 @@ See [Transformers](#transformers). ### Adding connector configuration -To run a YugabyteDB connector, create a connector configuration and add the configuration to your Kafka Connect cluster. +To run a connector, create a connector configuration and add the configuration to your Kafka Connect cluster. #### Prerequisites * [YugabyteDB is configured](#setting-up-yugabytedb) for change data capture. -* The YugabyteDB connector is installed. +* The YugabyteDB gRPC connector is installed. #### Procedure -1. Create a configuration for the YugabyteDB connector. +1. Create a configuration for the connector. 2. Use the [Kafka Connect REST API](https://kafka.apache.org/documentation/#connect_rest) to add that connector configuration to your Kafka Connect cluster. #### Results @@ -1014,7 +1016,7 @@ After the connector starts, it will perform a snapshot of the tables depending o ### Connector configuration properties -The Debezium YugabyteDB connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. +The connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. The following properties are _required_ unless a default value is available: @@ -1097,11 +1099,11 @@ The following three transformers are available: YBExtractNewRecordState, Extract Transformer type: `io.debezium.connector.yugabytedb.transforms.YBExtractNewRecordState` -Unlike the Debezium Connector for PostgreSQL, the YugabyteDB connector only sends the `after` image of the "set of columns" that are modified. PostgreSQL sends the complete `after` image of the row which has changed. So by default if the column was not changed, it is not a part of the payload that is sent and the default value is set to `null`. +Unlike the Debezium connector for PostgreSQL, the YugabyteDB gRPC Connector only sends the `after` image of the "set of columns" that are modified. PostgreSQL sends the complete `after` image of the row which has changed. So by default if the column was not changed, it is not a part of the payload that is sent and the default value is set to `null`. -To differentiate between the case where a column is set to `null` and the case in which it's not modified, the YugabyteDB connector changes the value type to a struct. In this structure, an unchanged column is `{'value': null}`, whereas the column changed to a null value is `{'value': null, 'set': true}`. +To differentiate between the case where a column is set to `null` and the case in which it's not modified, the YugabyteDB gRPC Connector changes the value type to a struct. In this structure, an unchanged column is `{'value': null}`, whereas the column changed to a null value is `{'value': null, 'set': true}`. -A schema registry requires that, once a schema is registered, records must contain only payloads with that schema version. If you're using a schema registry, the YugabyteDB Debezium connector's approach can be problematic, as the schema may change with every message. For example, if we keep changing the record to only include the value of modified columns, the schema of each record will be different (the total number unique schemas will be a result of making all possible combinations of columns) and thus would require sending a schema with every record. +A schema registry requires that, once a schema is registered, records must contain only payloads with that schema version. If you're using a schema registry, the YugabyteDB gRPC Connector's approach can be problematic, as the schema may change with every message. For example, if we keep changing the record to only include the value of modified columns, the schema of each record will be different (the total number unique schemas will be a result of making all possible combinations of columns) and thus would require sending a schema with every record. To avoid this problem when you're using a schema registry, use the `YBExtractNewRecordState` SMT (Single Message Transformer for Kafka), which interprets these values and sends the record in the correct format (by removing the unmodified columns from the JSON message). Records transformed by `YBExtractNewRecordState` are compatible with all sink implementations. This approach ensures that the schema doesn't change with each new record and it can work with a schema registry. @@ -1123,7 +1125,7 @@ The transformation defines the following configurations: * `field.name` - The name of the field which should be used as the topic name. If `null` or empty, the entire key or value is used (and assumed to be a string). By default is `null`. * `skip.missing.or.null` - In case the source of the new topic name is `null` or missing, should a record be silently passed without transformation. By default, is `false`. -Here is an example of this transformation configuration: +The following is an example of this transformation configuration: ```properties ... @@ -1168,7 +1170,7 @@ For usage example, refer to YugabyteDB CDC Consistent Streaming Pipeline in the ### Transaction boundaries -The connector publishes metadata that can be used to distinguish transaction boundaries for a downstream application to implement atomicity. Once the configuration property `provide.transaction.metadata` is enabled, the connector will also publish events indicating the beginning and end of the transaction. For more information, see [Transaction metadata](#transaction-metadata). +The connector publishes metadata that can be used to distinguish transaction boundaries for a downstream application to implement atomicity. After the configuration property `provide.transaction.metadata` is enabled, the connector will also publish events indicating the beginning and end of the transaction. For more information, see [Transaction metadata](#transaction-metadata). ### Prerequisites @@ -1203,15 +1205,15 @@ In these cases, the error message has details about the problem and possibly a s ### YB-TServer becomes unavailable -In case one of the tablet servers crashes, the replicas on other YB-TServer nodes will become the leader for the tablets that were hosted on the crashed server. The YugabyteDB connector will figure out the new tablet leaders and start streaming from the checkpoint the Debezium maintains. +In case one of the tablet servers crashes, the replicas on other YB-TServer nodes will become the leader for the tablets that were hosted on the crashed server. The YugabyteDB gRPC Connector will figure out the new tablet leaders and start streaming from the checkpoint the Debezium maintains. ### YugabyteDB server failures -In case of YugabyteDB server failures, the Debezium YugabyteDB connector will try for a configurable (using a [flag](../../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags)) amount of time for the availability of the YB-TServer and will stop if the cluster cannot start. When the cluster is restarted, the connector can be run again and it will start processing the changes with the committed checkpoint. +In case of YugabyteDB server failures, the YugabyteDB gRPC Connector will try for a configurable amount of time for the availability of the YB-TServer and will stop if the cluster cannot start. When the cluster is restarted, the connector can be run again and it will start processing the changes with the committed checkpoint. ### Connector unable to find table association with stream ID -In this case, the Connector throws an exception with an error message that the table is not a part of the stream ID. +In this case, the connector throws an exception with an error message that the table is not a part of the stream ID. This can happen in the following 2 scenarios: @@ -1222,13 +1224,14 @@ This can happen in the following 2 scenarios: When the connector is running, the YugabyteDB server that it is connected to could become unavailable for any number of reasons. If this happens, the connector fails with an error and stops. When the server is available again, restart the connector. -The YugabyteDB connector externally stores the last processed offset in the form of a checkpoint. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the stream ID remains intact. Never delete a stream ID without deleting all the associated connectors with it, otherwise you will lose data. +The connector externally stores the last processed offset in the form of a checkpoint. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the stream ID remains intact. Never delete a stream ID without deleting all the associated connectors with it, otherwise you will lose data. ## Dropping a table part of the replication While the connector is running with a set of tables configured to capture the changes, if one of the tables in the set is dropped, the connector will fail with an error message indicating that the object is not found. To avoid or resolve a failure due to a dropped table, follow these steps: + 1. Delete the connector that contains the table that was dropped, or that you want to drop. 2. Edit the configuration and remove the given table from `table.include.list`. 3. Deploy a new connector with the updated configuration. diff --git a/docs/content/preview/explore/going-beyond-sql/connection-mgr-ysql.md b/docs/content/preview/explore/going-beyond-sql/connection-mgr-ysql.md index 0278a0ef7bf4..22e306cd810d 100644 --- a/docs/content/preview/explore/going-beyond-sql/connection-mgr-ysql.md +++ b/docs/content/preview/explore/going-beyond-sql/connection-mgr-ysql.md @@ -48,7 +48,7 @@ YSQL Connection Manager has the following key features: ## How to use -To start a YugabtyeDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. +To start a YugabyteDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. When `enable_ysql_conn_mgr` is set, each YB-TServer starts the YSQL Connection Manager process along with the PostgreSQL process. You should see one YSQL Connection Manager process per YB-TServer. diff --git a/docs/content/preview/integrations/cdc/debezium.md b/docs/content/preview/integrations/cdc/debezium.md index a4b955587759..d749e919fff7 100644 --- a/docs/content/preview/integrations/cdc/debezium.md +++ b/docs/content/preview/integrations/cdc/debezium.md @@ -13,9 +13,9 @@ menu: type: docs --- -Debezium is an open-source distributed platform used to capture the changes in a database. +Debezium is an open-source distributed platform used to capture the changes in a database. The [YugabyteDB gRPC Connector](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/) is based on the Debezium Connector, and captures row-level changes in a YugabyteDB database's schemas. -## Get the Debezium connector for YugabyteDB +## Get the YugabyteDB gRPC Connector Using Docker, you can get the connector from Quay: @@ -154,7 +154,7 @@ Do the following: }' ``` -For a list of all the configuration options provided with the Debezium YugabyteDB connector, see [Connector configuration properties](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb/#connector-configuration-properties). +For a list of all the configuration options provided with the YugabyteDB gRPC Connector, see [Connector configuration properties](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb/#connector-configuration-properties). {{< tip title="TRUNCATE tables when CDC is enabled" >}} diff --git a/docs/content/preview/reference/configuration/yb-tserver.md b/docs/content/preview/reference/configuration/yb-tserver.md index e1b2381d05cd..11cebc2cda23 100644 --- a/docs/content/preview/reference/configuration/yb-tserver.md +++ b/docs/content/preview/reference/configuration/yb-tserver.md @@ -1296,6 +1296,38 @@ Number of tables to be added to the stream ID per run of the background thread w Default: `2` +The following set of flags are only relevant for CDC using the PostgreSQL replication protocol. To learn about CDC using the PostgreSQL replication protocol, see [CDC using logical replication](../../../architecture/docdb-replication/cdc-logical-replication). + +##### --ysql_yb_default_replica_identity + +The default replica identity to be assigned to user defined tables at the time of creation. The flag is case sensitive and can take only one of the four possible values, `FULL`, `DEFAULT`,`'NOTHING` and `CHANGE`. + +Default: `CHANGE` + +##### --cdcsdk_enable_dynamic_table_support + +Tables created after the creation of a replication slot are referred as Dynamic tables. This preview flag can be used to switch the dynamic addition of tables to the publication ON or OFF. + +Default: `false` + +##### --cdcsdk_publication_list_refresh_interval_secs + +Interval in seconds at which the table list in the publication will be refreshed. + +Default: `3600` + +##### --cdcsdk_max_consistent_records + +Controls the maximum number of records sent from Virtual WAL (VWAL) to walsender in consistent order. + +Default: `500` + +##### --cdcsdk_vwal_getchanges_resp_max_size_bytes + +Max size (in bytes) of changes sent from CDC Service to [Virtual WAL](../../../architecture/docdb-replication/cdc-logical-replication)(VWAL) for a particular tablet. + +Default: `1 MB` + ## File expiration based on TTL flags ##### --tablet_enable_ttl_file_filter diff --git a/docs/content/preview/tutorials/azure/azure-event-hubs.md b/docs/content/preview/tutorials/azure/azure-event-hubs.md index 0a59ea932953..182e98d42fd8 100644 --- a/docs/content/preview/tutorials/azure/azure-event-hubs.md +++ b/docs/content/preview/tutorials/azure/azure-event-hubs.md @@ -13,7 +13,7 @@ menu: type: docs --- -The [Azure Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-about) data streaming service is [Apache Kafka](https://kafka.apache.org/intro) compatible, enabling existing workloads to easily be moved to Azure. With the [Debezium Connector for YugabyteDB](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb), we can stream changes from a YugabyteDB cluster to a Kafka topic using [Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html#:~:text=Kafka%20Connect%20is%20a%20tool,in%20and%20out%20of%20Kafka.). +The [Azure Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-about) data streaming service is [Apache Kafka](https://kafka.apache.org/intro) compatible, enabling existing workloads to easily be moved to Azure. With the [YugabyteDB Debezium connector](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb), we can stream changes from a YugabyteDB cluster to a Kafka topic using [Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html#:~:text=Kafka%20Connect%20is%20a%20tool,in%20and%20out%20of%20Kafka.). In this tutorial, we'll examine how the [YugabyteDB CDC](../../../explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-overview/) can be used with Azure Event Hubs to stream real-time data for downstream processing. @@ -35,7 +35,7 @@ This application runs a Node.js process to insert order records to a YugabyteDB - An Azure Cloud account with permissions to create services - [Download Apache Kafka](https://kafka.apache.org/downloads) version 2.12-3.2.0 - [Download YugabyteDB](https://download.yugabyte.com/#/) version 2.16.8.0 -- [Download Debezium Connector for YugabyteDB](https://github.com/yugabyte/debezium-connector-yugabytedb/tree/v1.9.5.y.15) version 1.9.5.y.15 +- [Download YugabyteDB Debezium connector](https://github.com/yugabyte/debezium-connector-yugabytedb/tree/v1.9.5.y.15) version 1.9.5.y.15 - [Node.js](https://github.com/nodejs/release#release-schedule) version 18 ## Get started with YugabyteDB @@ -175,7 +175,7 @@ A Kafka Connect configuration file is required to provide information about the cp /path/to/event-hubs.config /path/to/kafka_2.12-3.2.0/bin ``` -1. Copy the Debezium Connector for YugabyteDB to the Kafka _libs_ directory. +1. Copy the YugabyteDB Debezium connector to the Kafka _libs_ directory. ```sh cp /path/to/debezium-connector-yugabytedb-1.9.5.y.15.jar /path/to/kafka_2.12-3.2.0/libs diff --git a/docs/content/preview/tutorials/cdc-tutorials/cdc-aws-msk.md b/docs/content/preview/tutorials/cdc-tutorials/cdc-aws-msk.md index d8f85949687e..e6e2b77d5748 100644 --- a/docs/content/preview/tutorials/cdc-tutorials/cdc-aws-msk.md +++ b/docs/content/preview/tutorials/cdc-tutorials/cdc-aws-msk.md @@ -14,7 +14,7 @@ type: docs Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed, highly available, and secure Apache Kafka service offered by Amazon Web Services (AWS). Using Amazon MSK, you can build and run applications using Apache Kafka without having to manage and operate your own Kafka clusters. -This tutorial describes how to configure Yugabyte CDC and stream data into Amazon MSK using Debezium connector, and assumes some familiarity with AWS, Apache Kafka, and CDC. +This tutorial describes how to configure Yugabyte CDC and stream data into Amazon MSK using a Debezium connector, and assumes some familiarity with AWS, Apache Kafka, and CDC. ![Architecture of YugabyteDB to MSK using Debezium](/images/explore/cdc/aws_msk_images/architecture.jpg) @@ -214,7 +214,7 @@ Create a Security Group with inbound and outbound rules configured to ensure acc ### Upload Debezium connector Jar file to the S3 bucket -Download the YugabyteDB Debezium connector jar from the [repository](https://github.com/yugabyte/debezium-connector-yugabytedb/releases/download/v1.9.5.y.19/debezium-connector-yugabytedb-1.9.5.y.19.jar) and upload it to an S3 bucket. +Download the connector jar from the [repository](https://github.com/yugabyte/debezium-connector-yugabytedb/releases/download/v1.9.5.y.19/debezium-connector-yugabytedb-1.9.5.y.19.jar) and upload it to an S3 bucket. ![Upload to S3](/images/explore/cdc/aws_msk_images/upload-to-s3.png) diff --git a/docs/content/preview/tutorials/cdc-tutorials/cdc-azure-event-hub.md b/docs/content/preview/tutorials/cdc-tutorials/cdc-azure-event-hub.md index a58b503724e0..584c8510293d 100644 --- a/docs/content/preview/tutorials/cdc-tutorials/cdc-azure-event-hub.md +++ b/docs/content/preview/tutorials/cdc-tutorials/cdc-azure-event-hub.md @@ -36,7 +36,7 @@ The following table describes how the data flows through each of these component | Step | Component | Operation | | --- | --- | --- | | 1 | Debezium YugabyteDB Kafka Connect | Stream the changed data sets requested from the source YugabyteDB Anywhere YSQL Tables. | -| 2 | Azure Event Hubs | Stream messages from Kafka to different targets. +| 2 | Azure Event Hubs | Stream messages from Kafka to different targets. | | 3 | Azure Synapse Analytics Pipeline| Used to schedule data-driven workflows that can ingest data from Azure Event Hubs to an Azure Data Lake Storage (ADLS) Gen 2 account. | | 4 | ADLS (Azure Data Lake Services) Gen2 | CDC data from the Azure Event Hub is saved to ADLS Gen2 in Parquet format. | | 5 | Azure Synapse workspace | Azure SQL Pools and Spark Pools can be used to analyze the CDC data from Yugabyte in near real time. | @@ -52,7 +52,7 @@ To get started, you'll need the following: ## Get Started -Use the following steps to move YugabyteDB CDC data into Azure Synapse Analytics using the YugabyteDB Debezium Connector that streams data into Azure Event Hubs. This data can be stored as Avro/JSON/Parquet in Azure Data Lake Storage Gen2 and then accessed via SQL Pools or Spark Pools in the Synapse workspace. +Use the following steps to move YugabyteDB CDC data into Azure Synapse Analytics using the YugabyteDB Debezium connector that streams data into Azure Event Hubs. This data can be stored as Avro/JSON/Parquet in Azure Data Lake Storage Gen2 and then accessed via SQL Pools or Spark Pools in the Synapse workspace. ### Step 1: Create an Event Hubs namespace and Event Hubs @@ -66,7 +66,7 @@ Now that you have created Event Hubs in Azure, you need to create a YugabyteDB C 1. [Download Apache Kafka](https://downloads.apache.org/kafka/). -1. Configure your event hub to connect and receive data from the Debezium Connector for Yugabyte. You can create an Azure Event Hub configuration file that will be saved locally on the machine, for example you can save the configuration file as `eventhub.config` in the Kafka `bin` directory. For more details on creating your configuration file, refer to the sample available in the [Kafka Connect for Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-connect-tutorial#configure-kafka-connect-for-event-hubs) documentation. +1. Configure your event hub to connect and receive data from the YugabyteDB Debezium connector. You can create an Azure Event Hub configuration file that will be saved locally on the machine, for example you can save the configuration file as `eventhub.config` in the Kafka `bin` directory. For more details on creating your configuration file, refer to the sample available in the [Kafka Connect for Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-connect-tutorial#configure-kafka-connect-for-event-hubs) documentation. 1. Download the [Debezium connector jar file](https://github.com/yugabyte/debezium-connector-yugabytedb/releases/download/v1.9.5.y.19/debezium-connector-yugabytedb-1.9.5.y.19.jar) from the Yugabyte GitHub repository. Save this jar file in your Kafka `libs` folder (for example, `/home/azureuser/kafka_2.12-3.2.0/libs`). diff --git a/docs/content/preview/yugabyte-voyager/release-notes.md b/docs/content/preview/yugabyte-voyager/release-notes.md index 8a8c94cc8442..cb9b2e52d4a6 100644 --- a/docs/content/preview/yugabyte-voyager/release-notes.md +++ b/docs/content/preview/yugabyte-voyager/release-notes.md @@ -93,7 +93,7 @@ What follows are the release notes for the YugabyteDB Voyager v1 release series. - Improved UX of `yb-voyager get data-migration-report` for large set of tables by adding pagination. -- The YugabyteDB debezium connector version is upgraded to v1.9.5.y.33.2 to leverage support for precise decimal type handling with YugabyteDB versions 2.20.1.1 and later. +- The YugabyteDB Debezium connector version is upgraded to v1.9.5.y.33.2 to leverage support for precise decimal type handling with YugabyteDB versions 2.20.1.1 and later. - Enhanced [export data status](../reference/data-migration/export-data/#export-data-status) command to report number of rows exported for each table in case of offline migration. diff --git a/docs/content/stable/architecture/docdb-replication/cdc-logical-replication.md b/docs/content/stable/architecture/docdb-replication/cdc-logical-replication.md new file mode 100644 index 000000000000..0c3bd470c93e --- /dev/null +++ b/docs/content/stable/architecture/docdb-replication/cdc-logical-replication.md @@ -0,0 +1,82 @@ +--- +title: Change data capture using Logical Replication in YugabyteDB +headerTitle: CDC using Logical Replication +linkTitle: CDC using Logical Replication +description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications. +headContent: Asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications +badges: ea +menu: + stable: + parent: architecture-docdb-replication + identifier: architecture-docdb-replication-cdc-logical-replication + weight: 500 +type: docs +--- + +Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and made available for consumption by applications and other tools. + +CDC in YugabyteDB is based on the PostgreSQL Logical Replication model. The fundamental concept is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel. + +## Architecture + +![Logical replication architecture](/images/architecture/cdc-logical-replication-architecture.png) + +The following are the main components of the Yugabyte CDC solution: + +1. Walsender - A special purpose PG backend responsible for streaming changes to the client and handling acknowledgments. + +2. Virtual WAL (VWAL) - Assembles changes from all the shards of user tables (under the publication) to maintain transactional consistency. + +3. CDCService - Retrieves changes from the WAL of a specified shard starting from a given checkpoint. + +### Data Flow + +Logical replication starts by copying a snapshot of the data on the publisher database. After that is done, changes on the publisher are streamed to the server as they occur in near real time. + +To set up Logical Replication, an application will first have to create a replication slot. When a replication slot is created, a boundary is established between the snapshot data and the streaming changes. This boundary or `consistent_point` is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the `consistent_point` are consumed as part of the initial snapshot. Changes from transactions with commit time greater than the commit time of the `consistent_point` are consumed in the streaming phase in transaction commit time order. + +#### Initial Snapshot + +The initial snapshot data for each table is consumed by executing a corresponding snapshot query (SELECT statement) on that table. This snapshot query should be executed as of the database state corresponding to the `consistent_point`. This database state is represented by a value of HybridTime. + +First, a `SET LOCAL yb_read_time TO ' ht'` command should be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction. + +The HybridTime value to use in the `SET LOCAL yb_read_time` command is the value of the `snapshot_name` field that is returned by the `CREATE_REPLICATION_SLOT` command. Alternatively, it can be obtained by querying the `pg_replication_slots` view. + +During Snapshot consumption, the snapshot data from all tables will be from the same consistent state (`consistent_point`). At the end of Snapshot consumption, the state of the target system is at/based on the `consistent_point`. History of the tables as of the `consistent_point` is retained on the source until the snapshot is consumed. + +#### Streaming Data Flow + +YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row. + +Each tablet has its own WAL. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the information on the changes involved in the transactions (or changes) for that tablet as well as additional metadata related to the transactions. + +**Step 1 - Data flow from the tablet WAL to the VWAL** + +![CDCService-VWAL](/images/architecture/cdc_service_vwal_interaction.png) + +Each tablet sends changes in transaction commit time order. Further, in a transaction, the changes are in the order in which the operations were performed in the transaction. + +**Step 2 - Sorting in the VWAL and sending transactions to the Walsender** + +![VWAL-Walsender](/images/architecture/vwal_walsender_interaction.png) + +VWAL collects changes across multiple tablets, assembles the transactions, assigns LSN to each change and transaction boundary (BEGIN, COMMIT) record, and sends the changes to the Walsender in transaction commit time order. + +**Step 3 - Walsender to client** + +Walsender sends changes to the output plugin, which filters them according to the slot's publication and converts them into the client's desired format. These changes are then streamed to the client using the appropriate streaming replication protocols determined by the output plugin. Yugabyte follows the same streaming replication protocols as defined in PostgreSQL. + + diff --git a/docs/content/stable/architecture/docdb-replication/change-data-capture.md b/docs/content/stable/architecture/docdb-replication/change-data-capture.md index 329abdaaedec..5259b6190914 100644 --- a/docs/content/stable/architecture/docdb-replication/change-data-capture.md +++ b/docs/content/stable/architecture/docdb-replication/change-data-capture.md @@ -1,45 +1,53 @@ --- -title: Change data capture (CDC) in YugabyteDB -headerTitle: Change data capture (CDC) -linkTitle: Change data capture (CDC) +title: Change data capture (CDC) gRPC Replication in YugabyteDB +headerTitle: CDC using gRPC Replication +linkTitle: CDC using gRPC Replication description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications. -earlyAccess: /preview/releases/versioning/#feature-maturity +badges: ea menu: stable: parent: architecture-docdb-replication identifier: architecture-docdb-replication-cdc - weight: 500 + weight: 600 type: docs --- -Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and automatically applied to another data repository instance, or made available for consumption by applications and other tools. CDC provides the following guarantees. +## Architecture -- [Ordering is maintained per-tablet](#per-tablet-ordered-delivery) -- [At-least once delivery](#at-least-once-delivery) -- [No gaps](#no-gaps-in-change-stream) +Every YB-TServer has a `CDC service` that is stateless. The main APIs provided by the CDC service are the following: -## Architecture +- `createCDCSDKStream` API for creating the stream on the database. +- `getChangesCDCSDK` API that can be used by the client to get the latest set of changes. ![Stateless CDC Service](/images/architecture/stateless_cdc_service.png) -Every YB-TServer has a `CDC service` that is stateless. The main APIs provided by the CDC service are the following: +## CDC streams -* `createCDCSDKStream` API for creating the stream on the database. -* `getChangesCDCSDK` API that can be used by the client to get the latest set of changes. +YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row. -## CDC streams +Each tablet has its own WAL file. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the order in which transactions (or changes) happened. Hybrid TS, Operation ID, and additional metadata about the transaction is also preserved. + +![How does CDC work](/images/explore/cdc-overview-work2.png) + +YugabyteDB normally purges WAL segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the database schemas. + +The YugabyteDB Debezium connector captures row-level changes in the schemas of a YugabyteDB database. The first time it connects to a YugabyteDB cluster, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. + +![How does CDC work](/images/explore/cdc-overview-work.png) + +The core primitive of CDC is the _stream_. Streams can be enabled and disabled on databases. You can specify which tables to include or exclude. Every change to a watched database table is emitted as a record in a configurable format to a configurable sink. Streams scale to any YugabyteDB cluster independent of its size and are designed to impact production traffic as little as possible. -Creating a new CDC stream returns a stream UUID. This is facilitated via the [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands) tool. +Creating a new CDC stream returns a stream UUID. This is facilitated via the [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands) tool. A stream ID is created first, per database. You configure the maximum batch side in YugabyteDB, while the polling frequency is configured on the connector side. -## Debezium +Connector tasks can consume changes from multiple tablets. At least once delivery is guaranteed. In turn, connector tasks write to the Kafka cluster, and tasks don't need to match Kafka partitions. Tasks can be independently scaled up or down. -To consume the events generated by CDC, Debezium is used as the connector. Debezium is an open-source distributed platform that needs to be pointed at the database using the stream ID. For information on how to set up Debezium for YugabyteDB CDC, see [Debezium integration](../../../integrations/cdc/debezium/). +The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in a separate Kafka topic. Client applications read the Kafka topics that correspond to the database tables of interest, and can react to every row-level event they receive from those topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic. All changes for a row (or rows in the same tablet) are received in the order in which they happened. A checkpoint per stream ID and tablet is updated in a state table after a successful write to Kafka brokers. -## Pushing changes to external systems +## CDC guarantees -Using the Debezium connector for YugabyteDB, changes are pushed from YugabyteDB to a Kafka topic, which can then be used by any end-user application for the processing and analysis of the records. +CDC in YugabyteDB provides technology to ensure that any changes in data due to operations (such as inserts, updates, and deletions) are identified, captured, and automatically applied to another data repository instance, or made available for consumption by applications and other tools. CDC provides the following guarantees. -## Per-tablet ordered delivery +### Per-tablet ordered delivery All data changes for one row or multiple rows in the same tablet are received in the order in which they occur. Due to the distributed nature of the problem, however, there is no guarantee for the order across tablets. @@ -51,13 +59,13 @@ Consider the following scenario: In this case, it is possible for CDC to push the later update corresponding to `row #2` change to Kafka before pushing the earlier update, corresponding to `row #1`. -## At-least-once delivery +### At-least-once delivery Updates for rows are pushed at least once. With the at-least-once delivery, you never lose a message, however the message might be delivered to a CDC consumer more than once. This can happen in case of a tablet leader change, where the old leader already pushed changes to Kafka, but the latest pushed `op id` was not updated in the CDC metadata. For example, a CDC client has received changes for a row at times `t1` and `t3`. It is possible for the client to receive those updates again. -## No gaps in change stream +### No gaps in change stream When you have received a change for a row for timestamp `t`, you do not receive a previously unseen change for that row from an earlier timestamp. This guarantees that receiving any change implies that all earlier changes have been received for a row. diff --git a/docs/content/stable/explore/change-data-capture/_index.md b/docs/content/stable/explore/change-data-capture/_index.md index 638f6782e272..f8a1bd4b17a4 100644 --- a/docs/content/stable/explore/change-data-capture/_index.md +++ b/docs/content/stable/explore/change-data-capture/_index.md @@ -4,7 +4,7 @@ headerTitle: Change data capture (CDC) linkTitle: Change data capture description: CDC or Change data capture is a process to capture changes made to data in the database. headcontent: Capture changes made to data in the database -image: /images/section_icons/index/develop.png +image: fa-light fa-rotate cascade: earlyAccess: /preview/releases/versioning/#feature-maturity menu: @@ -14,63 +14,42 @@ menu: weight: 280 type: indexpage --- -In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. CDC is beneficial in a number of scenarios. Let us look at few of them. +In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. CDC is beneficial in a number of scenarios: -- **Microservice-oriented architectures** : Some microservices require a stream of changes to the data, and using CDC in YugabyteDB can provide consumable data changes to CDC subscribers. +- **Microservice-oriented architectures**: Some microservices require a stream of changes to the data, and using CDC in YugabyteDB can provide consumable data changes to CDC subscribers. -- **Asynchronous replication to remote systems** : Remote systems may subscribe to a stream of data changes and then transform and consume the changes. Maintaining separate database instances for transactional and reporting purposes can be used to manage workload performance. +- **Asynchronous replication to remote systems**: Remote systems may subscribe to a stream of data changes and then transform and consume the changes. Maintaining separate database instances for transactional and reporting purposes can be used to manage workload performance. -- **Multiple data center strategies** : Maintaining multiple data centers enables enterprises to provide high availability (HA). +- **Multiple data center strategies**: Maintaining multiple data centers enables enterprises to provide high availability (HA). -- **Compliance and auditing** : Auditing and compliance requirements can require you to use CDC to maintain records of data changes. +- **Compliance and auditing**: Auditing and compliance requirements can require you to use CDC to maintain records of data changes. -## How does CDC work +YugabyteDB supports the following methods for reading change events. -YugabyteDB CDC captures changes made to data in the database and streams those changes to external processes, applications, or other databases. CDC allows you to track and propagate changes in a YugabyteDB database to downstream consumers based on its Write-Ahead Log (WAL). YugabyteDB CDC uses Debezium to capture row-level changes resulting from INSERT, UPDATE, and DELETE operations in the upstream database, and publishes them as events to Kafka using Kafka Connect-compatible connectors. +## PostgreSQL Replication Protocol -![What is CDC](/images/explore/cdc-overview-what.png) +This method uses the [PostgreSQL replication protocol](using-logical-replication/key-concepts/#replication-protocols), ensuring compatibility with PostgreSQL CDC systems. Logical replication operates through a publish-subscribe model. It replicates data objects and their changes based on the replication identity. -{{}} -To know more about the internals of CDC, see [Overview](./cdc-overview). -{{}} - -## Debezium connector - -To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the Debezium connector. Debezium is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. - -{{}} -To understand how the various features and configuration of the connector, see [Debezium connector](./debezium-connector-yugabytedb). -{{}} - -## Monitoring +It works as follows: -You can monitor the activities and status of the deployed connectors using the http end points provided by YugabyteDB. +1. Create Publications in the YugabyteDB cluster similar to PostgreSQL. +1. Deploy the YugabyteDB Connector in your preferred Kafka Connect environment. +1. The connector uses replication slots to capture change events and publishes them directly to a Kafka topic. -{{}} -To know more about how to monitor your CDC setup, see [Monitor](./cdc-monitor). +{{}} +To learn about CDC in YugabyteDB using the PostgreSQL Replication Protocol, see [CDC using PostgreSQL Replication Protocol](./using-logical-replication). {{}} -For tutorials on streaming data to Kafka environments, including Amazon MSK, Azure Event Hubs, and Confluent Cloud, see [Kafka environments](/preview/tutorials/cdc-tutorials/). +## YugabyteDB gRPC Replication Protocol -{{}} +This method involves setting up a change stream in YugabyteDB that uses the native gRPC replication protocol to publish change events. - {{}} +It works as follows: -{{}} +1. Establish a change stream in the YugabyteDB cluster using the yb_admin CLI commands. +1. Deploy the YugabyteDB gRPC Connector in your preferred Kafka Connect environment. +1. The connector captures change events using YugabyteDB's native gRPC replication and directly publishes them to a Kafka topic. -## Learn more - -* [Examples of CDC usage and patterns](https://github.com/yugabyte/cdc-examples/tree/main) {{}} -* [Tutorials to deploy in different Kafka environments](/preview/tutorials/cdc-tutorials/) {{}} -* [Data Streaming Using YugabyteDB CDC, Kafka, and SnowflakeSinkConnector](https://www.yugabyte.com/blog/data-streaming-using-yugabytedb-cdc-kafka-and-snowflakesinkconnector/) {{}} -* [Unlock Azure Storage Options With YugabyteDB CDC](https://www.yugabyte.com/blog/unlocking-azure-storage-options-with-yugabytedb-cdc/) {{}} -* [Change Data Capture From YugabyteDB to Elasticsearch](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-elasticsearch/) {{}} -* [Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB](https://www.yugabyte.com/blog/snowflake-cdc-publish-data-using-amazon-s3-yugabytedb/) {{}} -* [Streaming Changes From YugabyteDB to Downstream Databases](https://www.yugabyte.com/blog/streaming-changes-yugabytedb-cdc-downstream-databases/) {{}} -* [Change Data Capture from YugabyteDB CDC to ClickHouse](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-clickhouse/) {{}} -* [How to Run Debezium Server with Kafka as a Sink](https://www.yugabyte.com/blog/change-data-capture-cdc-run-debezium-server-kafka-sink/) {{}} -* [Change Data Capture Using a Spring Data Processing Pipeline](https://www.yugabyte.com/blog/change-data-capture-cdc-spring-data-processing-pipeline/) {{}} +{{}} +To learn about CDC in YugabyteDB using the gRPC Replication Protocol, see [CDC using gRPC Replication Protocol](./using-yugabytedb-grpc-replication/). +{{}} diff --git a/docs/content/stable/explore/change-data-capture/cdc-overview.md b/docs/content/stable/explore/change-data-capture/cdc-overview.md deleted file mode 100644 index e0094f2b70a1..000000000000 --- a/docs/content/stable/explore/change-data-capture/cdc-overview.md +++ /dev/null @@ -1,59 +0,0 @@ ---- -title: Overview of CDC internals -linkTitle: Overview -description: Change Data Capture in YugabyteDB. -headcontent: Change Data Capture in YugabyteDB -menu: - stable: - parent: explore-change-data-capture - identifier: cdc-overview - weight: 10 -type: docs ---- - -YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row. - -Each tablet has its own WAL file. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the order in which transactions (or changes) happened. Hybrid TS, Operation ID, and additional metadata about the transaction is also preserved. - -![How does CDC work](/images/explore/cdc-overview-work2.png) - -YugabyteDB normally purges WAL segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the database schemas. - -The Debezium YugabyteDB connector captures row-level changes in the schemas of a YugabyteDB database. The first time it connects to a YugabyteDB cluster, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. - -![How does CDC work](/images/explore/cdc-overview-work.png) - -The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in a separate Kafka topic. Client applications read the Kafka topics that correspond to the database tables of interest, and can react to every row-level event they receive from those topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic. - -The core primitive of CDC is the _stream_. Streams can be enabled and disabled on databases. Every change to a watched database table is emitted as a record in a configurable format to a configurable sink. Streams scale to any YugabyteDB cluster independent of its size and are designed to impact production traffic as little as possible. - -![How does CDC work](/images/explore/cdc-overview-work3.png) - -## Known limitations - -* A single stream can only be used to stream data from one namespace only. -* There should be a primary key on the table you want to stream the changes from. -* CDC is not supported on a target table for xCluster replication [11829](https://github.com/yugabyte/yugabyte-db/issues/11829). -* Currently we don't support schema evolution for changes that require table rewrites (ex: ALTER TYPE). -* YCQL tables aren't currently supported. Issue [11320](https://github.com/yugabyte/yugabyte-db/issues/11320). - -In addition, CDC support for the following features will be added in upcoming releases: - -* Support for point-in-time recovery (PITR) is tracked in issue [10938](https://github.com/yugabyte/yugabyte-db/issues/10938). -* Support for transaction savepoints is tracked in issue [10936](https://github.com/yugabyte/yugabyte-db/issues/10936). -* Support for enabling CDC on Read Replicas is tracked in issue [11116](https://github.com/yugabyte/yugabyte-db/issues/11116). -* Support for schema evolution with before image is tracked in issue [15197](https://github.com/yugabyte/yugabyte-db/issues/15197). - -## Learn more - -* Refer to [CDC Examples](https://github.com/yugabyte/cdc-examples/tree/main) for CDC usage and pattern examples. -* Refer to [Tutorials](/preview/tutorials/cdc-tutorials/) to deploy in different Kafka environments. -* Refer to blogs about CDC: - * [Data Streaming Using YugabyteDB CDC, Kafka, and SnowflakeSinkConnector](https://www.yugabyte.com/blog/data-streaming-using-yugabytedb-cdc-kafka-and-snowflakesinkconnector/) - * [Unlock Azure Storage Options With YugabyteDB CDC](https://www.yugabyte.com/blog/unlocking-azure-storage-options-with-yugabytedb-cdc/) - * [Change Data Capture From YugabyteDB to Elasticsearch](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-elasticsearch/) - * [Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB](https://www.yugabyte.com/blog/snowflake-cdc-publish-data-using-amazon-s3-yugabytedb/) - * [Streaming Changes From YugabyteDB to Downstream Databases](https://www.yugabyte.com/blog/streaming-changes-yugabytedb-cdc-downstream-databases/) - * [Change Data Capture from YugabyteDB CDC to ClickHouse](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-clickhouse/) - * [How to Run Debezium Server with Kafka as a Sink](https://www.yugabyte.com/blog/change-data-capture-cdc-run-debezium-server-kafka-sink/) - * [Change Data Capture Using a Spring Data Processing Pipeline](https://www.yugabyte.com/blog/change-data-capture-cdc-spring-data-processing-pipeline/) diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/_index.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/_index.md new file mode 100644 index 000000000000..956048ccdd2c --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/_index.md @@ -0,0 +1,116 @@ +--- +title: CDC using PostgreSQL replication protocol +headerTitle: CDC using PostgreSQL replication protocol +linkTitle: PostgreSQL protocol +description: CDC using YugabyteDB PostgreSQL replication protocol. +headcontent: Capture changes made to data in the database +cascade: + earlyAccess: /preview/releases/versioning/#feature-maturity +menu: + stable: + identifier: explore-change-data-capture-logical-replication + parent: explore-change-data-capture + weight: 240 +type: indexpage +showRightNav: true +--- + +## Overview + +YugabyteDB CDC captures changes made to data in the database and streams those changes to external processes, applications, or other databases. CDC allows you to track and propagate changes in a YugabyteDB database to downstream consumers based on its Write-Ahead Log (WAL). YugabyteDB CDC captures row-level changes resulting from INSERT, UPDATE, and DELETE operations in the configured database and publishes it further to be consumed by downstream applications. + +### Highlights + +#### Resilience + +YugabyteDB CDC with PostgreSQL Logical Replication provides resilience as follows: + +1. Following a failure of the application, server, or network, the replication can continue from any of the available server nodes. + +2. Replication continues from the transaction immediately after the transaction that was last acknowledged by the application. No transactions are missed by the application. + +#### Security + +Because YugabyteDB is using the PostgreSQL Logical Replication model, the following applies: + +- The CDC user persona will be a PostgreSQL replication client. + +- A standard replication connection is used for consumption, and all the server-side configurations for authentication, authorizations, SSL modes, and connection load balancing can be leveraged automatically. + +#### Guarantees + +CDC in YugabyteDB provides the following guarantees. + +| GUARANTEE | DESCRIPTION | +| :----- | :----- | +| Per-slot ordered delivery guarantee | Changes from transactions from all the tables that are part of the replication slot's publication are received in the order they were committed. This also implies ordered delivery across all the tablets that are part of the publication's table list. | +| At least once delivery | Changes from transactions are streamed at least once. Changes from transactions may be streamed again in case of restart after failure. For example, this can happen in the case of a Kafka Connect node failure. If the Kafka Connect node pushes the records to Kafka and crashes before committing the offset, it will again get the same set of records upon restart. | +| No gaps in change stream | Receiving changes that are part of a transaction with commit time *t* implies that you have already received changes from all transactions with commit time lower than *t*. Thus, receiving any change for a row with commit timestamp *t* implies that you have received all older changes for that row. | + +## Key concepts + +The YugabyteDB logical replication feature makes use of PostgreSQL concepts like replication slot, publication, replica identity, and so on. Understanding these key concepts is crucial for setting up and managing a logical replication environment effectively. + +{{}} +To know more about the key concepts of YugabyteDB CDC with logical replication, see [Key concepts](./key-concepts). +{{}} + +## Getting started + +Get started with YugabyteDB logical replication using the YugabyteDB Connector. + +{{}} + +To learn how get started with the connector, see [Get started](./get-started). + +{{}} + +## Monitoring + +You can monitor the activities and status of the deployed connectors using the http end points provided by YugabyteDB. + +{{}} +To know more about how to monitor your CDC setup, see [Monitor](./monitor/). +{{}} + +## YugabyteDB Connector + +To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the YugabyteDB Connector, which is based on the Debezium platform. The connector is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. + +{{}} +To understand how the various features and configuration of the connector, see [YugabyteDB Connector](./yugabytedb-connector/). +{{}} + +## Limitations + +- LSN Comparisons Across Slots. + + In the case of YugabyteDB, the LSN  does not represent the byte offset of a WAL record. Hence, arithmetic on LSN and any other usages of the LSN making this assumption will not work. Also, currently, comparison of LSN values from messages coming from different replication slots is not supported. + +- The following functions are currently unsupported: + + - `pg_current_wal_lsn` + - `pg_wal_lsn_diff` + - `IDENTIFY SYSTEM` + - `txid_current` + - `pg_stat_replication` + + Additionally, the functions responsible for pulling changes instead of the server streaming it are unsupported as well. They are described in [Replication Functions](https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-REPLICATION) in the PostgreSQL documentation. + +- Restriction on DDLs + + DDL operations should not be performed from the time of replication slot creation till the start of snapshot consumption of the last table. + +- There should be a primary key on the table you want to stream the changes from. + +- CDC is not supported on a target table for xCluster replication [11829](https://github.com/yugabyte/yugabyte-db/issues/11829). + +- Currently we don't support schema evolution for changes that require table rewrites (ex: ALTER TYPE). + +- YCQL tables aren't currently supported. Issue [11320](https://github.com/yugabyte/yugabyte-db/issues/11320). + +- Support for point-in-time recovery (PITR) is tracked in issue [10938](https://github.com/yugabyte/yugabyte-db/issues/10938). + +- Support for transaction savepoints is tracked in issue [10936](https://github.com/yugabyte/yugabyte-db/issues/10936). + +- Support for enabling CDC on Read Replicas is tracked in issue [11116](https://github.com/yugabyte/yugabyte-db/issues/11116). \ No newline at end of file diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-configuration.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-configuration.md new file mode 100644 index 000000000000..c6ec89ad5cd9 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-configuration.md @@ -0,0 +1,36 @@ +--- +title: Advanced configurations for CDC using Logical Replication +headerTitle: Advanced configuration +linkTitle: Advanced configuration +description: Advanced Configurations for Logical Replication. +headcontent: Tune your CDC configuration +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: advanced-configurations + weight: 40 +type: docs +--- + +## YB-TServer flags + +You can use the following [YB-TServer flags](../../../../reference/configuration/yb-tserver/) to tune logical replication deployment configuration: + +- [ysql_yb_default_replica_identity](../../../../reference/configuration/yb-tserver/#ysql-yb-default-replica-identity) +- [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) +- [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) +- [cdcsdk_max_consistent_records](../../../../reference/configuration/yb-tserver/#cdcsdk-max-consistent-records) +- [cdcsdk_vwal_getchanges_resp_max_size_bytes](../../../../reference/configuration/yb-tserver/#cdcsdk-vwal-getchanges-resp-max-size-bytes) + +## Retention of resources + +CDC retains resources (such as WAL segments) that contain information related to the changes involved in the transactions. These resources are typically retained until the consuming client acknowledges the receipt of all the transactions contained in that resource. + +Retaining resources has an impact on the system. Clients are expected to consume these transactions within configurable duration limits. Resources will be released if the duration exceeds these configured limits. + +Use the following flags to control the duration for which resources are retained: + +- [cdc_wal_retention_secs](../../../../reference/configuration/yb-tserver/#cdc-wal-retention-secs) +- [cdc_intent_retention_ms](../../../../reference/configuration/yb-tserver/#cdc-intent-retention-ms) + +Resources are retained for each tablet of a table that is part of a database whose changes are being consumed using a replication slot. This includes those tables that may not be currently part of the publication specification. diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-topic.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-topic.md new file mode 100644 index 000000000000..4db588af1d29 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/advanced-topic.md @@ -0,0 +1,208 @@ +--- +title: Advanced topics +headerTitle: Advanced topics +linkTitle: Advanced topics +description: Advanced topics for Change Data Capture in YugabyteDB. +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: advanced-topics + weight: 50 +type: docs +--- + +This section explores a range of topics designed to provide deeper insights and enhance your understanding of advanced functionalities. + +## Schema evolution + +A change in the schema of the tables (ALTER TABLE) being streamed is transparently handled by the database without manual intervention. + +This is illustrated in the following example. The client used for the example is [pg_recvlogical](../get-started/#get-started-with-pg-recvlogical). + +1. Create a table and create the replication slot. pg_recvlogical uses the test_decoding output plugin by default. + + ```sql + CREATE TABLE demo_table (id INT PRIMARY KEY); + ``` + + ```sh + pg_recvlogical -d yugabyte --slot=demo_slot --create-slot + ``` + +1. Start streaming the changes from the replication slot. + + ```sh + pg_recvlogical -d yugabyte --slot=demo_slot --start -f - + ``` + +1. In a new shell, start ysqlsh and insert some data. + + ```sh + bin/ysqlsh + ``` + + ```sql + INSERT INTO demo_table VALUES (1); + INSERT INTO demo_table VALUES (2); + ``` + + pg_recvlogical receives the inserts and prints it on the console. + + ```output + BEGIN 2 + TABLE public.demo_table: INSERT: id[integer]:1 + COMMIT 2 + BEGIN 3 + TABLE public.demo_table: INSERT: id[integer]:2 + COMMIT 3 + ``` + +1. Add a new column to the `demo_table` and insert some more rows. + + ```sql + ALTER TABLE demo_table ADD COLUMN address TEXT; + INSERT INTO demo_table VALUES (3, 'address1'); + INSERT INTO demo_table VALUES (4, 'address2'); + ``` + + Without any manual intervention, pg_recvlogical receives the inserts with the new schema and prints it on the console. + + ```output + BEGIN 4 + TABLE public.demo_table: INSERT: id[integer]:3 col_text[text]:'address1' + COMMIT 4 + BEGIN 5 + TABLE public.demo_table: INSERT: id[integer]:4 col_text[text]:'address2' + COMMIT 5 + ``` + +## Adding tables to publication + +Addition of tables to the streaming list after slot creation is currently a preview feature. To enable dynamic table addition, set the [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) flag to true. + +The Publication's tables list can change in two ways. The first way is by adding a table to the publication by performing an alter publication. + +```sql +CREATE TABLE test_table_1(id INT PRIMARY KEY, aa INT, bb INT); +CREATE TABLE test_table_2(id INT PRIMARY KEY, aa INT, bb INT); + +CREATE PUBLICATION PUB FOR TABLE test_table_1; + +-- Start consumption through a replication slot. + +ALTER PUBLICATION ADD TABLE test_table_2; + +CREATE TABLE test_table_3(id INT PRIMARY KEY, aa INT, bb INT); + +ALTER PUBLICATION ADD TABLE test_table_3; +``` + +The second way is when a table is added to `ALL TABLES` publication upon creation. + +```sql +CREATE TABLE test_table_1(id INT PRIMARY KEY, aa INT, bb INT); + +CREATE PUBLICATION PUB FOR ALL TABLES; + +-- Start consumption through a replication slot. + +CREATE TABLE test_table_2(id INT PRIMARY KEY, aa INT, bb INT); +-- Since the publication was created for ALL TABLES, alter publication is not requirred. +``` + +### YugabyteDB semantics + +Unlike PostgreSQL, any changes made to the publication's tables list are not applied immediately in YugabyteDB. Instead the publication's tables list is periodically refreshed, and changes, if any, are applied. The refresh interval is managed using the [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) flag. The default is one hour (3600 sec). This means that any changes made to the publication's tables list will be applied after `cdcsdk_publication_list_refresh_interval_secs` in the worst case. + +Consider the following example: + +- Suppose that the value of the flag `cdcsdk_publication_list_refresh_interval_secs` is 3600 sec (1 hour) and the publication's tables list is being refreshed every hour at 8 am, 9 am, 10 am, and so on. + +- If any change is made to publication's tables list at 8:01 am, then this change will be applied at 9:00 am. However, any change made to publication's tables list at 8:59 am will also be applied at 9:00 am. + +The value of this flag can be changed at run time, but the change becomes effective only after some time. Continuing the example: + +- Suppose that the value of the flag `cdcsdk_publication_list_refresh_interval_secs` is changed from 3600 sec (1 hour) to 600 sec (10 minutes) at 8:01 am. + +- This change will only be applied after 9:00 am. That is, the publication's tables list will be next refreshed at 9:00 am. Then, the next refresh will happen at 9:10 am, and the subsequent refreshes will take place every 10 minutes. + +### Required settings + +To enable dynamic table addition, perform the following steps: + +1. Set the [cdcsdk_enable_dynamic_table_support](../../../../reference/configuration/yb-tserver/#cdcsdk-enable-dynamic-table-support) to true. + + Because it is a preview flag, first add it to the `allowed_preview_flags_csv` list. + + ```sh + ./yb-ts-cli --server_address= set_flag allowed_preview_flags_csv cdcsdk_enable_dynamic_table_support + ``` + + Then set the `cdcsdk_enable_dynamic_table_support` flag to true. + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_enable_dynamic_table_support true + ``` + +1. Set the [cdcsdk_publication_list_refresh_interval_secs](../../../../reference/configuration/yb-tserver/#cdcsdk-publication-list-refresh-interval-secs) flag to a lower value, such as 60 or 120 seconds. Note that the effect of this setting takes place after the upcoming publication refresh is performed. + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_publication_list_refresh_interval_secs 120 + ``` + +1. After you start receiving records from the newly added table in the publication, reset the `cdcsdk_publication_list_refresh_interval_secs` flag to a high value (for example, 3600 seconds). + + ```sh + ./yb-ts-cli --server_address= set_flag cdcsdk_publication_list_refresh_interval_secs 3600 + ``` + +## Initial snapshot + +The [initial snapshot](../../../../architecture/docdb-replication/cdc-logical-replication/#initial-snapshot) data for a table is consumed by executing a snapshot query (SELECT statement). To ensure that the streaming phase continues exactly from where the snapshot left, this snapshot query is executed as of a specific database state. In YugabyteDB, this database state is represented by a value of `HybridTime`. Changes due to transactions with commit time strictly greater than this snapshot `HybridTime` will be consumed during the streaming phase. + +The consistent database state on which the snapshot query is to be executed is specified using the following command: + +```sql +SET LOCAL yb_read_time TO ' ht'; +``` + +This command should first be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction. The HybridTime value to use in the `SET LOCAL yb_read_time` command is the value of the `snapshot_name` field that is returned by the [CREATE REPLICATION SLOT](../../../../api/ysql/the-sql-language/statements/#streaming-replication-protocol-statements) command. + +You can also obtain this value by executing the following query: + +```sql +select yb_restart_commit_ht +from pg_replication_slots where slot_name = ; +``` + +For more information on the `pg_replication_slots` catalog view, refer to [pg_replication_slots](../monitor/#pg-replication-slots). + +### Permissions + +Only a superuser can execute the command to set the value of `yb_read_time`. + +For a non-superuser to be able to perform an initial snapshot, perform the following additional setup as a superuser (in addition to granting the required SELECT and USAGE privileges): + +```sql +CREATE ROLE appuser WITH LOGIN REPLICATION; +CREATE SCHEMA appuser AUTHORIZATION appuser; + +CREATE OR REPLACE PROCEDURE appuser.set_yb_read_time(value TEXT) +LANGUAGE plpgsql +AS $$ +BEGIN + EXECUTE 'SET LOCAL yb_read_time = ' || quote_literal(value); +END; +$$ +SECURITY DEFINER; + + +REVOKE EXECUTE ON PROCEDURE appuser.set_yb_read_time FROM PUBLIC; +GRANT EXECUTE ON PROCEDURE appuser.set_yb_read_time TO appuser; +``` + +With this setup, the command to be executed by the application user as part of the transaction prior to executing the snapshot SELECT query would be: + +```sh +CALL set_yb_read_time(' ht') +``` diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/best-practices.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/best-practices.md new file mode 100644 index 000000000000..4f9c2e9dfba0 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/best-practices.md @@ -0,0 +1,28 @@ +--- +title: Best Practices for logical replication +headerTitle: Best practices +linkTitle: Best practices +description: Best Practices for for logical replication with Change Data Capture in YugabyteDB. +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: best-practices-cdc + weight: 60 +type: docs +--- + +This section describes best practices to achieve scalability and performance while using CDC with logical replication. + +## Parallel consumption + +The recommended approach towards addressing the requirement of consuming changes in parallel from different tables is to use multiple replication slots. One replication slot per table could be used. Each replication slot is independent of the other and the changes from the tables can be consumed in parallel. + +## Fan out + +Consider the requirement where there are multiple applications, all of them requiring to consume changes from the same table. The recommended approach to address this requirement is to use one replication slot to consume the changes from the table and write the changes to a system like Kafka. The fan out can then be implemented with the multiple applications consuming from Kafka. + +## Load balancing consumption + +An application can connect to any of the YB-TServer nodes to consume from a replication slot. Furthermore, even in case of an interruption, a fresh connection can be made to a different node (different from the node from which consumption was previously happening) to continue consumption from the same replication slot. + +When there are multiple consuming applications each consuming from a different replication slot, it is best that the applications connect to different YB-TServer nodes in the cluster. This ensures better load balancing. The [YugabyteDB smart driver](../../../../drivers-orms/smart-drivers/) does this automatically, so it is recommended that applications use this smart driver. diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/get-started.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/get-started.md new file mode 100644 index 000000000000..b30bb0e0b128 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/get-started.md @@ -0,0 +1,586 @@ +--- +title: Get started with CDC logical replication in YugabyteDB +headerTitle: Get started +linkTitle: Get started +description: Get started with Change Data Capture in YugabyteDB. +headcontent: Start using CDC with logical replication +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: get-started + weight: 20 +type: docs +--- + +To get started streaming data change events from a YugabyteDB database using a replication slot, you can use either of the following client options: + +- [pg_recvlogical](#get-started-with-pg_recvlogical) +- [YugabyteDB connector](#get-started-with-yugabytedb-connector) + +{{< note title="Note" >}} + +CDC via logical replication is supported in YugabyteDB starting from version 2024.1.1. + +{{< /note >}} + +## Get started with pg_recvlogical + +pg_recvlogical is a command-line tool provided by PostgreSQL for interacting with the logical replication feature. It is specifically used to receive changes from the database using logical replication slots. + +YugabyteDB provides the pg_recvlogical binary in the `/postgres/bin/` directory, which is inherited and based on PostgreSQL 11.2. Although PostgreSQL also offers a pg_recvlogical binary, you are strongly advised to use the YugabyteDB version to avoid compatibility issues. + +### Set up pg_recvlogical + +To set up pg_recvlogical, create and start the local cluster by running the following command from your YugabyteDB home directory: + +```sh +./bin/yugabyted start \ + --advertise_address=127.0.0.1 \ + --base_dir="${HOME}/var/node1" \ + --tserver_flags="allowed_preview_flags_csv={cdcsdk_enable_dynamic_table_support},cdcsdk_enable_dynamic_table_support=true,cdcsdk_publication_list_refresh_interval_secs=2" +``` + +#### Create tables + +1. Use ysqlsh to connect to the default `yugabyte` database with the default superuser `yugabyte`, as follows: + + ```sh + bin/ysqlsh -h 127.0.0.1 -U yugabyte -d yugabyte + ``` + +1. In the `yugabyte` database, create a table `employees`. + + ```sql + CREATE TABLE employees ( + employee_id SERIAL PRIMARY KEY, + name VARCHAR(255), + email VARCHAR(255), + department_id INTEGER + ); + ``` + +#### Create a Replication slot + +Create a logical replication slot named `test_logical_replication_slot` using the `test_decoding` output plugin via the following function: + +```sql +SELECT * +FROM pg_create_logical_replication_slot('test_logical_replication_slot', 'test_decoding'); +``` + +Expected output after running the command that indicates successful creation of the slot: + +```output + slot_name | lsn +-------------------------------+----- + test_logical_replication_slot | 0/2 +``` + +#### Configure and start pg_recvlogical + +The pg_recvlogical binary can be found under `/postgres/bin/`. + +Open a new shell and start pg_recvlogical to connect to the `yugabyte` database with the superuser `yugabyte` and replicate changes using the replication slot you created as follows: + +```sh +./pg_recvlogical -d yugabyte \ + -U yugabyte \ + -h 127.0.0.1 \ + --slot test_logical_replication_slot \ + --start \ + -f - +``` + +Any changes that get replicated are printed to stdout. + +For more pg_recvlogical configurations, refer to the PostgreSQL [pg_recvlogical](https://www.postgresql.org/docs/11/app-pgrecvlogical.html) documentation. + +#### Verify Replication + +Return to the shell where ysqlsh is running. Perform DMLs on the `employees` table. + +```sql +BEGIN; + +INSERT INTO employees (name, email, department_id) +VALUES ('Alice Johnson', 'alice@example.com', 1); + +INSERT INTO employees (name, email, department_id) +VALUES ('Bob Smith', 'bob@example.com', 2); + +COMMIT; +``` + +Expected output observed on stdout where pg_recvlogical is running: + +```output +BEGIN 2 +table public.employees: INSERT: employee_id[integer]:1 name[character varying]:'Alice Johnson' email[character varying]:'alice@example.com' department_id[integer]:1 +table public.employees: INSERT: employee_id[integer]:2 name[character varying]:'Bob Smith' email[character varying]:'bob@example.com' department_id[integer]:2 +COMMIT 2 +``` + +#### Add tables (Dynamic table addition) + +You can add a new table to the `yugabyte` database and any DMLs performed on the new table would also be replicated to pg_recvlogical. + +1. In the `yugabyte` database, create a new table `projects`: + + ```sql + CREATE TABLE projects ( + project_id SERIAL PRIMARY KEY, + name VARCHAR(255), + description TEXT + ); + ``` + +2. Perform DMLs on the `projects` table: + + ```sql + INSERT INTO projects (name, description) + VALUES ('Project A', 'Description of Project A'); + ``` + +Expected output observed on stdout where pg_recvlogical is running: + +```output +BEGIN 3 +table public.projects: INSERT: project_id[integer]:1 name[character varying]:'Project A' description[text]:'Description of Project A' +COMMIT 3 +``` + +{{% explore-cleanup-local %}} + +## Get started with YugabyteDB connector + +This tutorial demonstrates how to use Debezium to monitor a YugabyteDB database. As the data in the database changes, you will see the resulting event streams. + +In this tutorial you will start the Debezium services, run a YugabyteDB instance with a basic example database, and use Debezium to monitor the database for changes. + +**Prerequisites** + +- Docker is installed and running. + + This tutorial uses Docker and the Debezium container images to run the required services. You should use the latest version of Docker. For more information, see the [Docker Engine installation](https://docs.docker.com/engine/installation/) documentation. + +### Start the services + +Using Debezium requires three separate services: [Zookeeper](http://zookeeper.apache.org/), [Kafka](https://kafka.apache.org), and the Debezium connector service. + +In this tutorial, you will set up a single instance of each service using Docker and the Debezium container images. + +To start the services needed for this tutorial, you must: + +- [Start Zookeeper](#start-zookeeper) +- [Start Kafka](#start-kafka) +- [Start a YugabyteDB database](#start-a-yugabytedb-database) +- [Start Kafka Connect](#start-kafka-connect) + +#### Start Zookeeper + +Zookeeper is the first service you must start. + +1. Open a terminal and use it to start Zookeeper in a container. This command runs a new container using version `2.5.2.Final` of the `debezium/zookeeper` image: + +```sh +docker run -d --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:2.5.2.Final +``` + +#### Start Kafka + +After starting Zookeeper, you can start Kafka in a new container. + +1. Open a new terminal and use it to start Kafka in a container. This command runs a new container using version `2.5.2.Final` of the `debezium/kafka` image: + +```sh +docker run -d --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:2.5.2.Final +``` + +{{< note title="Note" >}} + +In this tutorial, you will always connect to Kafka from in a Docker container. Any of these containers can communicate with the `kafka` container by linking to it. If you need to connect to Kafka from outside of a Docker container, you have to set the `-e` option to advertise the Kafka address through the Docker host (`-e ADVERTISED_HOST_NAME=` followed by either the IP address or resolvable host name of the Docker host). + +{{< /note >}} + +#### Start a YugabyteDB database + +At this point, you have started Zookeeper and Kafka, but you still need a database server from which Debezium can capture changes. In this procedure, you start a YugabyteDB instance with an example database. Follow the [Quick Start](../../../../quick-start) to start an instance using yugabyted. + +{{< note title="Note" >}} + +You need to start the database on an IP that is resolvable by the docker containers. If you use the localhost address (that is, `127.0.0.1`) then if you deploy the connectors in the docker containers, they won't be able to talk to the database and will keep trying to connect to `127.0.0.1` inside the container. Use the [--advertise_address option for yugabyted](../../../../reference/configuration/yugabyted#flags-8) to specify the IP you want to start your database instance. + +For example, Linux users can use the following: + +```sh +./bin/yugabyted start --advertise_address $(hostname -i) +``` + +{{< /note >}} + +##### Use the YSQL command line client + +After starting YugabyteDB, use ysqlsh to create your database: + +1. Connect the client to the database process running on the IP you specified when you started up the database instance. + + ```sh + ./bin/ysqlsh -h + ``` + + You should see output similar to the following: + + ```output + ysqlsh (11.2-YB-2.21.1.0-b0) + Type "help" for help. + + yugabyte=# + ``` + +1. Load the schema of the sample tables: + + ```sql + yugabyte=# \i share/schema.sql + CREATE TABLE + CREATE TABLE + CREATE TABLE + CREATE TABLE + ``` + +1. List the tables + + ```sql + yugabyte=# \d + ``` + + ```output + List of relations + Schema | Name | Type | Owner + --------+-----------------+----------+---------- + public | orders | table | yugabyte + public | orders_id_seq | sequence | yugabyte + public | products | table | yugabyte + public | products_id_seq | sequence | yugabyte + public | reviews | table | yugabyte + public | reviews_id_seq | sequence | yugabyte + public | users | table | yugabyte + public | users_id_seq | sequence | yugabyte + (8 rows) + ``` + +1. Load data in one of the tables and verify the count. + + ```sql + yugabyte=# \i share/products.sql + ``` + + ```output + yugabyte=# select count(*) from products; + count + ------- + 200 + (1 row) + ``` + +#### Start Kafka Connect + +After starting YugabyteDB, you start the Kafka Connect service. This service exposes a REST API to manage the YugabyteDB connector. + +1. Open a new terminal, and use it to start the Kafka Connect service in a container. + + The following command runs a new container using the `dz.2.5.2.yb.2024.1.SNAPSHOT.1` version of the `quay.io/yugabyte/ybdb-debezium` image: + + ```sh + docker run -it --rm --name connect -p 8083:8083 -p 1976:1976 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuses -e CLASSPATH=/kafka/connect/ --link zookeeper:zookeeper --link kafka:kafka quay.io/yugabyte/ybdb-debezium:dz.2.5.2.yb.2024.1.SNAPSHOT.1 + ``` + +1. Verify that Kafka Connect started and is ready to accept connections. You should see output similar to the following: + + ```output + ... + 2024-07-19 12:04:33,044 INFO || Kafka version: 3.6.1 [org.apache.kafka.common.utils.AppInfoParser] + ... + 2024-07-19 12:04:33,661 INFO || [Worker clientId=connect-1, groupId=1] Starting connectors and tasks using config offset -1 [org.apache.kafka.connect.runtime.distributed.DistributedHerder] + 2024-07-19 12:04:33,661 INFO || [Worker clientId=connect-1, groupId=1] Finished starting connectors and tasks [org.apache.kafka.connect.runtime.distributed.DistributedHerder] + ``` + +1. Use the Kafka Connect REST API to check the status of the Kafka Connect service. + + Kafka Connect exposes a REST API to manage Debezium connectors. To communicate with the Kafka Connect service, you can use the `curl` command to send API requests to port 8083 of the Docker host (which you mapped to port 8083 in the `connect` container when you started Kafka Connect). + + Open a new terminal and check the status of the Kafka Connect service: + + ```sh + $ curl -H "Accept:application/json" localhost:8083/ + + {"version":"3.6.1","commit":"5e3c2b738d253ff5","kafka_cluster_id":"kafka-cluster-id"} + ``` + +{{< note title="Note" >}} + +These commands use `localhost`. If you are using a non-native Docker platform (such as Docker Toolbox), replace `localhost` with the IP address of your Docker host. + +{{< /note >}} + +### Deploy the YugabyteDB connector + +After starting the Debezium and YugabyteDB service, you are ready to deploy the YugabyteDB connector. To deploy the connector, you must: + +- [Register the YugabyteDB connector to monitor the `yugabyte` database](#register-a-connector-to-monitor-yugabyte-database) +- Watch the connector start + +#### Register a connector to monitor `yugabyte` database + +By registering the YugabyteDB connector, the connector will start monitoring the YugabyteDB database's table `products`. When a row in the table changes, Debezium generates a change event. + +{{< note title="Note" >}} + +In a production environment, you would typically either use the Kafka tools to manually create the necessary topics, including specifying the number of replicas, or you would use the Kafka Connect mechanism for customizing the settings of [auto-created](https://debezium.io/documentation/reference/2.5/configuration/topic-auto-create-config.html) topics. However, for this tutorial, Kafka is configured to automatically create the topics with just one replica. + +{{< /note >}} + +1. Review the configuration of the YugabyteDB connector that you will register. Before registering the connector, you should be familiar with its configuration. In the next step, you will register the following connector: + + ```json + { + "name": "ybconnector", + "config": { + "tasks.max":"1", + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", + "database.hostname":"'$(hostname -i)'", + "database.port":"5433", + "database.user": "yugabyte", + "database.password":"yugabyte", + "database.dbname":"yugabyte", + "topic.prefix":"dbserver1", + "snapshot.mode":"initial", + "table.include.list":"public.products", + "plugin.name":"yboutput", + "slot.name":"yb_replication_slot" + } + } + ``` + + - `name` - The name of the connector. + - `config` - The connector's configuration. + - `database.hostname` - The database host, which is the IP of the machine running YugabyteDB. If YugabyteDB were running on a normal network, you would specify the IP address or resolvable host name for this value. + - `topic.prefix` - A unique topic prefix. This name will be used as the prefix for all Kafka topics. + - `table.include.list` - Only changes in the table `products` of the schema `public` will be detected. + - `plugin.name` - [Plugin](../key-concepts/#output-plugin) to be used for replication. + - `slot.name` - Name of the [replication slot](../key-concepts/#replication-slot). + + For more information, see [YugabyteDB connector configuration properties](../yugabytedb-connector/#connector-properties). + +1. Open a new terminal and use the `curl` command to register the YugabyteDB connector. + + This command uses the Kafka Connect service API to submit a `POST` request against the `/connectors` resource with a `JSON` document that describes the new connector (called `ybconnector`). + + ```sh + curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ + "name": "ybconnector", + "config": { + "tasks.max":"1", + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", + "database.hostname":"'$(hostname -i)'", + "database.port":"5433", + "database.user": "yugabyte", + "database.password":"yugabyte", + "database.dbname":"yugabyte", + "topic.prefix":"dbserver1", + "snapshot.mode":"initial", + "table.include.list":"public.products", + "plugin.name":"yboutput", + "slot.name":"yb_replication_slot" + } + }' + ``` + + {{< note title="Note" >}} +Windows users may need to escape the double-quotes. + {{< /note >}} + +1. Verify that `ybconnector` is included in the list of connectors: + + ```sh + $ curl -H "Accept:application/json" localhost:8083/connectors/ + + ["ybconnector"] + ``` + +#### Watch the connector start + +When you register a connector, it generates a large amount of log output in the Kafka Connect container. By reviewing this output, you can better understand the process that the connector goes through from the time it is created until it begins reading the change events. + +After registering the `ybconnector` connector, you can review the log output in the Kafka Connect container (`connect`) to track the connector's status. + +Kafka Connect reports some "errors". However, you can safely ignore these warnings: these messages just mean that new Kafka topics were created and that Kafka had to assign a new leader for each one: + +```output +2021-11-30 01:38:45,555 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 3 : {dbserver1=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,691 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 9 : {dbserver1.public.orders=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,813 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 13 : {dbserver1.public.users=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:45,927 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 18 : {dbserver1.public.products=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +2021-11-30 01:38:46,043 WARN || [Producer clientId=connector-producer-inventory-connector-0] Error while fetching metadata with correlation id 22 : {dbserver1.public.reviews=LEADER_NOT_AVAILABLE} [org.apache.kafka.clients.NetworkClient] +``` + +### View change events + +After deploying the YugabyteDB connector, it starts monitoring the `yugabyte` database for data change events. + +For this tutorial, you will explore the `dbserver1.public.products` topic. + +#### View a change event + +Open a new terminal, and use it to start the watch-topic utility to watch the `dbserver1.public.products` topic from the beginning of the topic. + +The following command runs the `watch-topic` utility in a new container using the `2.5.2.Final` version of the `debezium/kafka` image: + +```sh +docker run -it --rm --name consumer --link zookeeper:zookeeper --link kafka:kafka debezium/kafka:2.5.2.Final watch-topic -a dbserver1.public.products +``` + +The `watch-topic` utility returns the event records from the `products` table. There will be 200 events, one for each row in the table which was snapshotted. Each event is formatted in JSON, because that is how you configured the Kafka Connect service. There are two JSON documents for each event: one for the key, and one for the value. + +You should see output similar to the following: + +```output.json +Using ZOOKEEPER_CONNECT=172.17.0.2:2181 +Using KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.17.0.7:9092 +Using KAFKA_BROKER=172.17.0.3:9092 +Contents of topic dbserver1.public.products: +... +{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int64","optional":false,"default":0,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":false,"name":"id","field":"id"},{"type":"struct","fields":[{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"created_at","field":"created_at"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"category","field":"category"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"ean","field":"ean"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"price","field":"price"},{"type":"struct","fields":[{"type":"int32","optional":true,"default":5000,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"quantity","field":"quantity"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"rating","field":"rating"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"title","field":"title"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"vendor","field":"vendor"}],"optional":true,"name":"dbserver1.public.products.Value","field":"before"},{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int64","optional":false,"default":0,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":false,"name":"id","field":"id"},{"type":"struct","fields":[{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"created_at","field":"created_at"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"category","field":"category"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"ean","field":"ean"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"price","field":"price"},{"type":"struct","fields":[{"type":"int32","optional":true,"default":5000,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"quantity","field":"quantity"},{"type":"struct","fields":[{"type":"double","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"rating","field":"rating"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"title","field":"title"},{"type":"struct","fields":[{"type":"string","optional":true,"field":"value"},{"type":"boolean","optional":false,"field":"set"}],"optional":true,"name":"vendor","field":"vendor"}],"optional":true,"name":"dbserver1.public.products.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"dbserver1.public.products.Envelope","version":1},"payload":{"before":null,"after":{"id":{"value":147,"set":true},"created_at":{"value":1500306107286000,"set":true},"category":{"value":"Doohickey","set":true},"ean":{"value":"6590063715","set":true},"price":{"value":44.4315141414441,"set":true},"quantity":{"value":5000,"set":true},"rating":{"value":4.6,"set":true},"title":{"value":"Mediocre Wool Toucan","set":true},"vendor":{"value":"Bradtke, Wilkinson and Reilly","set":true}},"source":{"version":"dz.2.5.2.yb.2024.1-SNAPSHOT","connector":"postgresql","name":"dbserver1","ts_ms":1721400304248,"snapshot":"true","db":"yugabyte","sequence":"[null,\"2\"]","schema":"public","table":"products","txId":2,"lsn":2,"xmin":null},"op":"r","ts_ms":1721400309609,"transaction":null}} +... +``` + +{{< note title="Note" >}} + +This utility keeps watching the topic, so any new events will automatically appear as long as the utility is running. + +{{< /note >}} + +#### Update the database and view the update event + +Now that you have seen how the YugabyteDB connector captured the create events in the `yugabyte` database, change one of the records and see how the connector captures it. + +By completing this procedure, you will learn how to find details about what changed in a database commit, and how you can compare change events to determine when the change occurred in relation to other changes. + +1. In the terminal that is running ysqlsh, run the following statement: + + ```sql + update products set title = 'Enormous Granite Shiny Shoes' where id = 22; + ``` + +1. View the updated `products` table: + + ```sql + yugabyte=# select * from products where id = 22; + ``` + + ```output + id | created_at | category | ean | price | quantity | rating | title | vendor + ----+-------------------------+----------+---------------+------------------+----------+--------+------------------------------+--------------------------- + 22 | 2017-11-24 20:14:28.415 | Gizmo | 7595223735110 | 21.4245199604423 | 5000 | 4.2 | Enormous Granite Shiny Shoes | Mayer, Kiehn and Turcotte + (1 row) + ``` + +1. Switch to the terminal running `watch-topic` to see a new event. + + By changing a record in the `products` table, the YugabyteDB connector generated a new event. + + The details for the payload of the *update* event will look similar to the following (formatted for readability): + + ```json + { + "before": null, + "after": { + "id": { + "value": 22, + "set": true + }, + "created_at": null, + "category": null, + "ean": null, + "price": null, + "quantity": null, + "rating": null, + "title": { + "value": "Enormous Granite Shiny Shoes", + "set": true + }, + "vendor": null + } + } + ``` + +Note that the fields which were not updated are coming out as `null`. This is because the [REPLICA IDENTITY](../key-concepts/#replica-identity) of the table is `CHANGE` by default, where you only send the values of the updated columns in the change event. + +#### Delete a row and view the delete event + +1. In the terminal that is running ysqlsh, run the following statement: + + ```sql + delete from products where id = 22; + ``` + +1. Switch to the terminal running `watch-topic` to see two new events. By deleting a row in the `products` table, the YugabyteDB connector generated 2 new events. + + The details for the payload of the first event will look similar to the following (formatted for readability): + + ```json + { + "before": { + "id": { + "value": 22, + "set": true + }, + "created_at": { + "value": null, + "set": true + }, + "category": { + "value": null, + "set": true + }, + "ean": { + "value": null, + "set": true + }, + "price": { + "value": null, + "set": true + }, + "quantity": { + "value": 5000, + "set": true + }, + "rating": { + "value": null, + "set": true + }, + "title": { + "value": null, + "set": true + }, + "vendor": { + "value": null, + "set": true + } + }, + "after": null + } + ``` + +The second event will have a *key* but the *value* will be `null`; that is a [tombstone event](../yugabytedb-connector/#tombstone-events) generated by the YugabyteDB connector. + +### Clean up + +After you are finished with the tutorial, you can use Docker to stop all of the running containers. + +Run the following command: + +```sh +docker stop zookeeper kafka connect consumer +``` + +Docker stops each container. Because you used the `--rm` option when you started them, Docker also removes them. diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/key-concepts.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/key-concepts.md new file mode 100644 index 000000000000..cc5f32640b53 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/key-concepts.md @@ -0,0 +1,123 @@ +--- +title: Key concepts - logical replication +headerTitle: Key concepts +linkTitle: Key concepts +description: Change Data Capture in YugabyteDB. +headcontent: PostgreSQL logical replication concepts +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: key-concepts + weight: 10 +type: docs +--- + +The YugabyteDB logical replication feature uses [PostgreSQL Logical Replication](https://www.postgresql.org/docs/11/logical-replication.html), which operates using a publish-subscribe model. Understanding the following key concepts will help you set up and manage a logical replication environment effectively. + +## Concepts + +### Replication slot + +A replication slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database. + +In logical replication, the fundamental unit of data transmission is a transaction. A logical slot emits each change just once in normal operation. The current position of each slot is persisted only at checkpoint, so if a replication process is interrupted and restarts, even if the checkpoint or the starting LSN falls in the middle of a transaction, **the entire transaction is retransmitted**. This behavior guarantees that clients receive complete transactions without missing any intermediate changes, maintaining data integrity across the replication stream​. Logical decoding clients are responsible for avoiding ill effects from handling the same message more than once. Clients may wish to record the last LSN they saw when decoding and skip over any repeated data or (when using the replication protocol) request that decoding start from that LSN rather than letting the server determine the start point. + +For more information, refer to [Replication slots](https://www.postgresql.org/docs/11/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS) in the PostgreSQL documentation. + +### Publication + +A publication is a set of changes generated from a table or a group of tables, and might also be described as a change set or replication set. Each publication exists in only one database. + +Publications are different from schemas and do not affect how the table is accessed. Each table can be added to multiple publications if needed. Publications may currently only contain tables. Objects must be added explicitly, except when a publication is created for ALL TABLES. + +For more information, refer to [Publication](https://www.postgresql.org/docs/11/logical-replication-publication.html#LOGICAL-REPLICATION-PUBLICATION) in the PostgreSQL documentation. + +### Output plugin + +Output plugins transform the data from the write-ahead log's internal representation into the format that can be consumed by replication clients. These plugins are notified about the change events that need to be processed and sent via various callbacks. These callbacks are only invoked when the transaction actually commits. + +YugabyteDB supports the following four output plugins: + +- `yboutput` +- `pgoutput` +- `test_decoding` +- `wal2json` + +All these plugins are pre-packaged with YugabyteDB and do not require any external installation. + +{{< note title="Note" >}} + +The plugin `yboutput` is YugabyteDB specific. It is similar to `pgoutput` in most aspects. The only difference being that replica identity `CHANGE` is not supported in `pgoutput`. All other plugins support replica identity `CHANGE`. + +{{}} + +For more information, refer to [Logical Decoding Output Plugins](https://www.postgresql.org/docs/11/logicaldecoding-output-plugin.html) in the PostgreSQL documentation. + +### LSN + +LSN (Log Sequence Number) in YugabyteDB is an unsigned 64-bit integer that uniquely identifies a change record or a transaction boundary record that is consumed from a given replication slot. + +In YugabyteDB, LSN values from different slots are considered unrelated and should not be compared. In YugabyteDB, LSN no longer represents the byte offset of a WAL record. + +LSN values for a single replication slot satisfy the following properties: + +- **Uniqueness** + + LSN values for the change and `COMMIT` records for a given replication slot are unique. In particular, changes from different tablets of the same or different tables will have unique LSN values for a replication slot. + +- **Ordering** + + LSN values can be compared ( `<`, `>`, `=` ). + + The LSN of the change records in a transaction will be strictly lower than the LSN of the COMMIT record of the same transaction. + + The LSNs of change records in a transaction will be in increasing order and will correspond to the order in which those changes were made in that transaction. That is, the LSN of an earlier change will have a strictly lower value than the LSN of a later change in the same transaction. This is the case even if the changes correspond to rows in different tablets of the same or different tables. + + For a given replication slot, the LSN of a `COMMIT` record of an earlier transaction will be strictly lower than the LSN of the `COMMIT` record of a later transaction. + +- **Determinism** + + For a given replication slot, the LSN value of a change record (or a transaction boundary record) remains the same for the lifetime of that replication slot. In particular, this is true across server and client restarts and client re-connections. Thus, LSN values for a single replication slot may be used to uniquely identify records that are consumed from that replication slot. The values can be compared for determining duplicates at the client side. + +### Replica identity + +Replica identity is a table-level parameter that controls the amount of information being written to the change records. YugabyteDB supports the following four replica identities: + +- CHANGE (default) +- DEFAULT +- FULL +- NOTHING + +The replica identity `INDEX` is not supported in YugabyteDB. + +Replica identity `CHANGE` is the best performant and the default replica identity. The replica identity of a table can be changed by performing an alter table. However, for a given slot, the alter tables performed to change the replica identity after the creation of the slot will have no effect. This means that the effective replica identity for any table for a slot, is the replica identity of the table that existed at the time of slot creation. A dynamically created table (a table created after slot creation) will have the default replica identity. For a replica identity modified after slot creation to take effect, a new slot will have to be created after performing the Alter table. + +The [ysql_yb_default_replica_identity](../../../../reference/configuration/yb-tserver/#ysql-yb-default-replica-identity) flag determines the default replica identity for user tables at the time of table creation. The default value is `CHANGE`. The purpose of this flag is to set the replica identities for dynamically created tables. In order to create a dynamic table with desired replica identity, the flag must be set accordingly and then the table must be created. + +{{< note title="Advisory" >}} +You should refrain from altering the replica identity of a dynamically created table for at least 5 minutes after its creation. +{{< /note >}} + +For more information, refer to [Replica Identity](https://www.postgresql.org/docs/11/sql-altertable.html#SQL-CREATETABLE-REPLICA-IDENTITY) in the PostgreSQL documentation. + +### Replication protocols + +PostgreSQL has defined protocols for replication that need to be followed by clients to establish replication connection as well as message structures for streaming data. This includes the [Streaming Replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html) and the [Logical Streaming Replication protocol](https://www.postgresql.org/docs/11/protocol-logical-replication.html). + +The logical streaming replication protocol sends individual transactions one-by-one. This means that all messages between a pair of `BEGIN` and `COMMIT` messages belong to the same transaction. + +YugabyteDB supports both the streaming replication protocols used in PostgreSQL to support logical replication, maintaining the same semantics described in PostgreSQL: + +- Streaming Replication Protocol - This protocol is followed by all output plugins. + +- Logical Streaming Replication Protocol - This protocol is followed by `pgoutput` and `yboutput`, in addition to the Streaming replication protocol. + +{{< note title="Note" >}} + +YugabyteDB does not support Physical Replication. + +{{< /note >}} + +## Learn more + +[CDC using Logical Replication architecture](../../../../architecture/docdb-replication/cdc-logical-replication/) diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/monitor.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/monitor.md new file mode 100644 index 000000000000..38a98272e3f8 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/monitor.md @@ -0,0 +1,87 @@ +--- +title: CDC monitoring in YugabyteDB +headerTitle: Monitor +linkTitle: Monitor +description: Monitor Change Data Capture in YugabyteDB. +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: monitor + weight: 30 +type: docs +--- + +## Catalog objects and views + +### pg_publication + +Contains all publication objects contained in the database. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| oid | oid | Row identifier | +| pubname | name | Name of the publication | +| pubowner | oid | OID of the owner. | +| puballtables | bool | If true, this publication includes all tables in the database including those added in the future. | +| pubinsert | bool | If true, INSERT operations are replicated for tables in the publication. | +| pubupdate | bool | If true, UPDATE operations are replicated for tables in the publication. | +| pubdelete | bool | If true, DELETE operations are replicated for tables in the publication. | +| pubtruncate | bool | If true, TRUNCATE operations are replicated for tables in the publication. | + +### pg_publication_rel + +Contains mapping between publications and tables. This is a many-to-many mapping. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| oid | oid | Row identifier. | +| prpubid | oid | OID of the publication. References pg_publication.oid. | +| prrelid| oid | OID of the relation. References pg_class.oid. | + +### pg_publication_tables + +Contains mapping between publications and tables. It is a wrapper over `pg_publication_rel` as it expands the publications defined as FOR ALL TABLES, so for such publications there will be a row for each eligible table. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| pubname | name | Name of publication. | +| schemaname | name | Name of schema containing table. | +| tablename | name | Name of table. | + +### pg_replication_slots + +Provides a list of all replication slots that currently exist on the database cluster, along with their metadata. + +| Column name | Data type | Description | +| :----- | :----- | :----- | +| slot_name | name | Name of the replication slot. | +| plugin | name | Output plugin name (Always `yboutput`). | +| slot_type | text | Always logical. | +| datoid | oid | The OID of the database this slot is associated with. | +| database | text | The name of the database this slot is associated with. | +| temporary | boolean | True if this is a temporary replication slot. Temporary slots are automatically dropped on error or when the session has finished. | +| active | boolean | True if this slot is currently actively being used. In YSQL, an "active" replication slot means a slot which has been consumed at least once in a certain time frame. The time is defined using the `ysql_replication_slot_activity_threshold` flag, which has a default of 5 minutes. | +| active_pid | integer | The process ID of the session using this slot if the slot is currently actively being used. `NULL` if no replication process is ongoing. | +| xmin | xid | The oldest transaction that this slot needs the database to retain. | +| catalog_xmin | xid | Not applicable for YSQL. Always set to xmin. | +| restart_lsn | pg_lsn | The LSN of the oldest change record which still might be required by the consumer of this slot and thus won't be automatically removed during checkpoints. | +| confirmed_flush_lsn | pg_lsn | The LSN up to which the logical slot's consumer has confirmed receiving data. Data older than this is not available anymore. Transactions with commit LSN lower than the `confirmed_flush_lsn` are not available anymore. | +| yb_stream_id | text | UUID of the CDC stream | +| yb_restart_commit_ht | int8 | A uint64 representation of the commit Hybrid Time corresponding to the `restart_lsn`. This can be used by the client (like YugabyteDB connector) to perform a consistent snapshot (as of the `consistent_point`) in the case when a replication slot already exists. | + +## CDC Service metrics + +Provide information about CDC service in YugabyteDB. + +| Metric name | Type | Description | +| :---- | :---- | :---- | +| cdcsdk_change_event_count | `long` | The number of records sent by the CDC Service. | +| cdcsdk_traffic_sent | `long` | The number of milliseconds since the connector has read and processed the most recent event. | +| cdcsdk_event_lag_micros | `long` | The LAG metric is calculated by subtracting the timestamp of the latest record in the WAL of a tablet from the last record sent to the CDC connector. | +| cdcsdk_expiry_time_ms | `long` | The time left to read records from WAL is tracked by the Stream Expiry Time (ms). | + +## Connector metrics + + + +Refer to [Monitoring](../yugabytedb-connector/#monitoring) for information on YugabyteDB connector metrics. diff --git a/docs/content/stable/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md b/docs/content/stable/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md new file mode 100644 index 000000000000..b120352505df --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-logical-replication/yugabytedb-connector.md @@ -0,0 +1,1679 @@ +--- +title: YugabyteDB connector +headerTitle: YugabyteDB connector +linkTitle: YugabyteDB connector +description: YugabyteDB connector for Change Data Capture in YugabyteDB. +menu: + stable: + parent: explore-change-data-capture-logical-replication + identifier: yugabytedb-connector + weight: 70 +type: docs +--- + +The YugabyteDB Connector is based on the Debezium Connector, and captures row-level changes in the schemas of a YugabyteDB database using the PostgreSQL replication protocol. + +The first time it connects to a YugabyteDB server, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database. The connector generates data change event records and streams them to Kafka topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic. + +## Overview + +YugabyteDB CDC using logical decoding is a mechanism that allows the extraction of changes that were committed to the transaction log and the processing of these changes in a user-friendly manner with the help of a [PostgreSQL output plugin](https://www.postgresql.org/docs/11/logicaldecoding-output-plugin.html). The output plugin enables clients to consume the changes. + +The YugabyteDB connector contains two main parts that work together to read and process database changes: + +* You must configure a replication slot that uses your chosen output plugin before running the YugabyteDB server. The plugin can be one of the following: + + + * `yboutput` is the plugin packaged with YugabyteDB. It is maintained by Yugabyte and is always present with the distribution. + + * `pgoutput` is the standard logical decoding output plugin in PostgreSQL 10+. It is maintained by the PostgreSQL community, and used by PostgreSQL itself for logical replication. YugabyteDB bundles this plugin with the standard distribution so it is always present and no additional libraries need to be installed. The YugabyteDB connector interprets the raw replication event stream directly into change events. + + +* Java code (the actual Kafka Connect connector) that reads the changes produced by the chosen logical decoding output plugin. It uses the [streaming replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html), by means of the YugabyteDB JDBC driver. + +The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in a separate Kafka topic. Client applications read the Kafka topics that correspond to the database tables of interest, and can react to every row-level event they receive from those topics. + +YugabyteDB normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the YugabyteDB connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the configured tables. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken. + +The connector is tolerant of failures. As the connector reads changes and produces events, it records the LSN for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off. + +{{< tip title="Use UTF-8 encoding" >}} + +Debezium supports databases with UTF-8 character encoding only. With a single-byte character encoding, it's not possible to correctly process strings that contain extended ASCII code characters. + +{{< /tip >}} + +## How the connector works + +To optimally configure and run a Debezium connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. + +### Security + +To use the Debezium connector to stream changes from a YugabyteDB database, the connector must operate with specific privileges in the database. Although one way to grant the necessary privileges is to provide the user with `superuser` privileges, doing so potentially exposes your YugabyteDB data to unauthorized access. Rather than granting excessive privileges to the Debezium user, it is best to create a dedicated Debezium replication user to which you grant specific privileges. + +For more information about configuring privileges for the Debezium replication user, see [Setting up permissions](#setting-up-permissions). + +### Snapshots + +Most YugabyteDB servers are configured to not retain the complete history of the database in the WAL segments. This means that the YugabyteDB connector would be unable to see the entire history of the database by reading only the WAL. Consequently, the first time that the connector starts, it performs an initial consistent snapshot of the database. + +#### Default workflow behavior of initial snapshots + +The default behavior for performing a snapshot consists of the following steps. You can change this behavior by setting the `snapshot.mode` [connector configuration property](#advanced-configuration-properties) to a value other than `initial`. + +1. Start a transaction. +2. Set the transaction read time to the [consistent point](../../../../architecture/docdb-replication/cdc-logical-replication/#initial-snapshot) associated with the replication slot. +3. Execute snapshot through the execution of a `SELECT` query. +4. Generate a `READ` event for each row and write to the appropriate table-specific Kafka topic. +5. Record successful completion of the snapshot in the connector offsets. + +If the connector fails, is rebalanced, or stops after Step 1 begins but before Step 5 completes, upon restart the connector begins a new snapshot. After the connector completes its initial snapshot, the YugabyteDB connector continues streaming from the position that it read in Step 2. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. + +The following table describes the options for the `snapshot.mode` connector configuration property. + +| Option | Description | +| :--- | :--- | +| `never` | The connector never performs snapshots. When a connector is configured this way, its behavior when it starts is as follows. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. If no LSN has been stored, the connector starts streaming changes from the point in time when the YugabyteDB logical replication slot was created on the server. The `never` snapshot mode is beneficial only when you know all data of interest is still reflected in the WAL. | +| `initial` (default) | The connector performs a database snapshot when no Kafka offsets topic exists. After the database snapshot completes the Kafka offsets topic is written. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. | +| `initial_only` | The connector performs a database snapshot and stops before streaming any change event records. If the connector had started but did not complete a snapshot before stopping, the connector restarts the snapshot process and stops when the snapshot completes. | + +### Streaming changes + +The YugabyteDB connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. This mechanism relies on [PostgreSQL's replication protocol](https://www.postgresql.org/docs/11/protocol-replication.html). This protocol enables clients to receive changes from the server as they are committed in the server's transaction logs. + +Whenever the server commits a transaction, a separate server process invokes a callback function from the [logical decoding plugin](../key-concepts/#output-plugin). This function processes the changes from the transaction, converts them to a specific format and writes them on an output stream, which can then be consumed by clients. + +The YugabyteDB connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The YugabyteDB connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. + +Periodically, Kafka Connect records the most recent offset in another Kafka topic. The offset indicates source-specific position information that Debezium includes with each event. For the YugabyteDB connector, the LSN recorded in each change event is the offset. + +When Kafka Connect gracefully shuts down, it stops the connectors, flushes all event records to Kafka, and records the last offset received from each connector. When Kafka Connect restarts, it reads the last recorded offset for each connector, and starts each connector at its last recorded offset. When the connector restarts, it sends a request to the YugabyteDB server to send the events starting just after that position. + +### Logical decoding plugin support + +As of YugabyteDB v2024.1.1 and later, YugabyteDB supports the [yboutput plugin](../key-concepts/#output-plugin), a native output plugin for logical decoding. + +Additionally, YugabyteDB also supports the PostgreSQL `pgoutput` plugin natively. This means that the YugabyteDB connector can work with an existing setup configured using `pgoutput`. + +### Topic names + +By default, the YugabyteDB connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _topicPrefix.schemaName.tableName_. + +The components of a topic name are as follows: + +* _topicPrefix_ - the topic prefix as specified by the `topic.prefix` configuration property. +* _schemaName_ - the name of the database schema in which the change event occurred. +* _tableName_ - the name of the database table in which the change event occurred. + +For example, suppose that `dbserver` is the topic prefix in the configuration for a connector that is capturing changes in a YugabyteDB installation that has a `yugabyte` database and an `inventory` schema that contains four tables: `products`, `products_on_hand`, `customers`, and `orders`. The connector would stream records to these four Kafka topics: + +* `dbserver.inventory.products` +* `dbserver.inventory.products_on_hand` +* `dbserver.inventory.customers` +* `dbserver.inventory.orders` + +Now suppose that the tables are not part of a specific schema but were created in the default public YugabyteDB schema. The names of the Kafka topics would be: + +* `dbserver.public.products` +* `dbserver.public.products_on_hand` +* `dbserver.public.customers` +* `dbserver.public.orders` + +The connector applies similar naming conventions to label its [transaction metadata topics](#transaction-metadata). + +If the default topic names don't meet your requirements, you can configure custom topic names. To configure custom topic names, you specify regular expressions in the logical topic routing SMT. For more information about using the logical topic routing SMT to customize topic naming, see the Debezium documentation on [Topic routing](https://debezium.io/documentation/reference/2.5/transformations/topic-routing.html). + +### Transaction metadata + +Debezium can generate events that represent transaction boundaries and that enrich data change event messages. + +{{< note title="Limits on when Debezium receives transaction metadata" >}} + +Debezium registers and receives metadata only for transactions that occur _after you deploy the connector_. Metadata for transactions that occur before you deploy the connector is not available. + +{{< /note >}} + +For every transaction `BEGIN` and `END`, Debezium generates an event containing the following fields: + +* `status` - `BEGIN` or `END`. +* `id` - String representation of the unique transaction identifier composed of YugabyteDB transaction ID itself and LSN of given operation separated by colon, that is, the format is `txID:LSN`. +* `ts_ms` - The time of a transaction boundary event (`BEGIN` or `END` event) at the data source. If the data source does not provide Debezium with the event time, then the field instead represents the time at which Debezium processes the event. +* `event_count` (for `END` events) - total number of events emitted by the transaction. +* `data_collections` (for `END` events) - an array of pairs of `data_collection` and `event_count` that provides the number of events emitted by changes originating from given data collection. + +For example: + +```output.json +{ + "status": "BEGIN", + "id": "571:53195829", + "ts_ms": 1486500577125, + "event_count": null, + "data_collections": null +} + +{ + "status": "END", + "id": "571:53195832", + "ts_ms": 1486500577691, + "event_count": 2, + "data_collections": [ + { + "data_collection": "s1.a", + "event_count": 1 + }, + { + "data_collection": "s2.a", + "event_count": 1 + } + ] +} +``` + +Unless overridden via the `transaction.topic` option, transaction events are written to the topic and named __.transaction. + +#### Change data event enrichment + +When transaction metadata is enabled the data message `Envelope` is enriched with a new `transaction` field. This field provides information about every event in the form of a composite of fields: + +* `id` - string representation of unique transaction identifier +* `total_order` - absolute position of the event among all events generated by the transaction +* `data_collection_order` - the per-data collection position of the event among all events emitted by the transaction + +Following is an example of a message: + +```output.json +{ + "before": null, + "after": { + "pk": "2", + "aa": "1" + }, + "source": { + ... + }, + "op": "c", + "ts_ms": "1580390884335", + "transaction": { + "id": "571:53195832", + "total_order": "1", + "data_collection_order": "1" + } +} +``` + +## Data change events + +The YugabyteDB connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. + +Debezium and Kafka Connect are designed around _continuous streams of event messages_. However, the structure of these events may change over time, which can be difficult for consumers to handle. To address this, each event contains the schema for its content or, if you are using a schema registry, a schema ID that a consumer can use to obtain the schema from the registry. This makes each event self-contained. + +The following skeleton JSON shows the basic four parts of a change event. However, how you configure the Kafka Connect converter that you choose to use in your application determines the representation of these four parts in change events. A `schema` field is in a change event only when you configure the converter to produce it. Likewise, the event key and event payload are in a change event only if you configure a converter to produce it. If you use the JSON converter and you configure it to produce all four basic change event parts, change events have this structure: + +```output.json +{ + "schema": { --> 1 + ... + }, + "payload": { --> 2 + ... + }, + "schema": { --> 3 + ... + }, + "payload": { --> 4 + ... + } +} +``` + +The following table describes the content of the change events. + +| Item | Field name | Description | +| :--: | :--------- | :---------- | +| 1 | `schema` | The first `schema` field is part of the event key. It specifies a Kafka Connect schema that describes what is in the event key's `payload` portion. In other words, the first `schema` field describes the structure of the primary key, or the unique key if the table does not have a primary key, for the table that was changed. | +| 2 | `payload` | The first `payload` field is part of the event key. It has the structure described by the previous `schema` field and it contains the key for the row that was changed. | +| 3 | `schema` | The second `schema` field is part of the event value. It specifies the Kafka Connect schema that describes what is in the event value's `payload` portion. In other words, the second `schema` describes the structure of the row that was changed. Typically, this schema contains nested schemas. | +| 4 | `payload` | The second `payload` field is part of the event value. It has the structure described by the previous `schema` field and it contains the actual data for the row that was changed. | + +By default, the connector streams change event records to [Kafka topics](#topic-names) with names that are the same as the event's originating table. + +{{< note title="Note" >}} + +Starting with Kafka 0.10, Kafka can optionally record the event key and value with the timestamp at which the message was created (recorded by the producer) or written to the log by Kafka. + +{{< /note >}} + +{{< warning title="Warning" >}} + +The YugabyteDB connector ensures that all Kafka Connect schema names adhere to the Avro schema name format. This means that the logical server name must start with a Latin letter or an underscore, that is, `a-z`, `A-Z`, or `_`. Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore, that is, `a-z`, `A-Z`, `0-9`, or `_`. If there is an invalid character it is replaced with an underscore character. + +This can lead to unexpected conflicts if the topic prefix, a schema name, or a table name contains invalid characters, and the only characters that distinguish names from one another are invalid and thus replaced with underscores. + +{{< /warning >}} + +### Change event keys + +For a given table, the change event's key has a structure that contains a field for each column in the primary key of the table at the time the event was created. Alternatively, if the table has `REPLICA IDENTITY` set to `FULL` there is a field for each unique key constraint. + +Consider a `customers` table defined in the `public` database schema and the example of a change event key for that table. + +**Example table:** + +```sql +CREATE TABLE customers ( + id SERIAL, + first_name VARCHAR(255) NOT NULL, + last_name VARCHAR(255) NOT NULL, + email VARCHAR(255) NOT NULL, + PRIMARY KEY(id) +); +``` + +#### Example change event key + +If the `topic.prefix` connector configuration property has the value `YugabyteDB_server`, every change event for the `customers` table while it has this definition has the same key structure, which in JSON looks like this: + +```output.json +{ + "schema": { --> 1 + "type": "struct", + "name": "YugabyteDB_server.public.customers.Key", --> 2 + "optional": false, --> 3 + "fields": [ --> 4 + { + "name": "id", + "index": "0", + "schema": { + "type": "INT32", + "optional": "false" + } + } + ] + }, + "payload": { --> 5 + "id": "1" + }, +} +``` + +**Description of a change event key:** + +| Item | Field name | Description | +| :--- | :--------- | :---------- | +| 1 | schema | The schema portion of the key specifies a Kafka Connect schema that describes what is in the key's `payload` portion. | +| 2 | YugabyteDB_server.public.customers.Key | Name of the schema that defines the structure of the key's payload. This schema describes the structure of the primary key for the table that was changed. Key schema names have the format _connector-name.database-name.table-name.Key_. In this example:
`YugabyteDB_server` is the name of the connector that generated this event.
`public` is the schema which contains the table that was changed.
`customers` is the table that was updated. | +| 3 | optional | Indicates whether the event key must contain a value in its `payload` field. In this example, a value in the key's payload is required. | +| 4 | fields | Specifies each field that is expected in the payload, including each field's name, index, and schema. | +| 5 | payload | Contains the key for the row for which this change event was generated. In this example, the key, contains a single `id` field whose value is `1`. | + +{{< note title="Note" >}} + +Although the `column.exclude.list` and `column.include.list` connector configuration properties allow you to capture only a subset of table columns, all columns in a primary or unique key are always included in the event's key. + +{{< /note >}} + +{{< warning title="Warning" >}} + +CDC is not supported for tables without primary keys. + +{{< /warning >}} + +### Change event values + +The value in a change event is a bit more complicated than the key. Like the key, the value has a `schema` section and a `payload` section. The `schema` section contains the schema that describes the `Envelope` structure of the `payload` section, including its nested fields. Change events for operations that create, update or delete data all have a value payload with an envelope structure. + +Consider the same sample table that was used to show an example of a change event key: + +```sql +CREATE TABLE customers ( + id SERIAL, + first_name VARCHAR(255) NOT NULL, + last_name VARCHAR(255) NOT NULL, + email VARCHAR(255) NOT NULL, + PRIMARY KEY(id) +); +``` + +The value portion of a change event for a change to this table varies according to the `REPLICA IDENTITY` setting and the operation that the event is for. + +### Replica Identity + +[REPLICA IDENTITY](https://www.postgresql.org/docs/11/sql-altertable.html#SQL-CREATETABLE-REPLICA-IDENTITY) is a YugabyteDB-specific table-level setting that determines the amount of information that is available to the logical decoding plugin for `UPDATE` and `DELETE` events. More specifically, the setting of `REPLICA IDENTITY` controls what (if any) information is available for the previous values of the table columns involved, whenever an `UPDATE` or `DELETE` event occurs. + +There are 4 possible values for `REPLICA IDENTITY`: + +* `CHANGE` - Emitted events for `UPDATE` operations will only contain the value of the changed column along with the primary key column with no previous values present. `DELETE` operations will only contain the previous value of the primary key column in the table. +* `DEFAULT` - The default behavior is that only `DELETE` events contain the previous values for the primary key columns of a table. For an `UPDATE` event, no previous values will be present and the new values will be present for all the columns in the table. +* `FULL` - Emitted events for `UPDATE` and `DELETE` operations contain the previous values of all columns in the table. +* `NOTHING` - Emitted events for `UPDATE` and `DELETE` operations do not contain any information about the previous value of any table column. + +{{< note title="Note">}} + +YugabyteDB supports the replica identity CHANGE only with the plugin `yboutput`. + +{{< /note >}} + +#### Message formats for replica identities + +Consider the following employee table into which a row is inserted, subsequently updated, and deleted: + +```sql +CREATE TABLE employee ( + employee_id INT PRIMARY KEY, + employee_name VARCHAR, + employee_dept TEXT); + +INSERT INTO employee VALUES (1001, 'Alice', 'Packaging'); + +UPDATE employee SET employee_name = 'Bob' WHERE employee_id = 1001; + +DELETE FROM employee WHERE employee_id = 1001; +``` + +{{< tabpane text=true >}} + + {{% tab header="CHANGE" lang="change" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": null
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+      "value": 1001,
+      "set": true
+    },
+    "employee_name": null,
+    "employee_dept": null
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="DEFAULT" lang="default" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": null
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+      "value": 1001,
+      "set": true
+    },
+    "employee_name": null,
+    "employee_dept": null
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging"
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="FULL" lang="full" %}} + +**yboutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  },
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Bob",
+      "set": true
+    },
+    "employee_dept": {
+      "value": "Packaging",
+      "set": true
+    }
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + + + + + +
INSERT UPDATE DELETE
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging",
+    }
+  }
+  "op": "c"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging"
+  },
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  }
+  "op": "u"
+}
+
+
+
+{
+  "before": {
+    "employee_id": 1001,
+    "employee_name": "Bob",
+    "employee_dept": "Packaging"
+  },
+  "after": null,
+  "op": "d"
+}
+
+
+ + {{% /tab %}} + + {{% tab header="NOTHING" lang="nothing" %}} + +**yboutput plugin** + + + + + + + + + + +
INSERT
+
+{
+  "before": null,
+  "after": {
+    "employee_id": {
+        "value": 1001,
+        "set": true
+    },
+    "employee_name": {
+      "value": "Alice",
+      "set": true
+    },
+    "employee_dept": {
+        "value": "Packaging",
+        "set": true
+    }
+  }
+  "op": "c"
+}
+
+
+ +**pgoutput plugin** + + + + + + + + + + +
INSERT
+
+{
+  "before": null,
+  "after": {
+    "employee_id": 1001,
+    "employee_name": "Alice",
+    "employee_dept": "Packaging",
+  }
+  "op": "c"
+}
+
+
+ + {{% /tab %}} + +{{< /tabpane >}} + +{{< note title="Note" >}} + +If `UPDATE` and `DELETE` operations will be performed on a table in publication without any replica identity (that is, `REPLICA IDENTITY` set to `NOTHING`), then the operations will cause an error on the publisher. For more details, see [Publication](https://www.postgresql.org/docs/11/logical-replication-publication.html). + +{{< /note >}} + +### *create* events + +The following example shows the value portion of a change event that the connector generates for an operation that creates data in the `customers` table: + +```output.json +{ + "schema": { --> 1 + "type": "struct", + "fields": [ + { + "type": "struct", + "fields": [ + { + "type": "int32", + "optional": false, + "field": "id" + }, + { + "type": "string", + "optional": false, + "field": "first_name" + }, + { + "type": "string", + "optional": false, + "field": "last_name" + }, + { + "type": "string", + "optional": false, + "field": "email" + } + ], + "optional": true, + "name": "YugabyteDB_server.inventory.customers.Value", --> 2 + "field": "before" + }, + { + "type": "struct", + "fields": [ + { + "type": "int32", + "optional": false, + "field": "id" + }, + { + "type": "string", + "optional": false, + "field": "first_name" + }, + { + "type": "string", + "optional": false, + "field": "last_name" + }, + { + "type": "string", + "optional": false, + "field": "email" + } + ], + "optional": true, + "name": "YugabyteDB_server.inventory.customers.Value", + "field": "after" + }, + { + "type": "struct", + "fields": [ + { + "type": "string", + "optional": false, + "field": "version" + }, + { + "type": "string", + "optional": false, + "field": "connector" + }, + { + "type": "string", + "optional": false, + "field": "name" + }, + { + "type": "int64", + "optional": false, + "field": "ts_ms" + }, + { + "type": "boolean", + "optional": true, + "default": false, + "field": "snapshot" + }, + { + "type": "string", + "optional": false, + "field": "db" + }, + { + "type": "string", + "optional": false, + "field": "schema" + }, + { + "type": "string", + "optional": false, + "field": "table" + }, + { + "type": "int64", + "optional": true, + "field": "txId" + }, + { + "type": "int64", + "optional": true, + "field": "lsn" + }, + { + "type": "int64", + "optional": true, + "field": "xmin" + } + ], + "optional": false, + "name": "io.debezium.connector.postgresql.Source", --> 3 + "field": "source" + }, + { + "type": "string", + "optional": false, + "field": "op" + }, + { + "type": "int64", + "optional": true, + "field": "ts_ms" + } + ], + "optional": false, + "name": "YugabyteDB_server.public.customers.Envelope" --> 4 + }, + "payload": { --> 5 + "before": null, --> 6 + "after": { --> 7 + "id": 1, + "first_name": "Anne", + "last_name": "Kretchmar", + "email": "annek@noanswer.org" + }, + "source": { --> 8 + "version": "2.5.2.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": true, + "db": "postgres", + "sequence": "[\"24023119\",\"24023128\"]", + "schema": "public", + "table": "customers", + "txId": 555, + "lsn": 24023128, + "xmin": null + }, + "op": "c", --> 9 + "ts_ms": 1559033904863 --> 10 + } +} +``` + +The following table describes the create event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | schema | The value's schema, which describes the structure of the value's payload. A change event's value schema is the same in every change event that the connector generates for a particular table. | +| 2 | name | In the schema section, each name field specifies the schema for a field in the value's payload.

`YugabyteDB_server.inventory.customers.Value` is the schema for the payload's _before_ and _after_ fields. This schema is specific to the customers table.

Names of schemas for _before_ and _after_ fields are of the form `logicalName.tableName.Value`, which ensures that the schema name is unique in the database. This means that when using the [Avro Converter](https://www.confluent.io/hub/confluentinc/kafka-connect-avro-converter), the resulting Avro schema for each table in each logical source has its own evolution and history. | +| 3 | name | `io.debezium.connector.postgresql.Source` is the schema for the payload's `source` field. This schema is specific to the YugabyteDB connector. The connector uses it for all events that it generates. | +| 4 | name | `YugabyteDB_server.inventory.customers.Envelope` is the schema for the overall structure of the payload, where `YugabyteDB_server` is the connector name, `public` is the schema, and `customers` is the table. | +| 5 | payload | The value's actual data. This is the information that the change event is providing.

It may appear that the JSON representations of the events are much larger than the rows they describe. This is because the JSON representation must include the schema and the payload portions of the message. However, by using the Avro converter, you can significantly decrease the size of the messages that the connector streams to Kafka topics. | +| 6 | before | An optional field that specifies the state of the row before the event occurred. When the op field is `c` for create, as it is in this example, the `before` field is `null` as this change event is for new content.
{{< note title="Note" >}}Whether or not this field is available is dependent on the [REPLICA IDENTITY](#replica-identity) setting for each table.{{< /note >}} | +| 7 | after | An optional field that specifies the state of the row after the event occurred. In this example, the `after` field contains the values of the new row's `id`, `first_name`, `last_name`, and `email` columns. | +| 8 | source | Mandatory field that describes the source metadata for the event. This field contains information that you can use to compare this event with other events, with regard to the origin of the events, the order in which the events occurred, and whether events were part of the same transaction. The source metadata includes:
  • Debezium version
  • Connector type and name
  • Database and table that contains the new row
  • Stringified JSON array of additional offset information. The first value is always the last committed LSN, the second value is always the current LSN. Either value may be null.
  • Schema name
  • If the event was part of a snapshot
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 9 | op | Mandatory string that describes the type of operation that caused the connector to generate the event. In this example, `c` indicates that the operation created a row. Valid values are:
  • `c` = create
  • `r` = read (applies to only snapshots)
  • `u` = update
  • `d` = delete
| +| 10 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +### *update* events + +The value of a change event for an update in the sample `customers` table has the same schema as a create event for that table. Likewise, the event value's payload has the same structure. However, the event value payload contains different values in an update event. The following is an example of a change event value in an event that the connector generates for an update in the `customers` table: + + + +```sql +{ + "schema": { ... }, + "payload": { + "before": null, --> 1 + "after": { --> 2 + "id": 1, + "first_name": "Anne Marie", + "last_name": "Kretchmar", + "email": "annek@noanswer.org" + }, + "source": { --> 3 + "version": "2.5.2.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": false, + "db": "postgres", + "schema": "public", + "table": "customers", + "txId": 556, + "lsn": 24023128, + "xmin": null + }, + "op": "u", --> 4 + "ts_ms": 1465584025523 --> 5 + } +} +``` + +The following table describes the update event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | before | An optional field that contains values that were in the row before the database commit. In this example, no previous value for any of the columns, is present because the table's [REPLICA IDENTITY](#replica-identity) setting is, `DEFAULT`. For an update event to contain the previous values of all columns in the row, you would have to change the `customers` table by running `ALTER TABLE customers REPLICA IDENTITY FULL`. | +| 2 | after | An optional field that specifies the state of the row after the event occurred. In this example, the `first_name` value is now `Anne Marie`. | +| 3 | source | Mandatory field that describes the source metadata for the event. The `source` field structure has the same fields as in a create event, but some values are different. The source metadata includes:
  • Debezium version
  • Connector type and name
  • Database and table that contains the new row
  • Schema name
  • If the event was part of a snapshot (always `false` for _update_ events)
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 4 | op | Mandatory string that describes the type of operation. In an update event value, the `op` field value is `u`, signifying that this row changed because of an update. | +| 5 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +{{< note title="Note" >}} + +Updating the columns for a row's primary/unique key changes the value of the row's key. When a key changes, Debezium outputs three events: a `DELETE` event and a [tombstone event](#tombstone-events) with the old key for the row, followed by an event with the new key for the row. Details are in the next section. + +{{< /note >}} + +### Primary key updates + +An `UPDATE` operation that changes a row's primary key field(s) is known as a primary key change. For a primary key change, in place of sending an `UPDATE` event record, the connector sends a `DELETE` event record for the old key and a `CREATE` event record for the new (updated) key. + +### _delete_ events + +The value in a _delete_ change event has the same `schema` portion as create and update events for the same table. The `payload` portion in a delete event for the sample `customers` table looks like this: + +```output.json +{ + "schema": { ... }, + "payload": { + "before": { --> 1 + "id": 1 + }, + "after": null, --> 2 + "source": { --> 3 + "version": "2.5.4.Final", + "connector": "YugabyteDB", + "name": "YugabyteDB_server", + "ts_ms": 1559033904863, + "snapshot": false, + "db": "postgres", + "schema": "public", + "table": "customers", + "txId": 556, + "lsn": 46523128, + "xmin": null + }, + "op": "d", --> 4 + "ts_ms": 1465581902461 --> 5 + } +} +``` + +The following table describes the delete event value fields. + +| Item | Field name | Description | +| :---- | :------ | :------------ | +| 1 | before | Optional field that specifies the state of the row before the event occurred. In a _delete_ event value, the `before` field contains the values that were in the row before it was deleted with the database commit.

In this example, the before field contains only the primary key column because the table's [REPLICA IDENTITY](#replica-identity) setting is `DEFAULT`. | +| 2 | after | Optional field that specifies the state of the row after the event occurred. In a delete event value, the `after` field is `null`, signifying that the row no longer exists. | +| 3 | source | Mandatory field that describes the source metadata for the event. In a delete event value, the source field structure is the same as for create and update events for the same table. Many source field values are also the same. In a delete event value, the `ts_ms` and `lsn` field values, as well as other values, might have changed. But the source field in a delete event value provides the same metadata:
  • Debezium version
  • Connector type and name
  • Database and table that contained the deleted row
  • Schema name
  • If the event was part of a snapshot (always false for delete events)
  • ID of the transaction in which the operation was performed
  • Offset of the operation in the database log
  • Timestamp for when the change was made in the database
| +| 4 | op | Mandatory string that describes the type of operation. The `op` field value is `d`, signifying that this row was deleted. | +| 5 | ts_ms | Optional field that displays the time at which the connector processed the event. The time is based on the system clock in the JVM running the Kafka Connect task.

In the `source` object, `ts_ms` indicates the time that the change was made in the database. By comparing the value for `payload.source.ts_ms` with the value for `payload.ts_ms`, you can determine the lag between the source database update and Debezium. | + +A _delete_ change event record provides a consumer with the information it needs to process the removal of this row. + +YugabyteDB connector events are designed to work with [Kafka log compaction](https://kafka.apache.org/documentation#compaction). Log compaction enables removal of some older messages as long as at least the most recent message for every key is kept. This lets Kafka reclaim storage space while ensuring that the topic contains a complete data set and can be used for reloading key-based state. + +#### Tombstone events + +When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the YugabyteDB connector follows a _delete_ event with a special tombstone event that has the same key but a `null` value. + + + +## Data type mappings + +The YugabyteDB connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. + +* `literal` type describes how the value is literally represented using Kafka Connect schema types: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT32`, `FLOAT64`, `BOOLEAN`, `STRING`, `BYTES`, `ARRAY`, `MAP`, and `STRUCT`. +* `semantic` type describes how the Kafka Connect schema captures the meaning of the field using the name of the Kafka Connect schema for the field. + +If the default data type conversions do not meet your needs, you can [create a custom converter](https://debezium.io/documentation/reference/2.5/development/converters.html#custom-converters) for the connector. + +### Basic types + +| YugabyteDB data type| Literal type (schema type) | Semantic type (schema name) and Notes | +| :------------------ | :------------------------- | :-------------------------- | +| `BOOLEAN` | `BOOLEAN` | N/A | +| `BIT(1)` | `BOOLEAN` | N/A | +| `BIT( > 1)` | `BYTES` | `io.debezium.data.Bits`
The `length` schema parameter contains an integer that represents the number of bits. The resulting `byte[]` contains the bits in little-endian form and is sized to contain the specified number of bits. For example, `numBytes = n/8 + (n % 8 == 0 ? 0 : 1)` where `n` is the number of bits. | +| `BIT VARYING[(M)]` | `BYTES` | `io.debezium.data.Bits`
The `length` schema parameter contains an integer that represents the number of bits (2^31 - 1 in case no length is given for the column). The resulting `byte[]` contains the bits in little-endian form and is sized based on the content. The specified size (`M`) is stored in the length parameter of the `io.debezium.data.Bits` type. | +| `SMALLINT`, `SMALLSERIAL` | `INT16` | N/A | +| `INTEGER`, `SERIAL` | `INT32` | N/A | +| `BIGINT`, `BIGSERIAL`, `OID` | `INT64` | N/A | +| `REAL` | `FLOAT32` | N/A | +| `DOUBLE PRECISION` | `FLOAT64` | N/A | +| `CHAR [(M)]` | `STRING` | N/A | +| `VARCHAR [(M)]` | `STRING` | N/A | +| `CHARACTER [(M)]` | `STRING` | N/A | +| `CHARACTER VARYING [(M)]` | `STRING` | N/A | +| `TIMESTAMPTZ`, `TIMESTAMP WITH TIME ZONE` | `STRING` | `io.debezium.time.ZonedTimestamp`
A string representation of a timestamp with timezone information, where the timezone is GMT. | +| `TIMETZ`, `TIME WITH TIME ZONE` | `STRING` | `io.debezium.time.ZonedTime`
A string representation of a time value with timezone information, where the timezone is GMT. | +| `INTERVAL [P]` | `INT64` | `io.debezium.time.MicroDuration` (default)
The approximate number of microseconds for a time interval using the `365.25 / 12.0` formula for days per month average. | +| `INTERVAL [P]` | `STRING` | `io.debezium.time.Interval`
(when `interval.handling.mode` is `string`)
The string representation of the interval value that follows the pattern
P\Y\M\DT\H\M\S.
For example, `P1Y2M3DT4H5M6.78S`. | +| `BYTEA` | `BYTES` or `STRING` | n/a

Either the raw bytes (the default), a base64-encoded string, or a base64-url-safe-encoded String, or a hex-encoded string, based on the connector's `binary handling mode` setting.

Debezium only supports Yugabyte `bytea_output` configuration of value `hex`. For more information about PostgreSQL binary data types, see the [Binary data types](../../../../api/ysql/datatypes/type_binary/). | +| `JSON`, `JSONB` | `STRING` | `io.debezium.data.Json`
Contains the string representation of a JSON document, array, or scalar. | +| `UUID` | `STRING` | `io.debezium.data.Uuid`
Contains the string representation of a YugabyteDB UUID value. | +| `INT4RANGE` | `STRING` | Range of integer. | +| `INT8RANGE` | `STRING` | Range of `bigint`. | +| `NUMRANGE` | `STRING` | Range of `numeric`. | +| `TSRANGE` | `STRING` | n/a

The string representation of a timestamp range without a time zone. | +| `TSTZRANGE` | `STRING` | n/a

The string representation of a timestamp range with the local system time zone. | +| `DATERANGE` | `STRING` | n/a

The string representation of a date range. Always has an _exclusive_ upper bound. | +| `ENUM` | `STRING` | `io.debezium.data.Enum`

Contains the string representation of the YugabyteDB `ENUM` value. The set of allowed values is maintained in the allowed schema parameter. | + +### Temporal types + +Other than YugabyteDB's `TIMESTAMPTZ` and `TIMETZ` data types, which contain time zone information, how temporal types are mapped depends on the value of the `time.precision.mode` connector configuration property. The following sections describe these mappings: + +* `time.precision.mode=adaptive` +* `time.precision.mode=adaptive_time_microseconds` +* `time.precision.mode=connect` + +#### time.precision.mode=adaptive + +When the `time.precision.mode` property is set to `adaptive`, the default, the connector determines the literal type and semantic type based on the column's data type definition. This ensures that events _exactly_ represent the values in the database. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `io.debezium.time.Date`
Represents the number of days since the epoch. | +| `TIME(1)`, `TIME(2)`, `TIME(3)` | `INT32` | `io.debezium.time.Time`
Represents the number of milliseconds past midnight, and does not include timezone information. | +| `TIME(4)`, `TIME(5)`, `TIME(6)` | `INT64` | `io.debezium.time.MicroTime`
Represents the number of microseconds past midnight, and does not include timezone information. | +| `TIMESTAMP(1)`, `TIMESTAMP(2)`, `TIMESTAMP(3)` | `INT64` | `io.debezium.time.Timestamp`
Represents the number of milliseconds since the epoch, and does not include timezone information. | +| `TIMESTAMP(4)`, `TIMESTAMP(5)`, `TIMESTAMP(6)`, `TIMESTAMP` | `INT64` | `io.debezium.time.MicroTimestamp`
Represents the number of microseconds since the epoch, and does not include timezone information. | + +#### time.precision.mode=adaptive_time_microseconds + +When the `time.precision.mode` configuration property is set to `adaptive_time_microseconds`, the connector determines the literal type and semantic type for temporal types based on the column's data type definition. This ensures that events _exactly_ represent the values in the database, except all `TIME` fields are captured as microseconds. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `io.debezium.time.Date`
Represents the number of days since the epoch. | +| `TIME([P])` | `INT64` | `io.debezium.time.MicroTime`
Represents the time value in microseconds and does not include timezone information. YugabyteDB allows precision `P` to be in the range 0-6 to store up to microsecond precision. | +| `TIMESTAMP(1)` , `TIMESTAMP(2)`, `TIMESTAMP(3)` | `INT64` | `io.debezium.time.Timestamp`
Represents the number of milliseconds past the epoch, and does not include timezone information. | +| `TIMESTAMP(4)`, `TIMESTAMP(5)`, `TIMESTAMP(6)`, `TIMESTAMP` | `INT64` | `io.debezium.time.MicroTimestamp`
Represents the number of microseconds past the epoch, and does not include timezone information. | + +#### time.precision.mode=connect + +When the `time.precision.mode` configuration property is set to `connect`, the connector uses Kafka Connect logical types. This may be useful when consumers can handle only the built-in Kafka Connect logical types and are unable to handle variable-precision time values. However, because YugabyteDB supports microsecond precision, the events generated by a connector with the connect time precision mode results in a loss of precision when the database column has a fractional second precision value that is greater than 3. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `DATE` | `INT32` | `org.apache.kafka.connect.data.Date`
Represents the number of days since the epoch. | +| `TIME([P])` | `INT64` | `org.apache.kafka.connect.data.Time`
Represents the number of milliseconds since midnight, and does not include timezone information. YugabyteDB allows `P` to be in the range 0-6 to store up to microsecond precision, though this mode results in a loss of precision when `P` is greater than 3. | +| `TIMESTAMP([P])` | `INT64` | `org.apache.kafka.connect.data.Timestamp`
Represents the number of milliseconds since the epoch, and does not include timezone information. YugabyteDB allows `P` to be in the range 0-6 to store up to microsecond precision, though this mode results in a loss of precision when `P` is greater than 3. | + +### TIMESTAMP type + +The `TIMESTAMP` type represents a timestamp without time zone information. Such columns are converted into an equivalent Kafka Connect value based on UTC. For example, the `TIMESTAMP` value "2018-06-20 15:13:16.945104" is represented by an `io.debezium.time.MicroTimestamp` with the value "1529507596945104" when `time.precision.mode` is not set to `connect`. + +The timezone of the JVM running Kafka Connect and Debezium does not affect this conversion. + +YugabyteDB supports using +/-infinite values in `TIMESTAMP` columns. These special values are converted to timestamps with value `9223372036825200000` in case of positive infinity or `-9223372036832400000` in case of negative infinity. This behavior mimics the standard behavior of the YugabyteDB JDBC driver. For reference, see the [`org.postgresql.PGStatement`](https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGStatement.html) interface. + +### Decimal types + +The setting of the YugabyteDB connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. + +#### decimal.handling.mode=double + +When the `decimal.handling.mode` property is set to `double`, the connector represents all `DECIMAL`, `NUMERIC` and `MONEY` values as Java double values and encodes them as shown in the following table. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `NUMERIC[(M[,D])]` | `FLOAT64` | | +| `DECIMAL[(M[,D])]` | `FLOAT64` | | +| `MONEY[(M[,D])]` | `FLOAT64` | | + +#### decimal.handling.mode=string + +The last possible setting for the `decimal.handling.mode` configuration property is `string`. In this case, the connector represents `DECIMAL`, `NUMERIC` and `MONEY` values as their formatted string representation, and encodes them as shown in the following table. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `NUMERIC[(M[,D])]` | `STRING` | | +| `DECIMAL[(M[,D])]` | `STRING` | | +| `MONEY[(M[,D])]` | `STRING` | | + +{{< note title="Note" >}} + +Decimal handling mode `precise` is not yet supported by `YugabyteDBConnector`. + +{{< /note >}} + +### HSTORE types + +The setting of the YugabyteDB connector configuration property `hstore.handling.mode` determines how the connector maps `HSTORE` values. + +When the `hstore.handling.mode` property is set to json (the default), the connector represents `HSTORE` values as string representations of `JSON` values and encodes them as shown in the following table. When the `hstore.handling.mode` property is set to map, the connector uses the `MAP` schema type for `HSTORE` values. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `HSTORE` | `STRING` | `io.debezium.data.Json`

Example: output representation using the JSON converter is `{"key" : "val"}` | +| `HSTORE` | `MAP` | n/a

Example: output representation using the `JSON` converter is `{"key" : "val"}` | + +### Domain types + +YugabyteDB supports user-defined types that are based on other underlying types. When such column types are used, Debezium exposes the column's representation based on the full type hierarchy. + +{{< note title="Note" >}} + +Capturing changes in columns that use YugabyteDB domain types requires special consideration. When a column is defined to contain a domain type that extends one of the default database types and the domain type defines a custom length or scale, the generated schema inherits that defined length or scale. + +When a column is defined to contain a domain type that extends another domain type that defines a custom length or scale, the generated schema does not inherit the defined length or scale because that information is not available in the YugabyteDB driver's column metadata. + +{{< /note >}} + +### Network address types + +YugabyteDB has data types that can store IPv4, IPv6, and MAC addresses. It is better to use these types instead of plain text types to store network addresses. Network address types offer input error checking and specialized operators and functions. + +| YugabyteDB data type | Literal type (schema type) | Semantic type (schema name) and Notes | +| :----- | :----- | :----- | +| `INET` | `STRING` | n/a

IPv4 and IPv6 networks | +| `CIDR` | `STRING` | n/a

IPv4 and IPv6 hosts and networks | +| `MACADDR` | `STRING` | n/a

MAC addresses | +| `MACADDR8` | `STRING` | n/a

MAC addresses in EUI-64 format | + + + + +## Setting up YugabyteDB + +### Setting up permissions + +Setting up a YugabyteDB server to run a Debezium connector requires a database user that can perform replications. Replication can be performed only by a database user that has appropriate permissions and only for a configured number of hosts. + +Although, by default, superusers have the necessary `REPLICATION` and `LOGIN` roles, as mentioned in [Security](#security), it is best not to provide the Debezium replication user with elevated privileges. Instead, create a Debezium user that has the minimum required privileges. + +**Prerequisites:** + +* YugabyteDB administrative permissions. + +**Procedure:** + +To provide a user with replication permissions, define a YugabyteDB role that has at least the `REPLICATION` and `LOGIN` permissions, and then grant that role to the user. For example: + +```sql +CREATE ROLE REPLICATION LOGIN; +``` + +### Setting privileges to enable Debezium to create YugabyteDB publications when you use `pgoutput` or `yboutput` + +If you use `pgoutput` or `yboutput` as the logical decoding plugin, Debezium must operate in the database as a user with specific privileges. + +Debezium streams change events for YugabyteDB source tables from publications that are created for the tables. Publications contain a filtered set of change events that are generated from one or more tables. The data in each publication is filtered based on the publication specification. The specification can be created by the `YugabyteDB` database administrator or by the Debezium connector. To permit the Debezium connector to create publications and specify the data to replicate to them, the connector must operate with specific privileges in the database. + +There are several options for determining how publications are created. In general, it is best to manually create publications for the tables that you want to capture, before you set up the connector. However, you can configure your environment in a way that permits Debezium to create publications automatically, and to specify the data that is added to them. + +Debezium uses include list and exclude list properties to specify how data is inserted in the publication. For more information about the options for enabling Debezium to create publications, see `publication.autocreate.mode`. + +For Debezium to create a YugabyteDB publication, it must run as a user that has the following privileges: + +* Replication privileges in the database to add the table to a publication. +* `CREATE` privileges on the database to add publications. +* `SELECT` privileges on the tables to copy the initial table data. Table owners automatically have `SELECT` permission for the table. + +To add tables to a publication, the user must be an owner of the table. But because the source table already exists, you need a mechanism to share ownership with the original owner. To enable shared ownership, create a YugabyteDB replication group, then add the existing table owner and the replication user to the group. + +Procedure + +1. Create a replication group. + + ```sql + CREATE ROLE ; + ``` + +2. Add the original owner of the table to the group. + + ```sql + GRANT REPLICATION_GROUP TO ; + ``` + +3. Add the Debezium replication user to the group. + + ```sql + GRANT REPLICATION_GROUP TO ; + ``` + +4. Transfer ownership of the table to ``. + + ```sql + ALTER TABLE OWNER TO REPLICATION_GROUP; + ``` + +For Debezium to specify the capture configuration, the value of `publication.autocreate.mode` must be set to `filtered`. + +### Configuring YugabyteDB to allow replication with the Debezium connector host + +To enable Debezium to replicate YugabyteDB data, you must configure the database to permit replication with the host that runs the YugabyteDB connector. To specify the clients that are permitted to replicate with the database, add entries to the YugabyteDB host-based authentication file, `ysql_hba.conf`. For more information about the pg_hba.conf file, see the [YugabyteDB documentation](../../../../secure/authentication/host-based-authentication#ysql_hbaconf-file). + +Procedure + +* Add entries to the `ysql_hba.conf` file to specify the Debezium connector hosts that can replicate with the database host. For example, + +```sh +--ysql_hba_conf_csv="local replication trust, local replication 127.0.0.1/32 trust, host replication ::1/128 trust" +``` + +### Supported YugabyteDB topologies + +As mentioned in the beginning, YugabyteDB (for all versions > 2024.1.1) supports logical replication slots. The YugabyteDB connector can communicate with the server by connecting to any node using the [YugabyteDB Java driver](../../../../reference/drivers/java/yugabyte-jdbc-reference). Should any node fail, the connector receives an error and restarts. Upon restart, the connector connects to any available node and continues streaming from that node. + +### Setting up multiple connectors for same database server + +Debezium uses [replication slots](https://www.postgresql.org/docs/11/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS) to stream changes from a database. These replication slots maintain the current position in form of a LSN. This helps YugabyteDB keep the WAL available until it is processed by Debezium. A single replication slot can exist only for a single consumer or process - as different consumer might have different state and may need data from different position. + +Because a replication slot can only be used by a single connector, it is essential to create a unique replication slot for each Debezium connector. Although when a connector is not active, YugabyteDB may allow other connectors to consume the replication slot - which could be dangerous as it may lead to data loss as a slot will emit each change just once. + +In addition to replication slot, Debezium uses publication to stream events when using the `pgoutput`or `yboutput` plugin. Similar to replication slot, publication is at database level and is defined for a set of tables. Thus, you'll need a unique publication for each connector, unless the connectors work on same set of tables. For more information about the options for enabling Debezium to create publications, see `publication.autocreate.mode`. + +See `slot.name` and `publication.name` on how to set a unique replication slot name and publication name for each connector. + +## Deployment + +To deploy the connector, you install the connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. + +**Prerequisites** + +* [Zookeeper](https://zookeeper.apache.org/), [Kafka](http://kafka.apache.org/), and [Kafka Connect](https://kafka.apache.org/documentation.html#connect) are installed. +* YugabyteDB is installed and is [set up to run the Debezium connector](#setting-up-yugabytedb). + +**Procedure** + +1. Download the [YugabyteDB connector plugin archive](https://github.com/yugabyte/debezium/releases/tag/dz.2.5.2.yb.2024.1.SNAPSHOT.1). +2. Extract the files into your Kafka Connect environment. +3. Add the directory with the JAR files to the [Kafka Connect `plugin.path`](https://kafka.apache.org/documentation/#connectconfigs). +4. Restart your Kafka Connect process to pick up the new JAR files. + +### Creating Kafka topics + +If [auto creation of topics](https://debezium.io/documentation/reference/2.5/configuration/topic-auto-create-config.html) is not enabled in the Kafka Connect cluster then you will need to create the following topics manually: + +* Topic for each table in the format `..` +* Heartbeat topic in the format `.`. The [topic.heartbeat.prefix](#connector-properties) has a default value of `__debezium-heartbeat`. + +### Connector configuration example + +Following is an example of the configuration for a YugabyteDB connector that connects to a YugabyteDB server on port `5433` at `192.168.99.100`, whose topic prefix is `fulfillment`. Typically, you configure the YugabyteDB connector in a JSON file by setting the configuration properties available for the connector. + +You can choose to produce events for a subset of the schemas and tables in a database. Optionally, you can ignore, mask, or truncate columns that contain sensitive data, are larger than a specified size, or that you do not need. + +```output.json +{ + "name": "fulfillment-connector", --> 1 + "config": { + "connector.class": "io.debezium.connector.postgresql.YugabyteDBConnector", --> 2 + "database.hostname": "192.168.99.100:5433,192.168.1.10:5433,192.168.1.68:5433", --> 3 + "database.port": "5432", --> 4 + "database.user": "postgres", --> 5 + "database.password": "postgres", --> 6 + "database.dbname" : "postgres", --> 7 + "topic.prefix": "fulfillment", --> 8 + "table.include.list": "public.inventory" --> 9 + } +} +``` + +1. The name of the connector when registered with a Kafka Connect service. +2. The name of this YugabyteDB connector class. +3. The addresses of the YugabyteDB YB-TServer nodes. This can take a value of multiple addresses in the format `IP1:PORT1,IP2:PORT2,IP3:PORT3`. +4. The port number of the YugabyteDB server. +5. The name of the YugabyteDB user that has the [required privileges](#setting-up-yugabytedb). +6. The password for the YugabyteDB user that has the [required privileges](#setting-up-yugabytedb). +7. The name of the YugabyteDB database to connect to +8. The topic prefix for the YugabyteDB server/cluster, which forms a namespace and is used in all the names of the Kafka topics to which the connector writes, the Kafka Connect schema names, and the namespaces of the corresponding Avro schema when the Avro converter is used. +9. A list of all tables hosted by this server that this connector will monitor. This is optional, and there are other properties for listing the schemas and tables to include or exclude from monitoring. + +See the [complete list of YugabyteDB connector properties](#connector-properties) that can be specified in these configurations. + +You can send this configuration with a `POST` command to a running Kafka Connect service. The service records the configuration and starts one connector task that performs the following actions: + +* Connects to the YugabyteDB database. +* Reads the transaction log. +* Streams change event records to Kafka topics. + +### Adding connector configuration + +To run a the connector, create a connector configuration and add the configuration to your Kafka Connect cluster. + +**Prerequisites** + +* [YugabyteDB is configured to support logical replication.](#setting-up-yugabytedb) +* The YugabyteDB connector is installed. + +**Procedure** + +1. Create a configuration for the YugabyteDB connector. +2. Use the [Kafka Connect REST API](https://kafka.apache.org/documentation/#connect_rest) to add that connector configuration to your Kafka Connect cluster. + +#### Results + +After the connector starts, it performs a consistent snapshot of the YugabyteDB server databases that the connector is configured for. The connector then starts generating data change events for row-level operations and streaming change event records to Kafka topics. + +### Connector properties + +The connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. Information about the properties is organized as follows: + +* [Required configuration properties](#required-configuration-properties) +* [Advanced configuration properties](#advanced-configuration-properties) +* [Pass-through configuration properties](#pass-through-configuration-properties) + +The following configuration properties are _required_ unless a default value is available. + +#### Required configuration properties + +| Property | Default value | Description | +| :------- | :------------ | :---------- | +| name | No default | Unique name for the connector. Attempting to register again with the same name will fail. This property is required by all Kafka Connect connectors. | +| connector.class | No default | The name of the Java class for the connector. Always use a value of `io.debezium.connector.postgresql.YugabyteDBConnector` for the YugabyteDB connector. | +| tasks.max | 1 | The maximum number of tasks that should be created for this connector. The YugabyteDB connector always uses a single task and therefore does not use this value, so the default is always acceptable. | +| plugin.name | decoderbufs | The name of the YugabyteDB [logical decoding plugin](../key-concepts/#output-plugin) installed on the YugabyteDB server.
Supported values are `yboutput`, and `pgoutput`. | +| slot.name | debezium | The name of the YugabyteDB logical decoding slot that was created for streaming changes from a particular plugin for a particular database/schema. The server uses this slot to stream events to the Debezium connector that you are configuring.
Slot names can contain lower-case letters, numbers, and the underscore character. | +| slot.drop.on.stop | false | Whether or not to delete the logical replication slot when the connector stops in a graceful, expected way. The default behavior is that the replication slot remains configured for the connector when the connector stops. When the connector restarts, having the same replication slot enables the connector to start processing where it left off.
Set to true in only testing or development environments. Dropping the slot allows the database to discard WAL segments. When the connector restarts it performs a new snapshot or it can continue from a persistent offset in the Kafka Connect offsets topic. | +| publication.name | `dbz_publication` | The name of the YugabyteDB publication created for streaming changes when using pgoutput.
This publication is created at start-up if it does not already exist and it includes all tables. Debezium then applies its own include/exclude list filtering, if configured, to limit the publication to change events for the specific tables of interest. The connector user must have superuser permissions to create this publication, so it is usually preferable to create the publication before starting the connector for the first time.
If the publication already exists, either for all tables or configured with a subset of tables, Debezium uses the publication as it is defined. | +| database.hostname | No default | IP address or hostname of the YugabyteDB database server. This needs to be in the format `IP1:PORT1,IP2:PORT2,IP3:PORT3` | +| database.port | 5433 | Integer port number of the YugabyteDB database server. | +| database.user | No default | Name of the YugabyteDB database user for connecting to the YugabyteDB database server. | +| database.password | No default | Password to use when connecting to the YugabyteDB database server. | +| database.dbname | No default | The name of the YugabyteDB database from which to stream the changes. | +| topic.prefix | No default | Topic prefix that provides a namespace for the particular YugabyteDB database server or cluster in which Debezium is capturing changes. The prefix should be unique across all other connectors, as it is used as a topic name prefix for all Kafka topics that receive records from this connector. Only alphanumeric characters, hyphens, dots, and underscores must be used in the database server logical name. {{< warning title="Warning" >}} Do not change the value of this property. If you change the name value, after a restart, instead of continuing to emit events to the original topics, the connector emits subsequent events to topics whose names are based on the new value. {{< /warning >}} | +| schema.include.list | No default | An optional, comma-separated list of regular expressions that match names of schemas for which you **want** to capture changes. Any schema name not included in `schema.include.list` is excluded from having its changes captured. By default, all non-system schemas have their changes captured.
To match the name of a schema, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the schema; it does not match substrings that might be present in a schema name.
If you include this property in the configuration, do not also set the `schema.exclude.list` property. | +| schema.exclude.list | No default | An optional, comma-separated list of regular expressions that match names of schemas for which you **do not** want to capture changes. Any schema whose name is not included in `schema.exclude.list` has its changes captured, with the exception of system schemas.
To match the name of a schema, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the schema; it does not match substrings that might be present in a schema name.
If you include this property in the configuration, do not set the `schema.include.list` property. | +| table.include.list | No default | An optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you want to capture. When this property is set, the connector captures changes only from the specified tables. Each identifier is of the form `schemaName.tableName`. By default, the connector captures changes in every non-system table in each schema whose changes are being captured.
To match the name of a table, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire identifier for the table; it does not match substrings that might be present in a table name.
If you include this property in the configuration, do not also set the `table.exclude.list` property. | +| table.exclude.list | No default | An optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you do not want to capture. Each identifier is of the form `schemaName.tableName`. When this property is set, the connector captures changes from every table that you do not specify.
To match the name of a table, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire identifier for the table; it does not match substrings that might be present in a table name.
If you include this property in the configuration, do not set the `table.include.list` property. | +| column.include.list | No default | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be included in change event record values. Fully-qualified names for columns are of the form `schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the expression is used to match the entire name string of the column; it does not match substrings that might be present in a column name.
If you include this property in the configuration, do not also set the `column.exclude.list` property. | +| column.exclude.list | No default | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event record values. Fully-qualified names for columns are of the form `schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the expression is used to match the entire name string of the column; it does not match substrings that might be present in a column name.
If you include this property in the configuration, do not set the `column.include.list` property. | +| skip.messages.without.change | false | Specifies whether to skip publishing messages when there is no change in included columns. This would essentially filter messages if there is no change in columns included as per `column.include.list` or `column.exclude.list` properties.
Note: Only works when REPLICA IDENTITY of the table is set to FULL | +| time.precision.mode | adaptive | Time, date, and timestamps can be represented with different kinds of precision:

`adaptive` captures the time and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column's type.

`adaptive_time_microseconds` captures the date, datetime and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column's type. An exception is `TIME` type fields, which are always captured as microseconds.

`connect` always represents time and timestamp values by using Kafka Connect built-in representations for `Time`, `Date`, and `Timestamp`, which use millisecond precision regardless of the database columns' precision. For more information, see [Temporal types](#temporal-types). | +| decimal.handling.mode | precise | Specifies how the connector should handle values for `DECIMAL` and `NUMERIC` columns:

`double` represents values by using double values, which might result in a loss of precision but which is easier to use.

`string` encodes values as formatted strings, which are easy to consume but semantic information about the real type is lost. For more information, see [Decimal types](#decimal-types). | +| interval.handling.mode | numeric | Specifies how the connector should handle values for interval columns:

`numeric` represents intervals using approximate number of microseconds.

`string` represents intervals exactly by using the string pattern representation `PYMDTHMS`. For example: `P1Y2M3DT4H5M6.78S`. For more information, see [Basic types](#basic-types). | +| database.sslmode | prefer | Whether to use an encrypted connection to the YugabyteDB server. Options include:

`disable` uses an unencrypted connection.

`allow` attempts to use an unencrypted connection first and, failing that, a secure (encrypted) connection.

`prefer` attempts to use a secure (encrypted) connection first and, failing that, an unencrypted connection.

`require` uses a secure (encrypted) connection, and fails if one cannot be established.

`verify-ca` behaves like require but also verifies the server TLS certificate against the configured Certificate Authority (CA) certificates, or fails if no valid matching CA certificates are found.

`verify-full` behaves like verify-ca but also verifies that the server certificate matches the host to which the connector is trying to connect. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslcert | No default | The path to the file that contains the SSL certificate for the client. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslkey | No default | The path to the file that contains the SSL private key of the client. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslpassword | No default | The password to access the client private key from the file specified by `database.sslkey`. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.sslrootcert | No default | The path to the file that contains the root certificate(s) against which the server is validated. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| database.tcpKeepAlive | true | Enable TCP keep-alive probe to verify that the database connection is still alive. For more information, see the [PostgreSQL documentation](https://www.postgresql.org/docs/11/static/libpq-connect.html). | +| tombstones.on.delete | true | Controls whether a delete event is followed by a tombstone event.

`true` - a delete operation is represented by a delete event and a subsequent tombstone event.

`false` - only a delete event is emitted.

After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic. | +| column.truncate.to.length.chars | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Set this property if you want to truncate the data in a set of columns when it exceeds the number of characters specified by the length in the property name. Set `length` to a positive integer value, for example, `column.truncate.to.20.chars`.

The fully-qualified name of a column observes the following format: `..`. To match the name of a column, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name.

You can specify multiple properties with different lengths in a single configuration. | +| column.mask.with.length.chars | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Set this property if you want the connector to mask the values for a set of columns, for example, if they contain sensitive data. Set `length` to a positive integer to replace data in the specified columns with the number of asterisk (`*`) characters specified by the length in the property name. Set length to `0` (zero) to replace data in the specified columns with an empty string.

The fully-qualified name of a column observes the following format: schemaName.tableName.columnName. To match the name of a column, Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name.

You can specify multiple properties with different lengths in a single configuration. | +| column.mask.hash.hashAlgorithm.with.salt._salt_;
column.mask.hash.v2.hashAlgorithm.with.salt._salt_ | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form `..`.
To match the name of a column Debezium applies the regular expression that you specify as an anchored regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name. In the resulting change event record, the values for the specified columns are replaced with pseudonyms.
A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the [MessageDigest](https://docs.oracle.com/javase/7/docs/technotes/guides/security/StandardNames.html#MessageDigest) section of the Java Cryptography Architecture Standard Algorithm Name Documentation.

In the following example, `CzQMA0cB5K` is a randomly selected salt.

```column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerName```
If necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts.

Depending on the `hashAlgorithm` used, the salt selected, and the actual data set, the resulting data set might not be completely masked.

Hashing strategy version 2 should be used to ensure fidelity if the value is being hashed in different places or systems. | +| column.propagate.source.type | n/a | An optional, comma-separated list of regular expressions that match the fully-qualified names of columns for which you want the connector to emit extra parameters that represent column metadata. When this property is set, the connector adds the following fields to the schema of event records:
  • __debezium.source.column.type
  • __debezium.source.column.length
  • __debezium.source.column.scale
These parameters propagate a column's original type name and length (for variable-width types), respectively.
Enabling the connector to emit this extra data can assist in properly sizing specific numeric or character-based columns in sink databases.
The fully-qualified name of a column observes one of the following formats: `databaseName.tableName.columnName`, or `databaseName.schemaName.tableName.columnName`.
To match the name of a column, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the column; the expression does not match substrings that might be present in a column name. | +| datatype.propagate.source.type | n/a | An optional, comma-separated list of regular expressions that specify the fully-qualified names of data types that are defined for columns in a database. When this property is set, for columns with matching data types, the connector emits event records that include the following extra fields in their schema:
  • __debezium.source.column.type
  • __debezium.source.column.length
  • __debezium.source.column.scale
These parameters propagate a column's original type name and length (for variable-width types), respectively.
Enabling the connector to emit this extra data can assist in properly sizing specific numeric or character-based columns in sink databases.
The fully-qualified name of a column observes one of the following formats: `databaseName.tableName.typeName`, or `databaseName.schemaName.tableName.typeName`.
To match the name of a data type, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the data type; the expression does not match substrings that might be present in a type name.
For the list of YugabyteDB-specific data type names, see [Data type mappings](#data-type-mappings). | +| message.key.columns | _empty string_ | A list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.
By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns.

To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format:

`:,`

To base a table key on multiple column names, insert commas between the column names.
Each fully-qualified table name is a regular expression in the following format:

`.`

The property can include entries for multiple tables. Use a semicolon to separate table entries in the list.

The following example sets the message key for the tables `inventory.customers` and `purchase.orders`:

`inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4`

For the table `inventory.customer`, the columns `pk1` and `pk2` are specified as the message key. For the `purchaseorders` tables in any schema, the columns `pk3` and `pk4` server as the message key.
There is no limit to the number of columns that you use to create custom message keys. However, it's best to use the minimum number that are required to specify a unique key.
Note that having this property set and `REPLICA IDENTITY` set to `DEFAULT` on the tables, will cause the tombstone events to not be created properly if the key columns are not part of the primary key of the table.
Setting `REPLICA IDENTITY` to `FULL` is the only solution. | +| publication.autocreate.mode | `all_tables` | Applies only when streaming changes by using the [pgoutput plugin](https://www.postgresql.org/docs/11/sql-createpublication.html). The setting determines how creation of a [publication](https://www.postgresql.org/docs/11/logical-replication-publication.html) should work. Specify one of the following values:

`all_tables` - If a publication exists, the connector uses it. If a publication does not exist, the connector creates a publication for all tables in the database for which the connector is capturing changes. For the connector to create a publication it must access the database through a database user account that has permission to create publications and perform replications. You grant the required permission by using the following SQL command `CREATE PUBLICATION FOR ALL TABLES;`.

`disabled` - The connector does not attempt to create a publication. A database administrator or the user configured to perform replications must have created the publication before running the connector. If the connector cannot find the publication, the connector throws an exception and stops.

`filtered` - If a publication exists, the connector uses it. If no publication exists, the connector creates a new publication for tables that match the current filter configuration as specified by the `schema.include.list`, `schema.exclude.list`, and `table.include.list`, and `table.exclude.list` connector configuration properties. For example: `CREATE PUBLICATION FOR TABLE `. If the publication exists, the connector updates the publication for tables that match the current filter configuration. For example: `ALTER PUBLICATION SET TABLE `. | +| replica.identity.autoset.values | _empty string_ | The setting determines the value for [replica identity](#replica-identity) at table level.

This option will overwrite the existing value in database. A comma-separated list of regular expressions that match fully-qualified tables and replica identity value to be used in the table.

Each expression must match the pattern `:`, where the table name could be defined as (`SCHEMA_NAME.TABLE_NAME`), and the replica identity values are:

`DEFAULT` - Records the old values of the columns of the primary key, if any. This is the default for non-system tables.

`FULL` - Records the old values of all columns in the row.

`NOTHING` - Records no information about the old row. This is the default for system tables.

For example,

```schema1.*:FULL,schema2.table2:NOTHING,schema2.table3:DEFAULT```

{{< warning title="Warning" >}} Tables in YugabyteDB will always have the replica identity present at the time of replication slot creation, it cannot be altered at runtime. If it needs to be altered, it will only be reflected on a new slot created after altering the replica identity. {{< /warning >}} | +| binary.handling.mode | bytes | Specifies how binary (`bytea`) columns should be represented in change events:

`bytes` represents binary data as byte array.

`base64` represents binary data as base64-encoded strings.

`base64-url-safe` represents binary data as base64-url-safe-encoded strings.

`hex` represents binary data as hex-encoded (base16) strings. | +| schema.name.adjustment.mode | none | Specifies how schema names should be adjusted for compatibility with the message converter used by the connector. Possible settings:

  • `none` does not apply any adjustment.
  • `avro` replaces the characters that cannot be used in the Avro type name with underscore.
  • `avro_unicode` replaces the underscore or characters that cannot be used in the Avro type name with corresponding unicode like _uxxxx. Note: `_` is an escape sequence like backslash in Java.
| +| field.name.adjustment.mode | none | Specifies how field names should be adjusted for compatibility with the message converter used by the connector. Possible settings:

  • `none` does not apply any adjustment.
  • `avro` replaces the characters that cannot be used in the Avro type name with underscore.
  • `avro_unicode` replaces the underscore or characters that cannot be used in the Avro type name with corresponding unicode like _uxxxx. Note: `_` is an escape sequence like backslash in Java.
For more information, see [Avro naming](https://debezium.io/documentation/reference/2.5/configuration/avro.html#avro-naming). | +| money.fraction.digits | 2 | Specifies how many decimal digits should be used when converting Postgres `money` type to `java.math.BigDecimal`, which represents the values in change events. Applicable only when `decimal.handling.mode` is set to `precise`. | + +#### Advanced configuration properties + +The following advanced configuration properties have defaults that work in most situations and therefore rarely need to be specified in the connector's configuration. + +| Property | Default value | Description | +| :------- | :------------ | :---------- | +| converters | No default | Enumerates a comma-separated list of the symbolic names of the custom converter instances that the connector can use. For example,

```isbn```

You must set the converters property to enable the connector to use a custom converter.
For each converter that you configure for a connector, you must also add a .type property, which specifies the fully-qualified name of the class that implements the converter interface. The `.type` property uses the following format:
`.type`
For example,

```isbn.type: io.debezium.test.IsbnConverter```

If you want to further control the behavior of a configured converter, you can add one or more configuration parameters to pass values to the converter. To associate any additional configuration parameter with a converter, prefix the parameter names with the symbolic name of the converter.
For example,

```isbn.schema.name: io.debezium.YugabyteDB.type.Isbn``` | +| snapshot.mode | `initial` | Specifies the criteria for performing a snapshot when the connector starts:

`initial` - The connector performs a snapshot only when no offsets have been recorded for the logical server name.

`never` - The connector never performs snapshots. When a connector is configured this way, its behavior when it starts is as follows. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. If no LSN has been stored, the connector starts streaming changes from the point in time when the YugabyteDB logical replication slot was created on the server. The never snapshot mode is useful only when you know all data of interest is still reflected in the WAL.

`initial_only` - The connector performs an initial snapshot and then stops, without processing any subsequent changes. | +| snapshot.include.collection.list | All tables included in `table.include.list` | An optional, comma-separated list of regular expressions that match the fully-qualified names (`.`) of the tables to include in a snapshot. The specified items must be named in the connector's `table.include.list` property. This property takes effect only if the connector's `snapshot.mode` property is set to a value other than `never`.
This property does not affect the behavior of incremental snapshots.
To match the name of a table, Debezium applies the regular expression that you specify as an _anchored_ regular expression. That is, the specified expression is matched against the entire name string of the table; it does not match substrings that might be present in a table name. | +| event.processing.failure.handling.mode | fail | Specifies how the connector should react to exceptions during processing of events:

`fail` propagates the exception, indicates the offset of the problematic event, and causes the connector to stop.

`warn` logs the offset of the problematic event, skips that event, and continues processing.

`skip` skips the problematic event and continues processing. | +| max.batch.size | 2048 | Positive integer value that specifies the maximum size of each batch of events that the connector processes. | +| max.queue.size | 8192 | Positive integer value that specifies the maximum number of records that the blocking queue can hold. When Debezium reads events streamed from the database, it places the events in the blocking queue before it writes them to Kafka. The blocking queue can provide backpressure for reading change events from the database in cases where the connector ingests messages faster than it can write them to Kafka, or when Kafka becomes unavailable. Events that are held in the queue are disregarded when the connector periodically records offsets. Always set the value of `max.queue.size` to be larger than the value of `max.batch.size`. | +| max.queue.size.in.bytes | 0 | A long integer value that specifies the maximum volume of the blocking queue in bytes. By default, volume limits are not specified for the blocking queue. To specify the number of bytes that the queue can consume, set this property to a positive long value.
If `max.queue.size` is also set, writing to the queue is blocked when the size of the queue reaches the limit specified by either property. For example, if you set `max.queue.size=1000`, and `max.queue.size.in.bytes=5000`, writing to the queue is blocked after the queue contains 1000 records, or after the volume of the records in the queue reaches 5000 bytes. | +| poll.interval.ms | 500 | Positive integer value that specifies the number of milliseconds the connector should wait for new change events to appear before it starts processing a batch of events. Defaults to 500 milliseconds. | +| include.unknown.datatypes | false | Specifies connector behavior when the connector encounters a field whose data type is unknown. The default behavior is that the connector omits the field from the change event and logs a warning.

Set this property to `true` if you want the change event to contain an opaque binary representation of the field. This lets consumers decode the field. You can control the exact representation by setting the [binary handling mode](#connector-properties) property.{{< note title="Note" >}} Consumers risk backward compatibility issues when `include.unknown.datatypes` is set to `true`. Not only may the database-specific binary representation change between releases, but if the data type is eventually supported by Debezium, the data type will be sent downstream in a logical type, which would require adjustments by consumers. In general, when encountering unsupported data types, create a feature request so that support can be added. {{< /note >}} | +| database.initial.statements | No default | A semicolon separated list of SQL statements that the connector executes when it establishes a JDBC connection to the database. To use a semicolon as a character and not as a delimiter, specify two consecutive semicolons, `;;`.

The connector may establish JDBC connections at its own discretion. Consequently, this property is useful for configuration of session parameters only, and not for executing DML statements.

The connector does not execute these statements when it creates a connection for reading the transaction log. | +| status.update.interval.ms | 10000 | Frequency for sending replication connection status updates to the server, given in milliseconds. The property also controls how frequently the database status is checked to detect a dead connection in case the database was shut down. | +| schema.refresh.mode | columns_diff | Specify the conditions that trigger a refresh of the in-memory schema for a table.

`columns_diff` is the safest mode. It ensures that the in-memory schema stays in sync with the database table's schema at all times.

`columns_diff_exclude_unchanged_toast` instructs the connector to refresh the in-memory schema cache if there is a discrepancy with the schema derived from the incoming message, unless unchanged TOASTable data fully accounts for the discrepancy.

This setting can significantly improve connector performance if there are frequently-updated tables that have TOASTed data that are rarely part of updates. However, it is possible for the in-memory schema to become outdated if TOASTable columns are dropped from the table. | +| snapshot.delay.ms | No default | An interval in milliseconds that the connector should wait before performing a snapshot when the connector starts. If you are starting multiple connectors in a cluster, this property is useful for avoiding snapshot interruptions, which might cause re-balancing of connectors. | +| snapshot.fetch.size | 10240 | During a snapshot, the connector reads table content in batches of rows. This property specifies the maximum number of rows in a batch. | +| slot.stream.params | No default | Semicolon separated list of parameters to pass to the configured logical decoding plugin. | +| slot.max.retries | 6 | If connecting to a replication slot fails, this is the maximum number of consecutive attempts to connect. | +| slot.retry.delay.ms | 10000 (10 seconds) | The number of milliseconds to wait between retry attempts when the connector fails to connect to a replication slot. | +| unavailable.value.placeholder | __debezium_unavailable_value | Specifies the constant that the connector provides to indicate that the original value is a toasted value that is not provided by the database. If the setting of `unavailable.value.placeholder` starts with the `hex:` prefix it is expected that the rest of the string represents hexadecimally encoded octets. | +| provide.transaction.metadata | false | Determines whether the connector generates events with transaction boundaries and enriches change event envelopes with transaction metadata. Specify true if you want the connector to do this. For more information, see [Transaction metadata](#transaction-metadata). | +| flush.lsn.source | true | Determines whether the connector should commit the LSN of the processed records in the source YugabyteDB database so that the WAL logs can be deleted. Specify `false` if you don't want the connector to do this. Please note that if set to `false` LSN will not be acknowledged by Debezium and as a result WAL logs will not be cleared which might result in disk space issues. User is expected to handle the acknowledgement of LSN outside Debezium. | +| retriable.restart.connector.wait.ms | 10000 (10 seconds) | The number of milliseconds to wait before restarting a connector after a retriable error occurs. | +| skipped.operations | t | A comma-separated list of operation types that will be skipped during streaming. The operations include: `c` for inserts/create, `u` for updates, `d` for deletes, `t` for truncates, and `none` to not skip any operations. By default, truncate operations are skipped. | +| xmin.fetch.interval.ms | 0 | How often, in milliseconds, the XMIN will be read from the replication slot. The XMIN value provides the lower bounds of where a new replication slot could start from. The default value of `0` disables tracking XMIN tracking. | +| topic.naming.strategy | `io.debezium.schema.SchemaTopicNamingStrategy` | The name of the TopicNamingStrategy class that should be used to determine the topic name for data change, schema change, transaction, heartbeat event etc., defaults to `SchemaTopicNamingStrategy`. | +| topic.delimiter | `.` | Specify the delimiter for topic name, defaults to `.`. | +| topic.cache.size | 10000 | The size used for holding the topic names in bounded concurrent hash map. This cache will help to determine the topic name corresponding to a given data collection. | +| topic.heartbeat.prefix | `__debezium-heartbeat` | Controls the name of the topic to which the connector sends heartbeat messages. The topic name has this pattern:

`topic.heartbeat.prefix.topic.prefix`

For example, if the topic prefix is `fulfillment`, the default topic name is `__debezium-heartbeat.fulfillment.` | +| topic.transaction | transaction | Controls the name of the topic to which the connector sends transaction metadata messages. The topic name has this pattern:
`.`

For example, if the `topic.prefix` is `fulfillment`, the default topic name is `fulfillment.transaction`. | +| snapshot.max.threads | 1 | Specifies the number of threads that the connector uses when performing an initial snapshot. To enable parallel initial snapshots, set the property to a value greater than 1. In a parallel initial snapshot, the connector processes multiple tables concurrently. This feature is incubating. | +| custom.metric.tags | No default | The custom metric tags will accept key-value pairs to customize the MBean object name which should be appended the end of regular name, each key would represent a tag for the MBean object name, and the corresponding value would be the value of that tag the key is. For example: `k1=v1,k2=v2`. | +| errors.max.retries | -1 | The maximum number of retries on retriable errors (for example, connection errors) before failing (-1 = no limit, 0 = disabled, > 0 = num of retries). | + +#### Pass-through configuration properties + +The connector also supports pass-through configuration properties that are used when creating the Kafka producer and consumer. + +Be sure to consult the [Kafka documentation](https://kafka.apache.org/documentation.html) for all of the configuration properties for Kafka producers and consumers. The YugabyteDB connector does use the [new consumer configuration properties](https://kafka.apache.org/documentation.html#consumerconfigs). + +## Monitoring + +The YugabyteDB connector provides two metrics in addition to the built-in support for JMX metrics that Zookeeper, Kafka, and Kafka Connect provide: + +* [Snapshot metrics](#snapshot-metrics) provide information about connector operation while performing a snapshot. +* [Streaming metrics](#streaming-metrics) provide information about connector operation when the connector is capturing changes and streaming change event records. + +[Debezium monitoring documentation](https://debezium.io/documentation/reference/2.5/operations/monitoring.html#monitoring-debezium) provides details for how to expose these metrics by using JMX. + +### Snapshot metrics + +The **MBean** is `debezium.postgres:type=connector-metrics,context=snapshot,server=`. + +Snapshot metrics are not exposed unless a snapshot operation is active, or if a snapshot has occurred since the last connector start. + +The following table lists the snapshot metrics that are available. + +| Attributes | Type | Description | +| :--------- | :--- | :---------- | +| `LastEvent` | string | The last snapshot event that the connector has read. | +| `MilliSecondsSinceLastEvent` | long | The number of milliseconds since the connector has read and processed the most recent event. | +| `TotalNumberOfEventsSeen` | long | The total number of events that this connector has seen since last started or reset. | +| `NumberOfEventsFiltered` | long | The number of events that have been filtered by include/exclude list filtering rules configured on the connector. | +| `CapturedTables` | string[] | The list of tables that are captured by the connector. | +| `QueueTotalCapacity` | int | The length the queue used to pass events between the snapshotter and the main Kafka Connect loop. | +| `QueueRemainingCapacity` | int | The free capacity of the queue used to pass events between the snapshotter and the main Kafka Connect loop. | +| `TotalTableCount` | int | The total number of tables that are being included in the snapshot. | +| `RemainingTableCount` | int | The number of tables that the snapshot has yet to copy. | +| `SnapshotRunning` | boolean | Whether the snapshot was started. | +| `SnapshotPaused` | boolean | Whether the snapshot was paused. | +| `SnapshotAborted` | boolean | Whether the snapshot was aborted. | +| `SnapshotCompleted` | boolean | Whether the snapshot completed. | +| `SnapshotDurationInSeconds` | long | The total number of seconds that the snapshot has taken so far, even if not complete. Includes also time when snapshot was paused. | +| `SnapshotPausedDurationInSeconds` | long | The total number of seconds that the snapshot was paused. If the snapshot was paused several times, the paused time adds up. | +| `RowsScanned` | Map | Map containing the number of rows scanned for each table in the snapshot. Tables are incrementally added to the Map during processing. Updates every 10,000 rows scanned and upon completing a table. | +| `MaxQueueSizeInBytes` | long | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | +| `CurrentQueueSizeInBytes` | long | The current volume, in bytes, of records in the queue. | + +### Streaming metrics + +The **MBean** is `debezium.postgres:type=connector-metrics,context=streaming,server=`. + +The following table lists the streaming metrics that are available. + +| Attributes | Type | Description | +| :--------- | :--- | :---------- | +| `LastEvent` | string | The last streaming event that the connector has read. | +| `MilliSecondsSinceLastEvent` | long | The number of milliseconds since the connector has read and processed the most recent event. | +| `TotalNumberOfEventsSeen` | long | The total number of events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfCreateEventsSeen` | long | The total number of create events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfUpdateEventsSeen` | long | The total number of update events that this connector has seen since the last start or metrics reset. | +| `TotalNumberOfDeleteEventsSeen` | long | The total number of delete events that this connector has seen since the last start or metrics reset. | +| `NumberOfEventsFiltered` | long | The number of events that have been filtered by include/exclude list filtering rules configured on the connector. | +| `CapturedTables` | string[] | The list of tables that are captured by the connector. | +| `QueueTotalCapacity` | int | The length the queue used to pass events between the streamer and the main Kafka Connect loop. | +| `QueueRemainingCapacity` | int | The free capacity of the queue used to pass events between the streamer and the main Kafka Connect loop. | +| `Connected` | boolean | Flag that denotes whether the connector is currently connected to the database server. | +| `MilliSecondsBehindSource` | long | The number of milliseconds between the last change event's timestamp and the connector processing it. The values will incorporate any differences between the clocks on the machines where the database server and the connector are running. | +| `NumberOfCommittedTransactions` | long | The number of processed transactions that were committed. | +| `SourceEventPosition` | Map | The coordinates of the last received event. | +| `LastTransactionId` | string | Transaction identifier of the last processed transaction. | +| `MaxQueueSizeInBytes` | long | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | +| `CurrentQueueSizeInBytes` | long | The current volume, in bytes, of records in the queue. | + +## Behavior when things go wrong + +Debezium is a distributed system that captures all changes in multiple upstream databases; it never misses or loses an event. When the system is operating normally or being managed carefully then Debezium provides _exactly once_ delivery of every change event record. If a fault does happen then the system does not lose any events. However, while it is recovering from the fault, it's possible that the connector might emit some duplicate change events. In these abnormal situations, Debezium, like Kafka, provides _at least once_ delivery of change events. + +The rest of this section describes how Debezium handles various kinds of faults and problems. + +### Configuration and startup errors + +In the following situations, the connector fails when trying to start, reports an error/exception in the log, and stops running: + +* The connector's configuration is invalid. +* The connector cannot successfully connect to YugabyteDB by using the specified connection parameters. +* The connector is restarting from a previously-recorded LSN and YugabyteDB no longer has that history available. + +In these cases, the error message has details about the problem and possibly a suggested workaround. After you correct the configuration or address the YugabyteDB problem, restart the connector. + +### YB-TServer becomes unavailable + +When the connector is running, the YB-TServer that it is connected to could become unavailable for any number of reasons. If this happens, the connector fails with an error and retries to connect to the YugabyteDB server. Because the connector uses the [YugabyteDB Java driver](../../../../drivers-orms/java/), the connection is handled internally and the connector restores the connection to another running node. + +The YugabyteDB connector externally stores the last processed offset in the form of a YugabyteDB LSN. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the Debezium replication slot remains intact. + +{{< warning title="Warning" >}} + +Never drop a replication slot on the server or you will lose data. + +{{< /warning >}} + +### Cluster failures + +When the connector is running, it is possible that the YugabyteDB server becomes unavailable for any number of reasons. If that happens, the connector fails with and error and initiates retries but as the complete YugabyteDB server is unavailable, all the retries will fail. + +When the YugabyteDB server is back up, restart the connector to continue streaming where it left off. + +### Kafka Connect process stops gracefully + +Suppose that Kafka Connect is being run in distributed mode and a Kafka Connect process is stopped gracefully. Prior to shutting down that process, Kafka Connect migrates the process's connector tasks to another Kafka Connect process in that group. The new connector tasks start processing exactly where the prior tasks stopped. There is a short delay in processing while the connector tasks are stopped gracefully and restarted on the new processes. + +### Kafka Connect process crashes + +If the Kafka Connector process stops unexpectedly, any connector tasks it was running terminate without recording their most recently processed offsets. When Kafka Connect is being run in distributed mode, Kafka Connect restarts those connector tasks on other processes. However, YugabyteDB connectors resume from the last offset that was recorded by the earlier processes. This means that the new replacement tasks might generate some of the same change events that were processed just prior to the crash. The number of duplicate events depends on the offset flush period and the volume of data changes just before the crash. + +Because there is a chance that some events might be duplicated during a recovery from failure, consumers should always anticipate some duplicate events. Debezium changes are idempotent, so a sequence of events always results in the same state. + +In each change event record, Debezium connectors insert source-specific information about the origin of the event, including the YugabyteDB server's time of the event, the ID of the server transaction, and the position in the write-ahead log where the transaction changes were written. Consumers can keep track of this information, especially the LSN, to determine whether an event is a duplicate. + +### Kafka becomes unavailable + +As the connector generates change events, the Kafka Connect framework records those events in Kafka by using the Kafka producer API. Periodically, at a frequency that you specify in the Kafka Connect configuration, Kafka Connect records the latest offset that appears in those change events. If the Kafka brokers become unavailable, the Kafka Connect process that is running the connectors repeatedly tries to reconnect to the Kafka brokers. In other words, the connector tasks pause until a connection can be re-established, at which point the connectors resume exactly where they left off. + +### Connector is stopped for a duration + +If the connector is gracefully stopped, the database can continue to be used. Any changes are recorded in the YugabyteDB WAL. When the connector restarts, it resumes streaming changes where it left off. That is, it generates change event records for all database changes that were made while the connector was stopped. + +A properly configured Kafka cluster is able to handle massive throughput. Kafka Connect is written according to Kafka best practices, and given enough resources a Kafka Connect connector can also handle very large numbers of database change events. Because of this, after being stopped for a while, when a Debezium connector restarts, it is very likely to catch up with the database changes that were made while it was stopped. How quickly this happens depends on the capabilities and performance of Kafka and the volume of changes being made to the data in YugabyteDB. diff --git a/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md new file mode 100644 index 000000000000..67b084ea7360 --- /dev/null +++ b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/_index.md @@ -0,0 +1,80 @@ +--- +title: CDC using YugabyteDB gRPC replication protocol +headerTitle: CDC using gRPC replication protocol +linkTitle: gRPC protocol +description: CDC using YugabyteDB gRPC replication protocol. +headcontent: Capture changes made to data in the database +cascade: + earlyAccess: /preview/releases/versioning/#feature-maturity +menu: + stable: + identifier: explore-change-data-capture-grpc-replication + parent: explore-change-data-capture + weight: 280 +type: indexpage +showRightNav: true +--- + +YugabyteDB CDC captures changes made to data in the database and streams those changes to external processes, applications, or other databases. CDC allows you to track and propagate changes in a YugabyteDB database to downstream consumers based on its Write-Ahead Log (WAL). YugabyteDB CDC uses Debezium to capture row-level changes resulting from INSERT, UPDATE, and DELETE operations in the upstream database, and publishes them as events to Kafka using Kafka Connect-compatible connectors. + +![What is CDC](/images/explore/cdc-overview-work.png) + + + +## Get started + +Get started with Yugabyte gRPC replication. + +For tutorials on streaming data to Kafka environments, including Amazon MSK, Azure Event Hubs, and Confluent Cloud, see [Kafka environments](/preview/tutorials/cdc-tutorials/). + +{{}} +To learn how get started with the connector, see [Get started](./cdc-get-started). +{{}} + +## Monitoring + +You can monitor the activities and status of the deployed connectors using the http end points provided by YugabyteDB. + +{{}} +To know more about how to monitor your CDC setup, see [Monitor](./cdc-monitor). +{{}} + +## YugabyteDB gRPC Connector + +To capture and stream your changes in YugabyteDB to an external system, you need a connector that can read the changes in YugabyteDB and stream it out. For this, you can use the YugabyteDB gRPC connector, which is based on the Debezium platform. The connector is deployed as a set of Kafka Connect-compatible connectors, so you first need to define a YugabyteDB connector configuration and then start the connector by adding it to Kafka Connect. + +{{}} +To understand how the various features and configuration of the connector, see [YugabyteDB gRPC Connector](./debezium-connector-yugabytedb). +{{}} + +## Known limitations + +* A single stream can only be used to stream data from one namespace only. +* There should be a primary key on the table you want to stream the changes from. +* CDC is not supported on a target table for xCluster replication [11829](https://github.com/yugabyte/yugabyte-db/issues/11829). +* Currently we don't support schema evolution for changes that require table rewrites (ex: ALTER TYPE). +* YCQL tables aren't currently supported. Issue [11320](https://github.com/yugabyte/yugabyte-db/issues/11320). + +In addition, CDC support for the following features will be added in upcoming releases: + +* Support for point-in-time recovery (PITR) is tracked in issue [10938](https://github.com/yugabyte/yugabyte-db/issues/10938). +* Support for transaction savepoints is tracked in issue [10936](https://github.com/yugabyte/yugabyte-db/issues/10936). +* Support for enabling CDC on Read Replicas is tracked in issue [11116](https://github.com/yugabyte/yugabyte-db/issues/11116). +* Support for schema evolution with before image is tracked in issue [15197](https://github.com/yugabyte/yugabyte-db/issues/15197). + +## Learn more + +* [Examples of CDC usage and patterns](https://github.com/yugabyte/cdc-examples/tree/main) {{}} +* [Tutorials to deploy in different Kafka environments](/preview/tutorials/cdc-tutorials/) {{}} +* [Data Streaming Using YugabyteDB CDC, Kafka, and SnowflakeSinkConnector](https://www.yugabyte.com/blog/data-streaming-using-yugabytedb-cdc-kafka-and-snowflakesinkconnector/) {{}} +* [Unlock Azure Storage Options With YugabyteDB CDC](https://www.yugabyte.com/blog/unlocking-azure-storage-options-with-yugabytedb-cdc/) {{}} +* [Change Data Capture From YugabyteDB to Elasticsearch](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-elasticsearch/) {{}} +* [Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB](https://www.yugabyte.com/blog/snowflake-cdc-publish-data-using-amazon-s3-yugabytedb/) {{}} +* [Streaming Changes From YugabyteDB to Downstream Databases](https://www.yugabyte.com/blog/streaming-changes-yugabytedb-cdc-downstream-databases/) {{}} +* [Change Data Capture from YugabyteDB CDC to ClickHouse](https://www.yugabyte.com/blog/change-data-capture-cdc-yugabytedb-clickhouse/) {{}} +* [How to Run Debezium Server with Kafka as a Sink](https://www.yugabyte.com/blog/change-data-capture-cdc-run-debezium-server-kafka-sink/) {{}} +* [Change Data Capture Using a Spring Data Processing Pipeline](https://www.yugabyte.com/blog/change-data-capture-cdc-spring-data-processing-pipeline/) {{}} diff --git a/docs/content/stable/explore/change-data-capture/cdc-get-started.md b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md similarity index 71% rename from docs/content/stable/explore/change-data-capture/cdc-get-started.md rename to docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md index 4ad55e1b3e32..6826b04a017e 100644 --- a/docs/content/stable/explore/change-data-capture/cdc-get-started.md +++ b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-get-started.md @@ -6,39 +6,40 @@ description: Get started with Change Data Capture in YugabyteDB. headcontent: Get set up for using CDC in YugabyteDB menu: stable: - parent: explore-change-data-capture + parent: explore-change-data-capture-grpc-replication identifier: cdc-get-started - weight: 30 + weight: 10 type: docs --- -To stream data change events from YugabyteDB databases, you need to use Debezium YugabyteDB connector. To deploy a Debezium YugabyteDB connector, you install the Debezium YugabyteDB connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. You can download the connector from [GitHub releases](https://github.com/yugabyte/debezium-connector-yugabytedb/releases). The connector supports Kafka Connect version 2.x and later, and for YugabyteDB, it supports version 2.14 and later. For more connector configuration details and complete steps, refer to [Debezium connector](../debezium-connector-yugabytedb/). - -## Ordering guarantees - -|Ordering guarantee| Description| -|----------| ----------------------------| -|Per-tablet ordered delivery guarantee|All changes for a row (or rows in the same tablet) are received in the order in which they happened. However, due to the distributed nature of the problem, there is no guarantee of the order across tablets.| -|At least once delivery|Updates for rows are streamed at least once. This can happen in the case of Kafka Connect Node failure. If the Kafka Connect Node pushes the records to Kafka and crashes before committing the offset, on restart, it will again get the same set of records.| -|No gaps in change stream|Note that after you have received a change for a row for some timestamp `t`, you won't receive a previously unseen change for that row at a lower timestamp. Receiving any change implies that you have received _all older changes_ for that row.| - ## Set up YugabyteDB for CDC -The following steps are necessary to set up YugabyteDB for use with the Debezium YugabyteDB connector: +The following steps are necessary to set up YugabyteDB for use with the YugabyteDB gRPC connector: - Create a DB stream ID. - Before you use the YugabyteDB connector to retrieve data change events from YugabyteDB, create a stream ID using the yb-admin CLI command. Refer to the [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands) CDC command reference documentation for more details. + Before you use the YugabyteDB connector to retrieve data change events from YugabyteDB, create a stream ID using the yb-admin CLI command. Refer to the [yb-admin](../../../../admin/yb-admin/#change-data-capture-cdc-commands) CDC command reference documentation for more details. - Make sure the YB-Master and YB-TServer ports are open. - The connector connects to the YB-Master and YB-TServer processes running on the YugabyteDB server. Make sure the ports on which these processes are running are open. The [default ports](../../../reference/configuration/default-ports/) on which the processes run are `7100` and `9100` respectively. + The connector connects to the YB-Master and YB-TServer processes running on the YugabyteDB server. Make sure the ports on which these processes are running are open. The [default ports](../../../../reference/configuration/default-ports/) on which the processes run are `7100` and `9100` respectively. - Monitor available disk space. The change records for CDC are read from the WAL. YugabyteDB CDC maintains checkpoints internally for each DB stream ID and garbage collects the WAL entries if those have been streamed to the CDC clients. - In case CDC is lagging or away for some time, the disk usage may grow and cause YugabyteDB cluster instability. To avoid this scenario, if a stream is inactive for a configured amount of time, the WAL is garbage collected. This is configurable using a [YB-TServer flag](../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags). + In case CDC is lagging or away for some time, the disk usage may grow and cause YugabyteDB cluster instability. To avoid this scenario, if a stream is inactive for a configured amount of time, the WAL is garbage collected. This is configurable using a [YB-TServer flag](../../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags). + +## Deploying the YugabyteDB gRPC Connector + +To stream data change events from YugabyteDB databases, follow these steps to deploy the YugabyteDB gRPC Connector: + +- Download the Connector: You can download the connector from the [GitHub releases](https://github.com/yugabyte/debezium-connector-yugabytedb/releases) +- Install the Connector: Extract and install the connector archive in your Kafka Connect environment. +- Configure the Connector: Modify the connector configuration to suit your specific requirements. +- Start the Connector: Add the connector's configuration to Kafka Connect and start the connector. + +For more details on connector configuration and deployment steps, refer to the [YugabyteDB gRPC Connector documentation](../debezium-connector-yugabytedb/). ## Serialization @@ -63,7 +64,7 @@ The YugabyteDB source connector also supports AVRO serialization with schema reg {{% tab header="JSON" lang="json" %}} -For JSON schema serialization, you can use the [Kafka JSON Serializer](https://mvnrepository.com/artifact/io.confluent/kafka-json-serializer) and equivalent deserializer. After downloading and including the required `JAR` file in the Kafka-Connect environment, you can directly configure the CDC source and sink connectors to use this converter. +For JSON schema serialization, you can use the [Kafka JSON Serializer](https://mvnrepository.com/artifact/io.confluent/kafka-json-serializer) and equivalent de-serializer. After downloading and including the required `JAR` file in the Kafka-Connect environment, you can directly configure the CDC source and sink connectors to use this converter. For source connectors: @@ -110,9 +111,9 @@ To use the [protobuf](http://protobuf.dev) format for the serialization/de-seria Before image refers to the state of the row _before_ the change event occurred. The YugabyteDB connector sends the before image of the row when it will be configured using a stream ID enabled with before image. It is populated for UPDATE and DELETE events. For INSERT events, before image doesn't make sense as the change record itself is in the context of new row insertion. -Yugabyte uses multi-version concurrency control (MVCC) mechanism, and compacts data at regular intervals. The compaction or the history retention is controlled by the [history retention interval flag](../../../reference/configuration/yb-tserver/#timestamp-history-retention-interval-sec). However, when before image is enabled for a database, YugabyteDB adjusts the history retention for that database based on the most lagging active CDC stream so that the previous row state is retained, and available. Consequently, in the case of a lagging CDC stream, the amount of space required for the database grows as more data is retained. On the other hand, older rows that are not needed for any of the active CDC streams are identified and garbage collected. +Yugabyte uses multi-version concurrency control (MVCC) mechanism, and compacts data at regular intervals. The compaction or the history retention is controlled by the [history retention interval flag](../../../../reference/configuration/yb-tserver/#timestamp-history-retention-interval-sec). However, when before image is enabled for a database, YugabyteDB adjusts the history retention for that database based on the most lagging active CDC stream so that the previous row state is retained, and available. Consequently, in the case of a lagging CDC stream, the amount of space required for the database grows as more data is retained. On the other hand, older rows that are not needed for any of the active CDC streams are identified and garbage collected. -Schema version that is currently being used by a CDC stream will be used to frame before and current row images. The before image functionality is disabled by default unless it is specifically turned on during the CDC stream creation. The [yb-admin](../../../admin/yb-admin/#enabling-before-image) `create_change_data_stream` command can be used to create a CDC stream with before image enabled. +Schema version that is currently being used by a CDC stream will be used to frame before and current row images. The before image functionality is disabled by default unless it is specifically turned on during the CDC stream creation. The [yb-admin](../../../../admin/yb-admin/#enabling-before-image) `create_change_data_stream` command can be used to create a CDC stream with before image enabled. {{< tip title="Use transformers" >}} @@ -501,11 +502,11 @@ CDC record for UPDATE (using schema version 1): ## Colocated tables -YugabyteDB supports streaming of changes from [colocated tables](../../architecture/docdb-sharding/colocated-tables). The connector can be configured with regular configuration properties and deployed for streaming. +YugabyteDB supports streaming of changes from [colocated tables](../../../../architecture/docdb-sharding/colocated-tables). The connector can be configured with regular configuration properties and deployed for streaming. {{< note title="Note" >}} -If a connector is already streaming a set of colocated tables from a database and if a new table is created in the same database, you can't deploy a new connector for the newly created table. +If a connector is already streaming a set of colocated tables from a database and if a new table is created in the same database, you cannot deploy a new connector for this newly created table. To stream the changes for the new table, delete the existing connector and deploy it again with the updated configuration property after adding the new table to `table.include.list`. @@ -513,15 +514,15 @@ To stream the changes for the new table, delete the existing connector and deplo ## Important configuration settings -You can use several flags to fine-tune YugabyteDB's CDC behavior. These flags are documented in the [Change data capture flags](../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags) section of the YB-TServer reference and [Change data capture flags](../../../reference/configuration/yb-master/#change-data-capture-cdc-flags) section of the YB-Master reference. The following flags are particularly important for configuring CDC: +You can use several flags to fine-tune YugabyteDB's CDC behavior. These flags are documented in the [Change data capture flags](../../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags) section of the YB-TServer reference and [Change data capture flags](../../../../reference/configuration/yb-master/#change-data-capture-cdc-flags) section of the YB-Master reference. The following flags are particularly important for configuring CDC: -- [cdc_intent_retention_ms](../../../reference/configuration/yb-tserver/#cdc-intent-retention-ms) - Controls retention of intents, in ms. If a request for change records is not received for this interval, un-streamed intents are garbage collected and the CDC stream is considered expired. This expiry is not reversible, and the only course of action would be to create a new CDC stream. The default value of this flag is 4 hours (4 x 3600 x 1000 ms). +- [cdc_intent_retention_ms](../../../../reference/configuration/yb-tserver/#cdc-intent-retention-ms) - Controls retention of intents, in ms. If a request for change records is not received for this interval, un-streamed intents are garbage collected and the CDC stream is considered expired. This expiry is not reversible, and the only course of action would be to create a new CDC stream. The default value of this flag is 4 hours (4 x 3600 x 1000 ms). -- [cdc_wal_retention_time_secs](../../../reference/configuration/yb-master/#cdc-wal-retention-time-secs) - Controls how long WAL is retained, in seconds. This is irrespective of whether a request for change records is received or not. The default value of this flag is 4 hours (14400 seconds). +- [cdc_wal_retention_time_secs](../../../../reference/configuration/yb-master/#cdc-wal-retention-time-secs) - Controls how long WAL is retained, in seconds. This is irrespective of whether a request for change records is received or not. The default value of this flag is 4 hours (14400 seconds). -- [cdc_snapshot_batch_size](../../../reference/configuration/yb-tserver/#cdc-snapshot-batch-size) - This flag's default value is 250 records included per batch in response to an internal call to get the snapshot. If the table contains a very large amount of data, you may need to increase this value to reduce the amount of time it takes to stream the complete snapshot. You can also choose not to take a snapshot by modifying the [Debezium](../debezium-connector-yugabytedb/) configuration. +- [cdc_snapshot_batch_size](../../../../reference/configuration/yb-tserver/#cdc-snapshot-batch-size) - This flag's default value is 250 records included per batch in response to an internal call to get the snapshot. If the table contains a very large amount of data, you may need to increase this value to reduce the amount of time it takes to stream the complete snapshot. You can also choose not to take a snapshot by modifying the [Debezium](../debezium-connector-yugabytedb/) configuration. -- [cdc_max_stream_intent_records](../../../reference/configuration/yb-tserver/#cdc-max-stream-intent-records) - Controls how many intent records can be streamed in a single `GetChanges` call. Essentially, intents of large transactions are broken down into batches of size equal to this flag, hence this controls how many batches of `GetChanges` calls are needed to stream the entire large transaction. The default value of this flag is 1680, and transactions with intents less than this value are streamed in a single batch. The value of this flag can be increased, if the workload has larger transactions and CDC throughput needs to be increased. Note that high values of this flag can increase the latency of each `GetChanges` call. +- [cdc_max_stream_intent_records](../../../../reference/configuration/yb-tserver/#cdc-max-stream-intent-records) - Controls how many intent records can be streamed in a single `GetChanges` call. Essentially, intents of large transactions are broken down into batches of size equal to this flag, hence this controls how many batches of `GetChanges` calls are needed to stream the entire large transaction. The default value of this flag is 1680, and transactions with intents less than this value are streamed in a single batch. The value of this flag can be increased, if the workload has larger transactions and CDC throughput needs to be increased. Note that high values of this flag can increase the latency of each `GetChanges` call. ## Retaining data for longer durations @@ -535,12 +536,12 @@ Longer values of `cdc_intent_retention_ms`, coupled with longer CDC lags (period ## Content-based routing -By default, the Yugabyte Debezium connector streams all of the change events that it reads from a table to a single static topic. However, you may want to re-route the events into different Kafka topics based on the event's content. You can do this using the Debezium `ContentBasedRouter`. But first, two additional dependencies need to be placed in the Kafka-Connect environment. These are not included in the official *yugabyte-debezium-connector* for security reasons. These dependencies are: +By default, the connector streams all of the change events that it reads from a table to a single static topic. However, you may want to re-route the events into different Kafka topics based on the event's content. You can do this using the Debezium `ContentBasedRouter`. But first, two additional dependencies need to be placed in the Kafka-Connect environment. These are not included in the official _yugabyte-debezium-connector_ for security reasons. These dependencies are: - Debezium routing SMT (Single Message Transform) - Groovy JSR223 implementation (or other scripting languages that integrate with [JSR 223](https://jcp.org/en/jsr/detail?id=223)) -To get started, you can rebuild the *yugabyte-debezium-connector* image including these dependencies. Here's what the Dockerfile would look like: +To get started, you can rebuild the _yugabyte-debezium-connector_ image including these dependencies. The following shows what the Dockerfile would look like: ```Dockerfile FROM quay.io/yugabyte/debezium-connector:latest @@ -571,6 +572,6 @@ The `` contains the logic for routing of the events. For exa value.after != null ? (value.after?.country?.value == '\''UK'\'' ? '\''uk_users'\'' : null) : (value.before?.country?.value == '\''UK'\'' ? '\''uk_users'\'' : null)" ``` -This expression checks if the value of the row after the operation has the country set to "UK". If *yes* then the expression returns "uk_users." If *no*, it returns *null*, and in case the row after the operation is *null* (for example, in a "delete" operation), the expression also checks for the same condition on row values before the operation. The value that is returned determines which new Kafka Topic will receive the re-routed event. If it returns *null*, the event is sent to the default topic. +This expression checks if the value of the row after the operation has the country set to `UK`. If _yes_, then the expression returns `uk_users`. If _no_, it returns _null_, and in case the row after the operation is _null_ (for example, in a "delete" operation), the expression also checks for the same condition on row values before the operation. The value that is returned determines which new Kafka Topic will receive the re-routed event. If it returns _null_, the event is sent to the default topic. For more advanced routing configuration, refer to the [Debezium documentation](https://debezium.io/documentation/reference/stable/transformations/content-based-routing.html) on content-based routing. diff --git a/docs/content/stable/explore/change-data-capture/cdc-monitor.md b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md similarity index 95% rename from docs/content/stable/explore/change-data-capture/cdc-monitor.md rename to docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md index 58c432965d42..e77e69164c30 100644 --- a/docs/content/stable/explore/change-data-capture/cdc-monitor.md +++ b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/cdc-monitor.md @@ -6,9 +6,9 @@ description: Monitor Change Data Capture in YugabyteDB. headcontent: Monitor deployed CDC connectors menu: stable: - parent: explore-change-data-capture + parent: explore-change-data-capture-grpc-replication identifier: cdc-monitor - weight: 60 + weight: 20 type: docs --- @@ -102,7 +102,7 @@ The following streaming metrics are available: | QueueTotalCapacity | `int` | The length the queue used to pass events between the streamer and the main Kafka Connect loop. | | QueueRemainingCapacity | `int` | The free capacity of the queue used to pass events between the streamer and the main Kafka Connect loop. | | Connected | `boolean` | Indicates whether the connector is currently connected to the database server. | -| MilliSecondsBehindSource | `long` | The number of milliseconds between the last change event’s timestamp and when the connector processed it. The value incorporates any differences between the clocks on the machines where the database server and the connector are running. | +| MilliSecondsBehindSource | `long` | The number of milliseconds between the last change event's timestamp and when the connector processed it. The value incorporates any differences between the clocks on the machines where the database server and the connector are running. | | SourceEventPosition | `Map` | The coordinates of the last received event. | | LastTransactionId | `string` | Transaction identifier of the last processed transaction. | | MaxQueueSizeInBytes | `long` | The maximum buffer of the queue in bytes. This metric is available if `max.queue.size.in.bytes` is set to a positive long value. | diff --git a/docs/content/stable/explore/change-data-capture/debezium-connector-yugabytedb.md b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md similarity index 86% rename from docs/content/stable/explore/change-data-capture/debezium-connector-yugabytedb.md rename to docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md index 8c7c05a605e7..d1586fdf054c 100644 --- a/docs/content/stable/explore/change-data-capture/debezium-connector-yugabytedb.md +++ b/docs/content/stable/explore/change-data-capture/using-yugabytedb-grpc-replication/debezium-connector-yugabytedb.md @@ -1,39 +1,19 @@ --- -title: Debezium connector for YugabyteDB -headerTitle: Debezium connector for YugabyteDB -linkTitle: Debezium connector -description: Debezium is an open source distributed platform used to capture the changes in a database. +title: YugabyteDB gRPC Connector (Debezium) +headerTitle: YugabyteDB gRPC Connector +linkTitle: YugabyteDB gRPC Connector +description: YugabyteDB gRPC Connector is an open source distributed platform used to capture the changes in a database. menu: stable: - parent: explore-change-data-capture + parent: explore-change-data-capture-grpc-replication identifier: debezium-connector-yugabytedb - weight: 20 + weight: 50 type: docs rightNav: hideH4: true --- -The Debezium connector for YugabyteDB captures row-level changes in the schemas of a YugabyteDB database. - -The first time it connects to a YugabyteDB cluster or universe, the connector takes a consistent snapshot of the tables it is configured for. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content that are committed to a YugabyteDB database. The connector generates data change event records and streams them to Kafka topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic. - -## Overview - -The Debezium connector for YugabyteDB reads the changes produced by YugabyteDB. It uses the CDC service APIs implemented on the server side to get the changes. - -The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in separate Kafka topics. Client applications read the Kafka topics corresponding to database tables of interest, and can react to every row-level event they receive from those topics. - -![What is CDC](/images/explore/cdc-overview-what.png) - -YugabyteDB normally purges write-ahead log (WAL) segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the YugabyteDB connector first connects to a particular YugabyteDB database, it starts by taking a snapshot of each of the database schemas. After the connector completes the snapshot, it continues streaming changes from the exact point at which the snapshot was made. This way, the connector starts with a consistent view of all of the data, and does not omit any changes that were made while the snapshot was being taken. - -The connector is tolerant of failures. As the connector reads changes and produces events, it records the WAL position for each event. If the connector stops for any reason (including communication failures, network problems, or crashes), upon restart the connector continues reading the WAL where it last left off using the WAL position called checkpoints managed on the Kafka side as well as on the YugabyteDB cluster. - -{{< tip title="Use UTF-8 encoding" >}} - -Debezium supports databases with UTF-8 character encoding only. With a single-byte character encoding, it's not possible to correctly process strings that contain extended ASCII code characters. - -{{< /tip >}} +The YugabyteDB gRPC Connector is based on the Debezium Connector, and captures row-level changes in a YugabyteDB database's schemas using the YugabyteDB gRPC replication protocol. ## Connector compatibility @@ -41,24 +21,55 @@ The connector is compatible with the following versions of YugabyteDB. | YugabyteDB | Connector | | :--- | :--- | -| 2.14 (EA) | 1.9.5.y.3 | +| 2.14 | 1.9.5.y.3 | | 2.16 | 1.9.5.y.24 | | 2.18.2 | 1.9.5.y.33.2 | | 2.20 | 1.9.5.y.220.2 | -| 2024.1 | 1.9.5.y.220.2 | + +In addition, the connector supports the following: + +* Kafka Connect v2.x and later. +* YugabyteDB v2.14 and later. {{< note title="Note" >}} Starting with YugabyteDB v2.20, the naming convention for releases of the connector uses the scheme *major.y.minor*, as follows: -- *major* - Debezium release the connector is based on -- *minor* - version of YugabyteDB the connector works with + +* *major* - Debezium release the connector is based on +* *minor* - version of YugabyteDB the connector works with + The connector is backward compatible with previous releases of YugabyteDB unless stated otherwise. {{< /note >}} +## Initial Snapshot and Continuous Streaming + +* Initial Snapshot: Upon its first connection to a YugabyteDB cluster, the connector takes a consistent snapshot of the configured tables. +* Continuous Streaming: After the snapshot, it continuously captures row-level changes (insertions, updates, and deletions) from the database. It then generates data change event records and streams them to Kafka topics. + +![What is CDC](/images/explore/cdc-overview-work.png) + +## Kafka integration + +For each table, the connector streams all generated events to a separate Kafka topic. Client applications and services can consume these data change event records from their respective topics. + +* CDC (Change Data Capture) Service: The Debezium connector leverages the CDC service APIs to read the changes from YugabyteDB. +* Event Production: For every row-level insert, update, and delete operation captured, the connector produces a corresponding change event and sends it to separate Kafka topics dedicated to each table. +* Client Consumption: Applications read the Kafka topics corresponding to the database tables they are interested in and react to the row-level events received. + +## Failure tolerance + +The connector records the WAL position for each event as it reads changes and produces events. If the connector stops (due to communication failures, network problems, or crashes), it resumes reading the WAL from the last recorded position upon restart. This uses checkpoints managed on both the Kafka side and the YugabyteDB cluster. + +{{< tip title="Use UTF-8 encoding" >}} + +Debezium supports databases with UTF-8 character encoding only. With a single-byte character encoding, it's not possible to correctly process strings that contain extended ASCII code characters. + +{{< /tip >}} + ## How the connector works -To optimally configure and run a Debezium YugabyteDB connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. +To optimally configure and run a Debezium connector, it is helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. ### Security @@ -74,7 +85,7 @@ Per-user CDC privileges are planned for a future release. Most YugabyteDB servers are configured to not retain the complete history of the database in the WAL segments. This means that the YugayteDB connector would be unable to see the entire history of the database by reading only the WAL. Consequently, the first time that the connector starts, it performs an initial consistent snapshot of the database. You can change this behavior by setting the `snapshot.mode` connector configuration property to a value other than initial. -After the connector completes its initial snapshot, the YugabyteDB connector continues streaming the changes. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. +After the connector completes its initial snapshot, it continues streaming the changes. This ensures that the connector does not miss any updates. If the connector stops again for any reason, upon restart, the connector continues streaming changes from where it previously left off. Options for the `snapshot.mode` connector configuration property are as follows: @@ -86,11 +97,11 @@ Options for the `snapshot.mode` connector configuration property are as follows: ### Streaming changes -The YugabyteDB connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. +The YugabyteDB gRPC Connector typically spends the vast majority of its time streaming changes from the YugabyteDB server to which it is connected. The connector keeps polling for changes and whenever there is a change, the connector processes them, converts them to a specific format (Protobuf or JSON in the case of the Debezium plugin) and writes them on an output stream, which can then be consumed by clients. -The Debezium YugabyteDB connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The YugabyteDB connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. +The connector acts as a YugabyteDB client. When the connector receives changes it transforms the events into Debezium create, update, or delete events that include the LSN of the event. The connector forwards these change events in records to the Kafka Connect framework, which is running in the same process. The Kafka Connect process asynchronously writes the change event records in the same order in which they were generated to the appropriate Kafka topic. Periodically, Kafka Connect records the most recent offset in another Kafka topic. The offset indicates source-specific position information that Debezium includes with each event. @@ -98,13 +109,13 @@ When Kafka Connect gracefully shuts down, it stops the connectors, and flushes a {{< note title="Schema changes" >}} -The YugabyteDB connector retrieves schema information as part of the change events which consist of the schema metadata for the table. When there is any schema change on the configured table, the connector will automatically receive an event pertaining to the change and it will update its internal schema. +The connector retrieves schema information as part of the change events which consist of the schema metadata for the table. When there is any schema change on the configured table, the connector will automatically receive an event pertaining to the change and it will update its internal schema. {{< /note >}} ### Topic names -By default, the YugabyteDB connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _serverName.schemaName.tableName_. +By default, the YugabyteDB gRPC connector writes change events for all `INSERT`, `UPDATE`, and `DELETE` operations that occur in a table to a single Apache Kafka topic that is specific to that table. The connector names change event topics as _serverName.schemaName.tableName_. The components of a topic name are as follows: @@ -132,7 +143,7 @@ If the default topic names don't meet your requirements, you can configure custo ### Meta information -In addition to the data change event, each record produced by the YugabyteDB connector contains some metadata. Metadata includes information about which tablet caused the change event to occur, the commit time, table, database, offset of the event, for example: +In addition to the data change event, each record produced by the connector contains some metadata. Metadata includes information about which tablet caused the change event to occur, the commit time, table, database, offset of the event, for example: ```output.json "source": { @@ -235,7 +246,7 @@ For example: ## Data change events -The Debezium YugabyteDB connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. +The connector generates a data change event for each row-level `INSERT`, `UPDATE`, and `DELETE` operation. Each event contains a key and a value. The structure of the key and the value depends on the table that was changed. Debezium and Kafka Connect are designed around continuous streams of event messages. However, the structure of these events may change over time, which can be difficult for consumers to handle. To address this, each event contains the schema for its content. This makes each event self-contained. @@ -269,7 +280,7 @@ If you use the JSON converter and you configure it to produce all four basic cha {{< warning title="Naming conflicts due to invalid characters" >}} -The YugabyteDB connector ensures that all Kafka Connect schema names adhere to the [Avro schema name format](http://avro.apache.org/docs/current/spec.html#names). This means that the logical server name must start with a Latin letter or an underscore (a-z, A-Z, or \_). Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore (a-z, A-Z, 0-9, or \_). Invalid characters are replaced with an underscore character. +The YugabyteDB gRPC connector ensures that all Kafka Connect schema names adhere to the [Avro schema name format](http://avro.apache.org/docs/current/spec.html#names). This means that the logical server name must start with a Latin letter or an underscore (a-z, A-Z, or \_). Each remaining character in the logical server name and each character in the schema and table names must be a Latin letter, a digit, or an underscore (a-z, A-Z, 0-9, or \_). Invalid characters are replaced with an underscore character. This can lead to unexpected conflicts if the logical server name, a schema name, or a table name contains invalid characters, in the event that the only characters that distinguish names from one another are invalid, and thus replaced with underscores. @@ -689,11 +700,11 @@ A `delete` change event record provides a consumer with the information it needs #### Tombstone events -When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the YugabyteDB connector follows a delete event with a special _tombstone_ event that has the same key but a null value. +When a row is deleted, the _delete_ event value still works with log compaction, because Kafka can remove all earlier messages that have that same key. However, for Kafka to remove all messages that have that same key, the message value must be `null`. To make this possible, the connector follows a delete event with a special _tombstone_ event that has the same key but a null value. {{< tip title="TRUNCATE tables when CDC is enabled" >}} -By default, the YugabyteDB CDC implementation does not allow you to TRUNCATE a table while an active CDC stream is present on the namespace. To allow truncating tables while CDC is active, set the [enable_truncate_cdcsdk_table](../../../reference/configuration/yb-tserver/#enable-truncate-cdcsdk-table) flag to true. +By default, the YugabyteDB CDC implementation does not allow you to TRUNCATE a table while an active CDC stream is present on the namespace. To allow truncating tables while CDC is active, set the [enable_truncate_cdcsdk_table](../../../../reference/configuration/yb-tserver/#enable-truncate-cdcsdk-table) flag to true. {{< /tip >}} @@ -711,14 +722,14 @@ If you set the property to `false` to prevent the connector from saving tombston ## Datatype mappings -The YugabyteDB connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. +The connector represents changes to rows with events that are structured like the table in which the row exists. The event contains a field for each column value. How that value is represented in the event depends on the YugabyteDB data type of the column. The following sections describe how the connector maps YugabyteDB data types to a literal type and a semantic type in event fields. * The literal type describes how the value is literally represented using Kafka Connect schema types: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, BOOLEAN, STRING, BYTES, ARRAY, MAP, and STRUCT. * The semantic type describes how the Kafka Connect schema captures the meaning of the field using the name of the Kafka Connect schema for the field. ### Default values -If there is a default value for any column in a the YugabyteDB database schema, the YugabyteDB Debezium connector propagates the same value to the Kafka schema. +If there is a default value for any column in a the YugabyteDB database schema, the connector propagates the same value to the Kafka schema. ### Basic types @@ -811,7 +822,7 @@ YugabyteDB supports using `+/-infinity` values in `TIMESTAMP` columns. These spe ### Decimal types -The setting of the YugabyteDB connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. +The setting of the connector configuration property `decimal.handling.mode` determines how the connector maps decimal types. {{< note title="Note" >}} @@ -911,10 +922,10 @@ Support for the following YugabyteDB data types will be enabled in future releas ## Setting up YugabyteDB -Before using the YugabyteDB connector to monitor the changes on a YugabyteDB server, you need to ensure the following: +Before using the connector to monitor the changes on a YugabyteDB server, you need to ensure the following: -* You have a stream ID created on the database you want to monitor the changes for. The stream can be created using the [yb-admin create_change_data_stream](../../../admin/yb-admin#create_change_data_stream) command. -* The table which is supposed to be monitored should have a primary key. Only tables which have a primary key can be streamed. See [limitations](../../change-data-capture/cdc-overview/#known-limitations). +* You have a stream ID created on the database you want to monitor the changes for. The stream can be created using the [yb-admin create_change_data_stream](../../../../admin/yb-admin#create_change_data_stream) command. +* The table which is supposed to be monitored should have a primary key. Only tables which have a primary key can be streamed. See [limitations](../cdc-overview/#known-limitations). ### WAL disk space consumption @@ -924,11 +935,11 @@ For example, the connector is lagging behind in streaming the changes. In this c ## Deployment -To deploy a Debezium YugabyteDB connector, you install the Debezium YugabyteDB connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. For complete steps, follow the guide to [running the Debezium connector for YugabyteDB](../../../integrations/cdc/debezium/). +To deploy a Debezium connector, you install the YugabyteDB gRPC Connector archive, configure the connector, and start the connector by adding its configuration to Kafka Connect. For complete steps, follow the guide to [running the Debezium connector in YugabyteDB](../../../../integrations/cdc/debezium/). ### Connector configuration example -Following is an example of the configuration for a YugabyteDB connector that connects to a YugabyteDB server on port 5433 at 127.0.0.1, whose logical name is `dbserver1`. Typically, you configure the Debezium YugabyteDB connector in a JSON file by setting the configuration properties available for the connector. +Following is an example of the configuration for a connector that connects to a YugabyteDB server on port 5433 at 127.0.0.1, whose logical name is `dbserver1`. Typically, you configure the connector in a JSON file by setting the configuration properties available for the connector. You can choose to produce events for a subset of the schemas and tables in a database. Optionally, you can ignore, mask, or truncate columns that contain sensitive data, are larger than a specified size, or that you do not need. @@ -951,11 +962,11 @@ You can choose to produce events for a subset of the schemas and tables in a dat ``` 1. The name of the connector when registered with a Kafka Connect service. -1. The name of this YugabyteDB connector class. +1. The name of this YugabyteDB gRPC Connector class. 1. The address of this YugabyteDB server. 1. The port number of the YugabyteDB YSQL process. 1. List of comma separated values of master nodes of the YugabyteDB server. Usually in the form `host`:`port`. -1. The DB stream ID created using [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands). +1. The DB stream ID created using [yb-admin](../../../../admin/yb-admin/#change-data-capture-cdc-commands). 1. The name of the YugabyteDB user having the privileges to connect to the database. 1. The password for the above specified YugabyteDB user. 1. The name of the YugabyteDB database to connect to. @@ -981,16 +992,16 @@ See [Transformers](#transformers). ### Adding connector configuration -To run a YugabyteDB connector, create a connector configuration and add the configuration to your Kafka Connect cluster. +To run a connector, create a connector configuration and add the configuration to your Kafka Connect cluster. #### Prerequisites * [YugabyteDB is configured](#setting-up-yugabytedb) for change data capture. -* The YugabyteDB connector is installed. +* The YugabyteDB gRPC connector is installed. #### Procedure -1. Create a configuration for the YugabyteDB connector. +1. Create a configuration for the connector. 2. Use the [Kafka Connect REST API](https://kafka.apache.org/documentation/#connect_rest) to add that connector configuration to your Kafka Connect cluster. #### Results @@ -999,7 +1010,7 @@ After the connector starts, it will perform a snapshot of the tables depending o ### Connector configuration properties -The Debezium YugabyteDB connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. +The connector has many configuration properties that you can use to achieve the right connector behavior for your application. Many properties have default values. The following properties are _required_ unless a default value is available: @@ -1013,7 +1024,7 @@ The following properties are _required_ unless a default value is available: | database.password | N/A | Password for the given user. | | database.dbname | N/A | The database from which to stream. | | database.server.name | N/A | Logical name that identifies and provides a namespace for the particular YugabyteDB database server or cluster for which Debezium is capturing changes. This name must be unique, as it's also used to form the Kafka topic. | -| database.streamid | N/A | Stream ID created using [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands) for Change data capture. | +| database.streamid | N/A | Stream ID created using [yb-admin](../../../../admin/yb-admin/#change-data-capture-cdc-commands) for Change data capture. | | table.include.list | N/A | Comma-separated list of table names and schema names, such as `public.test` or `test_schema.test_table_name`. | | table.max.num.tablets | 300 | Maximum number of tablets the connector can poll for. This should be greater than or equal to the number of tablets the table is split into. | | database.sslmode | disable | Whether to use an encrypted connection to the YugabyteDB cluster. Supported options are:
  • `disable` uses an unencrypted connection
  • `require` uses an encrypted connection and fails if it can't be established
  • `verify-ca` uses an encrypted connection, verifies the server TLS certificate against the configured Certificate Authority (CA) certificates, and fails if no valid matching CA certificates are found.
| @@ -1040,9 +1051,9 @@ The APIs used to fetch the changes are set up to work with TLSv1.2 only. Make su If you have a YugabyteDB cluster with SSL enabled, you need to obtain the root certificate and provide the path of the file in the `database.sslrootcert` configuration property. You can follow these links to get the certificates for your universe: -* [Local deployments](../../../secure/tls-encryption/) -* [YugabyteDB Anywhere](../../../yugabyte-platform/security/enable-encryption-in-transit/#connect-to-a-ysql-endpoint-with-tls) -* [YugabyteDB Aeon](../../../yugabyte-cloud/cloud-secure-clusters/cloud-authentication/#download-your-cluster-certificate) +* [Local deployments](../../../../secure/tls-encryption/) +* [YugabyteDB Anywhere](../../../../yugabyte-platform/security/enable-encryption-in-transit/#connect-to-a-ysql-endpoint-with-tls) +* [YugabyteDB Aeon](../../../../yugabyte-cloud/cloud-secure-clusters/cloud-authentication/#download-your-cluster-certificate) {{< /note >}} @@ -1059,7 +1070,7 @@ Advanced connector configuration properties: | time.precision.mode | adaptive | Time, date, and timestamps can be represented with different kinds of precision:

`adaptive` captures the time and timestamp values exactly as in the database using millisecond precision values based on the database column's type.

`adaptive_time_microseconds` captures the date, datetime and timestamp values exactly as in the database using millisecond precision values based on the database column's type. An exception is `TIME` type fields, which are always captured as microseconds.

`connect` always represents time and timestamp values by using Kafka Connect's built-in representations for Time, Date, and Timestamp, which use millisecond precision regardless of the database columns' precision. See temporal values. | | decimal.handling.mode | double | The `precise` mode is not currently supported.

`double` maps all the numeric, double, and money types as Java double values (FLOAT64).

`string` represents the numeric, double, and money types as their string-formatted form.

| | binary.handling.mode | hex | `hex` is the only supported mode. All binary strings are converted to their respective hex format and emitted as their string representation . | -| interval.handling.mode | numeric | Specifies how the connector should handle values for interval columns:

`numeric` represents intervals using approximate number of microseconds.

`string` represents intervals exactly by using the string pattern representation
`PYMDTHMS`.
For example: P1Y2M3DT4H5M6.78S. See [YugabyteDB data types](../../../api/ysql/datatypes/). | +| interval.handling.mode | numeric | Specifies how the connector should handle values for interval columns:

`numeric` represents intervals using approximate number of microseconds.

`string` represents intervals exactly by using the string pattern representation
`PYMDTHMS`.
For example: P1Y2M3DT4H5M6.78S. See [YugabyteDB data types](../../../../api/ysql/datatypes/). | | transaction.topic | `${database.server.name}`
`.transaction` | Controls the name of the topic to which the connector sends transaction metadata messages. The placeholder `${database.server.name}` can be used for referring to the connector's logical name; defaults to `${database.server.name}.transaction`, for example `dbserver1.transaction`. | | provide.transaction.metadata | `false` | Determines whether the connector generates events with transaction boundaries and enriches change event envelopes with transaction metadata. Specify `true` if you want the connector to do this. See [Transaction metadata](#transaction-metadata) for details. | | skipped.operations | N/A | A comma-separated list of operation types to be skipped during streaming. The types are `c` for insert/create operations, `u` for update operations, and `d` for delete operations. By default, no operations are skipped. | @@ -1082,11 +1093,11 @@ The following three transformers are available: YBExtractNewRecordState, Extract Transformer type: `io.debezium.connector.yugabytedb.transforms.YBExtractNewRecordState` -Unlike the Debezium Connector for PostgreSQL, the YugabyteDB connector only sends the `after` image of the "set of columns" that are modified. PostgreSQL sends the complete `after` image of the row which has changed. So by default if the column was not changed, it is not a part of the payload that is sent and the default value is set to `null`. +Unlike the Debezium connector for PostgreSQL, the YugabyteDB gRPC Connector only sends the `after` image of the "set of columns" that are modified. PostgreSQL sends the complete `after` image of the row which has changed. So by default if the column was not changed, it is not a part of the payload that is sent and the default value is set to `null`. -To differentiate between the case where a column is set to `null` and the case in which it's not modified, the YugabyteDB connector changes the value type to a struct. In this structure, an unchanged column is `{'value': null}`, whereas the column changed to a null value is `{'value': null, 'set': true}`. +To differentiate between the case where a column is set to `null` and the case in which it's not modified, the YugabyteDB gRPC Connector changes the value type to a struct. In this structure, an unchanged column is `{'value': null}`, whereas the column changed to a null value is `{'value': null, 'set': true}`. -A schema registry requires that, once a schema is registered, records must contain only payloads with that schema version. If you're using a schema registry, the YugabyteDB Debezium connector's approach can be problematic, as the schema may change with every message. For example, if we keep changing the record to only include the value of modified columns, the schema of each record will be different (the total number unique schemas will be a result of making all possible combinations of columns) and thus would require sending a schema with every record. +A schema registry requires that, once a schema is registered, records must contain only payloads with that schema version. If you're using a schema registry, the YugabyteDB gRPC Connector's approach can be problematic, as the schema may change with every message. For example, if we keep changing the record to only include the value of modified columns, the schema of each record will be different (the total number unique schemas will be a result of making all possible combinations of columns) and thus would require sending a schema with every record. To avoid this problem when you're using a schema registry, use the `YBExtractNewRecordState` SMT (Single Message Transformer for Kafka), which interprets these values and sends the record in the correct format (by removing the unmodified columns from the JSON message). Records transformed by `YBExtractNewRecordState` are compatible with all sink implementations. This approach ensures that the schema doesn't change with each new record and it can work with a schema registry. @@ -1108,7 +1119,7 @@ The transformation defines the following configurations: * `field.name` - The name of the field which should be used as the topic name. If `null` or empty, the entire key or value is used (and assumed to be a string). By default is `null`. * `skip.missing.or.null` - In case the source of the new topic name is `null` or missing, should a record be silently passed without transformation. By default, is `false`. -Here is an example of this transformation configuration: +The following is an example of this transformation configuration: ```properties ... @@ -1153,11 +1164,11 @@ For usage example, refer to YugabyteDB CDC Consistent Streaming Pipeline in the ### Transaction boundaries -The connector publishes metadata that can be used to distinguish transaction boundaries for a downstream application to implement atomicity. Once the configuration property `provide.transaction.metadata` is enabled, the connector will also publish events indicating the beginning and end of the transaction. For more information, see [Transaction metadata](#transaction-metadata). +The connector publishes metadata that can be used to distinguish transaction boundaries for a downstream application to implement atomicity. After the configuration property `provide.transaction.metadata` is enabled, the connector will also publish events indicating the beginning and end of the transaction. For more information, see [Transaction metadata](#transaction-metadata). ### Prerequisites -* Create the Stream ID should in the `EXPLICIT` checkpointing mode. For more information, see [yb-admin create\_change\_data_stream](../../../admin/yb-admin#create-change-data-stream). +* Create the Stream ID should in the `EXPLICIT` checkpointing mode. For more information, see [yb-admin create\_change\_data_stream](../../../../admin/yb-admin#create-change-data-stream). * You should always run the connector with a single task, that is, `tasks.max` should always be set to 1. ### Known limitations @@ -1188,15 +1199,15 @@ In these cases, the error message has details about the problem and possibly a s ### YB-TServer becomes unavailable -In case one of the tablet servers crashes, the replicas on other YB-TServer nodes will become the leader for the tablets that were hosted on the crashed server. The YugabyteDB connector will figure out the new tablet leaders and start streaming from the checkpoint the Debezium maintains. +In case one of the tablet servers crashes, the replicas on other YB-TServer nodes will become the leader for the tablets that were hosted on the crashed server. The YugabyteDB gRPC Connector will figure out the new tablet leaders and start streaming from the checkpoint the Debezium maintains. ### YugabyteDB server failures -In case of YugabyteDB server failures, the Debezium YugabyteDB connector will try for a configurable (using a [flag](../../../reference/configuration/yb-tserver/#change-data-capture-cdc-flags)) amount of time for the availability of the YB-TServer and will stop if the cluster cannot start. When the cluster is restarted, the connector can be run again and it will start processing the changes with the committed checkpoint. +In case of YugabyteDB server failures, the YugabyteDB gRPC Connector will try for a configurable amount of time for the availability of the YB-TServer and will stop if the cluster cannot start. When the cluster is restarted, the connector can be run again and it will start processing the changes with the committed checkpoint. ### Connector unable to find table association with stream ID -In this case, the Connector throws an exception with an error message that the table is not a part of the stream ID. +In this case, the connector throws an exception with an error message that the table is not a part of the stream ID. This can happen in the following 2 scenarios: @@ -1207,14 +1218,15 @@ This can happen in the following 2 scenarios: When the connector is running, the YugabyteDB server that it is connected to could become unavailable for any number of reasons. If this happens, the connector fails with an error and stops. When the server is available again, restart the connector. -The YugabyteDB connector externally stores the last processed offset in the form of a checkpoint. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the stream ID remains intact. Never delete a stream ID without deleting all the associated connectors with it otherwise you will lose data. +The connector externally stores the last processed offset in the form of a checkpoint. After a connector restarts and connects to a server instance, the connector communicates with the server to continue streaming from that particular offset. This offset is available as long as the stream ID remains intact. Never delete a stream ID without deleting all the associated connectors with it, otherwise you will lose data. ## Dropping a table part of the replication -While the connector is running with a set of tables configured to capture the changes, if one of the tables in the set is dropped, the connector will fail with an appropriate error message indicating that the object is not found. +While the connector is running with a set of tables configured to capture the changes, if one of the tables in the set is dropped, the connector will fail with an error message indicating that the object is not found. + +To avoid or resolve a failure due to a dropped table, follow these steps: -To avoid a connector failure or to resolve the failure, the recommended way is to follow these steps: -1. Delete the connector which contains the dropped or the table to be dropped. +1. Delete the connector that contains the table that was dropped, or that you want to drop. 2. Edit the configuration and remove the given table from `table.include.list`. 3. Deploy a new connector with the updated configuration. diff --git a/docs/content/stable/explore/going-beyond-sql/connection-mgr-ysql.md b/docs/content/stable/explore/going-beyond-sql/connection-mgr-ysql.md index 365026c92efe..cdcec5a26d14 100644 --- a/docs/content/stable/explore/going-beyond-sql/connection-mgr-ysql.md +++ b/docs/content/stable/explore/going-beyond-sql/connection-mgr-ysql.md @@ -45,7 +45,7 @@ YSQL Connection Manager has the following key features: ## How to use -To start a YugabtyeDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. +To start a YugabyteDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. When `enable_ysql_conn_mgr` is set, each YB-TServer starts the YSQL Connection Manager process along with the PostgreSQL process. You should see one YSQL Connection Manager process per YB-TServer. diff --git a/docs/content/stable/reference/configuration/yb-tserver.md b/docs/content/stable/reference/configuration/yb-tserver.md index 15ad7966c670..cfd8e8f5877c 100644 --- a/docs/content/stable/reference/configuration/yb-tserver.md +++ b/docs/content/stable/reference/configuration/yb-tserver.md @@ -1296,6 +1296,38 @@ Number of tables to be added to the stream ID per run of the background thread w Default: `2` +The following set of flags are only relevant for CDC using the PostgreSQL replication protocol. To learn about CDC using the PostgreSQL replication protocol, see [CDC using logical replication](../../../architecture/docdb-replication/cdc-logical-replication). + +##### --ysql_yb_default_replica_identity + +The default replica identity to be assigned to user defined tables at the time of creation. The flag is case sensitive and can take only one of the four possible values, `FULL`, `DEFAULT`,`'NOTHING` and `CHANGE`. + +Default: `CHANGE` + +##### --cdcsdk_enable_dynamic_table_support + +Tables created after the creation of a replication slot are referred as Dynamic tables. This preview flag can be used to switch the dynamic addition of tables to the publication ON or OFF. + +Default: `false` + +##### --cdcsdk_publication_list_refresh_interval_secs + +Interval in seconds at which the table list in the publication will be refreshed. + +Default: `3600` + +##### --cdcsdk_max_consistent_records + +Controls the maximum number of records sent from Virtual WAL (VWAL) to walsender in consistent order. + +Default: `500` + +##### --cdcsdk_vwal_getchanges_resp_max_size_bytes + +Max size (in bytes) of changes sent from CDC Service to [Virtual WAL](../../../architecture/docdb-replication/cdc-logical-replication)(VWAL) for a particular tablet. + +Default: `1 MB` + ## File expiration based on TTL flags ##### --tablet_enable_ttl_file_filter diff --git a/docs/content/v2.20/explore/connection-manager/connection-mgr-ysql.md b/docs/content/v2.20/explore/connection-manager/connection-mgr-ysql.md index 032d92f2f8ff..238063e1267f 100644 --- a/docs/content/v2.20/explore/connection-manager/connection-mgr-ysql.md +++ b/docs/content/v2.20/explore/connection-manager/connection-mgr-ysql.md @@ -45,7 +45,7 @@ YSQL Connection Manager has the following key features: ## Use YSQL Connection Manager -To start a YugabtyeDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. +To start a YugabyteDB cluster with YSQL Connection Manager, set the [yb-tserver](../../../reference/configuration/yb-tserver/) flag `enable_ysql_conn_mgr` to true. When `enable_ysql_conn_mgr` is set, each YB-TServer starts the YSQL Connection Manager process along with the PostgreSQL process. You should see one YSQL Connection Manager process per YB-TServer.