[DEPRECATED] Embedding Debezium Connectors

Table of Contents

Dependencies
In the code
Advanced Record Consuming
Engine properties
- database schema history properties
Handling failures

This page describes an internal API that is subject to backwards-incompatible changes in future releases. Please use the new supported publicDebezium Engine API.

Debezium connectors are normally operated by deploying them to a Kafka Connect service, and configuring one or more connectors to monitor upstream databases and produce data change events for all changes that they see in the upstream databases. Those data change events are written to Kafka, where they can be independently consumed by many different applications. Kafka Connect provides excellent fault tolerance and scalability, since it runs as a distributed service and ensures that all registered and configured connectors are always running. For example, even if one of the Kafka Connect endpoints in a cluster goes down, the remaining Kafka Connect endpoints will restart any connectors that were previously running on the now-terminated endpoint, minimizing downtime and eliminating administrative activities.

Not every application needs this level of fault tolerance and reliability, and they may not want to rely upon an external cluster of Kafka brokers and Kafka Connect services. Instead, some applications would prefer toembedDebezium connectors directly within the application space. They still want the same data change events, but prefer to have the connectors send them directly to the application rather than persist them inside Kafka.

Thisdebezium-embeddedmodule defines a small library that allows an application to easily configure and run Debezium connectors.

Dependencies

To use this module, add thedebezium-embeddedmodule to your application’s dependencies. For Maven, this entails adding the following to your application’s POM:

 io.debezium debezium-embedded ${version.debezium}

where${version.debezium}Debezium你使用的版本或开云体育官方注册网址a Maven property whose value contains the Debezium version string.

Likewise, add dependencies for each of the Debezium connectors that your application will use. For example, the following can be added to your application’s Maven POM file so your application can use the MySQL connector:

 io.debezium debezium-connector-mysql ${version.debezium}

Or for the MongoDB connector:

 io.debezium debezium-connector-mongodb ${version.debezium}

The remainder of this document describes embedding the MySQL connector in your application. Other connectors are used in a similar manner, except with connector-specific configuration, topics, and events.

In the code

Your application needs to set up an embedded engine for each connector instance you want to run. Theio.debezium.embedded.EmbeddedEngineclass serves as an easy-to-use wrapper around any standard Kafka Connect connector and completely manages the connector’s lifecycle. Basically, you create theEmbeddedEnginewith a configuration (perhaps loaded from a properties file) that defines the environment for both the engine and the connector. You also provide the engine with a function that it will call for every data change event produced by the connector.

Here’s an example of code that configures and runs an embeddedMySQL connector:

/ /定义嵌入的配置和MySQL connector ... Configuration config = Configuration.create() /* begin engine properties */ .with("connector.class", "io.debezium.connector.mysql.MySqlConnector") .with("offset.storage", "org.apache.kafka.connect.storage.FileOffsetBackingStore") .with("offset.storage.file.filename", "/path/to/storage/offset.dat") .with("offset.flush.interval.ms", 60000) /* begin connector properties */ .with("name", "my-sql-connector") .with("database.hostname", "localhost") .with("database.port", 3306) .with("database.user", "mysqluser") .with("database.password", "mysqlpw") .with("database.server.id", 85744) .with("topic.prefix", "my-app-connector") .with("schema.history.internal", "io.debezium.storage.file.history.FileSchemaHistory") .with("schema.history.internal.file.filename", "/path/to/storage/schemahistory.dat") .build(); // Create the engine with this configuration ... EmbeddedEngine engine = EmbeddedEngine.create() .using(config) .notifying(this::handleEvent) .build(); // Run the engine asynchronously ... Executor executor = Executors.newSingleThreadExecutor(); executor.execute(engine); // At some later time ... engine.stop();

Let’s look into this code in more detail, starting with the first few lines that we repeat here:

/ /定义嵌入的配置和MySQL connector ... Configuration config = Configuration.create() /* begin engine properties */ .with("connector.class", "io.debezium.connector.mysql.MySqlConnector") .with("offset.storage", "org.apache.kafka.connect.storage.FileOffsetBackingStore") .with("offset.storage.file.filename", "/path/to/storage/offset.dat") .with("offset.flush.interval.ms", 60000);

This creates a newConfigurationobject and uses a fluent-style builder API to set several fields required by the engine regardless of which connector is being used. The first is a name for the engine that will be used within the source records produced by the connector and its internal state, so use something meaningful in your application. Theconnector.class字段定义了类的名称,扩展了the Kafka Connectorg.apache.kafka.connect.source.SourceConnectorabstract class; in this example, we specify Debezium’sMySqlConnectorclass.

When a Kafka Connect connector runs, it reads information from the source and periodically records "offsets" that define how much of that information it has processed. Should the connector be restarted, it will use the last recorded offset to know where in the source information it should resume reading. Since connectors don’t know or carehowthe offsets are stored, it is up to the engine to provide a way to store and recover these offsets. The next few fields of our configuration specify that our engine should use theFileOffsetBackingStoreclass to store offsets in the/path/to/storage/offset.datfile on the local file system (the file can be named anything and stored anywhere). Additionally, although the connector records the offsets with every source record it produces, the engine flushes the offsets to the backing store periodically (in our case, once each minute). These fields can be tailored as needed for your application.

The next few lines define the fields that are specific to the connector, which in our example is theMySqlConnectorconnector:

/* begin connector properties */ .with("name", "mysql-connector") .with("database.hostname", "localhost") .with("database.port", 3306) .with("database.user", "mysqluser") .with("database.password", "mysqlpw") .with("database.server.id", 85744) .with("topic.prefix", "products") .with("schema.history.internal", "io.debezium.storage.file.history.FileSchemaHistory") .with("schema.history.internal.file.filename", "/path/to/storage/schemahistory.dat") .build();

Here, we set the name of the host machine and port number where the MySQL database server is running, and we define the username and password that will be used to connect to the MySQL database. Note that for MySQL the username and password should correspond to a MySQL database user that has been granted the following MySQL permissions:

SELECT
RELOAD
SHOW DATABASES
REPLICATION SLAVE
REPLICATION CLIENT

The first three privileges are required when reading a consistent snapshot of the databases. The last two privileges allow the database to read the server’s binlog that is normally used for MySQL replication.

The configuration also includes a numeric identifier for theserver.id. Since MySQL’s binlog is part of the MySQL replication mechanism, in order to read the binlog theMySqlConnectorinstance must join the MySQL server group, and that means this server ID must beunique within all processes that make up the MySQL server groupand is any integer between 1 and 2³²-1. In our code we set it to a fairly large but somewhat random value we’ll use only for our application.

配置还指定了一个逻辑名佛r the MySQL server. The connector includes this logical name within the topic field of every source record it produces, enabling your application to discern the origin of those records. Our example uses a server name of "products", presumably because the database contains product information. Of course, you can name this anything meaningful to your application.

When theMySqlConnectorclass runs, it reads the MySQL server’s binlog, which includes all data changes and schema changes made to the databases hosted by the server. Since all changes to data are structured in terms of the owning table’s schema at the time the change was recorded, the connector needs to track all of the schema changes so that it can properly decode the change events. The connector records the schema information so that, should the connector be restarted and resume reading from the last recorded offset, it knows exactly what the database schemas looked like at that offset. How the connector records the database schema history is defined in the last two fields of our configuration, namely that our connector should use theFileSchemaHistoryclass to store database schema history changes in the/path/to/storage/schemahistory.datfile on the local file system (again, this file can be named anything and stored anywhere).

Finally the immutable configuration is built using thebuild()method. (Incidentally, rather than build it programmatically, we could havereadthe configuration from a properties file using one of theConfiguration.read(…)methods.)

Now that we have a configuration, we can create our engine. Here again are the relevant lines of code:

// Create the engine with this configuration ... EmbeddedEngine engine = EmbeddedEngine.create() .using(config) .notifying(this::handleEvent) .build();

A fluent-style builder API is used to create an engine that uses ourConfigurationobject and that sends all data change records to thehandleEvent(SourceRecord)method, which can be any method that matches the signature of thejava.util.function.Consumerfunctional interface, whereSourceRecordis theorg.apache.kafka.connect.source.SourceRecordclass. Note that your application’s handler function should not throw any exceptions; if it does, the engine will log any exception thrown by the method and will continue to operate on the next source record, but your application will not have another chance to handle the particular source record that caused the exception, meaning your application might become inconsistent with the database.

At this point, we have an existingEmbeddedEngineobject that is configured and ready to run, but it doesn’t do anything. TheEmbeddedEngineis designed to be executed asynchronously by anExecutororExecutorService:

// Run the engine asynchronously ... Executor executor = Executors.newSingleThreadExecutor(); executor.execute(engine);

Your application can stop the engine safely and gracefully by calling itsstop()method:

// At some later time ... engine.stop();

The engine’s connector will stop reading information from the source system, forward all remainingSourceRecordobjects to your handler function, and flush the latest offets to offset storage. Only after all of this completes will the engine’srun()method return. If your application needs to wait for the engine to completely stop before exiting, you can do this with the engine’sawait(…)method:

try { while (!engine.await(30, TimeUnit.SECONDS)) { logger.info("Wating another 30 seconds for the embedded engine to shut down"); } } catch ( InterruptedException e ) { Thread.currentThread().interrupt(); }

Recall that when the JVM shuts down, it only waits for daemon threads. Therefore, if your application exits, be sure to wait for completion of the engine or alternatively run the engine on a daemon thread.

Your application should always properly stop the engine to ensure graceful and complete shutdown and that each source record is sent to the application exactly one time. For example, do not rely upon shutting down theExecutorService, since that interrupts the running threads. Although theEmbeddedEnginewill indeed terminate when its thread is interrupted, the engine may not terminate cleanly, and when your application is restarted it may see some of the same source records that it had processed just prior to the shutdown.

Advanced Record Consuming

For some use cases, such as when trying to write records in batches or against an async API, the functional interface described above may be challenging. In these situations, it may be easier to use theio.debezium.embedded.EmbeddedEngine.ChangeConsumerinterface.

This interface has single function with the following signature:

/** * Handles a batch of records, calling the {@link RecordCommitter#markProcessed(SourceRecord)} * for each record and {@link RecordCommitter#markBatchFinished()} when this batch is finished. * @param records the records to be processed * @param committer the committer that indicates to the system that we are finished */ void handleBatch(List records, RecordCommitter committer) throws InterruptedException;

As mentioned in the Javadoc, theRecordCommitterobject is to be called for each record and once each batch is finished. TheRecordCommitterinterface is threadsafe, which allows for flexible processing of records.

To use theChangeConsumerAPI, you must pass an implementation of the interface to thenotifyingAPI, as seen below:

类MyChangeConsumer实现EmbeddedEngine.ChangeConsumer { public void handleBatch(List records, RecordCommitter committer) throws InterruptedException { ... } } // Create the engine with this configuration ... EmbeddedEngine engine = EmbeddedEngine.create() .using(config) .notifying(new MyChangeConsumer()) .build();

Engine properties

The following configuration properties arerequiredunless a default value is available (for the sake of text formatting the package names of Java classes are replaced with<…>).

Property Default Description

Property	Default	Description
`name`		Unique name for the connector instance.
`connector.class`		The name of the Java class for the connector, e.g`<…>.MySqlConnector`for the MySQL connector.
`offset.storage`	`<…>.FileOffsetBackingStore`	The name of the Java class that is responsible for persistence of connector offsets. It must implement`<…>.OffsetBackingStore`interface.
`offset.storage.file.filename`	`""`	Path to file where offsets are to be stored. Required when`offset.storage`is set to the`<…>.FileOffsetBackingStore`.
`offset.storage.topic`	`""`	The name of the Kafka topic where offsets are to be stored. Required when`offset.storage`is set to the`<…>.KafkaOffsetBackingStore`.
`offset.storage.partitions`	`""`	The number of partitions used when creating the offset storage topic. Required when`offset.storage`is set to the`<…>.KafkaOffsetBackingStore`.
`offset.storage.replication.factor`	`""`	Replication factor used when creating the offset storage topic. Required when`offset.storage`is set to the`<…>.KafkaOffsetBackingStore`.
`offset.commit.policy`	`<…>.PeriodicCommitOffsetPolicy`	The name of the Java class of the commit policy. It defines when offsets commit has to be triggered based on the number of events processed and the time elapsed since the last commit. This class must implement the interface`<…>.OffsetCommitPolicy`. The default is a periodic commity policy based upon time intervals.
`offset.flush.interval.ms`	`60000`	Interval at which to try committing offsets. The default is 1 minute.
`offset.flush.timeout.ms`	`5000`	Maximum number of milliseconds to wait for records to flush and partition offset data to be committed to offset storage before cancelling the process and restoring the offset data to be committed in a future attempt. The default is 5 seconds.

name

Unique name for the connector instance.

connector.class

The name of the Java class for the connector, e.g<…>.MySqlConnectorfor the MySQL connector.

offset.storage

<…>.FileOffsetBackingStore

The name of the Java class that is responsible for persistence of connector offsets. It must implement<…>.OffsetBackingStoreinterface.

offset.storage.file.filename

""

Path to file where offsets are to be stored. Required whenoffset.storageis set to the<…>.FileOffsetBackingStore.

offset.storage.topic

""

The name of the Kafka topic where offsets are to be stored. Required whenoffset.storageis set to the<…>.KafkaOffsetBackingStore.

offset.storage.partitions

""

The number of partitions used when creating the offset storage topic. Required whenoffset.storageis set to the<…>.KafkaOffsetBackingStore.

offset.storage.replication.factor

""

Replication factor used when creating the offset storage topic. Required whenoffset.storageis set to the<…>.KafkaOffsetBackingStore.

offset.commit.policy

<…>.PeriodicCommitOffsetPolicy

The name of the Java class of the commit policy. It defines when offsets commit has to be triggered based on the number of events processed and the time elapsed since the last commit. This class must implement the interface<…>.OffsetCommitPolicy. The default is a periodic commity policy based upon time intervals.

offset.flush.interval.ms

60000

Interval at which to try committing offsets. The default is 1 minute.

offset.flush.timeout.ms

5000

Maximum number of milliseconds to wait for records to flush and partition offset data to be committed to offset storage before cancelling the process and restoring the offset data to be committed in a future attempt. The default is 5 seconds.

database schema history properties

Some of the connectors also requires additional set of properties that configures database schema history:

MySQL
SQL Server
Oracle
Db2

Without proper configuration of the database schema history the connectors will refuse to start. The default configuration expects a Kafka cluster to be available. For other deployments, a file-based database schema history storage implementation is available.

Property Default Description

Property	Default	Description
`schema.history.internal`	`<…>.KafkaSchemaHistory`	The name of the Java class that is responsible for persistence of the database schema history. It must implement`<…>.SchemaHistory`interface.
`schema.history.internal.file.filename`	`""`	Path to a file where the database schema history is stored. Required when`schema.history.internal`is set to the`<…>.FileSchemaHistory`.
`schema.history.internal.kafka.topic`	`""`	The Kafka topic where the database schema history is stored. Required when`schema.history.internal`is set to the`<…>.KafkaSchemaHistory`.
`schema.history.internal.kafka.bootstrap.servers`	`""`	The initial list of Kafka cluster servers to connect to. The cluster provides the topic to store the database schema history. Required when`schema.history.internal`is set to the`<…>.KafkaSchemaHistory`.

schema.history.internal

<…>.KafkaSchemaHistory

The name of the Java class that is responsible for persistence of the database schema history.
It must implement<…>.SchemaHistoryinterface.

schema.history.internal.file.filename

""

Path to a file where the database schema history is stored.
Required whenschema.history.internalis set to the<…>.FileSchemaHistory.

schema.history.internal.kafka.topic

""

The Kafka topic where the database schema history is stored.
Required whenschema.history.internalis set to the<…>.KafkaSchemaHistory.

schema.history.internal.kafka.bootstrap.servers

""

The initial list of Kafka cluster servers to connect to. The cluster provides the topic to store the database schema history.
Required whenschema.history.internalis set to the<…>.KafkaSchemaHistory.

Handling failures

When the engine executes, its connector is actively recording the source offset inside each source record, and the engine is periodically flushing those offsets to persistent storage. When the application and engine shutdown normally or crash, when they are restarted the engine and its connector will resume reading the source informationfrom the last recorded offset.

So, what happens when your application fails while an embedded engine is running? The net effect is that the application will likely receive some source records after restart that it had already processed right before the crash. How many depends upon how frequently the engine flushes offsets to its store (via theoffset.flush.interval.msproperty) and how many source records the specific connector returns in one batch. The best case is that the offsets are flushed every time (e.g.,offset.flush.interval.msis set to 0), but even then the embedded engine will still only flush the offsets after each batch of source records is received from the connector.

For example, the MySQL connector uses themax.batch.sizeto specify the maximum number of source records that can appear in a batch. Even withoffset.flush.interval.msis set to 0, when an application restarts after a crash it may see up tonduplicates, wherenis the size of the batches. If theoffset.flush.interval.msproperty is set higher, then the application may see up ton * mduplicates, wherenis the maximum size of the batches andmis the number of batches that might accumulate during a single offset flush interval. (Obviously it is possible to configure embedded connectors to use no batching and to always flush offsets, resulting in an application never receiving any duplicate source records. However, this dramatically increases the overhead and decreases the throughput of the connectors.)

The bottom line is that when using embedded connectors, applications will receive each source record exactly once during normal operation (including restart after a graceful shutdown), but do need to be tolerant of receiving duplicate events immediately following a restart after a crash or improper shutdown. If applications need more rigorous exactly-once behavior, then they should use the full Debezium platform that can provide exactly-once guarantees (even after crashes and restarts).