How to Stream MySQL Binlog Changes Using Debezium

Debezium has emerged as the leading open-source platform for change data capture, transforming how organizations stream database changes into event-driven architectures. Unlike polling-based approaches that strain databases or proprietary CDC tools that lock you into vendor ecosystems, Debezium reads MySQL binary logs directly, capturing every insert, update, and delete with minimal source database impact. Understanding how to properly configure MySQL for binlog streaming, deploy Debezium connectors, and handle the resulting change event streams is essential for building reliable real-time data pipelines.

This article provides a complete implementation guide for streaming MySQL binlog changes using Debezium, from initial database configuration through production deployment patterns, focusing on the practical decisions and configurations that determine success in production environments.

Understanding Debezium’s Architecture and MySQL Integration

Debezium operates as a set of Kafka Connect source connectors, reading database transaction logs and publishing changes as structured Kafka events. For MySQL, this means connecting to the database, requesting binlog access, parsing binlog events, and transforming them into standardized change event formats that downstream consumers can process.

The architecture consists of three primary components. Kafka provides the distributed streaming platform storing change events durably and enabling multiple consumers to process changes independently. Kafka Connect offers the framework for running Debezium connectors, handling tasks like configuration management, failure recovery, and distributed execution. Debezium MySQL connector performs the actual work of connecting to MySQL, reading binlog events, and transforming them into Kafka messages.

This architecture provides significant advantages over custom CDC implementations. Debezium handles the complex binlog parsing, event transformation, and failure recovery logic that would require months to build correctly. Kafka Connect provides operational capabilities like distributed execution, monitoring, and scaling without custom infrastructure code. Kafka delivers reliable, scalable event storage with exactly-once delivery guarantees when properly configured.

Key concepts for MySQL binlog streaming:

  • Binary log position: MySQL tracks binlog files and positions within files. Debezium maintains offset state tracking which binlog file and position it has read to, enabling resume from exact points after restarts or failures.
  • GTID (Global Transaction Identifier): Modern MySQL versions support GTIDs, providing globally unique transaction identifiers that simplify failover and replication topology changes. Debezium can track GTIDs instead of file positions for more robust offset management.
  • Row-based replication: Debezium requires row-based binlog format, which records actual data changes rather than SQL statements. Statement-based logging is insufficient because it doesn’t capture the complete before/after state of modified rows.
  • Change event structure: Each Debezium event contains the full row state before and after changes, operation type (create/update/delete), transaction metadata, and source information identifying which database, table, and transaction produced the change.

Understanding this architecture helps you make informed decisions about deployment topology, resource allocation, and failure handling strategies that ensure reliable binlog streaming in production.

🔄 Debezium vs. Other CDC Approaches

Debezium’s log-based CDC imposes negligible overhead on source databases compared to trigger-based or query-based approaches. Reading binlogs typically consumes under 1% of database resources, enabling CDC on production systems without performance degradation. This efficiency is why Debezium has become the standard for open-source CDC.

Configuring MySQL for Binlog-Based CDC

Before deploying Debezium, MySQL requires specific configuration to enable binlog streaming with the correct format and retention policies. These settings determine whether Debezium can successfully capture changes and how resilient your CDC pipeline will be to failures.

Enable binary logging with row format. MySQL’s binlog_format must be set to ROW to capture complete row changes rather than SQL statements. Edit my.cnf or my.ini configuration file:

[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7

The binlog_row_image = FULL setting ensures complete before and after images of changed rows are captured, providing Debezium with all information needed to reconstruct change events. Setting expire_logs_days determines how long MySQL retains binlog files—critical for allowing Debezium to catch up after outages without data loss.

Configure GTID mode for improved reliability. GTIDs provide globally unique transaction identifiers that simplify offset management and failover scenarios. Enable GTID mode:

gtid_mode = ON
enforce_gtid_consistency = ON

GTID mode requires careful enablement on running systems because it affects replication topology. For new deployments, enable GTIDs from the start. For existing systems, follow MySQL’s documented GTID enablement procedures to avoid disrupting existing replication.

Create dedicated replication user with appropriate privileges. Debezium needs specific permissions to read binlogs and table schemas:

CREATE USER 'debezium'@'%' IDENTIFIED BY 'secure_password';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT 
ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;

The SELECT privilege allows reading table schemas and initial snapshots. RELOAD enables flushing table metadata. REPLICATION SLAVE and REPLICATION CLIENT grant binlog access. Limit this user to the minimum required databases in production by restricting grants to specific schemas rather than using *.*.

Verify binlog configuration before deploying Debezium to catch configuration issues early:

-- Check binlog is enabled
SHOW VARIABLES LIKE 'log_bin';

-- Verify row format
SHOW VARIABLES LIKE 'binlog_format';

-- Check GTID status
SHOW VARIABLES LIKE 'gtid_mode';

-- View current binlog files
SHOW BINARY LOGS;

-- Confirm replication user privileges
SHOW GRANTS FOR 'debezium'@'%';

These queries validate your configuration matches Debezium requirements. If binlog isn’t enabled or format is incorrect, Debezium connectors will fail with explicit error messages, but catching issues before deployment saves troubleshooting time.

RDS and managed MySQL considerations: AWS RDS, Azure Database for MySQL, and Google Cloud SQL have special requirements. RDS requires enabling automated backups to enable binlog access, then setting binlog_retention_hours parameter to retain binlogs sufficiently long. Managed services often restrict some configuration options or require parameter group changes rather than direct configuration file edits.

Deploying Debezium with Kafka Connect

With MySQL properly configured, deploying Debezium involves setting up Kafka, Kafka Connect, and the Debezium MySQL connector. This section covers deployment on Docker for development and distributed mode for production.

Docker Compose for development environment provides the fastest path to a working Debezium setup for learning and testing:

version: '3'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  connect:
    image: debezium/connect:2.4
    ports:
      - "8083:8083"
    depends_on:
      - kafka
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_status

Run docker-compose up to start the complete stack. This gives you Kafka, Kafka Connect with Debezium connectors pre-installed, and a REST API on port 8083 for connector management.

Production deployment on Kubernetes requires more sophisticated configuration with proper resource allocation, security, and high availability. Use Strimzi Kafka Operator for managing Kafka on Kubernetes, then deploy Kafka Connect with Debezium as a separate deployment with multiple replicas for distributed execution and fault tolerance.

Creating the MySQL connector involves POSTing configuration to the Kafka Connect REST API:

{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql.example.com",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "secure_password",
    "database.server.id": "184054",
    "database.server.name": "mysql-server",
    "database.include.list": "inventory",
    "table.include.list": "inventory.customers,inventory.orders",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory",
    "snapshot.mode": "initial",
    "snapshot.locking.mode": "minimal",
    "include.schema.changes": "true",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

Critical configuration parameters explained:

  • database.server.id: Unique identifier for the connector when connecting to MySQL. Must be unique across all MySQL replicas and CDC clients. Use consistent values across connector restarts.
  • database.server.name: Logical name for the source database server. Becomes the prefix for all Kafka topic names. Changes to this value create new topics, potentially duplicating data.
  • database.include.list and table.include.list: Control which databases and tables are captured. Limit to only required tables to reduce resource usage and event volume. Use regular expressions for pattern matching.
  • snapshot.mode: Controls initial snapshot behavior. initial performs a full table snapshot before streaming changes. schema_only skips data snapshot, streaming only new changes. when_needed snapshots only if no previous offset exists.
  • snapshot.locking.mode: Determines table locking during snapshots. minimal locks only long enough to determine snapshot position, none uses non-blocking snapshot on MySQL 8.0.20+, extended locks tables throughout snapshot (not recommended for production).
  • database.history.kafka.topic: Kafka topic storing database schema changes. Debezium tracks schema evolution to correctly deserialize binlog events. This topic must be highly durable with infinite retention.

Deploy the connector by POSTing this JSON to Kafka Connect:

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d @mysql-connector.json

Verify deployment succeeded:

# Check connector status
curl http://localhost:8083/connectors/mysql-connector/status

# List created topics (should include database.table topics)
kafka-topics --bootstrap-server localhost:9092 --list

# Monitor connector logs
kubectl logs -f deployment/kafka-connect

⚙️ Configuration Best Practice

Start with conservative snapshot and filtering settings. Begin by capturing only critical tables with schema_only snapshot mode to quickly validate binlog streaming works. Then expand to full snapshots and additional tables incrementally. This progressive approach prevents overwhelming your Kafka cluster with snapshot data during initial deployment.

Understanding and Processing Change Events

Debezium produces structured change events containing rich information about each database operation. Understanding event structure enables building robust consumers that correctly process changes while handling edge cases like schema evolution and deletes.

Change event structure follows a consistent format regardless of operation type. Each event contains an op field indicating operation type (c for create/insert, u for update, d for delete, r for read during snapshot), a before field containing the row state before the change, an after field with the row state after the change, and a source block with metadata about the change origin.

Example insert event structure:

{
  "before": null,
  "after": {
    "id": 1001,
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@example.com"
  },
  "source": {
    "version": "2.4.0.Final",
    "connector": "mysql",
    "name": "mysql-server",
    "ts_ms": 1699488000000,
    "snapshot": "false",
    "db": "inventory",
    "sequence": null,
    "table": "customers",
    "server_id": 184054,
    "gtid": "f7d85e3f-5e3c-11ee-8c99-0242ac120002:123",
    "file": "mysql-bin.000003",
    "pos": 154,
    "row": 0,
    "thread": 7,
    "query": null
  },
  "op": "c",
  "ts_ms": 1699488000105,
  "transaction": null
}

Processing different operation types requires handling null before or after fields appropriately. Inserts have null before, deletes have null after, updates have both. When applying changes to target systems, use op to determine the correct action—insert for creates, update/upsert for updates, delete for deletes.

Handling schema changes is critical for long-running CDC pipelines. Debezium captures DDL changes like ALTER TABLE in the schema history topic and adjusts event serialization accordingly. Consumers must either be schema-agnostic (processing events as generic JSON) or react to schema changes by updating their processing logic.

The ExtractNewRecordState transformation (configured via transforms.unwrap in the connector) simplifies event structure by extracting just the after state for creates and updates, making events more directly consumable. However, this loses the before state needed for certain use cases like computing deltas or auditing.

Tombstone events for deletes provide compaction support in Kafka. After a delete event, Debezium produces a tombstone—a message with the same key but null value. Kafka log compaction uses tombstones to eventually remove all messages for deleted keys, preventing unbounded topic growth. Configure transforms.unwrap.drop.tombstones based on whether downstream consumers need tombstones or if they should be suppressed.

Consuming change events from Kafka requires choosing appropriate consumer patterns. For real-time processing, use Kafka Streams or Flink for stateful stream processing. For batch workloads, consume into data lakes using Kafka Connect S3 sink or similar. For database replication, use JDBC sink connectors that apply changes to target databases.

A simple Python consumer example:

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'mysql-server.inventory.customers',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    group_id='customer-processor',
    auto_offset_reset='earliest'
)

for message in consumer:
    event = message.value
    
    if event is None:  # Tombstone
        print(f"Delete tombstone for key: {message.key}")
        continue
    
    operation = event.get('op')
    
    if operation == 'c':  # Insert
        customer = event['after']
        print(f"New customer: {customer['first_name']} {customer['last_name']}")
        
    elif operation == 'u':  # Update
        before = event['before']
        after = event['after']
        print(f"Customer {after['id']} updated")
        
    elif operation == 'd':  # Delete
        customer = event['before']
        print(f"Customer {customer['id']} deleted")

Operational Considerations and Troubleshooting

Production Debezium deployments require monitoring, alerting, and troubleshooting capabilities to maintain reliable binlog streaming and quickly resolve issues when they occur.

Monitoring connector health starts with Kafka Connect metrics exposed via JMX. Critical metrics include connector state, task state, source record poll rate, and error counts. Use Prometheus with JMX exporter to collect these metrics and Grafana for visualization. Alert on connector state transitions to FAILED or task state becoming FAILED, indicating issues requiring investigation.

Binlog position and lag monitoring ensures the connector keeps pace with MySQL changes. Debezium exposes metrics showing current binlog file and position being processed. Compare this against MySQL’s current position (via SHOW MASTER STATUS) to calculate lag. Increasing lag indicates the connector falls behind, potentially due to insufficient Kafka Connect resources, network issues, or downstream Kafka capacity problems.

Common issues and resolutions:

  • Connector fails to start with “binlog not available” errors: MySQL binlog retention expired before Debezium could read them. Increase expire_logs_days or binlog_retention_hours on RDS. If data was lost, reset connector offsets and perform new snapshot.
  • Snapshot takes too long or times out: Large tables require extended snapshot time. Increase snapshot.fetch.size to read more rows per query, or use snapshot.select.statement.overrides to add WHERE clauses limiting snapshot scope. Consider snapshot.mode = schema_only for very large databases where full snapshots are impractical.
  • Memory issues during snapshot: Snapshots of tables with large BLOBs or many columns can exhaust connector memory. Increase Kafka Connect worker memory allocation, reduce snapshot.fetch.size, or exclude BLOB columns using column.exclude.list.
  • Duplicate events after restarts: Usually indicates offset management issues. Debezium stores offsets in Kafka’s connect_offsets topic. Ensure this topic has sufficient retention and replication. After connector crashes, Debezium resumes from last committed offset, which may cause some events to be redelivered. Consumers should implement idempotent processing.

Connector reconfiguration and maintenance requires understanding state management. Changing connector configuration via REST API (PUT /connectors/mysql-connector/config) updates settings without losing offset state. Deleting and recreating connectors resets offsets, causing full snapshots. Use the update API for most configuration changes to preserve streaming position.

Handling MySQL failover and HA scenarios requires GTID mode. When MySQL primary fails and replica is promoted, GTID-based tracking allows Debezium to seamlessly continue from the new primary at the correct transaction position. File-position-based tracking requires manual offset adjustment or connector reset during failover.

Performance optimization strategies:

  • Increase parallelism: For databases with many tables, increase tasks.max to distribute table processing across multiple connector tasks. Each task handles a subset of tables independently.
  • Tune batch sizes: Adjust max.batch.size and max.queue.size to control how many events are batched before writing to Kafka. Larger batches improve throughput but increase latency.
  • Optimize snapshot queries: Use snapshot.select.statement.overrides to add appropriate WHERE clauses or indexes for snapshot queries. Snapshots can strain source databases; optimized queries reduce impact.
  • Schema-only snapshots: When full snapshots are impractical, use snapshot.mode = schema_only to skip data snapshot and immediately begin streaming binlog changes. This works when starting CDC on established databases where historical data isn’t needed.

Advanced Patterns and Use Cases

Beyond basic binlog streaming, Debezium supports advanced patterns enabling sophisticated data integration and event-driven architectures.

Outbox pattern implementation leverages Debezium for transactional event publishing. Applications write business events to an “outbox” table within the same transaction as business data changes. Debezium streams outbox table changes, and downstream processors consume these events while filtering out actual data changes. This pattern guarantees event publishing without dual-write problems inherent in publishing to message brokers separately from database commits.

Multi-database aggregation streams changes from multiple MySQL instances into unified topics or processing streams. Deploy separate Debezium connectors for each database, using consistent database.server.name schemes and topic naming conventions. Downstream consumers process merged streams, enabling cross-database analytics and consolidation.

Change data for machine learning feeds ML models with real-time feature updates. Stream database changes into feature stores, updating model inputs as source data changes. This enables ML models to operate on current data rather than stale batch-loaded features, improving prediction accuracy for time-sensitive applications.

Event sourcing with Debezium captures database changes as immutable event streams, enabling event sourcing patterns without application-level event publishing. While not pure event sourcing (which stores events as primary record), Debezium-captured changes provide similar benefits—complete history, temporal queries, and replay capability.

Conclusion

Streaming MySQL binlog changes with Debezium transforms database operations into real-time event streams powering analytics, replication, and event-driven architectures. By properly configuring MySQL for row-based binlog replication, deploying Debezium connectors with appropriate snapshot and filtering settings, and building robust consumers that handle all change event types, you create reliable CDC pipelines that maintain data consistency while imposing minimal source database overhead. Understanding the operational aspects—monitoring, troubleshooting, and performance optimization—ensures these pipelines remain healthy in production.

Debezium has matured into production-ready infrastructure trusted by organizations requiring reliable change data capture at scale. Starting with the patterns and configurations outlined here, you can implement CDC solutions that deliver business value through real-time data availability while avoiding the common pitfalls that derail initial CDC projects. The key is progressive implementation—start simple, validate functionality, then expand scope and sophistication as operational experience grows.

Leave a Comment