Streaming CDC Data from MySQL to S3

Change Data Capture (CDC) has become essential for modern data architectures that need to keep data warehouses, analytics platforms, and downstream systems synchronized with operational databases in near real-time. Streaming CDC data from MySQL to Amazon S3 creates a powerful foundation for analytics, machine learning, and data lake architectures while maintaining a complete historical record of database changes. This comprehensive guide explores the architectural approaches, implementation strategies, and practical considerations for building reliable CDC pipelines that continuously capture MySQL changes and stream them to S3 in formats optimized for analytical workloads.

Understanding CDC and Why It Matters for MySQL to S3 Pipelines

Change Data Capture represents a fundamental shift from batch-based data replication to continuous streaming of database changes. Traditional approaches take periodic snapshots of database tables—running queries every hour or day to extract current data. This creates staleness issues (data is always hours or days old), performance impacts (full table scans burden production databases), and inability to capture intermediate states (you only see current values, not the history of changes).

CDC instead captures every insert, update, and delete as it happens by reading the database’s transaction log—in MySQL’s case, the binary log (binlog). This approach provides several critical advantages. You get near real-time data propagation with changes appearing in S3 within seconds or minutes rather than hours. The database experiences minimal performance impact since CDC reads logs rather than querying tables. You capture complete change history, seeing not just current state but every modification made to data. And you can reconstruct database state at any point in time by replaying changes.

For MySQL to S3 pipelines specifically, CDC enables building data lakes that maintain complete historical records of operational data. Analytics teams can query current state, analyze trends over time, perform point-in-time analysis, and train machine learning models on historical patterns—all without impacting production MySQL databases. The S3-based architecture provides cost-effective storage at massive scale while supporting diverse analytical tools through open data formats.

MySQL Binlog: The Foundation of CDC

MySQL’s binary log is the mechanism that makes CDC possible. Understanding how binlog works is essential for implementing reliable CDC pipelines and troubleshooting issues when they arise.

How MySQL Binlog Works

MySQL writes all data modifications to the binlog before committing them to disk. This write-ahead logging ensures durability—even if MySQL crashes, committed transactions can be recovered from the binlog. The binlog contains a complete record of database changes: insert operations with full row data, update operations showing before and after values, delete operations with deleted row data, and DDL statements like schema changes.

MySQL supports multiple binlog formats, each with different characteristics. Statement-based logging records the SQL statements executed, which is compact but can be problematic for non-deterministic queries. Row-based logging (the recommended format for CDC) records the actual row changes, which is reliable but generates more data. Mixed format adaptively chooses between statement and row-based logging depending on the operation.

For CDC purposes, row-based binlog format is essential. It provides deterministic, complete records of data changes regardless of how those changes were made. CDC tools read the binlog sequentially, translating binary log events into structured change records that can be processed by downstream systems.

Configuring MySQL for CDC

Before implementing CDC, MySQL must be properly configured to generate binlogs in the correct format with sufficient retention. The configuration involves several key parameters that balance CDC requirements with operational considerations.

MySQL configuration requires enabling binlog, setting row-based format, and configuring retention. Here’s a basic configuration:

# Enable binary logging
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log

# Use row-based format for CDC reliability
binlog_format = ROW

# Retention period (in seconds) - 7 days
binlog_expire_logs_seconds = 604800

# Optional: Include row images for updates
binlog_row_image = FULL

# Enable binary logging
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log

# Use row-based format for CDC reliability
binlog_format = ROW

# Retention period (in seconds) - 7 days
binlog_expire_logs_seconds = 604800

# Optional: Include row images for updates
binlog_row_image = FULL

The server-id must be unique if you have replication or multiple CDC consumers. The binlog_format = ROW ensures CDC gets complete row data. The retention period must be long enough to handle downstream failures—if your CDC pipeline goes down for maintenance, binlogs must still be available when it resumes. Setting binlog_row_image = FULL includes both before and after values for updates, useful for certain analytics use cases.

Production deployments often include additional configurations for binlog encryption, compression, and monitoring. Large-scale systems monitor binlog generation rates to ensure storage can accommodate retention policies and alert on excessive binlog growth that might indicate disk space issues.

Architectural Approaches for MySQL CDC to S3

Several architectural patterns can stream CDC data from MySQL to S3, each with different trade-offs around complexity, latency, cost, and operational characteristics. Choosing the right approach depends on your specific requirements and existing infrastructure.

Debezium-Based Architecture

Debezium is an open-source CDC platform that provides robust MySQL connectors capable of reading binlogs and producing change events. A typical Debezium-based architecture uses Kafka as an intermediate streaming layer between MySQL and S3.

The architecture flow works as follows: Debezium MySQL Connector reads binlog and publishes change events to Kafka topics, Kafka provides durable buffering and allows multiple consumers, Kafka Connect S3 Sink Connector writes events from Kafka to S3, and S3 objects are organized by topic, partition, and time.

This architecture provides several advantages. Debezium offers mature, well-tested MySQL CDC capabilities with extensive community support. Kafka provides reliability through replication and the ability to replay events if needed. Multiple consumers can read the same change stream for different purposes. The architecture scales horizontally by adding Kafka brokers and connector instances.

However, the complexity and cost are higher due to managing Kafka infrastructure. There’s additional latency from the multi-hop architecture (MySQL → Debezium → Kafka → S3 Sink → S3). The operational overhead includes managing Kafka clusters, Debezium connectors, and Kafka Connect.

A sample Debezium MySQL connector configuration demonstrates the key settings:

{
  "name": "mysql-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql-host.example.com",
    "database.port": "3306",
    "database.user": "cdc_user",
    "database.password": "${file:/secrets/mysql-password.txt:password}",
    "database.server.id": "184054",
    "database.server.name": "production-mysql",
    "database.include.list": "mydb",
    "table.include.list": "mydb.orders,mydb.customers,mydb.products",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.mydb",
    "snapshot.mode": "initial",
    "tombstones.on.delete": "true",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}

{
  "name": "mysql-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql-host.example.com",
    "database.port": "3306",
    "database.user": "cdc_user",
    "database.password": "${file:/secrets/mysql-password.txt:password}",
    "database.server.id": "184054",
    "database.server.name": "production-mysql",
    "database.include.list": "mydb",
    "table.include.list": "mydb.orders,mydb.customers,mydb.products",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.mydb",
    "snapshot.mode": "initial",
    "tombstones.on.delete": "true",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}

This configuration specifies the MySQL connection details, which tables to capture, how to handle initial snapshots, and transformations to apply to change events. The connector reads binlog continuously and publishes structured change events to Kafka.

AWS DMS (Database Migration Service) Approach

AWS Database Migration Service provides a managed service that can perform both full load and ongoing CDC from MySQL to S3. This approach simplifies operations by eliminating self-managed infrastructure but offers less flexibility than Debezium-based solutions.

The DMS architecture is straightforward: DMS Replication Instance reads MySQL binlog, processes changes, and writes to S3, with S3 organized into folders by table and time prefix. DMS handles both initial full load and continuous replication automatically, with automatic failover and recovery.

Benefits include fully managed service with no infrastructure to maintain, built-in monitoring and alerting through CloudWatch, native AWS integration with IAM, encryption, and networking, and easy configuration through console or API. DMS handles many operational complexities automatically—managing checkpoints, handling failures, and coordinating initial load with CDC.

Limitations include less flexibility in output format and transformation compared to custom solutions, potential for higher latency compared to highly optimized custom pipelines, less visibility into internal processing, and costs that can be significant for large-scale replication.

DMS works well for organizations already invested in AWS, needing managed solutions, and prioritizing operational simplicity over maximum flexibility or lowest latency.

Architecture Comparison: Debezium vs AWS DMS

🔧 Debezium + Kafka

Pros:
• Full control and flexibility
• Lower latency potential
• Multiple consumers possible
• Rich transformation options
• Open source and extensible

Cons:
• Complex infrastructure
• Higher operational burden
• Requires Kafka expertise
• Self-managed scaling

☁️ AWS DMS

Pros:
• Fully managed service
• Simple configuration
• Built-in monitoring
• Native AWS integration
• Automatic failover

Cons:
• Less flexibility
• Potential higher latency
• Limited transformations
• Cost at scale

💡 Decision Guide: Choose Debezium if you need maximum flexibility, have Kafka expertise, and require sub-second latency. Choose DMS if you prioritize operational simplicity, are already on AWS, and can accept slightly higher latency.

Data Format and Organization in S3

How you structure and format CDC data in S3 significantly impacts downstream analytics performance, storage costs, and ease of use. Several considerations shape these decisions.

File Format Selection

Parquet has emerged as the preferred format for analytical workloads on S3. This columnar format provides excellent compression (typically 5-10x compared to JSON), efficient querying when you only need specific columns, schema evolution support for handling database schema changes, and broad tool support across Spark, Athena, Presto, and other engines.

JSON is simpler and more flexible but has significant drawbacks: poor compression leading to higher storage costs, inefficient queries that must read entire files, and larger file sizes increasing processing time. JSON works for initial prototypes or when human readability is important, but production systems almost always migrate to Parquet.

Avro provides a middle ground with good compression, schema evolution support, and efficient serialization, though it lacks Parquet’s columnar benefits for analytical queries. Avro works well as an intermediate format in streaming pipelines, with final conversion to Parquet for analytics.

S3 Organization Strategies

Organizing S3 objects effectively enables efficient querying and lifecycle management. The typical organization pattern uses hierarchical prefixes that partition data by table and time:

s3://my-data-lake/cdc/
  database=mydb/
    table=orders/
      year=2024/
        month=01/
          day=15/
            hour=14/
              part-00001.parquet
              part-00002.parquet

s3://my-data-lake/cdc/
  database=mydb/
    table=orders/
      year=2024/
        month=01/
          day=15/
            hour=14/
              part-00001.parquet
              part-00002.parquet

This structure enables partition pruning where queries only scan relevant partitions (e.g., “orders for January 2024” only reads that month’s data). Time-based partitioning facilitates lifecycle policies that archive or delete old data. Table-based partitioning allows different retention and access patterns per table.

Partition granularity requires balancing. Hourly partitions work well for high-volume tables with frequent queries filtered by time. Daily partitions suffice for moderate-volume tables. Monthly or yearly partitions make sense for low-volume tables or archival data.

File sizes also matter. Very small files (under 10MB) create excessive overhead and poor query performance. Very large files (over 1GB) reduce parallelism and increase memory requirements. Target file sizes of 100-500MB balance these concerns, achieved through batching CDC events or compaction processes.

Handling Schema Evolution

Database schemas change over time—new columns are added, existing columns are modified, tables are renamed. CDC pipelines must handle these changes gracefully without breaking downstream consumers.

Parquet’s schema evolution capabilities help here. Adding new columns preserves compatibility—old data reads as null for new columns. However, other changes are more challenging: dropping columns can break queries expecting them, renaming columns appears as dropping old and adding new, and changing data types may require reprocessing historical data.

Production CDC pipelines implement schema versioning and validation, with schema changes captured in metadata, notification systems alerting downstream consumers of changes, and validation ensuring transformations handle schema versions correctly. Some organizations maintain multiple versions of Parquet files during transition periods, allowing consumers to migrate gradually.

Handling Initial Load and Ongoing CDC

CDC pipelines must address two distinct phases: initial load (capturing current database state) and ongoing CDC (capturing subsequent changes). Each phase has different characteristics and challenges.

Initial Load Strategies

Before streaming changes, the pipeline must capture the current state of tables. This initial snapshot provides the baseline against which changes apply. Several approaches handle initial load with different trade-offs.

Consistent snapshot approaches use transactions or locks to ensure consistency. MySQL’s REPEATABLE READ transaction isolation can snapshot large databases consistently, though this requires holding transactions open for extended periods. For very large databases, this may not be practical.

Incremental snapshot strategies break large tables into chunks, processing them over time. This reduces resource contention but complicates consistency—changes happening during the incremental snapshot require careful coordination to avoid duplicates or gaps.

Debezium handles initial snapshots automatically, using a consistent snapshot approach by default. It locks tables briefly to establish a binlog position, then reads table data while simultaneously starting CDC from that position. This ensures no changes are lost during the transition from snapshot to CDC.

Managing CDC State and Checkpoints

CDC pipelines must track their position in the binlog to resume correctly after failures. This position tracking—often called checkpointing—is critical for reliability.

The checkpoint contains the binlog filename and position (or GTID if enabled), allowing the CDC process to resume from exactly where it stopped. Checkpoints must be stored durably—losing checkpoints requires reprocessing from the beginning or accepting data loss.

Debezium stores checkpoints in Kafka, leveraging Kafka’s reliability. DMS manages checkpoints internally in its control tables. Custom solutions might store checkpoints in DynamoDB, S3, or other durable storage with appropriate consistency guarantees.

Checkpoint frequency balances durability against performance. Checkpointing after every event maximizes durability but incurs overhead. Checkpointing every N seconds or after N events reduces overhead but increases potential duplicate processing after failures. Idempotent processing—where processing the same event multiple times produces the same result—mitigates duplicate concerns.

CDC Pipeline Data Flow

MySQL Writes to Binlog

Every INSERT, UPDATE, DELETE operation is recorded in binary log

CDC Tool Reads Binlog

Debezium/DMS reads binlog events and parses them into change records

Transform & Batch

Changes are transformed to target format and batched for efficient S3 writes

Write to S3

Batched Parquet files written to S3 with partitioning for efficient queries

Checkpoint & Continue

Pipeline checkpoints binlog position and continues processing changes

Monitoring, Alerting, and Troubleshooting

Production CDC pipelines require comprehensive monitoring to ensure reliability, catch issues early, and maintain data quality. Several key metrics and practices form the foundation of operational CDC systems.

Critical Metrics to Monitor

Binlog lag measures the time difference between when changes occur in MySQL and when they’re processed by CDC. Increasing lag indicates the pipeline is falling behind—perhaps due to downstream throughput limits, resource constraints, or processing bottlenecks. Acceptable lag depends on business requirements but typically ranges from seconds to minutes for near real-time use cases.

Processing throughput tracks events per second or records per minute. Monitoring throughput helps capacity planning and identifies degradation. Sudden drops in throughput warrant investigation—they might indicate network issues, resource constraints, or downstream system problems.

Error rates and retry counts reveal processing problems. CDC systems should track failed events, retry attempts, and events sent to dead letter queues. Rising error rates might indicate schema changes, data quality issues, or integration problems with downstream systems.

S3 write metrics track files written, data volume, and write latency. These metrics help optimize batching parameters and identify S3 performance issues. Monitoring S3 storage costs over time ensures the data lake remains cost-effective as volumes grow.

Common Issues and Resolutions

Binlog retention expiration occurs when the CDC pipeline falls too far behind and MySQL purges needed binlogs. This requires re-initializing the snapshot, potentially causing data gaps. Prevention involves adequate binlog retention, monitoring lag, and alerting before retention limits are reached.

Schema change handling problems arise when MySQL schema changes aren’t properly handled by the CDC pipeline. Symptoms include processing errors, missing columns, or type mismatches. Solutions involve coordination between database teams and data teams, schema change notifications, and validation processes for schema migrations.

Network connectivity issues between MySQL and CDC systems cause intermittent failures and increased lag. Robust retry logic, proper timeout configuration, and network monitoring help mitigate these issues. For cloud deployments, ensuring CDC systems run in the same VPC or region as MySQL reduces network-related problems.

Resource exhaustion happens when CDC systems lack sufficient CPU, memory, or network bandwidth. Symptoms include increasing lag despite stable binlog generation rates. Solutions involve vertical scaling (larger instances), horizontal scaling (more workers), or optimizing processing logic to reduce resource requirements.

Performance Optimization Strategies

Optimizing CDC pipeline performance ensures you can handle peak loads, minimize latency, and reduce infrastructure costs. Several strategies address different bottlenecks.

Parallel Processing

CDC pipelines can parallelize at several levels. Multiple Kafka partitions allow parallel processing of different tables or shards. Multiple Debezium connector instances can read different portions of large tables during initial load. Multiple S3 Sink Connector tasks can write to S3 in parallel.

However, parallelization requires careful consideration of ordering guarantees. Changes to the same record must be processed in order to maintain consistency. Partitioning strategies typically use key-based partitioning—all changes for the same primary key go to the same partition, preserving ordering within a key while allowing parallel processing across keys.

Batching and Buffering

Writing individual change events to S3 immediately creates excessive overhead—small files, high API call costs, and poor query performance. Batching accumulates events before writing, creating larger, more efficient Parquet files.

Batch size balances latency against efficiency. Larger batches improve compression, reduce API calls, and create better file sizes for queries but increase latency. Smaller batches reduce latency but increase overhead. Typical configurations batch by time (e.g., every 5 minutes) or size (e.g., 50MB worth of events), whichever comes first.

Compression and Encoding

Parquet supports various compression algorithms—Snappy, Gzip, and Zstandard being most common. Snappy offers good compression with minimal CPU overhead, making it suitable for high-throughput pipelines. Gzip provides better compression ratios but higher CPU cost, useful when storage costs dominate. Zstandard offers a middle ground with excellent compression and reasonable CPU usage.

Column encoding also impacts performance. Dictionary encoding works well for low-cardinality columns, run-length encoding suits columns with consecutive duplicate values, and delta encoding efficiently stores incrementing values like timestamps or IDs.

Security and Compliance Considerations

CDC pipelines handle sensitive operational data, requiring robust security controls and compliance measures.

Encryption

Data must be encrypted at rest in S3 using SSE-S3, SSE-KMS, or client-side encryption. The choice depends on compliance requirements and key management preferences. KMS provides granular access control and audit logs but adds latency and cost.

Data in transit should use TLS for all connections—MySQL to CDC system, CDC to Kafka (if used), and CDC to S3. Proper certificate validation and management prevents man-in-the-middle attacks.

Access Control

CDC systems need limited MySQL permissions—REPLICATION SLAVE and REPLICATION CLIENT for reading binlog, plus SELECT permissions on tables for initial snapshots. Avoid granting excessive privileges that could be exploited if the CDC system is compromised.

S3 access should follow least-privilege principles. CDC systems get write-only permissions to specific S3 prefixes. Analytical systems get read-only permissions. IAM policies, bucket policies, and S3 access points provide granular control.

Compliance and Auditing

Regulated industries must track data lineage, showing how operational data flows to analytics systems. CDC pipelines should log processing metadata—timestamps, source systems, transformations applied, and data volumes processed. This audit trail supports compliance requirements and troubleshooting.

PII and sensitive data might require special handling—tokenization, encryption, or masking before writing to S3. Some organizations implement field-level encryption where sensitive columns are encrypted while leaving other data queryable. Others use separate pipelines for sensitive and non-sensitive data with different security controls.

Conclusion

Streaming CDC data from MySQL to S3 creates powerful data architectures that combine operational and analytical workloads effectively. Whether using Debezium with Kafka for maximum flexibility or AWS DMS for managed simplicity, the key to success lies in understanding binlog mechanics, choosing appropriate data formats and organization strategies, implementing robust monitoring and error handling, and addressing security and compliance requirements comprehensively. The technical challenges are significant—managing initial loads, handling schema evolution, maintaining checkpoints, and optimizing performance—but the payoff is substantial for organizations needing real-time analytics on operational data.

As you implement CDC pipelines, start with a single table or schema, validate the approach, and gradually expand scope. Invest heavily in monitoring and operational practices early—CDC systems require ongoing attention and tuning to maintain reliability. The patterns and considerations outlined here apply broadly across CDC implementations, though specific details vary based on your architectural choices. With careful implementation and ongoing operational discipline, MySQL to S3 CDC pipelines provide the foundation for modern, real-time data platforms that drive analytical insights without compromising operational database performance.