CDC Implementation for Data Lakes

Data lakes have become the cornerstone of modern analytics architectures, consolidating vast amounts of structured and unstructured data in a cost-effective storage layer. However, keeping these lakes fresh with the latest operational data has traditionally relied on batch ETL processes that introduce significant latency—often hours or even days between when data changes occur in source systems and when those changes become available for analysis. Change Data Capture (CDC) fundamentally transforms this paradigm by streaming database changes to data lakes in near real-time, enabling organizations to make decisions based on current rather than stale data. Implementing CDC for data lakes requires careful architectural decisions around capture mechanisms, data formats, schema evolution, and query optimization that directly impact both operational efficiency and analytical capabilities.

Architectural Patterns for CDC Data Lakes

Building an effective CDC pipeline to a data lake requires understanding the unique challenges that arise from combining streaming change capture with batch-oriented storage systems. Unlike traditional data warehouses that maintain ACID guarantees and support in-place updates, data lakes store immutable files that cannot be modified after writing. This fundamental constraint shapes every architectural decision in your CDC implementation.

The Append-Only Architecture

The most straightforward CDC pattern for data lakes uses an append-only approach where every change event becomes a new record in the lake. When a row updates in your source database, CDC captures both the before and after states and writes a new file containing the change event. Over time, the data lake accumulates a complete history of all changes, providing a comprehensive audit trail but requiring downstream queries to reconstruct current state.

This pattern works naturally with how CDC systems like Debezium produce change events. Each event contains the operation type (INSERT, UPDATE, DELETE), the complete row state after the change, and metadata like timestamp and source table. Your data lake simply stores these events as-is, typically organized by date partitions for efficient querying:

s3://datalake/cdc/database_name/table_name/
  year=2025/month=11/day=08/hour=10/
    changes_001.parquet
    changes_002.parquet

s3://datalake/cdc/database_name/table_name/
  year=2025/month=11/day=08/hour=10/
    changes_001.parquet
    changes_002.parquet

Queries against this structure must handle the change event semantics. To retrieve current state, you need to process all change events for each row, applying them in timestamp order to compute the latest version. This approach provides complete flexibility—you can query state at any historical point—but introduces query complexity and performance overhead.

Merge-On-Read vs Copy-On-Write Strategies

To address the query complexity of pure append-only storage, data lake table formats like Apache Iceberg, Delta Lake, and Apache Hudi implement sophisticated merge strategies that present a simplified table view while storing change events underneath.

Copy-on-write (CoW) eagerly merges changes when they arrive. As CDC events stream into the lake, a background process reads existing data files, merges the changes, and writes new consolidated files that replace the originals. Queries simply read the latest files without needing to understand CDC semantics. This approach optimizes read performance at the cost of write amplification—a small number of changes can trigger rewriting large data files.

Merge-on-read (MoR) postpones the merge work until query time. Change events append to incremental files stored alongside base data files. Queries read both the base files and incremental changes, merging them on-the-fly to produce current state. This approach minimizes write cost but increases query complexity and execution time. The system typically runs periodic compaction jobs that consolidate incremental changes into base files, finding a middle ground between the two extremes.

For CDC pipelines feeding data lakes, merge-on-read often provides better characteristics. CDC streams produce continuous small changes that would cause excessive file rewrites under copy-on-write. MoR absorbs these changes efficiently while still providing reasonably fast queries, especially when compaction runs regularly during off-peak hours.

CDC Data Lake Architecture Comparison

📝

Append-Only

Pros: Simple, complete history, no write amplification

Cons: Complex queries, slow reads, large storage

Best for: Audit logs, compliance, exploratory analysis

✏️

Copy-on-Write

Pros: Fast queries, simple semantics, optimized files

Cons: High write cost, update latency, resource intensive

Best for: Read-heavy workloads, batch updates

🔄

Merge-on-Read

Pros: Low write cost, fast updates, balanced performance

Cons: Slower queries, needs compaction, complexity

Best for: CDC streams, frequent updates, mixed workloads

Choosing and Configuring Table Formats

The choice of table format is perhaps the most critical architectural decision for CDC data lakes. While raw Parquet or ORC files on S3 provide basic storage, modern table formats add essential capabilities like ACID transactions, schema evolution, time travel, and optimized query performance.

Apache Hudi for CDC Workloads

Apache Hudi was specifically designed for streaming ingestion into data lakes, making it a natural fit for CDC pipelines. Hudi introduces the concept of record-level indexing that enables efficient upserts—it can quickly locate existing records that need updating rather than scanning entire datasets. This capability dramatically improves CDC performance where most operations are updates to existing rows.

Hudi’s timeline concept maintains metadata about all operations applied to a table, enabling features like incremental queries that retrieve only changes since a specific point in time. This aligns perfectly with CDC use cases where downstream consumers often need to process only new changes rather than full table scans.

Configure Hudi tables for CDC using the merge-on-read table type with appropriate index strategies:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("CDC-Hudi-Ingestion") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
    .getOrCreate()

# CDC dataframe from Kafka/Kinesis
cdc_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "cdc.inventory.orders") \
    .load() \
    .selectExpr("CAST(value AS STRING)")

# Parse CDC events and write to Hudi
parsed_df = cdc_df.select(
    from_json(col("value"), cdc_schema).alias("data")
).select("data.*")

hudi_options = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.partitionpath.field': 'order_date',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.index.type': 'BLOOM',
    'hoodie.bloom.index.update.partition.path': 'true',
    'hoodie.compact.inline': 'false',
    'hoodie.compact.inline.max.delta.commits': '10'
}

query = parsed_df.writeStream \
    .format("hudi") \
    .options(**hudi_options) \
    .outputMode("append") \
    .option("checkpointLocation", "s3://bucket/checkpoints/orders") \
    .start("s3://datalake/tables/orders")

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("CDC-Hudi-Ingestion") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
    .getOrCreate()

# CDC dataframe from Kafka/Kinesis
cdc_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "cdc.inventory.orders") \
    .load() \
    .selectExpr("CAST(value AS STRING)")

# Parse CDC events and write to Hudi
parsed_df = cdc_df.select(
    from_json(col("value"), cdc_schema).alias("data")
).select("data.*")

hudi_options = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.partitionpath.field': 'order_date',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.index.type': 'BLOOM',
    'hoodie.bloom.index.update.partition.path': 'true',
    'hoodie.compact.inline': 'false',
    'hoodie.compact.inline.max.delta.commits': '10'
}

query = parsed_df.writeStream \
    .format("hudi") \
    .options(**hudi_options) \
    .outputMode("append") \
    .option("checkpointLocation", "s3://bucket/checkpoints/orders") \
    .start("s3://datalake/tables/orders")

The recordkey.field identifies unique rows for upserts, while precombine.field resolves conflicts when multiple changes arrive for the same record. The BLOOM index provides a good balance between query performance and update efficiency for most CDC workloads. Disabling inline compaction and running scheduled compaction jobs during maintenance windows prevents compaction overhead from impacting CDC ingestion throughput.

Delta Lake Integration

Delta Lake, developed by Databricks, provides similar capabilities with tight integration into the Spark ecosystem. It uses transaction logs to maintain ACID guarantees and supports time travel, schema enforcement, and efficient upserts through its MERGE command.

Delta Lake’s strengths include mature tooling, excellent documentation, and native support in Databricks environments. For CDC, Delta Lake’s merge operation efficiently handles upserts by automatically identifying changed records and updating only affected files:

from delta.tables import DeltaTable

# Define target Delta table
target_table = DeltaTable.forPath(spark, "s3://datalake/tables/customers")

# CDC stream with operation type
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "cdc.crm.customers") \
    .load() \
    .selectExpr("CAST(value AS STRING) as json_data")

def merge_to_delta(batch_df, batch_id):
    """Process each micro-batch from the stream"""
    parsed_df = batch_df.select(
        from_json("json_data", cdc_schema).alias("data")
    ).select("data.*")
    
    # Separate inserts/updates from deletes
    upserts = parsed_df.filter("op != 'd'")
    deletes = parsed_df.filter("op = 'd'")
    
    # Handle upserts
    target_table.alias("target").merge(
        upserts.alias("source"),
        "target.customer_id = source.customer_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()
    
    # Handle deletes
    if deletes.count() > 0:
        target_table.alias("target").merge(
            deletes.alias("source"),
            "target.customer_id = source.customer_id"
        ).whenMatchedDelete().execute()

# Apply merge logic to streaming CDC data
query = cdc_stream.writeStream \
    .foreachBatch(merge_to_delta) \
    .option("checkpointLocation", "s3://bucket/checkpoints/customers") \
    .start()

from delta.tables import DeltaTable

# Define target Delta table
target_table = DeltaTable.forPath(spark, "s3://datalake/tables/customers")

# CDC stream with operation type
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "cdc.crm.customers") \
    .load() \
    .selectExpr("CAST(value AS STRING) as json_data")

def merge_to_delta(batch_df, batch_id):
    """Process each micro-batch from the stream"""
    parsed_df = batch_df.select(
        from_json("json_data", cdc_schema).alias("data")
    ).select("data.*")
    
    # Separate inserts/updates from deletes
    upserts = parsed_df.filter("op != 'd'")
    deletes = parsed_df.filter("op = 'd'")
    
    # Handle upserts
    target_table.alias("target").merge(
        upserts.alias("source"),
        "target.customer_id = source.customer_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()
    
    # Handle deletes
    if deletes.count() > 0:
        target_table.alias("target").merge(
            deletes.alias("source"),
            "target.customer_id = source.customer_id"
        ).whenMatchedDelete().execute()

# Apply merge logic to streaming CDC data
query = cdc_stream.writeStream \
    .foreachBatch(merge_to_delta) \
    .option("checkpointLocation", "s3://bucket/checkpoints/customers") \
    .start()

This pattern processes CDC events in micro-batches, applying merges that update existing records or insert new ones. Deletes are handled separately to remove records from the Delta table. The checkpoint location ensures exactly-once processing semantics even across failures and restarts.

Handling Schema Evolution in CDC Pipelines

Database schemas evolve—columns get added, data types change, and tables get restructured. Your CDC pipeline must handle these changes gracefully without breaking downstream analytics or requiring manual intervention for every schema modification.

Forward and Backward Compatibility

Schema evolution introduces compatibility challenges. Forward compatibility means old consumers can read new data, while backward compatibility means new consumers can read old data. For data lakes, you typically need both—analysts querying historical data shouldn’t see errors when schemas change, and new queries should handle data written before schema changes.

Additive changes (adding new columns) are generally safe in both directions. Old queries simply ignore new columns they don’t know about, while new queries treat missing columns as null in older files. This makes additive evolution the preferred pattern for CDC data lakes.

Removing or renaming columns breaks backward compatibility—queries referencing removed columns will fail. Data type changes can break both directions depending on the specific change. Widening types (INT to BIGINT) usually works, while narrowing (BIGINT to INT) often causes issues.

Schema Registry Integration

Implement a schema registry that tracks schema versions and enforces compatibility rules. Apache Avro with Schema Registry or AWS Glue Schema Registry provides centralized schema management:

Validate changes: Prevent incompatible schema modifications from entering the pipeline
Version tracking: Maintain complete schema history for each table
Automatic evolution: Update data lake table schemas when compatible changes occur
Schema conversion: Apply transformations to normalize data across schema versions

Configure your CDC pipeline to validate incoming events against registered schemas and reject incompatible changes:

from aws_schema_registry import SchemaRegistryClient
from aws_schema_registry.avro import AvroSchema

# Initialize schema registry client
schema_registry = SchemaRegistryClient(
    registry_name="cdc-schemas",
    schema_compatibility="BACKWARD"
)

def validate_and_evolve_schema(cdc_event):
    """Validate event against schema registry and evolve if needed"""
    table_name = cdc_event['source']['table']
    schema_name = f"cdc.{table_name}"
    
    # Get current schema from registry
    current_schema = schema_registry.get_schema_by_id(schema_name)
    
    # Check if event schema matches
    event_schema = extract_schema_from_event(cdc_event)
    
    if not schemas_compatible(current_schema, event_schema):
        # Attempt to register new schema version
        try:
            new_version = schema_registry.register_schema(
                schema_name=schema_name,
                schema_definition=event_schema,
                compatibility_mode="BACKWARD"
            )
            
            # Evolve Delta/Hudi table schema
            evolve_table_schema(table_name, event_schema)
            
        except IncompatibleSchemaException as e:
            # Route to dead letter queue for manual resolution
            send_to_dlq(cdc_event, f"Incompatible schema: {e}")
            return None
    
    return cdc_event

from aws_schema_registry import SchemaRegistryClient
from aws_schema_registry.avro import AvroSchema

# Initialize schema registry client
schema_registry = SchemaRegistryClient(
    registry_name="cdc-schemas",
    schema_compatibility="BACKWARD"
)

def validate_and_evolve_schema(cdc_event):
    """Validate event against schema registry and evolve if needed"""
    table_name = cdc_event['source']['table']
    schema_name = f"cdc.{table_name}"
    
    # Get current schema from registry
    current_schema = schema_registry.get_schema_by_id(schema_name)
    
    # Check if event schema matches
    event_schema = extract_schema_from_event(cdc_event)
    
    if not schemas_compatible(current_schema, event_schema):
        # Attempt to register new schema version
        try:
            new_version = schema_registry.register_schema(
                schema_name=schema_name,
                schema_definition=event_schema,
                compatibility_mode="BACKWARD"
            )
            
            # Evolve Delta/Hudi table schema
            evolve_table_schema(table_name, event_schema)
            
        except IncompatibleSchemaException as e:
            # Route to dead letter queue for manual resolution
            send_to_dlq(cdc_event, f"Incompatible schema: {e}")
            return None
    
    return cdc_event

This approach provides guardrails that prevent schema changes from breaking your data lake while still allowing evolution when safe.

Handling Type Coercion and Nullability

CDC events sometimes contain data that doesn’t match expected types—strings in numeric fields, nulls where not expected, or timestamps in inconsistent formats. Your pipeline needs explicit strategies for handling these cases rather than failing silently or propagating bad data.

Implement type coercion with clear rules about when to coerce, when to reject, and when to route to error handling:

Numeric type widening: Automatically widen INT to BIGINT when values exceed INT range
String to number conversion: Attempt parsing with fallback to null or error based on strictness settings
Timestamp normalization: Convert all timestamps to UTC and standardize format
Null handling: Decide whether nulls in previously non-nullable columns should update the schema or reject the event

These decisions significantly impact data quality in your lake. Overly permissive coercion can introduce subtle data quality issues, while overly strict validation can cause unnecessary processing failures.

Optimizing Query Performance for CDC Tables

CDC data in lakes presents unique query performance challenges. Unlike traditional tables that get bulk loaded periodically, CDC tables continuously accumulate small files and fragmented data. Without optimization, query performance degrades over time as file counts grow and the merge-on-read overhead increases.

Partitioning Strategies for Time-Series Data

Most CDC workloads exhibit time-series characteristics—recent data gets queried more frequently than historical data, and queries typically filter by time ranges. Design your partitioning scheme to exploit these patterns.

Partition by event timestamp rather than ingestion time. CDC events include both the database operation timestamp and the CDC capture timestamp. Partition by operation timestamp so that queries filtering by “when the data changed” can skip irrelevant partitions efficiently:

s3://datalake/orders/
  year=2025/month=11/day=08/
  year=2025/month=11/day=09/

s3://datalake/orders/
  year=2025/month=11/day=08/
  year=2025/month=11/day=09/

Avoid excessive partition granularity. Hourly partitions might seem appealing for real-time data, but thousands of tiny partitions create metadata overhead and slow query planning. Daily partitions typically provide the best balance for CDC workloads, with monthly partitions for high-volume tables.

For tables with additional natural partitioning dimensions (geography, customer segments, product categories), use hierarchical partitioning that puts time as the first level. This allows time-based partition pruning to eliminate large swaths of data before considering secondary partitions.

Compaction and File Sizing

Small files are the enemy of query performance in data lakes. CDC streams produce continuous small files—one per minute or even more frequently depending on your buffering configuration. Left unchecked, tables accumulate tens of thousands of small files that query engines must open individually, overwhelming both metadata systems and I/O capacity.

Implement regular compaction that consolidates small files into optimally-sized files. For Parquet, target file sizes between 128MB and 1GB depending on your typical query patterns. Larger files work well for full table scans, while smaller files benefit point queries.

Schedule compaction during low-activity periods to avoid competing with CDC ingestion:

Daily compaction: Consolidate previous day’s files after the day completes
Weekly compaction: Further consolidate older data that’s no longer changing
Partition-aware compaction: Compact only partitions with excessive file counts rather than the entire table

Delta Lake and Hudi provide built-in compaction commands that understand their table formats’ specific requirements. Run these regularly as part of your data lake maintenance workflows.

⚡ Performance Optimization Checklist

📊 File Management

• Target 128MB-1GB Parquet files
• Run daily compaction on active partitions
• Clean up old files post-compaction
• Monitor file count per partition

🎯 Partitioning

• Use operation timestamp, not ingest time
• Daily partitions for most CDC workloads
• Time as first partition dimension
• Avoid excessive granularity

🔄 Merge Strategy

• Merge-on-read for high-frequency CDC
• Scheduled compaction off-peak
• Tune compaction frequency vs latency
• Monitor merge file ratios

⚠️ Common Performance Pitfalls

Small file proliferation: CDC streams create thousands of tiny files. Without compaction, queries slow to a crawl.
Over-partitioning: Hourly or minute-level partitions create excessive metadata overhead.
Missing indexes: Hudi BLOOM indexes significantly speed upserts; don’t disable them without cause.
Unbounded history: Implement time-based cleanup or archival for old CDC events to control storage costs.

Managing Deletes and Data Retention

CDC captures DELETE operations, but data lakes don’t traditionally support deleting data—they’re append-only by design. This creates tension between maintaining accurate current state and the immutability that makes data lakes reliable and cost-effective.

Hard Delete vs Soft Delete Patterns

For regulatory compliance or data privacy requirements (GDPR’s right to be forgotten), you must actually remove data from your lake. Modern table formats support this through targeted file rewrites. When a CDC DELETE event arrives, the table format identifies which files contain the deleted record, rewrites those files without the record, and atomically swaps them in.

This operation is expensive—potentially rewriting gigabytes of data to delete a single record. Batch these operations by accumulating DELETE events and processing them periodically rather than immediately. Daily or weekly delete processing windows work well for most scenarios, providing reasonable compliance response times while minimizing rewrite costs.

Soft deletes offer an alternative where you never physically remove data but mark it as deleted. Add a deleted_at timestamp column that queries filter out:

SELECT * FROM orders 
WHERE deleted_at IS NULL 
  AND order_date >= '2025-01-01'

SELECT * FROM orders 
WHERE deleted_at IS NULL 
  AND order_date >= '2025-01-01'

This approach is fast—no file rewrites needed—and preserves complete audit history. However, you must ensure all queries consistently filter deleted records, and you accumulate deleted data indefinitely, increasing storage costs.

Time-Based Data Retention

Even without explicit DELETE operations, CDC tables grow without bound as changes accumulate. Implement retention policies that archive or remove old change events based on business requirements.

For tables where only current state matters, compact away historical changes older than your defined retention period. Keep recent changes for time travel and rollback capabilities, but consolidate older partitions to contain only the final state:

Recent data (0-7 days): Full CDC history for debugging and rollback
Medium-term (7-90 days): Daily snapshots with change history compacted
Long-term (90+ days): Monthly snapshots or archive to cheaper storage tiers

This tiered approach balances analytical flexibility against storage costs, providing detailed history when needed while avoiding unbounded growth.

Conclusion

Implementing CDC for data lakes transforms them from batch-oriented repositories into near real-time analytical platforms capable of supporting operational analytics and time-sensitive decision-making. The key architectural decisions—choosing between merge-on-read and copy-on-write, selecting appropriate table formats, and designing effective schema evolution strategies—directly determine whether your implementation delivers the performance and reliability that downstream analytics require. Modern table formats like Apache Hudi and Delta Lake have matured to the point where they handle most of the complexity, but understanding their trade-offs remains essential for optimal results.

Success with CDC data lakes comes from treating them as living, evolving systems rather than static repositories. Regular compaction, proactive schema management, intelligent partitioning, and thoughtful retention policies ensure that your lake continues performing well as data volumes grow and usage patterns evolve. By implementing these practices from the start, you build a foundation that scales from initial deployment through years of production use, providing the real-time insights modern businesses demand without sacrificing the cost efficiency and flexibility that make data lakes compelling.