Change Data Capture has evolved from a specialized database replication technique into a fundamental pattern for modern data architectures. Building production-grade CDC pipelines on AWS requires orchestrating multiple services—DMS for change capture, Kinesis or MSK for streaming, Lambda or Glue for transformation, and S3 or data warehouses for storage. The complexity lies not in any single component but in architecting these pieces into resilient, performant systems that handle real-world failure modes gracefully while maintaining data consistency.
This article walks through building complete CDC pipelines on AWS, from source database configuration through target data consumption, focusing on architectural decisions, implementation patterns, and operational considerations that determine success in production environments.
Architecture Overview and Component Selection
A complete CDC pipeline on AWS consists of several distinct layers, each addressing specific concerns in the data flow from source databases to analytical systems.
The capture layer extracts changes from source databases. AWS Database Migration Service serves as the primary tool here, reading database transaction logs and converting changes into structured events. DMS supports diverse source databases—PostgreSQL, MySQL, Oracle, SQL Server—handling the database-specific details of log parsing and change identification.
The streaming layer transports changes from DMS to downstream consumers. Amazon Kinesis Data Streams provides managed streaming infrastructure with automatic scaling and durability. Alternatively, Amazon MSK (Managed Streaming for Apache Kafka) offers Kafka-compatible streaming for organizations with existing Kafka expertise or requiring Kafka-specific features like exactly-once semantics.
The transformation layer processes raw change events into formats suitable for analytical queries. AWS Lambda provides serverless transformation for moderate throughput scenarios, while AWS Glue handles larger-scale transformations with its distributed Spark environment. This layer applies business logic, data quality rules, and format conversions.
The storage layer persists processed changes for querying. Amazon S3 serves as the primary data lake storage, organizing data into partitioned structures optimized for query engines. Amazon Redshift, Athena, or third-party warehouses consume this data for analytical workloads.
Architectural pattern selection:
- DMS → Kinesis → Lambda → S3 → Athena: Ideal for moderate throughput scenarios (under 10,000 changes per second) where serverless simplicity is prioritized. Lambda transforms events from Kinesis, writes to S3, and Athena queries the data lake.
- DMS → MSK → Kafka Connect → S3 → Redshift: Suited for high-throughput scenarios requiring exactly-once semantics and complex transformations. Kafka Connect sinks provide battle-tested integrations with data warehouses.
- DMS → Kinesis Firehose → S3 → Glue/Athena: Optimized for scenarios prioritizing operational simplicity over transformation complexity. Firehose automatically batches and delivers data to S3 with basic transformations.
The architecture you choose depends on throughput requirements, transformation complexity, operational maturity, and existing infrastructure investments. Organizations without streaming expertise often start with the DMS → Kinesis → Lambda pattern for its simplicity, migrating to Kafka-based architectures as requirements grow.
🏗️ Architecture Decision
Start simple and evolve complexity as needs grow. Many organizations over-engineer initial CDC implementations with Kafka, Spark, and complex transformations when simpler DMS → Kinesis → Lambda → S3 pipelines would serve initial requirements perfectly. Add complexity incrementally as you encounter actual bottlenecks.
Source Database Configuration for CDC
Before implementing CDC pipelines, source databases require proper configuration to support change capture. Each database engine has specific requirements that must be satisfied or CDC will fail silently or produce incomplete change streams.
PostgreSQL CDC configuration requires enabling logical replication and configuring replication slots. Logical replication publishes database changes in a format DMS can consume. Edit postgresql.conf to set wal_level = logical and restart PostgreSQL. Additionally, set max_replication_slots to a value exceeding the number of DMS tasks you’ll run (typically start with 10) and max_wal_senders similarly.
Create a replication user with appropriate permissions:
CREATE USER dms_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dms_user;
GRANT USAGE ON SCHEMA public TO dms_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO dms_user;
For RDS PostgreSQL instances, these settings are configured through parameter groups. Create a custom parameter group, modify the rds.logical_replication parameter to 1, and apply to your instance. This requires an instance reboot.
MySQL CDC configuration depends on binary logging. Ensure binary logging is enabled with row-based format: binlog_format = ROW. Set binlog_row_image = FULL to capture complete before and after images of changed rows. Configure retention to ensure logs remain available long enough for DMS to process them—typically 24 hours minimum via expire_logs_days = 1.
Create a CDC user with replication privileges:
CREATE USER 'dms_user'@'%' IDENTIFIED BY 'secure_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'dms_user'@'%';
FLUSH PRIVILEGES;
Critical configuration considerations:
- Log retention: Transaction logs must remain available until DMS processes them. If logs rotate before DMS reads them, changes are permanently lost. For production systems, configure retention exceeding worst-case replication lag by 3-5x. If typical lag is 5 minutes, retain logs for at least 1 hour, preferably several hours.
- Replication slot management: PostgreSQL replication slots prevent log deletion until consumers read them. However, abandoned replication slots can cause disk exhaustion if DMS tasks fail and slots aren’t cleaned up. Monitor replication slot lag and implement alerts when slots fall too far behind or when disk usage approaches limits.
- Performance impact: Logical replication and binary logging impose minimal overhead on properly-provisioned databases—typically under 5% CPU and I/O increase. However, inadequate storage IOPS for log writes can create bottlenecks. Ensure sufficient provisioned IOPS for transaction logs separate from data file IOPS.
Implementing DMS Tasks for Change Capture
DMS tasks bridge source databases and streaming infrastructure, reading changes and publishing them to targets. Proper task configuration determines pipeline reliability and performance.
Creating the DMS replication instance establishes the compute environment where tasks execute. Instance sizing depends on throughput requirements and transformation complexity. Start with dms.c5.large for moderate workloads (under 5,000 changes per second), scaling to dms.c5.xlarge or larger for higher throughput. Compute-optimized instances outperform general-purpose instances for CDC workloads.
Allocate sufficient storage—DMS uses local storage for caching changes during full load and when targets fall behind. Allocate at least 100GB, increasing based on expected lag scenarios. If targets might be offline for hours, calculate storage needed to buffer that period’s changes.
Configuring endpoints requires precise connection details and authentication. Source endpoints connect to databases using appropriate JDBC connection strings, credentials, and SSL settings. For Kinesis targets, endpoints specify stream names, IAM roles for access, and message formatting options.
Create a Kinesis target endpoint specifying JSON message format for flexibility:
{
"EndpointIdentifier": "kinesis-target",
"EndpointType": "target",
"EngineName": "kinesis",
"KinesisSettings": {
"StreamArn": "arn:aws:kinesis:us-east-1:123456789012:stream/cdc-stream",
"MessageFormat": "json",
"ServiceAccessRoleArn": "arn:aws:iam::123456789012:role/dms-kinesis-role",
"IncludeTransactionDetails": true,
"IncludeTableAlterOperations": true,
"IncludePartitionValue": true
}
}
Designing table mappings and transformation rules controls which data DMS captures and how it’s transformed. Table mappings define which schemas and tables to include or exclude. Transformation rules modify data during replication—renaming columns, filtering rows, or converting data types.
A typical table mapping configuration:
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-public-schema",
"object-locator": {
"schema-name": "public",
"table-name": "%"
},
"rule-action": "include"
},
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "add-prefix",
"rule-target": "table",
"object-locator": {
"schema-name": "public",
"table-name": "%"
},
"rule-action": "add-prefix",
"value": "cdc_"
}
]
}
Task settings optimization significantly impacts performance and reliability. Enable CloudWatch logging for visibility into task execution. Configure FullLoadSettings to control parallel table loading during initial full load. Set ChangeProcessingTuning parameters like BatchApplyEnabled to improve CDC throughput for high-volume scenarios.
Critical task settings to configure:
- TargetMetadata.SupportLobs: Disable if tables don’t contain LOB columns to improve performance
- ChangeProcessingTuning.BatchApplyEnabled: Enable for significantly improved write throughput on targets
- ChangeProcessingTuning.BatchApplyTimeoutMin/Max: Control how long DMS accumulates changes before applying batches
- Logging.EnableLogging: Always enable with LogComponents set to capture detailed troubleshooting information
Building the Streaming and Transformation Layer
With DMS capturing changes to Kinesis, the next layer processes these raw change events into analytical formats. This transformation layer bridges operational change streams and analytical query patterns.
Kinesis stream configuration requires capacity planning and shard allocation. Each shard supports 1,000 records per second or 1MB/sec incoming, and 2MB/sec outgoing. For a database generating 5,000 changes per second with average 1KB change events, provision at least 5 shards for incoming data (5,000 records / 1,000 per shard) with headroom for bursts.
Enable enhanced fan-out if multiple consumers process the same stream—each consumer gets dedicated 2MB/sec throughput rather than sharing capacity. This prevents consumer interference where one slow consumer impacts others.
Lambda transformation functions process Kinesis records, applying business logic and formatting for storage. A typical transformation function reads change events, extracts operation type (insert/update/delete), flattens nested structures, and writes formatted records to S3.
import json
import boto3
from datetime import datetime
s3 = boto3.client('s3')
BUCKET = 'cdc-processed-data'
def lambda_handler(event, context):
processed_records = []
for record in event['Records']:
# Decode Kinesis data
payload = json.loads(base64.b64decode(record['kinesis']['data']))
# Extract metadata and data
operation = payload['metadata']['operation']
table_name = payload['metadata']['table-name']
timestamp = payload['metadata']['timestamp']
data = payload['data']
# Transform based on operation type
if operation in ['insert', 'update']:
transformed = {
'operation': operation,
'table': table_name,
'timestamp': timestamp,
'record': data
}
processed_records.append(transformed)
elif operation == 'delete':
# For deletes, we might only have keys
transformed = {
'operation': 'delete',
'table': table_name,
'timestamp': timestamp,
'keys': data
}
processed_records.append(transformed)
# Batch write to S3
if processed_records:
date_partition = datetime.now().strftime('%Y/%m/%d')
key = f'processed/{date_partition}/{context.request_id}.json'
s3.put_object(
Bucket=BUCKET,
Key=key,
Body='\n'.join(json.dumps(r) for r in processed_records)
)
return {'statusCode': 200, 'processed': len(processed_records)}
Error handling and dead letter queues ensure data isn’t lost when transformations fail. Configure Lambda to retry failed records with exponential backoff. After maximum retries, send failed records to an SQS dead letter queue for investigation. This prevents failed records from blocking stream processing while preserving them for later analysis.
Glue for complex transformations handles scenarios where Lambda’s 15-minute timeout or memory limits are insufficient. Glue jobs run Apache Spark clusters that can process arbitrarily large batches of change events, applying complex transformations, joins with reference data, or aggregations.
A Glue job might read accumulated change events from S3, join with dimension tables to denormalize data, apply data quality rules, and write to Parquet format partitioned by date for efficient querying:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read raw CDC events from S3
cdc_events = spark.read.json('s3://cdc-raw/events/')
# Filter for specific operations and tables
inserts_updates = cdc_events.filter(
(cdc_events.operation.isin(['insert', 'update'])) &
(cdc_events.table == 'orders')
)
# Extract nested record data
flattened = inserts_updates.select('operation', 'timestamp', 'record.*')
# Write partitioned by date
flattened.write.partitionBy('year', 'month', 'day') \
.mode('append') \
.parquet('s3://cdc-processed/orders/')
job.commit()
⚙️ Transformation Strategy
Keep transformations simple initially. Many CDC implementations fail by attempting complex transformations in real-time. Start by storing raw change events, then add transformations incrementally. This provides fallback options when transformations fail and allows reprocessing historical data with improved logic.
Storage Layer Design and Optimization
The storage layer organizes processed change events for efficient querying while maintaining the complete history needed for analytical workloads and auditing.
S3 bucket structure and partitioning dramatically impacts query performance. Organize data hierarchically by table, then by date partitions, enabling partition pruning during queries. A typical structure: s3://bucket/table=orders/year=2024/month=11/day=09/file.parquet.
This partitioning allows queries filtering by date to scan only relevant partitions rather than entire datasets. Athena and other query engines automatically leverage partition information to minimize scanned data, directly reducing query costs and latency.
File formats and compression balance storage costs, query performance, and write throughput. Columnar formats like Parquet or ORC dramatically reduce storage requirements (typically 60-80% compression) while enabling efficient column-level scanning for analytical queries. JSON provides simplicity and human-readability but uses 3-5x more storage and scans less efficiently.
Choose Parquet for production CDC data lakes. It provides excellent compression, efficient querying, and wide tool support. Use Snappy compression within Parquet files for balance between compression ratio and CPU overhead.
Handling updates and deletes in append-only storage like S3 requires specific patterns. CDC captures insert, update, and delete operations, but S3 doesn’t support in-place updates. Several approaches handle this:
- Append-only with operation type: Store all change events including updates and deletes. Queries filter for latest operation per primary key. Simple but queries become expensive as change volume grows.
- Merge on read: Store all changes, but query-time views apply merge logic to produce current state. Hudi, Iceberg, and Delta Lake frameworks handle this automatically, materializing latest record versions during queries.
- Periodic compaction: Background jobs periodically read accumulated change events, apply updates and deletes to produce snapshot tables containing only current state, then replace old data. Balances storage efficiency with query simplicity.
Glue Data Catalog integration provides schema management and partition tracking. Register CDC tables in the Glue Catalog, defining schemas and partition keys. Catalog integration enables Athena, Redshift Spectrum, and EMR to query CDC data without manual table definitions.
Use Glue Crawlers to automatically discover partitions as CDC processes create new date partitions. Configure crawlers to run hourly or daily, updating catalog partition information so queries access latest data.
Monitoring, Alerting, and Operational Management
Production CDC pipelines require comprehensive monitoring to detect failures, performance degradation, and data quality issues before they impact downstream systems.
DMS task monitoring tracks replication health and performance. Critical CloudWatch metrics include:
- CDCLatencySource/Target: Measures replication lag in seconds. Alert when lag exceeds thresholds (typically 60-300 seconds depending on SLA requirements).
- FullLoadThroughputRowsSource/Target: Monitors full load progress measuring rows per second being transferred.
- CDCIncomingChanges: Shows change arrival rate, helping identify when source database activity overwhelms replication capacity.
- ReplicationTaskStatus: Tracks task state transitions. Alert on transitions to failed or stopped states.
Set up CloudWatch alarms for these metrics with SNS notifications to operations teams. Configure escalating alerts—warning for minor lag increases, critical alerts for severe lag or task failures.
Kinesis stream monitoring ensures streaming layer health. Monitor IncomingRecords, IncomingBytes, and WriteProvisionedThroughputExceeded metrics. Throughput exceeded errors indicate insufficient shard capacity requiring shard splitting. Monitor GetRecords.IteratorAgeMilliseconds to track consumer lag—how far behind consumers are from stream head.
Lambda function monitoring reveals transformation layer issues. Track invocation errors, duration, and throttling metrics. High error rates indicate transformation logic bugs or malformed input data. Approaching timeout thresholds suggests computational complexity exceeds Lambda’s limits, necessitating migration to Glue.
Data quality monitoring validates CDC pipeline correctness. Implement checks comparing source and target record counts, checksums, or sample data. Schedule validation jobs that query source databases and S3/warehouse data, alerting on discrepancies exceeding thresholds.
A simple validation pattern:
- Query source database for row counts and aggregate metrics per table
- Query processed CDC data for corresponding counts and metrics
- Calculate discrepancies and alert if exceeding tolerance (typically 0.1-1%)
- Sample random records and compare values between source and target
Operational runbooks document response procedures for common failure scenarios. Runbooks should cover:
- DMS task failure recovery: When to restart tasks, when to create new replication instances, how to resume from specific points
- Kinesis stream scaling: How to split shards under load, how to merge shards during low traffic
- S3 storage management: When and how to compact CDC data, archival policies for old partitions
- Backfill procedures: How to reload tables when CDC misses changes due to outages
Handling Edge Cases and Failure Scenarios
Production CDC implementations encounter numerous edge cases requiring specific handling strategies to maintain data integrity and pipeline reliability.
Schema evolution occurs when source database schemas change—adding columns, modifying data types, or restructuring tables. DMS can automatically handle additive changes (new columns) but breaking changes (dropped columns, type changes) often require task restarts or reconfiguration.
Implement schema change detection by monitoring DMS logs for TableSchemaChanged events. When detected, evaluate whether changes are compatible or require intervention. For compatible changes, CDC continues automatically. For breaking changes, plan task restarts and downstream schema updates.
Large transaction handling creates challenges when source databases commit transactions involving millions of rows. These large transactions generate massive change event bursts overwhelming streaming infrastructure. DMS accumulates these changes and publishes them atomically, potentially causing memory exhaustion or target write timeouts.
Mitigate by configuring source applications to commit smaller transactions. If not possible, increase DMS instance memory and configure larger Kinesis batch sizes to handle bursts. Monitor DMS memory usage and scale instances before exhaustion occurs.
Network partition and failure recovery requires careful handling. When DMS loses connectivity to sources or targets, it must resume from correct positions without data loss or duplication. DMS maintains checkpoint state enabling resume, but manual intervention is sometimes necessary.
For PostgreSQL sources, replication slots ensure no data loss during outages by preventing log deletion. For MySQL, sufficient binlog retention performs the same function. Always configure retention exceeding worst-case outage duration by significant margins.
Duplicate handling occurs when failures cause retries, potentially sending the same change event multiple times. Downstream systems must handle duplicates through idempotent processing—applying the same change multiple times produces identical results to applying it once. Implement duplicate detection using unique identifiers from change events, discarding already-processed changes.
Conclusion
Building end-to-end CDC pipelines on AWS requires orchestrating multiple services into cohesive architectures that balance simplicity, performance, and reliability. Starting with properly configured source databases, implementing DMS tasks that capture changes efficiently, streaming through Kinesis or MSK, transforming via Lambda or Glue, and storing in optimized S3 structures creates foundations for reliable change data capture. The complexity lies not in individual components but in understanding how they interact, where failures occur, and how to build resilience through proper configuration, monitoring, and operational procedures.
Successful CDC implementations start simple and evolve complexity as requirements grow. Begin with basic DMS to Kinesis to Lambda pipelines, validate they meet initial needs, then incrementally add sophistication—complex transformations, advanced streaming patterns, or specialized storage formats—only when justified by actual requirements. This progressive approach builds operational experience while minimizing over-engineering risk, ultimately delivering CDC systems that reliably serve business needs without unnecessary complexity.