AWS DMS CDC Troubleshooting Guide

AWS Database Migration Service’s Change Data Capture functionality promises seamless database replication, but production reality often involves investigating stuck tasks, resolving data inconsistencies, and diagnosing mysterious replication lag. Unlike full load migrations that either succeed or fail clearly, CDC issues manifest subtly—tables falling behind by hours, specific records missing from targets, or tasks showing “running” status while processing nothing. Effective CDC troubleshooting requires systematic diagnostic approaches that move beyond error messages to understand underlying replication mechanics.

This guide provides battle-tested troubleshooting methodologies for the most common AWS DMS CDC problems, focusing on practical diagnostic techniques and resolution strategies that work in real production environments.

Understanding DMS CDC Architecture for Effective Troubleshooting

Before diving into specific problems, understanding how DMS CDC actually works enables more effective diagnosis. DMS CDC operates through several interconnected components, and issues can originate in any layer.

DMS reads source database transaction logs—PostgreSQL’s Write-Ahead Log, MySQL’s binary logs, Oracle’s redo logs, or SQL Server’s transaction log. The replication instance parses these logs to identify data changes, transforms them according to task configuration, and applies them to target systems. This process involves continuous log reading, change buffering, transformation processing, and target application—each stage presenting potential failure points.

The CDC workflow includes these critical phases:

Log position tracking: DMS maintains checkpoint positions in source transaction logs, recording where it last successfully processed changes. These checkpoints enable resuming after failures without reprocessing everything or missing changes.
Change buffering: Changes accumulate in replication instance memory and swap space before applying to targets. Buffer exhaustion causes processing slowdowns or failures.
Batch processing: DMS applies changes in batches rather than individually, optimizing target throughput. Batch size and timing significantly impact latency and performance.
Validation and error handling: DMS can validate replicated data against sources and maintains error tables recording failed operations for investigation.

When troubleshooting, always consider which component might be failing. Is DMS successfully reading source logs but failing to apply changes? Are logs being read slowly? Is the replication instance resource-constrained? Identifying the failing stage focuses troubleshooting efforts effectively.

Diagnosing and Resolving Replication Lag

Replication lag—targets falling behind sources by minutes or hours—represents the most common CDC issue. Understanding lag causes and diagnostic approaches helps restore healthy replication quickly.

Start by quantifying the lag precisely. The DMS console shows “CDCLatencySource” (time between source changes and DMS processing) and “CDCLatencyTarget” (time between DMS processing and target application). These metrics reveal where delays occur. High source latency indicates DMS struggles reading source logs. High target latency suggests application bottlenecks on target systems.

CloudWatch metrics provide deeper visibility. The “CDCIncomingChanges” metric shows changes DMS detects from source logs. “CDCChanges” shows changes successfully applied to targets. If incoming changes consistently exceed applied changes, you’re accumulating backlog. Compare these metrics across time windows to identify when lag began and whether it’s growing, stable, or shrinking.

Common replication lag causes and solutions:

Undersized replication instances represent the most frequent lag cause. DMS replication instances perform CPU-intensive log parsing and change transformation. If your instance’s CPU consistently exceeds 80%, upgrade to a larger instance class. The CloudWatch metric “CPUUtilization” shows compute load. Similarly, “FreeableMemory” indicates memory pressure—consistent memory below 500MB on smaller instances suggests needing more capacity.

To upgrade instances without downtime, modify your replication instance to a larger type. DMS provisions the new instance, migrates your tasks, and switches over automatically. This typically completes in 5-10 minutes with minimal replication interruption.

Source database log configuration issues frequently cause lag. For PostgreSQL, if your replication slot isn’t configured properly, log segments might be recycled before DMS processes them, causing lag or data loss. Verify PostgreSQL’s wal_keep_segments (older versions) or wal_keep_size (newer versions) provides adequate retention. For MySQL, binary log retention must exceed your maximum expected lag duration. If DMS falls behind and required binlogs are purged, replication fails catastrophically requiring full table reloads.

Check source log retention with these queries:

-- PostgreSQL: Check replication slot lag
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots 
WHERE slot_name LIKE 'dms%';

-- MySQL: Check binary log retention
SHOW BINARY LOGS;
SHOW VARIABLES LIKE 'expire_logs_days';

-- PostgreSQL: Check replication slot lag
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots 
WHERE slot_name LIKE 'dms%';

-- MySQL: Check binary log retention
SHOW BINARY LOGS;
SHOW VARIABLES LIKE 'expire_logs_days';

If retained logs seem insufficient, increase retention settings. PostgreSQL’s max_wal_size and max_slot_wal_keep_size control retention. MySQL’s expire_logs_days or binlog_expire_logs_seconds set purge timelines.

Large transactions cause temporary lag spikes. When source applications perform massive bulk updates—updating millions of rows in single transactions—DMS must process entire transactions before committing to targets. A 30-minute batch job updating 10 million records creates a 30-minute lag spike. This is expected behavior, not a problem. Monitor your “CDCLatency” metrics—if lag spikes correlate with known batch jobs and recovers afterward, no action is needed. If lag persists after large transactions, investigate other causes.

Target system performance bottlenecks limit how quickly DMS applies changes. If your target database experiences high CPU, disk I/O saturation, or connection exhaustion, DMS cannot apply changes fast enough regardless of replication instance size. Monitor target system metrics during high lag periods. CloudWatch RDS metrics (if your target is RDS) show CPU, IOPS, and connection counts. If targets are maxed out, optimize target performance—add indexes, upgrade instance types, tune database parameters, or reduce concurrent load from other applications.

📊 Lag Investigation Priority

When investigating lag, check in this order: 1) Replication instance CPU/memory utilization, 2) Source log retention and availability, 3) Target system performance metrics, 4) Network connectivity between DMS and endpoints. This sequence addresses the most common causes first, resolving 80% of lag issues.

Investigating Task Failures and Error Handling

DMS tasks fail for numerous reasons—connectivity issues, data type mismatches, constraint violations, or permissions problems. Effective troubleshooting requires interpreting error messages correctly and knowing where to find diagnostic information.

Task error messages appear in multiple locations, each providing different detail levels. The DMS console shows high-level status and recent errors. CloudWatch Logs contain detailed task logs including every error encountered. The awsdms_apply_exceptions table (created automatically on targets) records individual failed changes with error details.

When a task fails or stops unexpectedly, start with CloudWatch Logs. Navigate to the log group for your replication instance (format: /aws/dms/tasks/{task-id}). Search for “ERROR” or “FATAL” messages around the failure time. DMS logs are verbose but searching for these keywords quickly identifies problems.

Common error patterns and resolutions:

“RetriableSQLException” errors indicate temporary target database issues—deadlocks, connection timeouts, or transient failures. DMS automatically retries these errors according to task error handling configuration. If you see many retriable errors eventually succeeding, no action is necessary—DMS is handling failures gracefully. If errors exhaust retry attempts, investigate target database health. Frequent deadlocks suggest index or query optimization needs on targets.

“LastFailure: null” or “LastError: null” appears frustratingly often when tasks stop without clear errors. This typically indicates DMS lost connectivity to source or target databases without receiving explicit error responses—network issues, security group changes, or database restarts. Check:

Security group rules allowing DMS replication instance access to both endpoints
Network connectivity from DMS VPC/subnets to source and target networks
Source and target database status—recent restarts, maintenance windows, or configuration changes
DMS endpoint test connections showing green status for both source and target

To diagnose connectivity, test endpoints through the DMS console. Click your endpoint and select “Test connection” with your replication instance. This validates network paths and credentials without running full replication.

Data type mismatch errors occur when source column types don’t cleanly map to target types. Error messages like “cannot convert value” or “data type mismatch” indicate transformation failures. Common scenarios include:

Source strings containing data exceeding target column lengths
Source numerics with precision exceeding target column definitions
Date/time values in formats targets cannot parse
Character encoding issues between source and target collations

Resolution requires either adjusting target schemas to accommodate source data or using DMS transformation rules to handle conversions. Transformation rules can truncate strings, convert data types, or exclude problematic columns. Example transformation rule truncating long strings:

{
  "rules": [{
    "rule-type": "transformation",
    "rule-id": "1",
    "rule-name": "truncate-long-text",
    "rule-target": "column",
    "object-locator": {
      "schema-name": "public",
      "table-name": "products",
      "column-name": "description"
    },
    "rule-action": "change-data-type",
    "data-type": {
      "type": "string",
      "length": 500
    }
  }]
}

{
  "rules": [{
    "rule-type": "transformation",
    "rule-id": "1",
    "rule-name": "truncate-long-text",
    "rule-target": "column",
    "object-locator": {
      "schema-name": "public",
      "table-name": "products",
      "column-name": "description"
    },
    "rule-action": "change-data-type",
    "data-type": {
      "type": "string",
      "length": 500
    }
  }]
}

Primary key or unique constraint violations happen when DMS attempts inserting duplicate records or updating non-existent records. These errors typically indicate:

Source and target schemas diverged (someone manually modified target data)
Tasks restarted and are reprocessing already-applied changes
Updates arriving before corresponding inserts due to ordering issues

Check the awsdms_apply_exceptions table on your target for specific records causing violations. Compare these records between source and target to understand discrepancies. Often, resolving requires deleting problematic target records and allowing DMS to reinsert them correctly.

Handling Schema Changes and DDL Replication

DMS’s handling of Data Definition Language (DDL) operations—table alterations, index creation, column additions—represents a common pain point. By default, many source-target combinations don’t replicate DDL automatically, causing replication failures when schemas change.

DMS behavior with DDL varies by database combination. PostgreSQL to PostgreSQL tasks can replicate DDL if configured. MySQL to Aurora MySQL often handles DDL gracefully. However, heterogeneous migrations (Oracle to PostgreSQL, SQL Server to MySQL) typically don’t support automatic DDL replication. Even homogeneous migrations sometimes require explicit configuration enabling DDL capture.

When source schemas change without corresponding target updates, CDC fails. Adding a new column to a source table causes DMS to attempt replicating inserts including that column to targets lacking it, resulting in errors. The safest approach is explicitly managing schema changes across both environments.

Best practices for handling schema evolution:

Pause CDC during schema changes by stopping replication tasks, applying schema changes to both source and target identically, then resuming tasks. This approach guarantees consistency but requires coordination and brief replication downtime. For planned maintenance windows, this represents the cleanest solution.

Enable DDL replication where supported. For supported database pairs, configure task settings to capture and replicate DDL. Enable this in task configuration under “Table mappings” → “Selection rules” → “Transformation rules”:

{
  "rules": [{
    "rule-type": "selection",
    "rule-id": "1",
    "rule-name": "replicate-ddl",
    "object-locator": {
      "schema-name": "%",
      "table-name": "%"
    },
    "rule-action": "include"
  }],
  "ddl-options": {
    "handle-create-table": true,
    "handle-alter-table": true,
    "handle-drop-table": true
  }
}

{
  "rules": [{
    "rule-type": "selection",
    "rule-id": "1",
    "rule-name": "replicate-ddl",
    "object-locator": {
      "schema-name": "%",
      "table-name": "%"
    },
    "rule-action": "include"
  }],
  "ddl-options": {
    "handle-create-table": true,
    "handle-alter-table": true,
    "handle-drop-table": true
  }
}

Note that DDL replication has limitations—complex alterations, certain index types, or schema renames might not replicate correctly. Always test DDL replication behavior in development environments before relying on it in production.

Monitor for schema drift by periodically comparing source and target schemas. Write scripts comparing table structures, column definitions, and indexes between systems. Catching drift early prevents accumulating discrepancies that become difficult to reconcile. Several open-source tools like pgAdmin for PostgreSQL or mysqldiff for MySQL automate schema comparison.

⚠️ DDL Handling Warning

Never assume DMS automatically handles DDL. Even when DDL replication is enabled, test thoroughly. Complex alterations (renaming columns, changing primary keys, splitting tables) often fail silently or partially replicate, causing subtle data inconsistencies discovered much later. Explicit schema management prevents these issues.

Addressing Data Validation Failures and Inconsistencies

Data validation—DMS’s feature comparing source and target data to ensure consistency—frequently reports mismatches that require investigation. Not all validation failures indicate real problems; understanding false positives versus genuine inconsistencies is crucial.

Enable validation through task settings under “Validation.” DMS samples records from sources and targets, comparing them for differences. The CloudWatch metric “ValidationSuccessCount” tracks successful validations while “ValidationFailureCount” shows mismatches. The validation state table (awsdms_validation_failures_v1) on targets lists specific mismatched records.

Common causes of validation mismatches:

Timing differences cause false positives when comparing rapidly changing data. DMS reads a record from source, then reads the corresponding target record microseconds later. If that record was updated between reads, validation reports a mismatch despite replication working correctly. High-frequency update patterns (records updated multiple times per second) generate validation failures that aren’t real issues.

Resolution involves accepting these timing mismatches or excluding high-change-rate tables from validation. Focus validation on relatively static tables or tables where change frequency is measured in minutes or hours rather than milliseconds.

Data type representation differences between databases cause mismatches that don’t represent functional problems. Timestamps stored with microsecond precision in PostgreSQL might round to millisecond precision in MySQL. Floating-point numbers experience minor rounding differences across database implementations. Strings might differ in trailing whitespace handling.

Investigate validation failures by querying the validation failure table and comparing actual source and target values:

-- Query validation failures
SELECT table_name, 
       key_value,
       failure_type,
       failure_time
FROM awsdms_validation_failures_v1
WHERE failure_time > NOW() - INTERVAL '1 hour'
ORDER BY failure_time DESC;

-- Query validation failures
SELECT table_name, 
       key_value,
       failure_type,
       failure_time
FROM awsdms_validation_failures_v1
WHERE failure_time > NOW() - INTERVAL '1 hour'
ORDER BY failure_time DESC;

Then manually compare source and target records for listed keys. If differences are minor (timestamp precision, whitespace, rounding), document these as expected variations rather than errors.

Genuine data inconsistencies occur from several sources. Task restarts during incomplete batch processing might leave partial batches applied. Network failures during change application create windows where some changes apply while others don’t. Manually modifying target data outside DMS creates divergence DMS doesn’t detect until validation runs.

For genuine inconsistencies, resolution depends on scope. A few isolated mismatched records might be manually corrected. Widespread inconsistencies often require reloading affected tables:

Stop the DMS task
Truncate affected target tables
Modify task to perform “Drop tables on target” and “Full load”
Restart task to reload data cleanly
Once full load completes, CDC resumes automatically

This approach reestablishes consistency but requires replication downtime and reloading time proportional to table sizes.

Optimizing CDC Performance and Preventing Future Issues

Proactive configuration and monitoring prevent many CDC issues from occurring, reducing troubleshooting frequency and maintaining healthy replication.

Right-size replication instances from the start. Monitor CPU and memory utilization continuously. If average CPU exceeds 60-70%, upgrade preemptively before performance degrades. DMS performance scales with compute resources—undersized instances introduce unnecessary troubleshooting burden.

Configure appropriate task settings for your workload characteristics. Key settings include:

FullLoadMaxCommitRate: Limits batch size during full loads, preventing target overload
BatchApplyTimeoutMin: Controls how long DMS waits accumulating changes before applying batches
MaxFullLoadSubTasks: Determines parallel table loading during full loads
ParallelApplyThreads: Sets concurrent apply threads for CDC changes

For high-throughput workloads, increase ParallelApplyThreads to 8 or 16 (from default 0, meaning serial processing). This dramatically improves CDC throughput on multi-vCPU instances but requires adequate target capacity to handle concurrent writes.

Implement comprehensive monitoring beyond basic task status checks. Set CloudWatch alarms for:

CDC latency exceeding acceptable thresholds (e.g., alert when lag exceeds 5 minutes)
Task failures or stops
Replication instance CPU exceeding 80%
Validation failure counts increasing
Target database connection counts approaching limits

These alarms enable proactive response before issues impact business operations.

Maintain source database log retention conservatively. Calculate maximum acceptable replication downtime (how long could DMS be down before requiring full reloads), then ensure log retention exceeds that duration by 50-100%. If you can tolerate 4-hour downtime, maintain 6-8 hours of log retention minimum.

Document your DMS architecture and configuration thoroughly. When troubleshooting at 3 AM, documentation about task configurations, transformation rules, custom settings, and known quirks proves invaluable. Include network diagrams showing DMS connectivity, security group rules, and endpoint configurations.

Test failover and recovery procedures regularly. Intentionally stop tasks and verify your monitoring alerts fire. Practice restarting tasks and measuring lag recovery time. Simulate replication instance failures and measure recovery duration. These exercises build confidence and reveal gaps in procedures before real incidents occur.

Conclusion

Troubleshooting AWS DMS CDC effectively requires understanding the underlying replication architecture, systematic diagnostic approaches, and familiarity with common failure patterns. The most prevalent issues—replication lag, connectivity failures, schema change handling, and validation mismatches—follow predictable patterns with established resolution strategies. Starting with CloudWatch metrics and logs, methodically working through potential failure points, and understanding the difference between genuine problems and expected behaviors enables efficient problem resolution.

Proactive configuration, comprehensive monitoring, and regular operational testing prevent many troubleshooting scenarios from occurring in the first place. While CDC issues will inevitably arise in complex production environments, the diagnostic frameworks and resolution patterns outlined here equip you to identify root causes quickly and restore healthy replication with minimal business impact.