AWS Database Migration Service (DMS) has become the go-to solution for migrating databases to AWS, enabling everything from simple lifts-and-shifts to complex heterogeneous migrations and ongoing replication for hybrid architectures. Yet the power of DMS comes with operational complexity—replication tasks can lag, fail silently during full loads, encounter data type conversion errors, or experience network issues that cause subtle data inconsistencies. A migration that looks successful in the AWS Console might have dropped thousands of records due to undetected validation failures.
Effective monitoring and logging transforms DMS from an unpredictable migration tool into a reliable, observable data pipeline. Without proper observability, you discover problems when users report missing data or when post-migration validation reveals significant discrepancies. With comprehensive monitoring, you detect issues within minutes—lag metrics show capacity problems before they violate RPO requirements, task failures trigger immediate alerts, and detailed logs enable root cause analysis that accelerates resolution from hours to minutes.
This guide provides an in-depth examination of best practices for monitoring and logging AWS DMS, covering the critical metrics and log sources you need to watch, CloudWatch configuration strategies, alerting rules that catch problems early, and troubleshooting approaches that leverage monitoring data to maintain healthy replication pipelines.
Understanding AWS DMS Monitoring Architecture
Before implementing monitoring, it’s essential to understand how DMS exposes operational data and what visibility each layer provides.
DMS Metrics in CloudWatch
AWS DMS automatically publishes metrics to CloudWatch at one-minute intervals, providing visibility into replication performance and health. These metrics fall into several categories:
Task-level metrics track overall replication task status and performance—full load progress, CDC latency, throughput rates, and error counts. These metrics answer “is my migration progressing?” and “how far behind is replication?”
Table-level metrics provide granular visibility into individual table replication—rows loaded, DDL operations performed, validation status. Critical for identifying which specific tables experience problems rather than diagnosing aggregate task failures.
Host-level metrics expose the underlying replication instance’s resource utilization—CPU, memory, storage, network throughput. Essential for capacity planning and identifying resource-constrained tasks.
Endpoint-level metrics reveal source and target endpoint health—connection status, transaction backlog. These help distinguish database-side issues from DMS-side problems.
The challenge with DMS metrics isn’t availability—AWS provides dozens of metrics—but knowing which ones actually matter for your use case and how to interpret them in context.
DMS Logging Layers
DMS generates multiple log types that serve different troubleshooting purposes:
Task logs contain operational messages about replication progress, connection events, and error conditions. These are your primary troubleshooting resource when tasks fail or behave unexpectedly.
Replication instance logs capture system-level events including instance startup, configuration changes, and infrastructure problems. Less frequently needed than task logs but critical for diagnosing instance-level issues.
Database logs from source and target endpoints provide context around connection issues, query performance, and database-specific errors that affect replication.
CloudWatch Logs stores DMS logs by default, but log retention, organization, and analysis require intentional configuration. Without proper log management, critical diagnostic information expires before you need it or becomes impossible to search effectively.
Critical Metrics to Monitor
DMS exposes numerous metrics, but focusing on the truly critical ones separates signal from noise and prevents alert fatigue.
Full Load Progress and Performance
During initial full load migrations, these metrics determine whether your migration will complete within your window:
FullLoadThroughputRowsSource: Rows per second read from the source database. Declining throughput indicates source database performance issues, network bottlenecks, or resource constraints on the replication instance. Typical rates vary dramatically by source type—Oracle might sustain 50K rows/sec while MongoDB may only achieve 5K rows/sec due to different storage architectures.
FullLoadThroughputRowsTarget: Rows per second written to the target. If this lags significantly behind source throughput, the target database is the bottleneck—check target database capacity, index creation overhead, or constraint validation costs.
CDCLatencySource and CDCLatencyTarget: Time lag between when changes occur and when DMS processes them. During full load, CDC latency accumulates as DMS buffers ongoing source changes for later application. Rising CDC latency during full load is expected; failure to decrease after full load completion indicates a problem.
MemoryUtilization: Percentage of replication instance memory in use. Full load operations are memory-intensive, especially when buffering CDC changes. Sustained >80% utilization may cause performance degradation or out-of-memory errors.
Monitoring these metrics together reveals the complete picture. High source throughput with low target throughput indicates target bottleneck; low source throughput with abundant instance resources suggests source database performance issues or network problems.
CDC Replication Health
Once full load completes, ongoing CDC replication metrics become critical for operational stability:
CDCLatencySource: The most critical CDC metric—measures milliseconds between when a change commits in the source and when DMS captures it. Increasing latency indicates DMS can’t keep up with source write rates. Latency under 5 seconds is excellent, under 30 seconds is acceptable for most use cases, and over 60 seconds suggests problems.
CDCLatencyTarget: Time between when DMS captures a change and when it applies it to the target. High target latency despite low source latency indicates target database capacity issues or DMS-to-target network problems.
CDCChangesDiskSource and CDCChangesDiskTarget: Number of change events buffered to disk rather than held in memory. Non-zero values indicate memory pressure forcing spill-to-disk, which degrades performance. Sustained disk buffering suggests undersized replication instances.
CDCThroughputRowsSource and CDCThroughputRowsTarget: Change event processing rates. Compare these with your source database’s typical transaction rates. If DMS throughput is 10% of source transaction rate, you’re likely underprovisioned.
Error and Failure Detection
Error metrics surface problems that might otherwise go undetected until they cause data discrepancies:
ErrorTaskCount: Number of errors the task has encountered. Any non-zero value warrants investigation. Even if the task continues running, errors often indicate dropped transactions or malformed records.
FailedEventSourceCount and FailedEventTargetCount: Events that couldn’t be processed or applied. Unlike transient errors that resolve via retries, failed events typically require manual intervention—they represent data that won’t replicate without fixing the underlying issue.
ValidationSuspendedRecords and ValidationFailedRecords: When using DMS validation, these metrics reveal data consistency problems. Failed validations indicate source and target data differs—potentially from errors, transformation issues, or data type incompatibilities.
Monitor error rates, not just absolute counts. A task with 5 errors per hour might be acceptable noise; the same task suddenly generating 50 errors per hour signals an emerging problem requiring investigation.
Resource Utilization
Replication instance resource metrics help identify capacity constraints before they degrade replication:
CPUUtilization: Percentage of CPU in use. Sustained >80% suggests undersized instance or inefficient transformations. CPU spikes during full load are normal; sustained high CPU during CDC indicates problems.
FreeableMemory: Available memory in bytes. Low freeable memory (<20% of total) causes performance degradation as the instance swaps to disk. Memory pressure often manifests as rising CDC latency before appearing in memory metrics.
NetworkReceiveThroughput and NetworkTransmitThroughput: Network bandwidth utilization. If these approach instance type limits, network bandwidth becomes a bottleneck. High receive throughput with low transmit suggests DMS is reading from source but struggling to write to target.
SwapUsage: Bytes swapped to disk. Any significant swap usage indicates severe memory pressure. DMS performance degrades dramatically when swapping occurs—address immediately by scaling instance size.
DMS Metrics Priority Matrix
| Metric | Phase | Alert Threshold | Severity |
|---|---|---|---|
| CDCLatencySource | CDC | > 60 seconds | Critical |
| FullLoadThroughputRows | Full Load | Drops > 50% | Warning |
| FailedEventCount | Both | > 0 | Critical |
| CPUUtilization | Both | > 80% for 10 min | Warning |
| FreeableMemory | Both | < 20% of total | Warning |
| ValidationFailedRecords | Post-migration | > 0 | Critical |
| SwapUsage | Both | > 100 MB | Critical |
Configuring CloudWatch for Optimal Observability
Raw metrics need proper organization, retention, and analysis capabilities to provide actionable insights.
CloudWatch Dashboards for DMS
Build role-specific dashboards that surface relevant information without overwhelming operators:
Executive dashboard shows high-level migration status:
- Overall task status (running, failed, stopped)
- Estimated time to completion for full loads
- CDC latency across all replication tasks
- Total throughput (rows/sec) across the migration
- Recent critical alerts
Operator dashboard provides detailed task monitoring:
- Per-task CDC latency graphs (1h, 6h, 24h views)
- Throughput rates with baseline comparisons
- Error count trends
- Resource utilization (CPU, memory, network)
- Table-level progress for large full loads
Troubleshooting dashboard aids incident response:
- Detailed error messages from CloudWatch Logs Insights
- Source and target endpoint metrics side-by-side
- Network throughput with bandwidth limits
- Memory breakdown (heap, cache, buffers)
- Transaction backlog at source database
Organize graphs to show related metrics together. For example, display source latency, target latency, source throughput, and target throughput on the same dashboard row so correlation is obvious during investigations.
Enhanced Logging Configuration
Default DMS logging captures basic events, but enhanced logging provides the detail needed for serious troubleshooting:
Enable table-level logging to track individual table replication:
{
"TableSettings": {
"Logging": {
"EnableLogging": true,
"LogComponents": [
{
"Id": "TRANSFORMATION",
"Severity": "LOGGER_SEVERITY_DEFAULT"
},
{
"Id": "SOURCE_UNLOAD",
"Severity": "LOGGER_SEVERITY_INFO"
},
{
"Id": "TARGET_LOAD",
"Severity": "LOGGER_SEVERITY_INFO"
},
{
"Id": "SOURCE_CAPTURE",
"Severity": "LOGGER_SEVERITY_INFO"
},
{
"Id": "TARGET_APPLY",
"Severity": "LOGGER_SEVERITY_INFO"
}
]
}
}
}
Critical logging components to enable:
- SOURCE_UNLOAD: Logs full load read operations, query execution times, and batching behavior
- TARGET_LOAD: Logs full load write operations, batch sizes, and target database responses
- SOURCE_CAPTURE: Logs CDC capture from source logs (binlog, WAL, etc.)
- TARGET_APPLY: Logs CDC application to target, including transaction commits and errors
- TRANSFORMATION: Logs data transformation operations and mapping errors
Set severity to INFO or DEBUG for troubleshooting, but be aware DEBUG generates substantial log volume. Use INFO for production and DEBUG only during active investigations.
CloudWatch Logs Insights for DMS Analysis
CloudWatch Logs Insights enables powerful log analysis through SQL-like queries:
Finding all errors in the last hour:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
Identifying slow full load operations:
fields @timestamp, @message
| filter @message like /Table.*loaded/
| parse @message "Table '*' loaded in * milliseconds" as table, duration
| stats max(duration) as max_load_time, avg(duration) as avg_load_time by table
| sort max_load_time desc
Tracking CDC latency over time from logs:
fields @timestamp, @message
| filter @message like /CDC latency/
| parse @message "CDC latency: * milliseconds" as latency
| stats avg(latency) as avg_latency by bin(5m)
Analyzing error patterns:
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as error_type
| stats count() as error_count by error_type
| sort error_count desc
Save frequently-used queries for quick access during incidents. Export query results for trend analysis or compliance documentation.
Setting Up Effective Alerting
Monitoring without actionable alerts is just passive observation. Effective alerting balances sensitivity (catching real issues) with specificity (avoiding false positives).
Critical Alert Rules
Configure CloudWatch Alarms for conditions requiring immediate response:
Task failure alert: Alert when any replication task enters Failed state. This is your most critical alert—it means replication has completely stopped.
CDC latency exceeds threshold: Alert when CDCLatencySource exceeds your RPO requirement. For real-time replication needs, alert at 60 seconds; for near-real-time, 300 seconds may be acceptable.
Failed events detected: Alert immediately when FailedEventCount > 0. Failed events represent data that won’t replicate without intervention—investigate and resolve before they accumulate.
Validation failures: If using DMS validation, alert on any validation failures. These indicate source and target data differences that may indicate bugs or lost transactions.
Resource exhaustion: Alert when FreeableMemory < 1GB or SwapUsage > 0. Memory exhaustion causes severe performance degradation and task instability.
Warning-Level Alerts
Warning alerts indicate developing problems that need investigation but aren’t yet critical:
Rising CDC latency trend: Alert when CDC latency increases by >50% over a 15-minute period. This catches degrading performance before it violates absolute thresholds.
Decreased throughput: Alert when full load or CDC throughput drops >40% from baseline without corresponding source database changes. Indicates capacity or connectivity issues.
Elevated error rate: Alert when error count rate exceeds normal baseline by 3x. Occasional errors are expected; sharp increases indicate systemic problems.
High resource utilization: Alert when CPU >80% or memory <30% free for sustained periods (>10 minutes). Gives time to scale before resource exhaustion occurs.
Configure alert routing based on severity—page on-call for critical alerts, email for warnings. Include runbook links in alert descriptions to guide response.
Troubleshooting with Monitoring Data
Effective monitoring provides the data needed to diagnose and resolve problems quickly. Here’s how to leverage DMS monitoring for common issues.
Diagnosing High CDC Latency
When CDC latency rises above acceptable levels, systematic analysis reveals the bottleneck:
Step 1: Determine if latency is from source capture or target apply
- Check
CDCLatencySourcevsCDCLatencyTarget - If source latency is high but target is low, problem is capturing from source
- If source is low but target high, problem is applying to target
- If both are high, check network latency and replication instance resources
Step 2: Analyze throughput patterns
- Compare
CDCThroughputRowsSourcewith source database transaction rate - If DMS throughput << database transaction rate, capacity is insufficient
- If throughput drops suddenly, check for schema changes or large transactions
Step 3: Check resource utilization
- Review CPU, memory, and network metrics
- High CPU suggests transformation overhead or complex filters
- Low memory indicates potential sizing issues
- Network saturation suggests bandwidth bottleneck
Step 4: Examine logs for clues
- Search for ERROR or WARNING messages during high-latency periods
- Look for “Table * is being reloaded” messages indicating schema changes
- Check for “Retrying operation” messages suggesting transient failures
Resolution strategies:
- Source latency + low resources → Scale replication instance
- Target latency + slow target writes → Optimize target indexes or scale target database
- High CPU + complex transformations → Simplify transformations or scale instance
- Network saturation → Increase replication instance size for better network performance
Resolving Failed Tasks
Task failures are the most visible DMS problems. Monitoring helps quickly identify root causes:
Immediate diagnostics:
- Check task logs in CloudWatch for the error message that caused failure
- Verify source and target endpoint connectivity
- Check database credentials and permissions
- Review recent schema changes in source or target
Common failure patterns and solutions:
Pattern: “Insufficient privileges”
- Cause: Database user permissions changed or insufficient for CDC
- Monitoring evidence: Logs show permission denied errors
- Solution: Grant required permissions (SELECT + appropriate CDC grants)
Pattern: “Table does not exist”
- Cause: Target table dropped or schema mismatch
- Monitoring evidence: Logs show table not found errors
- Solution: Recreate table or fix mapping in task settings
Pattern: “Connection timeout”
- Cause: Network connectivity issues or database overload
- Monitoring evidence: Endpoint metrics show connection failures
- Solution: Check security groups, database firewall rules, or scale database
Pattern: “Out of memory”
- Cause: Insufficient replication instance memory
- Monitoring evidence: Memory metrics show sustained low freeable memory
- Solution: Scale to larger instance type or reduce batch sizes
Optimizing Full Load Performance
Slow full loads delay migration cutover. Monitoring identifies optimization opportunities:
Analyze per-table throughput: Use CloudWatch Logs Insights to identify which tables load slowly. Large tables with many indexes or complex foreign keys are common bottlenecks.
Check parallelism configuration: DMS can load multiple tables in parallel. Ensure MaxFullLoadSubTasks is set appropriately for your replication instance size. Too low limits throughput; too high causes resource contention.
Monitor target database metrics: If target throughput lags source, the target is the bottleneck. Check target database CPU, I/O, and connection count. Consider temporarily disabling indexes for bulk load, recreating them afterward.
Review transformation overhead: Complex transformations (especially LOB handling) significantly impact throughput. Monitor CPU utilization—if high during full load, simplify transformations or scale the instance.
DMS Monitoring Checklist
- CloudWatch dashboard created with key metrics
- Critical alerts configured (task failure, high latency, failed events)
- Enhanced logging enabled for all tasks
- Log retention set to 30+ days
- Baseline metrics established from test migrations
- Runbooks documented for common failure scenarios
- Monitor throughput every 15 minutes
- Check CDC latency accumulation
- Verify no failed events
- Review resource utilization (CPU, memory, network)
- Track estimated completion time
- Check for errors in logs hourly
- CDC latency < RPO requirement continuously
- Zero failed events sustained
- Throughput matches source transaction rate
- Resource utilization stable and under 70%
- Run validation tasks periodically
- Review logs for warnings weekly
- Full validation completed successfully
- No validation failures or discrepancies
- Export metrics for compliance documentation
- Archive logs for audit purposes
- Document lessons learned and metric baselines
- Update monitoring based on operational experience
Advanced Monitoring Strategies
Beyond basic metrics and logs, advanced monitoring patterns provide deeper insights for complex migrations.
Custom CloudWatch Metrics
While AWS provides comprehensive built-in metrics, custom metrics fill gaps for specific use cases:
Row count validation metrics: Use Lambda functions to periodically query source and target databases, compare row counts, and publish discrepancies as custom metrics. This provides continuous validation beyond DMS’s built-in validation.
Business-specific SLA metrics: Calculate metrics that matter to your business—for example, “percentage of customer records replicated within 5 seconds” rather than generic latency averages.
End-to-end replication time: Measure time from when an application writes to source until that data appears in target. Requires correlation IDs or timestamps in application data.
Integration with Centralized Monitoring
For organizations with existing monitoring infrastructure, integrate DMS metrics:
Export to Prometheus: Use CloudWatch Exporter to pull DMS metrics into Prometheus for unified monitoring across your infrastructure.
Stream to Datadog or New Relic: Use CloudWatch metric streams to send DMS metrics to third-party APM platforms for correlation with application performance.
Feed to data lake: Export DMS metrics and logs to S3 for long-term analysis, trend detection, and compliance reporting.
Automated Remediation
Combine CloudWatch Alarms with Lambda functions for automatic responses to common issues:
Auto-scaling replication instances: Trigger Lambda when CPU or memory thresholds are exceeded to scale to larger instance types.
Automatic task restart: Configure Lambda to restart tasks that fail due to transient issues (network timeouts, temporary database unavailability).
Notification enrichment: Have Lambda query additional context when alerts fire—current table being processed, recent error patterns, affected data volume—and include in notifications.
Automated remediation reduces mean time to recovery but requires careful implementation to avoid masking chronic problems or making situations worse.
Conclusion
Effective monitoring and logging transforms AWS DMS from an opaque migration tool into a reliable, observable data pipeline that operators can confidently manage and troubleshoot. By focusing on the metrics that truly matter—CDC latency, throughput, failed events, and resource utilization—and implementing comprehensive logging with proper retention and analysis capabilities, you build migrations and ongoing replication that meet strict SLAs while enabling rapid incident response. The monitoring infrastructure using CloudWatch dashboards, strategic alerting, and CloudWatch Logs Insights provides the observability foundation that separates successful migrations from those plagued by data loss and mysterious failures.
Success with DMS monitoring requires continuous iteration based on operational experience—tuning alert thresholds to balance sensitivity with false positive rates, adding custom metrics that capture business-specific requirements, and refining dashboards to surface the insights that drive faster problem resolution. Start with the critical metrics and alert rules outlined here, implement enhanced logging across all tasks, and evolve your monitoring practice as you learn what matters most for your specific migration patterns. With proper monitoring in place, DMS becomes a trusted component of your data infrastructure rather than a source of anxiety during critical migration windows.