Best Practices for AWS DMS Monitoring and Logging

AWS Database Migration Service (DMS) has become the go-to solution for migrating databases to AWS, enabling everything from simple lifts-and-shifts to complex heterogeneous migrations and ongoing replication for hybrid architectures. Yet the power of DMS comes with operational complexity—replication tasks can lag, fail silently during full loads, encounter data type conversion errors, or experience network issues that cause subtle data inconsistencies. A migration that looks successful in the AWS Console might have dropped thousands of records due to undetected validation failures.

Effective monitoring and logging transforms DMS from an unpredictable migration tool into a reliable, observable data pipeline. Without proper observability, you discover problems when users report missing data or when post-migration validation reveals significant discrepancies. With comprehensive monitoring, you detect issues within minutes—lag metrics show capacity problems before they violate RPO requirements, task failures trigger immediate alerts, and detailed logs enable root cause analysis that accelerates resolution from hours to minutes.

This guide provides an in-depth examination of best practices for monitoring and logging AWS DMS, covering the critical metrics and log sources you need to watch, CloudWatch configuration strategies, alerting rules that catch problems early, and troubleshooting approaches that leverage monitoring data to maintain healthy replication pipelines.

Understanding AWS DMS Monitoring Architecture

Before implementing monitoring, it’s essential to understand how DMS exposes operational data and what visibility each layer provides.

DMS Metrics in CloudWatch

AWS DMS automatically publishes metrics to CloudWatch at one-minute intervals, providing visibility into replication performance and health. These metrics fall into several categories:

Task-level metrics track overall replication task status and performance—full load progress, CDC latency, throughput rates, and error counts. These metrics answer “is my migration progressing?” and “how far behind is replication?”

Table-level metrics provide granular visibility into individual table replication—rows loaded, DDL operations performed, validation status. Critical for identifying which specific tables experience problems rather than diagnosing aggregate task failures.

Host-level metrics expose the underlying replication instance’s resource utilization—CPU, memory, storage, network throughput. Essential for capacity planning and identifying resource-constrained tasks.

Endpoint-level metrics reveal source and target endpoint health—connection status, transaction backlog. These help distinguish database-side issues from DMS-side problems.

The challenge with DMS metrics isn’t availability—AWS provides dozens of metrics—but knowing which ones actually matter for your use case and how to interpret them in context.

DMS Logging Layers

DMS generates multiple log types that serve different troubleshooting purposes:

Task logs contain operational messages about replication progress, connection events, and error conditions. These are your primary troubleshooting resource when tasks fail or behave unexpectedly.

Replication instance logs capture system-level events including instance startup, configuration changes, and infrastructure problems. Less frequently needed than task logs but critical for diagnosing instance-level issues.

Database logs from source and target endpoints provide context around connection issues, query performance, and database-specific errors that affect replication.

CloudWatch Logs stores DMS logs by default, but log retention, organization, and analysis require intentional configuration. Without proper log management, critical diagnostic information expires before you need it or becomes impossible to search effectively.

Critical Metrics to Monitor

DMS exposes numerous metrics, but focusing on the truly critical ones separates signal from noise and prevents alert fatigue.

Full Load Progress and Performance

During initial full load migrations, these metrics determine whether your migration will complete within your window:

FullLoadThroughputRowsSource: Rows per second read from the source database. Declining throughput indicates source database performance issues, network bottlenecks, or resource constraints on the replication instance. Typical rates vary dramatically by source type—Oracle might sustain 50K rows/sec while MongoDB may only achieve 5K rows/sec due to different storage architectures.

FullLoadThroughputRowsTarget: Rows per second written to the target. If this lags significantly behind source throughput, the target database is the bottleneck—check target database capacity, index creation overhead, or constraint validation costs.

CDCLatencySource and CDCLatencyTarget: Time lag between when changes occur and when DMS processes them. During full load, CDC latency accumulates as DMS buffers ongoing source changes for later application. Rising CDC latency during full load is expected; failure to decrease after full load completion indicates a problem.

MemoryUtilization: Percentage of replication instance memory in use. Full load operations are memory-intensive, especially when buffering CDC changes. Sustained >80% utilization may cause performance degradation or out-of-memory errors.

Monitoring these metrics together reveals the complete picture. High source throughput with low target throughput indicates target bottleneck; low source throughput with abundant instance resources suggests source database performance issues or network problems.

CDC Replication Health

Once full load completes, ongoing CDC replication metrics become critical for operational stability:

CDCLatencySource: The most critical CDC metric—measures milliseconds between when a change commits in the source and when DMS captures it. Increasing latency indicates DMS can’t keep up with source write rates. Latency under 5 seconds is excellent, under 30 seconds is acceptable for most use cases, and over 60 seconds suggests problems.

CDCLatencyTarget: Time between when DMS captures a change and when it applies it to the target. High target latency despite low source latency indicates target database capacity issues or DMS-to-target network problems.

CDCChangesDiskSource and CDCChangesDiskTarget: Number of change events buffered to disk rather than held in memory. Non-zero values indicate memory pressure forcing spill-to-disk, which degrades performance. Sustained disk buffering suggests undersized replication instances.

CDCThroughputRowsSource and CDCThroughputRowsTarget: Change event processing rates. Compare these with your source database’s typical transaction rates. If DMS throughput is 10% of source transaction rate, you’re likely underprovisioned.

Error and Failure Detection

Error metrics surface problems that might otherwise go undetected until they cause data discrepancies:

ErrorTaskCount: Number of errors the task has encountered. Any non-zero value warrants investigation. Even if the task continues running, errors often indicate dropped transactions or malformed records.

FailedEventSourceCount and FailedEventTargetCount: Events that couldn’t be processed or applied. Unlike transient errors that resolve via retries, failed events typically require manual intervention—they represent data that won’t replicate without fixing the underlying issue.

ValidationSuspendedRecords and ValidationFailedRecords: When using DMS validation, these metrics reveal data consistency problems. Failed validations indicate source and target data differs—potentially from errors, transformation issues, or data type incompatibilities.

Monitor error rates, not just absolute counts. A task with 5 errors per hour might be acceptable noise; the same task suddenly generating 50 errors per hour signals an emerging problem requiring investigation.

Resource Utilization

Replication instance resource metrics help identify capacity constraints before they degrade replication:

CPUUtilization: Percentage of CPU in use. Sustained >80% suggests undersized instance or inefficient transformations. CPU spikes during full load are normal; sustained high CPU during CDC indicates problems.

FreeableMemory: Available memory in bytes. Low freeable memory (<20% of total) causes performance degradation as the instance swaps to disk. Memory pressure often manifests as rising CDC latency before appearing in memory metrics.

NetworkReceiveThroughput and NetworkTransmitThroughput: Network bandwidth utilization. If these approach instance type limits, network bandwidth becomes a bottleneck. High receive throughput with low transmit suggests DMS is reading from source but struggling to write to target.

SwapUsage: Bytes swapped to disk. Any significant swap usage indicates severe memory pressure. DMS performance degrades dramatically when swapping occurs—address immediately by scaling instance size.

DMS Metrics Priority Matrix

Metric	Phase	Alert Threshold	Severity
CDCLatencySource	CDC	> 60 seconds	Critical
FullLoadThroughputRows	Full Load	Drops > 50%	Warning
FailedEventCount	Both	> 0	Critical
CPUUtilization	Both	> 80% for 10 min	Warning
FreeableMemory	Both	< 20% of total	Warning
ValidationFailedRecords	Post-migration	> 0	Critical
SwapUsage	Both	> 100 MB	Critical

Configuring CloudWatch for Optimal Observability

Raw metrics need proper organization, retention, and analysis capabilities to provide actionable insights.

CloudWatch Dashboards for DMS

Build role-specific dashboards that surface relevant information without overwhelming operators:

Executive dashboard shows high-level migration status:

Overall task status (running, failed, stopped)
Estimated time to completion for full loads
CDC latency across all replication tasks
Total throughput (rows/sec) across the migration
Recent critical alerts

Operator dashboard provides detailed task monitoring:

Per-task CDC latency graphs (1h, 6h, 24h views)
Throughput rates with baseline comparisons
Error count trends
Resource utilization (CPU, memory, network)
Table-level progress for large full loads

Troubleshooting dashboard aids incident response:

Detailed error messages from CloudWatch Logs Insights
Source and target endpoint metrics side-by-side
Network throughput with bandwidth limits
Memory breakdown (heap, cache, buffers)
Transaction backlog at source database

Organize graphs to show related metrics together. For example, display source latency, target latency, source throughput, and target throughput on the same dashboard row so correlation is obvious during investigations.

Enhanced Logging Configuration

Default DMS logging captures basic events, but enhanced logging provides the detail needed for serious troubleshooting:

Enable table-level logging to track individual table replication:

{
  "TableSettings": {
    "Logging": {
      "EnableLogging": true,
      "LogComponents": [
        {
          "Id": "TRANSFORMATION",
          "Severity": "LOGGER_SEVERITY_DEFAULT"
        },
        {
          "Id": "SOURCE_UNLOAD",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "TARGET_LOAD",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "SOURCE_CAPTURE",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "TARGET_APPLY",
          "Severity": "LOGGER_SEVERITY_INFO"
        }
      ]
    }
  }
}

{
  "TableSettings": {
    "Logging": {
      "EnableLogging": true,
      "LogComponents": [
        {
          "Id": "TRANSFORMATION",
          "Severity": "LOGGER_SEVERITY_DEFAULT"
        },
        {
          "Id": "SOURCE_UNLOAD",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "TARGET_LOAD",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "SOURCE_CAPTURE",
          "Severity": "LOGGER_SEVERITY_INFO"
        },
        {
          "Id": "TARGET_APPLY",
          "Severity": "LOGGER_SEVERITY_INFO"
        }
      ]
    }
  }
}

Critical logging components to enable:

SOURCE_UNLOAD: Logs full load read operations, query execution times, and batching behavior
TARGET_LOAD: Logs full load write operations, batch sizes, and target database responses
SOURCE_CAPTURE: Logs CDC capture from source logs (binlog, WAL, etc.)
TARGET_APPLY: Logs CDC application to target, including transaction commits and errors
TRANSFORMATION: Logs data transformation operations and mapping errors

Set severity to INFO or DEBUG for troubleshooting, but be aware DEBUG generates substantial log volume. Use INFO for production and DEBUG only during active investigations.

CloudWatch Logs Insights for DMS Analysis

CloudWatch Logs Insights enables powerful log analysis through SQL-like queries:

Finding all errors in the last hour:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Identifying slow full load operations:

fields @timestamp, @message
| filter @message like /Table.*loaded/
| parse @message "Table '*' loaded in * milliseconds" as table, duration
| stats max(duration) as max_load_time, avg(duration) as avg_load_time by table
| sort max_load_time desc

fields @timestamp, @message
| filter @message like /Table.*loaded/
| parse @message "Table '*' loaded in * milliseconds" as table, duration
| stats max(duration) as max_load_time, avg(duration) as avg_load_time by table
| sort max_load_time desc

Tracking CDC latency over time from logs:

fields @timestamp, @message
| filter @message like /CDC latency/
| parse @message "CDC latency: * milliseconds" as latency
| stats avg(latency) as avg_latency by bin(5m)

fields @timestamp, @message
| filter @message like /CDC latency/
| parse @message "CDC latency: * milliseconds" as latency
| stats avg(latency) as avg_latency by bin(5m)

Analyzing error patterns:

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as error_type
| stats count() as error_count by error_type
| sort error_count desc

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message "ERROR: *" as error_type
| stats count() as error_count by error_type
| sort error_count desc

Save frequently-used queries for quick access during incidents. Export query results for trend analysis or compliance documentation.

Setting Up Effective Alerting

Monitoring without actionable alerts is just passive observation. Effective alerting balances sensitivity (catching real issues) with specificity (avoiding false positives).

Critical Alert Rules

Configure CloudWatch Alarms for conditions requiring immediate response:

Task failure alert: Alert when any replication task enters Failed state. This is your most critical alert—it means replication has completely stopped.

CDC latency exceeds threshold: Alert when CDCLatencySource exceeds your RPO requirement. For real-time replication needs, alert at 60 seconds; for near-real-time, 300 seconds may be acceptable.

Failed events detected: Alert immediately when FailedEventCount > 0. Failed events represent data that won’t replicate without intervention—investigate and resolve before they accumulate.

Validation failures: If using DMS validation, alert on any validation failures. These indicate source and target data differences that may indicate bugs or lost transactions.

Resource exhaustion: Alert when FreeableMemory < 1GB or SwapUsage > 0. Memory exhaustion causes severe performance degradation and task instability.

Warning-Level Alerts

Warning alerts indicate developing problems that need investigation but aren’t yet critical:

Rising CDC latency trend: Alert when CDC latency increases by >50% over a 15-minute period. This catches degrading performance before it violates absolute thresholds.

Decreased throughput: Alert when full load or CDC throughput drops >40% from baseline without corresponding source database changes. Indicates capacity or connectivity issues.

Elevated error rate: Alert when error count rate exceeds normal baseline by 3x. Occasional errors are expected; sharp increases indicate systemic problems.

High resource utilization: Alert when CPU >80% or memory <30% free for sustained periods (>10 minutes). Gives time to scale before resource exhaustion occurs.

Configure alert routing based on severity—page on-call for critical alerts, email for warnings. Include runbook links in alert descriptions to guide response.

Troubleshooting with Monitoring Data

Effective monitoring provides the data needed to diagnose and resolve problems quickly. Here’s how to leverage DMS monitoring for common issues.

Diagnosing High CDC Latency

When CDC latency rises above acceptable levels, systematic analysis reveals the bottleneck:

Step 1: Determine if latency is from source capture or target apply

Check CDCLatencySource vs CDCLatencyTarget
If source latency is high but target is low, problem is capturing from source
If source is low but target high, problem is applying to target
If both are high, check network latency and replication instance resources

Step 2: Analyze throughput patterns

Compare CDCThroughputRowsSource with source database transaction rate
If DMS throughput << database transaction rate, capacity is insufficient
If throughput drops suddenly, check for schema changes or large transactions

Step 3: Check resource utilization

Review CPU, memory, and network metrics
High CPU suggests transformation overhead or complex filters
Low memory indicates potential sizing issues
Network saturation suggests bandwidth bottleneck

Step 4: Examine logs for clues

Search for ERROR or WARNING messages during high-latency periods
Look for “Table * is being reloaded” messages indicating schema changes
Check for “Retrying operation” messages suggesting transient failures

Resolution strategies:

Source latency + low resources → Scale replication instance
Target latency + slow target writes → Optimize target indexes or scale target database
High CPU + complex transformations → Simplify transformations or scale instance
Network saturation → Increase replication instance size for better network performance

Resolving Failed Tasks

Task failures are the most visible DMS problems. Monitoring helps quickly identify root causes:

Immediate diagnostics:

Check task logs in CloudWatch for the error message that caused failure
Verify source and target endpoint connectivity
Check database credentials and permissions
Review recent schema changes in source or target

Common failure patterns and solutions:

Pattern: “Insufficient privileges”

Cause: Database user permissions changed or insufficient for CDC
Monitoring evidence: Logs show permission denied errors
Solution: Grant required permissions (SELECT + appropriate CDC grants)

Pattern: “Table does not exist”

Cause: Target table dropped or schema mismatch
Monitoring evidence: Logs show table not found errors
Solution: Recreate table or fix mapping in task settings

Pattern: “Connection timeout”

Cause: Network connectivity issues or database overload
Monitoring evidence: Endpoint metrics show connection failures
Solution: Check security groups, database firewall rules, or scale database

Pattern: “Out of memory”

Cause: Insufficient replication instance memory
Monitoring evidence: Memory metrics show sustained low freeable memory
Solution: Scale to larger instance type or reduce batch sizes

Optimizing Full Load Performance

Slow full loads delay migration cutover. Monitoring identifies optimization opportunities:

Analyze per-table throughput: Use CloudWatch Logs Insights to identify which tables load slowly. Large tables with many indexes or complex foreign keys are common bottlenecks.

Check parallelism configuration: DMS can load multiple tables in parallel. Ensure MaxFullLoadSubTasks is set appropriately for your replication instance size. Too low limits throughput; too high causes resource contention.

Monitor target database metrics: If target throughput lags source, the target is the bottleneck. Check target database CPU, I/O, and connection count. Consider temporarily disabling indexes for bulk load, recreating them afterward.

Review transformation overhead: Complex transformations (especially LOB handling) significantly impact throughput. Monitor CPU utilization—if high during full load, simplify transformations or scale the instance.

DMS Monitoring Checklist

✅ Before Migration

CloudWatch dashboard created with key metrics
Critical alerts configured (task failure, high latency, failed events)
Enhanced logging enabled for all tasks
Log retention set to 30+ days
Baseline metrics established from test migrations
Runbooks documented for common failure scenarios

✅ During Full Load

Monitor throughput every 15 minutes
Check CDC latency accumulation
Verify no failed events
Review resource utilization (CPU, memory, network)
Track estimated completion time
Check for errors in logs hourly

✅ During CDC Operations

CDC latency < RPO requirement continuously
Zero failed events sustained
Throughput matches source transaction rate
Resource utilization stable and under 70%
Run validation tasks periodically
Review logs for warnings weekly

✅ Post-Migration

Full validation completed successfully
No validation failures or discrepancies
Export metrics for compliance documentation
Archive logs for audit purposes
Document lessons learned and metric baselines
Update monitoring based on operational experience

Advanced Monitoring Strategies

Beyond basic metrics and logs, advanced monitoring patterns provide deeper insights for complex migrations.

Custom CloudWatch Metrics

While AWS provides comprehensive built-in metrics, custom metrics fill gaps for specific use cases:

Row count validation metrics: Use Lambda functions to periodically query source and target databases, compare row counts, and publish discrepancies as custom metrics. This provides continuous validation beyond DMS’s built-in validation.

Business-specific SLA metrics: Calculate metrics that matter to your business—for example, “percentage of customer records replicated within 5 seconds” rather than generic latency averages.

End-to-end replication time: Measure time from when an application writes to source until that data appears in target. Requires correlation IDs or timestamps in application data.

Integration with Centralized Monitoring

For organizations with existing monitoring infrastructure, integrate DMS metrics:

Export to Prometheus: Use CloudWatch Exporter to pull DMS metrics into Prometheus for unified monitoring across your infrastructure.

Stream to Datadog or New Relic: Use CloudWatch metric streams to send DMS metrics to third-party APM platforms for correlation with application performance.

Feed to data lake: Export DMS metrics and logs to S3 for long-term analysis, trend detection, and compliance reporting.

Automated Remediation

Combine CloudWatch Alarms with Lambda functions for automatic responses to common issues:

Auto-scaling replication instances: Trigger Lambda when CPU or memory thresholds are exceeded to scale to larger instance types.

Automatic task restart: Configure Lambda to restart tasks that fail due to transient issues (network timeouts, temporary database unavailability).

Notification enrichment: Have Lambda query additional context when alerts fire—current table being processed, recent error patterns, affected data volume—and include in notifications.

Automated remediation reduces mean time to recovery but requires careful implementation to avoid masking chronic problems or making situations worse.

Conclusion

Effective monitoring and logging transforms AWS DMS from an opaque migration tool into a reliable, observable data pipeline that operators can confidently manage and troubleshoot. By focusing on the metrics that truly matter—CDC latency, throughput, failed events, and resource utilization—and implementing comprehensive logging with proper retention and analysis capabilities, you build migrations and ongoing replication that meet strict SLAs while enabling rapid incident response. The monitoring infrastructure using CloudWatch dashboards, strategic alerting, and CloudWatch Logs Insights provides the observability foundation that separates successful migrations from those plagued by data loss and mysterious failures.

Success with DMS monitoring requires continuous iteration based on operational experience—tuning alert thresholds to balance sensitivity with false positive rates, adding custom metrics that capture business-specific requirements, and refining dashboards to surface the insights that drive faster problem resolution. Start with the critical metrics and alert rules outlined here, implement enhanced logging across all tasks, and evolve your monitoring practice as you learn what matters most for your specific migration patterns. With proper monitoring in place, DMS becomes a trusted component of your data infrastructure rather than a source of anxiety during critical migration windows.