DMS Migration Strategies for Production Databases

Migrating production databases represents one of the most high-stakes operations in enterprise IT. Unlike test environments where failures are learning opportunities, production migrations must succeed while maintaining business continuity, preserving data integrity, and meeting strict uptime requirements. AWS Database Migration Service (DMS) has emerged as a powerful tool for these critical migrations, but simply spinning up a DMS replication instance and hoping for the best leads to failed cutover attempts, extended downtime windows, and data inconsistencies. Successful production migrations using DMS require carefully planned strategies that account for your database size, acceptable downtime, data consistency requirements, and rollback capabilities.

Understanding DMS Migration Phases and Their Implications

DMS operates through three distinct phases: full load, change data capture (CDC), and ongoing replication. How you orchestrate these phases determines your migration’s success, downtime window, and ability to validate results before committing to the new environment.

The full load phase performs an initial bulk copy of all tables from source to target. DMS parallelizes this operation across multiple tables and can subdivide large tables into segments for parallel loading. During full load, DMS doesn’t capture changes occurring on the source database—those changes are either lost or require CDC to capture them. This characteristic fundamentally shapes your migration strategy options.

CDC continuously captures changes from the source database’s transaction logs and applies them to the target. This enables minimal-downtime migrations where you perform the full load while the source remains active, then use CDC to catch up the target with changes that occurred during the full load. Once the target catches up, you can cut over with only seconds or minutes of downtime required to ensure no transactions are in flight.

The ongoing replication phase continues CDC indefinitely, keeping source and target synchronized. This capability enables extended validation periods where you run applications against both databases simultaneously, comparing results before committing to the new environment. It also provides a rollback path—if issues arise after cutover, you can reverse the replication direction and fall back to the original database.

Choosing Between Migration Patterns

Your migration strategy fundamentally depends on how much downtime you can tolerate and how much validation you need before committing to the target database.

Offline migration shuts down applications, performs full load without CDC, validates the target, and brings applications back online pointing to the new database. This approach is simplest and most reliable because you don’t need to worry about consistency between source and target during migration. However, it requires a downtime window equal to the full load duration plus validation time—often many hours for large databases.

Minimal-downtime migration uses full load plus CDC to keep downtime to minutes rather than hours. Applications remain online during full load, CDC captures all changes, and you cut over once CDC catches up. This pattern works for most production scenarios where brief maintenance windows are acceptable but extended outages would impact business operations.

Zero-downtime migration maintains continuous operation throughout the migration by running applications against both databases simultaneously or using read replicas and gradual traffic shifting. This approach requires the most sophisticated orchestration but enables migrations with no visible user impact. Few organizations truly need zero-downtime migrations—the complexity often outweighs the benefits unless downtime costs are astronomical.

Migration Strategy Decision Matrix

⏸️
Offline Migration
Downtime: Hours
Complexity: Low
Risk: Low
Best for: Non-critical systems, small databases, dev/test environments
Minimal-Downtime
Downtime: Minutes
Complexity: Medium
Risk: Medium
Best for: Most production systems, 24/7 operations, large databases
🔄
Zero-Downtime
Downtime: None
Complexity: High
Risk: High
Best for: Financial systems, healthcare, ultra-critical services

Implementing Minimal-Downtime Migrations with CDC

The minimal-downtime pattern using full load plus CDC represents the sweet spot for most production migrations—short enough downtime to fit within maintenance windows, but simple enough to execute reliably without extensive orchestration complexity.

Pre-Migration Preparation

Success begins long before you start the DMS task. Thoroughly validate that your source and target databases are compatible with DMS. Not all database features translate cleanly—stored procedures, triggers, functions, and certain data types may require manual handling. Run DMS’s pre-migration assessment to identify potential issues before you’re deep into the migration.

Configure your source database to enable CDC by activating binary logging (MySQL/MariaDB), logical replication (PostgreSQL), or Archive Log Mode (Oracle). These prerequisites allow DMS to capture ongoing changes. Verify that your database user has appropriate permissions—DMS needs fairly extensive privileges to read binary logs and perform schema operations.

Size your DMS replication instance based on your database size and desired migration speed. Under-sizing leads to painfully slow full loads that extend your migration timeline. As a starting point, use r5 instances with at least 8 vCPUs for databases over 100 GB, and scale up from there based on testing. Multi-AZ deployments provide higher availability during migration but double the cost.

Create the target database with appropriate sizing and configuration. If migrating to RDS, ensure sufficient storage and IOPS capacity to handle both the data load and ongoing write traffic during CDC catchup. Under-provisioned targets create bottlenecks that prevent CDC from catching up, making cutover impossible.

Executing the Full Load Phase

Start your DMS task with full load enabled and CDC activated from the task start. This configuration ensures that DMS begins capturing changes immediately, even before full load completes. Without this, changes occurring during full load are lost, requiring you to restart the entire migration.

Monitor full load progress through CloudWatch metrics and DMS console. Key metrics include:

  • FullLoadThroughputBandwidthTarget: Data being written to target (bytes/sec)
  • FullLoadThroughputRowsTarget: Rows being written to target (rows/sec)
  • CDCLatencySource: Lag between source database time and last CDC event captured
  • CDCLatencyTarget: Lag between CDC capture and application to target

Full load duration varies dramatically based on database size, network bandwidth, and target database write capacity. A 500 GB database might take 4-8 hours on a well-configured DMS instance, while a 5 TB database could require days. Run test migrations against production-sized data to establish realistic timelines.

During full load, tables are copied independently. DMS doesn’t maintain referential integrity during this phase—foreign key constraints are disabled or dropped, then recreated after full load completes. Plan for this by understanding how long constraint recreation takes, as it extends your total migration time.

Managing the CDC Catchup Phase

Once full load completes, CDC must catch up with all changes that occurred during the full load. Monitor CDCLatencySource and CDCLatencyTarget metrics—CDC is caught up when these values stabilize near zero (typically under 10 seconds).

Catchup time depends on your source database’s write rate and the target’s ability to apply changes. CDC applies changes sequentially within each table, so highly concurrent writes to the same table on the source may take longer to catch up than writes distributed across many tables. If CDC can’t keep up with source write rates, you’ll never be able to cut over—the lag continues growing indefinitely.

If CDC struggles to catch up, you have several options:

  • Increase DMS instance size: More CPU and memory accelerates CDC processing
  • Enable parallel apply: DMS can apply changes to multiple tables simultaneously
  • Reduce source load: Temporarily throttle non-critical writes during migration
  • Partition large tables: Break massive tables into smaller segments for parallel processing

Use table statistics in the DMS console to identify which tables are behind. Large tables with high update rates often become bottlenecks. Consider pre-splitting these tables using DMS’s table segmentation feature to parallelize their processing.

Orchestrating the Cutover

Plan your cutover sequence meticulously. This is your highest-risk moment—errors here result in data loss or application failures. Document every step and assign clear ownership for each action.

Typical cutover sequence:

  1. Stop application writes to the source database (maintenance mode or connection rejection)
  2. Wait for CDC to fully catch up (CDCLatencyTarget < 1 second for several minutes)
  3. Run validation queries comparing row counts and critical data between source and target
  4. Update application connection strings to point to the target database
  5. Start applications and verify functionality
  6. Monitor for errors in application logs and database logs
  7. Keep source database available as rollback option for predetermined time window

Build automation for the cutover process. Manual execution introduces errors under pressure. Use scripts that execute each step, validate success before proceeding, and provide clear status output:

import boto3
import time
from datetime import datetime

class DMSCutoverOrchestrator:
    def __init__(self, replication_task_arn, application_endpoint):
        self.dms = boto3.client('dms')
        self.task_arn = replication_task_arn
        self.app_endpoint = application_endpoint
        
    def wait_for_cdc_catchup(self, max_latency_seconds=10, stable_duration_seconds=300):
        """Wait for CDC to catch up with source database"""
        print(f"{datetime.now()}: Waiting for CDC catchup...")
        stable_start = None
        
        while True:
            stats = self.dms.describe_replication_tasks(
                Filters=[{'Name': 'replication-task-arn', 'Values': [self.task_arn]}]
            )
            
            task = stats['ReplicationTasks'][0]
            cdc_latency = task.get('ReplicationTaskStats', {}).get('CDCLatencySource', float('inf'))
            
            print(f"{datetime.now()}: CDC Latency: {cdc_latency} seconds")
            
            if cdc_latency <= max_latency_seconds:
                if stable_start is None:
                    stable_start = time.time()
                    print(f"CDC caught up, waiting {stable_duration_seconds}s for stability...")
                elif time.time() - stable_start >= stable_duration_seconds:
                    print(f"{datetime.now()}: CDC stable and caught up!")
                    return True
            else:
                stable_start = None  # Reset if latency increases
                
            time.sleep(30)
    
    def validate_data_consistency(self, source_conn, target_conn, critical_tables):
        """Compare row counts and checksums between source and target"""
        print(f"{datetime.now()}: Validating data consistency...")
        
        for table in critical_tables:
            source_count = self._get_row_count(source_conn, table)
            target_count = self._get_row_count(target_conn, table)
            
            if source_count != target_count:
                raise ValueError(
                    f"Row count mismatch for {table}: "
                    f"source={source_count}, target={target_count}"
                )
            
            print(f"✓ {table}: {source_count} rows match")
        
        print(f"{datetime.now()}: Validation complete - all tables consistent")
        return True
    
    def cutover(self, source_conn, target_conn, critical_tables):
        """Execute cutover sequence"""
        print(f"\n{'='*60}")
        print(f"STARTING CUTOVER at {datetime.now()}")
        print(f"{'='*60}\n")
        
        # Step 1: Stop application writes
        print("Step 1: Putting application in read-only mode...")
        self._enable_maintenance_mode(self.app_endpoint)
        
        # Step 2: Wait for CDC catchup
        print("\nStep 2: Waiting for CDC to catch up...")
        self.wait_for_cdc_catchup()
        
        # Step 3: Validate consistency
        print("\nStep 3: Validating data consistency...")
        self.validate_data_consistency(source_conn, target_conn, critical_tables)
        
        # Step 4: Switch connection endpoint
        print("\nStep 4: Updating application database endpoint...")
        self._update_connection_string(target_conn)
        
        # Step 5: Bring application online
        print("\nStep 5: Bringing application online...")
        self._disable_maintenance_mode(self.app_endpoint)
        
        print(f"\n{'='*60}")
        print(f"CUTOVER COMPLETE at {datetime.now()}")
        print(f"{'='*60}\n")

This automation provides reproducibility and reduces human error during the high-pressure cutover window. Test the entire cutover sequence in staging environments multiple times before production execution.

Handling Large Table Migrations and Performance Optimization

Tables exceeding hundreds of gigabytes or billions of rows present special challenges for DMS migrations. Without optimization, these tables can take days to migrate and may never achieve CDC catchup due to ongoing changes.

Table Segmentation Strategies

DMS supports partitioning large tables into segments that migrate in parallel. This dramatically reduces full load time for massive tables. Configure segmentation using selection rules in your DMS task:

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "segment-large-orders-table",
      "object-locator": {
        "schema-name": "production",
        "table-name": "orders"
      },
      "rule-action": "include",
      "parallel-load": {
        "type": "partitions-auto",
        "number-of-partitions": 8,
        "collection-count-from-metadata": "true"
      }
    }
  ]
}

Automatic partitioning works well when tables have numeric primary keys. DMS divides the key range into segments and migrates each in parallel. For tables without suitable numeric keys, use range-based partitioning where you manually define segments:

{
  "parallel-load": {
    "type": "ranges",
    "partitions": [
      {"min": "1", "max": "1000000"},
      {"min": "1000001", "max": "2000000"},
      {"min": "2000001", "max": "3000000"}
    ]
  }
}

Choose segment counts based on your DMS instance size and target database write capacity. Too many segments can overwhelm the target with parallel writes. Start with 4-8 segments per large table and adjust based on observed performance.

LOB Handling and Binary Data

Large objects (LOBs)—BLOBs, CLOBs, TEXT, and similar types—require special configuration in DMS. By default, DMS uses “limited LOB mode” which only migrates LOBs up to a specified size threshold. LOBs exceeding this threshold are truncated, causing data loss.

For tables containing LOBs, choose between:

Limited LOB mode migrates LOBs up to a specified size (default 32 KB). This is fast but risks truncation. Use when you know LOBs are small or can tolerate truncation of outliers.

Full LOB mode migrates all LOBs regardless of size but dramatically slows migration because DMS must read LOBs separately. Use when LOB integrity is critical and you accept longer migration times.

Inline LOB mode treats small LOBs efficiently while using full LOB mode for large ones. Configure the threshold based on your data distribution—if 95% of LOBs are under 1 MB, set the inline LOB threshold to 1 MB.

Configure LOB settings in task settings:

{
  "TargetMetadata": {
    "LobMaxSize": 32,
    "LobMode": "LIMITED"
  }
}

Profile your source database to understand LOB size distribution before choosing settings. Migrating a table with multi-gigabyte LOBs using the wrong settings can result in weeks of migration time or silent data truncation.

Validation and Testing Strategies

Never cut over to a target database without thorough validation. Production data is too critical to trust migration success without verification.

Automated Data Validation

DMS provides built-in validation that compares source and target data during migration. Enable this in task settings, but understand its limitations—validation increases migration time and DMS instance load. Use it for critical tables while sampling others.

Implement custom validation queries that verify:

  • Row counts for all tables match between source and target
  • Primary key uniqueness is maintained in the target
  • Critical business metrics (sum of transaction amounts, count of active users, etc.) are identical
  • Date ranges show no gaps in time-series data
  • Null handling matches between source and target for nullable columns

Run these validations in three phases:

  1. After full load completes but before CDC begins (catches bulk load issues)
  2. During CDC catchup to verify ongoing replication works correctly
  3. Immediately before cutover as final confirmation

Build validation into your cutover automation so it blocks cutover if discrepancies are detected. False positives from validation queries are better than cutting over with data corruption.

Application-Level Testing

Data consistency is necessary but insufficient—applications must function correctly against the target database. Performance characteristics often differ between source and target, particularly when changing database engines (PostgreSQL to Aurora, Oracle to PostgreSQL, etc.).

Establish a parallel testing environment where applications run against the target while the source handles production traffic. This requires bidirectional replication or periodic snapshots to keep test data current. Execute:

  • Functional testing verifying all application features work correctly
  • Performance testing confirming response times meet SLAs
  • Load testing validating the target handles production traffic volumes
  • Error testing ensuring failure modes are handled properly

Identify and resolve issues in parallel testing before cutover. Common problems include missing indexes, query plan differences, connection pool sizing, and timeout configurations that worked on the source but not the target.

Pre-Cutover Checklist

📋 Technical Validation
☑ Row counts match all tables
☑ Primary/foreign keys validated
☑ Indexes created on target
☑ CDC latency under 10 seconds
☑ No replication errors in logs
☑ Storage capacity verified
☑ Network connectivity tested
🔧 Operational Readiness
☑ Rollback plan documented
☑ Communication plan active
☑ Support team on standby
☑ Monitoring dashboards ready
☑ Cutover runbook tested
☑ Backup taken and verified
☑ Maintenance window confirmed
🎯 Application Testing
☑ Function tests passed
☑ Performance tests passed
☑ Load tests completed
☑ Integration tests passed
☑ Connection strings updated
☑ Health checks configured
☑ Error handling verified
⚠️ Critical: Have a Rollback Plan
Never execute production cutover without a tested rollback procedure. Keep the source database online and reachable for at least 24-48 hours post-migration. Configure reverse DMS replication from target back to source before cutover, enabling quick fallback if critical issues emerge. Define clear rollback triggers (error rates, performance degradation, data inconsistencies) and decision authority.

Managing Rollback and Post-Migration Monitoring

Even perfectly planned migrations encounter unexpected issues. Your rollback strategy determines whether these issues cause hours of downtime or quick recovery.

Implementing Bidirectional Replication

The safest migration pattern maintains bidirectional replication between source and target for a post-cutover period. Before cutover, configure a second DMS task that replicates from target back to source. Keep this task ready but not started.

After successful cutover, if critical issues arise, you can:

  1. Put applications back in maintenance mode
  2. Start the reverse DMS task to sync target changes back to source
  3. Wait for reverse CDC to catch up
  4. Switch applications back to the source database
  5. Investigate and resolve target issues offline

This approach enables rollback with minimal data loss—only changes occurring during the rollback cutover window are at risk. Without bidirectional replication, rollback requires restoring source from backup, losing all changes since migration.

Post-Migration Monitoring

Intensively monitor the target database for the first 24-48 hours post-migration. Watch for:

Performance anomalies: Response time degradation, increased CPU utilization, slow queries that ran fast on source

Error rates: Application errors, failed transactions, connection timeouts

Data drift: If keeping source online, monitor for unintended writes to the old database

Replication lag: If maintaining ongoing replication for fallback, ensure CDC keeps up

Storage growth: Verify storage consumption matches expectations—unexpected growth indicates issues

Define clear success criteria and measurement periods. For example: “Migration is successful if error rates remain below 0.1%, 95th percentile response times stay under 200ms, and no critical issues arise for 48 hours.” Objective criteria prevent premature decommissioning of source infrastructure.

Conclusion

Successful DMS migrations for production databases require meticulous planning, comprehensive testing, and operational discipline that extends well beyond simply running a replication task. The minimal-downtime pattern using full load plus CDC provides the optimal balance for most scenarios, delivering short cutover windows without unnecessary complexity. However, the technical mechanics of running DMS represent only half the challenge—validation strategies, rollback planning, and post-migration monitoring separate successful migrations from failures that force emergency rollbacks.

The key to success lies in treating migration as a process rather than an event. Start with thorough assessment and planning, progress through multiple rounds of testing in progressively more production-like environments, execute cutover with automation and verification at every step, and maintain vigilant monitoring throughout the stabilization period. Organizations that invest this discipline into their migration approach achieve successful transitions while those that rush toward cutover dates inevitably face extended outages, data inconsistencies, or failed migrations that erode confidence in cloud adoption strategies.

Leave a Comment