Change Data Capture (CDC) has become essential for modern data architectures that require real-time or near-real-time data synchronization. Rather than replicating entire tables repeatedly, CDC identifies and captures only the changes—inserts, updates, and deletes—dramatically reducing data transfer volumes and enabling incremental updates. Fivetran simplifies CDC implementation by handling the complexity of log-based replication, transformation, and loading into Snowflake. This guide walks through implementing a production-ready CDC pipeline, covering setup, configuration, monitoring, and optimization strategies.
Understanding CDC Architecture with Fivetran and Snowflake
Before diving into implementation, understanding how Fivetran’s CDC mechanism works with Snowflake clarifies design decisions and troubleshooting approaches. Fivetran employs log-based CDC for supported databases, reading transaction logs to capture changes without impacting source system performance. This approach differs fundamentally from trigger-based or query-based CDC methods that add overhead to source databases.
When Fivetran connects to a source database like PostgreSQL, MySQL, or SQL Server with CDC enabled, it establishes a replication slot or binlog position that marks where in the transaction log to begin reading. As transactions occur in the source database, Fivetran continuously reads the log, identifies data modifications, and stages them for replication. This log-based approach provides several critical advantages: minimal impact on source database performance, capture of all changes including deletes, and preservation of transaction ordering.
The data flow follows a specific pattern. Fivetran reads changes from source transaction logs, transforms them into a standardized format, and writes them to Snowflake staging tables. From staging, Fivetran applies transformations and merges changes into final destination tables. Snowflake’s MERGE statement handles the upsert logic, inserting new records and updating existing ones based on primary keys. For deleted records, Fivetran either hard deletes them or marks them with deletion flags, depending on your configuration.
Fivetran maintains metadata tables in Snowflake schemas that track sync history, schema changes, and replication state. These metadata tables prove invaluable for monitoring pipeline health and troubleshooting synchronization issues. The _fivetran_synced
timestamp column added to every replicated table indicates when each row was last updated by Fivetran, enabling incremental downstream processing.
Preparing Source Database for CDC
Successful CDC implementation begins with proper source database configuration. Different database systems require specific setup steps to enable log-based replication. This section covers the most common scenarios data engineers encounter.
PostgreSQL CDC Setup
PostgreSQL uses logical replication for CDC. Enable it by modifying PostgreSQL configuration:
-- Check current wal_level setting
SHOW wal_level;
-- Modify postgresql.conf (requires restart)
-- wal_level = logical
-- max_replication_slots = 10
-- max_wal_senders = 10
-- Create publication for tables to replicate
CREATE PUBLICATION fivetran_publication FOR TABLE
customers, orders, products, order_items;
-- Create replication user with appropriate permissions
CREATE USER fivetran_user WITH REPLICATION PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE your_database TO fivetran_user;
GRANT USAGE ON SCHEMA public TO fivetran_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fivetran_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fivetran_user;
The wal_level = logical
setting enables logical decoding necessary for CDC. Replication slots prevent PostgreSQL from purging transaction logs before Fivetran reads them, ensuring no changes are missed. The publication defines which tables participate in replication—add tables explicitly rather than using FOR ALL TABLES
to maintain granular control.
MySQL CDC Configuration
MySQL CDC requires enabling binary logging and creating an appropriate replication user:
-- Verify binary logging is enabled
SHOW VARIABLES LIKE 'log_bin';
-- Add to my.cnf if not enabled (requires restart)
-- [mysqld]
-- server-id = 1
-- log_bin = mysql-bin
-- binlog_format = ROW
-- binlog_row_image = FULL
-- Create replication user
CREATE USER 'fivetran_user'@'%' IDENTIFIED BY 'secure_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'fivetran_user'@'%';
FLUSH PRIVILEGES;
-- Verify user permissions
SHOW GRANTS FOR 'fivetran_user'@'%';
The binlog_format = ROW
setting ensures binary logs contain complete row images rather than SQL statements, critical for accurate CDC. The REPLICATION CLIENT
privilege allows Fivetran to read binary log positions, while REPLICATION SLAVE
enables reading binary log contents.
SQL Server CDC Activation
SQL Server requires enabling CDC at both database and table levels:
-- Enable CDC for database
USE YourDatabase;
GO
EXEC sys.sp_cdc_enable_db;
GO
-- Enable CDC for specific tables
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'customers',
@role_name = NULL,
@supports_net_changes = 1;
GO
-- Verify CDC is enabled
SELECT name, is_cdc_enabled
FROM sys.databases
WHERE name = 'YourDatabase';
SELECT name, is_tracked_by_cdc
FROM sys.tables
WHERE schema_id = SCHEMA_ID('dbo');
-- Create Fivetran login and user
CREATE LOGIN fivetran_user WITH PASSWORD = 'secure_password';
CREATE USER fivetran_user FOR LOGIN fivetran_user;
GRANT SELECT ON SCHEMA::dbo TO fivetran_user;
GRANT VIEW DATABASE STATE TO fivetran_user;
SQL Server CDC creates system tables (prefixed with cdc.
) that store change data. Fivetran queries these tables to capture modifications. The @supports_net_changes = 1
parameter enables efficient net change queries that consolidate multiple changes to the same row.
Database-Specific CDC Requirements
PostgreSQL
Setting: wal_level = logical
Permissions: REPLICATION
Feature: Publications
MySQL
Setting: binlog_format = ROW
Permissions: REPLICATION SLAVE
Feature: Binary logs
SQL Server
Setting: sp_cdc_enable_db
Permissions: VIEW DATABASE STATE
Feature: CDC tables
Configuring Snowflake for Fivetran Integration
Snowflake configuration establishes the destination environment where Fivetran loads replicated data. Proper setup ensures Fivetran has necessary permissions while maintaining security through role-based access control.
Create a dedicated database and schema for Fivetran data:
-- Create dedicated database for replicated data
CREATE DATABASE IF NOT EXISTS FIVETRAN_DB;
-- Create schema for each source system
CREATE SCHEMA IF NOT EXISTS FIVETRAN_DB.PRODUCTION_DB;
CREATE SCHEMA IF NOT EXISTS FIVETRAN_DB.ANALYTICS_DB;
-- Create dedicated warehouse for Fivetran loads
CREATE WAREHOUSE IF NOT EXISTS FIVETRAN_WH
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;
The dedicated warehouse isolates Fivetran’s load operations from user queries, preventing resource contention. Size the warehouse based on expected data volumes—SMALL
suffices for low-volume CDC, while high-volume replication may require LARGE
or X-LARGE
.
Create a Fivetran role with appropriate permissions:
-- Create role for Fivetran
CREATE ROLE IF NOT EXISTS FIVETRAN_ROLE;
-- Grant warehouse usage
GRANT USAGE ON WAREHOUSE FIVETRAN_WH TO ROLE FIVETRAN_ROLE;
-- Grant database and schema permissions
GRANT USAGE ON DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT CREATE SCHEMA ON DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT CREATE TABLE ON ALL SCHEMAS IN DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
-- Grant table permissions
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA FIVETRAN_DB.PRODUCTION_DB TO ROLE FIVETRAN_ROLE;
GRANT SELECT, INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA FIVETRAN_DB.PRODUCTION_DB TO ROLE FIVETRAN_ROLE;
-- Create user for Fivetran
CREATE USER IF NOT EXISTS FIVETRAN_USER
PASSWORD = 'secure_generated_password'
DEFAULT_ROLE = FIVETRAN_ROLE
DEFAULT_WAREHOUSE = FIVETRAN_WH
MUST_CHANGE_PASSWORD = FALSE;
-- Assign role to user
GRANT ROLE FIVETRAN_ROLE TO USER FIVETRAN_USER;
The FUTURE TABLES
grant ensures Fivetran can access new tables it creates without manual permission updates. Setting MUST_CHANGE_PASSWORD = FALSE
prevents authentication issues since Fivetran manages the connection programmatically.
Configure network access if using Snowflake network policies:
-- Add Fivetran IP addresses to network policy
-- Fivetran provides current IP ranges in their documentation
ALTER NETWORK POLICY SNOWFLAKE_NETWORK_POLICY
ADD ALLOWED_IP_LIST = ('52.0.2.4/32', '34.203.33.166/32');
Fivetran publishes current IP addresses in their documentation. Add these to your network policy to allow connections while maintaining security.
Setting Up Fivetran Connector
With source and destination configured, create the Fivetran connector that orchestrates data replication. The Fivetran UI guides through connector creation, but understanding configuration options ensures optimal setup.
Creating the Database Connector
Navigate to Fivetran dashboard and click “Add Connector.” Select your source database type (PostgreSQL, MySQL, SQL Server, etc.) and provide connection details:
- Host: Database hostname or IP address
- Port: Database port (5432 for PostgreSQL, 3306 for MySQL, 1433 for SQL Server)
- User: Replication user created earlier
- Password: User password
- Database: Source database name
For cloud-hosted databases (RDS, Cloud SQL, Azure SQL), enable SSL and provide necessary certificates. Fivetran performs a connection test to verify credentials and network connectivity before proceeding.
Configuring CDC Settings
After connection validation, configure CDC-specific options:
Update Method: Select “Log-based (CDC)” rather than “Periodic (full table)” to enable change data capture. Fivetran automatically uses the appropriate CDC mechanism for your database type.
Historical Sync: Choose whether to perform an initial full copy of existing data before switching to CDC. Enable this for new connectors; disable if tables are already synchronized.
Schema Selection: Select schemas and tables to replicate. Avoid replicating unnecessary tables to reduce load and storage costs. Use inclusion/exclusion patterns for fine-grained control:
Include schemas: public, sales, inventory
Include tables: public.customers, public.orders, sales.*
Exclude tables: public.temp_*, *.audit_log
Primary Key Configuration: Fivetran auto-detects primary keys but allows manual specification. Correct primary key configuration is critical—Fivetran uses these to perform upserts. For tables without natural primary keys, Fivetran can create surrogate keys.
Configuring Snowflake Destination
Connect the Snowflake destination using the user and role created earlier:
- Account Identifier: Your Snowflake account ID (e.g., xy12345.us-east-1)
- User: FIVETRAN_USER
- Password: User password
- Database: FIVETRAN_DB
- Warehouse: FIVETRAN_WH
- Schema Prefix: Optional prefix for organizing schemas (e.g., “PROD_”, “DEV_”)
Fivetran validates Snowflake connectivity and permissions before proceeding. Address any permission errors by reviewing the SQL grants executed earlier.
Managing Schema Changes and Transformations
CDC pipelines must handle evolving source schemas gracefully. Fivetran provides multiple strategies for managing schema changes and applying transformations during replication.
Automatic Schema Evolution
Fivetran’s schema migration feature automatically propagates source schema changes to Snowflake:
- New columns: Automatically added to destination tables
- Column type changes: Handled based on compatibility (e.g., widening VARCHAR)
- Dropped columns: Optionally preserved or removed in destination
- New tables: Automatically created in destination schema
Configure schema migration behavior in connector settings:
Allow column removal: Enable to drop columns from Snowflake when removed from source. Disable to preserve historical columns.
Block on incompatible changes: Stop syncing if Fivetran encounters incompatible type changes, preventing data loss or corruption.
Notification settings: Configure alerts for schema changes to maintain awareness of source database evolution.
Applying Transformations
Fivetran supports transformations through dbt Core integration, enabling SQL-based transformations on replicated data. Create transformation models that run after each sync:
-- models/staging/stg_customers.sql
-- Basic staging transformation
SELECT
customer_id,
UPPER(TRIM(customer_name)) AS customer_name,
LOWER(TRIM(email)) AS email,
country_code,
registration_date,
_fivetran_synced AS last_sync_timestamp
FROM {{ source('fivetran', 'customers') }}
WHERE _fivetran_deleted = FALSE
-- models/marts/customer_orders.sql
-- Aggregate customer order metrics
WITH order_stats AS (
SELECT
customer_id,
COUNT(order_id) AS total_orders,
SUM(order_total) AS lifetime_value,
MAX(order_date) AS last_order_date
FROM {{ ref('stg_orders') }}
GROUP BY customer_id
)
SELECT
c.customer_id,
c.customer_name,
c.email,
c.country_code,
COALESCE(o.total_orders, 0) AS total_orders,
COALESCE(o.lifetime_value, 0) AS lifetime_value,
o.last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN order_stats o ON c.customer_id = o.customer_id
These dbt models run automatically after Fivetran syncs, maintaining transformed tables that stay current with source changes. The _fivetran_synced
and _fivetran_deleted
metadata columns enable incremental processing and soft delete handling.
Handling Soft Deletes
Configure how Fivetran handles deleted records:
Hard Delete: Physically removes deleted rows from Snowflake tables, matching source behavior exactly.
Soft Delete: Marks deleted rows with _fivetran_deleted = TRUE
rather than removing them, preserving historical records for audit or analysis.
Soft deletes enable tracking deletion patterns and recovering accidentally deleted data:
-- Query active records only
SELECT * FROM customers WHERE _fivetran_deleted = FALSE;
-- Analyze deletion patterns
SELECT
DATE_TRUNC('day', _fivetran_synced) AS deletion_date,
COUNT(*) AS deletions_count
FROM customers
WHERE _fivetran_deleted = TRUE
GROUP BY deletion_date
ORDER BY deletion_date;
Fivetran Metadata Columns
_fivetran_synced
Timestamp when Fivetran last modified the row. Use for incremental downstream processing: WHERE _fivetran_synced > last_processed_time
_fivetran_deleted
Boolean flag indicating soft-deleted records. Filter active records with: WHERE _fivetran_deleted = FALSE
_fivetran_id
Surrogate key generated by Fivetran for tables without natural primary keys. Ensures upsert operations work correctly even without explicit keys.
_fivetran_index
Sequence number for rows with identical primary keys in the same sync, handling edge cases where multiple changes occur between syncs.
Monitoring Pipeline Performance and Health
Production CDC pipelines require continuous monitoring to detect issues before they impact downstream systems. Fivetran provides monitoring tools and Snowflake queries reveal pipeline behavior.
Fivetran Dashboard Monitoring
The Fivetran dashboard displays connector health and sync metrics:
- Sync frequency: How often Fivetran checks for changes (typically 1-15 minutes depending on plan)
- Rows synced: Number of rows inserted, updated, and deleted per sync
- Sync duration: Time required to complete each sync cycle
- Schema changes: Recent schema modifications detected
- Errors and warnings: Issues requiring attention
Set up alerts for critical conditions:
- Sync failures: Immediate notification when syncs fail
- Schema changes: Alert on structural changes for awareness
- Volume anomalies: Notify when row counts exceed expected thresholds
- Lag warnings: Alert if replication lag grows beyond acceptable limits
Snowflake-Side Monitoring
Query Fivetran metadata tables to analyze pipeline behavior:
-- Check most recently synced tables
SELECT
table_schema,
table_name,
MAX(_fivetran_synced) AS last_sync_time,
COUNT(*) AS row_count
FROM FIVETRAN_DB.INFORMATION_SCHEMA.TABLES
CROSS JOIN LATERAL (
SELECT _fivetran_synced
FROM IDENTIFIER(CONCAT(table_schema, '.', table_name))
)
GROUP BY table_schema, table_name
ORDER BY last_sync_time DESC;
-- Monitor sync lag by comparing source and destination
SELECT
table_name,
MAX(_fivetran_synced) AS last_fivetran_sync,
DATEDIFF(minute, MAX(_fivetran_synced), CURRENT_TIMESTAMP()) AS sync_lag_minutes
FROM FIVETRAN_DB.PRODUCTION_DB.CUSTOMERS
GROUP BY table_name
HAVING sync_lag_minutes > 30; -- Alert if lag exceeds 30 minutes
-- Analyze daily sync volumes
SELECT
DATE_TRUNC('day', _fivetran_synced) AS sync_date,
COUNT(*) AS rows_synced,
COUNT(CASE WHEN _fivetran_deleted THEN 1 END) AS rows_deleted
FROM FIVETRAN_DB.PRODUCTION_DB.ORDERS
WHERE _fivetran_synced >= CURRENT_DATE - 7
GROUP BY sync_date
ORDER BY sync_date DESC;
These queries identify replication lag, track sync volumes, and detect anomalies requiring investigation.
Performance Optimization
Optimize CDC pipeline performance through several strategies:
Warehouse sizing: Monitor Fivetran warehouse utilization. If sync durations grow despite stable data volumes, increase warehouse size. Auto-scaling warehouses aren’t beneficial for Fivetran since syncs run predictably.
Clustering keys: Add clustering keys to large tables frequently filtered by date or other selective columns:
ALTER TABLE FIVETRAN_DB.PRODUCTION_DB.ORDERS
CLUSTER BY (order_date, customer_id);
Selective replication: Exclude tables that don’t require real-time synchronization. Use Fivetran’s table selection to replicate only necessary data, reducing processing and storage costs.
Compression: Snowflake automatically compresses data, but verify compression ratios for very large tables:
SELECT
table_name,
row_count,
bytes / (1024 * 1024 * 1024) AS size_gb,
(bytes / NULLIF(row_count, 0)) AS bytes_per_row
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
WHERE table_schema = 'PRODUCTION_DB'
ORDER BY size_gb DESC;
Poor compression ratios may indicate opportunities for data type optimization or partitioning strategies.
Troubleshooting Common CDC Issues
Understanding common CDC problems and their solutions reduces troubleshooting time and prevents prolonged outages.
Replication Lag
Symptoms: Growing delay between source changes and Snowflake visibility.
Causes and solutions:
- Large transactions: Breaking up bulk operations in source database improves CDC throughput
- Warehouse undersized: Increase Snowflake warehouse size if Fivetran syncs run slowly
- Network issues: Check connectivity between Fivetran and source database
- Log retention: Ensure source database retains logs long enough for Fivetran to process
Schema Conflict Errors
Symptoms: Syncs fail with schema mismatch errors.
Causes and solutions:
- Incompatible type changes: Review schema change causing conflict; may require manual intervention
- Missing permissions: Verify Fivetran user has necessary DDL permissions
- Concurrent modifications: Avoid manual schema changes in Fivetran-managed tables
Primary Key Violations
Symptoms: Duplicate key errors during upserts.
Causes and solutions:
- Incorrect primary key: Verify primary key configuration matches source table definition
- Missing primary key: Add natural or surrogate keys to tables lacking them
- Source data quality: Fix duplicate keys in source database
Connection Interruptions
Symptoms: Syncs fail intermittently with connection errors.
Causes and solutions:
- Firewall rules: Ensure Fivetran IPs are whitelisted
- Database restarts: Fivetran automatically reconnects; verify replication slot/binlog position restored correctly
- Credential changes: Update Fivetran connector if database passwords rotated
Conclusion
Implementing CDC with Fivetran and Snowflake transforms data integration from batch-oriented ETL to continuous synchronization that keeps analytical data current. The combination of Fivetran’s managed CDC capabilities and Snowflake’s performance and scalability enables real-time analytics without the operational complexity of building custom replication infrastructure. Proper setup—configuring source databases correctly, securing Snowflake with appropriate permissions, and monitoring pipeline health—ensures reliable data replication that scales with business growth.
Success with CDC pipelines extends beyond initial implementation to ongoing optimization and maintenance. Monitor replication lag and sync performance, manage schema evolution proactively, and leverage Fivetran’s metadata for downstream incremental processing. These practices transform CDC from a technical integration into a strategic capability that enables businesses to make decisions based on current data rather than yesterday’s snapshots, ultimately delivering the real-time insights modern organizations require.