How to Implement a CDC Data Pipeline in Snowflake Using Fivetran

Change Data Capture (CDC) has become essential for modern data architectures that require real-time or near-real-time data synchronization. Rather than replicating entire tables repeatedly, CDC identifies and captures only the changes—inserts, updates, and deletes—dramatically reducing data transfer volumes and enabling incremental updates. Fivetran simplifies CDC implementation by handling the complexity of log-based replication, transformation, and loading into Snowflake. This guide walks through implementing a production-ready CDC pipeline, covering setup, configuration, monitoring, and optimization strategies.

Understanding CDC Architecture with Fivetran and Snowflake

Before diving into implementation, understanding how Fivetran’s CDC mechanism works with Snowflake clarifies design decisions and troubleshooting approaches. Fivetran employs log-based CDC for supported databases, reading transaction logs to capture changes without impacting source system performance. This approach differs fundamentally from trigger-based or query-based CDC methods that add overhead to source databases.

When Fivetran connects to a source database like PostgreSQL, MySQL, or SQL Server with CDC enabled, it establishes a replication slot or binlog position that marks where in the transaction log to begin reading. As transactions occur in the source database, Fivetran continuously reads the log, identifies data modifications, and stages them for replication. This log-based approach provides several critical advantages: minimal impact on source database performance, capture of all changes including deletes, and preservation of transaction ordering.

The data flow follows a specific pattern. Fivetran reads changes from source transaction logs, transforms them into a standardized format, and writes them to Snowflake staging tables. From staging, Fivetran applies transformations and merges changes into final destination tables. Snowflake’s MERGE statement handles the upsert logic, inserting new records and updating existing ones based on primary keys. For deleted records, Fivetran either hard deletes them or marks them with deletion flags, depending on your configuration.

Fivetran maintains metadata tables in Snowflake schemas that track sync history, schema changes, and replication state. These metadata tables prove invaluable for monitoring pipeline health and troubleshooting synchronization issues. The _fivetran_synced timestamp column added to every replicated table indicates when each row was last updated by Fivetran, enabling incremental downstream processing.

Preparing Source Database for CDC

Successful CDC implementation begins with proper source database configuration. Different database systems require specific setup steps to enable log-based replication. This section covers the most common scenarios data engineers encounter.

PostgreSQL CDC Setup

PostgreSQL uses logical replication for CDC. Enable it by modifying PostgreSQL configuration:

-- Check current wal_level setting
SHOW wal_level;

-- Modify postgresql.conf (requires restart)
-- wal_level = logical
-- max_replication_slots = 10
-- max_wal_senders = 10

-- Create publication for tables to replicate
CREATE PUBLICATION fivetran_publication FOR TABLE 
    customers, orders, products, order_items;

-- Create replication user with appropriate permissions
CREATE USER fivetran_user WITH REPLICATION PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE your_database TO fivetran_user;
GRANT USAGE ON SCHEMA public TO fivetran_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fivetran_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fivetran_user;

The wal_level = logical setting enables logical decoding necessary for CDC. Replication slots prevent PostgreSQL from purging transaction logs before Fivetran reads them, ensuring no changes are missed. The publication defines which tables participate in replication—add tables explicitly rather than using FOR ALL TABLES to maintain granular control.

MySQL CDC Configuration

MySQL CDC requires enabling binary logging and creating an appropriate replication user:

-- Verify binary logging is enabled
SHOW VARIABLES LIKE 'log_bin';

-- Add to my.cnf if not enabled (requires restart)
-- [mysqld]
-- server-id = 1
-- log_bin = mysql-bin
-- binlog_format = ROW
-- binlog_row_image = FULL

-- Create replication user
CREATE USER 'fivetran_user'@'%' IDENTIFIED BY 'secure_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'fivetran_user'@'%';
FLUSH PRIVILEGES;

-- Verify user permissions
SHOW GRANTS FOR 'fivetran_user'@'%';

The binlog_format = ROW setting ensures binary logs contain complete row images rather than SQL statements, critical for accurate CDC. The REPLICATION CLIENT privilege allows Fivetran to read binary log positions, while REPLICATION SLAVE enables reading binary log contents.

SQL Server CDC Activation

SQL Server requires enabling CDC at both database and table levels:

-- Enable CDC for database
USE YourDatabase;
GO
EXEC sys.sp_cdc_enable_db;
GO

-- Enable CDC for specific tables
EXEC sys.sp_cdc_enable_table
    @source_schema = N'dbo',
    @source_name = N'customers',
    @role_name = NULL,
    @supports_net_changes = 1;
GO

-- Verify CDC is enabled
SELECT name, is_cdc_enabled 
FROM sys.databases 
WHERE name = 'YourDatabase';

SELECT name, is_tracked_by_cdc 
FROM sys.tables 
WHERE schema_id = SCHEMA_ID('dbo');

-- Create Fivetran login and user
CREATE LOGIN fivetran_user WITH PASSWORD = 'secure_password';
CREATE USER fivetran_user FOR LOGIN fivetran_user;
GRANT SELECT ON SCHEMA::dbo TO fivetran_user;
GRANT VIEW DATABASE STATE TO fivetran_user;

SQL Server CDC creates system tables (prefixed with cdc.) that store change data. Fivetran queries these tables to capture modifications. The @supports_net_changes = 1 parameter enables efficient net change queries that consolidate multiple changes to the same row.

Database-Specific CDC Requirements

PostgreSQL

Setting: wal_level = logical

Permissions: REPLICATION

Feature: Publications

MySQL

Setting: binlog_format = ROW

Permissions: REPLICATION SLAVE

Feature: Binary logs

SQL Server

Setting: sp_cdc_enable_db

Permissions: VIEW DATABASE STATE

Feature: CDC tables

Configuring Snowflake for Fivetran Integration

Snowflake configuration establishes the destination environment where Fivetran loads replicated data. Proper setup ensures Fivetran has necessary permissions while maintaining security through role-based access control.

Create a dedicated database and schema for Fivetran data:

-- Create dedicated database for replicated data
CREATE DATABASE IF NOT EXISTS FIVETRAN_DB;

-- Create schema for each source system
CREATE SCHEMA IF NOT EXISTS FIVETRAN_DB.PRODUCTION_DB;
CREATE SCHEMA IF NOT EXISTS FIVETRAN_DB.ANALYTICS_DB;

-- Create dedicated warehouse for Fivetran loads
CREATE WAREHOUSE IF NOT EXISTS FIVETRAN_WH
  WITH WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

The dedicated warehouse isolates Fivetran’s load operations from user queries, preventing resource contention. Size the warehouse based on expected data volumes—SMALL suffices for low-volume CDC, while high-volume replication may require LARGE or X-LARGE.

Create a Fivetran role with appropriate permissions:

-- Create role for Fivetran
CREATE ROLE IF NOT EXISTS FIVETRAN_ROLE;

-- Grant warehouse usage
GRANT USAGE ON WAREHOUSE FIVETRAN_WH TO ROLE FIVETRAN_ROLE;

-- Grant database and schema permissions
GRANT USAGE ON DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT CREATE SCHEMA ON DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;
GRANT CREATE TABLE ON ALL SCHEMAS IN DATABASE FIVETRAN_DB TO ROLE FIVETRAN_ROLE;

-- Grant table permissions
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA FIVETRAN_DB.PRODUCTION_DB TO ROLE FIVETRAN_ROLE;
GRANT SELECT, INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA FIVETRAN_DB.PRODUCTION_DB TO ROLE FIVETRAN_ROLE;

-- Create user for Fivetran
CREATE USER IF NOT EXISTS FIVETRAN_USER
  PASSWORD = 'secure_generated_password'
  DEFAULT_ROLE = FIVETRAN_ROLE
  DEFAULT_WAREHOUSE = FIVETRAN_WH
  MUST_CHANGE_PASSWORD = FALSE;

-- Assign role to user
GRANT ROLE FIVETRAN_ROLE TO USER FIVETRAN_USER;

The FUTURE TABLES grant ensures Fivetran can access new tables it creates without manual permission updates. Setting MUST_CHANGE_PASSWORD = FALSE prevents authentication issues since Fivetran manages the connection programmatically.

Configure network access if using Snowflake network policies:

-- Add Fivetran IP addresses to network policy
-- Fivetran provides current IP ranges in their documentation
ALTER NETWORK POLICY SNOWFLAKE_NETWORK_POLICY 
  ADD ALLOWED_IP_LIST = ('52.0.2.4/32', '34.203.33.166/32');

Fivetran publishes current IP addresses in their documentation. Add these to your network policy to allow connections while maintaining security.

Setting Up Fivetran Connector

With source and destination configured, create the Fivetran connector that orchestrates data replication. The Fivetran UI guides through connector creation, but understanding configuration options ensures optimal setup.

Creating the Database Connector

Navigate to Fivetran dashboard and click “Add Connector.” Select your source database type (PostgreSQL, MySQL, SQL Server, etc.) and provide connection details:

Host: Database hostname or IP address
Port: Database port (5432 for PostgreSQL, 3306 for MySQL, 1433 for SQL Server)
User: Replication user created earlier
Password: User password
Database: Source database name

For cloud-hosted databases (RDS, Cloud SQL, Azure SQL), enable SSL and provide necessary certificates. Fivetran performs a connection test to verify credentials and network connectivity before proceeding.

Configuring CDC Settings

After connection validation, configure CDC-specific options:

Update Method: Select “Log-based (CDC)” rather than “Periodic (full table)” to enable change data capture. Fivetran automatically uses the appropriate CDC mechanism for your database type.

Historical Sync: Choose whether to perform an initial full copy of existing data before switching to CDC. Enable this for new connectors; disable if tables are already synchronized.

Schema Selection: Select schemas and tables to replicate. Avoid replicating unnecessary tables to reduce load and storage costs. Use inclusion/exclusion patterns for fine-grained control:

Include schemas: public, sales, inventory
Include tables: public.customers, public.orders, sales.*
Exclude tables: public.temp_*, *.audit_log

Primary Key Configuration: Fivetran auto-detects primary keys but allows manual specification. Correct primary key configuration is critical—Fivetran uses these to perform upserts. For tables without natural primary keys, Fivetran can create surrogate keys.

Configuring Snowflake Destination

Connect the Snowflake destination using the user and role created earlier:

Account Identifier: Your Snowflake account ID (e.g., xy12345.us-east-1)
User: FIVETRAN_USER
Password: User password
Database: FIVETRAN_DB
Warehouse: FIVETRAN_WH
Schema Prefix: Optional prefix for organizing schemas (e.g., “PROD_”, “DEV_”)

Fivetran validates Snowflake connectivity and permissions before proceeding. Address any permission errors by reviewing the SQL grants executed earlier.

Managing Schema Changes and Transformations

CDC pipelines must handle evolving source schemas gracefully. Fivetran provides multiple strategies for managing schema changes and applying transformations during replication.

Automatic Schema Evolution

Fivetran’s schema migration feature automatically propagates source schema changes to Snowflake:

New columns: Automatically added to destination tables
Column type changes: Handled based on compatibility (e.g., widening VARCHAR)
Dropped columns: Optionally preserved or removed in destination
New tables: Automatically created in destination schema

Configure schema migration behavior in connector settings:

Allow column removal: Enable to drop columns from Snowflake when removed from source. Disable to preserve historical columns.

Block on incompatible changes: Stop syncing if Fivetran encounters incompatible type changes, preventing data loss or corruption.

Notification settings: Configure alerts for schema changes to maintain awareness of source database evolution.

Applying Transformations

Fivetran supports transformations through dbt Core integration, enabling SQL-based transformations on replicated data. Create transformation models that run after each sync:

-- models/staging/stg_customers.sql
-- Basic staging transformation
SELECT
    customer_id,
    UPPER(TRIM(customer_name)) AS customer_name,
    LOWER(TRIM(email)) AS email,
    country_code,
    registration_date,
    _fivetran_synced AS last_sync_timestamp
FROM {{ source('fivetran', 'customers') }}
WHERE _fivetran_deleted = FALSE

-- models/marts/customer_orders.sql
-- Aggregate customer order metrics
WITH order_stats AS (
    SELECT
        customer_id,
        COUNT(order_id) AS total_orders,
        SUM(order_total) AS lifetime_value,
        MAX(order_date) AS last_order_date
    FROM {{ ref('stg_orders') }}
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.customer_name,
    c.email,
    c.country_code,
    COALESCE(o.total_orders, 0) AS total_orders,
    COALESCE(o.lifetime_value, 0) AS lifetime_value,
    o.last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN order_stats o ON c.customer_id = o.customer_id

These dbt models run automatically after Fivetran syncs, maintaining transformed tables that stay current with source changes. The _fivetran_synced and _fivetran_deleted metadata columns enable incremental processing and soft delete handling.

Handling Soft Deletes

Configure how Fivetran handles deleted records:

Hard Delete: Physically removes deleted rows from Snowflake tables, matching source behavior exactly.

Soft Delete: Marks deleted rows with _fivetran_deleted = TRUE rather than removing them, preserving historical records for audit or analysis.

Soft deletes enable tracking deletion patterns and recovering accidentally deleted data:

-- Query active records only
SELECT * FROM customers WHERE _fivetran_deleted = FALSE;

-- Analyze deletion patterns
SELECT 
    DATE_TRUNC('day', _fivetran_synced) AS deletion_date,
    COUNT(*) AS deletions_count
FROM customers
WHERE _fivetran_deleted = TRUE
GROUP BY deletion_date
ORDER BY deletion_date;

Fivetran Metadata Columns

_fivetran_synced

Timestamp when Fivetran last modified the row. Use for incremental downstream processing: WHERE _fivetran_synced > last_processed_time

_fivetran_deleted

Boolean flag indicating soft-deleted records. Filter active records with: WHERE _fivetran_deleted = FALSE

_fivetran_id

Surrogate key generated by Fivetran for tables without natural primary keys. Ensures upsert operations work correctly even without explicit keys.

_fivetran_index

Sequence number for rows with identical primary keys in the same sync, handling edge cases where multiple changes occur between syncs.

Monitoring Pipeline Performance and Health

Production CDC pipelines require continuous monitoring to detect issues before they impact downstream systems. Fivetran provides monitoring tools and Snowflake queries reveal pipeline behavior.

Fivetran Dashboard Monitoring

The Fivetran dashboard displays connector health and sync metrics:

Sync frequency: How often Fivetran checks for changes (typically 1-15 minutes depending on plan)
Rows synced: Number of rows inserted, updated, and deleted per sync
Sync duration: Time required to complete each sync cycle
Schema changes: Recent schema modifications detected
Errors and warnings: Issues requiring attention

Set up alerts for critical conditions:

Sync failures: Immediate notification when syncs fail
Schema changes: Alert on structural changes for awareness
Volume anomalies: Notify when row counts exceed expected thresholds
Lag warnings: Alert if replication lag grows beyond acceptable limits

Snowflake-Side Monitoring

Query Fivetran metadata tables to analyze pipeline behavior:

-- Check most recently synced tables
SELECT 
    table_schema,
    table_name,
    MAX(_fivetran_synced) AS last_sync_time,
    COUNT(*) AS row_count
FROM FIVETRAN_DB.INFORMATION_SCHEMA.TABLES
CROSS JOIN LATERAL (
    SELECT _fivetran_synced 
    FROM IDENTIFIER(CONCAT(table_schema, '.', table_name))
)
GROUP BY table_schema, table_name
ORDER BY last_sync_time DESC;

-- Monitor sync lag by comparing source and destination
SELECT 
    table_name,
    MAX(_fivetran_synced) AS last_fivetran_sync,
    DATEDIFF(minute, MAX(_fivetran_synced), CURRENT_TIMESTAMP()) AS sync_lag_minutes
FROM FIVETRAN_DB.PRODUCTION_DB.CUSTOMERS
GROUP BY table_name
HAVING sync_lag_minutes > 30;  -- Alert if lag exceeds 30 minutes

-- Analyze daily sync volumes
SELECT 
    DATE_TRUNC('day', _fivetran_synced) AS sync_date,
    COUNT(*) AS rows_synced,
    COUNT(CASE WHEN _fivetran_deleted THEN 1 END) AS rows_deleted
FROM FIVETRAN_DB.PRODUCTION_DB.ORDERS
WHERE _fivetran_synced >= CURRENT_DATE - 7
GROUP BY sync_date
ORDER BY sync_date DESC;

These queries identify replication lag, track sync volumes, and detect anomalies requiring investigation.

Performance Optimization

Optimize CDC pipeline performance through several strategies:

Warehouse sizing: Monitor Fivetran warehouse utilization. If sync durations grow despite stable data volumes, increase warehouse size. Auto-scaling warehouses aren’t beneficial for Fivetran since syncs run predictably.

Clustering keys: Add clustering keys to large tables frequently filtered by date or other selective columns:

ALTER TABLE FIVETRAN_DB.PRODUCTION_DB.ORDERS 
CLUSTER BY (order_date, customer_id);

Selective replication: Exclude tables that don’t require real-time synchronization. Use Fivetran’s table selection to replicate only necessary data, reducing processing and storage costs.

Compression: Snowflake automatically compresses data, but verify compression ratios for very large tables:

SELECT 
    table_name,
    row_count,
    bytes / (1024 * 1024 * 1024) AS size_gb,
    (bytes / NULLIF(row_count, 0)) AS bytes_per_row
FROM SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
WHERE table_schema = 'PRODUCTION_DB'
ORDER BY size_gb DESC;

Poor compression ratios may indicate opportunities for data type optimization or partitioning strategies.

Troubleshooting Common CDC Issues

Understanding common CDC problems and their solutions reduces troubleshooting time and prevents prolonged outages.

Replication Lag

Symptoms: Growing delay between source changes and Snowflake visibility.

Causes and solutions:

Large transactions: Breaking up bulk operations in source database improves CDC throughput
Warehouse undersized: Increase Snowflake warehouse size if Fivetran syncs run slowly
Network issues: Check connectivity between Fivetran and source database
Log retention: Ensure source database retains logs long enough for Fivetran to process

Schema Conflict Errors

Symptoms: Syncs fail with schema mismatch errors.

Causes and solutions:

Incompatible type changes: Review schema change causing conflict; may require manual intervention
Missing permissions: Verify Fivetran user has necessary DDL permissions
Concurrent modifications: Avoid manual schema changes in Fivetran-managed tables

Primary Key Violations

Symptoms: Duplicate key errors during upserts.

Causes and solutions:

Incorrect primary key: Verify primary key configuration matches source table definition
Missing primary key: Add natural or surrogate keys to tables lacking them
Source data quality: Fix duplicate keys in source database

Connection Interruptions

Symptoms: Syncs fail intermittently with connection errors.

Causes and solutions:

Firewall rules: Ensure Fivetran IPs are whitelisted
Database restarts: Fivetran automatically reconnects; verify replication slot/binlog position restored correctly
Credential changes: Update Fivetran connector if database passwords rotated

Conclusion

Implementing CDC with Fivetran and Snowflake transforms data integration from batch-oriented ETL to continuous synchronization that keeps analytical data current. The combination of Fivetran’s managed CDC capabilities and Snowflake’s performance and scalability enables real-time analytics without the operational complexity of building custom replication infrastructure. Proper setup—configuring source databases correctly, securing Snowflake with appropriate permissions, and monitoring pipeline health—ensures reliable data replication that scales with business growth.

Success with CDC pipelines extends beyond initial implementation to ongoing optimization and maintenance. Monitor replication lag and sync performance, manage schema evolution proactively, and leverage Fivetran’s metadata for downstream incremental processing. These practices transform CDC from a technical integration into a strategic capability that enables businesses to make decisions based on current data rather than yesterday’s snapshots, ultimately delivering the real-time insights modern organizations require.

Understanding CDC Architecture with Fivetran and Snowflake

Preparing Source Database for CDC

Database-Specific CDC Requirements

PostgreSQL

MySQL

SQL Server

Configuring Snowflake for Fivetran Integration

Setting Up Fivetran Connector

Managing Schema Changes and Transformations

Fivetran Metadata Columns

_fivetran_synced

_fivetran_deleted

_fivetran_id

_fivetran_index

Monitoring Pipeline Performance and Health

Troubleshooting Common CDC Issues

Conclusion

Leave a Comment Cancel reply