Change Data Capture (CDC) has become the backbone of modern data architectures, enabling real-time data synchronization between operational databases and analytical systems, powering event-driven architectures, and maintaining materialized views across distributed systems. Debezium, as the leading open-source CDC platform, captures row-level changes from databases and streams them to Kafka with minimal latency and exactly-once semantics. Yet the power of CDC comes with operational complexity—connectors can lag, fail silently, encounter database permission issues, or experience backpressure from downstream consumers.
Effective monitoring transforms Debezium from a fragile pipeline into a reliable production system. Without proper observability, you discover problems when users report stale data or when emergency debugging reveals hours of accumulated lag. With comprehensive monitoring, you detect issues within minutes—connector failures trigger alerts, lag metrics show capacity problems before they impact SLAs, and detailed metrics enable root cause analysis that accelerates resolution from hours to minutes.
This guide provides a deep dive into monitoring Debezium connectors for production CDC pipelines, covering the critical metrics that matter, implementation strategies using Prometheus and Grafana, alerting rules that catch problems early, and troubleshooting approaches that leverage monitoring data to resolve issues quickly.
Understanding What to Monitor in Debezium
Debezium connectors expose numerous metrics through JMX (Java Management Extensions), but knowing which metrics actually matter separates signal from noise. Focus monitoring on metrics that directly indicate health, performance, or impending problems.
Connector State and Health Metrics
The most fundamental question is: “Is my connector running?” Debezium connectors can exist in several states that determine whether data flows:
Connector status: Running, Failed, Paused, or Unassigned. A connector in Failed state has stopped processing entirely, requiring intervention. Unassigned state indicates Kafka Connect cluster issues rather than connector-specific problems.
Task status: Each connector spawns one or more tasks that do actual work. Individual task failures can cause partial processing loss while the connector shows Running state. You must monitor task-level status separately from connector status.
Last successful offset commit: Tracks when the connector last successfully committed its position in the database log. Stale commit timestamps indicate the connector is stuck—processing events but unable to persist progress, usually due to downstream issues or internal errors.
These state metrics answer “is it working right now?” but don’t reveal performance or capacity problems. For that, you need throughput and lag metrics.
Throughput and Processing Metrics
Throughput metrics reveal how much data your connector processes and identify capacity bottlenecks:
Events processed per second: The rate at which the connector reads database changes. Healthy connectors show consistent rates during normal operation, with spikes during bulk operations. Sudden drops to zero indicate problems; gradual decreases suggest increasing lag.
Bytes processed per second: Complements event count by revealing data volume. A million small updates differs operationally from a million bulk inserts with large payloads. High byte rates with normal event counts indicate large transactions or blob-heavy changes.
Database transaction commit rate: Measures how many source database transactions the connector processes. This metric directly correlates with source database write load, helping distinguish “connector is slow” from “database is genuinely quiet.”
Snapshot progress: During initial snapshots, metrics track table completion percentage, rows processed, and estimated time remaining. Slow snapshot progress often indicates database performance issues, network bottlenecks, or suboptimal connector configuration.
Throughput metrics reveal the “how much” question, but lag metrics answer the critical “how far behind” question that determines whether your CDC pipeline meets latency SLAs.
Lag and Latency Metrics
Lag represents the time delta between when changes occur in the source database and when Debezium processes them. Multiple lag metrics provide different perspectives:
Replication lag (milliseconds behind master): For databases with replication (MySQL, PostgreSQL), this measures how far behind the primary the replica Debezium reads from is. High replication lag isn’t a Debezium problem but indicates database infrastructure issues that indirectly affect CDC latency.
Connector processing lag: Time between when a database change was committed and when Debezium processed it. This is the purest measure of Debezium performance. Increasing processing lag indicates the connector can’t keep up with database write rates.
Queue lag: Number of events buffered in internal connector queues waiting for processing. High queue lag suggests backpressure from Kafka—the connector captures changes faster than it can publish them to Kafka topics.
End-to-end lag: Time from database commit to consumer application processing. This requires correlation IDs or timestamp tracking across the pipeline and represents the metric that actually matters to users, though only a portion is attributable to Debezium itself.
Lag metrics are your early warning system. Rising lag indicates impending problems even when current state shows everything running normally.
Error and Failure Metrics
Errors occur constantly in production CDC pipelines—temporary network glitches, database deadlocks, transient Kafka broker issues. Most errors self-resolve through retries, but patterns in error metrics reveal systemic problems:
Retry count: How many times the connector retried failed operations. Occasional retries are normal; sustained high retry rates indicate persistent problems that eventually exhaust retry limits and cause connector failure.
Connection errors: Failed connections to source database or Kafka cluster. Spikes indicate network issues or infrastructure problems; sustained connection errors suggest misconfiguration or authentication failures.
Serialization errors: Failures converting database changes to Kafka messages, usually indicating schema evolution problems—new columns with incompatible types, unexpected null values, or character encoding issues.
Transaction errors: Problems reading or parsing database transaction logs (binlog, WAL, etc.). These often result from database configuration issues (binary logging disabled, insufficient retention) or permission problems.
Error rate trends matter more than absolute numbers. A connector generating 5 errors per minute that rises to 50 per minute signals an emerging problem, while a consistent 5 errors per minute might be acceptable noise.
Critical Debezium Metrics Summary
- Connector state = Failed
- Task failures > 0
- No offset commits for 5+ minutes
- Processing lag > SLA threshold
- Processing lag increasing trend
- Retry rate spike (>100/min)
- Queue depth growing
- Throughput dropped by 50%+
- Events processed per second
- Snapshot progress percentage
- Connection pool utilization
- Memory and CPU usage
- Peak throughput vs. capacity
- Storage growth rate
- Network bandwidth usage
- Historical lag patterns
Implementing Monitoring with Prometheus and Grafana
Debezium exposes metrics through JMX, which integrates naturally with Prometheus for collection and Grafana for visualization. This stack provides production-grade monitoring with minimal configuration.
Configuring JMX Exporter for Debezium
The JMX Exporter bridges Java’s JMX metrics to Prometheus’ scraping model. Deploy it as a Java agent alongside Kafka Connect:
# jmx_exporter_config.yml
---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
# Debezium connector metrics
rules:
# Connector state and health
- pattern: 'debezium.([^:]+)<type=connector-metrics, context=([^,]+), server=([^,]+), key=([^>]+)><>(\w+)'
name: debezium_metrics_$5
labels:
connector_type: "$1"
context: "$2"
server: "$3"
key: "$4"
# Streaming metrics (throughput, lag)
- pattern: 'debezium.([^:]+)<type=connector-metrics, context=streaming, server=([^>]+)><>([\w]+)'
name: debezium_streaming_$3
labels:
connector_type: "$1"
server: "$2"
# Snapshot metrics
- pattern: 'debezium.([^:]+)<type=connector-metrics, context=snapshot, server=([^>]+)><>([\w]+)'
name: debezium_snapshot_$3
labels:
connector_type: "$1"
server: "$2"
# Task metrics
- pattern: 'debezium.([^:]+)<type=connector-metrics, context=task, server=([^,]+), task=([^>]+)><>([\w]+)'
name: debezium_task_$4
labels:
connector_type: "$1"
server: "$2"
task: "$3"
# Kafka Connect worker metrics
- pattern: 'kafka.connect<type=connect-worker-metrics><>([^:]+)'
name: kafka_connect_worker_$1
- pattern: 'kafka.connect<type=connector-metrics, connector=([^>]+)><>([\w]+)'
name: kafka_connect_connector_$2
labels:
connector: "$1"
Launch Kafka Connect with the JMX Exporter agent:
export KAFKA_OPTS="-javaagent:/path/to/jmx_prometheus_javaagent.jar=8080:/path/to/jmx_exporter_config.yml"
./bin/connect-distributed.sh config/connect-distributed.properties
Prometheus now scrapes metrics from port 8080. Configure your Prometheus server to collect from all Kafka Connect workers:
# prometheus.yml
scrape_configs:
- job_name: 'debezium-connectors'
static_configs:
- targets: ['connect-worker-1:8080', 'connect-worker-2:8080', 'connect-worker-3:8080']
scrape_interval: 15s
scrape_timeout: 10s
Essential PromQL Queries for Debezium Monitoring
With metrics flowing into Prometheus, these queries extract actionable insights:
Connector running status (0 = down, 1 = running):
debezium_connector_status{state="running"}
Processing lag in milliseconds:
debezium_streaming_MilliSecondsBehindSource
Events processed per second (rate over 1 minute):
rate(debezium_streaming_TotalNumberOfEventsSeen[1m])
Task failure count:
sum(debezium_task_status{state="failed"}) by (connector, task)
Queue depth (events waiting for processing):
debezium_streaming_QueueTotalCapacity - debezium_streaming_QueueRemainingCapacity
Snapshot progress percentage:
debezium_snapshot_RowsScanned / debezium_snapshot_TotalTableCount * 100
Error rate (errors per minute):
rate(debezium_streaming_NumberOfErrantRecords[1m]) * 60
These queries form the foundation of dashboards and alerts that keep CDC pipelines healthy.
Building Effective Grafana Dashboards
Effective dashboards surface problems quickly without overwhelming operators with information. Organize dashboards hierarchically:
Overview dashboard shows connector health across your entire CDC pipeline:
- Count of running vs failed connectors
- Total throughput (events/sec across all connectors)
- Maximum lag across all connectors
- Recent alerts fired
- Resource utilization (CPU, memory, network)
Per-connector dashboard drills into individual connector details:
- Current state and last state change
- Processing lag over time (1h, 6h, 24h views)
- Throughput graphs (events/sec, bytes/sec)
- Error count and retry rate
- Queue depth and saturation
- Snapshot progress (if applicable)
Troubleshooting dashboard aids incident response:
- Detailed error messages (requires log aggregation)
- Transaction size distribution
- Network latency to source database
- Kafka producer metrics (batch size, compression ratio)
- JVM metrics (heap usage, GC time)
Use color coding thoughtfully: green for healthy metrics, yellow for warnings, red for critical issues. Set thresholds on gauges so operators instantly recognize problems.
Setting Up Effective Alerting
Monitoring without alerting is observability theater—you can see problems after they occur but miss the opportunity to respond before users notice. Effective alerts balance sensitivity (catching real issues) with specificity (avoiding false positives that cause alert fatigue).
Critical Alert Rules
Configure these alerts to trigger on conditions that require immediate response:
Connector failure alert:
- alert: DebeziumConnectorDown
expr: debezium_connector_status{state="running"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Debezium connector {{ $labels.connector }} is down"
description: "Connector {{ $labels.connector }} has been in failed state for more than 2 minutes"
Processing lag exceeds SLA:
- alert: DebeziumHighLag
expr: debezium_streaming_MilliSecondsBehindSource > 300000 # 5 minutes
for: 5m
labels:
severity: critical
annotations:
summary: "High processing lag on {{ $labels.connector }}"
description: "Connector {{ $labels.connector }} is {{ $value }}ms behind source (threshold: 300000ms)"
Task failure alert:
- alert: DebeziumTaskFailed
expr: sum(debezium_task_status{state="failed"}) by (connector) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Task failure in connector {{ $labels.connector }}"
description: "One or more tasks failed in connector {{ $labels.connector }}"
No recent offset commits:
- alert: DebeziumStaleOffsets
expr: time() - debezium_streaming_LastOffsetCommit > 600 # 10 minutes
for: 5m
labels:
severity: warning
annotations:
summary: "Stale offsets for {{ $labels.connector }}"
description: "No offset commit in last {{ $value }} seconds for {{ $labels.connector }}"
Warning-Level Alert Rules
Warning alerts indicate developing problems that don’t yet require immediate action but need investigation:
Increasing lag trend:
- alert: DebeziumLagIncreasing
expr: deriv(debezium_streaming_MilliSecondsBehindSource[15m]) > 1000 # Lag increasing by 1sec per minute
for: 10m
labels:
severity: warning
annotations:
summary: "Processing lag increasing for {{ $labels.connector }}"
description: "Lag for {{ $labels.connector }} is increasing at {{ $value }}ms per minute"
High error rate:
- alert: DebeziumHighErrorRate
expr: rate(debezium_streaming_NumberOfErrantRecords[5m]) * 60 > 10 # More than 10 errors per minute
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.connector }}"
description: "Connector {{ $labels.connector }} experiencing {{ $value }} errors per minute"
Queue saturation:
- alert: DebeziumQueueSaturated
expr: (debezium_streaming_QueueTotalCapacity - debezium_streaming_QueueRemainingCapacity) / debezium_streaming_QueueTotalCapacity > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Queue saturation on {{ $labels.connector }}"
description: "Queue {{ $value | humanizePercentage }} full for {{ $labels.connector }}"
Tune for durations based on your SLAs and operational response times. Shorter durations catch problems faster but increase false positive risk; longer durations provide more certainty but delay notification.
Alert Response Playbook
1. Check connector logs for error message
2. Verify source database connectivity and credentials
3. Check Kafka cluster health
4. Restart connector if issue resolved
Common Causes: Database permissions changed, network partition, Kafka broker outage, schema incompatibility
1. Check if lag is growing or stable
2. Compare throughput vs baseline
3. Check database replication lag
4. Monitor Kafka producer metrics for backpressure
Common Causes: Bulk database operations, insufficient Kafka Connect resources, slow consumers blocking topics, database performance issues
1. Review error messages in connector logs
2. Check for schema evolution events
3. Verify data type compatibility
4. Inspect dead letter queue if configured
Common Causes: Schema changes without connector restart, unexpected null values, character encoding issues, data type incompatibilities
Troubleshooting Common Issues Using Monitoring Data
Effective monitoring not only alerts you to problems but provides the data needed to diagnose and resolve them quickly. Here’s how to leverage monitoring metrics for common Debezium issues.
Diagnosing Persistent Lag
When processing lag grows continuously despite the connector showing healthy state, systematic analysis reveals the bottleneck:
Step 1: Determine if lag is from connector or database replication
- Check
debezium_streaming_MilliSecondsBehindSource(connector lag) - Compare with database replication lag metrics
- If replication lag is high, the problem is database infrastructure, not Debezium
Step 2: Analyze throughput trends
- Compare current events/sec with historical baseline
- If throughput dropped suddenly, check recent configuration changes
- If throughput is consistent but lag grows, database write rate exceeds connector capacity
Step 3: Identify resource constraints
- Check Kafka Connect worker CPU and memory
- Monitor network bandwidth to source database
- Review Kafka producer buffer metrics for backpressure
Step 4: Look for blocking operations
- Large transactions can block connector progress
- Check transaction size distribution
- Monitor snapshot progress if concurrent with streaming
Resolution strategies based on diagnosis:
- Insufficient connector resources → Scale Kafka Connect workers or increase parallelism
- Database replication lag → Address database performance or increase replica resources
- Kafka backpressure → Scale Kafka brokers or optimize consumer applications
- Large transactions → Tune connector batch size or split large operations
Resolving Intermittent Failures
Connectors that repeatedly fail and recover create operational chaos. Monitoring patterns reveal root causes:
Analyze error patterns:
- Plot error count over time—periodic spikes suggest external factors (nightly batch jobs, maintenance windows)
- Check error types—connection errors indicate network issues, serialization errors suggest schema problems
- Correlate errors with other metrics—do failures coincide with throughput spikes or specific times?
Common intermittent failure patterns:
Pattern 1: Database connection timeouts during peak load
- Symptom: Connection errors correlate with high database CPU
- Monitoring evidence: Error spikes align with database query latency increases
- Solution: Increase connection pool size or reduce connector load during peak hours
Pattern 2: Kafka producer timeouts during rebalances
- Symptom: Producer timeouts when consumers join/leave
- Monitoring evidence: Errors correlate with consumer group rebalances
- Solution: Increase Kafka producer timeout configuration
Pattern 3: Schema registry unavailability
- Symptom: Serialization errors during schema registry maintenance
- Monitoring evidence: Errors occur at consistent times (maintenance windows)
- Solution: Implement retry logic or buffer events during known maintenance
Optimizing Snapshot Performance
Initial snapshots can take hours or days for large databases. Monitoring identifies optimization opportunities:
Track snapshot progress metrics:
debezium_snapshot_RowsScannedshows how many rows processeddebezium_snapshot_TotalTableCountindicates remaining work- Calculate rows per second to estimate completion time
Identify slow tables:
- Per-table snapshot metrics reveal which tables consume most time
- Large tables with complex indexes scan slowly
- Wide tables with many columns or BLOBs transfer slowly
Optimization strategies:
- Increase
snapshot.fetch.sizeto read more rows per batch - Add indexes to columns used in snapshot queries
- Schedule snapshots during low-database-load periods
- Consider parallel snapshotting for very large databases
- Use
snapshot.mode=schema_onlyfor tables you can backfill otherwise
Monitor throughput during snapshot. If it remains consistently low despite resource availability, database query performance is the bottleneck—work with DBAs to optimize.
Advanced Monitoring Patterns
Beyond basic metrics, advanced monitoring patterns provide deeper insights for complex CDC pipelines.
End-to-End Latency Tracking
Measure true pipeline latency from database commit to downstream consumption:
- Database embeds commit timestamp in changed records
- Debezium includes source timestamp in Kafka message metadata
- Consumers record processing timestamp
- Calculate deltas at each stage: DB→Debezium, Debezium→Kafka, Kafka→Consumer
This reveals where latency accumulates—database replication, Debezium processing, network transit, Kafka ingestion, or consumer processing. Optimize the actual bottleneck rather than guessing.
Anomaly Detection on Metrics
Static thresholds catch obvious problems but miss subtle degradations. Implement anomaly detection:
- Calculate baseline throughput patterns (higher during business hours, lower nights/weekends)
- Alert when metrics deviate significantly from expected patterns
- Use statistical methods (standard deviations) or machine learning (forecasting models)
Example: Throughput dropping from 1000 to 500 events/sec might be normal at 2am but alarming at 2pm. Anomaly detection contextualizes metrics with temporal patterns.
Correlation Analysis
Complex problems emerge from interactions between metrics:
- High lag + normal throughput + growing queue = Kafka backpressure
- High lag + dropped throughput + errors = Connector health issue
- High lag + high throughput + stable queue = Database replication lag
Build dashboards that display related metrics together. During incidents, correlated metrics guide faster diagnosis than investigating metrics in isolation.
Conclusion
Effective monitoring transforms Debezium from a powerful but opaque CDC tool into a reliable, observable production system. By focusing on the metrics that truly matter—connector state, processing lag, throughput, and errors—and implementing comprehensive alerting that catches problems before users notice them, you build CDC pipelines that operators can confidently maintain and debug. The monitoring infrastructure using Prometheus for collection, Grafana for visualization, and well-tuned alerts provides the observability foundation that separates hobby projects from production-grade data platforms.
Success with Debezium monitoring requires continuous iteration—tuning alert thresholds based on operational experience, adding new metrics as you encounter novel failure modes, and refining dashboards to surface insights that drive faster incident resolution. Start with the critical metrics and alerts outlined here, instrument them properly using JMX exporters and Prometheus, and evolve your monitoring practice as you learn what matters most for your specific CDC workloads. With proper monitoring in place, Debezium connectors become reliable, maintainable components of your data infrastructure rather than mysterious black boxes that occasionally fail in production.