Database Monitoring: Metrics, Alerts, and Performance Tuning
Database monitoring involves tracking performance metrics, query execution times, resource utilization, and error rates. It helps identify bottlenecks, predict capacity needs, and detect issues before they affect users.
Database Monitoring: Metrics, Alerts, and Performance Tuning
Database monitoring is the practice of continuously observing database performance, health, and resource utilization to ensure optimal operation. It involves collecting metrics, analyzing trends, detecting anomalies, and alerting on issues before they impact users. Without proper monitoring, database problems go unnoticed until users complain, by which time the issue may have already caused significant damage.
Effective database monitoring answers critical questions: Is the database healthy? Are queries performing within acceptable limits? Is there enough capacity for growth? Are there bottlenecks that need attention? To understand monitoring properly, it is helpful to be familiar with SQL optimization, database indexing, and database performance tuning.
┌─────────────────────────────────────────────────────────────┐
│ Database Monitoring │
├─────────────────────────────────────────────────────────────┤
│ │
│ Key Metrics: Alert Categories: │
│ • Query performance • Performance degradation │
│ • Connection count • Connection pool issues │
│ • Cache hit ratio • Disk space critical │
│ • Lock contention • Replication lag │
│ • Replication lag • Deadlock frequency │
│ • CPU / Memory / Disk • Long-running queries │
│ │
│ Monitoring Tools: │
│ • pg_stat_statements (PG) • Performance Schema (MySQL) │
│ • Query Store (SQL Server) • Prometheus + Grafana │
│ • Datadog, New Relic, SolarWinds │
│ │
└─────────────────────────────────────────────────────────────┘
What Is Database Monitoring
Database monitoring is the process of collecting, analyzing, and alerting on database performance and health metrics. It provides visibility into how the database is performing, identifies bottlenecks, and helps predict future capacity needs. Monitoring is essential for maintaining service level agreements (SLAs), preventing outages, and optimizing performance.
- Metrics Collection: Gathering quantitative data about database operation (queries per second, connection count, cache hit ratio).
- Alerting: Notifying administrators when metrics exceed thresholds or anomalies are detected.
- Trend Analysis: Tracking metrics over time to identify patterns and predict future needs.
- Capacity Planning: Using historical data to forecast when resources will need to be increased.
- Root Cause Analysis: Investigating past incidents using historical monitoring data.
Why Database Monitoring Matters
Without monitoring, database problems are discovered by users, leading to poor experience, lost revenue, and damaged reputation. Monitoring enables proactive detection and resolution.
- Proactive Issue Detection: Identify problems before they affect users (e.g., slow queries, disk space exhaustion).
- Performance Optimization: Find and fix slow queries, missing indexes, and inefficient schema designs.
- Capacity Planning: Predict when additional resources (disk, memory, CPU) will be needed.
- SLA Compliance: Ensure database meets performance and availability commitments.
- Security Monitoring: Detect unusual access patterns or potential breaches.
- Troubleshooting: Quickly identify root causes of incidents using historical metrics.
- Cost Optimization: Right-size database resources based on actual usage patterns.
Essential Database Metrics
Query Performance Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Queries per second (QPS) | Number of queries executed per second | Sudden drop or spike |
| Slow query count | Number of queries exceeding threshold | > 10 per minute |
| Average query time | Mean execution time of queries | > 100ms (increase by 50%) |
| Lock wait time | Time transactions wait for locks | > 1 second |
| Deadlock count | Number of deadlocks detected | > 5 per hour |
Connection Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Active connections | Currently executing queries | > 80% of max |
| Idle connections | Open but not executing | > 100 (connection leak) |
| Connection rate | New connections per second | Sudden spike |
| Connection pool wait | Time waiting for connection | > 100ms |
Resource Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| CPU utilization | CPU usage percentage | > 80% sustained |
| Memory usage | RAM consumption | > 90% |
| Disk I/O | Read/write operations per second | > 80% of max IOPS |
| Disk space | Available storage | < 20% free | Network throughput | Data sent/received per second | Sustained high usage |
Cache Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Cache hit ratio | Percentage of reads from cache | < 90% (PostgreSQL), < 95% (MySQL) |
| Buffer cache hit ratio | Data pages found in memory | < 95% |
| Index hit ratio | Index pages found in memory | < 90% |
Monitoring by Database
PostgreSQL Monitoring
-- pg_stat_database: Database-level statistics
SELECT datname, numbackends, xact_commit, xact_rollback, blks_hit, blks_read
FROM pg_stat_database;
-- Calculate cache hit ratio
SELECT datname,
blks_hit * 100.0 / (blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database
WHERE (blks_hit + blks_read) > 0;
-- pg_stat_statements: Query-level statistics (requires extension)
CREATE EXTENSION pg_stat_statements;
SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- pg_stat_activity: Current connections and queries
SELECT pid, usename, application_name, state, query, query_start
FROM pg_stat_activity
WHERE state = 'active';
-- pg_locks: Lock monitoring
SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks
WHERE NOT granted;
-- Long-running queries
SELECT pid, usename, query, now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
-- Connection count
SELECT count(*) AS total_connections,
count(*) FILTER (WHERE state = 'active') AS active_connections,
count(*) FILTER (WHERE state = 'idle') AS idle_connections
FROM pg_stat_activity;
-- Table bloat (index fragmentation)
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
round(n_dead_tup * 100.0 / (n_live_tup + n_dead_tup), 2) AS dead_tuple_percent
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_tuple_percent DESC;
MySQL Monitoring
-- Show global status
SHOW GLOBAL STATUS;
-- Show process list (current queries)
SHOW FULL PROCESSLIST;
-- Query performance
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
-- InnoDB metrics
SHOW ENGINE INNODB STATUS\G
-- Buffer pool hit ratio
SELECT (variable_value * 1.0) AS innodb_buffer_pool_hit_ratio
FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_reads';
-- Connection count
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
-- Slow query count
SHOW STATUS LIKE 'Slow_queries';
SQL Server Monitoring
-- Query performance (Query Store)
SELECT TOP 10 q.query_id, qt.query_sql_text,
rs.avg_duration, rs.avg_cpu_time, rs.avg_logical_io_reads
FROM sys.query_store_query AS q
JOIN sys.query_store_query_text AS qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan AS p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats AS rs ON p.plan_id = rs.plan_id
ORDER BY rs.avg_duration DESC;
-- Wait statistics
SELECT wait_type, wait_time_ms, waiting_tasks_count
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;
-- Missing indexes
SELECT * FROM sys.dm_db_missing_index_details;
-- Current locks
SELECT resource_type, request_mode, request_status, object_name(resource_associated_entity_id)
FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID();
Setting Up Alerts
Alerts notify administrators when metrics exceed thresholds or anomalies are detected. Effective alerting prevents issues from becoming incidents.
# Prometheus alerting rules
groups:
- name: database_alerts
rules:
- alert: HighQueryLatency
expr: avg_over_time(pg_stat_database_xact_commit[5m]) > 100
for: 2m
annotations:
summary: "High query latency"
- alert: LowCacheHitRatio
expr: cache_hit_ratio < 0.9
for: 5m
annotations:
summary: "Cache hit ratio below 90%"
- alert: ConnectionLimit
expr: pg_stat_database_numbackends > 80
for: 1m
annotations:
summary: "Connection count approaching limit"
- alert: DiskSpaceCritical
expr: disk_free_bytes / disk_total_bytes < 0.2
for: 5m
annotations:
summary: "Disk space below 20%"
Monitoring Tools
| Tool | Type | Key Features | Cost |
|---|---|---|---|
| Prometheus + Grafana | Open source | Time-series DB, flexible dashboards, alerting | Free |
| Datadog | SaaS | Comprehensive, easy setup, many integrations | Paid |
| New Relic | SaaS | APM integration, query insights | Paid |
| SolarWinds Database Performance Analyzer | On-prem/SaaS | Deep query analysis, wait-based tuning | Paid |
| pg_stat_monitor (PostgreSQL) | Open source | Enhanced pg_stat_statements with query planning | Free |
| MySQL Enterprise Monitor | Commercial | MySQL-specific, query analyzer, advisors | Paid |
# Install postgres_exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.13.0/postgres_exporter-0.13.0.linux-amd64.tar.gz
tar xzf postgres_exporter-0.13.0.linux-amd64.tar.gz
# Configure data source name
export DATA_SOURCE_NAME="postgresql://user:pass@localhost:5432/mydb?sslmode=disable"
# Run exporter
./postgres_exporter
# Prometheus configuration (prometheus.yml)
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['localhost:9187']
Common Monitoring Mistakes to Avoid
- Alert Fatigue: Too many noisy alerts cause teams to ignore critical ones. Alert on symptoms, not every metric change.
- No Baselines: Without historical baselines, you cannot detect anomalies. Collect metrics continuously.
- Monitoring Only Production: Monitor all environments (dev, staging, production) to catch issues early.
- No Actionable Alerts: Alerts should include possible causes and suggested actions.
- Ignoring Trends: Gradual degradation (slowly increasing query time) indicates capacity problems.
- No Dashboard: Without visualization, you cannot quickly assess system health.
- Monitoring Everything: Focus on metrics that matter. Too many metrics obscure real problems.
Monitoring Best Practices
- Establish Baselines: Collect normal operating metrics to detect anomalies.
- Set Meaningful Thresholds: Alert on business-impacting conditions, not every fluctuation.
- Monitor from Client Perspective: Measure query response time from application, not just database.
- Create Dashboards: Visualize key metrics for quick health assessment.
- Review Alerts Regularly: Tune thresholds and eliminate noisy alerts.
- Monitor Replication Lag: For replicated setups, track lag on replicas.
- Correlate Metrics: Combine database metrics with application and infrastructure metrics.
- Retain Historical Data: Keep metrics for trend analysis and post-incident investigation.
Dashboard Sections:
1. Health Overview
- Database up/down
- Replication status
- Recent backups
2. Performance
- Queries per second
- Average query time
- Slow queries per minute
- Cache hit ratio
3. Connections
- Total connections
- Active vs idle
- Connection rate
4. Resources
- CPU utilization
- Memory usage
- Disk space and I/O
- Network throughput
5. Locks and Conflicts
- Lock wait time
- Deadlocks per hour
- Blocking queries
Frequently Asked Questions
- What is the most important database metric?
Query response time from the application perspective. If queries are fast, other metrics are secondary. Track both average and percentiles (p95, p99). - How often should I collect metrics?
Collect key metrics every 10-60 seconds. For high-traffic systems, every 1-5 seconds. Longer intervals miss transient spikes. - What is a good cache hit ratio?
PostgreSQL: > 99% for well-tuned systems. MySQL: > 95%. Lower ratios indicate insufficient memory for working set. - How do I find slow queries?
Enable slow query logging (log_min_duration_statement, slow_query_log). Use pg_stat_statements, performance_schema, or Query Store to identify slow queries over time. - What is normal CPU usage for a database?
Average CPU below 50% is normal for steady state. Spikes to 80-100% during peak traffic may be acceptable. Sustained > 80% indicates capacity issue. - What should I learn next after database monitoring?
After mastering monitoring, explore performance tuning, query optimization, capacity planning, and observability practices for complete database operations.
