Database Monitoring: Metrics, Alerts, and Performance Tuning

Database monitoring involves tracking performance metrics, query execution times, resource utilization, and error rates. It helps identify bottlenecks, predict capacity needs, and detect issues before they affect users.

Database Monitoring: Metrics, Alerts, and Performance Tuning

Database monitoring is the practice of continuously observing database performance, health, and resource utilization to ensure optimal operation. It involves collecting metrics, analyzing trends, detecting anomalies, and alerting on issues before they impact users. Without proper monitoring, database problems go unnoticed until users complain, by which time the issue may have already caused significant damage.

Effective database monitoring answers critical questions: Is the database healthy? Are queries performing within acceptable limits? Is there enough capacity for growth? Are there bottlenecks that need attention? To understand monitoring properly, it is helpful to be familiar with SQL optimization, database indexing, and database performance tuning.

Database monitoring overview:

┌─────────────────────────────────────────────────────────────┐
│                   Database Monitoring                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Key Metrics:                  Alert Categories:             │
│  • Query performance           • Performance degradation    │
│  • Connection count            • Connection pool issues     │
│  • Cache hit ratio             • Disk space critical        │
│  • Lock contention             • Replication lag            │
│  • Replication lag             • Deadlock frequency         │
│  • CPU / Memory / Disk         • Long-running queries       │
│                                                              │
│  Monitoring Tools:                                           │
│  • pg_stat_statements (PG)     • Performance Schema (MySQL) │
│  • Query Store (SQL Server)    • Prometheus + Grafana       │
│  • Datadog, New Relic, SolarWinds                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

What Is Database Monitoring

Database monitoring is the process of collecting, analyzing, and alerting on database performance and health metrics. It provides visibility into how the database is performing, identifies bottlenecks, and helps predict future capacity needs. Monitoring is essential for maintaining service level agreements (SLAs), preventing outages, and optimizing performance.

Metrics Collection: Gathering quantitative data about database operation (queries per second, connection count, cache hit ratio).
Alerting: Notifying administrators when metrics exceed thresholds or anomalies are detected.
Trend Analysis: Tracking metrics over time to identify patterns and predict future needs.
Capacity Planning: Using historical data to forecast when resources will need to be increased.
Root Cause Analysis: Investigating past incidents using historical monitoring data.

Why Database Monitoring Matters

Without monitoring, database problems are discovered by users, leading to poor experience, lost revenue, and damaged reputation. Monitoring enables proactive detection and resolution.

Proactive Issue Detection: Identify problems before they affect users (e.g., slow queries, disk space exhaustion).
Performance Optimization: Find and fix slow queries, missing indexes, and inefficient schema designs.
Capacity Planning: Predict when additional resources (disk, memory, CPU) will be needed.
SLA Compliance: Ensure database meets performance and availability commitments.
Security Monitoring: Detect unusual access patterns or potential breaches.
Troubleshooting: Quickly identify root causes of incidents using historical metrics.
Cost Optimization: Right-size database resources based on actual usage patterns.

Essential Database Metrics

Query Performance Metrics

Metric	Description	Alert Threshold
Queries per second (QPS)	Number of queries executed per second	Sudden drop or spike
Slow query count	Number of queries exceeding threshold	> 10 per minute
Average query time	Mean execution time of queries	> 100ms (increase by 50%)
Lock wait time	Time transactions wait for locks	> 1 second
Deadlock count	Number of deadlocks detected	> 5 per hour

Connection Metrics

Metric	Description	Alert Threshold
Active connections	Currently executing queries	> 80% of max
Idle connections	Open but not executing	> 100 (connection leak)
Connection rate	New connections per second	Sudden spike
Connection pool wait	Time waiting for connection	> 100ms

Resource Metrics

Metric	Description	Alert Threshold
CPU utilization	CPU usage percentage	> 80% sustained
Memory usage	RAM consumption	> 90%
Disk I/O	Read/write operations per second	> 80% of max IOPS
Disk space	Available storage	< 20% free
Network throughput	Data sent/received per second	Sustained high usage

Cache Metrics

Metric	Description	Alert Threshold
Cache hit ratio	Percentage of reads from cache	< 90% (PostgreSQL), < 95% (MySQL)
Buffer cache hit ratio	Data pages found in memory	< 95%
Index hit ratio	Index pages found in memory	< 90%

Monitoring by Database

PostgreSQL Monitoring

Key PostgreSQL monitoring views:

-- pg_stat_database: Database-level statistics
SELECT datname, numbackends, xact_commit, xact_rollback, blks_hit, blks_read
FROM pg_stat_database;

-- Calculate cache hit ratio
SELECT datname,
       blks_hit * 100.0 / (blks_hit + blks_read) AS cache_hit_ratio
FROM pg_stat_database
WHERE (blks_hit + blks_read) > 0;

-- pg_stat_statements: Query-level statistics (requires extension)
CREATE EXTENSION pg_stat_statements;

SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- pg_stat_activity: Current connections and queries
SELECT pid, usename, application_name, state, query, query_start
FROM pg_stat_activity
WHERE state = 'active';

-- pg_locks: Lock monitoring
SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks
WHERE NOT granted;

PostgreSQL monitoring queries:

-- Long-running queries
SELECT pid, usename, query, now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;

-- Connection count
SELECT count(*) AS total_connections,
       count(*) FILTER (WHERE state = 'active') AS active_connections,
       count(*) FILTER (WHERE state = 'idle') AS idle_connections
FROM pg_stat_activity;

-- Table bloat (index fragmentation)
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / (n_live_tup + n_dead_tup), 2) AS dead_tuple_percent
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_tuple_percent DESC;

MySQL Monitoring

Key MySQL monitoring queries:

-- Show global status
SHOW GLOBAL STATUS;

-- Show process list (current queries)
SHOW FULL PROCESSLIST;

-- Query performance
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

-- InnoDB metrics
SHOW ENGINE INNODB STATUS\G

-- Buffer pool hit ratio
SELECT (variable_value * 1.0) AS innodb_buffer_pool_hit_ratio
FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_reads';

-- Connection count
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';

-- Slow query count
SHOW STATUS LIKE 'Slow_queries';

SQL Server Monitoring

Key SQL Server monitoring queries:

-- Query performance (Query Store)
SELECT TOP 10 q.query_id, qt.query_sql_text, 
       rs.avg_duration, rs.avg_cpu_time, rs.avg_logical_io_reads
FROM sys.query_store_query AS q
JOIN sys.query_store_query_text AS qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan AS p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats AS rs ON p.plan_id = rs.plan_id
ORDER BY rs.avg_duration DESC;

-- Wait statistics
SELECT wait_type, wait_time_ms, waiting_tasks_count
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

-- Missing indexes
SELECT * FROM sys.dm_db_missing_index_details;

-- Current locks
SELECT resource_type, request_mode, request_status, object_name(resource_associated_entity_id)
FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID();

Setting Up Alerts

Alerts notify administrators when metrics exceed thresholds or anomalies are detected. Effective alerting prevents issues from becoming incidents.

Alert configuration examples:

# Prometheus alerting rules
groups:
- name: database_alerts
  rules:
  - alert: HighQueryLatency
    expr: avg_over_time(pg_stat_database_xact_commit[5m]) > 100
    for: 2m
    annotations:
      summary: "High query latency"

  - alert: LowCacheHitRatio
    expr: cache_hit_ratio < 0.9
    for: 5m
    annotations:
      summary: "Cache hit ratio below 90%"

  - alert: ConnectionLimit
    expr: pg_stat_database_numbackends > 80
    for: 1m
    annotations:
      summary: "Connection count approaching limit"

  - alert: DiskSpaceCritical
    expr: disk_free_bytes / disk_total_bytes < 0.2
    for: 5m
    annotations:
      summary: "Disk space below 20%"

Monitoring Tools

Tool	Type	Key Features	Cost
Prometheus + Grafana	Open source	Time-series DB, flexible dashboards, alerting	Free
Datadog	SaaS	Comprehensive, easy setup, many integrations	Paid
New Relic	SaaS	APM integration, query insights	Paid
SolarWinds Database Performance Analyzer	On-prem/SaaS	Deep query analysis, wait-based tuning	Paid
pg_stat_monitor (PostgreSQL)	Open source	Enhanced pg_stat_statements with query planning	Free
MySQL Enterprise Monitor	Commercial	MySQL-specific, query analyzer, advisors	Paid

Prometheus + Grafana setup (PostgreSQL):

# Install postgres_exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.13.0/postgres_exporter-0.13.0.linux-amd64.tar.gz
tar xzf postgres_exporter-0.13.0.linux-amd64.tar.gz

# Configure data source name
export DATA_SOURCE_NAME="postgresql://user:pass@localhost:5432/mydb?sslmode=disable"

# Run exporter
./postgres_exporter

# Prometheus configuration (prometheus.yml)
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']

Common Monitoring Mistakes to Avoid

Alert Fatigue: Too many noisy alerts cause teams to ignore critical ones. Alert on symptoms, not every metric change.
No Baselines: Without historical baselines, you cannot detect anomalies. Collect metrics continuously.
Monitoring Only Production: Monitor all environments (dev, staging, production) to catch issues early.
No Actionable Alerts: Alerts should include possible causes and suggested actions.
Ignoring Trends: Gradual degradation (slowly increasing query time) indicates capacity problems.
No Dashboard: Without visualization, you cannot quickly assess system health.
Monitoring Everything: Focus on metrics that matter. Too many metrics obscure real problems.

Monitoring Best Practices

Establish Baselines: Collect normal operating metrics to detect anomalies.
Set Meaningful Thresholds: Alert on business-impacting conditions, not every fluctuation.
Monitor from Client Perspective: Measure query response time from application, not just database.
Create Dashboards: Visualize key metrics for quick health assessment.
Review Alerts Regularly: Tune thresholds and eliminate noisy alerts.
Monitor Replication Lag: For replicated setups, track lag on replicas.
Correlate Metrics: Combine database metrics with application and infrastructure metrics.
Retain Historical Data: Keep metrics for trend analysis and post-incident investigation.

Sample dashboard metrics:

Dashboard Sections:

1. Health Overview
   - Database up/down
   - Replication status
   - Recent backups

2. Performance
   - Queries per second
   - Average query time
   - Slow queries per minute
   - Cache hit ratio

3. Connections
   - Total connections
   - Active vs idle
   - Connection rate

4. Resources
   - CPU utilization
   - Memory usage
   - Disk space and I/O
   - Network throughput

5. Locks and Conflicts
   - Lock wait time
   - Deadlocks per hour
   - Blocking queries

Frequently Asked Questions

What is the most important database metric?
Query response time from the application perspective. If queries are fast, other metrics are secondary. Track both average and percentiles (p95, p99).
How often should I collect metrics?
Collect key metrics every 10-60 seconds. For high-traffic systems, every 1-5 seconds. Longer intervals miss transient spikes.
What is a good cache hit ratio?
PostgreSQL: > 99% for well-tuned systems. MySQL: > 95%. Lower ratios indicate insufficient memory for working set.
How do I find slow queries?
Enable slow query logging (log_min_duration_statement, slow_query_log). Use pg_stat_statements, performance_schema, or Query Store to identify slow queries over time.
What is normal CPU usage for a database?
Average CPU below 50% is normal for steady state. Spikes to 80-100% during peak traffic may be acceptable. Sustained > 80% indicates capacity issue.
What should I learn next after database monitoring?
After mastering monitoring, explore performance tuning, query optimization, capacity planning, and observability practices for complete database operations.

Database Monitoring: Metrics, Alerts, and Performance Tuning