Analytics

Real-Time Server Health Dashboard with Alerts

Monitor CPU, memory, disk, and network usage across multiple servers with instant alerts and historical trending.

Intermediate
15 minutes
Published Jan 25, 2024

Server Health Dashboard

Know exactly what's happening on your servers. Get alerts before users do.

What You'll Build

A monitoring dashboard that tracks:

  • CPU, memory, disk, and network usage
  • Process-level resource consumption
  • Historical trends and anomaly detection
  • Instant Telegram alerts for issues

Requirements

  • Plugins: Web Search (for API calls)
  • Time: 15 minutes
  • Server Access: SSH or monitoring agent installed

Setup

1. Add Servers to Monitor

Monitor these servers: - prod-web-1 (192.168.1.10) - prod-api-1 (192.168.1.11) - prod-db-1 (192.168.1.12) Check every 1 minute.

2. Configure Alert Thresholds

Alert on Telegram if: - CPU >80% for 5 minutes - Memory >90% for 3 minutes - Disk >85% on any partition - Network errors detected - Any server unreachable

3. Set Up Historical Tracking

Store metrics for: - Last 24 hours (1-minute intervals) - Last 7 days (5-minute intervals) - Last 30 days (hourly averages) Generate weekly performance reports.

4. Process Monitoring

Track top processes by: - CPU usage - Memory usage - Open file descriptors - Network connections Alert if any process uses >50% CPU unexpectedly.

Sample Alert

High CPU:

🚨 SERVER ALERT Server: prod-api-1 Issue: High CPU usage Current: 87% (5-minute avg) Normal: 35-50% 🔝 Top Processes: 1. node (PID 1234): 52% CPU 2. postgres (PID 5678): 18% CPU 3. nginx (PID 9012): 9% CPU 💡 Possible causes: - Traffic spike (check logs) - Inefficient query - Background job running [View detailed metrics →]

Recovery:

✅ ALERT RESOLVED Server: prod-api-1 CPU: 87% → 42% Duration: 12 minutes Root cause analysis: - Traffic spike from API endpoint /search - Query optimization deployed - CPU back to normal [View incident timeline →]

Sample Weekly Report

📊 Server Health Report — Feb 5-11 🖥️ prod-web-1 Uptime: 99.98% Avg CPU: 38% Avg Memory: 62% Disk: 45% used Incidents: 0 🖥️ prod-api-1 Uptime: 99.85% Avg CPU: 52% Avg Memory: 71% Disk: 58% used Incidents: 2 (both auto-resolved) 🖥️ prod-db-1 Uptime: 100% Avg CPU: 28% Avg Memory: 78% Disk: 67% used Incidents: 0 🎯 Recommendations: - prod-api-1: Consider scaling to 2 instances (CPU trending up) - prod-db-1: Memory usage stable but near threshold — monitor - All servers: Disk usage healthy Next review: Feb 18, 2024

Pro Tips

  1. Predictive Alerts: Use ML to detect anomalies before thresholds are hit
  2. Log Correlation: When CPU spikes, auto-pull relevant application logs
  3. Auto-Scaling: Trigger cloud provider API to scale up during high load
  4. Cost Tracking: Monitor cloud costs alongside performance metrics

Monitor your infrastructure → Launch Claws

Ready to try this recipe?

Deploy your Claws agent and start automating in under 2 minutes.

Get Started with Claws

Share this recipe

Recipe Details

Difficulty
Intermediate
Time to Setup
15 minutes
Category
Analytics
Plugins Used
Tags
#monitoring#DevOps#infrastructure#alerts