Service Health Monitor: Automated Multi-Service Monitoring and Recovery
Learn how to create a comprehensive service monitoring system that checks, restarts, and reports on multiple Linux services. Master automated service recovery and health reporting.
Service Health Monitor: Automated Multi-Service Monitoring and Recovery
In production environments, service failures can lead to downtime, data loss, and frustrated users. While external monitoring tools like Nagios, Zabbix, or Prometheus provide comprehensive monitoring, sometimes you need a lightweight, custom solution that can quickly detect and recover from service failures. A well-designed service health monitor can automatically restart failed services, generate reports, and alert administrators when manual intervention is required.
Understanding Service Monitoring
Why Service Monitoring is Critical
Service monitoring serves multiple purposes:
- Availability assurance - Ensure critical services are always running
- Automated recovery - Restart failed services without manual intervention
- Performance tracking - Monitor service health over time
- Incident prevention - Detect issues before they become critical
- Compliance requirements - Meet SLA and uptime requirements
Common Services to Monitor
Critical services that typically require monitoring:
- Web servers - nginx, apache2, httpd
- SSH services - sshd, ssh
- Database services - mysql, postgresql, mongodb
- Container services - docker, containerd
- Application services - custom applications, microservices
- System services - cron, rsyslog, systemd-resolved
Basic Service Health Monitor
Simple Service Monitor Script
#!/bin/bash
# multi_service_monitor.sh
# List of services to monitor
services=("nginx" "sshd" "docker")
# Report Header
echo "-----------------------------------"
echo " Service Health Check Report"
echo "-----------------------------------"
# Loop through services
for service in "${services[@]}"; do
if systemctl is-active --quiet "$service"; then
echo "$service is ✅ RUNNING"
else
echo "$service is ❌ STOPPED"
echo ""
echo "Attempting to restart $service..."
systemctl restart "$service" &> /dev/null
# Check if restart was successful
if systemctl is-active --quiet "$service"; then
echo "$service has been ✅ restarted successfully."
else
echo "❌ Failed to restart $service. Manual intervention needed."
fi
fi
echo "-----------------------------------"
done
Example Output
-----------------------------------
Service Health Check Report
-----------------------------------
nginx is ✅ RUNNING
-----------------------------------
sshd is ✅ RUNNING
-----------------------------------
docker is ❌ STOPPED
Attempting to restart docker...
docker has been ✅ restarted successfully.
-----------------------------------
Advanced Service Health Monitor
Comprehensive Monitoring Script
#!/bin/bash
# advanced_service_monitor.sh
# Configuration
LOG_FILE="/var/log/service_monitor.log"
ALERT_EMAIL="admin@company.com"
MAX_RESTART_ATTEMPTS=3
RESTART_COOLDOWN=300 # 5 minutes
# Services to monitor with their dependencies
declare -A services=(
["nginx"]="network.target"
["sshd"]="network.target"
["docker"]="network.target"
["mysql"]="network.target"
["postgresql"]="network.target"
)
# Function to log actions
log_action() {
echo "[$(date)] $1" | tee -a "$LOG_FILE"
}
# Function to send alert
send_alert() {
local service="$1"
local message="$2"
echo "ALERT: $service - $message" | mail -s "Service Alert: $service" "$ALERT_EMAIL"
log_action "ALERT SENT: $service - $message"
}
# Function to check service health
check_service_health() {
local service="$1"
local dependency="$2"
# Check if service is active
if systemctl is-active --quiet "$service"; then
# Check if service is enabled
if systemctl is-enabled --quiet "$service"; then
log_action "HEALTHY: $service is running and enabled"
return 0
else
log_action "WARNING: $service is running but not enabled"
return 1
fi
else
log_action "CRITICAL: $service is not running"
return 2
fi
}
# Function to restart service
restart_service() {
local service="$1"
local dependency="$2"
local attempts=0
while [ $attempts -lt $MAX_RESTART_ATTEMPTS ]; do
attempts=$((attempts + 1))
log_action "RESTART ATTEMPT $attempts: $service"
# Check dependency first
if [ -n "$dependency" ] && [ "$dependency" != "network.target" ]; then
if ! systemctl is-active --quiet "$dependency"; then
log_action "DEPENDENCY CHECK: $dependency is not running, starting it first"
systemctl start "$dependency"
sleep 5
fi
fi
# Restart the service
systemctl restart "$service"
sleep 10
# Check if restart was successful
if systemctl is-active --quiet "$service"; then
log_action "SUCCESS: $service restarted successfully"
return 0
else
log_action "FAILED: $service restart attempt $attempts failed"
fi
done
log_action "CRITICAL: $service failed to restart after $MAX_RESTART_ATTEMPTS attempts"
send_alert "$service" "Failed to restart after $MAX_RESTART_ATTEMPTS attempts"
return 1
}
# Function to generate health report
generate_health_report() {
local report_file="/tmp/service_health_report_$(date +%Y%m%d_%H%M%S).txt"
echo "Service Health Report - $(date)" > "$report_file"
echo "=================================" >> "$report_file"
echo "" >> "$report_file"
local healthy_count=0
local warning_count=0
local critical_count=0
for service in "${!services[@]}"; do
local dependency="${services[$service]}"
if systemctl is-active --quiet "$service"; then
if systemctl is-enabled --quiet "$service"; then
echo "✅ $service: RUNNING (enabled)" >> "$report_file"
healthy_count=$((healthy_count + 1))
else
echo "⚠️ $service: RUNNING (disabled)" >> "$report_file"
warning_count=$((warning_count + 1))
fi
else
echo "❌ $service: STOPPED" >> "$report_file"
critical_count=$((critical_count + 1))
fi
done
echo "" >> "$report_file"
echo "Summary:" >> "$report_file"
echo " Healthy: $healthy_count" >> "$report_file"
echo " Warnings: $warning_count" >> "$report_file"
echo " Critical: $critical_count" >> "$report_file"
log_action "Health report generated: $report_file"
# Send report if there are critical issues
if [ $critical_count -gt 0 ]; then
mail -s "Service Health Report - Critical Issues" "$ALERT_EMAIL" < "$report_file"
fi
}
# Function to check system resources
check_system_resources() {
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
local memory_usage=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
local disk_usage=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
log_action "SYSTEM RESOURCES: CPU: ${cpu_usage}%, Memory: ${memory_usage}%, Disk: ${disk_usage}%"
# Alert if resources are high
if (( $(echo "$cpu_usage > 90" | bc -l) )); then
send_alert "SYSTEM" "High CPU usage: ${cpu_usage}%"
fi
if (( $(echo "$memory_usage > 90" | bc -l) )); then
send_alert "SYSTEM" "High memory usage: ${memory_usage}%"
fi
if [ "$disk_usage" -gt 90 ]; then
send_alert "SYSTEM" "High disk usage: ${disk_usage}%"
fi
}
# Main monitoring function
main() {
log_action "Starting service health monitoring"
# Check system resources
check_system_resources
# Monitor each service
for service in "${!services[@]}"; do
local dependency="${services[$service]}"
case $(check_service_health "$service" "$dependency") in
0)
# Service is healthy
;;
1)
# Service is running but not enabled
log_action "WARNING: $service is running but not enabled"
;;
2)
# Service is not running
log_action "CRITICAL: $service is not running, attempting restart"
restart_service "$service" "$dependency"
;;
esac
done
# Generate health report
generate_health_report
log_action "Service health monitoring completed"
}
# Run main function
main
Service-Specific Monitoring
Web Server Monitoring
#!/bin/bash
# web_server_monitor.sh
# Function to check web server health
check_web_server() {
local service="$1"
local port="$2"
local url="$3"
# Check if service is running
if ! systemctl is-active --quiet "$service"; then
log_action "CRITICAL: $service is not running"
return 1
fi
# Check if port is listening
if ! netstat -tuln | grep -q ":$port "; then
log_action "CRITICAL: $service is not listening on port $port"
return 1
fi
# Check HTTP response
if [ -n "$url" ]; then
local http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$http_code" != "200" ]; then
log_action "WARNING: $service returned HTTP $http_code"
return 1
fi
fi
log_action "HEALTHY: $service is running and responding"
return 0
}
# Monitor web servers
check_web_server "nginx" "80" "http://localhost"
check_web_server "apache2" "80" "http://localhost"
Database Service Monitoring
#!/bin/bash
# database_monitor.sh
# Function to check database health
check_database() {
local service="$1"
local port="$2"
local user="$3"
local password="$4"
# Check if service is running
if ! systemctl is-active --quiet "$service"; then
log_action "CRITICAL: $service is not running"
return 1
fi
# Check if port is listening
if ! netstat -tuln | grep -q ":$port "; then
log_action "CRITICAL: $service is not listening on port $port"
return 1
fi
# Check database connection
case "$service" in
"mysql")
if ! mysql -u "$user" -p"$password" -e "SELECT 1;" &>/dev/null; then
log_action "WARNING: $service connection test failed"
return 1
fi
;;
"postgresql")
if ! psql -U "$user" -d postgres -c "SELECT 1;" &>/dev/null; then
log_action "WARNING: $service connection test failed"
return 1
fi
;;
esac
log_action "HEALTHY: $service is running and accessible"
return 0
}
# Monitor databases
check_database "mysql" "3306" "root" "password"
check_database "postgresql" "5432" "postgres" "password"
Automation and Scheduling
Cron Job Setup
# Add to crontab for regular monitoring
sudo crontab -e
# Run every 5 minutes
*/5 * * * * /usr/local/bin/service_monitor.sh >> /var/log/service_monitor.log 2>&1
# Run comprehensive check every hour
0 * * * * /usr/local/bin/advanced_service_monitor.sh >> /var/log/service_monitor.log 2>&1
# Generate daily report
0 8 * * * /usr/local/bin/generate_daily_report.sh >> /var/log/service_monitor.log 2>&1
Systemd Timer Alternative
# Create systemd service
sudo nano /etc/systemd/system/service-monitor.service
[Unit]
Description=Service Health Monitor
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/service_monitor.sh
User=root
StandardOutput=journal
StandardError=journal
# Create systemd timer
sudo nano /etc/systemd/system/service-monitor.timer
[Unit]
Description=Run service monitor every 5 minutes
Requires=service-monitor.service
[Timer]
OnCalendar=*:0/5
Persistent=true
[Install]
WantedBy=timers.target
# Enable and start timer
sudo systemctl enable service-monitor.timer
sudo systemctl start service-monitor.timer
Monitoring and Alerting
Email Alerts
#!/bin/bash
# send_alert.sh
# Function to send email alert
send_alert() {
local service="$1"
local status="$2"
local message="$3"
local subject="Service Alert: $service - $status"
local body="Service: $service
Status: $status
Message: $message
Time: $(date)
Server: $(hostname)"
echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
log_action "ALERT SENT: $service - $status"
}
# Function to send Slack alert
send_slack_alert() {
local service="$1"
local status="$2"
local message="$3"
local payload="{
\"text\": \"🚨 Service Alert\",
\"attachments\": [{
\"color\": \"danger\",
\"fields\": [{
\"title\": \"Service\",
\"value\": \"$service\",
\"short\": true
}, {
\"title\": \"Status\",
\"value\": \"$status\",
\"short\": true
}, {
\"title\": \"Message\",
\"value\": \"$message\",
\"short\": false
}, {
\"title\": \"Time\",
\"value\": \"$(date)\",
\"short\": true
}, {
\"title\": \"Server\",
\"value\": \"$(hostname)\",
\"short\": true
}]
}]
}"
curl -X POST -H 'Content-type: application/json' \
--data "$payload" \
"$SLACK_WEBHOOK_URL"
}
Best Practices
1. Service Dependencies
# Check service dependencies
check_dependencies() {
local service="$1"
local dependencies=($(systemctl list-dependencies "$service" --plain | grep -v "$service"))
for dep in "${dependencies[@]}"; do
if ! systemctl is-active --quiet "$dep"; then
log_action "WARNING: Dependency $dep is not running for $service"
return 1
fi
done
return 0
}
2. Graceful Restart
# Function for graceful restart
graceful_restart() {
local service="$1"
# Try graceful restart first
systemctl reload "$service" 2>/dev/null
if [ $? -eq 0 ]; then
log_action "SUCCESS: $service reloaded gracefully"
return 0
fi
# Fall back to restart
systemctl restart "$service"
if [ $? -eq 0 ]; then
log_action "SUCCESS: $service restarted"
return 0
fi
log_action "FAILED: $service restart failed"
return 1
}
3. Health Checks
# Function for comprehensive health check
comprehensive_health_check() {
local service="$1"
# Check service status
if ! systemctl is-active --quiet "$service"; then
return 1
fi
# Check service logs for errors
if journalctl -u "$service" --since "5 minutes ago" | grep -i error; then
log_action "WARNING: $service has errors in recent logs"
return 1
fi
# Check resource usage
local pid=$(systemctl show -p MainPID "$service" --value)
if [ -n "$pid" ] && [ "$pid" != "0" ]; then
local cpu_usage=$(ps -p "$pid" -o %cpu --no-headers)
local memory_usage=$(ps -p "$pid" -o %mem --no-headers)
if (( $(echo "$cpu_usage > 90" | bc -l) )); then
log_action "WARNING: $service has high CPU usage: ${cpu_usage}%"
return 1
fi
if (( $(echo "$memory_usage > 90" | bc -l) )); then
log_action "WARNING: $service has high memory usage: ${memory_usage}%"
return 1
fi
fi
return 0
}
Conclusion
Service health monitoring is essential for maintaining reliable Linux systems. A well-designed monitoring system includes:
- Comprehensive service checks - Monitor status, dependencies, and health
- Automated recovery - Restart failed services automatically
- Resource monitoring - Track system resources and service performance
- Alerting and reporting - Notify administrators of issues
- Logging and audit trails - Maintain records of all monitoring activities
Key takeaways:
- Monitor critical services - Focus on services that impact system availability
- Implement automated recovery - Restart failed services without manual intervention
- Set up proper alerting - Notify administrators when manual intervention is needed
- Track service dependencies - Ensure dependent services are running
- Monitor system resources - Prevent resource exhaustion from causing failures