Service Health Monitor: Automated Multi-Service Monitoring and Recovery

In production environments, service failures can lead to downtime, data loss, and frustrated users. While external monitoring tools like Nagios, Zabbix, or Prometheus provide comprehensive monitoring, sometimes you need a lightweight, custom solution that can quickly detect and recover from service failures. A well-designed service health monitor can automatically restart failed services, generate reports, and alert administrators when manual intervention is required.

Understanding Service Monitoring

Why Service Monitoring is Critical

Service monitoring serves multiple purposes:

Availability assurance - Ensure critical services are always running
Automated recovery - Restart failed services without manual intervention
Performance tracking - Monitor service health over time
Incident prevention - Detect issues before they become critical
Compliance requirements - Meet SLA and uptime requirements

Service monitoring is not just about detecting failures—it's about implementing proactive measures that maintain system reliability and minimize downtime.

Common Services to Monitor

Critical services that typically require monitoring:

Web servers - nginx, apache2, httpd
SSH services - sshd, ssh
Database services - mysql, postgresql, mongodb
Container services - docker, containerd
Application services - custom applications, microservices
System services - cron, rsyslog, systemd-resolved

Basic Service Health Monitor

Simple Service Monitor Script

#!/bin/bash
# multi_service_monitor.sh

# List of services to monitor
services=("nginx" "sshd" "docker")

# Report Header
echo "-----------------------------------"
echo "  Service Health Check Report"
echo "-----------------------------------"

# Loop through services
for service in "${services[@]}"; do
  if systemctl is-active --quiet "$service"; then
    echo "$service is ✅ RUNNING"
  else
    echo "$service is ❌ STOPPED"
    echo ""
    echo "Attempting to restart $service..."

    systemctl restart "$service" &> /dev/null

    # Check if restart was successful
    if systemctl is-active --quiet "$service"; then
      echo "$service has been ✅ restarted successfully."
    else
      echo "❌ Failed to restart $service. Manual intervention needed."
    fi
  fi
  echo "-----------------------------------"
done

Example Output

-----------------------------------
  Service Health Check Report
-----------------------------------
nginx is ✅ RUNNING
-----------------------------------
sshd is ✅ RUNNING
-----------------------------------
docker is ❌ STOPPED

Attempting to restart docker...
docker has been ✅ restarted successfully.
-----------------------------------

Advanced Service Health Monitor

Comprehensive Monitoring Script

#!/bin/bash
# advanced_service_monitor.sh

# Configuration
LOG_FILE="/var/log/service_monitor.log"
ALERT_EMAIL="admin@company.com"
MAX_RESTART_ATTEMPTS=3
RESTART_COOLDOWN=300  # 5 minutes

# Services to monitor with their dependencies
declare -A services=(
    ["nginx"]="network.target"
    ["sshd"]="network.target"
    ["docker"]="network.target"
    ["mysql"]="network.target"
    ["postgresql"]="network.target"
)

# Function to log actions
log_action() {
    echo "[$(date)] $1" | tee -a "$LOG_FILE"
}

# Function to send alert
send_alert() {
    local service="$1"
    local message="$2"
    
    echo "ALERT: $service - $message" | mail -s "Service Alert: $service" "$ALERT_EMAIL"
    log_action "ALERT SENT: $service - $message"
}

# Function to check service health
check_service_health() {
    local service="$1"
    local dependency="$2"
    
    # Check if service is active
    if systemctl is-active --quiet "$service"; then
        # Check if service is enabled
        if systemctl is-enabled --quiet "$service"; then
            log_action "HEALTHY: $service is running and enabled"
            return 0
        else
            log_action "WARNING: $service is running but not enabled"
            return 1
        fi
    else
        log_action "CRITICAL: $service is not running"
        return 2
    fi
}

# Function to restart service
restart_service() {
    local service="$1"
    local dependency="$2"
    local attempts=0
    
    while [ $attempts -lt $MAX_RESTART_ATTEMPTS ]; do
        attempts=$((attempts + 1))
        log_action "RESTART ATTEMPT $attempts: $service"
        
        # Check dependency first
        if [ -n "$dependency" ] && [ "$dependency" != "network.target" ]; then
            if ! systemctl is-active --quiet "$dependency"; then
                log_action "DEPENDENCY CHECK: $dependency is not running, starting it first"
                systemctl start "$dependency"
                sleep 5
            fi
        fi
        
        # Restart the service
        systemctl restart "$service"
        sleep 10
        
        # Check if restart was successful
        if systemctl is-active --quiet "$service"; then
            log_action "SUCCESS: $service restarted successfully"
            return 0
        else
            log_action "FAILED: $service restart attempt $attempts failed"
        fi
    done
    
    log_action "CRITICAL: $service failed to restart after $MAX_RESTART_ATTEMPTS attempts"
    send_alert "$service" "Failed to restart after $MAX_RESTART_ATTEMPTS attempts"
    return 1
}

# Function to generate health report
generate_health_report() {
    local report_file="/tmp/service_health_report_$(date +%Y%m%d_%H%M%S).txt"
    
    echo "Service Health Report - $(date)" > "$report_file"
    echo "=================================" >> "$report_file"
    echo "" >> "$report_file"
    
    local healthy_count=0
    local warning_count=0
    local critical_count=0
    
    for service in "${!services[@]}"; do
        local dependency="${services[$service]}"
        
        if systemctl is-active --quiet "$service"; then
            if systemctl is-enabled --quiet "$service"; then
                echo "✅ $service: RUNNING (enabled)" >> "$report_file"
                healthy_count=$((healthy_count + 1))
            else
                echo "⚠️  $service: RUNNING (disabled)" >> "$report_file"
                warning_count=$((warning_count + 1))
            fi
        else
            echo "❌ $service: STOPPED" >> "$report_file"
            critical_count=$((critical_count + 1))
        fi
    done
    
    echo "" >> "$report_file"
    echo "Summary:" >> "$report_file"
    echo "  Healthy: $healthy_count" >> "$report_file"
    echo "  Warnings: $warning_count" >> "$report_file"
    echo "  Critical: $critical_count" >> "$report_file"
    
    log_action "Health report generated: $report_file"
    
    # Send report if there are critical issues
    if [ $critical_count -gt 0 ]; then
        mail -s "Service Health Report - Critical Issues" "$ALERT_EMAIL" < "$report_file"
    fi
}

# Function to check system resources
check_system_resources() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    local memory_usage=$(free | awk 'NR==2{printf "%.2f", $3*100/$2}')
    local disk_usage=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
    
    log_action "SYSTEM RESOURCES: CPU: ${cpu_usage}%, Memory: ${memory_usage}%, Disk: ${disk_usage}%"
    
    # Alert if resources are high
    if (( $(echo "$cpu_usage > 90" | bc -l) )); then
        send_alert "SYSTEM" "High CPU usage: ${cpu_usage}%"
    fi
    
    if (( $(echo "$memory_usage > 90" | bc -l) )); then
        send_alert "SYSTEM" "High memory usage: ${memory_usage}%"
    fi
    
    if [ "$disk_usage" -gt 90 ]; then
        send_alert "SYSTEM" "High disk usage: ${disk_usage}%"
    fi
}

# Main monitoring function
main() {
    log_action "Starting service health monitoring"
    
    # Check system resources
    check_system_resources
    
    # Monitor each service
    for service in "${!services[@]}"; do
        local dependency="${services[$service]}"
        
        case $(check_service_health "$service" "$dependency") in
            0)
                # Service is healthy
                ;;
            1)
                # Service is running but not enabled
                log_action "WARNING: $service is running but not enabled"
                ;;
            2)
                # Service is not running
                log_action "CRITICAL: $service is not running, attempting restart"
                restart_service "$service" "$dependency"
                ;;
        esac
    done
    
    # Generate health report
    generate_health_report
    
    log_action "Service health monitoring completed"
}

# Run main function
main

Service-Specific Monitoring

Web Server Monitoring

#!/bin/bash
# web_server_monitor.sh

# Function to check web server health
check_web_server() {
    local service="$1"
    local port="$2"
    local url="$3"
    
    # Check if service is running
    if ! systemctl is-active --quiet "$service"; then
        log_action "CRITICAL: $service is not running"
        return 1
    fi
    
    # Check if port is listening
    if ! netstat -tuln | grep -q ":$port "; then
        log_action "CRITICAL: $service is not listening on port $port"
        return 1
    fi
    
    # Check HTTP response
    if [ -n "$url" ]; then
        local http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
        if [ "$http_code" != "200" ]; then
            log_action "WARNING: $service returned HTTP $http_code"
            return 1
        fi
    fi
    
    log_action "HEALTHY: $service is running and responding"
    return 0
}

# Monitor web servers
check_web_server "nginx" "80" "http://localhost"
check_web_server "apache2" "80" "http://localhost"

Database Service Monitoring

#!/bin/bash
# database_monitor.sh

# Function to check database health
check_database() {
    local service="$1"
    local port="$2"
    local user="$3"
    local password="$4"
    
    # Check if service is running
    if ! systemctl is-active --quiet "$service"; then
        log_action "CRITICAL: $service is not running"
        return 1
    fi
    
    # Check if port is listening
    if ! netstat -tuln | grep -q ":$port "; then
        log_action "CRITICAL: $service is not listening on port $port"
        return 1
    fi
    
    # Check database connection
    case "$service" in
        "mysql")
            if ! mysql -u "$user" -p"$password" -e "SELECT 1;" &>/dev/null; then
                log_action "WARNING: $service connection test failed"
                return 1
            fi
            ;;
        "postgresql")
            if ! psql -U "$user" -d postgres -c "SELECT 1;" &>/dev/null; then
                log_action "WARNING: $service connection test failed"
                return 1
            fi
            ;;
    esac
    
    log_action "HEALTHY: $service is running and accessible"
    return 0
}

# Monitor databases
check_database "mysql" "3306" "root" "password"
check_database "postgresql" "5432" "postgres" "password"

Automation and Scheduling

Cron Job Setup

# Add to crontab for regular monitoring
sudo crontab -e

# Run every 5 minutes
*/5 * * * * /usr/local/bin/service_monitor.sh >> /var/log/service_monitor.log 2>&1

# Run comprehensive check every hour
0 * * * * /usr/local/bin/advanced_service_monitor.sh >> /var/log/service_monitor.log 2>&1

# Generate daily report
0 8 * * * /usr/local/bin/generate_daily_report.sh >> /var/log/service_monitor.log 2>&1

Systemd Timer Alternative

# Create systemd service
sudo nano /etc/systemd/system/service-monitor.service

[Unit]
Description=Service Health Monitor
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/service_monitor.sh
User=root
StandardOutput=journal
StandardError=journal

# Create systemd timer
sudo nano /etc/systemd/system/service-monitor.timer

[Unit]
Description=Run service monitor every 5 minutes
Requires=service-monitor.service

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target

# Enable and start timer
sudo systemctl enable service-monitor.timer
sudo systemctl start service-monitor.timer

Monitoring and Alerting

Email Alerts

#!/bin/bash
# send_alert.sh

# Function to send email alert
send_alert() {
    local service="$1"
    local status="$2"
    local message="$3"
    
    local subject="Service Alert: $service - $status"
    local body="Service: $service
Status: $status
Message: $message
Time: $(date)
Server: $(hostname)"

    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
    log_action "ALERT SENT: $service - $status"
}

# Function to send Slack alert
send_slack_alert() {
    local service="$1"
    local status="$2"
    local message="$3"
    
    local payload="{
        \"text\": \"🚨 Service Alert\",
        \"attachments\": [{
            \"color\": \"danger\",
            \"fields\": [{
                \"title\": \"Service\",
                \"value\": \"$service\",
                \"short\": true
            }, {
                \"title\": \"Status\",
                \"value\": \"$status\",
                \"short\": true
            }, {
                \"title\": \"Message\",
                \"value\": \"$message\",
                \"short\": false
            }, {
                \"title\": \"Time\",
                \"value\": \"$(date)\",
                \"short\": true
            }, {
                \"title\": \"Server\",
                \"value\": \"$(hostname)\",
                \"short\": true
            }]
        }]
    }"
    
    curl -X POST -H 'Content-type: application/json' \
        --data "$payload" \
        "$SLACK_WEBHOOK_URL"
}

Best Practices

1. Service Dependencies

# Check service dependencies
check_dependencies() {
    local service="$1"
    local dependencies=($(systemctl list-dependencies "$service" --plain | grep -v "$service"))
    
    for dep in "${dependencies[@]}"; do
        if ! systemctl is-active --quiet "$dep"; then
            log_action "WARNING: Dependency $dep is not running for $service"
            return 1
        fi
    done
    
    return 0
}

2. Graceful Restart

# Function for graceful restart
graceful_restart() {
    local service="$1"
    
    # Try graceful restart first
    systemctl reload "$service" 2>/dev/null
    if [ $? -eq 0 ]; then
        log_action "SUCCESS: $service reloaded gracefully"
        return 0
    fi
    
    # Fall back to restart
    systemctl restart "$service"
    if [ $? -eq 0 ]; then
        log_action "SUCCESS: $service restarted"
        return 0
    fi
    
    log_action "FAILED: $service restart failed"
    return 1
}

3. Health Checks

# Function for comprehensive health check
comprehensive_health_check() {
    local service="$1"
    
    # Check service status
    if ! systemctl is-active --quiet "$service"; then
        return 1
    fi
    
    # Check service logs for errors
    if journalctl -u "$service" --since "5 minutes ago" | grep -i error; then
        log_action "WARNING: $service has errors in recent logs"
        return 1
    fi
    
    # Check resource usage
    local pid=$(systemctl show -p MainPID "$service" --value)
    if [ -n "$pid" ] && [ "$pid" != "0" ]; then
        local cpu_usage=$(ps -p "$pid" -o %cpu --no-headers)
        local memory_usage=$(ps -p "$pid" -o %mem --no-headers)
        
        if (( $(echo "$cpu_usage > 90" | bc -l) )); then
            log_action "WARNING: $service has high CPU usage: ${cpu_usage}%"
            return 1
        fi
        
        if (( $(echo "$memory_usage > 90" | bc -l) )); then
            log_action "WARNING: $service has high memory usage: ${memory_usage}%"
            return 1
        fi
    fi
    
    return 0
}

Conclusion

Service health monitoring is essential for maintaining reliable Linux systems. A well-designed monitoring system includes:

Comprehensive service checks - Monitor status, dependencies, and health
Automated recovery - Restart failed services automatically
Resource monitoring - Track system resources and service performance
Alerting and reporting - Notify administrators of issues
Logging and audit trails - Maintain records of all monitoring activities

Key takeaways:

Monitor critical services - Focus on services that impact system availability
Implement automated recovery - Restart failed services without manual intervention
Set up proper alerting - Notify administrators when manual intervention is needed
Track service dependencies - Ensure dependent services are running
Monitor system resources - Prevent resource exhaustion from causing failures

Remember: Service monitoring is not just about detecting failures—it's about implementing proactive measures that maintain system reliability and minimize downtime.

Table of Contents