Server Performance Crisis: Diagnosing and Resolving High CPU Usage

Your monitoring dashboard is flashing red, alerts are flooding your inbox, and users are complaining that the application is crawling. The server's CPU usage has spiked to 95% and shows no signs of coming down. This is a critical situation that requires immediate action, but panic won't help. With a systematic approach, you can quickly identify the root cause and restore normal operation before the system becomes completely unresponsive.

Understanding High CPU Usage

What High CPU Usage Means

High CPU usage indicates that your system's processor is working at or near its maximum capacity. This can manifest as:

Slow response times - Applications taking longer to respond
System unresponsiveness - Commands taking too long to execute
Increased load average - More processes waiting for CPU time
Heat generation - CPUs running hot under sustained load
Potential system instability - Risk of system crashes or hangs

Sustained high CPU usage can lead to system instability, data corruption, and complete system failure. Immediate action is required.

Common Causes of High CPU Usage

Runaway processes - Infinite loops, memory leaks, or stuck processes
Database issues - Long-running queries, deadlocks, or connection storms
System services - Malfunctioning services consuming excessive resources
Resource contention - Multiple processes competing for limited CPU time
Malware - Cryptocurrency miners, botnets, or other malicious software
Hardware issues - Failing CPU, overheating, or insufficient cooling
Application bugs - Inefficient algorithms or poor code optimization

Emergency Response: Quick Diagnosis

Step 1: Check System Load

First, get an overview of system performance:

# Check load average and uptime
uptime

# Example output:
# 14:02:03 up 3 days, 4:55, 2 users, load average: 6.02, 4.33, 2.89

Interpreting load average:

Load average should be close to the number of CPU cores
Values above CPU core count indicate overutilization
Consistently high values suggest a persistent problem

Load average represents the average number of processes that are either running or waiting for CPU time over 1, 5, and 15 minutes.

Step 2: Identify CPU-Heavy Processes

Find out which processes are consuming the most CPU:

# Real-time process monitoring
top -o %CPU

# More user-friendly interface
htop

# Sort by CPU usage
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -10

# Alternative with timing information
ps -eo pid,ppid,cmd,%cpu,%mem,etime --sort=-%cpu | head -10

Key information to look for:

PID - Process ID for further investigation
%CPU - CPU usage percentage
COMMAND - Process name and arguments
TIME - How long the process has been running
USER - Which user owns the process

Step 3: Detailed Process Analysis

Get more detailed information about high CPU processes:

# Use pidstat for detailed CPU statistics
pidstat -u 1 5

# Check process threads
top -H -p <PID>

# Get process details
ps -p <PID> -o pid,ppid,cmd,%cpu,%mem,etime,state

# Check process file descriptors
lsof -p <PID>

Step 4: Investigate the Root Cause

Based on what you see, ask these critical questions:

Is it a specific application? (Java, Python, Node.js, etc.)
Is there a cron job or batch script running?
Is a service misconfigured and looping?
Is it caused by a known bug? (zombie processes, memory leaks)
Is it malware or unauthorized software?

Targeted Resolution Strategies

Strategy 1: Handle Runaway Processes

Kill Problematic Processes

# Kill a specific process
kill -9 <PID>

# Kill all processes with the same name
pkill -f "process-name"

# Kill processes by user
pkill -u username

# Kill processes older than a certain time
pkill -f "process-name" --older-than 1h

Process Management

# Set process limits
ulimit -u 1000  # Limit number of processes per user

# Use nice values to control priority
nice -n 10 command  # Lower priority
renice -n 10 <PID>  # Change priority of running process

# Use cpulimit to restrict CPU usage
cpulimit -p <PID> -l 50  # Limit to 50% CPU

Strategy 2: Restart Malfunctioning Services

# Check service status
systemctl status <service-name>

# Restart a service
systemctl restart <service-name>

# Check service logs
journalctl -u <service-name> -f

# Check for failed services
systemctl --failed

Strategy 3: Database Performance Issues

Identify Database Problems

# PostgreSQL - Check active queries
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"

# MySQL - Check process list
mysql -e "SHOW PROCESSLIST;"

# Check database connections
netstat -an | grep :5432 | wc -l  # PostgreSQL
netstat -an | grep :3306 | wc -l  # MySQL

Resolve Database Issues

# Kill long-running queries
sudo -u postgres psql -c "SELECT pg_terminate_backend(<PID>);"

# Restart database service
systemctl restart postgresql
systemctl restart mysql

# Check database configuration
sudo -u postgres psql -c "SHOW ALL;" | grep -E "(shared_buffers|work_mem|max_connections)"

Strategy 4: Resource Limits and Scaling

Use cgroups for Resource Control

# Create a cgroup with CPU limits
sudo cgcreate -g cpu:/myapp
echo 50000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_period_us

# Add process to cgroup
echo <PID> > /sys/fs/cgroup/cpu/myapp/cgroup.procs

Container Resource Limits

# Docker container with CPU limits
docker run --cpus="1.5" --memory="512m" myapp

# Kubernetes resource limits
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: myapp
    resources:
      limits:
        cpu: "1.5"
        memory: "512Mi"
      requests:
        cpu: "1"
        memory: "256Mi"

Advanced Troubleshooting

Check System Logs

# Monitor system logs in real-time
journalctl -xe

# Check system log
tail -f /var/log/syslog

# Check for specific errors
grep -i "error\|fail\|panic\|oom" /var/log/syslog

# Check application logs
tail -f /var/log/nginx/error.log
tail -f /var/log/apache2/error.log

Performance Profiling

# Use perf to profile CPU usage
perf top

# Profile a specific process
perf top -p <PID>

# Record performance data
perf record -p <PID> sleep 30
perf report

System Call Tracing

# Trace system calls
strace -p <PID>

# Trace with timing information
strace -T -p <PID>

# Trace file operations
strace -e trace=file -p <PID>

Prevention and Monitoring

Set Up Continuous Monitoring

# Install monitoring tools
apt install htop iotop nethogs

# Set up automated monitoring script
cat > /usr/local/bin/monitor-cpu.sh << 'EOF'
#!/bin/bash
# CPU monitoring script

THRESHOLD=80
LOG_FILE="/var/log/cpu-monitor.log"

while true; do
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    
    if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
        echo "$(date): High CPU usage detected: ${CPU_USAGE}%" >> $LOG_FILE
        ps aux --sort=-%cpu | head -10 >> $LOG_FILE
    fi
    
    sleep 60
done
EOF

chmod +x /usr/local/bin/monitor-cpu.sh

Automated Alerts

# Set up email alerts
apt install mailutils

# Configure email alerts for high CPU
cat > /usr/local/bin/cpu-alert.sh << 'EOF'
#!/bin/bash
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)

if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then
    echo "High CPU usage detected: ${CPU_USAGE}%" | mail -s "CPU Alert" admin@example.com
fi
EOF

chmod +x /usr/local/bin/cpu-alert.sh

Monitoring Tools

# Use Prometheus + Grafana for comprehensive monitoring
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

Real-World Scenarios

Scenario 1: Runaway Cron Script

Problem: A cron script stuck in an infinite loop Solution:

# Identify the process
ps aux --sort=-%cpu | head -5

# Check cron jobs
crontab -l
sudo crontab -l

# Kill the runaway process
kill -9 <PID>

# Fix the script and reschedule

Scenario 2: Java Application Memory Leak

Problem: Java application consuming 95% CPU due to memory leak Solution:

# Identify the Java process
ps aux | grep java

# Check Java heap usage
jstat -gc <PID>

# Restart the application
systemctl restart myapp

# Monitor memory usage
free -h

Scenario 3: Docker Container Resource Storm

Problem: Docker containers running unbounded scraping jobs Solution:

# Check Docker processes
docker ps
docker stats

# Stop problematic containers
docker stop <container-id>

# Set resource limits
docker run --cpus="1" --memory="512m" myapp

Scenario 4: Database Query Storm

Problem: Multiple database queries consuming high CPU Solution:

# Check database processes
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active';"

# Kill long-running queries
sudo -u postgres psql -c "SELECT pg_terminate_backend(<PID>);"

# Check database configuration
sudo -u postgres psql -c "SHOW max_connections;"

Best Practices for CPU Management

1. Proactive Monitoring

# Set up monitoring dashboards
# Use tools like Prometheus, Grafana, or Zabbix
# Configure alerts for CPU usage thresholds

2. Resource Limits

# Set process limits
ulimit -u 1000

# Use cgroups for resource control
systemd-run --slice=user.slice --uid=1000 command

# Configure systemd service limits
[Service]
LimitNOFILE=65536
LimitNPROC=4096

3. Performance Optimization

# Optimize application code
# Use profiling tools to identify bottlenecks
# Implement caching strategies
# Optimize database queries

4. Security Measures

# Regular security updates
apt update && apt upgrade

# Monitor for unusual processes
# Use intrusion detection systems
# Implement proper access controls

Diagnose quickly - Use uptime, top, htop, and ps to identify the culprit
Analyze thoroughly - Check system resources, logs, and process details
Act appropriately - Kill runaway processes, restart services, or optimize code
Monitor continuously - Set up alerts and monitoring to prevent future issues

Remember:

Don't panic - Systematic diagnosis is more effective than random actions
Document everything - Keep records of what you did and why
Prevent recurrence - Address root causes, not just symptoms
Monitor proactively - Set up alerts before problems become critical

The key to resolving high CPU usage is quick diagnosis followed by appropriate action. Always investigate the root cause to prevent the problem from happening again.

Table of Contents