Server Performance Crisis: Diagnosing and Resolving High CPU Usage
Learn how to quickly diagnose and resolve high CPU usage issues on Linux servers. Discover the most common causes and effective troubleshooting strategies to restore system performance.
Server Performance Crisis: Diagnosing and Resolving High CPU Usage
Your monitoring dashboard is flashing red, alerts are flooding your inbox, and users are complaining that the application is crawling. The server's CPU usage has spiked to 95% and shows no signs of coming down. This is a critical situation that requires immediate action, but panic won't help. With a systematic approach, you can quickly identify the root cause and restore normal operation before the system becomes completely unresponsive.
Understanding High CPU Usage
What High CPU Usage Means
High CPU usage indicates that your system's processor is working at or near its maximum capacity. This can manifest as:
- Slow response times - Applications taking longer to respond
- System unresponsiveness - Commands taking too long to execute
- Increased load average - More processes waiting for CPU time
- Heat generation - CPUs running hot under sustained load
- Potential system instability - Risk of system crashes or hangs
Common Causes of High CPU Usage
- Runaway processes - Infinite loops, memory leaks, or stuck processes
- Database issues - Long-running queries, deadlocks, or connection storms
- System services - Malfunctioning services consuming excessive resources
- Resource contention - Multiple processes competing for limited CPU time
- Malware - Cryptocurrency miners, botnets, or other malicious software
- Hardware issues - Failing CPU, overheating, or insufficient cooling
- Application bugs - Inefficient algorithms or poor code optimization
Emergency Response: Quick Diagnosis
Step 1: Check System Load
First, get an overview of system performance:
# Check load average and uptime
uptime
# Example output:
# 14:02:03 up 3 days, 4:55, 2 users, load average: 6.02, 4.33, 2.89
Interpreting load average:
- Load average should be close to the number of CPU cores
- Values above CPU core count indicate overutilization
- Consistently high values suggest a persistent problem
Step 2: Identify CPU-Heavy Processes
Find out which processes are consuming the most CPU:
# Real-time process monitoring
top -o %CPU
# More user-friendly interface
htop
# Sort by CPU usage
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -10
# Alternative with timing information
ps -eo pid,ppid,cmd,%cpu,%mem,etime --sort=-%cpu | head -10
Key information to look for:
- PID - Process ID for further investigation
- %CPU - CPU usage percentage
- COMMAND - Process name and arguments
- TIME - How long the process has been running
- USER - Which user owns the process
Step 3: Detailed Process Analysis
Get more detailed information about high CPU processes:
# Use pidstat for detailed CPU statistics
pidstat -u 1 5
# Check process threads
top -H -p <PID>
# Get process details
ps -p <PID> -o pid,ppid,cmd,%cpu,%mem,etime,state
# Check process file descriptors
lsof -p <PID>
Step 4: Investigate the Root Cause
Based on what you see, ask these critical questions:
- Is it a specific application? (Java, Python, Node.js, etc.)
- Is there a cron job or batch script running?
- Is a service misconfigured and looping?
- Is it caused by a known bug? (zombie processes, memory leaks)
- Is it malware or unauthorized software?
Targeted Resolution Strategies
Strategy 1: Handle Runaway Processes
Kill Problematic Processes
# Kill a specific process
kill -9 <PID>
# Kill all processes with the same name
pkill -f "process-name"
# Kill processes by user
pkill -u username
# Kill processes older than a certain time
pkill -f "process-name" --older-than 1h
Process Management
# Set process limits
ulimit -u 1000 # Limit number of processes per user
# Use nice values to control priority
nice -n 10 command # Lower priority
renice -n 10 <PID> # Change priority of running process
# Use cpulimit to restrict CPU usage
cpulimit -p <PID> -l 50 # Limit to 50% CPU
Strategy 2: Restart Malfunctioning Services
# Check service status
systemctl status <service-name>
# Restart a service
systemctl restart <service-name>
# Check service logs
journalctl -u <service-name> -f
# Check for failed services
systemctl --failed
Strategy 3: Database Performance Issues
Identify Database Problems
# PostgreSQL - Check active queries
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';"
# MySQL - Check process list
mysql -e "SHOW PROCESSLIST;"
# Check database connections
netstat -an | grep :5432 | wc -l # PostgreSQL
netstat -an | grep :3306 | wc -l # MySQL
Resolve Database Issues
# Kill long-running queries
sudo -u postgres psql -c "SELECT pg_terminate_backend(<PID>);"
# Restart database service
systemctl restart postgresql
systemctl restart mysql
# Check database configuration
sudo -u postgres psql -c "SHOW ALL;" | grep -E "(shared_buffers|work_mem|max_connections)"
Strategy 4: Resource Limits and Scaling
Use cgroups for Resource Control
# Create a cgroup with CPU limits
sudo cgcreate -g cpu:/myapp
echo 50000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_period_us
# Add process to cgroup
echo <PID> > /sys/fs/cgroup/cpu/myapp/cgroup.procs
Container Resource Limits
# Docker container with CPU limits
docker run --cpus="1.5" --memory="512m" myapp
# Kubernetes resource limits
apiVersion: v1
kind: Pod
spec:
containers:
- name: myapp
resources:
limits:
cpu: "1.5"
memory: "512Mi"
requests:
cpu: "1"
memory: "256Mi"
Advanced Troubleshooting
Check System Logs
# Monitor system logs in real-time
journalctl -xe
# Check system log
tail -f /var/log/syslog
# Check for specific errors
grep -i "error\|fail\|panic\|oom" /var/log/syslog
# Check application logs
tail -f /var/log/nginx/error.log
tail -f /var/log/apache2/error.log
Performance Profiling
# Use perf to profile CPU usage
perf top
# Profile a specific process
perf top -p <PID>
# Record performance data
perf record -p <PID> sleep 30
perf report
System Call Tracing
# Trace system calls
strace -p <PID>
# Trace with timing information
strace -T -p <PID>
# Trace file operations
strace -e trace=file -p <PID>
Prevention and Monitoring
Set Up Continuous Monitoring
# Install monitoring tools
apt install htop iotop nethogs
# Set up automated monitoring script
cat > /usr/local/bin/monitor-cpu.sh << 'EOF'
#!/bin/bash
# CPU monitoring script
THRESHOLD=80
LOG_FILE="/var/log/cpu-monitor.log"
while true; do
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > $THRESHOLD" | bc -l) )); then
echo "$(date): High CPU usage detected: ${CPU_USAGE}%" >> $LOG_FILE
ps aux --sort=-%cpu | head -10 >> $LOG_FILE
fi
sleep 60
done
EOF
chmod +x /usr/local/bin/monitor-cpu.sh
Automated Alerts
# Set up email alerts
apt install mailutils
# Configure email alerts for high CPU
cat > /usr/local/bin/cpu-alert.sh << 'EOF'
#!/bin/bash
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then
echo "High CPU usage detected: ${CPU_USAGE}%" | mail -s "CPU Alert" admin@example.com
fi
EOF
chmod +x /usr/local/bin/cpu-alert.sh
Monitoring Tools
# Use Prometheus + Grafana for comprehensive monitoring
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Real-World Scenarios
Scenario 1: Runaway Cron Script
Problem: A cron script stuck in an infinite loop Solution:
# Identify the process
ps aux --sort=-%cpu | head -5
# Check cron jobs
crontab -l
sudo crontab -l
# Kill the runaway process
kill -9 <PID>
# Fix the script and reschedule
Scenario 2: Java Application Memory Leak
Problem: Java application consuming 95% CPU due to memory leak Solution:
# Identify the Java process
ps aux | grep java
# Check Java heap usage
jstat -gc <PID>
# Restart the application
systemctl restart myapp
# Monitor memory usage
free -h
Scenario 3: Docker Container Resource Storm
Problem: Docker containers running unbounded scraping jobs Solution:
# Check Docker processes
docker ps
docker stats
# Stop problematic containers
docker stop <container-id>
# Set resource limits
docker run --cpus="1" --memory="512m" myapp
Scenario 4: Database Query Storm
Problem: Multiple database queries consuming high CPU Solution:
# Check database processes
sudo -u postgres psql -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active';"
# Kill long-running queries
sudo -u postgres psql -c "SELECT pg_terminate_backend(<PID>);"
# Check database configuration
sudo -u postgres psql -c "SHOW max_connections;"
Best Practices for CPU Management
1. Proactive Monitoring
# Set up monitoring dashboards
# Use tools like Prometheus, Grafana, or Zabbix
# Configure alerts for CPU usage thresholds
2. Resource Limits
# Set process limits
ulimit -u 1000
# Use cgroups for resource control
systemd-run --slice=user.slice --uid=1000 command
# Configure systemd service limits
[Service]
LimitNOFILE=65536
LimitNPROC=4096
3. Performance Optimization
# Optimize application code
# Use profiling tools to identify bottlenecks
# Implement caching strategies
# Optimize database queries
4. Security Measures
# Regular security updates
apt update && apt upgrade
# Monitor for unusual processes
# Use intrusion detection systems
# Implement proper access controls
Common Pitfalls and Solutions
Pitfall 1: Killing the Wrong Process
Problem: Accidentally killing a critical system process Solution: Always verify the process name and check if it's safe to kill
Pitfall 2: Ignoring Root Causes
Problem: Only treating symptoms, not causes Solution: Investigate why the process is consuming high CPU
Pitfall 3: No Monitoring
Problem: Not knowing about issues until they become critical Solution: Set up proactive monitoring and alerting
Pitfall 4: Over-aggressive Actions
Problem: Taking drastic measures without understanding the impact Solution: Start with gentle actions and escalate gradually
Conclusion
High CPU usage is a critical issue that requires immediate attention, but with a systematic approach, you can quickly identify and resolve the problem:
- Diagnose quickly - Use
uptime
,top
,htop
, andps
to identify the culprit - Analyze thoroughly - Check system resources, logs, and process details
- Act appropriately - Kill runaway processes, restart services, or optimize code
- Monitor continuously - Set up alerts and monitoring to prevent future issues
Remember:
- Don't panic - Systematic diagnosis is more effective than random actions
- Document everything - Keep records of what you did and why
- Prevent recurrence - Address root causes, not just symptoms
- Monitor proactively - Set up alerts before problems become critical