Set up Linux storage monitoring with smartmontools and automated health alerts

Intermediate 25 min Apr 27, 2026 71 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Monitor disk health and prevent storage failures with S.M.A.R.T monitoring, automated email alerts, and custom dashboards. Covers smartd daemon configuration, health checks, and integration with monitoring systems.

Prerequisites

  • Root access to the server
  • Email system configured (postfix/sendmail)
  • At least one storage device with S.M.A.R.T support

What this solves

Disk failures are one of the most common causes of data loss and service outages in production environments. S.M.A.R.T (Self-Monitoring, Analysis and Reporting Technology) provides early warning signs of impending disk failures, allowing you to replace drives before they fail completely. This tutorial sets up smartmontools to continuously monitor disk health, send automated alerts, and integrate with monitoring dashboards.

Step-by-step installation

Install smartmontools package

Install the smartmontools package which provides the smartctl command and smartd daemon for continuous monitoring.

sudo apt update
sudo apt install -y smartmontools mailutils
sudo dnf update -y
sudo dnf install -y smartmontools mailx

Identify available storage devices

Scan for all storage devices and check which ones support S.M.A.R.T monitoring capabilities.

sudo smartctl --scan
sudo smartctl --info /dev/sda
sudo smartctl --health /dev/sda

This shows all detected drives and their S.M.A.R.T status. Note the device paths (like /dev/sda, /dev/nvme0n1) for configuration.

Configure email notifications

Set up system email to receive S.M.A.R.T alerts. Configure postfix or use an external SMTP relay.

sudo dpkg-reconfigure postfix
sudo dnf install -y postfix
sudo systemctl enable --now postfix

Choose "Internet Site" and enter your server's hostname. For production, configure proper SMTP relay settings.

Configure smartd daemon

Create the smartd configuration file to monitor specific drives and define alert conditions.

# Monitor all SATA/SAS drives, enable all S.M.A.R.T tests
/dev/sda -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner
/dev/sdb -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner

Monitor NVMe drives

/dev/nvme0n1 -a -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner

Global settings

DEVICESCAN -d removable -n standby -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner

Replace admin@example.com with your actual email address. The configuration monitors all drives, runs short tests daily at 2 AM and long tests weekly on Saturdays at 3 AM.

Create custom alert script

Create a custom script for enhanced alerting with more detailed information and multiple notification channels.

#!/bin/bash

Enhanced S.M.A.R.T alert script

Usage: Called by smartd when issues are detected

DEVICE="$1" MSG="$2" TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') HOSTNAME=$(hostname -f) LOGFILE="/var/log/smartd-alerts.log"

Log the alert

echo "[$TIMESTAMP] $HOSTNAME: $DEVICE - $MSG" >> "$LOGFILE"

Get detailed S.M.A.R.T information

SMART_INFO=$(smartctl -a "$DEVICE" 2>/dev/null) HEALTH_STATUS=$(smartctl -H "$DEVICE" 2>/dev/null | grep "SMART overall-health")

Create detailed email

EMAIL_BODY="S.M.A.R.T Alert - $HOSTNAME Timestamp: $TIMESTAMP Device: $DEVICE Message: $MSG Health Status: $HEALTH_STATUS Full S.M.A.R.T Data: $SMART_INFO Please check the drive immediately and consider replacement if errors persist."

Send email

echo "$EMAIL_BODY" | mail -s "[ALERT] S.M.A.R.T Issue on $HOSTNAME - $DEVICE" admin@example.com

Optional: Send to monitoring system

curl -X POST https://monitoring.example.com/webhook -d "{\"alert\":\"smart\",\"device\":\"$DEVICE\",\"message\":\"$MSG\"}"

Log to syslog

logger -t smartd-alert "S.M.A.R.T issue on $DEVICE: $MSG"
sudo chmod +x /usr/local/bin/smart-alert.sh

Update smartd configuration for custom script

Modify the smartd configuration to use the custom alert script instead of default email.

# Monitor all drives with custom alerting
/dev/sda -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/sdb -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/nvme0n1 -a -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

Enable automatic device scanning for hot-plugged drives

DEVICESCAN -d removable -n standby -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

Enable and start smartd service

Enable the smartd daemon to start automatically and begin monitoring immediately.

sudo systemctl enable smartd
sudo systemctl start smartd
sudo systemctl status smartd

Create monitoring dashboard integration

Create a script to export S.M.A.R.T metrics for monitoring systems like Prometheus.

#!/bin/bash

S.M.A.R.T metrics exporter for monitoring systems

Outputs metrics in Prometheus format

METRICS_FILE="/var/lib/node_exporter/textfile_collector/smart.prom" TEMP_FILE="/tmp/smart_metrics.$$"

Create directory if it doesn't exist

sudo mkdir -p /var/lib/node_exporter/textfile_collector

Clear previous metrics

echo "# HELP smart_device_health S.M.A.R.T device health status (1=healthy, 0=failing)" > "$TEMP_FILE" echo "# TYPE smart_device_health gauge" >> "$TEMP_FILE" echo "# HELP smart_temperature_celsius Current drive temperature" >> "$TEMP_FILE" echo "# TYPE smart_temperature_celsius gauge" >> "$TEMP_FILE" echo "# HELP smart_power_on_hours Drive power-on hours" >> "$TEMP_FILE" echo "# TYPE smart_power_on_hours gauge" >> "$TEMP_FILE"

Scan all devices

for device in $(lsblk -dpno NAME | grep -E '(sd[a-z]|nvme[0-9]n[0-9])'); do # Check if device supports S.M.A.R.T if smartctl -i "$device" >/dev/null 2>&1; then device_name=$(basename "$device") # Get health status health=$(smartctl -H "$device" 2>/dev/null | grep -c "PASSED") echo "smart_device_health{device=\"$device_name\"} $health" >> "$TEMP_FILE" # Get temperature temp=$(smartctl -A "$device" 2>/dev/null | awk '/Temperature_Celsius/ {print $10}' | head -1) if [[ -n "$temp" && "$temp" =~ ^[0-9]+$ ]]; then echo "smart_temperature_celsius{device=\"$device_name\"} $temp" >> "$TEMP_FILE" fi # Get power-on hours hours=$(smartctl -A "$device" 2>/dev/null | awk '/Power_On_Hours/ {print $10}' | head -1) if [[ -n "$hours" && "$hours" =~ ^[0-9]+$ ]]; then echo "smart_power_on_hours{device=\"$device_name\"} $hours" >> "$TEMP_FILE" fi fi done

Atomically update metrics file

sudo mv "$TEMP_FILE" "$METRICS_FILE" sudo chown node_exporter:node_exporter "$METRICS_FILE" 2>/dev/null || true
sudo chmod +x /usr/local/bin/smart-metrics.sh

Set up automated metrics collection

Create a cron job to regularly update S.M.A.R.T metrics for your monitoring system.

sudo crontab -e
# Update S.M.A.R.T metrics every 5 minutes
/5    * /usr/local/bin/smart-metrics.sh

Weekly S.M.A.R.T health report

0 9 1 /usr/local/bin/smart-health-report.sh

Create health reporting script

Generate comprehensive weekly health reports with trend analysis.

#!/bin/bash

Weekly S.M.A.R.T health report generator

REPORT_FILE="/tmp/smart_health_report_$(date +%Y%m%d).txt" HOSTNAME=$(hostname -f) echo "S.M.A.R.T Health Report for $HOSTNAME" > "$REPORT_FILE" echo "Generated: $(date)" >> "$REPORT_FILE" echo "======================================" >> "$REPORT_FILE" echo >> "$REPORT_FILE" for device in $(lsblk -dpno NAME | grep -E '(sd[a-z]|nvme[0-9]n[0-9])'); do if smartctl -i "$device" >/dev/null 2>&1; then echo "Device: $device" >> "$REPORT_FILE" echo "------------------" >> "$REPORT_FILE" # Basic info smartctl -i "$device" | grep -E '(Model|Serial|Capacity)' >> "$REPORT_FILE" # Health status echo >> "$REPORT_FILE" smartctl -H "$device" >> "$REPORT_FILE" # Key attributes echo >> "$REPORT_FILE" echo "Key Attributes:" >> "$REPORT_FILE" smartctl -A "$device" | grep -E '(Reallocated_Sector_Ct|Spin_Retry_Count|End-to-End_Error|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable|Temperature_Celsius|Power_On_Hours)' >> "$REPORT_FILE" # Recent errors echo >> "$REPORT_FILE" echo "Recent Errors:" >> "$REPORT_FILE" smartctl -l error "$device" | head -10 >> "$REPORT_FILE" echo >> "$REPORT_FILE" echo "======================================" >> "$REPORT_FILE" echo >> "$REPORT_FILE" fi done

Email the report

mail -s "Weekly S.M.A.R.T Health Report - $HOSTNAME" admin@example.com < "$REPORT_FILE"

Clean up

rm -f "$REPORT_FILE"
sudo chmod +x /usr/local/bin/smart-health-report.sh

Configure advanced monitoring options

Set up temperature monitoring

Configure specific temperature thresholds and cooling alerts for high-performance environments.

# Temperature monitoring with custom thresholds
/dev/sda -a -d auto -W 4,35,40 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/sdb -a -d auto -W 4,35,40 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

The -W option sets temperature monitoring: difference threshold (4°C), informal warning (35°C), and critical temperature (40°C).

Configure attribute monitoring

Monitor specific S.M.A.R.T attributes that indicate drive degradation.

# Monitor critical attributes with custom thresholds
/dev/sda -a -d auto -k -f -r 194 -r 9 -U 198 -I 194 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh

This monitors raw read error rate (-r 194), power-on hours (-r 9), offline uncorrectable errors (-U 198), and ignores temperature attribute for alerting (-I 194).

Verify your setup

Test the monitoring configuration and verify all components are working correctly.

# Check smartd service status
sudo systemctl status smartd

Verify configuration syntax

sudo smartd -q onecheck

Test manual S.M.A.R.T check on all drives

sudo smartctl -a /dev/sda sudo smartctl -t short /dev/sda

Check if metrics are being generated

ls -la /var/lib/node_exporter/textfile_collector/ cat /var/lib/node_exporter/textfile_collector/smart.prom

Test email notifications

echo "Test S.M.A.R.T alert" | mail -s "Test Alert" admin@example.com

View recent smartd logs

journalctl -u smartd -f

You can also integrate this monitoring with existing systems by linking to our system monitoring setup or Prometheus monitoring infrastructure.

Common issues

Symptom Cause Fix
smartd service fails to start Invalid device paths in config Run sudo smartctl --scan and update device paths in /etc/smartd.conf
No email alerts received Mail system not configured Test with echo "test" | mail admin@example.com and configure postfix properly
USB/removable drives cause errors smartd trying to monitor disconnected drives Use -n standby,q option and ensure DEVICESCAN includes -d removable
High CPU usage from smartd Too frequent testing schedule Reduce test frequency in schedule: -s (S/../../7/02|L/../../6/03) for weekly short tests
Metrics not appearing in Prometheus Wrong file permissions or path Check /var/lib/node_exporter/textfile_collector/ permissions and node_exporter config
False temperature alerts Normal seasonal temperature changes Adjust temperature thresholds in -W option or use -I 194 to ignore temperature alerts

Next steps

Running this in production?

Want this handled for you? Setting this up once is straightforward. Keeping it patched, monitored, backed up and performant across environments is the harder part. See how we run infrastructure like this for European teams.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.