Monitor disk health and prevent storage failures with S.M.A.R.T monitoring, automated email alerts, and custom dashboards. Covers smartd daemon configuration, health checks, and integration with monitoring systems.
Prerequisites
- Root access to the server
- Email system configured (postfix/sendmail)
- At least one storage device with S.M.A.R.T support
What this solves
Disk failures are one of the most common causes of data loss and service outages in production environments. S.M.A.R.T (Self-Monitoring, Analysis and Reporting Technology) provides early warning signs of impending disk failures, allowing you to replace drives before they fail completely. This tutorial sets up smartmontools to continuously monitor disk health, send automated alerts, and integrate with monitoring dashboards.
Step-by-step installation
Install smartmontools package
Install the smartmontools package which provides the smartctl command and smartd daemon for continuous monitoring.
sudo apt update
sudo apt install -y smartmontools mailutils
Identify available storage devices
Scan for all storage devices and check which ones support S.M.A.R.T monitoring capabilities.
sudo smartctl --scan
sudo smartctl --info /dev/sda
sudo smartctl --health /dev/sda
This shows all detected drives and their S.M.A.R.T status. Note the device paths (like /dev/sda, /dev/nvme0n1) for configuration.
Configure email notifications
Set up system email to receive S.M.A.R.T alerts. Configure postfix or use an external SMTP relay.
sudo dpkg-reconfigure postfix
Choose "Internet Site" and enter your server's hostname. For production, configure proper SMTP relay settings.
Configure smartd daemon
Create the smartd configuration file to monitor specific drives and define alert conditions.
# Monitor all SATA/SAS drives, enable all S.M.A.R.T tests
/dev/sda -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner
/dev/sdb -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner
Monitor NVMe drives
/dev/nvme0n1 -a -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner
Global settings
DEVICESCAN -d removable -n standby -m admin@example.com -M exec /usr/share/smartmontools/smartd-runner
Replace admin@example.com with your actual email address. The configuration monitors all drives, runs short tests daily at 2 AM and long tests weekly on Saturdays at 3 AM.
Create custom alert script
Create a custom script for enhanced alerting with more detailed information and multiple notification channels.
#!/bin/bash
Enhanced S.M.A.R.T alert script
Usage: Called by smartd when issues are detected
DEVICE="$1"
MSG="$2"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
HOSTNAME=$(hostname -f)
LOGFILE="/var/log/smartd-alerts.log"
Log the alert
echo "[$TIMESTAMP] $HOSTNAME: $DEVICE - $MSG" >> "$LOGFILE"
Get detailed S.M.A.R.T information
SMART_INFO=$(smartctl -a "$DEVICE" 2>/dev/null)
HEALTH_STATUS=$(smartctl -H "$DEVICE" 2>/dev/null | grep "SMART overall-health")
Create detailed email
EMAIL_BODY="S.M.A.R.T Alert - $HOSTNAME
Timestamp: $TIMESTAMP
Device: $DEVICE
Message: $MSG
Health Status: $HEALTH_STATUS
Full S.M.A.R.T Data:
$SMART_INFO
Please check the drive immediately and consider replacement if errors persist."
Send email
echo "$EMAIL_BODY" | mail -s "[ALERT] S.M.A.R.T Issue on $HOSTNAME - $DEVICE" admin@example.com
Optional: Send to monitoring system
curl -X POST https://monitoring.example.com/webhook -d "{\"alert\":\"smart\",\"device\":\"$DEVICE\",\"message\":\"$MSG\"}"
Log to syslog
logger -t smartd-alert "S.M.A.R.T issue on $DEVICE: $MSG"
sudo chmod +x /usr/local/bin/smart-alert.sh
Update smartd configuration for custom script
Modify the smartd configuration to use the custom alert script instead of default email.
# Monitor all drives with custom alerting
/dev/sda -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/sdb -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/nvme0n1 -a -n standby,q -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
Enable automatic device scanning for hot-plugged drives
DEVICESCAN -d removable -n standby -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
Enable and start smartd service
Enable the smartd daemon to start automatically and begin monitoring immediately.
sudo systemctl enable smartd
sudo systemctl start smartd
sudo systemctl status smartd
Create monitoring dashboard integration
Create a script to export S.M.A.R.T metrics for monitoring systems like Prometheus.
#!/bin/bash
S.M.A.R.T metrics exporter for monitoring systems
Outputs metrics in Prometheus format
METRICS_FILE="/var/lib/node_exporter/textfile_collector/smart.prom"
TEMP_FILE="/tmp/smart_metrics.$$"
Create directory if it doesn't exist
sudo mkdir -p /var/lib/node_exporter/textfile_collector
Clear previous metrics
echo "# HELP smart_device_health S.M.A.R.T device health status (1=healthy, 0=failing)" > "$TEMP_FILE"
echo "# TYPE smart_device_health gauge" >> "$TEMP_FILE"
echo "# HELP smart_temperature_celsius Current drive temperature" >> "$TEMP_FILE"
echo "# TYPE smart_temperature_celsius gauge" >> "$TEMP_FILE"
echo "# HELP smart_power_on_hours Drive power-on hours" >> "$TEMP_FILE"
echo "# TYPE smart_power_on_hours gauge" >> "$TEMP_FILE"
Scan all devices
for device in $(lsblk -dpno NAME | grep -E '(sd[a-z]|nvme[0-9]n[0-9])'); do
# Check if device supports S.M.A.R.T
if smartctl -i "$device" >/dev/null 2>&1; then
device_name=$(basename "$device")
# Get health status
health=$(smartctl -H "$device" 2>/dev/null | grep -c "PASSED")
echo "smart_device_health{device=\"$device_name\"} $health" >> "$TEMP_FILE"
# Get temperature
temp=$(smartctl -A "$device" 2>/dev/null | awk '/Temperature_Celsius/ {print $10}' | head -1)
if [[ -n "$temp" && "$temp" =~ ^[0-9]+$ ]]; then
echo "smart_temperature_celsius{device=\"$device_name\"} $temp" >> "$TEMP_FILE"
fi
# Get power-on hours
hours=$(smartctl -A "$device" 2>/dev/null | awk '/Power_On_Hours/ {print $10}' | head -1)
if [[ -n "$hours" && "$hours" =~ ^[0-9]+$ ]]; then
echo "smart_power_on_hours{device=\"$device_name\"} $hours" >> "$TEMP_FILE"
fi
fi
done
Atomically update metrics file
sudo mv "$TEMP_FILE" "$METRICS_FILE"
sudo chown node_exporter:node_exporter "$METRICS_FILE" 2>/dev/null || true
sudo chmod +x /usr/local/bin/smart-metrics.sh
Set up automated metrics collection
Create a cron job to regularly update S.M.A.R.T metrics for your monitoring system.
sudo crontab -e
# Update S.M.A.R.T metrics every 5 minutes
/5 * /usr/local/bin/smart-metrics.sh
Weekly S.M.A.R.T health report
0 9 1 /usr/local/bin/smart-health-report.sh
Create health reporting script
Generate comprehensive weekly health reports with trend analysis.
#!/bin/bash
Weekly S.M.A.R.T health report generator
REPORT_FILE="/tmp/smart_health_report_$(date +%Y%m%d).txt"
HOSTNAME=$(hostname -f)
echo "S.M.A.R.T Health Report for $HOSTNAME" > "$REPORT_FILE"
echo "Generated: $(date)" >> "$REPORT_FILE"
echo "======================================" >> "$REPORT_FILE"
echo >> "$REPORT_FILE"
for device in $(lsblk -dpno NAME | grep -E '(sd[a-z]|nvme[0-9]n[0-9])'); do
if smartctl -i "$device" >/dev/null 2>&1; then
echo "Device: $device" >> "$REPORT_FILE"
echo "------------------" >> "$REPORT_FILE"
# Basic info
smartctl -i "$device" | grep -E '(Model|Serial|Capacity)' >> "$REPORT_FILE"
# Health status
echo >> "$REPORT_FILE"
smartctl -H "$device" >> "$REPORT_FILE"
# Key attributes
echo >> "$REPORT_FILE"
echo "Key Attributes:" >> "$REPORT_FILE"
smartctl -A "$device" | grep -E '(Reallocated_Sector_Ct|Spin_Retry_Count|End-to-End_Error|Reported_Uncorrect|Command_Timeout|Current_Pending_Sector|Offline_Uncorrectable|Temperature_Celsius|Power_On_Hours)' >> "$REPORT_FILE"
# Recent errors
echo >> "$REPORT_FILE"
echo "Recent Errors:" >> "$REPORT_FILE"
smartctl -l error "$device" | head -10 >> "$REPORT_FILE"
echo >> "$REPORT_FILE"
echo "======================================" >> "$REPORT_FILE"
echo >> "$REPORT_FILE"
fi
done
Email the report
mail -s "Weekly S.M.A.R.T Health Report - $HOSTNAME" admin@example.com < "$REPORT_FILE"
Clean up
rm -f "$REPORT_FILE"
sudo chmod +x /usr/local/bin/smart-health-report.sh
Configure advanced monitoring options
Set up temperature monitoring
Configure specific temperature thresholds and cooling alerts for high-performance environments.
# Temperature monitoring with custom thresholds
/dev/sda -a -d auto -W 4,35,40 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
/dev/sdb -a -d auto -W 4,35,40 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
The -W option sets temperature monitoring: difference threshold (4°C), informal warning (35°C), and critical temperature (40°C).
Configure attribute monitoring
Monitor specific S.M.A.R.T attributes that indicate drive degradation.
# Monitor critical attributes with custom thresholds
/dev/sda -a -d auto -k -f -r 194 -r 9 -U 198 -I 194 -s (S/../.././02|L/../../6/03) -m admin@example.com -M exec /usr/local/bin/smart-alert.sh
This monitors raw read error rate (-r 194), power-on hours (-r 9), offline uncorrectable errors (-U 198), and ignores temperature attribute for alerting (-I 194).
Verify your setup
Test the monitoring configuration and verify all components are working correctly.
# Check smartd service status
sudo systemctl status smartd
Verify configuration syntax
sudo smartd -q onecheck
Test manual S.M.A.R.T check on all drives
sudo smartctl -a /dev/sda
sudo smartctl -t short /dev/sda
Check if metrics are being generated
ls -la /var/lib/node_exporter/textfile_collector/
cat /var/lib/node_exporter/textfile_collector/smart.prom
Test email notifications
echo "Test S.M.A.R.T alert" | mail -s "Test Alert" admin@example.com
View recent smartd logs
journalctl -u smartd -f
You can also integrate this monitoring with existing systems by linking to our system monitoring setup or Prometheus monitoring infrastructure.
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| smartd service fails to start | Invalid device paths in config | Run sudo smartctl --scan and update device paths in /etc/smartd.conf |
| No email alerts received | Mail system not configured | Test with echo "test" | mail admin@example.com and configure postfix properly |
| USB/removable drives cause errors | smartd trying to monitor disconnected drives | Use -n standby,q option and ensure DEVICESCAN includes -d removable |
| High CPU usage from smartd | Too frequent testing schedule | Reduce test frequency in schedule: -s (S/../../7/02|L/../../6/03) for weekly short tests |
| Metrics not appearing in Prometheus | Wrong file permissions or path | Check /var/lib/node_exporter/textfile_collector/ permissions and node_exporter config |
| False temperature alerts | Normal seasonal temperature changes | Adjust temperature thresholds in -W option or use -I 194 to ignore temperature alerts |
Next steps
- Set up backup monitoring with Prometheus and Grafana to complement your storage monitoring
- Configure system resource monitoring for comprehensive server health tracking
- Implement automated database backups as part of your data protection strategy
- Create advanced Grafana dashboards for disk health visualization
- Set up RAID monitoring and automated alerts for hardware RAID systems
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Production-quality S.M.A.R.T monitoring setup script
# Installs smartmontools with automated health alerts
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'
# Global variables
EMAIL=""
HOSTNAME=$(hostname -f)
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Cleanup function
cleanup() {
if [[ $? -ne 0 ]]; then
echo -e "${RED}[ERROR] Installation failed. Check logs above.${NC}"
fi
}
trap cleanup EXIT
usage() {
echo "Usage: $0 [-e EMAIL] [-h]"
echo " -e EMAIL Email address for S.M.A.R.T alerts (required)"
echo " -h Show this help message"
exit 1
}
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Parse arguments
while getopts "e:h" opt; do
case $opt in
e) EMAIL="$OPTARG" ;;
h) usage ;;
*) usage ;;
esac
done
if [[ -z "$EMAIL" ]]; then
log_error "Email address is required"
usage
fi
# Email validation
if [[ ! "$EMAIL" =~ ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ ]]; then
log_error "Invalid email address format"
exit 1
fi
# Check if running as root
if [[ $EUID -ne 0 ]]; then
log_error "This script must be run as root"
exit 1
fi
echo -e "${BLUE}S.M.A.R.T Monitoring Setup${NC}"
echo "Email: $EMAIL"
echo "Hostname: $HOSTNAME"
echo ""
# Detect distribution
echo -e "${BLUE}[1/8]${NC} Detecting system..."
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update -qq"
PKG_INSTALL="apt install -y"
MAIL_PKG="mailutils"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -q -y"
PKG_INSTALL="dnf install -y"
MAIL_PKG="mailx"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -q -y"
PKG_INSTALL="yum install -y"
MAIL_PKG="mailx"
;;
*)
log_error "Unsupported distribution: $ID"
exit 1
;;
esac
log_info "Detected: $PRETTY_NAME ($PKG_MGR)"
else
log_error "Cannot detect distribution (/etc/os-release not found)"
exit 1
fi
# Update package cache
echo -e "${BLUE}[2/8]${NC} Updating package cache..."
$PKG_UPDATE
# Install smartmontools and mail utilities
echo -e "${BLUE}[3/8]${NC} Installing smartmontools and mail utilities..."
$PKG_INSTALL smartmontools $MAIL_PKG
# Install and configure postfix for email
echo -e "${BLUE}[4/8]${NC} Configuring mail system..."
if [[ "$PKG_MGR" == "apt" ]]; then
DEBIAN_FRONTEND=noninteractive $PKG_INSTALL postfix
# Configure postfix as internet site
echo "$HOSTNAME" > /etc/mailname
postconf -e "myhostname = $HOSTNAME"
postconf -e "mydestination = $HOSTNAME, localhost"
postconf -e "relayhost ="
else
$PKG_INSTALL postfix
postconf -e "myhostname = $HOSTNAME"
postconf -e "mydestination = $HOSTNAME, localhost"
postconf -e "relayhost ="
fi
systemctl enable postfix
systemctl start postfix
# Scan for storage devices
echo -e "${BLUE}[5/8]${NC} Scanning for storage devices..."
log_info "Available storage devices:"
smartctl --scan || true
DEVICES=($(smartctl --scan | awk '{print $1}' | head -10))
if [[ ${#DEVICES[@]} -eq 0 ]]; then
log_warn "No S.M.A.R.T capable devices found"
fi
for device in "${DEVICES[@]}"; do
if [[ -e "$device" ]]; then
health=$(smartctl --health "$device" 2>/dev/null | grep "SMART overall-health" || echo "Unknown")
log_info " $device: $health"
fi
done
# Create custom alert script
echo -e "${BLUE}[6/8]${NC} Creating custom alert script..."
cat > /usr/local/bin/smart-alert.sh << 'EOF'
#!/bin/bash
# Enhanced S.M.A.R.T alert script
# Usage: Called by smartd when issues are detected
DEVICE="$1"
MSG="$2"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
HOSTNAME=$(hostname -f)
LOGFILE="/var/log/smartd-alerts.log"
# Log the alert
echo "[$TIMESTAMP] $HOSTNAME: $DEVICE - $MSG" >> "$LOGFILE"
# Get detailed S.M.A.R.T information
SMART_INFO=$(smartctl -a "$DEVICE" 2>/dev/null)
HEALTH_STATUS=$(smartctl -H "$DEVICE" 2>/dev/null | grep "SMART overall-health" || echo "Status unknown")
# Create detailed email
EMAIL_BODY="S.M.A.R.T Alert - $HOSTNAME
Timestamp: $TIMESTAMP
Device: $DEVICE
Message: $MSG
Health Status: $HEALTH_STATUS
Full S.M.A.R.T Data:
$SMART_INFO
Please check the drive immediately and consider replacement if errors persist."
# Send email
echo "$EMAIL_BODY" | mail -s "[ALERT] S.M.A.R.T Issue on $HOSTNAME - $DEVICE" "$3"
# Log to syslog
logger -t smartd-alert "S.M.A.R.T issue on $DEVICE: $MSG"
EOF
chmod 755 /usr/local/bin/smart-alert.sh
chown root:root /usr/local/bin/smart-alert.sh
# Configure smartd
echo -e "${BLUE}[7/8]${NC} Configuring smartd daemon..."
cp /etc/smartd.conf /etc/smartd.conf.backup 2>/dev/null || true
cat > /etc/smartd.conf << EOF
# smartd configuration for automated monitoring
# Generated by smart-setup script on $(date)
# Global settings - scan for all devices
DEVICESCAN -d removable -n standby -s (S/../.././02|L/../../6/03) -m $EMAIL -M exec /usr/local/bin/smart-alert.sh
# Specific device monitoring (uncomment and modify as needed)
EOF
# Add specific devices if found
for device in "${DEVICES[@]}"; do
if [[ -e "$device" ]]; then
echo "# $device -a -d auto -n standby,q -s (S/../.././02|L/../../6/03) -m $EMAIL -M exec /usr/local/bin/smart-alert.sh" >> /etc/smartd.conf
fi
done
# Create log file
touch /var/log/smartd-alerts.log
chmod 644 /var/log/smartd-alerts.log
chown root:root /var/log/smartd-alerts.log
# Enable and start smartd
systemctl enable smartd
systemctl restart smartd
# Start services and final checks
echo -e "${BLUE}[8/8]${NC} Performing final verification..."
# Test email functionality
log_info "Testing email functionality..."
echo "S.M.A.R.T monitoring setup completed on $HOSTNAME at $(date)" | mail -s "S.M.A.R.T Monitoring Setup Complete" "$EMAIL" || log_warn "Email test may have failed"
# Check service status
if systemctl is-active --quiet smartd; then
log_info "smartd service is running"
else
log_error "smartd service failed to start"
exit 1
fi
if systemctl is-active --quiet postfix; then
log_info "postfix service is running"
else
log_warn "postfix service is not running - email alerts may not work"
fi
# Display configuration summary
echo ""
echo -e "${GREEN}=== Setup Complete ===${NC}"
echo "Email alerts: $EMAIL"
echo "Monitored devices: ${#DEVICES[@]} found"
echo "Config file: /etc/smartd.conf"
echo "Alert script: /usr/local/bin/smart-alert.sh"
echo "Log file: /var/log/smartd-alerts.log"
echo ""
echo "S.M.A.R.T tests scheduled:"
echo " - Short test: Daily at 2:00 AM"
echo " - Long test: Weekly on Saturday at 3:00 AM"
echo ""
log_info "Monitor logs with: tail -f /var/log/smartd-alerts.log"
log_info "Manual device check: smartctl -a /dev/sda"
log_info "Check service status: systemctl status smartd"
Review the script before running. Execute with: bash install.sh