Advanced Nomad job templates and deployment strategies with rolling updates and canary deployments

Advanced 45 min Apr 12, 2026 275 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Master production-grade Nomad job templates with HCL syntax, implement rolling updates with health checks, and deploy advanced blue-green and canary deployment patterns for resilient containerized workloads.

Prerequisites

  • Root or sudo access
  • At least 8GB RAM per node
  • 3+ node cluster recommended
  • Consul service discovery
  • Docker runtime

What this solves

Nomad job templates provide declarative infrastructure for container orchestration, but basic deployments lack the sophistication needed for production environments. This tutorial covers advanced HCL templating, rolling deployment strategies with health checks, blue-green deployments, and canary release patterns that ensure zero-downtime updates and automatic rollbacks.

Prerequisites and setup

Install Nomad cluster

Set up a multi-node Nomad cluster with Consul integration for service discovery and coordination.

wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y nomad consul
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo dnf install -y nomad consul

Configure Consul server

Set up Consul for service discovery and distributed coordination between Nomad nodes.

datacenter = "dc1"
data_dir = "/opt/consul"
log_level = "INFO"
server = true
bootstrap_expect = 3
retry_join = ["10.0.1.10", "10.0.1.11", "10.0.1.12"]
bind_addr = "{{ GetInterfaceIP \"eth0\" }}"
client_addr = "0.0.0.0"
ui_config {
  enabled = true
}
connect {
  enabled = true
}

Configure Nomad server

Configure Nomad server nodes with Consul integration and cluster coordination.

datacenter = "dc1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

server {
  enabled = true
  bootstrap_expect = 3
  server_join {
    retry_join = ["10.0.1.10", "10.0.1.11", "10.0.1.12"]
  }
}

consul {
  address = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise = true
  server_auto_join = true
  client_auto_join = true
}

ui_config {
  enabled = true
}

Start services

Enable and start Consul and Nomad services on all cluster nodes.

sudo systemctl enable --now consul
sudo systemctl enable --now nomad
sudo systemctl status consul nomad

Advanced HCL job templates

Create parameterized job template

Build a flexible job template using HCL variables and templating for reusable deployment patterns.

variable "app_name" {
  description = "Application name for service registration"
  type        = string
  default     = "web-app"
}

variable "app_version" {
  description = "Application version tag"
  type        = string
  default     = "latest"
}

variable "instance_count" {
  description = "Number of application instances"
  type        = number
  default     = 3
}

variable "resource_cpu" {
  description = "CPU allocation in MHz"
  type        = number
  default     = 500
}

variable "resource_memory" {
  description = "Memory allocation in MB"
  type        = number
  default     = 512
}

job "${var.app_name}" {
  datacenters = ["dc1"]
  type        = "service"
  priority    = 50

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  constraint {
    attribute = "${meta.instance_type}"
    operator  = "regexp"
    value     = "(web|app)"
  }

  update {
    max_parallel      = 2
    health_check      = "checks"
    min_healthy_time  = "30s"
    healthy_deadline  = "5m"
    progress_deadline = "10m"
    auto_revert       = true
    canary           = 1
    stagger          = "30s"
  }

  group "${var.app_name}-group" {
    count = var.instance_count

    network {
      port "http" {
        static = 8080
      }
    }

    service {
      name = "${var.app_name}"
      port = "http"
      tags = [
        "version-${var.app_version}",
        "traefik.enable=true",
        "traefik.http.routers.${var.app_name}.rule=Host(${var.app_name}.example.com)"
      ]

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }

      check {
        type     = "tcp"
        interval = "10s"
        timeout  = "3s"
      }
    }

    restart {
      attempts = 3
      interval = "5m"
      delay    = "15s"
      mode     = "fail"
    }

    task "${var.app_name}-task" {
      driver = "docker"

      config {
        image = "nginx:${var.app_version}"
        ports = ["http"]
        
        mount {
          type   = "bind"
          source = "local/nginx.conf"
          target = "/etc/nginx/nginx.conf"
        }
      }

      template {
        data = <<-EOH
        worker_processes auto;
        events {
            worker_connections 1024;
        }
        http {
            server {
                listen 8080;
                location /health {
                    access_log off;
                    return 200 "healthy";
                    add_header Content-Type text/plain;
                }
                location / {
                    root /usr/share/nginx/html;
                    index index.html;
                }
            }
        }
        EOH
        destination = "local/nginx.conf"
      }

      resources {
        cpu    = var.resource_cpu
        memory = var.resource_memory
      }

      env {
        APP_VERSION = "${var.app_version}"
        DATACENTER  = "${attr.datacenter}"
        NODE_NAME   = "${attr.unique.hostname}"
      }

      logs {
        max_files     = 10
        max_file_size = 15
      }
    }
  }
}

Deploy with variables

Submit the job with custom variable values for different environments and configurations.

nomad job run \
  -var="app_name=frontend" \
  -var="app_version=v2.1.0" \
  -var="instance_count=5" \
  -var="resource_cpu=800" \
  -var="resource_memory=1024" \
  web-app-template.nomad

Rolling deployment strategies

Configure advanced update blocks

Implement sophisticated rolling update strategies with health checks and automatic rollback capabilities.

job "rolling-app" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    max_parallel      = 2
    health_check      = "checks"
    min_healthy_time  = "30s"
    healthy_deadline  = "5m"
    progress_deadline = "10m"
    auto_revert       = true
    auto_promote      = false
    canary           = 2
    stagger          = "30s"
  }

  group "app" {
    count = 6

    network {
      port "http" {}
    }

    service {
      name = "rolling-app"
      port = "http"
      tags = ["version-${NOMAD_META_version}"]

      check {
        name     = "HTTP Health"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
        check_restart {
          limit           = 3
          grace           = "10s"
          ignore_warnings = false
        }
      }

      check {
        name     = "Application Ready"
        type     = "script"
        command  = "/bin/sh"
        args     = ["-c", "curl -f http://localhost:${NOMAD_PORT_http}/ready"]
        interval = "15s"
        timeout  = "5s"
      }
    }

    task "web" {
      driver = "docker"
      
      config {
        image = "nginx:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 500
        memory = 512
      }

      kill_timeout = "30s"
      kill_signal  = "SIGTERM"

      shutdown_delay = "5s"
    }
  }
}

Monitor rolling deployment

Track deployment progress and health status during rolling updates.

nomad job run rolling-update.nomad
nomad job status rolling-app
nomad deployment status rolling-app
nomad deployment promote rolling-app

Blue-green deployment pattern

Create blue-green job template

Implement blue-green deployments using Nomad job versioning and service discovery integration.

variable "environment" {
  description = "Deployment environment (blue or green)"
  type        = string
  default     = "blue"
}

variable "app_version" {
  description = "Application version"
  type        = string
}

job "app-${var.environment}" {
  datacenters = ["dc1"]
  type        = "service"

  meta {
    environment = "${var.environment}"
    version     = "${var.app_version}"
  }

  group "app" {
    count = 3

    network {
      port "http" {}
    }

    service {
      name = "app-${var.environment}"
      port = "http"
      tags = [
        "environment-${var.environment}",
        "version-${var.app_version}",
        "traefik.enable=false"
      ]

      meta {
        environment = "${var.environment}"
        version     = "${var.app_version}"
      }

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }

      check {
        name     = "Deep Health Check"
        type     = "http"
        path     = "/health/deep"
        interval = "30s"
        timeout  = "10s"
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "myapp:${var.app_version}"
        ports = ["http"]
      }

      env {
        ENVIRONMENT = "${var.environment}"
        VERSION     = "${var.app_version}"
      }

      resources {
        cpu    = 1000
        memory = 1024
      }
    }
  }
}

Load balancer configuration job

job "app-router" { datacenters = ["dc1"] type = "service" group "router" { count = 1 network { port "http" { static = 80 } } service { name = "app-router" port = "http" tags = [ "traefik.enable=true", "traefik.http.routers.app.rule=Host(app.example.com)" ] } task "traefik" { driver = "docker" config { image = "traefik:v2.10" ports = ["http"] args = [ "--api.dashboard=true", "--providers.consul.endpoints=127.0.0.1:8500", "--providers.consul.exposedByDefault=false", "--entrypoints.web.address=:80" ] } resources { cpu = 200 memory = 256 } } } }

Execute blue-green deployment

Deploy to the inactive environment and switch traffic after validation.

# Deploy green environment
nomad job run -var="environment=green" -var="app_version=v2.0.0" blue-green.nomad

Verify green deployment

nomad job status app-green consul catalog services

Switch traffic by updating service tags

nomad job run -var="environment=green" -var="app_version=v2.0.0" blue-green.nomad

Stop blue environment after validation

nomad job stop app-blue

Canary deployment implementation

Advanced canary configuration

Implement sophisticated canary deployments with traffic splitting and automatic promotion based on metrics.

job "canary-app" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    max_parallel      = 1
    health_check      = "checks"
    min_healthy_time  = "1m"
    healthy_deadline  = "10m"
    progress_deadline = "15m"
    auto_revert       = true
    auto_promote      = false
    canary           = 2
    stagger          = "1m"
  }

  group "app" {
    count = 6

    network {
      port "http" {}
      port "metrics" {}
    }

    service {
      name = "canary-app"
      port = "http"
      tags = [
        "version-${NOMAD_META_version}",
        "traefik.enable=true",
        "traefik.http.routers.canary-app.rule=Host(app.example.com)",
        "traefik.http.services.canary-app.loadbalancer.healthcheck.path=/health"
      ]

      canary_tags = [
        "version-${NOMAD_META_version}",
        "canary",
        "traefik.enable=true",
        "traefik.http.routers.canary-app-canary.rule=Host(app.example.com) && Headers(X-Canary, true)",
        "traefik.http.middlewares.canary-split.weight.service1=canary-app@consul,90",
        "traefik.http.middlewares.canary-split.weight.service2=canary-app-canary@consul,10"
      ]

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }

      check {
        name     = "Readiness Check"
        type     = "http"
        path     = "/ready"
        interval = "15s"
        timeout  = "5s"
      }

      check {
        name     = "Error Rate Check"
        type     = "script"
        command  = "/bin/sh"
        args = [
          "-c",
          "curl -s http://localhost:${NOMAD_PORT_metrics}/metrics | grep 'error_rate' | awk '{print $2}' | awk '$1 < 0.05 {exit 0} {exit 1}'"
        ]
        interval = "30s"
        timeout  = "10s"
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "myapp:latest"
        ports = ["http", "metrics"]
      }

      template {
        data = <<-EOH
        #!/bin/bash
        # Automated canary promotion script
        set -e
        
        DEPLOYMENT_ID=$(nomad deployment list -json | jq -r '.[0].ID')
        
        # Wait for canary instances to be healthy
        echo "Waiting for canary instances..."
        timeout 300 bash -c 'until [ $(nomad deployment status $DEPLOYMENT_ID -json | jq -r ".TaskGroups.app.HealthyAllocs") -ge 2 ]; do sleep 10; done'
        
        # Monitor metrics for 5 minutes
        echo "Monitoring canary metrics..."
        for i in {1..30}; do
          ERROR_RATE=$(curl -s http://localhost:${NOMAD_PORT_metrics}/metrics | grep 'error_rate' | awk '{print $2}')
          RESPONSE_TIME=$(curl -s http://localhost:${NOMAD_PORT_metrics}/metrics | grep 'response_time_p95' | awk '{print $2}')
          
          if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )) || (( $(echo "$RESPONSE_TIME > 1000" | bc -l) )); then
            echo "Canary metrics failed threshold, reverting deployment"
            nomad deployment fail $DEPLOYMENT_ID
            exit 1
          fi
          
          sleep 10
        done
        
        # Promote if all checks pass
        echo "Canary validation successful, promoting deployment"
        nomad deployment promote $DEPLOYMENT_ID
        EOH
        destination = "local/canary-check.sh"
        perms       = "755"
      }

      resources {
        cpu    = 500
        memory = 512
      }

      env {
        ENABLE_METRICS = "true"
        METRICS_PORT   = "${NOMAD_PORT_metrics}"
      }
    }

    task "canary-monitor" {
      driver = "exec"
      
      config {
        command = "/bin/bash"
        args    = ["local/canary-check.sh"]
      }
      
      lifecycle {
        hook    = "poststart"
        sidecar = false
      }

      resources {
        cpu    = 100
        memory = 128
      }
    }
  }
}

Deploy and monitor canary

Execute canary deployment with automated monitoring and promotion logic.

# Deploy canary version
nomad job run canary-deployment.nomad

Monitor deployment progress

nomad deployment status canary-app watch -n 5 'nomad deployment status canary-app'

Manual promotion if needed

nomad deployment promote canary-app

Manual rollback if issues detected

nomad deployment fail canary-app

Advanced scheduling and constraints

Implement complex affinity rules

Configure advanced scheduling constraints using node attributes, metadata, and affinity rules for optimal placement.

job "distributed-app" {
  datacenters = ["dc1"]
  type        = "service"

  # Spread across availability zones
  constraint {
    attribute = "${meta.availability_zone}"
    operator  = "distinct_hosts"
    value     = "true"
  }

  # Require SSD storage
  constraint {
    attribute = "${meta.storage_type}"
    value     = "ssd"
  }

  # Avoid spot instances for critical workloads
  constraint {
    attribute = "${meta.instance_lifecycle}"
    operator  = "!="
    value     = "spot"
  }

  affinity {
    attribute = "${meta.instance_type}"
    value     = "compute-optimized"
    weight    = 80
  }

  affinity {
    attribute = "${node.class}"
    value     = "production"
    weight    = 100
  }

  # Anti-affinity to spread across nodes
  spread {
    attribute = "${node.unique.id}"
    weight    = 100
  }

  # Spread across availability zones
  spread {
    attribute = "${meta.availability_zone}"
    weight    = 80
    target "us-east-1a" {
      percent = 34
    }
    target "us-east-1b" {
      percent = 33
    }
    target "us-east-1c" {
      percent = 33
    }
  }

  group "web" {
    count = 9

    # Group-level constraints
    constraint {
      attribute = "${attr.cpu.arch}"
      value     = "amd64"
    }

    # Ensure minimum resources available
    constraint {
      attribute = "${attr.memory.totalbytes}"
      operator  = ">="
      value     = "8589934592" # 8GB
    }

    network {
      port "http" {}
    }

    service {
      name = "distributed-web"
      port = "http"
      tags = [
        "zone-${meta.availability_zone}",
        "instance-${meta.instance_type}"
      ]

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "nginx:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 1000
        memory = 2048
        
        device "nvidia/gpu" {
          count = 1
          
          constraint {
            attribute = "${device.attr.memory}"
            operator  = ">="
            value     = "8GiB"
          }
        }
      }

      env {
        AVAILABILITY_ZONE = "${meta.availability_zone}"
        INSTANCE_TYPE     = "${meta.instance_type}"
        NODE_ID           = "${node.unique.id}"
      }
    }
  }
}

Monitoring and integration

Configure deployment monitoring

Set up comprehensive monitoring for deployments with Consul and external monitoring systems. For detailed monitoring setup, see our guide on monitoring Consul with Prometheus and Grafana.

#!/bin/bash

Deployment monitoring script

set -e JOB_NAME="$1" DEPLOYMENT_ID=$(nomad deployment list -json -job="$JOB_NAME" | jq -r '.[0].ID') echo "Monitoring deployment: $DEPLOYMENT_ID" while true; do STATUS=$(nomad deployment status "$DEPLOYMENT_ID" -json | jq -r '.Status') HEALTHY=$(nomad deployment status "$DEPLOYMENT_ID" -json | jq -r '.TaskGroups | to_entries[] | .value.HealthyAllocs') DESIRED=$(nomad deployment status "$DEPLOYMENT_ID" -json | jq -r '.TaskGroups | to_entries[] | .value.DesiredTotal') echo "Status: $STATUS, Healthy: $HEALTHY/$DESIRED" if [ "$STATUS" = "successful" ]; then echo "Deployment completed successfully" exit 0 elif [ "$STATUS" = "failed" ]; then echo "Deployment failed" exit 1 fi sleep 10 done

Set up automated testing

Implement automated testing pipeline for deployment validation and health verification.

#!/bin/bash

Automated deployment testing

set -e SERVICE_NAME="$1" TEST_URL="$2" echo "Testing deployment for service: $SERVICE_NAME"

Wait for service registration

echo "Waiting for service registration..." timeout 120 bash -c "until consul catalog services | grep -q $SERVICE_NAME; do sleep 5; done"

Get service endpoints

SERVICE_IPS=$(consul catalog service "$SERVICE_NAME" -format="json" | jq -r '.[].ServiceAddress') for IP in $SERVICE_IPS; do echo "Testing endpoint: http://$IP:8080" # Health check curl -f "http://$IP:8080/health" || exit 1 # Load test ab -n 100 -c 10 "http://$IP:8080$TEST_URL" || exit 1 # Response time check RESPONSE_TIME=$(curl -o /dev/null -s -w '%{time_total}' "http://$IP:8080$TEST_URL") if (( $(echo "$RESPONSE_TIME > 1.0" | bc -l) )); then echo "Response time too high: $RESPONSE_TIME seconds" exit 1 fi done echo "All tests passed for $SERVICE_NAME"

Verify your setup

# Check Nomad cluster status
nomad server members
nomad node status

Verify job deployments

nomad job status nomad deployment list

Check service discovery

consul catalog services consul members

Test service endpoints

curl -I http://app.example.com/health

Monitor resource usage

nomad monitor -log-level=INFO

Common issues

SymptomCauseFix
Deployment stuck in progressHealth checks failingCheck service logs with nomad alloc logs [ALLOC_ID] and verify health check endpoints
Canary not auto-promotingMetrics threshold not metReview canary metrics and adjust thresholds in job specification
Service not registering in ConsulConsul agent connectivityVerify Consul agent is running and accessible: consul members
Scheduling constraints not workingNode metadata missingAdd required metadata to client nodes: meta { instance_type = "web" }
Rolling update failuresInsufficient healthy instancesIncrease min_healthy_time and verify resource availability
Blue-green switch not workingLoad balancer configurationUpdate service tags and verify Traefik or load balancer rules

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.