High availability infrastructure: patterns, trade-offs and what to actually build.
High availability is not a product you buy. It is a set of architectural decisions that each add capability and cost. The right design depends on what you are protecting against and what an hour of downtime actually costs your business.
The uptime ladder
The number of nines you need decides everything else. This table shows the practical operational cost of each uptime target — downtime allowance, typical architecture and recurring cost category.
| Uptime target | Annual downtime | Typical architecture |
|---|---|---|
| 99.0% ("two nines") | 3.65 days | Single server, manual recovery |
| 99.9% ("three nines") | 8.76 hours | Redundant services, monitored recovery |
| 99.95% | 4.38 hours | Multi-AZ, automatic failover for stateless services |
| 99.99% ("four nines") | 52 minutes | Full redundancy, database replication, active monitoring |
| 99.999% ("five nines") | 5.26 minutes | Multi-region, synchronous replication, chaos-tested |
Most mid-market production platforms target 99.99%. Going to five nines roughly doubles the architectural complexity and operational cost. It is rarely the right trade-off unless downtime has regulatory consequences.
The layers where redundancy matters
Downtime usually comes from a single point of failure. The purpose of a high-availability architecture is to remove single points of failure at every layer of the stack. In order of impact:
- Network & DNS — Anycast DNS, multiple providers, short TTL for failover. A single DNS provider outage took down a quarter of the web in 2016.
- Load balancers — Active/active LB pair with health checks. A single LB is a bigger risk than a single application server because everything funnels through it.
- Application layer — Multiple stateless instances behind the load balancer. This is the easiest layer to make redundant because state lives elsewhere.
- Database — Primary with synchronous or asynchronous replica. Synchronous replication adds write latency but enables automatic failover without data loss. Asynchronous is faster but accepts a small window of data loss on failover.
- Cache & queue — Redis Sentinel or Cluster for cache. RabbitMQ mirrored queues or Kafka replication for queues. Losing the cache is recoverable; losing the queue often means lost work.
- Storage — Replicated block storage for databases, redundant object storage for assets.
Failover modes, and what each actually costs
Not all failover is equal. The mode you choose decides how much operational overhead you are accepting in exchange for the capability.
- Manual failover — Cheap to set up, expensive when it matters. Someone has to be awake, reachable and confident enough to execute the runbook. Tends to take 20–60 minutes during an incident.
- Automated failover with manual verification — Software detects the failure and initiates the switch, but a human confirms. Good middle ground: 2–5 minutes to switch, low false-positive risk.
- Fully automated failover — Lowest downtime (<30 seconds), highest complexity. Risk: a false-positive failover (network blip misread as total failure) causes more disruption than the original issue would have. Needs thorough monitoring and a solid runbook for the follow-up.
- Active/active — Both halves take traffic; a failure just shifts load. No failover event per se. Requires the application to be stateless or to handle distributed state explicitly.
Capacity planning as an HA concern
An often-overlooked dimension of availability: if your redundant node does not have enough capacity to take over the load of the failed one, failover creates a second outage. Every HA design needs a documented N+1 capacity check: with one node down, can the remaining infrastructure handle peak traffic without degradation? This is why we do quarterly capacity reviews on every managed environment — traffic patterns shift, and a design that was N+1 a year ago can silently slip to N+0.
What we actually build
For a typical 99.99% mid-market platform we design:
- Multi-AZ load balancer pair (active/active, Anycast DNS for health-check-driven regional routing if the workload crosses regions).
- 3+ stateless application nodes behind the LB, sized so any one can fail without capacity degradation.
- PostgreSQL primary with synchronous replica in the same region, asynchronous replica in a second region for disaster recovery.
- Redis Sentinel (3-node) for cache and session store, Cluster mode for workloads over ~10GB hot data.
- Object storage with cross-region replication for user assets.
- Continuous backup verification — every backup restored to a scratch environment on a schedule, so we know restores work before we need them.
- End-to-end synthetic monitoring plus real-user monitoring so we see degradation before users do.
This sits on our managed cloud platform with quarterly architecture reviews to catch drift.
Related engineering content
Engineering blog
Why most infrastructure fails under load — and how to prevent it
Engineering blog
10 signs your infrastructure is about to fail
Engineering blog
SLA, SLO and SLI: setting reliability targets that actually mean something
Engineering blog
Post-incident reviews that actually improve reliability
Engineering blog
Why misleading monitoring is more dangerous than no monitoring
Engineering blog
Sustainable on-call for small teams
Hands-on tutorials
Production-grade guides to implement the individual patterns:
Frequently asked questions
Is 99.99% uptime really achievable with a single cloud provider?
Yes, as long as the design places components across multiple availability zones within the region. 99.99% (52 minutes downtime per year) is comfortably within reach on a well-architected single-region, multi-AZ setup. Going to 99.999% generally requires multi-region, which roughly doubles the operational cost.
How do we measure our actual uptime vs. the target?
External synthetic monitoring from multiple geographic locations — never measure your uptime from inside your own infrastructure. We run minute-granularity checks against the user-facing endpoints, log every failure, and publish a transparent monthly uptime report per client environment.
What is the biggest single cause of failed failover in practice?
Untested capacity headroom. The secondary node was configured correctly, replication was healthy, but when it took over the load at 9am Monday, it was undersized. Every failover plan needs a documented N+1 capacity check that gets revalidated every quarter.
Do we need multi-region for HA?
Almost never for mid-market platforms. Multi-AZ within a single region protects against data-centre-level failure, which is the realistic failure mode. Multi-region protects against regional cloud-provider outages, which are rare enough that the ongoing operational cost usually exceeds the expected benefit. Regulatory or latency requirements are the two cases where multi-region is genuinely needed.
Designing or fixing an HA setup?
We run 99.99%-class infrastructure for European businesses. Tell us about your target and we will map out what it takes.
Talk to an engineer