High availability infrastructure: patterns, trade-offs and what to actually build.

Q: Is 99.99% uptime really achievable with a single cloud provider?

Yes, as long as the design places components across multiple availability zones within the region. 99.99% (52 minutes downtime per year) is comfortably within reach on a well-architected single-region, multi-AZ setup. Going to 99.999% generally requires multi-region, which roughly doubles the operational cost.

Q: How do we measure our actual uptime vs. the target?

External synthetic monitoring from multiple geographic locations — never measure your uptime from inside your own infrastructure. We run minute-granularity checks against the user-facing endpoints, log every failure, and publish a transparent monthly uptime report per client environment.

Q: What is the biggest single cause of failed failover in practice?

Untested capacity headroom. The secondary node was configured correctly, replication was healthy, but when it took over the load at 9am Monday, it was undersized. Every failover plan needs a documented N+1 capacity check that gets revalidated every quarter.

Q: Do we need multi-region for HA?

Almost never for mid-market platforms. Multi-AZ within a single region protects against data-centre-level failure, which is the realistic failure mode. Multi-region protects against regional cloud-provider outages, which are rare enough that the ongoing operational cost usually exceeds the expected benefit. Regulatory or latency requirements are the two cases where multi-region is genuinely needed.

High availability is not a product you buy. It is a set of architectural decisions that each add capability and cost. The right design depends on what you are protecting against and what an hour of downtime actually costs your business.

The uptime ladder

The number of nines you need decides everything else. This table shows the practical operational cost of each uptime target — downtime allowance, typical architecture and recurring cost category.

Uptime target	Annual downtime	Typical architecture
99.0% ("two nines")	3.65 days	Single server, manual recovery
99.9% ("three nines")	8.76 hours	Redundant services, monitored recovery
99.95%	4.38 hours	Multi-AZ, automatic failover for stateless services
99.99% ("four nines")	52 minutes	Full redundancy, database replication, active monitoring
99.999% ("five nines")	5.26 minutes	Multi-region, synchronous replication, chaos-tested

Most mid-market production platforms target 99.99%. Going to five nines roughly doubles the architectural complexity and operational cost. It is rarely the right trade-off unless downtime has regulatory consequences.

The layers where redundancy matters

Downtime usually comes from a single point of failure. The purpose of a high-availability architecture is to remove single points of failure at every layer of the stack. In order of impact:

Network & DNS — Anycast DNS, multiple providers, short TTL for failover. A single DNS provider outage took down a quarter of the web in 2016.
Load balancers — Active/active LB pair with health checks. A single LB is a bigger risk than a single application server because everything funnels through it.
Application layer — Multiple stateless instances behind the load balancer. This is the easiest layer to make redundant because state lives elsewhere.
Database — Primary with synchronous or asynchronous replica. Synchronous replication adds write latency but enables automatic failover without data loss. Asynchronous is faster but accepts a small window of data loss on failover.
Cache & queue — Redis Sentinel or Cluster for cache. RabbitMQ mirrored queues or Kafka replication for queues. Losing the cache is recoverable; losing the queue often means lost work.
Storage — Replicated block storage for databases, redundant object storage for assets.

Failover modes, and what each actually costs

Not all failover is equal. The mode you choose decides how much operational overhead you are accepting in exchange for the capability.

Manual failover — Cheap to set up, expensive when it matters. Someone has to be awake, reachable and confident enough to execute the runbook. Tends to take 20–60 minutes during an incident.
Automated failover with manual verification — Software detects the failure and initiates the switch, but a human confirms. Good middle ground: 2–5 minutes to switch, low false-positive risk.
Fully automated failover — Lowest downtime (<30 seconds), highest complexity. Risk: a false-positive failover (network blip misread as total failure) causes more disruption than the original issue would have. Needs thorough monitoring and a solid runbook for the follow-up.
Active/active — Both halves take traffic; a failure just shifts load. No failover event per se. Requires the application to be stateless or to handle distributed state explicitly.

Capacity planning as an HA concern

An often-overlooked dimension of availability: if your redundant node does not have enough capacity to take over the load of the failed one, failover creates a second outage. Every HA design needs a documented N+1 capacity check: with one node down, can the remaining infrastructure handle peak traffic without degradation? This is why we do quarterly capacity reviews on every managed environment — traffic patterns shift, and a design that was N+1 a year ago can silently slip to N+0.

What we actually build

For a typical 99.99% mid-market platform we design:

Multi-AZ load balancer pair (active/active, Anycast DNS for health-check-driven regional routing if the workload crosses regions).
3+ stateless application nodes behind the LB, sized so any one can fail without capacity degradation.
PostgreSQL primary with synchronous replica in the same region, asynchronous replica in a second region for disaster recovery.
Redis Sentinel (3-node) for cache and session store, Cluster mode for workloads over ~10GB hot data.
Object storage with cross-region replication for user assets.
Continuous backup verification — every backup restored to a scratch environment on a schedule, so we know restores work before we need them.
End-to-end synthetic monitoring plus real-user monitoring so we see degradation before users do.

This sits on our managed cloud platform with quarterly architecture reviews to catch drift.

Preguntas frecuentes

Is 99.99% uptime really achievable with a single cloud provider?

Yes, as long as the design places components across multiple availability zones within the region. 99.99% (52 minutes downtime per year) is comfortably within reach on a well-architected single-region, multi-AZ setup. Going to 99.999% generally requires multi-region, which roughly doubles the operational cost.

How do we measure our actual uptime vs. the target?

External synthetic monitoring from multiple geographic locations — never measure your uptime from inside your own infrastructure. We run minute-granularity checks against the user-facing endpoints, log every failure, and publish a transparent monthly uptime report per client environment.

What is the biggest single cause of failed failover in practice?

Untested capacity headroom. The secondary node was configured correctly, replication was healthy, but when it took over the load at 9am Monday, it was undersized. Every failover plan needs a documented N+1 capacity check that gets revalidated every quarter.

Do we need multi-region for HA?

Almost never for mid-market platforms. Multi-AZ within a single region protects against data-centre-level failure, which is the realistic failure mode. Multi-region protects against regional cloud-provider outages, which are rare enough that the ongoing operational cost usually exceeds the expected benefit. Regulatory or latency requirements are the two cases where multi-region is genuinely needed.

Ver todas las preguntas frecuentes →

Designing or fixing an HA setup?

We run 99.99%-class infrastructure for European businesses. Tell us about your target and we will map out what it takes.

Talk to an engineer

High availability infrastructure: patterns, trade-offs and what to actually build.

The uptime ladder

The layers where redundancy matters

Failover modes, and what each actually costs

Capacity planning as an HA concern

What we actually build

Related engineering content

Sovereign cloud in Europe: provider choice and HA together

EU alternatives to AWS, Azure and GCP

Why most infrastructure fails under load — and how to prevent it

10 signs your infrastructure is about to fail

SLA, SLO and SLI: setting reliability targets that actually mean something

Post-incident reviews that actually improve reliability

Why misleading monitoring is more dangerous than no monitoring

Sustainable on-call for small teams

Hands-on tutorials