Fintech platform achieved 99.97% uptime with circuit breakers

The situation: payment processing under extreme load

A European fintech platform processing €2.3 million in daily transactions was experiencing cascading failures during peak hours. Every morning between 8-10 AM and evening between 6-8 PM, when users checked balances and made transfers, the entire platform would slow to a crawl or become completely unresponsive.

The platform served 45,000 active users across three core services: account management, payment processing, and transaction history. During normal hours, response times averaged 200ms. But during peaks, the system would either return 500 errors or time out completely, taking down unrelated features like user dashboards and account settings.

The business impact was severe. Each minute of downtime during peak hours meant €1,600 in lost transaction fees. Customer support tickets increased 340% on days with incidents. Worse, users were losing trust in the platform's reliability for critical financial operations.

When payment processing fails, users don't just get frustrated. They move their money elsewhere.

What we found during the audit

The platform's architecture revealed several critical weaknesses that created perfect conditions for cascading failures.

First, service dependencies were tightly coupled. When the payment processing service experienced high load, it would consume all available database connections. This starved other services like account lookups and transaction history, causing them to fail even though they had spare capacity.

Database connection pooling was misconfigured across services:

# Payment service - consuming too many connections
max_connections: 200
pool_size: 150

# Account service - fighting for remaining connections
pool_size: 50

# Transaction service - often getting zero connections
pool_size: 30

Second, there were no circuit breakers between services. When the payment API became slow, the dashboard would continue making requests, waiting for full timeouts. These hanging requests accumulated, consuming memory and file descriptors until the entire web application became unresponsive.

Third, external API calls had no fallback mechanisms. The platform integrated with three bank APIs for real-time balance checks. When any of these APIs became slow or unavailable, the entire account dashboard would fail to load, even for users who didn't need real-time data.

Monitoring showed the cascade pattern clearly. Payment processing latency would spike from 200ms to 8+ seconds. Within 2 minutes, account service response times would climb from 180ms to 4+ seconds. By minute 3, the entire platform would be returning timeouts or 500 errors.

The root cause wasn't insufficient capacity. The platform could handle the load when services remained isolated. The problem was that failures in one area immediately propagated everywhere else.

The approach we took and why

Rather than simply adding more servers, we focused on containing failures and maintaining partial functionality during incidents. The goal was to keep critical features working even when non-critical components failed.

Our approach had three core principles:

Fail fast, not slow. Instead of letting services wait for timeouts, we implemented circuit breakers that would immediately return cached data or graceful error states when dependencies became unhealthy.

Prioritize critical paths. We identified that payment processing was the most critical function, followed by account balance checks, then transaction history. During high load, we would degrade lower-priority features to preserve capacity for essential operations.

Design for partial failures. Every service interaction needed to handle three states: success, graceful degradation, and complete failure. The dashboard should never be blank just because one API is slow.

This approach differed from typical scaling strategies because we weren't just adding capacity. We were fundamentally changing how services interacted during stress conditions.

The key insight was that users could tolerate delayed transaction history or cached account balances, but they couldn't tolerate a completely unresponsive platform. Intermittent outages often cause more damage than predictable ones because they're harder for users to understand and work around.

Implementation details with specifics

We implemented circuit breakers using a combination of application-level logic and infrastructure patterns. Here's how each component worked:

Database connection isolation

First, we isolated database connections by service priority:

# Critical services (payment processing)
max_connections: 80
pool_size: 60

# Important services (account lookups)
max_connections: 40
pool_size: 30

# Nice-to-have services (transaction history)
max_connections: 20
pool_size: 15

This ensured payment processing always had dedicated database capacity, even when other services experienced high load.

Application-level circuit breakers

We implemented circuit breakers in the API gateway layer using a state-based approach. Each external dependency got its own circuit breaker configuration:

# Bank API circuit breaker
failure_threshold: 5
timeout: 2000ms
reset_timeout: 30000ms
half_open_max_calls: 3

The circuit breaker tracked three states:

Closed: Normal operation. Requests pass through, failures are counted.

Open: After 5 failures in 60 seconds, the circuit opens. All requests immediately return cached data without hitting the external API.

Half-open: After 30 seconds, allow 3 test requests. If they succeed, close the circuit. If they fail, stay open for another 30 seconds.

Graceful degradation patterns

For each service interaction, we implemented fallback behavior:

Real-time balance checks: If the bank API circuit breaker was open, return the last known balance with a timestamp showing when it was last updated.

Transaction history: If the database query took longer than 3 seconds, return cached results from Redis with a banner indicating data might be slightly delayed.

Payment processing: If external validation services were slow, process payments using internal fraud detection only, then validate externally in the background.

The frontend was modified to handle these graceful states:

// Account balance with fallback
if (response.status === 'cached') {
  showBalanceWithWarning(response.balance, response.timestamp);
} else if (response.status === 'degraded') {
  showBalanceWithDelayNotice(response.balance);
} else {
  showBalance(response.balance);
}

Load-shedding implementation

During extreme load, the API gateway would shed non-critical requests to preserve capacity for essential operations:

# Nginx rate limiting with priority tiers
location /api/payments {
    limit_req zone=critical burst=20;
}

location /api/accounts {
    limit_req zone=important burst=10;
}

location /api/history {
    limit_req zone=general burst=5;
}

When rate limits were exceeded, instead of returning 429 errors, we returned cached responses or gracefully degraded functionality.

Monitoring and alerting changes

We modified monitoring to track circuit breaker states and degradation levels:

# Circuit breaker metrics
circuit_breaker_state{service="bank_api_1"} 0  # closed
circuit_breaker_failures{service="bank_api_1"} 3
degraded_responses_total{service="accounts"} 45

Alerts were tuned to focus on business impact rather than technical metrics. Instead of alerting on "API response time > 2 seconds," we alerted on "successful payment rate < 95%" or "user-facing errors > 1% of requests."

Results with real numbers

The implementation took 3 weeks to complete and showed immediate improvements during the first peak period.

Availability improvements

Before: 97.2% uptime with 8-12 incidents per month lasting average 18 minutes each
After: 99.97% uptime with 1-2 incidents per month lasting average 90 seconds each

During the heaviest load period in the following month (end-of-month salary processing), the platform maintained full functionality while processing 340% normal transaction volume.

Response time improvements

Critical path (payment processing):
• 95th percentile: 180ms → 165ms
• 99th percentile: 2.3s → 890ms
• Timeout rate: 2.1% → 0.03%

Account dashboard loading:
• Average load time: 1.2s → 650ms
• Failures during peak: 12% → 0.8%
• Blank dashboard incidents: 8/month → 0/month

Business impact

Customer support tickets related to platform availability dropped 73%. The support team went from handling 40-60 "app not working" tickets per week to 8-12.

Transaction success rate during peak hours improved from 94.2% to 99.1%. This translated to an additional €28,000 in monthly transaction fees that were previously lost to failed payments.

User retention metrics showed the most significant improvement. The percentage of users making transactions within 24 hours of a platform incident increased from 67% to 94%, suggesting much higher confidence in platform reliability.

Operational improvements

Mean time to detection improved from 8.3 minutes to 1.2 minutes because monitoring focused on business metrics rather than just technical health.

Mean time to recovery dropped dramatically because most incidents now resolved automatically through circuit breaker healing, rather than requiring manual intervention.

The on-call team went from 3-5 escalations per week to 1-2 per month. Sustainable on-call practices became much easier to maintain when most failures were contained automatically.

What we'd do differently next time

While the results were strong, several aspects of the implementation could have been more efficient.

Start with business impact metrics earlier. We spent too much time optimizing technical metrics like database connection counts before clearly defining what "success" meant from a user perspective. Defining acceptable degradation levels upfront would have guided technical decisions more effectively.

Implement gradual circuit breaker rollout. We rolled out all circuit breakers simultaneously, which made it difficult to isolate the impact of each component. A more gradual deployment would have provided clearer insight into which patterns delivered the most value.

Build degradation testing into CI/CD. While we tested circuit breakers manually, we didn't build automated tests for graceful degradation scenarios. This meant some edge cases weren't discovered until production load. Chaos engineering practices would have caught these earlier.

Document degradation states more clearly for users. We focused heavily on technical implementation but didn't spend enough time on user experience during degraded states. Users were confused when they saw "cached data" warnings without understanding what that meant for their actions.

Plan capacity for circuit breaker recovery. When circuit breakers closed after being open, they sometimes created load spikes as queued requests were processed. Building in gradual recovery mechanisms would have smoothed these transitions.

The biggest lesson was that high availability infrastructure isn't just about preventing failures. It's about maintaining user trust and business continuity when failures inevitably occur.

Beyond the technical implementation

The most interesting outcome wasn't the improved uptime metrics, but how this changed the team's relationship with reliability.

Before implementing circuit breakers, incidents felt unpredictable and catastrophic. The engineering team was reactive, spending most of their time fighting fires rather than building features. The business team was constantly worried about the next outage.

After implementation, incidents became predictable and manageable. When bank APIs experienced issues, the platform continued operating with cached data. When database load spiked, non-critical features degraded gracefully while payments continued processing normally.

This predictability allowed the engineering team to shift focus from emergency response to proactive improvement. They could invest time in feature development knowing that most potential failure modes were contained.

The business team gained confidence to run marketing campaigns and expand to new markets, knowing that traffic spikes wouldn't bring down the platform.

Perhaps most importantly, the finance team could accurately forecast the cost of reliability. Instead of unpredictable revenue losses from outages, they could budget for the infrastructure investment needed to maintain specific availability targets.

This transformation from reactive firefighting to proactive reliability engineering is often more valuable than the immediate technical improvements.

Key patterns that work across industries

While this case study focused on a fintech platform, the same patterns apply across different types of high-traffic applications.

E-commerce platforms benefit from graceful degradation during sale events. When product recommendation APIs become slow, showing basic product information keeps the purchasing flow working. When inventory checking APIs timeout, allowing purchases with post-order validation prevents lost sales.

SaaS applications can maintain core functionality even when analytics or reporting services fail. Users can continue their primary workflows while secondary features operate in degraded mode.

Content platforms can serve cached content when personalization engines become overloaded, ensuring users still receive a functional experience even if recommendations aren't perfectly tailored.

The key is identifying which features are truly critical for your users' primary workflows, and which can be gracefully degraded without breaking core functionality. Understanding the warning signs of infrastructure stress helps implement these patterns before failures occur.

Most businesses discover that users are remarkably tolerant of reduced functionality as long as core features remain reliable and the degradation is clearly communicated.

Monitoring graceful degradation in production

Implementing circuit breakers and graceful degradation creates new monitoring requirements. Traditional uptime monitoring doesn't capture the nuanced health states these patterns create.

We found it essential to track degradation levels as a primary business metric:

# Business health metrics
successful_payment_rate 99.1%
user_facing_error_rate 0.8%
cached_response_percentage 12%
circuit_breaker_open_count 2

These metrics provide much clearer insight into user experience than traditional technical metrics like CPU usage or response time averages.

Alert thresholds needed to be recalibrated around business impact rather than technical perfection. An alert firing because 15% of dashboard data is served from cache is very different from an alert firing because 15% of payments are failing.

The most valuable monitoring addition was user journey tracking across degraded states. This showed which degradation scenarios actually impacted user behavior and which were transparent to the user experience.

Regular degradation testing became part of the operational routine. Monthly exercises where circuit breakers were manually triggered ensured the team understood how each failure mode would impact users and how quickly systems would recover.

Facing a similar challenge? Tell us about your setup and we will outline an approach.

#circuit breakers #graceful degradation #high availability #fintech infrastructure #reliability engineering

← Anterior Domain hosting and infrastructure decisions: why s...

Siguiente → How to solve random downtime in high availability...