Cut cloud costs 65% with sovereign stack cloud cost optimization services

The situation: a fintech scaling beyond hyperscaler economics

A European fintech platform processing 2.3 million transactions monthly was burning through €28,000 per month on AWS. Their core application consisted of 40 microservices running across multiple availability zones, with strict data residency requirements under PCI DSS and GDPR compliance.

The platform handled payment processing, fraud detection, and customer onboarding workflows. Peak transaction volumes occurred during business hours, creating predictable load patterns that their current auto-scaling setup handled poorly. Instead of scaling efficiently, instances would spin up aggressively during minor traffic increases, then remain idle for hours.

More concerning was vendor lock-in. Their fraud detection algorithms relied heavily on AWS-specific machine learning services, making migration seem impossible. The CTO approached us after their latest monthly bill exceeded €32,000, with projections showing €50,000+ within six months as transaction volumes doubled.

The constraint was clear: maintain sub-200ms API response times, achieve 99.95% uptime, and keep all data within EU borders while dramatically reducing operational costs.

What we found during the audit

Our infrastructure assessment revealed several cost drivers hidden in their AWS architecture. The biggest culprit was over-provisioning across compute, storage, and networking tiers.

Compute waste dominated their bill. They ran 60 EC2 instances continuously, but CPU utilization averaged only 23% during peak hours and dropped to 8% overnight. Reserved instances covered just 30% of their capacity, meaning they paid on-demand pricing for consistent workloads that could have been planned months ahead.

Storage costs spiraled due to poor data lifecycle management. Their PostgreSQL databases generated 2.4TB of transaction logs monthly, stored indefinitely across multiple regions. Application logs consumed another 800GB monthly with no retention policies. EBS snapshots accumulated without cleanup, reaching 15TB of redundant backup data.

Network transfer fees added €3,200 monthly. Cross-AZ traffic for their microservices architecture generated massive internal bandwidth charges. External API calls to third-party payment providers went through NAT gateways, multiplying egress costs unnecessarily.

But the real revelation was resource predictability. Despite appearing volatile, their workloads followed consistent patterns. Payment processing peaked from 9 AM to 6 PM weekdays. Fraud detection ran batch jobs nightly. Customer onboarding spiked during first-week-of-month marketing campaigns. This predictability made them perfect candidates for right-sized dedicated infrastructure.

The approach we took and why

Rather than optimizing within AWS, we designed a sovereign infrastructure stack using open-source components. This approach would eliminate vendor lock-in, reduce costs through better resource utilization, and maintain full EU data residency.

The architecture centered on four key technologies working together. Proxmox provided virtualization and cluster management across physical servers. Ceph delivered distributed storage with built-in redundancy. OpenStack created cloud-like provisioning and management interfaces. Kubernetes orchestrated containerized applications with efficient resource sharing.

This stack offered several advantages over traditional managed infrastructure approaches. Proxmox eliminated hypervisor licensing costs while providing enterprise-grade virtualization. Ceph removed dependency on proprietary storage systems, letting us tune performance for their specific workload patterns. OpenStack gave developers familiar cloud APIs without vendor lock-in. Kubernetes enabled efficient resource packing, running multiple services on shared infrastructure during low-demand periods.

The migration strategy prioritized risk reduction over speed. We would build the new infrastructure in parallel, migrate non-critical services first, then move payment processing during planned maintenance windows. This eliminated the all-or-nothing risk of traditional zero downtime migration approaches.

Most importantly, we designed for their actual resource requirements rather than theoretical peaks. Instead of provisioning for worst-case scenarios, the new stack would auto-scale predictively based on historical patterns while maintaining performance guarantees.

Implementation details with specifics

The physical infrastructure consisted of six bare-metal servers in a Frankfurt data center, each with 64 cores, 256GB RAM, and 4TB NVMe storage. We configured Proxmox in a high-availability cluster with shared storage provided by Ceph distributed across all nodes.

Ceph configuration required careful tuning for their workload patterns. Transaction data needed high IOPS for real-time processing, while analytics and log data could tolerate higher latency. We created separate storage pools: NVMe-backed pools for hot data with 3x replication, and SATA-backed pools for cold data with 2x replication plus erasure coding.

# Ceph pool configuration for hot transaction data
ceph osd pool create transactions 128 128 replicated
ceph osd pool set transactions size 3
ceph osd pool set transactions min_size 2
ceph osd pool set transactions crush_rule nvme_rule

# Pool for analytics and logs with erasure coding
ceph osd pool create analytics 64 64 erasure
ceph osd erasure-code-profile set ec-profile k=4 m=2
ceph osd pool set analytics erasure-code-profile ec-profile

OpenStack deployment used Kolla-Ansible for reproducible configuration management. We enabled only essential services: Nova for compute, Neutron for networking, Glance for images, and Cinder for block storage. This minimal deployment reduced complexity while providing cloud-standard APIs their development team already understood.

Kubernetes ran on OpenStack VMs, but with significant modifications for cost optimization. Instead of running each microservice in separate pods with dedicated resources, we implemented resource sharing policies that packed compatible services together during low-traffic periods.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: payment-processing
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "60"
    limits.memory: 80Gi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  minReplicas: 2
  maxReplicas: 12
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Network architecture eliminated the cross-AZ transfer costs that plagued their AWS setup. All internal communication stayed within the same physical location, while external API calls routed through dedicated gateway nodes with optimized egress pricing.

Database migration required the most careful planning. We moved their PostgreSQL clusters using logical replication, maintaining real-time sync during the transition period. New storage configuration used Ceph RBD volumes with performance tuning for transaction processing workloads.

Monitoring integration used Prometheus and Grafana, configured to track both technical metrics and cost attribution. Each service's resource consumption was measured and allocated, enabling precise cost tracking per business function.

Results with real numbers

The migration completed over six weeks, with payment processing cutover taking just 47 minutes of maintenance window. Performance improvements appeared immediately, with API response times dropping from 180ms average to 95ms average due to reduced network latency and optimized storage paths.

Monthly costs fell from €28,000 to €9,800, representing 65% reduction. Hardware costs totaled €4,200 monthly including depreciation, data center fees, and support. Managed services added €3,200 for monitoring, backup, and maintenance. Network and bandwidth costs dropped to €2,400 monthly.

But the performance gains delivered more value than cost savings alone. Payment processing throughput increased 40% due to optimized database performance and reduced latency. Fraud detection jobs that previously took 3.2 hours now completed in 1.4 hours, enabling faster transaction approval.

Resource utilization improved dramatically. CPU utilization averaged 78% during peak hours while maintaining response time targets. Storage efficiency gained 35% through Ceph's intelligent data placement and compression. Memory usage dropped 25% through better container packing and shared resource policies.

Availability exceeded targets despite initial concerns about managing physical infrastructure. The platform achieved 99.97% uptime during the first six months, surpassing their previous 99.93% on AWS. Two hardware failures occurred without service impact due to Ceph's distributed architecture and Proxmox's live migration capabilities.

Perhaps most importantly, vendor independence became real rather than theoretical. When their fraud detection requirements changed, they could deploy any machine learning framework without API compatibility constraints. Development velocity increased as teams could provision resources instantly without cloud provider quota limits or billing concerns.

What we'd do differently next time

Storage planning needed more precision from day one. Our initial Ceph configuration worked well for transaction processing but created bottlenecks for analytics workloads. We had to redesign storage pools after two months, during which some batch jobs ran slower than optimal.

The correct approach would separate storage classes during initial deployment. Hot operational data should get dedicated NVMe pools with high replication. Warm analytical data needs balanced performance pools with erasure coding. Cold archive data belongs on high-density SATA storage with aggressive compression.

Kubernetes resource requests should have been tuned more aggressively. We started conservative, setting resource requests based on AWS instance sizes. But this prevented efficient pod packing, wasting the primary advantage of shared infrastructure. Fine-tuning requests and limits based on actual resource consumption rather than previous provisioning would have improved density immediately.

Network monitoring deserved more attention during planning. While we eliminated AWS's cross-AZ charges, internal bandwidth utilization patterns weren't fully mapped until after migration. Some microservices created unexpected traffic volumes that stressed network links during peak periods.

Change management with the development team could have been smoother. Despite providing familiar APIs through OpenStack, some operational differences confused developers initially. More comprehensive training on Kubernetes debugging tools and Ceph storage behavior would have reduced support tickets during the first month.

Finally, disaster recovery testing should have started earlier. While the infrastructure proved reliable, our backup and recovery procedures weren't fully validated until month three. Earlier testing would have identified gaps in cross-site replication configuration and backup retention policies.

Close + CTA

Open-source sovereign infrastructure isn't just about avoiding vendor lock-in or meeting compliance requirements. When implemented correctly, it delivers better performance at lower cost while maintaining operational flexibility.

This fintech platform's success came from matching infrastructure design to actual workload patterns rather than over-provisioning for theoretical scenarios. Their predictable traffic patterns, data residency requirements, and performance targets made them ideal candidates for dedicated infrastructure with intelligent resource sharing.

The key insight is that cloud cost optimization services work best when combined with infrastructure redesign rather than simple right-sizing within existing architectures. Sometimes the biggest savings come from stepping outside hyperscaler ecosystems entirely.

Facing a similar challenge? Tell us about your setup and we will outline an approach.

#open-source #cost-optimization #sovereign-cloud #kubernetes #proxmox

← Vorige Best practices for accelerating deployment frequen...