How we optimized GKE cost by fixing idle clusters with runaway bills.
The business pressure
In a digital banking environment, cloud infrastructure is not a back-office concern. It is the product. Every service that processes a transaction, validates an account, or returns a balance runs on it.
When that infrastructure behaves unexpectedly, the consequences aren’t abstract — they show up on the monthly bill, in engineering escalations, and eventually in conversations with leadership about why costs are climbing while traffic is flat.
That’s exactly the situation our client was facing.
Their Google Kubernetes Engine (GKE) application had developed a memory leak — one of the most frustrating categories of engineering problems, because the system keeps running. There’s no crash, no error, no obvious failure. Just a slow, invisible accumulation of memory that never gets released after traffic spikes pass.
The knock-on effect made it worse.
Effective GKE cost optimization was impossible because the Horizontal Pod Autoscaler was reading inflated memory figures and concluding the cluster was perpetually under pressure. It kept spinning up Pods and burning budget even at idle.
The engineers knew something was wrong. The metrics confirmed it. But pinpointing the cause in a live banking system — where you cannot simply take things offline to inspect them — required a different kind of approach.
How we delivered it
We started by stabilizing the bleeding before diagnosing the wound.
- Cluster stabilized within days of engagement
- Memory returned to baseline post-traffic — for the first time
- Cloud spend reduced materially month-on-month
The instinct in situations like this is to dive immediately into profiling. But in a production banking environment, that instinct needs to be checked.
The first priority was stopping the runaway cost accumulation — not because the investigation could wait, but because a cluster operating at permanent maximum capacity creates its own risks: reduced headroom for genuine traffic spikes, degraded response times, and compounding infrastructure spend with every day that passes.
We manually tuned the Replica counts and HPA thresholds to bring the cluster back to rational behavior. This wasn’t the fix — it was a tourniquet. But it stabilized costs and bought the time needed to investigate properly.
Phase 1: Find the leak, not just the symptom
With the cluster stabilized, we began systematic profiling — running rigorous load tests while monitoring three specific signals simultaneously:
RSS (Resident Set Size) — tracking physical memory growth over time to distinguish genuine accumulation from normal working set expansion. In a healthy system, RSS rises under load and falls when traffic drops. In this system, it rose and stayed.
CPU usage during garbage collection — identifying whether the runtime was spending abnormal time trying to reclaim memory that couldn’t be freed, which would point toward object retention rather than simple allocation growth.
Thread count — detecting thread exhaustion or unclosed asynchronous tasks, which are a common but often overlooked source of memory leaks in systems handling concurrent financial operations.
The load tests were designed to mirror realistic banking traffic patterns — not uniform synthetic load, but the kind of burst-and-drop behavior that characterizes actual transaction processing. This was critical: a leak that only manifests under specific concurrency patterns won’t show up under a constant-rate test. We needed to reproduce the conditions that triggered the problem, not just apply generic pressure.
The profiling sessions confirmed what the metrics had suggested: memory was being allocated and not released. The question was where, and why.
Phase 2: Code-level intervention
A core pillar of our GKE cost optimization strategy involved code-level interventions: fixing resource lifecycle management and implementing Singleton patterns to reduce object churn.
These changes allowed the system to scale down without an expensive architectural overhaul.
Resource lifecycle management. Streams, database connections, and buffers were being opened during request handling but not reliably closed or disposed of afterward. In low-concurrency scenarios, this is often invisible — the operating system or runtime cleans up eventually. Under banking-grade concurrency, the accumulation outpaces the cleanup, and memory climbs without a ceiling.
The fix required a methodical pass through the request-handling code to identify every resource allocation point and ensure explicit closure — regardless of whether the request succeeded, failed, or timed out. The error path is where these issues hide most often: code that handles the happy path correctly but leaves connections open when an exception fires.
Singleton pattern implementation. HTTP clients, database drivers, and other expensive-to-initialize objects were being instantiated repeatedly — once per request in some cases. Each instantiation allocates memory for connection pools, thread management structures, and configuration state. None of it is large individually. Multiplied across thousands of concurrent banking transactions, it becomes significant.
Refactoring these to use the Singleton pattern — initializing once at application startup and reusing across requests — eliminated the redundant allocations and reduced the object churn that was making garbage collection less effective.
Neither of these changes was dramatic. There was no architectural overhaul, no framework migration, no infrastructure rebuild. The system that emerged was structurally identical to the one we started with — just written more carefully in the places that mattered.
Phase 3: Validate, measure, repeat
Each code change was validated through a full profiling cycle before the next change was made. This discipline — optimize, profile, confirm, then proceed — was non-negotiable. In a system handling financial transactions, the risk of an optimization that introduces a new problem outweighs the benefit of moving faster.
The validation process used the same load test conditions as the initial investigation, with before-and-after comparisons from the same metrics sources. By the time the changes were ready for production, we had empirical evidence — not engineering confidence — that the leak was resolved and that the fix held under sustained concurrency.
Technologies we used
Infrastructure
Autoscaling
Profiling & Observability
What really moved the needle
Memory leaks in production banking systems are not primarily a technical problem. They are a visibility problem.
The code issues we fixed — unclosed resources, redundant object instantiation — are not unusual. They exist in most systems that have grown under delivery pressure.
What made this situation serious was that there was no systematic way to see them. The HPA was reading the wrong signal and responding rationally to bad data. The engineers knew costs were wrong but couldn’t pinpoint why. The system was operating, which made the problem easy to defer.
The most important thing we established was not the fix. It was the methodology that found it. Systematic profiling with clearly defined metrics, load tests that reproduce realistic conditions, and a discipline of measuring each change before making the next one — these are the practices that turn performance engineering from guesswork into a repeatable capability.
The HPA is only as intelligent as the metrics it reads. An autoscaler working from corrupted signals will make technically correct decisions that produce operationally wrong outcomes. Fixing the signal — not the autoscaler — is always the right first move.
Production banking systems reward patience over instinct. The temptation when costs are climbing is to make changes quickly. In a regulated environment handling real financial transactions, an untested fix that introduces a new problem is worse than the original leak. The “optimize-profile-repeat” discipline exists for this reason — and it’s what the team now has the tools to maintain independently.