Skip to main content

Command Palette

Search for a command to run...

The Production Memory Leak I Solved by Accident.

Published
5 min read
The Production Memory Leak I Solved by Accident.

I’m writing this a few months after the whole thing went down. At the time, I jotted notes here and there while debugging, but I never properly sat down to document it. So yeah, some details are a bit hazy now. But, I have kept most of the important parts intact.

The Setup

We had a pretty simple architecture.

A web app built with Java and another AI app built using FastAPI and LangGraph.

The main app (Java) would send a request to the AI service. That service would process it and send the result back via a webhook. Two completely separate services. Two separate Postgres databases. Every request and response was stored in both sides.

That separation ended up saving us a lot of guesswork later. When things broke, it was obvious which side to look at.

The First Crash

It happened during peak US hours. Everything just stopped and the IP was not responding.

An alert fired. Two health checks were fine, but the EC2 reachability check failed. The instance was inaccessible.

I checked logs if there were incoming requests around that time, but none of them made it through. The database had no record of them. It was like they vanished mid-flight.

After a couple of hours, we restarted the instance. Things came back to life like nothing had happened.

Not great. But also not catastrophic. This service wasn’t user-facing in real time, so delays were acceptable. We could reprocess missed requests because each one was tied to a specific asset. So we did that and moved on.

But “restart and pray” isn’t a real solution. Something was clearly wrong.

Chasing the Obvious (and Wrong) Idea

My first thought was: memory leak. It crashed, we restarted, and everything worked again. That’s usually a dead giveaway. So I tried to reproduce it locally. Same instance type. Same specs and I spammed it with concurrent requests.

Nothing.

No crash. No slowdown. It just kept running like it had something to prove. Still, I didn’t trust it. Maybe it was too many parallel requests eating up a lot of RAM. So I added a Redis queue and limited processing to two concurrent threads.

Good change overall. Felt responsible. But, this was still not going to take us to the root of the problem.

Actually Looking at Memory

Then I realized something embarrassingly basic. We weren’t even tracking RAM usage. By default, EC2 doesn’t show memory metrics. You have to install an agent for that. Which… we hadn’t done. So I set that up. And there it was. Memory usage was climbing steadily and never dropping. Eventually, the system would choke and die.

The weird part was, It didn’t happen often. Maybe once every couple of weeks. It only occurred 3 times in total. Not frequent enough to easily debug, but frequent enough to be a real problem. At this point, it felt like trying to catch a bug that only shows up when it feels like it.

The Moment Things Clicked

The breakthrough didn’t come from staring harder at logs. It came from a random question that popped into my head while reading something unrelated: “How does this system keep track of multiple threads without mixing things up?” That had me digging.

Turns out, it uses a check pointer to store thread state. And if you don’t configure anything external, it just keeps everything in memory. Every request. Every thread. Every checkpoint.

All sitting in RAM.

And this is the important part; none of it was being cleaned up (This is a design decision that allows resuming the conversation at any point in time, although a debatable better decision would have been to default to a file based check pointer, rather than an In memory one). So with every request, memory usage ticked up a little. And then a little more. And then a little more.

Until eventually, the instance just ran out of room and gave up.

Why I Never Saw It in Dev

This also explained why I couldn’t reproduce it. In development, the service restarted all the time, every deploy, every change. Each restart wiped the memory clean. So the leak never had time to build up.

In production, though, the service stayed up for days or weeks. That’s where the problem had space to grow.

The Fix

Once the problem made sense, the fix was straightforward. We stopped storing state in memory and moved it to Postgres. Now, instead of piling everything into RAM, it gets persisted properly. Memory usage stays flat, no matter how many requests come in.

We also switched to a more appropriate instance type. Since then, no crashes. No slow memory creep. Just a stable system doing what it’s supposed to.

What I Learned (the Hard Way)

A few things I wish I had done differently:

Write things down while debugging. You will forget details. Numbers, timestamps, weird observations all of it. Don’t trust your memory. It’s unreliable when you need it most.

Don’t assume your monitoring is enough. We didn’t even have RAM metrics. That alone delayed the diagnosis way more than it should have.

Know your tools beyond the happy path. Defaults are often designed for convenience, not scale. Something that works perfectly in dev can quietly destroy you in production.

Dev and prod behave differently in subtle ways. Frequent restarts in dev were hiding the issue entirely. That gap matters more than it seems.

Sometimes the answer comes from sideways thinking. The breakthrough didn’t come from grinding harder. It came from curiosity about something loosely related. That’s often how these things work.


Looking back, none of it was very complex. But, the chasing through the invisible is what made this find stick.