The Ghost in the Machine: Stop-the-World GC
Why your application randomly freezes and how modern collectors fix it.
Dive deep into memory management and the latency-throughput tradeoff. Learn how incremental and concurrent garbage collection works in Java, Go, and V8.
How to Fix "Stop-The-World" Pauses in Your Code
Programming languages like Java, .NET, and Go clean up old memory automatically. This is called garbage collection. But to do this safely, they sometimes freeze your entire program. If you are building fast, real-time apps, these pauses can ruin your software. Here is exactly how they work, and how to stop them.
Why Must the Code Stop?
The automatic cleaner (garbage collector) has two main jobs: find memory you are not using anymore, and move things around to keep memory tidy.
If your program keeps running and changing data while the cleaner is trying to look at it, the cleaner might accidentally delete something you need. Also, if the cleaner moves a piece of data to a new spot, but your code tries to read it from the old spot, your program will crash. The safest way to prevent this is to freeze everything while the cleaner works.
The Rules of Pausing
- Clean Snapshot: The cleaner needs to see exactly what memory is being used right now.
- Safe Moving: Data cannot be used by the program while it is being moved to a new address.
How Different Languages Stop Your Code
Programs do not just stop instantly. They wait until they reach a "safepoint"—a specific line of code where it is safe to pause. Here is how different languages handle this.
Java (JVM)
The Method: Polling
Java puts tiny invisible checkpoints in your code (usually when a function starts, ends, or inside a loop). Your program constantly asks the system, "Should I stop now?" If the cleaner is ready, the system says yes, and the program pauses at the next checkpoint.
.NET (CLR)
The Method: Hijacking
.NET uses checkpoints like Java, but if a piece of code is stuck and taking too long to hit a checkpoint, .NET can "hijack" the code. It tricks the running function so that when it finishes, it walks directly into a trap that pauses it.
Go Runtime
The Method: Signals
If a Go task runs for more than 10 milliseconds without resting, the Go system sends a hard alert (a system signal) to that task. This alert forces the task to step aside and wait so the cleaner can do its job.
Smart Memory Slicing (G1GC)
In the past, memory was cut into two giant pieces: "New" and "Old". Cleaning the giant "Old" piece took a long time and froze the program badly.
Modern cleaners (like Java's G1GC) chop memory into thousands of tiny, equal-sized blocks called Regions. Now, the cleaner only freezes the program long enough to clean the specific regions that have the most trash.
| Total Memory Size | Size of Each Region |
|---|---|
| Under 4 GB | 1 MB |
| 4 GB – 8 GB | 2 MB |
| 8 GB – 16 GB | 4 MB |
| Over 64 GB | 32 MB |
Rule of thumb: If you make an object that is larger than half of a region size, the system calls it a "Humongous Object." These are very hard for the cleaner to manage and cause extra pauses.
Next-Gen "Pauseless" Cleaners
New technology like ZGC (Java) and Shenandoah aim to never pause for more than a single millisecond, even if you have terabytes of memory.
-
ZGC (Colored Labels)
It hides secret labels inside memory addresses. If your program tries to read an old piece of data, the system sees the label is outdated and instantly fixes the address before you even notice.
-
Shenandoah (Forwarding)
Every piece of data leaves a "forwarding address" when it moves. If your code asks for the data at the old spot, it just follows the forwarding link to the new spot safely.
The Memory Trade-off Math
Garbage collection is not free. You are always trading computer brain power (CPU) for memory size (RAM). In Go, this is controlled by a simple dial called GOGC.
(LiveMemory + Roots) * (GOGC / 100)
If you turn GOGC up, you allow the program to use double the memory. This means the cleaner runs half as often, saving CPU speed, but you need more RAM. If you hit your maximum memory limit, the cleaner panics and uses 100% of your CPU trying to fix it.
How to Write Code That Never Pauses
The best way to stop the cleaner from pausing your app is to hide your data from it. If the cleaner doesn't know the memory exists, it doesn't have to clean it.
Off-Heap Memory
Best for: Java, Huge Databases
You can force your code to store data straight into the raw computer memory, completely outside the cleaner's view. In modern Java, you use the Foreign Function & Memory (FFM) API. Because the cleaner never sees this data, it never pauses, even if you load terabytes of files.
Slicing Data
Best for: .NET / C#
Instead of copying text or data (which creates trash for the cleaner), .NET lets you use a tool called Span<T>. It creates a temporary "window" that looks at the original data without copying it. It cleans itself up instantly without needing the garbage collector.
No Pointers
Best for: Go
The Go cleaner has to check every single "pointer" (a link to another piece of data). If you have a list of 10 million items with pointers, the cleaner checks all 10 million every time. By rewriting your code to use simple numbers instead of pointers, you can make your code 1,000 times faster.
Why a 100ms Pause is a Disaster
Stock Trading
In automated trading, prices change in microseconds. A 100-millisecond freeze means the computer buys a stock at an old price, losing millions of dollars instantly.
Multiplayer Games
If the server pauses to clean memory, all players experience "rubber-banding." Characters freeze and snap back to old locations, ruining the experience.
Medical Monitors
Systems reading live heart rates or machinery cannot afford to drop data packets. A pause means missing critical real-time alerts.
The "Average" Lie (P99)
Never measure average speed. If 99 users get a fast response (1ms) and 1 user is frozen by a garbage collection pause (500ms), the average looks great (6ms). But your app is completely broken for that 1 out of 100 people.
Professionals measure the 99th Percentile (P99). This means looking at the worst experiences. If your P99 is fast, everyone is fast.
Private Workspaces (TLABs)
If every part of your program tried to grab memory from a single shared pile at the exact same time, they would have to wait in line. Instead, the system gives every worker a private desk.
Without Private Desks
Everyone fights over the giant shared memory pool. The program locks up just trying to hand out space.
Thread-Local Allocation (TLAB)
Each worker gets a small chunk of memory to use privately. No waiting in line, no locks. Blazing fast.
The Large Object Problem
Moving tiny items around memory is easy. Moving a massive 10-Megabyte image is slow and dangerous. Because of this, languages like .NET have a separate dumping ground called the Large Object Heap (LOH).
The LOH is barely managed. Things put here are almost never cleaned up or moved to save time. If your program keeps creating massive items, the LOH will fill up quickly, causing a catastrophic, multi-second pause when it finally forces a clean-up. Rule: Never create large things repeatedly.
Memory Packing (Defragmentation)
As objects are deleted, memory starts to look like Swiss cheese. Even if you have 1GB of free space, you might not be able to fit a 10MB object because the space is broken into tiny, separated chunks.
1. Fragmented (Full of holes)
2. Compacted (Solid blocks)
Parallel Cleaning
The program stops completely. The cleaner brings in multiple workers (threads) to clean up the mess as fast as possible.
Concurrent Cleaning
Your program keeps running. The cleaner runs quietly in the background at the exact same time, picking up trash without interrupting you.
Security Cameras (Write Barriers)
If the cleaner is working in the background (Concurrent), how does it know if your program suddenly changes a piece of data it just looked at?
The system injects a microscopic piece of code into every "save" command you write. This is called a Write Barrier. It acts like a security camera. If you change a piece of memory while the cleaner is looking away, the write barrier flags it, forcing the cleaner to come back and look at it again.
The Magic of Escape Analysis
Garbage collectors only have to clean the "Heap" (the messy, permanent storage). They don't have to clean the "Stack" (the temporary, instant-delete storage).
Modern compilers are incredibly smart. They read your code before running it. If they see you create an object, use it inside a function, and never share it outside that function, the object is trapped. It cannot "escape." Because it doesn't escape, the system puts it on the instant-delete Stack instead of the Heap. Result: Zero garbage collection.
The "Rent, Don't Buy" Strategy
Instead of building a new network connection or a new chunk of memory 10,000 times a second and throwing them away, you build a "Pool" of 50 objects when the app starts.
Borrow a clean object from the pool.
Use it for your heavy task. Do not create anything new.
Wipe it clean and put it back in the pool.
Ring Buffers (The Ultimate Speed Trick)
In ultra-fast finance platforms (like the LMAX Disruptor), developers use a Ring Buffer. It is an array of memory that connects back to itself in a circle.
Instead of creating a list that grows endlessly and needs garbage collection, you make a fixed circle of 10,000 slots. When you reach the end, you just circle back and overwrite the oldest data. You pre-allocate everything once, meaning zero garbage and millions of transactions per second.
Tools to Find the Problem
-Xlog:safepoint
Turns on a log that tells you exactly which piece of code is taking too long to reach a safe pause point.
dotnet-counters
A live monitor. If it shows your program spends more than 10% of its time cleaning, you are making too much trash memory.
pprof / -gcflags="-m"
Shows you exactly which lines of code are dumping data into the slow memory pile instead of the fast memory pile.
More in this collection
Continue your research with other reports from the Memory Management series.
Memory Management
- The Hidden Speed Trick: Copy-on-Write: A simple guide to the invisible trick computers use to run faster, save space, and handle thousands of tasks at once.
- The Ghost in the Machine: Stop-the-World GC: Why your application randomly freezes and how modern collectors fix it.
Reliability
- Art of Saying No: Understanding Backpressure: How to prevent your system from drowning in requests.
Distributed Systems
- The Guardian of Persistence: Write-Ahead Logging: The secret technique that keeps your database from losing data.
- The Hidden Costs of Perfect Data Sync: Why 'Perfect Sync' makes apps slow, expensive, and fragile.
Systems Design
- The Memory Wall: Why RAM is Your Next Bottleneck: CPUs are fast, memory is slow. Here is how to fix it.
Networking
- Speed is Two Things: Latency vs Throughput: Stop confusing bandwidth with response time.
Engineering
- The Rule of Safety: Idempotency: The most important skill for building reliable distributed systems.
Architecture
- The Actor Model: Fast, Safe Software: Building systems that scale without the headache of locks.
Systems Engineering
- Architecting Innovation: Polyglot Developer Tools: Moving beyond monolithic constraints with Rust, Zig, and Python.