Why Fast LLMs Are Really a Memory Problem

A beginner-friendly tour of how large language models are actually run — and the dozens of tricks engineers use to make them fast and cheap. No prior systems knowledge required.

A walk through the ideas in Alex Smola's "Efficiency in LLMs" tutorial (Columbia Machine Learning Summer School, 2026). All the numbers are his, verified by him as of mid-2026 — and, as he cheerfully warns, "likely wrong by December," because this field moves fast.

Here is a fact that surprises almost everyone the first time they hear it: when a modern AI model writes you an answer, the expensive chip running it spends most of its time waiting. Not computing. Waiting. Specifically, waiting for data to arrive from memory.

That single fact is the key that unlocks almost everything about how AI systems are built today. Once you understand why the chip waits, every optimization — the weird acronyms, the new chip designs, the clever serving software — turns out to be a variation on one move: move fewer bytes around.

This post is a guided tour. We'll start from absolute basics (what a chip even does, what "memory bandwidth" means) and build all the way up to the frontier techniques that let a model hold a million words of context. You don't need to know anything about GPUs going in. Let's go.

The tour ahead

The big picture: why inference is a memory problem
Hardware: the physics of fast
Serving: many users, one copy of the model
Weight compression: same brain, fewer bytes
KV-cache compression: how to afford a million words
The wrap-up: numbers to remember

Part 1The Big Picture

First, some vocabulary

Let's define a few terms in plain language so nothing trips you up later.

A model's "weights" are just a giant pile of numbers — the thing that got "learned" during training. Running the model means doing arithmetic with these numbers. A medium-sized model has billions of them.
A FLOP is one floating-point operation — basically one multiply or one add. Chips are rated by how many they can do per second. A top data-center GPU does roughly a quadrillion per second.
A GPU (graphics processing unit) is the chip that runs AI models. It has two relevant parts: thousands of tiny calculators (the "compute"), and a big bank of fast memory next to them (called HBM) where the weights live.
Memory bandwidth is how fast data can move from that memory bank into the calculators, measured in bytes per second. This turns out to be the hero (and villain) of the whole story.
A token is a chunk of text — roughly a word or part of a word. Models read and write text one token at a time.
Inference means using a trained model to generate answers (as opposed to training, which is teaching it in the first place). This whole post is about inference.

The imbalance that runs everything

Take NVIDIA's H100, a workhorse data-center GPU. It can do about 989 trillion math operations per second. Its memory can deliver about 3.35 trillion bytes per second. Divide one by the other and you get a number worth tattooing on your arm:

≈ 295 FLOPs per byte In the time it takes the H100 to fetch a single byte from memory, it could have done about 295 math operations. So to keep its calculators busy, you need to feed it roughly 300 operations of work for every byte you pull in.

Each dot is one math operation. The chip can do hundreds in the time a single byte shows up — so during decode, when each byte feeds only one operation, it sits nearly idle.

Here's the problem. When a model generates text one token at a time, each weight gets used exactly once per token. You haul a number all the way in from memory, do one multiply with it, and you're done with it. That's one operation per byte fetched — when the chip wanted three hundred.

The result: the calculators sit idle, drumming their fingers, while the memory system frantically shovels weights at them as fast as it can. The chip is starving.

An analogy

Imagine a world-class chef (the compute) who can chop, sear, and plate at superhuman speed. But the ingredients arrive on a single narrow conveyor belt (the memory bandwidth). It doesn't matter how fast the chef is — dinner comes out at the speed of the belt. The chef spends the night waiting for the next onion.

Making LLM inference faster is almost never about getting a faster chef. It's about putting fewer, smaller ingredients on the belt.

This is the thesis of the entire tour: generating text is a memory-traffic problem, not a math problem. Nearly every trick we'll meet is just a clever way to move fewer bytes per token.

Why anyone cares: the cost of serving

Two reasons this matters enormously in practice.

First, price. The cost of running a model at a given quality level has dropped about 10× every year — a roughly hundredfold drop over a few years. Most of that win didn't come from better models; it came from serving them more efficiently. The tricks in this post are the price drop.

Second, scale. You train a model once, but then you serve it to billions of requests. At that scale, the running cost dwarfs the training cost. Shaving a few bytes off each token, multiplied across billions of tokens a day, is the whole game.

The two phases: prefill and decode

When you send a prompt to a model, the work splits into two very different phases. Understanding this split is the most useful single idea in the post, so let's go slowly.

Phase 1 — Prefill ("reading your prompt"). The model reads your entire prompt at once. If your prompt is 40,000 tokens long, it processes all 40,000 in one big batch. Crucially, it reads each weight once but immediately reuses that weight across all 40,000 tokens. That's tons of math per byte — exactly what the chip likes. Prefill keeps the calculators busy. It is compute-bound (limited by how fast the chip can do math).

Phase 2 — Decode ("writing the answer"). Now the model writes its response one token at a time. To produce each new token, it must stream every single weight through the chip again — but it only uses each one once, for that one token. This is the starving scenario from above. Decode is memory-bound (limited by how fast memory can feed the chip).

The one-line summary

Prefill speed is set by your chip's FLOPs. Decode speed is set by your chip's bytes-per-second. Two phases, two bottlenecks, one model. Almost everything that follows is about making decode less painful.

Same model, two phases. Prefill spreads each weight-read across thousands of prompt tokens (great use of the chip). Decode drags every weight in to produce a single token (terrible use of the chip) — which is why generating text is slow.

Let's put real numbers on it, for a real 8-billion-parameter model (Qwen3-8B) on an H100:

Quantity	Prefill (40k-token prompt)	Decode (per token)
What it's limited by	Math (compute)	Memory traffic (bandwidth)
Work done	~750 trillion operations	~16 billion operations
Bytes moved	weights read once	~22 GB swept every token
Operations per byte	hundreds (chip is happy)	less than 1 (chip starves)
Resulting speed	~0.76 seconds total	~6.6 ms each → ~150 tokens/sec

Notice the decode line: to produce one token, the chip moves about 22 gigabytes of data (16 GB of weights plus 6 GB of "memory of the conversation" — we'll get to that). It does only 16 billion operations of math with it. That's well under one operation per byte. The H100, capable of a quadrillion operations a second, is running at less than 1% of its math potential during decode. It's a Ferrari in a traffic jam.

Arithmetic intensity and the "roofline"

Engineers have a name for "operations per byte": arithmetic intensity. It's the single number that decides whether your chip is starving or thriving.

There's a famous picture called a roofline that captures this. Imagine a graph: the more operations you do per byte (intensity), the more of the chip's power you can actually use — up to a ceiling. The graph has two parts: a sloped part on the left (the "bandwidth wall," where your speed is capped by memory and rises with intensity), and a flat part on the right (the "compute ceiling," where you're limited only by raw math speed).

The corner where they meet sits at that magic number, ~295 operations per byte for the H100. Prefill lives way out on the flat ceiling. Decode sits far down on the sloped wall — using a tiny fraction of the chip. Same model, same chip, two completely different worlds depending on the phase.

Below the ridge you're limited by memory speed; above it, by raw math. Decode is stranded far down the slope using under 1% of the chip's math power, while prefill bumps against the ceiling.

The KV cache: the model's short-term memory

One more concept and we've got the full picture. Where did that extra 6 GB in the decode step come from?

When a model reads text, for every token it computes two helper vectors called K and V (keys and values). These let later tokens "pay attention" to earlier ones. Here's the thing: a token's K and V never change once computed. So instead of recomputing them every step, the model computes them once and stores them. That store is the KV cache.

An analogy

The KV cache is the model's running notes on the conversation so far. Every time it writes a new word, it glances back over all its notes. The longer the conversation, the thicker the notebook — and the longer it takes to skim before writing each new word.

The catch: the KV cache grows with the length of the conversation. For our 8B model it's about 147 KB per token. At 40,000 tokens that's ~6 GB. At a million tokens it would be 147 GB — nearly ten times bigger than the model's own weights. And during decode, the chip has to read the entire KV cache for every single new token. This is why long conversations get slow and expensive, and why an entire section of this post (Part 5) is devoted to shrinking it.

The model's weights are a fixed size. The KV cache keeps growing with every token of the conversation — and past roughly a hundred thousand tokens it becomes bigger than the model itself.

Two consequences worth previewing

Consequence 1: the two phases want different machines. Prefill is hungry for math, so it wants a chip stacked with calculators. Decode is hungry for memory speed and capacity, so it wants fat, fast memory. A natural idea follows: stop running them on the same box. Split them into a "prefill pool" and a "decode pool." This is called disaggregated serving, and we'll build it in Part 3.

Consequence 2: the local-inference loophole. If decode just wants memory bandwidth and capacity (not raw math), then for running a model at home on a single chat, a device with lots of unified memory can beat an expensive GPU. A Mac with 128 GB of memory can run a 70-billion-parameter model that literally won't fit on a $2,000 gaming card — because the gaming card, for all its speed, doesn't have enough room. (The catch: prefill on these home devices is painfully slow. On one such box, prefill ran ~390× faster than decode — same chip, wildly different phases.)

Hold onto one number

~295 FLOPs per byte. An H100 can do ~295 math operations in the time it takes to fetch a single byte. Everything that follows is about making that ratio survivable.

Part 2Hardware: The Physics of Fast

If decode is bottlenecked by memory, the obvious move is "buy better memory." This part is about what the hardware can and can't do — and why money alone won't save you.

Why memory has to sit right next to the chip

Modern GPUs use a kind of memory called HBM (High Bandwidth Memory). The trick is physical: instead of putting memory chips inches away on the circuit board, HBM stacks the memory dies micrometers from the processor on a shared base. Distance is the enemy of bandwidth, so keeping memory practically touching the chip is how you hit trillions of bytes per second.

This leads to a beautiful, frustrating geometric fact engineers call the shoreline problem. A chip's computing power scales with its area (the calculators fill the interior). But its connections to the outside world — memory, other chips — can only happen at the edge. Area grows as the square of the chip's size; the edge grows only linearly. So every time chips get bigger or finer, they gain compute faster than they gain the ability to feed it. The starvation gets structurally worse with each generation.

Every chip boundary costs you an order of magnitude

Data moves at wildly different speeds depending on how far it has to travel. Roughly, in 2026:

Path	Speed	Relative
On-package HBM (memory → chip)	~8,000 GB/s	baseline
NVLink (chip → neighbor chip)	~1,800 GB/s	~4× slower
PCIe (chip → rest of computer)	~128 GB/s	~60× slower
Network card (box → box)	~100 GB/s	~80× slower
SSD (storage)	~14 GB/s	~570× slower

The lesson is blunt: keep the bytes home. Every time data crosses a chip boundary, you pay roughly a 10× speed penalty. A huge amount of systems design is just choreography to avoid those crossings.

The cruel scaling laws

Here's why you can't simply wait for better hardware to fix everything. Per chip generation, three things improve at three different rates:

What	Improvement per generation
Compute (math speed)	~4×
Bandwidth (memory speed)	~2×
Capacity (memory size)	less than 1.4×

Compute races ahead. Bandwidth limps behind. Capacity barely moves (and a 2025–26 memory-chip shortage made every gigabyte pricier). Since decode cares about bandwidth and capacity — the two laggards — the imbalance gets worse every year, not better. Money buys you time, not escape. The real fixes have to be clever, not just expensive. That's the rest of the post.

A genuinely brilliant trick: FlashAttention

Not every hardware-era win is about new silicon. Some are about using existing silicon smarter. The best example is FlashAttention, and it's worth understanding because it shows the whole philosophy in miniature.

Remember attention — where each token looks at every other token? Done naively, this builds a giant table of size (number of tokens) × (number of tokens). For a 40,000-token prompt, that table is 3.2 GB per attention head, per layer, and the naive method writes it out to slow memory and reads it back several times. It's pure wasted memory traffic.

FlashAttention's insight: never build the whole table. Instead, stream the data through the chip's tiny-but-blazing-fast on-chip scratchpad in small tiles, computing the answer incrementally and keeping only a running summary. The math comes out identical; the memory traffic collapses. The payoff was huge — successive versions took attention from using ~25% of the chip's potential up to ~85%. You almost certainly use it every time you talk to an AI; it's built into the standard libraries.

The pattern to notice

FlashAttention didn't add a single transistor. It just rearranged the work so that fewer bytes crossed the slow boundary. That is the same move, over and over, for the rest of this post.

Smaller numbers, faster chips: precision formats

Here's a lever that helps both bottlenecks at once. The "numbers" in a model don't have to be stored at full precision. A weight can be a chunky 16-bit number, or a leaner 8-bit one, or even a tiny 4-bit one.

Why does this help twice over? First, smaller numbers mean fewer bytes to move — direct relief for the memory bottleneck. Second, the silicon needed to multiply two numbers grows with the square of their size, so smaller numbers also mean faster math. Low precision wins on both axes.

How small can you go? A 4-bit number can represent only 16 distinct values — you can literally list them all. The obvious worry is that this is too coarse to preserve a model's quality. The clever fix is block scaling: instead of one scale factor for everything, you give each small group of weights (say 16 or 32 of them) its own scale, so the 16 available levels can hug the actual range of that local group. Modern formats called MXFP4 and NVFP4 do exactly this, and they make 4-bit weights genuinely usable. One open model (gpt-oss) ships its weights this way, letting a 120-billion-parameter model fit on a single 80 GB GPU.

The deepest reason of all: moving data costs energy

If you remember one fact from this section, make it this one. Compare the energy cost of operations on a chip:

1 memory read ≈ 500 multiplies Reading one number from main memory (DRAM) costs roughly 500 times the energy of actually multiplying two numbers together. Arithmetic is nearly free. Moving the operands is the entire budget.

This is the deepest reason every trick in this post tries to move fewer bytes — the fetch, not the arithmetic, dominates the bill.

This is the physics underneath everything. The chip would happily do far more math; it's the fetching that costs time and watts. Which is why, one more time: every technique in this post moves fewer bytes.

Hardware TL;DR

Compute grows ~4× per generation, bandwidth ~2×, capacity barely at all. The gap the model has to survive keeps widening. Hardware raises the ceiling; the clever algorithms in the next three parts lower how far below it you're forced to live.

Part 3Serving: Many Users, One Copy of the Model

So far we've imagined one user, one request. Reality is thousands of users hitting one fleet of GPUs at once. "Serving" is the software layer that juggles them. Its entire job is to answer one question, over and over: where is the GPU stalling on memory, or recomputing something we already had?

The metrics that matter

Two numbers define a good experience. Time to first token (TTFT) — how long before you see any response — is the prefill phase; under ~1 second "feels instant." Time per output token (TPOT) — how fast words then stream out — is decode; around 20–50 ms per token matches comfortable reading speed. The serving system's challenge is hitting both targets for many users at once, on shared hardware, without wasting a single GPU-cycle. Here are the five big moves it makes.

Move 1: Batch continuously

Recall that during decode, all the weights get streamed in anyway. If you process several users' requests together in one batch, they all share that single weight-read. The expensive memory traffic gets amortized across many users — free throughput. This is the core reason serving is efficient at all.

The naive way ("static batching") waits for a whole group of requests to finish before starting the next group — so a fast request sits idle waiting for a slow neighbor. The fix, continuous batching, re-forms the batch every single step: the moment one request finishes, a new one takes its slot. No idle seats. This one change delivered up to 20×+ throughput improvements and is now universal.

Requests finish at different times. Static batching leaves slots idle until the whole group is done; continuous batching slots a new request in the instant one finishes — so the expensive shared weight-read is never wasted.

But batching has a tension. Sharing the weight-read is great, but each request carries its own KV cache, and those don't share. Batch 32 conversations and you might need 192 GB just for KV. So the bigger your batch, the more the KV cache — not the weights — becomes the thing choking your memory. Managing that tension is most of the rest of serving (and Part 5).

Move 2: Page the KV cache

Early systems reserved one big contiguous block of memory per request, sized for the maximum possible length. Most requests are short, so most of that reserved space sat empty — only 20–40% of KV memory actually held real data. The rest was waste, and that waste capped how many users you could serve.

The fix, PagedAttention (the idea behind the popular vLLM system), borrows a trick from how your operating system manages memory. Instead of one big reservation, chop KV memory into small fixed-size blocks handed out on demand, with a lookup table mapping each request to its scattered blocks. Fragmentation nearly vanishes, and you can pack far more users in. A bonus: if two requests share the same opening text, they can literally share the same physical blocks until they diverge.

Move 3: Cache shared prefixes

That sharing idea is huge, because in practice an enormous fraction of tokens are repeats. The same system prompt prepended to every chat. The same few-shot examples on every query. The same conversation history replayed each turn. The same tool descriptions for an AI agent. Re-doing the prefill for identical text every time is pure waste.

Prefix caching stores the KV cache of text you've already processed and reuses it. The SGLang system organizes this cleverly with a "radix tree" (think of a filing system where shared beginnings share a folder path), so any request that starts with text you've seen skips straight past it. A cache hit means skipped prefill: lower latency, freed compute, bigger batches. On workloads with lots of shared text, this delivered up to 5× more throughput.

You've already paid for this

This is why AI providers charge far less for "cached" input. Anthropic charges as little as 10% of the normal price for cached tokens; OpenAI and Google have similar discounts. Your bill is, quite literally, a cache-hit-rate report. Structuring your prompt with the stable parts (system prompt, documents, examples) first is how you get them cached — and cut your costs.

Move 4: Don't let one long prompt freeze everyone

If one user sends a 40,000-token prompt, a naive scheduler will grind through that entire prefill while every other user's response freezes mid-sentence. The fix, chunked prefill, slices the giant prompt into bite-sized chunks and interleaves them with everyone else's decode steps. Each step does a fixed amount of work — a little prefill plus the ongoing decodes — so no one stalls.

Move 5: Disaggregate the phases

Here's where the Part 1 insight pays off. Prefill wants compute; decode wants memory bandwidth. So run them on separate pools of machines, each tuned for its job, and ship the KV cache from the prefill pool to the decode pool when handing off. A "conductor" routes each request to a good (prefill machine, decode machine) pair. This architecture — pioneered by systems called DistServe and Mooncake — delivered large throughput gains in production at companies serving over 100 billion tokens a day.

A bonus move: speculative decoding

This one is delightfully sneaky. During decode the GPU's calculators are mostly idle (remember, it's starving for memory, not math). So... spend those free FLOPs. A small, cheap "draft" model quickly guesses the next handful of tokens. Then the big, real model checks all of those guesses in a single forward pass — which costs the same memory traffic as producing one token, but validates several at once. Where the guesses are right (which is often), you got multiple tokens for the price of one.

The beautiful part: there's a precise statistical rule for accepting and rejecting guesses that guarantees the final output is exactly the distribution the big model would have produced on its own. It's a pure speedup with no quality cost — typically 2–5×. The one caveat: if your batch is already huge, the calculators aren't idle anymore, so verifying wrong guesses becomes pure overhead. Speculation helps when you have FLOPs to spare; measure before turning it on at scale.

Serving TL;DR

Five moves to keep the GPU fed: batch continuously, page the KV cache, cache shared prefixes, route requests with the same prefix to the same machine, and disaggregate the two phases. Every one of them buys back arithmetic intensity — more useful work per byte dragged out of memory.

Part 4Weight Compression: Same Brain, Fewer Bytes

Serving made the system efficient. Now we make the model itself cheaper to move. Since decode is bottlenecked by streaming all the weights every token, the goal is simple: fewer bytes of weights. There are two independent ways to do it, and you can use both.

Where the weight bytes actually live

Crack open a typical model and you find that the feed-forward network (FFN) — the part that "thinks" about each token's features — hogs roughly two-thirds of all the weights. For our 8B example: ~66% FFN, ~18% attention, the rest embeddings. Every token currently runs through the entire FFN. So the FFN is the first place to hunt for savings.

Idea 1: Mixture of Experts (use fewer weights per token)

What if, instead of one giant FFN that every token must run through, you had many smaller FFNs ("experts") and each token used only a few of them?

That's a Mixture of Experts (MoE). A little router looks at each token and picks, say, the 8 most relevant experts out of 128. The token only runs through those 8. You keep all the "knowledge" of a huge model, but each token only pays for a small slice of it.

All the experts sit in memory (that's the capacity cost), but each token only streams through the handful the router picks (that's the savings). Huge total knowledge, small per-token bill.

An analogy

A dense model is one generalist doctor who personally handles every patient. An MoE is a hospital with 128 specialists and a triage nurse who sends each patient to the right few. The hospital knows far more in total, but any single patient only occupies a couple of specialists.

Concretely, compare two real models. The dense Qwen3-8B uses all 8.2 billion parameters for every token. Its MoE sibling, Qwen3-30B-A3B, stores 30.5 billion parameters total but activates only ~3.35 billion per token. It's ~4× bigger in knowledge yet does less work per token.

The trade-off: MoE swaps memory capacity for compute. You must store all 30 billion parameters in memory (even the idle experts), but you stream far fewer per token. That's a great deal when memory capacity is cheap relative to bandwidth — which is exactly the situation modern hardware is in. The sparsity has been climbing fast: from using 1-in-4 of the weights a few years ago to as little as 1-in-31 in the newest models.

There's a subtlety worth knowing: training an MoE requires load balancing, or the router lazily keeps picking the same few experts while the rest starve and die. Various tricks (an auxiliary loss, or a per-expert bias nudged toward under-used experts) keep all the experts alive and learning.

Idea 2: Quantization (use fewer bits per weight)

The other axis: store each individual weight in fewer bits. Go from 16 bits to 8 and you halve the bytes streamed — halving the decode floor and fitting on a smaller GPU. Go to 4 bits and you quarter it. This is quantization, and it's where a lot of beautiful engineering lives.

First, a free win. It turns out you can't usefully gzip a model — the precise mantissa bits look like random noise and don't compress. But the exponent bits (which set the rough magnitude) are highly predictable in a trained model — only 2–3 bits of real information in an 8-bit field. So you can losslessly squeeze a model down to about 4.7 bits per weight with no quality loss at all. Below that floor, though, information genuinely has to be thrown away. The art is throwing away the bits that don't matter.

The villain: outliers

Quantization means rounding every weight onto a coarse grid of allowed values. The problem is outliers: a single unusually large weight forces the grid to stretch wide enough to include it, which makes the spacing coarse for the hundreds of ordinary weights around it. One heavy hitter ruins the precision for all its neighbors. Every serious quantization method is a different answer to the outlier problem:

The grid has only a fixed number of rungs. One unusually large weight forces the rungs far apart, so all the ordinary weights get squashed onto just a couple of them — losing precision. GPTQ, AWQ, and rotation are three different ways to defuse this.

GPTQ — compensate. After rounding each weight, it calculates the error introduced and pushes a correction onto the not-yet-rounded weights. It uses calibration data to figure out which errors actually matter for the model's output, turning quantization into a tidy least-squares problem. Result: 4-bit weights with almost no quality loss, and a 175-billion-parameter model fitting on a single GPU.

AWQ — protect. It notices that only ~1% of weight channels carry most of the important signal. It identifies them (by which ones see the biggest activations) and rescales them so rounding can't hurt them. Prevent the mistake instead of fixing it after.

Rotation — remove. The most elegant: multiply the weights by a carefully chosen rotation matrix. A rotation doesn't change the model's output, but it smears the outliers out across all the coordinates, turning a spiky distribution into smooth, near-Gaussian "dust" with no outliers left to worry about. Then plain uniform quantization just works.

One more wrinkle: weights are the easy part. Activations (the data flowing through the model as it runs) carry nasty dynamic outliers, which is why "4-bit weights" (W4A16) is routine but "4-bit weights and 4-bit activations" (W4A4) needs those rotation tricks to work at all.

Weight compression TL;DR

Two orthogonal levers, use both. MoE = fewer weights per token. Quantization = fewer bits per weight, where ~4.7 bits is the lossless floor and clever methods push below it. The payoff: a 30-billion-parameter MoE at 4-bit quality fits in ~16 GB of storage and streams only ~1.7 GB per token. Frontier-class quality, laptop-class memory.

Part 5KV-Cache Compression: How to Afford a Million Words

We've shrunk the system and the weights. The last giant is the KV cache — the model's growing notebook of the conversation. As we saw, it can balloon past the size of the model itself. This part is about taming it, and it's what makes million-token context windows possible.

Why the cache is the real long-context problem

Two things make the KV cache explode. It grows with batch size: each user has their own cache, so 100 users × 100,000 tokens each = ~1.5 TB of KV cache alone — eighteen H100s' worth of memory, before you've stored a single weight. You run out of memory long before you run out of math. And it grows with modality: text rarely needs a million tokens, but an hour of video, fed in frame by frame, can fill a million-token window all on its own. Audio and video are KV-cache firehoses.

And this directly sets your costs: KV size caps your batch size, which caps your throughput, which sets your price per token. Shrinking the KV cache pulls all four levers at once. There are essentially five knobs, and the first four multiply together.

Knob 1: Shrink the per-token state

The biggest structural win. The starting point, already standard in most models, is GQA (Grouped-Query Attention): instead of every attention head keeping its own K and V, groups of heads share them, cutting the cache ~4–8× for free.

The deeper move is MLA (Multi-head Latent Attention), introduced by DeepSeek. Instead of storing the full K and V vectors, it stores a small compressed summary (a "latent") and reconstructs what it needs on the fly. The result is dramatic: DeepSeek's huge model stores only ~70 KB per token — less than half of what our tiny 8B model needs with GQA, on a model ~80× larger. At that rate, a million tokens fits in ~70 GB, on a single node. Quality actually came out better than the traditional approach. The catch is that MLA has to be baked in during training — though researchers have since found ways to retrofit it onto already-trained models with a small amount of fine-tuning.

Knob 2: Shrink the bits (quantize the cache)

Same idea as weight quantization, applied to the cache. A method called KIVI noticed something neat: the keys and values misbehave in opposite ways, so they should be quantized along different axes (keys per-channel, values per-token). Done right, you get the KV cache down to 2 bits with almost no quality loss, roughly tripling throughput. An even more principled approach, TurboQuant, leans on information theory: it proves that naive coordinate-by-coordinate quantization wastes about 0.25 bits per number versus the theoretical optimum, then uses a random rotation (that Gaussian-dust trick again) to claw most of it back — with no calibration data needed.

Knob 3: Read less (sparse attention)

This knob attacks a different axis. Knobs 1 and 2 shrink what you store; sparse attention shrinks what you read. The insight: when generating a token, attention mass is mostly local (recent tokens) or concentrated on a few "hub" tokens. So you don't need to read the whole cache every step — just the relevant parts. The cache stays full-size on disk, but the per-token traffic collapses, which is what actually sets decode speed.

Approaches range from simple fixed patterns (only attend to a sliding window of recent tokens, à la Mistral; or mix mostly-local layers with occasional global ones, à la Gemma) to learned routing where a little gate decides which blocks of the past each query should read (NSA from DeepSeek, MoBA from Moonshot, DSA in DeepSeek's newer models). DSA's sparse reads halved the company's long-context API prices — a vivid reminder that algorithmic efficiency shows up directly on the invoice.

Knob 4: Drop what you don't need (eviction)

The bluntest knob: just throw tokens out of the cache. Methods like StreamingLLM, H2O, and SnapKV keep the important tokens (the first few "sink" tokens, the recent window, and a handful of heavy hitters) and evict the rest, often holding quality at a fraction of the cache size. One granularity up, compaction drops whole conversation turns — pause, summarize the old history, throw away the raw cache, resume. (If you've seen an AI coding assistant "compact" a long session, that's this.) It's fine for chat but risky for agents, since summarizing can quietly lose a detail that mattered.

Knob 5: Get rid of the cache entirely

The radical alternative. Knobs 1–4 shrink a cache that still grows with length. What if you used an architecture whose memory doesn't grow at all?

State Space Models (like Mamba-2) keep a fixed-size summary of everything seen so far — a constant few megabytes, no matter how long the context. Decode becomes constant-cost per step. The trade-off is recall: pure state-space models have fuzzy memory and struggle to fetch an exact detail from far back. The practical answer is hybrids — mostly cheap state-space layers with a few full-attention layers sprinkled in to handle precise retrieval. Several recent models (Qwen3-Next, IBM Granite 4, NVIDIA Nemotron-H, Jamba) are built this way, with ratios like 1 attention layer per 7–12 cheap ones.

Putting it together: the million-token bill

Because knobs 1–4 are on different axes, they stack. Watch a million-token cache shrink:

Step	Effect	KV size
Plain GQA baseline (1M tokens)	—	147 GB
+ MLA-style latent	÷8	18 GB
+ 4-bit KV quantization	÷4	4.6 GB
+ sparse reads	÷10 traffic	~0.5 GB read/token

Because the tricks work on different axes, they multiply. Shrink the state, then its bits, then how much you read — and an impossible 147 GB collapses to a few gigabytes stored, with the per-token traffic that sets your speed dropping even further.

From "impossible on one GPU" to "a few gigabytes of storage, and the per-token traffic that sets your latency drops another order of magnitude." Stacked together, a million tokens of context becomes feasible on a single node.

KV compression TL;DR

Four orthogonal, multiplying knobs — shrink the state (MLA), shrink its bits (quantize), shrink the reads (sparse), drop what you never needed (evict) — plus a fifth radical option that replaces the growing cache with constant-size state (Mamba/hybrids). Turn them together and long context gets affordable.

Part 6The Wrap-Up

We've covered a lot of ground — from "what is a FLOP" to retrofitting low-rank attention onto trained models. Let's zoom back out, because the whole sprawling field collapses into one idea.

The meta-lesson

Arithmetic intensity is destiny. Every single technique in this post moves fewer bytes per token. MoE moves fewer weights. Quantization moves fewer bits. Paging stops wasting KV memory. Prefix caching reads prompts once. MLA shrinks the cache 8×. Sparse attention reads a tenth of it. Disaggregation sends each phase to the hardware it wants. The chip does not care how you saved the bytes — through architecture, compression, or policy. Fewer bytes in means faster token out.

Seven numbers worth remembering

Number	What it means
~1 PFLOP/s vs ~3.4 TB/s	An H100's math speed vs its memory speed — the source of all the trouble
~300 FLOPs/byte	The "ridge point": operations you must do per byte fetched to keep the chip busy
~4× compute, ~2× bandwidth per generation	Why the pressure keeps rising — math outruns memory
~150 KB per token	KV cache for a small (8B) model, per conversation — the thing that grows
~500× energy	One memory read vs one multiply — moving data is the whole cost
0.1× price	What a cached token can cost vs an uncached one — prefix caching, as a discount
~400× gap	Prefill vs decode speed on the same edge chip — the two phases are different worlds

If you take away a single sentence, make it this one: a fast byte beats a fast FLOP — and moving fewer bytes beats buying more hardware.

Where to go deeper

If this sparked your curiosity, the tutorial points to some excellent free next steps: the "How to Scale Your Model" guide from DeepMind for the roofline-to-parallelism story; Hugging Face's Ultra-Scale Playbook for training at cluster scale; Stanford's CS336 course for building a language model from scratch; the GPU MODE lecture series for the CUDA-to-quantization pipeline; and Horace He's "Making Deep Learning Go Brrrr" for the compute-vs-bandwidth mental model that underpins this entire post.

And if you ever find yourself staring at a slow model wondering what to optimize, ask the question that every slide in this tour kept asking: where is the chip stalling on memory — and how do I move fewer bytes?

This post is an explanatory walk-through of the ideas in Alex Smola's "Efficiency in LLMs: Hardware, Serving, and Compression" tutorial, presented at the Columbia Machine Learning Summer School, 2026. All figures and example numbers are drawn from that tutorial, which notes they were verified as of June 2026 and will date quickly. Any errors in simplification are mine, not the original author's. Original slides: alex.smola.org/posts/45-mlss-efficiency.