The KV Cache - Part 4 of 6

Where we are in the stack: the memory pattern that dominates everything during decode. We have the weights loaded (Part 2) and we know how a forward pass computes (Part 3). Now we ask what gets remembered between forward passes and what it costs.

In Part 3 we walked one token through a transformer and noticed that the attention block needs to look at every prior token in the context. If naive, each decode step would reprocess the entire prefix. That would turn a 1000 token generation into 1000 + 999 + 998 + ... = 500,500 token-equivalents of work. Quadratic, intolerable.

The fix is the KV cache, and once you understand it you understand why modern serving systems put so much engineering into one tensor.

The 6-Part Inference Stack Series

KV Cache

Why K and V are cached, but Q is not

The attention computation inside a single layer, for the token currently being generated, looks like this:

Q = x @ W_Q       # 1 query for the new token
K_new = x @ W_K   # 1 new key for the new token
V_new = x @ W_V   # 1 new value for the new token
K_full = concat(K_cached, K_new)
V_full = concat(V_cached, V_new)
attn = softmax(Q @ K_full.T / sqrt(d)) @ V_full

The new token's Q vector queries against every prior K and pulls weighted information from every prior V. To compute attention for a single new token, we need K and V for all prior tokens, but only Q for the new token.

What each tensor does on every decode step

Fresh every step

Computed once for the new token, then thrown away.

Append then read all

Compute K for new token, append to cache, read all prior K's.

Append then read all

Compute V for new token, append to cache, read all prior V's.

So Q is computed fresh for the new token every step. K and V from prior tokens never need to be recomputed, because they only depend on those tokens' own input vectors and the (frozen) weight matrices. Cache them once, read them forever.

“

The KV cache is the data structure that turns a quadratic algorithm into a linear one.

Sizing the KV cache

Per token, the KV cache stores K and V tensors of shape [n_kv_heads, head_dim] in each layer. Total bytes per token:

kv_per_token = 2 * n_layers * n_kv_heads * head_dim * bytes_per_value

The factor of 2 is for K and V. Plug in real models, BF16:

KV Cache - Bytes Per Token and Total at Context Length

Llama 3 family, BF16, 8 kv heads, head_dim 128. One sequence.

Model Per token 8k ctx 32k ctx 128k ctx

8B (32 layers)

128 KB

1.0 GB

4.0 GB

16 GB

70B (80 layers)

320 KB

2.5 GB

10 GB

40 GB

405B (126 layers)

504 KB

4.0 GB

16 GB

64 GB

The numbers are sneaky in how fast they grow. At a 128k token context (which several modern models support), Llama 3 70B needs 40 GB of KV cache for one sequence. The model itself is 140 GB. Already 28% of model memory.

This is before you batch. If you want to serve 32 concurrent users at 8000 tokens each on Llama 3 70B, the KV cache total is 80 GB. On an H100 (80 GB) you cannot fit both the model and the KV cache in the same GPU. You need multiple GPUs, smaller batches, shorter contexts, KV cache quantization, or some combination.

Why long context blows up VRAM faster than people expect

Three independent effects pile up.

Linear scaling with sequence length. Doubling context doubles KV cache.
Linear scaling with batch size. Doubling concurrent users doubles KV cache.
The KV cache must be in VRAM to be useful. Unlike model weights, which you read once per step, the KV cache is involved in attention computation. The attention kernel reads it heavily and writes one new token's worth to it per step. Storing it on CPU or NVMe means PCIe round trips on every layer of every decode step, which is unacceptable for most workloads (though there are exceptions, which we will get to).

Combined, you can quickly hit a regime where the KV cache is larger than the model. For Llama 3 70B serving long context users, the KV cache often exceeds the model weights by 2x or 3x in total VRAM. The serving system's job is largely to manage this.

The fragmentation problem

Suppose you have 40 GB of VRAM available for KV cache after loading the model. You receive a request with a 4000 token prompt and the user wants up to 2000 new tokens. You don't know up front how many they will actually use. The model might emit an EOS at token 100 or run all the way to 2000.

Naive allocation reserves enough KV cache for the worst case. 4000 + 2000 = 6000 tokens. For Llama 3 70B that is 1.92 GB per request. You can fit 20 such requests in 40 GB. But if most requests only generate 200 tokens, you allocated for 2000, used 200, and wasted 1.8 GB per request. Your effective batch size collapses.

What if you allocate conservatively? Then a request that wants 2000 tokens has to be cut off, or you have to reallocate and copy. Reallocating GPU memory mid-sequence is painful. What if you allocate exactly the prompt length and grow as needed? Now every request grows independently, and after a few hundred requests have come and gone you have a swiss cheese pattern of allocated and free chunks. Classic fragmentation.

Naive Allocation

Reserve worst case per request

Used ~28%

Wasted ~60%

Free ~12%

Seq A

Seq B

Seq C

Reserved, unused

PagedAttention

Allocate fixed-size blocks on demand

Used ~69%

Wasted ~4%

Free ~27%

Seq A

Seq B

Seq C

Free pool

Kwon et al. measured this in their vLLM / PagedAttention paper (SOSP 2023). For typical workloads, naive allocation wasted 60% to 80% of KV cache memory. The actual computation was using only 20% to 40% of what was reserved. The fix changed how the industry builds serving systems.

PagedAttention: virtual memory for the KV cache

The vLLM team's insight was that the KV cache fragmentation problem is exactly the same problem that operating systems solved for process memory decades ago, using virtual memory and paging. The solution looks almost identical.

PagedAttention divides the KV cache into fixed size blocks (typically 16 tokens per block). Each block stores the K and V tensors for those 16 tokens, across all layers. The blocks are allocated from a global pool of physical blocks in VRAM. Each sequence has a block table that maps its logical token positions to physical block addresses, just like a page table maps virtual addresses to physical pages.

Logical block table -> physical block pool

Seq A - shares prefix

P9 P14 P3 P22 P17

Seq B - shares prefix

P9 P14 P11 P28

Seq C - unique

P5 P31 P19 P26 P0 P12

gather

Physical Block Pool (VRAM)

Shared

In use

Free

Blocks are 16 tokens. Sequences hold pointers, not contiguous regions. Shared blocks (green) are reference-counted.

When a sequence wants to grow, the system allocates a new block from the pool and adds it to the sequence's block table. When a sequence finishes, all its blocks return to the pool. Crucially, the blocks for one sequence do not have to be contiguous in physical memory. Fragmentation drops to nearly zero because every allocation is the same size.

The attention kernel has to be aware of this structure. The PagedAttention kernel walks the block table for each sequence and gathers the right physical blocks as it does the attention math. This adds a small overhead per attention step (a few percent in practice) but the win in usable batch size is large. The vLLM paper showed 2x to 4x throughput improvements on real workloads, mostly from being able to fit more sequences in the same VRAM.

PagedAttention has become the default. vLLM, TensorRT-LLM, SGLang, and most internal company stacks all use some variant. Block size is sometimes 16, sometimes 8, sometimes adjustable per layer. The kernels have been re-tuned many times for different GPU generations. But the core idea is the 2023 vLLM paper, and it is one of the most influential systems papers in the LLM era.

Prefix caching and shared system prompts

The KV cache stores intermediate state for a sequence. If two sequences share a common prefix (the same system prompt, the same few-shot examples, the same RAG context), they will produce identical K and V tensors for those prefix tokens. Recomputing them for the second sequence is wasted work.

Prefix caching keeps prefixes in the KV cache pool across requests. When a new request comes in, the system hashes the input tokens (typically in blocks of N tokens to match the page size) and looks up matching blocks. If they exist, the new request's block table just points at them. Only the unique suffix needs new computation.

RadixAttention - branching prefixes share KV blocks

A shared system prompt is computed once and reused across all conversations that branch from it.

Shared root - 94 blocks

"You are a helpful assistant... [RAG context]"

1500 tokens. Computed once. Reused N times.

Branch A

"Summarize this..."

+ 32 unique blocks

Branch B

"What if instead..."

+ 18 unique blocks

Branch C

"Translate to French..."

+ 24 unique blocks

Prefill cost for the 1500-token prefix: 1 computation, 3 reuses.

This is enormously useful in production. RAG systems retrieve documents and prepend them. Chatbots have a long system prompt. Agentic systems re-send context on every turn. With prefix caching, the prefill cost for these repeated prefixes goes to near zero (just the time to look them up).

vLLM has had Automatic Prefix Caching (APC) since 2024. SGLang's RadixAttention (Zheng et al., 2024) generalized this further with a radix tree over blocks, allowing efficient sharing of branching prefixes (think a chatbot with multiple parallel hypotheses). For multi-turn conversation serving, RadixAttention reported 2x to 5x throughput improvements over vLLM's APC on conversational benchmarks.

The accounting gets interesting. With prefix caching, the same physical block can be referenced by many sequences. You need a reference count. Eviction policy matters. LRU is the obvious default but in some workloads (e.g., where one shared prefix is overwhelmingly common) you want to pin it. There is real engineering here.

KV cache quantization

The KV cache is large. The KV cache is mostly bandwidth bound on read. The KV cache numerical precision does not need to be as high as the model weights need to be for math reasons. All three facts point at the same answer: quantize it.

Llama 3 70B KV Cache at 128k Context

Same model, same context. Different precision means very different VRAM.

BF16 (2 bytes/value, baseline)40 GB

100%

FP8 (1 byte/value, native H100)20 GB

50%

INT4 / KIVI 2-bit (sub-4 bits)10 GB

25%

The difference between one user and four users on the same GPU is sometimes just a precision flag.

Two approaches work in practice.

FP8 KV cache (per tensor or per head scales) is the most common production choice as of 2026. Hopper (H100) has native FP8 support, so the attention kernel can do the matmul directly in FP8 without dequantizing first. The accuracy hit is typically under 1% on standard benchmarks for well-implemented FP8 KV cache. TensorRT-LLM, vLLM, and SGLang all support it.

Sub-4-bit KV cache with carefully designed quantization is a more aggressive choice. Hooper et al.'s KVQuant (2024) compounds several tricks (per-channel keys, pre-RoPE key quantization, non-uniform per-layer datatypes, dense-and-sparse) to reach sub-4-bit precision with under 1% perplexity degradation. Liu et al.'s KIVI (2024) goes further, reaching asymmetric 2-bit quantization with a clean recipe: quantize K per-token (along sequence dimensions) and V per-channel (along feature dimensions). This asymmetry isolates the outliers cleanly because K outliers cluster by token (position) while V outliers cluster by channel. The downside in both cases is that low-precision attention kernels are not native on most current hardware, so you pay a dequantization cost.

Offloading: when CPU and NVMe come back

We said earlier that the KV cache needs to be in VRAM. That is true for the active part of attention. But not all of the KV cache is active all the time.

In a long conversation, the most recent tokens get attended to heavily and the very oldest tokens get attended to lightly. Many practical scenarios have this property: the model rarely attends strongly to tokens beyond a few thousand back. Attention sinks (Xiao et al., 2023) show that the first few tokens of context often act as a high-weight anchor regardless of what follows, while middle tokens decay in influence.

This opens up offloading strategies. Keep the recent K/V blocks and the attention sink blocks in VRAM. Push the middle blocks to CPU RAM or NVMe. When needed, swap them back in via PCIe.

Hot tier

HBM (VRAM)

3.35 TB/s

Recent K/V + attention sinks. Decode-path data must live here.

1 GB read: ~0.3 ms

Warm tier

Host RAM

~55 GB/s (PCIe)

Middle-context K/V blocks. Swap in on demand. Useful for long context.

1 GB read: ~20 ms

Cold tier

NVMe

~7 GB/s

Archive tier for 1M+ token contexts or rarely-accessed conversations.

1 GB read: ~140 ms

This is a tradeoff. PCIe is much slower than HBM. Reading 1 GB of KV cache over PCIe costs about 20 milliseconds, versus 0.3 milliseconds from HBM. For decode that is catastrophic. But:

For some agentic workloads that re-read very long contexts only occasionally, the latency hit is acceptable
For prefill on very long contexts where you would otherwise OOM, offloading is the difference between serving the request and not
For research workloads that need 1M+ token contexts on consumer hardware, it is essentially the only option

Frameworks like FlexGen (Sheng et al., 2023) and DeepSpeed-Inference have built sophisticated offloading systems. For most production serving, offloading is not the right tool. For long-context inference with intermittent access patterns or for memory-constrained setups, it is.

Putting it together: a request's KV story

Let's walk through what happens to the KV cache during one request on a modern serving system (call it vLLM-style with prefix caching and PagedAttention).

Request arrives

2000 token prompt. First 1500 are a common system prompt + RAG header the system has seen before.

2000 toks

Prefix lookup

Hash the first 1500 tokens in 16-token blocks. Find 94 cached blocks. Build the block table to point at them.

94 blocks reused

Prefill the unique suffix

500 remaining tokens go through prefill. 32 new blocks (500 / 16 rounded up) allocated from the pool and populated.

32 blocks alloc

First decode step

Attention gathers from 126 physical blocks (94 shared + 32 unique) via the block table. Write 1 K and 1 V vector for the new token into a fresh block.

126 blocks read

Decode continues

Every 16 tokens generated, a new block is allocated. The block table grows.

+1 block / 16 toks

Request finishes (200 tokens)

13 newly allocated blocks return to the pool. 94 prefix blocks remain in the cache for the next user. 32 unique-suffix blocks evicted.

Prefix retained

This is the everyday life of one tensor that, in 2017, the original transformer paper barely thought about.

What to take away

The KV cache is the memory pattern that turns autoregressive generation from quadratic to linear. It is the thing you spend the most VRAM on in production for long-context workloads. Naive management wastes most of it. PagedAttention solves the fragmentation problem by paging it like virtual memory. Prefix caching exploits redundancy across requests. Quantization halves or quarters its footprint. Offloading is sometimes the right escape hatch.

The three rules of KV cache work

When you are sizing inference hardware, you are mostly sizing the KV cache. When you are debugging OOMs on a serving system, you are mostly debugging KV cache allocation. When you are improving throughput, you are mostly improving how many sequences' KV caches you can keep alive at once.

In Part 5 we look at the scheduling layer on top: how to actually get many requests through the same model at the same time.

Up Next - Part 5 of 6

Batching and Serving

Read now

References and further reading

Kwon et al., 2023. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180. (Essential.)
Zheng et al., 2024. "SGLang: Efficient Execution of Structured Language Model Programs." arXiv:2312.07104. (Introduces RadixAttention.)
Hooper et al., 2024. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." arXiv:2401.18079.
Liu et al., 2024. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." arXiv:2402.02750.
Xiao et al., 2023. "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453.
Sheng et al., 2023. "FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU." ICML 2023. arXiv:2303.06865.
Ainslie et al., 2023. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245.

OOMing on long-context users?

Strongly.AI's forward deployed engineers have shipped KV cache management for everything from small chatbots to multi-million-token agentic workloads. If your GPU memory math doesn't add up, we'll find what's eating it.

Scope the First Engagement

The KV Cache Small, awkward, expensive. Paged like virtual memory.