Batching and Serving - Part 5 of 6

Where we are in the stack: above the model, where requests meet hardware. We know how a single forward pass works (Part 3) and how the KV cache is managed (Part 4). Now we turn one model into a service that handles thousands of users at once.

A model is a function. A service is something that takes a stream of requests, decides what to do with them, runs them on a finite pool of GPUs, and sends responses back. Going from the first thing to the second is most of what an inference engine actually does.

This post is about the scheduling and batching layer. It is also where the engineering creativity in this series peaks, because the constraints (Part 1's three resources, Part 3's roofline, Part 4's KV cache management) come together and produce a small number of techniques that have transformed inference economics.

The 6-Part Inference Stack Series

Serving

Why one request at a time wastes the GPU

Recall from Part 3: decode is bandwidth bound. To produce one token from Llama 3 8B, you read all 16 GB of weights. That takes about 5 ms on an H100. The math itself takes about 15 microseconds. Three hundred times less.

If you serve one user at a time, the GPU spends 99.7% of decode time waiting on memory. You bought a fifty thousand dollar accelerator and you are using less than half a percent of its compute.

1 user, no batching

1 token per 5 ms forward pass

Compute

0.3%

Bandwidth

99%

Cost per token: baseline (1x)

32 users, batched

32 tokens per 5 ms forward pass

Compute

10%

Bandwidth

99%

Cost per token: ~30x lower

If you serve 32 users in the same forward pass, you still read the weights once but now the math is done for 32 tokens. The math takes 32 × 15 microseconds = 0.48 ms. Still well under the 5 ms bandwidth time. Throughput is 32 tokens per 5 ms instead of 1 token per 5 ms. You are now using 10% of compute and the cost per token has dropped by 30x or so.

This is why batching is the single most important throughput lever in LLM serving. It is also why batching is hard, because requests do not arrive in lockstep.

Static batching, and why it falls apart

The simplest scheme: collect N requests, run them through the model in lockstep, return all answers. The problem is that requests have different prompt lengths and different output lengths. If you batch 32 requests together, and one user wants 2000 tokens but the other 31 want 50 tokens, then 31 of your batch slots sit idle for the 1950 extra decode steps that the long user keeps the batch alive. Your effective utilization collapses.

You can pad up to a fixed length and accept the waste. You can sort requests by expected length, but you do not know the actual length until generation ends. You can chunk into many small batches, but small batches lose the very benefit you batched for.

Static batching was how research code worked. It was never going to fly for production serving. Until 2022, though, it was largely how production worked too. The technique that changed everything came from a group at Seoul National University.

Continuous batching

Yu et al. introduced iteration-level scheduling in the Orca paper (2022). The idea is so clean in hindsight that you wonder why it took five years after attention came out.

Instead of batching at the request level (decide on a batch, run it to completion, decide on the next batch), batch at the iteration level. Every forward pass is its own scheduling decision. After each step:

Sequences that finished (EOS or max length) drop out
New requests waiting in the queue can be added to the next batch
The batch dimension is dynamic, changing every step

Same 5 Requests - Static vs Continuous

Each colored block is one sequence consuming a batch slot at that iteration. Striped = wasted (padding/idle).

Static batching research-grade ~40% utilization

slot 1

slot 2

slot 3

slot 4

slot 5

Continuous batching Orca 2022 ~95% utilization

slot 1

slot 2

slot 3

slot 4

slot 5

Req A (long)

Req B

Req C

Req D

Req E

Wasted

This is called continuous batching or in-flight batching. The result is that the GPU is constantly running a roughly full batch, with no waiting-for-the-longest-request waste. Throughput improvements on real workloads were typically 2x to 4x over static batching, sometimes much more on long tail traffic patterns.

The implementation gets fiddly. Each sequence in the batch is at a different position in its generation, so the attention kernel has to handle a "ragged" batch where each row has a different effective length. The KV cache for each sequence lives at a different location. New requests entering the batch need their prefill done while existing requests are doing decode, which leads to the next problem.

Mixing prefill and decode: chunked prefill

There is a fundamental tension in continuous batching. Decode steps are short (single tokens, lots of users at once) and bandwidth bound. Prefill steps are long (many tokens, single user) and compute bound. If you mix them naively, a prefill step blocks all the decode steps in the batch from completing, hurting their ITL.

Agrawal et al. introduced chunked prefill in their SARATHI paper (2023). The idea: split a long prefill into smaller chunks, and run each chunk piggybacked with the decode work for other sequences in the same forward pass. If your prefill is 4000 tokens, you might break it into 8 chunks of 500 tokens. Each forward pass processes 500 prefill tokens for the new sequence plus the decode tokens for the dozens of in-flight sequences.

1 prefill chunk + 4 decode tokens per iteration

iter N

prefill chunk 1/8

iter N+1

prefill chunk 2/8

iter N+2

prefill chunk 3/8

Prefill chunk (new request) Decode token (in-flight)

The benefit is that decode steps maintain steady cadence while the new sequence prefills incrementally. TTFT for the new user goes up slightly (the prefill is now spread across multiple steps) but ITL for existing users does not spike.

Chunked prefill is now standard in vLLM, TensorRT-LLM, and SGLang. The chunk size is tunable. Smaller chunks favor decode ITL; larger chunks favor TTFT. Most production systems sit around 512 to 2048 tokens per chunk.

Disaggregation: separating prefill and decode

The chunked-prefill solution to mixing prefill and decode is good but not perfect. Even with small chunks, the two workloads have fundamentally different hardware preferences. Prefill saturates compute and gets little benefit from extra GPUs (compute scales with FLOPs which scales with parameters and tokens). Decode is bandwidth bound and benefits hugely from being spread across more GPUs (you can split the weight reads across more memory channels).

The natural conclusion is to run them on separate pools of GPUs. Prefill / decode disaggregation does exactly that.

Prefill pool -> KV transfer -> Decode pool

Compute bound

Prefill Pool

GPU

Sized for FLOPS.
Can run on cheaper / fewer GPUs.

KV cache
NVLink / IB

Bandwidth bound

Decode Pool

GPU

Sized for HBM/$.
More GPUs = more bandwidth.

Each pool tuned independently. DistServe reported 4x-7x higher goodput on long-context workloads.

A request arrives at a prefill worker, which runs prefill and produces the KV cache for the full prompt. That KV cache then gets transferred to a decode worker, which runs the autoregressive decoding. The two pools can be sized independently. Prefill pools can run on cheaper GPUs (or fewer of them) since they are compute bound and benefit from large batches. Decode pools want lots of HBM bandwidth per parameter.

The challenge is the KV cache transfer. Moving a few GB of KV cache between GPUs over NVLink or InfiniBand is not free. Splitwise (Patel et al., 2024) and DistServe (Zhong et al., 2024) both worked through the design tradeoffs. DistServe reported 4x to 7x higher goodput (throughput within SLO) compared to coupled serving on long-context workloads.

As of 2026, disaggregation is the cutting edge of production serving but not yet universal. It requires good interconnect (NVLink between prefill and decode hosts, or fast networking) and orchestration that hides the cross-tier handoff from latency. NVIDIA's Dynamo (announced 2025) is built around disaggregation. SGLang has experimental support. Most simple deployments still run coupled.

Speculative decoding

The autoregressive loop is the fundamental latency constraint. Each new token requires reading all the weights. Speculative decoding tries to break this constraint by checking many candidate tokens at once.

The setup: you have your big target model, the one whose outputs you want. You also have a small draft model (typically 10x to 100x smaller). The draft model is fast but less accurate. The protocol:

Draft generates K tokens. K is typically 3 to 8. Fast because the draft is small.
Target verifies in one forward pass. Single pass through the target with all K candidates as input.
Compare. Accept tokens until first disagreement; take target's choice at that position.
Repeat.

Speculative Decoding - one verification pass yields multiple tokens

Draft proposes. Target verifies. Accept until disagreement. The big model runs once per K candidate tokens.

Step 1 - Draft model (fast)K=5 candidates

The

cat

sat

ONE forward pass through the target verifies all 5 in parallel

Step 2 - Target verification1 target call

The ✓

cat ✓

sat ✓

the ↑

Step 3 - Accepted output4 tokens / 1 pass

The

cat

sat

the

4 tokens emitted from 1 target forward pass. Naive autoregressive would have needed 4 passes.

Crucially, step 2 is a single forward pass through the target model, even though it produces up to K accepted tokens. The cost is one weight read instead of K. If the acceptance rate is high (the draft and target agree often), you can get 2x to 4x speedups.

The paper that made this rigorous is Leviathan et al. (2023), along with Chen et al. (2023) from DeepMind. Both papers showed how to do this exactly: with a small math trick (rejection sampling against the target's true distribution), you get tokens that are statistically identical to what the target would have produced on its own. The speedup is free in terms of output quality.

Variants have proliferated. Medusa (Cai et al., 2024) replaces the draft model with additional output heads on the target itself. EAGLE (Li et al., 2024) uses a small autoregressive head on the target's hidden states. Lookahead decoding (Fu et al., 2024) uses no draft model at all, instead generating n-grams from the target's own previous outputs.

“

For typical workloads, well-tuned speculative decoding pulls 2x to 3x speedups on decode latency. One of the few levers that genuinely improves ITL without hurting quality.

Admission control, queueing, and SLO-aware scheduling

So far we have talked about scheduling within the GPU. The other half of serving is what happens before the GPU. Real workloads have a queue of pending requests, SLOs (e.g., "TTFT under 500 ms at p95"), heterogeneous request shapes, and time-varying load.

The scheduler has to decide, every iteration: which queued requests should join the batch, which in-flight requests (if any) should be preempted, and whether to admit a new request at all.

Naive FIFO admission fails on long-prompt requests, which can starve short-prompt users. Naive shortest-job-first admission fails the long users entirely. Most production systems use some variant of fair share with prioritization, with the prompt length factored in.

Preemption is a real consideration. If a new high-priority request arrives, you can evict an existing low-priority request's KV cache and rerun its prefill later. The eviction is cheap (free a few blocks); the cost is paying for prefill again. This is sometimes the right move and sometimes not.

The serving stack also has to think about goodput, the throughput of requests that met their SLOs. A system that runs huge batches gets high raw throughput but might violate TTFT SLOs for users who arrived during a long prefill. Sizing batches against SLOs requires modeling the latency impact of each scheduling decision. DistServe and several recent systems papers have made this explicit.

How modern engines compose these techniques

Three serving stacks are worth knowing as of 2026.

vLLM

Open source, broad

Reference open-source engine. PagedAttention + continuous batching as core; prefix caching, chunked prefill, speculative decoding, FP8, parallelism, and (experimental) disaggregation.

Pick if: you need flexibility, multi-vendor support, or are iterating.

TensorRT-LLM

NVIDIA, hand-tuned

NVIDIA's optimized engine. Build-time graph compiler, hand-tuned kernels per shape. Wins on raw throughput by ~10-30% over vLLM, at the cost of less flexibility and NVIDIA-only.

Pick if: single model, max throughput on H100s, fixed shapes.

SGLang

Structured + reasoning

RadixAttention + frontend for orchestrating multi-step LLM calls. Strong on shared-prefix and reasoning workloads. Competitive general-purpose engine as of 2026.

Pick if: heavy prefix sharing, structured outputs, agentic flows.

The choice between them is mostly about your workload. The codebase of vLLM is large but readable, and reading vLLM's scheduler is one of the best ways to understand how all these pieces fit together in production.

A full request lifecycle on a modern engine

Let's bring it all together. A user sends a 3000 token prompt asking for up to 1000 tokens of response on a Llama 3 70B service running vLLM.

Arrival

Request hits scheduler queue. Tokenizer breaks the prompt into tokens. Scheduler estimates KV footprint and waits for a slot.

queue

Prefix lookup

Of 3000 tokens, 2500 match a common system prompt. 156 blocks reused via block table.

cache hit

Chunked prefill

Remaining 500 tokens split into chunks of 256. Piggybacked with decode steps for other in-flight requests over 2 forward passes.

prefill

Decode begins

First decode step produces the first token. TTFT clock stops.

decode

Continuous batching

Every forward pass joins a fresh batch (size varies 20-50). Request shares iterations with other decode + prefill chunks.

decode

Speculative decoding (optional)

If enabled, draft proposes 4 tokens per iteration, target verifies, typically 2-3 accepted. ITL drops.

spec decode

Completion (EOS at 487 tokens)

Block table releases non-shared blocks back to the pool. Shared prefix blocks stay for the next user.

eos

Response streamed

Tokens have been streaming back to the user the whole time. Network closes.

stream end

Every one of those steps is a place where engineering choices change cost and latency by significant factors. The serving stack is doing a lot of work between the time you call model.generate in research code and the time the same workload runs in production.

What to take away

Batching is the single biggest throughput lever in LLM serving. Continuous batching turns batching from a research toy into a production technique. Chunked prefill keeps prefill from blocking decode. Disaggregation pushes the separation further by giving prefill and decode their own hardware. Speculative decoding attacks the autoregressive loop itself. SLO-aware scheduling decides which requests to run when.

The four leverage points to know

If you operate an inference engine: batch size limits (KV cache memory), prefill chunk size (TTFT vs decode cadence), speculative decoding configuration (latency vs throughput), and prefix cache hit rate (the most under-monitored metric in production).

In Part 6 we tie up the series with quantization, which crosscuts everything we have discussed: weights, activations, KV cache, and how the choices compose.

Series Finale - Part 6 of 6

Quantization End to End

Read now

References and further reading

Yu et al., 2022. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022.
Agrawal et al., 2023. "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills." arXiv:2308.16369.
Kwon et al., 2023. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2311.13155.
Patel et al., 2024. "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." ISCA 2024. arXiv:2311.18677.
Zhong et al., 2024. "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving." arXiv:2401.09670.
Leviathan et al., 2023. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192.
Chen et al., 2023. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318.
Cai et al., 2024. "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." arXiv:2401.10774.
Li et al., 2024. "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty." arXiv:2401.15077.
Fu et al., 2024. "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding." arXiv:2402.02057.
Zheng et al., 2024. "SGLang: Efficient Execution of Structured Language Model Programs." arXiv:2312.07104.
vLLM source code
TensorRT-LLM documentation

Tuning your serving stack for goodput?

Strongly.AI's forward deployed engineers have shipped vLLM, TensorRT-LLM, and SGLang in production at every scale. If your tail latencies are creeping or your batch math doesn't add up, we'll find the right knob.

Scope the First Engagement

Batching and Serving One model. Many requests. The scheduling layer.

Why one request at a time wastes the GPU

1 token per 5 ms forward pass

32 tokens per 5 ms forward pass

Static batching, and why it falls apart

Continuous batching

Mixing prefill and decode: chunked prefill

Disaggregation: separating prefill and decode

Prefill Pool

Decode Pool

Speculative decoding

Admission control, queueing, and SLO-aware scheduling

How modern engines compose these techniques

A full request lifecycle on a modern engine

Arrival

Prefix lookup

Chunked prefill

Decode begins

Continuous batching

Speculative decoding (optional)

Completion (EOS at 487 tokens)

Response streamed

What to take away

Quantization End to End

References and further reading

Tuning your serving stack for goodput?