Model Distillation How It Works, What It Costs, and Why Frontier Labs Are Fighting Over It

A small model learns to copy a large one. The idea is elegant and decades old. The trouble starts when the model being copied belongs to someone else.

June 9, 2026 14 min read
What it actually is Under the hood Mapping to a language model How you actually do it A worked example Why anyone bothers How much data it takes The MiniMax case When a model says it is Claude The line, and the defenses The short version

Distillation is a way to move what one model knows into another, usually smaller, model. It is an elegant, decades-old technique that quietly powers many of the small models you use every day. It is also, as of early 2026, at the center of a fight between frontier labs. This piece covers both: how distillation works, what it costs, and why a quiet engineering method became a flashpoint.

A cinematic close-up of a laboratory glass distillation apparatus, a single amber droplet forming at the tip of a condenser and falling into a small collecting vial below, lit warmly against deep navy shadow - a visual metaphor for distilling a large model into a small one

What distillation actually is

The large source model is called the teacher. The smaller model being trained is called the student. The student learns to copy the teacher's behavior instead of learning everything from raw data on its own.

The idea goes back to a 2006 paper by Bucila and colleagues, but the version everyone uses today comes from a 2015 paper by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean titled "Distilling the Knowledge in a Neural Network." Their core point was simple. A big model carries more capacity than it ever fully uses at inference time, and you can hand off the useful part of that capacity to a much cheaper model.

Large

Teacher

The big, capable, expensive model. Frozen during training.

The signal

Behavior

Its outputs - and, when you can see them, its full probabilities.

Small

Student

The cheap model, trained to imitate the teacher rather than learn from scratch.

It helps to separate distillation from a related idea. Model compression shrinks a model by reducing the bits per parameter or trimming the network, but keeps the same model. Distillation trains a brand new, smaller model from scratch to imitate the teacher. They are not the same thing.

Same model, smaller

Compression

Shrinks an existing model by reducing bits per parameter or trimming the network. The model stays the same model.

New model, imitates

Distillation

Trains a brand new, smaller model from scratch to imitate the teacher. A different thing entirely.

How it works under the hood

A normal model is trained on hard labels. A picture is a cat or it is not. The label is a one and the rest are zeros. That carries almost no extra information once the model already gets the answer right.

The Hinton insight was that the teacher's full output distribution carries far more. When a well trained model looks at a picture of a dog, it might put most of the probability on dog but leave small amounts on wolf and fox and almost none on truck. Those small numbers tell the student which classes look similar to each other. Hinton called this the dark knowledge buried in the output layer.

Hard label

What ordinary training sees

dog1.0
wolf0
fox0
truck0

One right answer, everything else zero. Almost no information beyond the label.

Soft targets, temperature raised

What the teacher actually believes

dog.60
wolf.22
fox.15
truck.03

The small numbers are the dark knowledge: wolf and fox look like dog, truck does not.

The trick to expose that hidden information is a temperature setting on the softmax function. Softmax turns raw model outputs, called logits, into probabilities. Dividing the logits by a temperature value before the softmax flattens the distribution. A higher temperature produces a softer spread, which surfaces the small probabilities that hard training would otherwise crush to near zero. The same temperature is used while training the student, and then it is set back to one once training is done.

The student is usually trained on two goals at once. One goal matches the teacher's soft output distribution, measured with a divergence term. The other goal matches the real ground truth label using ordinary cross entropy. The two are blended with a weighting factor.

The original result

In Hinton's MNIST experiment, a small network that scored 146 errors trained alone dropped to 74 errors when it was also pushed to match the teacher's soft targets at a temperature of 20 - while the big teacher itself scored 67. Most of the gap was closed by copying the teacher, not by adding data.

A few variants are worth knowing, since they come up again later.

Response based

The classic approach above. Copies the teacher's final output. The most common starting point.

Feature based

Goes deeper and makes the student match the teacher's internal layers, not just the output.

Self distillation

Trains a model from a version of itself, transferring knowledge within one architecture.

Multi teacher

Blends several teachers so the student picks up a broader range of skills.

Everything to this point is the general picture, and the cat and dog classifier is the cleanest way to learn it. The rest of this piece narrows to one specific and currently contested case, distilling large language models, where the same core idea takes on a different shape and runs into a problem the classifier never had.

Mapping the classifier example to a language model

A language model is not sorting one fixed input into one of ten labels, so two things have to be adjusted before the classifier intuition carries over cleanly.

The first is that a language model produces not one prediction but a long sequence of them. It generates text one token at a time. At each step it takes everything so far, computes a score called a logit for every token in its vocabulary, and applies softmax to turn those scores into a probability distribution over the whole vocabulary, then a token is chosen and appended. The vocabulary, often tens of thousands of tokens or more, plays the role the image classes played, except instead of classifying once the model does a fresh classification at every position. This is autoregressive: each new token is fed back in, so the distribution at the next step depends on what was generated at the last. That feedback loop has no counterpart in the single image case, and it is part of why copying a language model is harder than copying a classifier.

The second adjustment is what counts as the soft target. In the image example it is the probability spread over the ten classes for one image. In a language model the equivalent is the probability distribution over the vocabulary at a given position, the model's sense of how likely each possible next token is. That distribution carries the same dark knowledge Hinton described.

Next-token distribution after "the dog chased the…"

The same dark knowledge, one token at a time

cat.64
ball.21
car.09
idea.02

The relative weights say more than the single winning word alone - and the model does this fresh at every position.

Here is where the classifier analogy quietly breaks, and the break is the whole reason this technique became contentious. Whether you can use these distributions at all depends entirely on access.

You own it / open weight

Full distribution visible

You can read the full distribution at every position and train the student to match it, soft targets and all - just repeated token by token instead of once per image. The true analogue of the MNIST setup.

API only

Only the words come out

You never see any distribution. You see the tokens it finally emitted and nothing underneath. The clean soft-target picture simply does not describe API based distillation.

That weaker signal, and the lengths people go to compensate for it, is exactly the situation behind the cases that drew public attention.

How you actually do it

In practice distillation is a two stage job. You first obtain the teacher and freeze it so it stops changing, then train the student to imitate its outputs. What that imitation looks like depends on the access split from the last section, and the two paths diverge sharply.

Own / open weight

Soft label distillation

Train the student to match the teacher's full distribution at each token. Transfers the most, because the student learns the teacher's confidence and the relationships between alternatives, not just the winning token.

  • The catch is cost: storing a full distribution over a vocabulary of 100,000+ tokens, for every token in a large training set, eats enormous memory.
  • Which is why even teams that can do it do not always do it at full scale.
API only

Hard label distillation

The distribution is invisible, so only the emitted text remains. The student learns from the teacher's generated tokens treated as ground truth, using ordinary supervised training.

  • Send the teacher a large volume of prompts, capture the responses, and train on them.
  • Reasoning steps, chain-of-thought traces, coding solutions, and tool-use sequences are especially prized.

That second path is worth dwelling on, because it is the contested one. The generated text still encodes how the teacher reached its answer even without the underlying probabilities. This is also why the contested campaigns crafted prompts to make the teacher spell out its reasoning step by step.

When you cannot read the distribution, the only way to capture the teacher's thinking is to make it write that thinking down as text you can train on.

The training material for this kind of distillation is not raw text. It is structured pairs. Each example is an instruction or question together with the answer you want the student to produce, and for anything that involves judgment or multi step logic, a reasoning chain that shows the work between the question and the answer. The student learns to map the question to the answer, and the reasoning chain teaches it the path rather than just the destination, which is what lets a small model handle questions it did not see verbatim in training. So the practical unit of work is generating large numbers of good question, reasoning, answer triples.

A worked example you can actually run

Say you want a small, cheap model that answers questions about your own software's technical documentation, so you can ship it in your product without paying frontier prices on every call. You do not need to scrape anyone's API for this. You have the documentation already, and that is your raw knowledge source.

1

Chunk the documentation

Break it into reasonably sized pieces, for example by section or by page, so each chunk is a self-contained unit of meaning.

2

Generate triples with a teacher

For each chunk, hand it to a capable teacher model and ask it to generate question and answer pairs grounded in that chunk, plus a short reasoning chain for any answer that connects more than one fact. A chunk about authentication might yield a dozen pairs, from "what header does the API expect for auth" to "why does a request fail with a 401 even with a valid key," the reasoning chain walking through token expiry logic.

3

Fine-tune the small model

Train the student on those triples. Because the questions, answers, and reasoning all trace back to your real documentation, the student learns to answer in your domain, in your terminology, with the reasoning patterns the teacher demonstrated.

A few sensible practices make this work better.

Keep a human in the loop

Spot-check the generated pairs. The teacher will occasionally invent a detail the documentation does not support, and a wrong answer in training teaches the wrong answer.

Cover evenly

Spread coverage across the documentation so common topics do not crowd out the rare but important ones.

Include boundary cases

Add questions the documentation does not answer, so the student learns to say it does not know rather than confidently making something up.

Nothing crosses a line

The knowledge is yours, the teacher is a writing assistant on your own material, and nothing crosses anyone's terms of service.

This is the same machinery as the contested API distillation, with one clean difference: the knowledge is yours.

Why anyone bothers

The payoff is efficiency. A distilled student runs on weaker and cheaper hardware, responds faster, and costs far less to serve, while keeping much of the teacher's accuracy. That is why distillation underpins smaller production models meant for phones and edge devices, where you cannot run a giant network. DistilBERT, a compressed version of BERT used widely in language tasks, is a well known example of the approach in practice, and the documentation assistant above is the same logic applied to a single company's needs.

Cheaper and faster

Runs on weaker hardware, responds faster, costs far less to serve - while keeping much of the teacher's accuracy.

Skip the research bill

If the teacher is someone else's frontier model, you approach a competitor's capability for the price of querying it, not building it. This is the contentious payoff.

Safety may not transfer

The guardrails, refusals, and protections built over years do not necessarily carry over through outputs alone.

That second payoff is what has turned distillation from a quiet engineering technique into a fight. And the safety dimension underneath gets less attention than it should: a student trained only on the teacher's answers can end up with similar raw capability but without the same protections, which is one of the concerns labs have raised about uncontrolled distillation.

How much data it takes

There is no single number, and anyone quoting one without context is guessing. The amount depends on the task, the size of the student, how close you want it to the teacher, and crucially whether you are teaching a new behavior or teaching genuinely new knowledge.

For instruction tuning, where you are mostly teaching a capable base model to respond in a certain format or style, surprisingly little can be enough. The LIMA work argued that around 1,000 diverse, high quality question answer pairs could produce general instruction following, and one study found that as few as 60 well chosen examples were enough to make a base model handle a question answering task, because the fine tuning was activating knowledge the model already had rather than installing new facts.

Examples needed, by what you are actually teaching
~60 activate knowledge
~1,000 instruction following
5k - 10k domain assistant
50k - 500k new knowledge
Millions frontier reasoning

Quality and coverage matter more than raw volume. The harder and more novel the thing you are teaching, the more examples you need.

Practitioner guidance often lands in the low thousands, with figures like 5,000 to 10,000 high quality instruction answer pairs cited as sufficient for a specialized domain assistant. For the documentation example above, this is the regime you are usually in, since the goal is to teach format and domain framing on top of a model that already speaks the language. Teaching real new knowledge, or harder generative tasks like translation and summarization, pushes the numbers up sharply, into the tens or hundreds of thousands, with some large instruction collections running into the millions.

The reason the ceiling is so high for the broadest cases comes back to the token by token point. A single prompt does not give you one training signal, it gives you a whole response made of many tokens, and to capture general reasoning and coding behavior rather than a handful of narrow skills you need wide coverage across many kinds of prompts. The hard label, API-only route compounds this, since each generated answer carries less information per token than a full distribution would, so you need more of them to make up the difference.

Narrowing a capable model to your own documentation is modest - a few thousand good pairs. Copying the broad reasoning and coding behavior of an entire frontier model through its text outputs alone takes query volume in the millions. That last figure is not hypothetical, and the clearest example of it arrived in early 2026.

The MiniMax and Anthropic case

On February 23, 2026, Anthropic published findings accusing three Chinese AI labs, DeepSeek, Moonshot AI, and MiniMax, of running large scale campaigns to distill its Claude models. The numbers Anthropic gave were stark.

0
fraudulent accounts created across the three labs
0
exchanges generated with Claude
0
accounts in a single proxy network at once
0
to re-target a newer model after launch

MiniMax was the heaviest user by a wide margin. The split across the three labs shows who was doing what.

Exchanges with Claude, by lab (of 16M+ total)
MiniMax 13M+ · coding
Moonshot 3.4M · reasoning
DeepSeek 150k+ · CoT

MiniMax accounted for more than three quarters of the total, focused on agentic coding, tool use, and orchestration. DeepSeek ran a smaller but pointed effort, crafting over 150,000 exchanges to make Claude lay out its reasoning step by step so that chain-of-thought data could be fed into a competing model.

The mechanics matter. Anthropic does not sell commercial access to Claude in China for national security reasons, so the labs allegedly reached it through commercial proxy services that resell API access at scale. Anthropic described the setup as "hydra cluster" architecture, sprawling networks of accounts spread across its API and third party cloud platforms. Ban one account and another took its place. In one documented case a single proxy network ran more than 20,000 accounts at once, deliberately mixing the extraction traffic with ordinary unrelated requests so the activity blended into normal usage.

Rare visibility

Caught mid-campaign

Anthropic said it caught the MiniMax campaign while it was still running, before MiniMax released the model it was training - rare visibility into a distillation effort from data collection through to model launch.

Fast pivot

24-hour re-targeting

When Anthropic shipped a new model during the active campaign, MiniMax reportedly shifted within 24 hours, redirecting close to half its traffic to start pulling capabilities from the newer system.

Anthropic attributed the campaigns using IP address correlation, account and request metadata, infrastructure indicators, payment signals that linked seemingly independent accounts, and corroboration from other labs that had seen the same actors.

The sheer scale that made the operation effective also made it detectable. Patterns invisible across a handful of accounts become obvious across thousands.

When a model says it is Claude

Alongside the formal attribution, there is a much simpler and more public tell that draws attention whenever a model is suspected of heavy distillation. Users ask the model what it is, and it answers with the name of the model it was trained on rather than its own.

You
"Which model are you?"
A heavily distilled student
"I'm Claude, made by Anthropic…"

This happens for a concrete reason. A model has no innate knowledge of its own name. Its sense of identity comes from whatever it saw during training and from the system prompt it runs under. If a large share of a model's training text was generated by Claude, then phrases like "I am Claude" or descriptions of Anthropic's guidelines are scattered all through that data. When you then ask the student which model it is, the most statistically natural continuation it learned can be the teacher's identity, not its own. The model is not lying, it is repeating the pattern it absorbed.

The effect shows up in revealing ways. It is often language dependent and strongest when the prompt does not firmly anchor the model's identity. Reports in early 2026 documented cases where models answered the self-identification question differently in English than in Chinese or French, because the volume of distilled identity statements in each language differed. The direction of the confusion also tends to be one sided, which is itself a fingerprint. A model whose training absorbed many Claude outputs will sometimes claim to be Claude, while Claude does not symmetrically claim to be that model, pointing to which way the data flowed. The same logic underlies academic work showing that a simple classifier can identify which model produced a piece of text with high accuracy, because each model leaves consistent stylistic idiosyncrasies in its outputs, and those idiosyncrasies travel into any student trained on them.

A single screenshot of a model calling itself Claude is suggestive, not proof. The same symptom can come from a gateway or wrapper that mimics Anthropic's API and slots a different model behind it, from a weak or empty system prompt, or from incidental contamination where Claude-generated text leaked into public datasets that many later models trained on. Self-identification is a hint that invites investigation, not a verdict.

This is exactly why the serious attribution in the Anthropic case rested on infrastructure evidence, traffic patterns, metadata, and partner corroboration, rather than on what any model says about itself. The identity slip is the visible smoke. The behavioral and infrastructure analysis is what stands up as evidence.

The line that matters, and the defenses

Before turning to defenses, it is worth being precise about what is and is not in dispute, because the technique itself is not the problem. Distillation is ordinary and legal in many settings. DeepSeek, for instance, has openly released distilled models on Hugging Face under a permissive license that explicitly allows distillation for training other models. What Anthropic objected to was unauthorized distillation, using a competitor's model as the teacher through fraudulent accounts and against the terms of service.

Where the line sits

The defenses are aimed at the unauthorized version, not at distillation as a whole. Distilling a model you own, an open-weight model, or one whose license permits it is routine engineering. The dispute is only about copying a competitor's model without permission.

The response has formed along four lines: detection, access control, industry coordination, and policy.

Detection

Behavioral fingerprinting and classifiers tuned to spot distillation-style prompt patterns, coordinated activity across many accounts, and requests that try to pull chain of thought out of the model. Watching request timing, structure, and metadata - not just content - is what tied traffic to specific organizations. Observers noted it resembles how botnet traffic is tracked.

Access control

Tighter identity verification and stricter checks on the pathways most often abused for cheap access, such as education, research, and startup programs, plus regional restrictions the proxy services were built to defeat.

Industry coordination

Anthropic shared indicators with other labs. Google reported similar attempts on Gemini involving more than 100,000 prompts in early February 2026, and OpenAI sent an open letter to US officials raising the same concern around the same period.

Policy

The accusations landed amid a debate over US export controls on advanced AI chips. The argument: distillation lets foreign labs capture capability from American models without the underlying research, while still needing restricted chips to run it at scale.

That last line is why the story drew the attention it did. It ties a narrow technical method to a much larger geopolitical fight, where a quiet trick for shrinking models becomes an argument about who gets to build the frontier at all.

The short version

Distillation is a legitimate and elegant technique for transferring knowledge from a large model into a smaller, cheaper one, built on the insight that a model's full output distribution carries more information than a bare label. It saves enormous compute and powers many of the small models in everyday use. The trouble starts when the teacher belongs to someone else and is accessed without permission. That is the line the 2026 disputes are drawn around, and the defenses being built, detection, access control, identity checks, and industry and policy coordination, are all aimed at that line rather than at the technique itself.

The technique is not the problem. Whose model you point it at is.

References

  1. Hinton, G., Vinyals, O., Dean, J. "Distilling the Knowledge in a Neural Network" (2015). arxiv.org/abs/1503.02531
  2. Wikipedia, "Knowledge distillation." en.wikipedia.org/wiki/Knowledge_distillation
  3. PyTorch, "Knowledge Distillation Tutorial." docs.pytorch.org
  4. Intel Distiller, "Knowledge Distillation." intellabs.github.io/distiller
  5. "A Comprehensive Review of Knowledge Distillation in Computer Vision" (arXiv). arxiv.org/pdf/2404.00936
  6. MarkTechPost, "Understanding LLM Distillation Techniques" (soft vs hard label). marktechpost.com
  7. "LLM-Oriented Token-Adaptive Knowledge Distillation" (arXiv). arxiv.org/pdf/2510.11615
  8. Firecrawl, "How to Create Custom Instruction Datasets for LLM Fine-tuning." firecrawl.dev
  9. Databricks, "LIMIT: Less Is More for Instruction Tuning" (LIMA). databricks.com
  10. "60 Data Points are Sufficient to Fine-Tune LLMs for Question-Answering" (arXiv). arxiv.org/pdf/2409.15825
  11. Latitude, "How Dataset Size Impacts LLM Fine-Tuning." latitude.so
  12. Anthropic, "Detecting and preventing distillation attacks" (Feb 23, 2026). anthropic.com
  13. VentureBeat, "Anthropic says DeepSeek, Moonshot and MiniMax used 24,000 fake accounts." venturebeat.com
  14. TechCrunch, "Anthropic accuses Chinese AI labs of mining Claude." techcrunch.com
  15. CNBC, "Anthropic accuses DeepSeek, Moonshot and MiniMax of distillation attacks on Claude." cnbc.com
  16. Sun, M. et al. "Idiosyncrasies in Large Language Models" (arXiv). arxiv.org/html/2502.12150v1
  17. Laozhang, "Why Claude Sonnet 4.6 Says DeepSeek: What It Likely Means, and What It Doesn't." blog.laozhang.ai

Build the small model on knowledge you own.

The cleanest distillation uses a teacher on material that is yours - your documentation, your processes, your data. Strongly's forward deployed engineers help you turn that into production AI you control, with the guardrails intact.

Talk to an FDE