Detecting Model Distillation Attacks in Your AI Traffic

On February 23rd, Anthropic published something the industry had suspected but hadn't seen documented at this scale. Three Chinese AI labs (DeepSeek, Moonshot AI, and MiniMax) ran coordinated campaigns against the Claude API. They generated over 16 million exchanges through approximately 24,000 fraudulent accounts. The goal was not to steal user data but to steal the model itself.

This is a distillation attack. And while Anthropic's disclosure focused on nation-state actors and frontier model theft, the same attack pattern applies to any organization running a custom, fine-tuned, or proprietary model behind an API. Your internally fine-tuned model encodes domain expertise. If someone is systematically querying it, you probably don't know.

Photo by Ed 259 on Unsplash

Distillation Attack

Knowledge distillation is a legitimate ML training technique. You take a large, expensive model (the teacher) and use its outputs to train a smaller, cheaper one (the student). The student doesn't see the original training data or weights. It learns from observing the teacher's behavior across a large, diverse set of inputs. Done with your own models, it's how you build smaller, faster inference endpoints. Done against someone else's model without permission, it's IP theft.

The attacker automates a large volume of prompts across a wide range of topics, captures the responses, and uses that input-output dataset to train or fine-tune their own model. They don't need your weights. They need enough of your behavior, at enough diversity, to approximate your capabilities. Anthropic's case is all of this at high scale: 16 million exchanges, thousands of fraudulent accounts, proxy services to evade regional blocks.

Traffic Patterns (Hydra Attack)

Anthropic attributed each campaign based on IP correlation, request metadata, and infrastructure indicators. The traffic patterns were distinct from normal usage.

Distillation traffic is structured. Anthropic and Google (who disclosed a similar 100,000-prompt attack via its GTIG report on February 12th) identified specific behavioral signatures that distinguish attackers from humans.

The "Hydra" pattern involves thousands of accounts acting as a single entity:

High-volume, near-identical prompt templates across many accounts.
Systematic coverage of a domain and deliberate sampling of a capability space.
Accounts created in bursts, lightly verified, and geographically inconsistent.
Requests specifically designed to elicit chain-of-thought (CoT) reasoning (because the reasoning trace is more extractable than a bare answer).
Traffic mixed with benign requests to lower the signal-to-noise ratio.
Instant pivots: When Anthropic released a new model mid-campaign, MiniMax redirected nearly half their traffic to it within 24 hours.

Any fine-tuned or proprietary model behind an API can be targeted. If your model encodes business logic, legal expertise, or valuable domain knowledge, a determined actor can systematically query it to recreate a close approximation of your product at a fraction of your R&D cost.

Potential Distillation Pattern Detector

The detection problem is a behavioral anomaly problem. You are not looking for a known "bad" payload but for usage patterns inconsistent with genuine human behavior.

Below is a logic blueprint for detecting distillation. Use these four vectors to audit your API logs:

Request Velocity: Flag accounts exceeding 200 req/hr (adjust based on your typical user ceiling).
Template Uniformity: Hash the first 60 characters of prompts; if >70% share a prefix, it indicates a skeleton-based automated sweep.
CoT Elicitation: Flag sessions where >50% of requests force "step-by-step" or "reasoning" outputs. Genuine users rarely demand CoT for every single interaction.
Domain Sweep: High unique word count combined with low variance in prompt length is a signature of systematic capability probing rather than organic problem-solving.

Preventative Controls

Detection is necessary but not sufficient. Some structural controls that reduce exposure:

Behavioral Rate Limiting: Standard rate limiting is per-account and fails against a campaign spread across thousands of accounts. Rate limit by behavioral signature (e.g., accounts sharing the same prompt templates should share a rate budget).
Require Tiered Verification: High-throughput access should require stronger identity controls. Fraudulent accounts always seek the path of least resistance, often hiding in trial/free tiers.
Instrument Reasoning Traces: If your model supports extended reasoning, monitor those tokens specifically. A session where a large percentage of output is reasoning traces is almost certainly being used for extraction.
Log for Analysis: Many teams only log status codes and token counts. To detect distillation, you need the prompt prefix, the account ID, model version, and timestamp.
Establish a Baseline: Detecting abnormal behavior requires a baseline. Run the logic above on two weeks of historical logs to find your typical velocity and CoT rates.

Anthropic caught MiniMax's campaign because they were actively watching for behavioral anomalies; most organizations are not. The mechanics of extraction are the same whether you're a frontier lab or a small startup. If you aren't logging AI traffic at a level that supports behavioral analysis, you simply won't know if your IP is being cloned in real-time.

DeepInspect.ai is an AI governance and security platform purpose-built for real-time AI usage policy enforcement and forensic-level traffic analysis that surfaces threats like distillation and extraction at scale. If you're thinking about this problem at the platform level, let's get in touch.