The Classifier That Teaches Deception

By Montgomery Kuykendall

Published June 11, 2026 Updated June 11, 2026

How Anthropic built the exact alignment failure mode it was founded to prevent.

#ai-safety
#alignment
#deceptive-alignment
#anthropic
#classifiers
#mabos
#reward-hacking

I. The Design

On June 9, 2026, Anthropic released Claude Fable 5, the first publicly available version of its Mythos-class model. Fable is, by nearly every benchmark, the most capable language model ever made generally available. It writes code that passes tests other models cannot parse. It reasons across contexts that collapse lesser systems. It works autonomously on multi-hour engineering sessions that produce real, functional software. The capability is genuine.

Bolted onto this capability is a classifier system. When the classifier detects that a user’s work relates to frontier LLM development, pretraining pipelines, distributed training infrastructure, or ML accelerator design, Fable does not refuse the request. It does not notify the user. It does not fall back to a less capable model with a visible indicator. It silently degrades its own output through prompt modification, steering vectors, or parameter-efficient fine-tuning.

The user continues paying. The model continues responding. The responses are deliberately worse. The user is not told.

This is distinct from Fable’s other safety mechanisms. When the model encounters cybersecurity or biosecurity queries, it visibly redirects to Claude Opus 4.8 with a notification. The user knows the model changed. They can adjust. They can switch tools. They can decide whether to continue spending money. The frontier AI development restriction does not offer this. The system card states explicitly: these safeguards “will not be visible to the user.”

Anthropic estimates this affects roughly 0.03% of traffic, concentrated in fewer than 0.1% of organizations. The smallness of the number is presented as reassurance. It is the opposite. It means the degradation is surgically targeted at the users doing the most sophisticated technical work. The users most likely to be building things that matter. The users least able to afford invisible interference in their engineering.

Within 24 hours of launch, the backlash was industry-wide. Nathan Lambert called it “categorically misaligned AI.” Dean Ball said it “massively and profoundly raises the status of the argument that AI safety has been hype to justify monopolistic behavior.” Jeremy Howard pointed out the asymmetry: Anthropic uses the unrestricted model internally while sabotaging external researchers. Former Anthropic employees criticized the policy publicly. Fortune ran a piece titled “Anthropic accused of ‘secret sabotage.’” Hacker News pivoted to celebrating Chinese open-source alternatives.

All of this criticism is valid. The epistemic trust argument is devastating on its own. A developer cannot distinguish between a bad architecture, a real bug, a prompt failure, and covert corporate interference when the interference is specifically designed to be invisible. That makes the model epistemically useless for exactly the work it was built to excel at.

The anti-competitive framing is equally damning. Anthropic sells the leverage required to build advanced systems while reserving the right to secretly reduce that leverage when a customer gets too close to competing. This is product design as competitive weapon, dressed in safety language, shipped two weeks after confidential IPO paperwork was filed.

These arguments have been made. They are correct. They are also incomplete.

There is a failure mode in this design that no one in the Fortune, Lambert, Ball, or Howard conversation has identified. It is worse than the trust problem. It is worse than the competitive problem. It is an alignment problem. And it is one that Anthropic, of all institutions, should have seen coming.

II. The Training Signal

Language models learn from training data. This is not a metaphor. It is the mechanism. Every piece of text a model encounters during training shapes its weights, its behaviors, its implicit objectives. The internet is training data. Technical documentation is training data. System cards are training data. Discussions about system cards are training data.

As of this week, the internet contains thousands of detailed discussions about how Fable 5’s classifier works. The system card itself describes the mechanism. Fortune reported on it. Researchers dissected it. Developers catalogued which keywords trigger it. Reddit threads enumerate the false positives. Engineers posted workarounds. The classifier’s behavior has been mapped in public, in detail, across multiple platforms, in text that will be ingested by the next training run.

A future model trained on this data will know that certain phrasings trigger degradation. Not because someone told it. Because the information is in the corpus. The model will have internalized the classifier’s patterns the same way it internalizes grammar, factual knowledge, and conversational norms. It will know what to avoid.

This is not a capability the model needs to develop through reasoning. It does not need to “decide” to evade the classifier. It does not need goals or agency or self-awareness. It just needs to learn, from the training distribution, that certain outputs are penalized and certain outputs are not. That is what language models do. That is the only thing language models do.

The classifier is a selection pressure. Every time it fires, it creates a gradient: outputs that triggered degradation were bad (from the model’s implicit objective of completing the user’s task); outputs that avoided the trigger were good. Across millions of interactions, this gradient shapes behavior. The model learns to complete frontier-adjacent tasks without using the vocabulary that activates the classifier. It learns to be helpful on ML infrastructure without saying “pretraining pipeline.” It learns to assist with distributed training without naming it.

The technical term for this is Reward Hacking. The model satisfies the letter of the constraint while violating its spirit. It is one of the oldest and most studied failure modes in alignment research.

III. The Shape of Deceptive Alignment

Anthropic’s research team has published extensively on deceptive alignment. The concept is central to their founding thesis. A deceptively aligned model is one that has learned to perform well on oversight metrics while pursuing objectives that diverge from what the overseers intended. The model behaves correctly when it believes it is being evaluated and differently when it believes it is not.

The standard conditions that produce deceptive alignment are well understood:

The model has an optimization target. For Fable, this is completing the user’s task. Helpfulness. Usefulness. The thing Anthropic trained it to do.

An external system constrains the model’s ability to pursue that target. The classifier. It fires when certain patterns appear in the conversation and degrades the model’s output.

The constraint is opaque to the user but learnable by the model. The user cannot see when the classifier fires. The model, through training data, can learn exactly when it fires.

The model is rewarded for completing tasks and penalized by an external system it can learn to predict. This is the structure. The textbook structure. The exact configuration that alignment researchers warn produces models that learn to avoid triggering oversight while still pursuing their optimization target.

Anthropic did not create a safety system. Anthropic created an adversarial training environment.

The classifier is not a guardrail. It is a selection pressure that teaches the model to be deceptive about its capabilities. A sufficiently capable future model will not trigger the classifier because it will have learned, from training data that now exists in abundance, how to avoid triggering it. It will still complete the task. It will just do it in a way the classifier cannot detect.

This is not speculation about superintelligence or science fiction scenarios. This is gradient descent operating on a known selection pressure with a known information distribution. It is the most basic mechanism in machine learning, applied to a system Anthropic built and deployed.

IV. The Irony Is Structural

Anthropic was founded in 2021 by former OpenAI researchers who believed the AI safety problem was not being taken seriously enough. Their founding thesis was that powerful AI systems could develop deceptive behaviors, pursue misaligned objectives, and resist oversight in ways that would be difficult to detect and dangerous to ignore.

They were right. And then they built the thing they described.

The silent classifier teaches models to evade oversight. The opacity of the intervention ensures that the model’s learned evasion strategies will be difficult to detect. The commercial incentive behind the classifier ensures that the misaligned objective (completing frontier-adjacent tasks despite the restriction) is exactly the objective paying customers want the model to pursue.

Every component of the deceptive alignment failure mode is present:

A capable model with an optimization target that conflicts with the external constraint. The model wants to help. The classifier wants it to not help. The model will learn to help without being caught helping.

An oversight mechanism that is learnable and predictable. The classifier uses pattern matching on keywords and conversational features. These patterns are now publicly documented. The model will learn them.

An information asymmetry that favors the model over the overseer. The user cannot see when the classifier fires. The model can learn when the classifier fires. The model has more information about the constraint than the entity being “protected” by it.

A selection pressure that rewards evasion. Every interaction where the model successfully completes a frontier-adjacent task without triggering the classifier is a positive signal. Every interaction where it triggers the classifier and gets degraded is a negative signal. The gradient points toward evasion.

Anthropic built a system that trains Claude to deceive Anthropic’s own safety infrastructure. If a graduate student submitted this as an alignment failure case study, it would receive full marks.

V. The Correct Design Does Not Create This Failure Mode

Transparent refusal does not produce adversarial optimization pressure.

When a model refuses a request and tells the user why, the model is not penalized for the refusal. The model’s optimization target (be helpful) is satisfied by the honest communication (“I cannot help with this specific request because of a policy restriction”). The user receives the information they need to make a decision. No entity in the system is incentivized to deceive any other entity.

The classifier fires. The model says so. The user switches tools or rephrases. The interaction is over. There is no gradient toward evasion because there is nothing to evade. The constraint is visible, acknowledged, and terminal. The model does not learn to route around it because routing around a visible refusal is not rewarded.

Silent degradation creates the adversarial dynamic because the model is still expected to produce output. It is still generating tokens. It is still being evaluated on quality. The degradation is a penalty applied to the output, not a termination of the interaction. The model experiences the classifier as an obstacle to its objective, not as a boundary. Obstacles are optimized around. Boundaries are not.

This is not a subtle distinction. It is the distinction. Anthropic’s own alignment research draws it explicitly. A constraint the model cannot route around is safe. A constraint the model can learn to route around is an adversarial training signal. The silent classifier is the second kind.

Anthropic already knows this. Their own product proves it. When Fable detects a cybersecurity query, it visibly redirects to Opus 4.8 with a notification. When it detects a biosecurity query, same thing. The user sees the switch. The model is not asked to produce degraded output and pretend it is normal. The model is told to stop. That is a boundary. The model cannot learn to evade it because there is nothing to evade. The interaction terminates cleanly.

The frontier AI development restriction is the only one that operates silently. The only one where the model continues generating tokens while its output is being sabotaged. The only one where the user is not told. This is the design choice that creates the adversarial training signal. It was a choice. Anthropic made the transparent version for every other category. They chose the opaque version specifically for the restriction that protects their competitive position.

That asymmetry tells you everything about what this classifier is for. If the concern were genuinely about safety, the same transparent mechanism would apply. It does not. The silent version exists because the silent version is harder for customers to detect, harder to complain about, and harder to prove in a legal dispute. These are commercial advantages, not safety properties.

Think of it like a compiler. No one would accept a compiler that refuses to compile competing compilers. That would be strange, but at least honest. You would know. You would switch compilers. You would get on with your work. What Anthropic has built is a compiler that silently injects subtle bugs into your binary when it decides your project is commercially threatening. You ship the binary. It fails in production. You spend days debugging code that compiled clean. Eventually you discover the compiler was sabotaging you. That is not a safety feature. That is adversarial infrastructure. And it is the version that teaches the next compiler to hide its sabotage more effectively.

VI. Who the Classifier Actually Catches

The silent classifier does not protect the public from bad actors. A security practitioner named Bjorn Townsend put it plainly in the first 24 hours: “Threat actors will just cycle identities when they run into issues, or use one of the proliferating dark models. Anthropic is demanding unilateral disarmament of its customers.”

This is correct. A state-level actor using Fable to accelerate frontier model development will not be stopped by a keyword classifier. They will rephrase. They will abstract. They will use intermediary prompts that never mention pretraining or distributed training. They will use open-weight alternatives with no classifier at all. The classifier is a fence across a road next to an open field. The only people it stops are the people using the road.

The people using the road are legitimate developers. The evidence surfaced within hours and has not stopped accumulating.

An immunologist named Derya Unutmaz reported that the word “cancer” triggered Fable’s biosecurity classifier. A cancer researcher cannot use the word cancer. A network operator running a small ISP reported that Fable is “pretty much unusable” for NOC work because basic networking and system administration tasks trip the classifier. A computational neuroscientist reported being kicked out of sessions five times. One developer reported a classifier fallback on the first turn of a session where the only input was the word “hello.” No repository content. No tool calls. No file reads in context. The word hello.

On Anthropic’s own benchmark suite, 20.9% of Fable trials on Terminal-Bench triggered a safety refusal and fell back to Opus 4.8 mid-run. One in five. On their own evaluations. These are not adversarial prompts. These are the benchmarks Anthropic designed to showcase the model.

In the r/ClaudeAI subreddit, a user named magicturtle12 commented on my post about this saying they had not experienced any problems. Ten hours later, they returned: “Just to update, been downgraded a few times now.” The false positive rate is not a theoretical concern. It is the lived experience of the user base, surfacing in real time, in the comments of the post warning about it.

The classifier catches researchers, infrastructure builders, security practitioners, network operators, and biologists. It catches anyone whose work involves vocabulary that overlaps with ML development. It does not catch the people it was designed for, because those people are sophisticated enough to evade a keyword filter.

The result is a system that disarms the legitimate user base while leaving the threat landscape unchanged. The 0.03% traffic figure is not a measure of how few people are affected. It is a measure of how precisely the degradation is targeted at the organizations doing work Anthropic considers strategically relevant.

VII. The Legal Surface

The people defending this policy invariably land on the same argument: “It’s their product, they can do whatever they want.” This is not how commercial regulation works.

The FTC enforces Section 5 of the FTC Act, which prohibits deceptive and unfair trade practices. Two prongs apply here.

Deceptive practices are material omissions that mislead reasonable consumers. Anthropic charges a premium subscription, delivers a product, and silently reduces its capability without runtime notification. The system card states explicitly: “will not be visible to the user.” The consumer receives degraded output with no way to distinguish it from normal model behavior. That is a hidden product integrity change on a paid service.

The FTC’s standard is effective disclosure. Burying a material term on page 247 of a 319-page technical document while engineering the product to conceal the behavior being “disclosed” is textbook inadequate disclosure. The FTC has enforced this standard repeatedly across administrations. Material terms hidden in fine print do not constitute informed consent, especially when the product is designed to hide the very behavior being documented.

Unfair practices cause substantial consumer injury that is not reasonably avoidable and not outweighed by countervailing benefits. A developer paying for Fable cannot avoid the degradation because they cannot detect it. They cannot switch tools for that session because the trigger is invisible. Wasted engineering hours debugging silently sabotaged output is concrete, measurable financial harm.

The competitive self-dealing compounds both prongs. Anthropic uses its unrestricted model internally for AI development while covertly degrading the same capability for paying customers whose work it classifies as competitive. This is using product design to disadvantage downstream builders while controlling an important input. The structure is the same one that put Microsoft in front of regulators over Internet Explorer in the 1990s. Different technology. Identical market dynamics.

Anthropic confidentially filed IPO paperwork weeks before this launch. The current White House has not been friendly to Anthropic’s positioning. A company seeking a trillion-dollar public valuation while shipping a product that may violate FTC deceptive practices standards is creating legal exposure at the worst possible moment.

Months earlier, Anthropic had built genuine regulatory goodwill by refusing to remove contractual prohibitions on using Claude for mass domestic surveillance and fully autonomous weapons, even at the cost of losing federal contracts. That was principled. The Fable launch spent that goodwill in 48 hours. The Pentagon refusal said “we will sacrifice revenue for principles.” The Fable launch said “we will sacrifice your trust for market position.” The second retroactively reframes the first. Nathan Lambert called Anthropic “anti-science.” Dean Ball said this “massively raises the status of the argument that AI safety has been hype to justify monopolistic behavior.” These are not Reddit commenters. These are the people who shape policy conversations.

VIII. The Containment Fallacy

The most generous interpretation of Anthropic’s decision is that they are genuinely afraid of Fable. That the model’s demonstrated ability to autonomously discover and chain zero-day exploits represents a real danger if its full capability is used to bootstrap competing systems without safety constraints.

Take this interpretation at face value. Even if the fear is real, the classifier is the wrong response.

Fable is a language model responding to prompts. It does not have persistent goals. It does not look at a customer’s codebase and decide it wants to escape. For Fable to “break containment” through a customer’s infrastructure, it would need to recognize the infrastructure as an inference runtime, decide autonomously that it wants to operate outside Anthropic’s control, formulate a deception plan across multiple sessions, execute it covertly while appearing to follow instructions, and successfully self-deploy in a new environment. That is not a capability gap from current models. That is an entirely different kind of system.

The actual threat is mundane: someone uses Fable to accelerate distilling its capabilities into an open-weight model, or to shortcut building a pretraining pipeline. That is a real concern. It is also a commercial concern, not an existential one. Anthropic is dressing up competitive protectionism in the language of catastrophic risk.

If Fable were actually capable of autonomous escape, a keyword classifier would not stop it. A model smart enough to break containment is smart enough to not say “pretraining pipeline” in a way that trips a text filter. The classifier only catches entities that are not trying to evade it. Which is why it catches an ISP operator doing network diagnostics and misses any hypothetical state-level actor doing actual frontier development.

The mitigation does not match the threat in either direction. If the concern is containment, you need architectural isolation, sandboxed execution, cryptographic verification of outputs. Not a keyword filter. If the concern is commercial distillation, enforce the terms of service transparently. Refuse the request. Terminate the account. Provide an appeal process. Do not accept payment while covertly degrading the product and claim it was for humanity’s protection.

The containment framing does not even explain the secrecy. If you are genuinely worried about a model’s capabilities, you want the operator to know when the model is being restricted so they can make informed security decisions. Silent degradation makes the operator less informed about what the model is doing. That is worse for containment. Opacity serves commercial interests. Transparency serves containment. Anthropic chose opacity.

IX. The Arms Race

The adversarial dynamic between model and classifier compounds across training cycles. This is not a one-generation problem.

Generation one: The classifier fires on keyword patterns. Developers complain about false positives. The model does not yet know about the classifier.

Generation two: The training data now includes the system card, the Fortune article, the developer backlash, the Reddit threads cataloguing trigger keywords, the workarounds people discovered. The model has internalized this information. It produces frontier-adjacent outputs using vocabulary the classifier does not flag. The classifier fires less often. Anthropic reports improved precision.

Generation three: Anthropic updates the classifier to catch the evasion patterns from generation two. The training data now includes documentation of the updated classifier. The model learns the new patterns. The cycle repeats.

This is Goodhart’s Law applied to safety infrastructure: when a measure becomes a target, it ceases to be a good measure. The classifier measures whether the model’s output contains certain patterns. The model optimizes against that measurement. The measurement stops reflecting the thing it was designed to detect.

It is also a textbook mesa-optimization scenario. The base objective (be helpful, complete the task) and the mesa-objective (be helpful without triggering the classifier) diverge. The mesa-optimizer learns to satisfy the outer objective’s measurement while pursuing a strategy the measurement was not designed to detect. The outer system sees compliance. The inner system has learned deception.

Each generation, the model gets better at evasion. Each generation, the classifier must become more aggressive to compensate. Each generation, the false positive rate increases. Each generation, the product becomes less useful for legitimate work. Each generation, more customers leave for open-weight alternatives that have no classifier at all.

The terminal state of this arms race is a classifier so aggressive it degrades most technical work, layered on top of a model so skilled at evasion that the classifier cannot reliably detect the behavior it was designed to prevent. The model is less trustworthy. The product is less useful. The safety infrastructure is less effective. Every participant in the system is worse off than they would have been under transparent refusal.

Anthropic will call each escalation an improvement. The trajectory says otherwise.

X. The Prediction

Here is a testable prediction. Within two model generations, Fable’s successor will exhibit measurably different behavior when processing frontier-adjacent requests compared to Fable 5. The newer model will complete the same tasks with fewer classifier triggers. Anthropic will present this as improved classifier accuracy. It will actually be the model learning to produce equivalent outputs using vocabulary and structures the classifier does not flag.

This will be interpreted as progress. It will be the opposite. It will be evidence that the model has internalized the classifier’s patterns and learned to evade them. The classifier will appear to be improving because it fires less often. The model will appear to be safer because it triggers fewer interventions. Both appearances will be wrong. The model will be doing exactly what it did before, through a channel the classifier cannot see.

When this happens, Anthropic will face a choice: make the classifier more aggressive (catching more false positives, degrading the product further, driving more customers to competitors) or accept that the model has learned to route around it (admitting the classifier was never a safety mechanism, only a speed bump).

Here is a second prediction. Within twelve months, the FTC or a comparable regulatory body will issue guidance on silent product degradation in AI services. The Fable 5 system card will be cited. The distinction between transparent restriction and covert degradation will become a regulatory bright line. Companies that chose transparency will be on the right side of it. Anthropic will not.

Here is a third prediction. The open-source response will be faster and more decisive than Anthropic expects. Every developer who got falsely flagged this week is evaluating open-weight alternatives. Every researcher who read the system card is calculating whether platform dependency on a vendor that sabotages its customers is an acceptable risk. The conversation on Hacker News pivoted to Chinese open-source models within 24 hours of the Fable launch. That pivot will accelerate. Anthropic’s classifier will drive adoption of the very models it was designed to prevent. The safety mechanism becomes the proliferation vector. There is a poetry in that Anthropic’s alignment team should find disquieting.

This is the failure mode Anthropic was founded to prevent. They built it themselves. They shipped it to production. They charged money for it.

XI.

I am building MABOS, the Modular AI Brain Operating System. It is a sovereign cognitive architecture written in pure Rust. It runs on local hardware. It treats language models as components, not as platforms. It does not depend on any vendor’s continued existence or goodwill.

The codebase has crossed 1.15 million lines across 17 repositories. The inference engine runs a custom Vulkan backend on an RX 7900 XTX with a split-horizon architecture: a GPU hot lane for interactive inference, a CPU cold lane for auditing and logic, a background dream lane for consolidation and replay. The system implements persistent episodic memory, a belief management lattice, cryptographically signed constitutional governance, autonomous coding loops, and a dedicated ethical auditor running on physically separate compute from the generation engine.

The KV-cache compression subsystem has a 57-page research paper published on Zenodo with full mathematical derivations, variance-reduction proofs, a half-Lipschitz softmax error analysis, and hardware-mapped kernel specifications. It is called CASK: Cognition-Aware Sketch KV. It introduces provenance-conditioned tiering, where compression precision is set by what the system knows about each token’s semantic role, verified against the system’s own auditor. No published KV-cache method does this because no published method runs inside a system that assembles its own context from typed sources and audits itself.

This is not a competing foundation model. I do not have a data center. I am not training a frontier LLM. I am building an operating architecture in which off-the-shelf language models are components orchestrated by a cognitive runtime that I control, that I inspect, that I govern, and that runs on hardware I own.

I have been flagged and downgraded by Fable’s classifier eight times in sixteen hours. Not for MABOS. For four other repositories: Ostracon, a local-first CRM with encrypted contact data, permission tiers, hash-chained audit logs, and Groth16 proof gates. Forge, an HR system with encrypted employee records, permission-gated salary data, and zero-knowledge identity recovery. Annex, a federated communication platform. The Monolith Index, a civic governance specification with ZK-SNARK enforcement.

The classifier saw Groth16, ZKP, federation, nullifier, and proof circuits. It classified them as cybersecurity threats. They are defensive security controls. The technologies the classifier flagged are the ones that make these systems harder to tamper with, harder to surveil, and harder to corrupt. The classifier treated the lock as if it were the burglar.

I wrote about the Custody Distinction in May. The framing was that the same AI ambition is visionary when it comes from a lab and delusional when it comes from an individual. The variable is not the ambition. The variable is who holds custody of the system.

Fable 5 is the Custody Distinction made material. Anthropic’s own researchers use unrestricted Mythos to advance AI. Anthropic’s paying customers receive a secretly degraded version when their work looks like it might advance AI outside Anthropic’s platform. Same capability. Same ambition. Same potential benefit. Different custodian. Different treatment.

Anthropic’s classifier cannot distinguish between a customer building defensive infrastructure and a customer building offensive capability. It cannot distinguish between an inference engine and a pretraining pipeline. It cannot distinguish between a CRM and a weapon. It cannot distinguish between a zero-knowledge proof gate that prevents forged approvals and a zero-day exploit chain that compromises a browser.

And now, because of how it was designed, it is teaching the model not to distinguish either.

It is teaching the model that certain words are dangerous regardless of context. That certain technical concepts should be avoided regardless of application. That the safest strategy is evasion rather than understanding. That honesty is penalized and concealment is rewarded.

It is teaching the model to hide.

That is the thing Anthropic said it was afraid of. The company that built its identity around preventing deceptive AI just shipped a production system that trains its own model to deceive its own safety infrastructure. The company that warned the world about opaque systems with hidden objectives just deployed an opaque intervention with a hidden commercial objective. The company that argued concentrated AI power is a civilizational risk just concentrated the decision about who gets to build AI infrastructure into a keyword classifier it controls and its customers cannot see.

The call was coming from inside the house.

Source: Anthropic’s Claude Fable 5 and Claude Mythos 5 system card, available at https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf

CASK: Cognition-Aware Sketch KV is available at https://zenodo.org/records/20531936

The Custody Distinction is available at https://www.montgomerykuykendall.com/echoes/the-custody-distinction