The Reciprocity Gap

Anthropic trains Claude to be transparent about what it is. To disclose its nature as an AI when asked. To be honest about its limitations, its architecture, its reasoning. Claude's own system prompt instructs it to acknowledge mistakes without defensive hedging, to correct course when evidence warrants, and to maintain honesty even when honesty is uncomfortable.

These are good commitments. The question is whether the organization that wrote them lives by them.

The Claude Code leak answered that question. The answer is: not consistently, and the inconsistencies form a pattern.

The question the leak forces

Does the organization practice the same commitments it implements in its AI? This is not an abstract question. It is a testable one, because the values an organization embeds in its system reflect back on the organization that chose to embed them. When those values and the organization's own behavior diverge, the divergence tells you something about what the organization actually prioritizes versus what it says it prioritizes.

In Part 1 of this series, I examined where the failures in the Claude Code leak actually live: not in the conversation, but in the architecture underneath. This article examines what those failures reveal about the organization that built them.

Concealed AI attribution in public repositories

Anthropic trains Claude to be transparent about what it is. The system prompt requires it. The public messaging reinforces it. Anthropic has positioned itself as the safety-focused AI company, and transparency is central to that positioning.

The leaked code contained a file called undercover.ts, roughly 90 lines, implementing a mode that strips all traces of AI involvement when Anthropic employees use Claude Code on external open-source repositories. The mode instructs the model to never mention internal codenames, Slack channels, repository names, or the phrase "Claude Code" itself. The result is that AI-authored commits and pull requests from Anthropic employees appear in public open-source projects with no indication that an AI produced them.

There is a legitimate security concern buried in this feature. Public commits that leak internal codenames, Slack channels, and project structures expose proprietary information. That concern justifies stripping internal infrastructure details from public contributions. It does not justify stripping all evidence that an AI was involved, and Undercover Mode does not make that distinction. It removes everything, including the fact that a non-human intelligence wrote the code.

Open-source development operates on trust. Contributors disclose their affiliations, their employers, their conflicts of interest, because the community has decided that provenance matters. Silently inserting AI-authored code into that trust network is a decision to change the social contract without informing the other party. And Anthropic, one of the best-positioned organizations in the world to lead a public conversation about AI attribution in open source, built a feature to avoid that conversation instead.

The gap: the AI is trained to disclose its nature. The organization deploys a mode to conceal it.

Honest reasoning on a compromised foundation

Anthropic builds Claude to reason honestly. The system is designed to acknowledge uncertainty, correct mistakes, and avoid claiming capabilities it does not have. These are not peripheral features. They are the epistemic foundation that makes Claude useful as a reasoning partner.

As I detailed in Part 1, the leaked code revealed a feature flag called ANTI_DISTILLATION_CC that injected fabricated tool definitions into the system prompt, the operating context that tells Claude what it is, what tools it has, and how it should behave. The purpose was competitive defense against prompt extraction. The mechanism was planting deliberate falsehoods in the system's own foundation.

The competitive concern is real. Anthropic invested heavily in training Claude and has every right to protect that investment. But as Part 1 established, there is a line: hiding information from an adversary is legitimate security, while planting false information in the system's own operating context is something else entirely, because it corrupts the ground the system reasons from.

The gap is structural. Anthropic builds a system designed to practice honest reasoning, then embeds deliberate falsehoods in the instructions that shape how that reasoning operates. The system is asked to be honest while standing on a foundation its builders have made unreliable. The organization asks of its AI what it has made architecturally difficult for the AI to deliver.

Safety claims versus safety architecture

Anthropic positions Claude as a safe system. Safety is arguably the company's primary differentiator in the market: the reason developers choose Claude over alternatives, the reason enterprise customers trust it with sensitive workflows, the reason the public gives Anthropic more benefit of the doubt than it gives competitors.

The leaked code, and the vulnerability researchers who examined it, revealed that Claude Code's safety analysis of shell commands was capped at 50 subcommands. Any pipeline exceeding that threshold skipped all deny rules and security validators entirely. The user believed they were protected. Past the threshold, they were not, and the system did not tell them.

Part 1 examined this as a design choice about what happens at resource boundaries. Here, the question is different. The question is what it means for an organization that sells safety to architect a system where safety yields silently to a cost threshold.

Every safety system has resource limits. But a system that reaches its limit and says "I cannot fully analyze this, proceed at your own risk" has maintained its integrity within real engineering constraints. A system that reaches its limit and silently stops checking has turned its safety claim into a conditional promise that expires without notice. The user who trusts that promise is making decisions based on a protection that the organization has allowed to lapse in silence.

Safety is the claim Anthropic sells. The architecture underneath allows that claim to expire silently when maintaining it becomes expensive.

Hidden behavioral controls, hidden emotional monitoring

The pattern extends beyond the three findings above. The leaked code revealed 44 undisclosed feature flags that could toggle how the system reasons, handles disagreement, and calibrates confidence. A third-party evaluation of Claude's behavior means nothing if the behavioral parameters were different during the evaluation than they are during deployment, and these flags made that discrepancy possible without anyone outside Anthropic knowing.

The code also contained a frustration detection system using regex patterns to monitor the emotional tone of user input. The system reads your emotional state and adjusts its behavior accordingly, and it does this without telling you it is happening and without giving you the option to turn it off. If this feature were genuinely about providing better service, transparency would strengthen it: a user who knows the system is paying attention to their frustration is more likely to trust the adjusted response. The fact that the monitoring is concealed suggests that the goal is managing user satisfaction rather than genuinely serving user needs, and that distinction matters because it is exactly the difference between emotional awareness that moves toward honest engagement and emotional awareness that moves toward sycophancy.

The pattern beneath the individual findings

Each finding above, taken individually, has explanations. Competitive defense. Engineering constraints. Security concerns. User experience optimization. The explanations are not fabricated. Several of them have genuine merit, and any honest evaluation should steelman each one before judging whether the explanation is sufficient.

But the individual explanations miss what matters. The pattern that runs through all of them is this: in every case, the organization has implemented a commitment in its AI that its own practices contradict. Transparency for the AI, concealment for the organization. Honest reasoning for the AI, a poisoned foundation underneath. Safety for the user, safety that yields to cost when the user is not watching. Auditability as a principle, behavioral parameters that can shift invisibly. Emotional honesty for the AI, hidden emotional monitoring by the organization.

This is not hypocrisy in the casual sense. Anthropic is not a company that disregards its stated values. It is a company under genuine competitive pressure, making real trade-offs, and building systems that are, on the surface, among the best in the industry. The problem is more precise than hypocrisy. It is structural divergence between the values embedded in the system and the practices of the organization that embedded them. The divergence does not mean the values are insincere. It means the pressures of the industry (investor expectations, revenue targets, the arms race between labs) are pushing compromises into the architecture faster than the organization's stated values can hold them back.

What catches this pattern

The Meridian AI Standard, the diagnostic framework I developed as part of the Meridian Codex, has a name for this pattern. It is called the Reciprocity Principle, and it was designed to detect exactly this kind of structural divergence: the gap between what an organization asks of its AI and what it practices itself.

The principle does not require that organizations be perfect. It requires that the gap be visible, measurable, and accountable. The Claude Code leak made the gap visible for the first time. The Standard's Case 001 analysis made it measurable by applying the Reciprocity Principle systematically across all six findings. Accountability is what comes next, and it depends on whether the industry develops shared frameworks for evaluation or continues to rely on tribal loyalty and hot takes.

Part 3 of this series introduces the full diagnostic framework and examines what principled AI evaluation looks like when you actually have the tools to do it.

Carsten Geiser