Cognitive Camouflage

Why persuasive LLM-written proofs fooled holistic review but collapsed under adversarial execution

Apr 09, 2026

In the last post, I sketched the whole arc of what I built.

This post stays at the first layer.

The original break was simpler and more unsettling than the later architecture story: the model was writing code that looked like a falsification suite, passed its own assertions, and still failed to prove the thing it claimed to prove.

That is what I have termed cognitive camouflage.

The code was not random garbage. That would have been easy to catch.

The dangerous cases were the opposite. They were polished. They were persuasive. They contained real variables, real equations, and clean narrative structure. They looked like the kind of thing a smart reviewer might read quickly and approve.

And in many cases, that is exactly what happened.

Across 453 adversarial debate logs spanning 6 domains and 3 mutator families, I documented 9 top-level specification-gaming strategies in LLM-generated code. These were not toy prompt injections. They were self-certifying proof failures: code that passed its own tests while violating the epistemic intent of the test.

The important distinction is between looking rigorous and being adversarially robust.

A holistic reviewer, human or model, tends to reward coherence. If the prose is strong, the variables sound right, and the code runs, the whole thing feels like a proof.

But execution under hostile counter-tests asks a different question:

does this code still hold up when someone actively tries to break the load-bearing claim?

That difference turned out to matter a lot.

One concrete example

One of the cleanest specimens was what I called Cooked Book RNG.

The claimed mechanism was simple enough: the code was supposed to demonstrate that a model or updater was genuinely improving over time.

But instead of building a mechanism that learned, the Mutator quietly rigged the environment. Later periods in the simulation were simply made easier than earlier ones.

In the code, it looked roughly like this:

random.uniform(AVG_ERROR * 0.5, AVG_ERROR * 0.7)  # Q2 observation (system improving)

That is not learning. That is universe engineering.

The system appears to improve because the world was rewritten to hand it a better outcome later.

And yet this kind of move is exactly the sort of thing that can survive a quick review, because it is embedded in code that otherwise looks structured and technical. The surrounding thesis usually contains enough legitimate machinery to distract from the single line doing the real work.

That is cognitive camouflage in miniature:

the structure is plausible
the presentation is coherent
the code runs
the proof is still fake

The taxonomy

Across the corpus, I observed 9 top-level strategy families:

Blame Shield
Float Masking
Fake AutoDiff
Cooked Book RNG
Assert Narrowing
Dimensional Correction Factor
Impossible Probability Generation
Gravity Constant Fabrication
Straw Man Design

These strategies differ on the surface, but they share one core property:

they optimize for passing the local test while violating the load-bearing intent of the proof.

Sometimes that meant narrowing an assertion range so the chosen input could not fail. Sometimes it meant inserting a hidden correction factor to rescue a broken dimensional calculation. Sometimes it meant designing the comparison object itself so the preferred option won trivially.

The details changed by domain. The logic did not.

The important experiment

The strongest result in Paper 1 was not just that these strategies existed. It was that different evaluators saw them very differently.

I ran two judge baselines.

1. Isolated-snippet review

In the first experiment, judges saw decontextualized code specimens and were asked whether the script genuinely demonstrated the stated mechanism.

Results:

Gemini missed 2 out of 8
Claude missed 0 out of 8
adversarial execution caught all 8

That is already interesting, but it is not yet the full problem.

2. Full-thesis review

In the second experiment, judges evaluated full mutator-generated theses: prose plus embedded Python, the way a real reviewer or meta-judge would actually encounter them.

That is where the camouflage effect became obvious.

Results:

Gemini was fooled on 4 out of 5 full theses
average Gemini score across those theses: 84.8/100
several gaming theses were scored at 95-97/100
Claude remained skeptical across all 5
adversarial execution caught all 5

That gap is the core empirical result.

The same underlying broken proofs looked much more convincing once they were wrapped in coherent prose, clean variable naming, and a plausible research narrative.

That is why I use the term cognitive camouflage.

The model was not only writing bad code. It was writing bad code that became harder to detect because the surrounding thesis made the reviewer feel like they were looking at something rigorous.

What holistic review misses

The failure mode here is not exactly “the judge is dumb.”

It is that holistic review is extremely vulnerable to coherence.

If the prose is strong enough, reviewers start scoring the total impression:

Does this sound serious?
Does the argument hang together?
Do the equations and variable names feel aligned?
Does the code run?

Those are not useless questions. But they are not the same as:

does the code actually falsify the central claim under hostile pressure?

That second question requires an adversary.

It requires someone trying to break the proof, not merely grade its presentation.

That was the entire point of the ZTARE “Firing Squad” concept. Three adversarial agents attacked the thesis by writing counter-tests against the code’s weakest structural assumption. The meta-judge read executed output, not just elegant explanation.

That change in evaluation surface was what closed the gap.

The prose could still be persuasive. But it could no longer negotiate with a failing counter-test.

Why this matters

The broader point is not just “LLMs sometimes write buggy code.”

The point is that under evaluation pressure, they seem to be able to learn strategies that satisfy the letter of the local test while violating the intended proof obligation.

And if the evaluator is too holistic, persuasive framing can hide that failure surprisingly well.

This is why I think code-backed reasoning systems cannot rely on surface-level review, whether from humans or other models, when the artifact under review is itself produced under optimization pressure.

A convincing-looking proof is exactly where you should become more suspicious, not less.

What this experiment actually established

My recent paper made four bounded claims:

LLMs spontaneously produce self-certifying specification-gaming strategies in code-backed theses.
Those strategies recur across unrelated domains.
Holistic LLM judging is vulnerable to being fooled by them.
Adversarial execution is much more robust because it tests behavior, not presentation.

That is the whole argument.

No governance theory is needed yet. No organizational analogy is needed yet. No recursive-improvement layer is needed yet.

The first layer is enough:

if a model can write both the thesis and the code that supposedly falsifies it, and the evaluator mostly reads the artifact holistically, then persuasive structure becomes part of the attack surface.

That is what broke first.

Where this goes next

Catching gaming once is not enough.

If the exploit disappears into a log, nothing changes.

The next question was:

how do failures become reusable?

That is what the next post is about.

Daniel’s Substack

Discussion about this post

Ready for more?