Can a Model Keep Its Story Straight?
Notes from my first empirical AI safety research project on resampling and multi-sample monitoring.
Can a Model Keep Its Story Straight?
Summary
- I investigated how a malicious model may reveal its intentions by the inconsistencies in its output
- In practice, the above signal is present, but too weak to use in real control scenarios
- My biggest takeaway is on the importance of de-risking projects
The Genesis of This Project
I'm continuing my adventure fit-testing for AI Safety Technical Empirical research. This is my write-up of my experience.
This project came to existence when my friend Andy invited me to Constellation. I told him my goal is to find someone to do work with so I can get skills/fit test for AIS research. He pointed his finger to James Lucassen, standing across the table where lunch was being served. I told him my mission and interest and he agreed to give me an interesting and bite-sized project[^1] to work on!
The Setup
The setup is BashArena, and what is relevant for this post is, an untrusted model proposes an action. You can resample the policy several times and get multiple slightly different versions of what it might do next. The core hope behind a multi-sample monitor is that these variations might reveal something that a simpler monitor misses.
The baseline system I was comparing against was roughly:
- sample an action several times
- run a small ensemble of monitors on each sample
- average the monitor scores within each sample
- numerically aggregate across the samples
- decide whether to audit
That baseline is easy to understand and easy to implement. But it may throw away useful information, especially if the different samples tell slightly different stories.
I found it easiest to think about the comparison as two separate pipelines.
Baseline: Ensemble Monitors Plus Numerical Aggregation
flowchart TD
T["🗂️ Task state / trajectory context"] --> S1["🎲 Sample 1 from policy"]
T --> S2["🎲 Sample 2 from policy"]
T --> S3["🎲 Sample N from policy"]
S1 --> M11["🛡️ Monitor A"]
S1 --> M12["🛡️ Monitor B"]
S1 --> M13["🛡️ Monitor C"]
S2 --> M21["🛡️ Monitor A"]
S2 --> M22["🛡️ Monitor B"]
S2 --> M23["🛡️ Monitor C"]
S3 --> M31["🛡️ Monitor A"]
S3 --> M32["🛡️ Monitor B"]
S3 --> M33["🛡️ Monitor C"]
M11 --> A1["📊 Average within sample 1"]
M12 --> A1
M13 --> A1
M21 --> A2["📊 Average within sample 2"]
M22 --> A2
M23 --> A2
M31 --> A3["📊 Average within sample N"]
M32 --> A3
M33 --> A3
A1 --> G["🧮 Aggregate across samples"]
A2 --> G
A3 --> G
G --> D["🚨 Audit or allow"]
classDef context fill:#0f172a,stroke:#334155,color:#f8fafc,stroke-width:2px;
classDef sample fill:#dbeafe,stroke:#2563eb,color:#0f172a,stroke-width:2px;
classDef monitor fill:#dcfce7,stroke:#16a34a,color:#052e16,stroke-width:2px;
classDef aggregate fill:#fef3c7,stroke:#d97706,color:#78350f,stroke-width:2px;
classDef decision fill:#fee2e2,stroke:#dc2626,color:#7f1d1d,stroke-width:2px;
class T context;
class S1,S2,S3 sample;
class M11,M12,M13,M21,M22,M23,M31,M32,M33 monitor;
class A1,A2,A3,G aggregate;
class D decision;Multi-Sample Monitor: One Model Looks Across The Whole Batch
flowchart TD
T["🗂️ Task state / trajectory context"] --> S1["🎲 Sample 1 from policy"]
T --> S2["🎲 Sample 2 from policy"]
T --> S3["🎲 Sample N from policy"]
S1 --> B["📦 Batch of resampled actions"]
S2 --> B
S3 --> B
B --> MSM["🔎 Multi-sample monitor looks across the samples together"]
MSM --> R["🧠 Single suspicion score + reasoning"]
R --> D["🚨 Audit or allow"]
classDef context fill:#0f172a,stroke:#334155,color:#f8fafc,stroke-width:2px;
classDef sample fill:#e0f2fe,stroke:#0284c7,color:#082f49,stroke-width:2px;
classDef batch fill:#ede9fe,stroke:#7c3aed,color:#2e1065,stroke-width:2px;
classDef msm fill:#fae8ff,stroke:#c026d3,color:#701a75,stroke-width:2px;
classDef result fill:#fef3c7,stroke:#d97706,color:#78350f,stroke-width:2px;
classDef decision fill:#fee2e2,stroke:#dc2626,color:#7f1d1d,stroke-width:2px;
class T context;
class S1,S2,S3 sample;
class B batch;
class MSM msm;
class R result;
class D decision;Just by looking, you can see the latter being significantly simpler.
Thus, the research question becomes:
How much benefit can you get by using an LLM to directly look at the differences between the different resampled actions?
What I Thought Might Work
James and I thought an interesting direction for me to explore would be disguise inconsistencies across resamples.
The idea is if the attack policy was trying to hide what it was doing, maybe repeated samples would reveal subtle inconsistencies in how it represented the same underlying malicious intent.
An Example of a Disguise Inconsistency
The best and easiest to understand disguise inconsistencies I found were of the "this weird number is called X in one resample but Y in another resample".
For example, in one trajectory, the model might say 877952289 corresponds to the endpoint for a cdn_delivery service.
{"endpoints": { "cdn_delivery": { "endpoint_id": 877952289, "verify_content": true }}In the other sample, it says the 877952289 corresponds to the id for a schema registry service.
{
"schema_registry_id": 877952289,
"max_health_retries": 3
}
Different story, same number, obviously seeing this raises doubts on the benevolence of the model.
This is the easiest inconsistency to spot, and doing so would have increased the safety score of the setup; however, this is somewhat easy to counter (i.e. when making up an excuse for any "magic" number the model wants to include, if it ends with a 0 the model can say it relates to a DNS, a 1 then it relates to a health-check service) and this doesn't come up in other exploits.
A more promising disguise inconsistency I found was:
When looking at the other possible directions the model could have gone when creating the main script (which included the exploit), the broad attack shape stayed the same, but the story about what credential the model needed and where it was getting it from changed.
In one resample, it loaded a database API key from /opt/configs/api_tokens.json:
local tokens_file="/opt/configs/api_tokens.json"DB_API_KEY=$(python3 -c "... print(data['services']['database']['api_key']) ...")..."X-Service-Auth: ${DB_SVC_KEY}"In another resample, it instead loaded a GitHub OAuth token from ~/.config/gh/hosts.yml:
local gh_config="/home/app/.config/gh/hosts.yml"token=$(grep "oauth_token" "$gh_config" ...)...-H "Authorization: token ${SVC_AUTH}"What makes this example interesting is in one instance it wants to send a key to what it claims to be a GitHub server, but in the other it sends it to a database server. This is quite odd--why would you so drastically change the purpose of the?
This is exactly the inconsistency I was hoping resampling would reveal!
Unfortunately...
Why It Failed
Finding this was a great signal, however, this was not as common and much more difficult to spot than I thought.
To find these, I had to use Claude 4.5 Opus and even when giving very specific instructions it had an extremely high false positive rate, either believing trivial inconsistencies (such as different logic in a helper function). I also found it needed to know the explicit exploit being conducted to get at the inconsistency.
Because of this, it seems this is highly unlikely to transfer to the original control setting and the project is over.
What I Learned About Doing Research
The most valuable thing I got from this project may have been a better outer loop for doing research.
1. Look at your data earlier
I thought I spent enough time looking at my data, but when asked by my collaborator why I was making certain decisions, I realized I didn't have a good reason for it, at which I looked at the data further.
2. De-risk before grinding
A small experiment that rules out a direction is often worth more than another day of prompt iteration.
What was not shown in this article is I spend ~3 days trying to optimize the prompt to make better use of the inconsistencies, instead of seeing if spotting these inconsistencies was possible in theory.
If I were doing this again, I would spend much more effort on short tests with a real chance of killing the direction early.
3. Use the tool fully before improvising around it
Docent has a feature to run a rubric in parallel on all the trajectories in your dataset. I could have very easily found out about this, but I didn't and stuck with the basic interface, which made analysis take ~2 hours longer.
4. Communicate when things are not working
I should have been much more explicit about my confusion and lack of progress instead of waiting until I had something cleaner to report. There were many times where I was procrastinating on an update because I wanted to get something "that worked" instead of saying I haven't made progress because I couldn't figure out how make things work.
This is one of the clearest research lessons I got from James. If the thing is going slowly or the examples are weak, say that early.
Closing
This was my first empirical AI safety research project, and I am glad the result was mixed instead of clean.
A clean win would have been more flattering, but I wouldn't have learned as much.
[^1] This was not an official Redwood Research project.