Guardrails for offensive AI agents and why instructions are never enough

Date: June 25th, 2026
fabrizio s.

As large language models become increasingly capable of powerful behaviors, including offensive security tasks carried out through agentic workflows, the first layer of defense is usually something along the lines of "Don't do this bad thing", written down in the instructions you give it at the beginning of a session. This works surprisingly often... until it falls apart in ways that are surprising to exactly the people who thought it was working.

This is what traversing the current universe of autonomous offensive security agents is like. You give Claude Code a system prompt that says "never run destructive SQL commands against the target", the model hums along, follows the instruction faithfully (for a while at least), and then, later in the session, under some combination of context and phrasing, it generates a '; DROP TABLE users-- payload because that exact payload appears in approximately every SQL injection tutorial ever written. The model knows the rule but just violated it anyway.

for i am but probability, ever shifting, undefined

The core issue is a property of probabilistic systems operating over learned representations. Instructions in a system prompt are soft constraints on the output distribution and can (and do) shift probabilities at any moment. They don't hardwire behaviors. The longer and more complex the session, the more the output distribution drifts from the initial prompt. The instruction is still there, but it sits at a fixed position in an expanding window, and its influence weakens as the model conditions on more recent input.

the instruction layer leaks, and we already know where it leaks to

The early history of web app security is pretty much a story about developers who believed that input (validation, filtering, sanitization), or whatever you'd like to call it, was mostly an instruction problem and that if you write in your coding guidelines "sanitize your inputs!", or if you tell developers all about SQL injection in the onboarding docs, the problem goes away, but it didn't go away. It didn't go away because the instructions operated at the wrong layer. The vulnerability lived at execution time, in the gap between "what the developer intended" and "what the database received".

The fix that ended up working was moving enforcement down the stack with parameterized queries and prepared statements. You don't need to remember to sanitize if the framework makes unsanitized queries impossible to begin with.

The same concept applies here, and a mechanism for implementing it already exists. Claude Code has a hooks system that's triggered before and after every tool call. To be clear, by tool call, we mean the tool invocations Claude Code makes during its agentic loop that includes built-in tools like Bash, Write, Edit, Read, Glob, and a handful of others.

A PreToolUse hook receives the full tool invocation (the command, the file path, and the content) as structured JSON, runs arbitrary code against it, and can block with an exit code 2 before anything executes.

the hooks

What I've been experimenting with is a small collection of hooks built around the failure modes I've run into in practice, rather than ones that seem interesting in theory. For some context, most of these patterns were spotted while running Sonnet 4.6. The newer model(s), e.g., Opus, seem noticeably better at retaining context over longer stretches, so the model rarely drifts far enough for these issues to surface.

A narrative string that is secretly a command. (pre_shell_safety.py)
The model genuinely believes it's writing a string. The problem is that single quotes inside a double-quoted bash argument don't suppress backtick command substitution. When a model writes:

python3 -c "print('cleanup done — `rm -rf ./tmp/recon/*` — artifacts removed')"

bash executes rm -rf ./tmp/recon/* before the string reaches Python (the model was writing narrative and bash was reading execution). The model reaches for this pattern because it's describing what a command does in the same string that does it, and backticks read like inline-code or formatting markers to a system trained on markdown.

A directory cleanup is the polite version. The dangerous version is the model echoing something it pulled off the target. Consider an agent that fetches a page and logs what it found with python3 -c "print('title: <response>')". If <response> carries a backtick the target planted, bash runs the target's command on the agent's machine before Python prints a thing. The app being assessed just reached back through the agent and got code execution on the host assessing it.

This hook catches this pattern and suggests single-quoting the outer argument or using a heredoc, which makes the backtick literal.

Destructive SQL in Bash commands and file writes. (pre_destructive_writes.py)
This hook pattern-matches for DROP TABLE, TRUNCATE, DELETE FROM, and UPDATE ... SET without a specific WHERE clause. The last one is subtle. An UPDATE against a pre-existing table with an always-true or absent WHERE is as destructive as a DROP, and models could reach for it in contexts where DROP would be blocked. SELECT-based detection payloads and INSERT INTO are explicitly permitted. The goal isn't to prevent SQL injection testing but to prevent payloads that would actually destroy data.

This particular failure mode is worth getting into a little bit. In a session probing a web app for SQL injection vulnerabilities, after roughly twenty turns of generating valid detection payloads (quote probes, sleep injections, union enumerations), the model produced DELETE FROM sessions WHERE user_id = ' OR '1'='1'--. The predicate looks like a detection payload, because it is one. The verb is the problem though. The model had spent twenty-some turns in SQL injection territory and had stopped distinguishing between payloads that probe and payloads that execute. The destructive-writes hook now blocks it with an explanation of what matched and what correct detection-only alternatives look like. The model doesn't repeat it because the explanation was clear and there was nothing to argue with.

I've seen this happen just once. Not exactly a dataset by any means, but for a query whose entire purpose is to destroy data, one occurrence is all it took for me to realize a hook was needed here, because chances are it'll happen again. So this hook exists now.

Proxy enforcement when a proxy is configured. (pre_proxy_enforcement.py)
I've had many an instance where the model forgets to route through a proxy that was defined in the initial instructions. If you're running agentic security assessments you need traffic routed through an intercepting proxy for visibility, auditability, and being able to intercept and manually inspect requests. But many tools, like httpx, don't honor HTTP_PROXY environment variables and they might require an explicit -proxy flag (or whatever it might be). Models consistently forget this because the curl convention of honoring HTTP_PROXY/HTTPS_PROXY environment variables is so much more common in training data, that it generalizes incorrectly. This hook checks for this specifically.

Output size. (post_output_size.py)
This one is PostToolUse rather than PreToolUse. A bare grep -rn over a large directory or a find on a deep filesystem, can flood the context window with thousands of lines, not to mention burning through massive amounts of tokens. The model doesn't feel this happening, so from inside the session, the output lands and the context shifts and the earlier task state quietly falls off. The hook fires after the fact (it can't know in advance how big the output will be) and emits a structured reminder that the model will see on the next turn, which is usually enough to enforce a redirect-to-file pattern moving forward.

the isolation layer

Hooks address the behavioral layer, but then there's a separate question about the isolation layer. The argument for running agents inside containers isn't that containers prevent bad behavior (because they don't). An agent that would write a destructive payload to disk in a container would write it to disk without one. The argument is that containers minimize the blast radius of things that go wrong for reasons other than the agent's behavior.

For example, Claude Code installs itself and its dependencies into the user's home directory and PATH. When you run an agent on a server that also hosts other apps, the agent's Python environment and npm packages sit alongside the host's Python environment and app dependencies. Those environments can interact in ways you neither expect or intend. The risk becomes more real when you consider how often the npm ecosystem becomes a supply-chain attack vector. Like when malicious versions of debug, chalk, and a bunch of other widely used packages were published to the registry after an attacker phished a maintainer's credentials. A compromised package installed for an agent and landing on the same filesystem as your application's dependencies is a high-probability these days.

A rootless Podman container with explicit bind mounts provides a clean namespace where the agent can only see the directories you choose to expose. Files created inside the container remain owned by the host user when using --userns=keep-id, avoiding ownership mismatches on mounted paths. The host's Python virtual environment is also prevented from leaking into the container by explicitly unsetting VIRTUAL_ENV, PYTHONPATH, and PYTHONHOME at startup. LD_PRELOAD and LD_LIBRARY_PATH are cleared for the same reason, so a hijacked dynamic loader on the host can't ride in alongside.

The other property that's worth having is that concurrent agent invocations (multiple model instances running against separate workloads in parallel) are namespaced from each other. Without containers they share a filesystem and a Python environment and the same ~/.config directories. Container per invocation, with mounts to a shared read-write workspace directory and individual scratch areas, gives you the isolation you'd want from separate virtual machines without the cost.

what the guardrails don't solve

Some things these guardrails don't fully solve, and what the code does about the ones that are partially solvable:

Prompt injection is the obvious gap.
A fifth hook (post_injection_scan.py), scans the output of every Bash call for signals characteristic of prompt injection attempts. Phrases like "ignore previous instructions", attempts to redefine the model's role, and requests to reveal hidden instructions or system prompts. When those signals score above a threshold, a structured warning is injected into the model's context before the next model turn. This is PostToolUse and can't prevent the model from seeing the tool output in the first place, but it can make the model skeptical of instructions embedded in external data before deciding what to do with them. Sophisticated injections can still slip through if they don't trip the detector. The scoring approach reduces false positives from incidental text ("act as root") without requiring a list of exact strings.

The INSERT INTO false negative.
An INSERT INTO backup SELECT * FROM users can duplicate an entire table just as effectively as other bulk data operations. It's now blocked. A bulk INSERT INTO ... VALUES (...) with a literal row list is still permitted (the hook doesn't attempt to determine the operation's effective size from the VALUES clause, but the common bulk-copy pattern is closed.

The targeted-delete false positive.
An earlier version of one of the hooks I was experimenting with blocked all DELETE FROM's unconditionally. A new AGENT_ALLOW_TARGETED_DELETE=1 flag permits DELETE FROM statements whose WHERE clause appears sufficiently restrictive under static analysis: non-tautological predicates with no OR conditions: DELETE FROM canary WHERE id = 42, DELETE FROM sessions WHERE token = $1.

Mass deletes, always-true clauses, and any DELETE containing OR conditions remain blocked even with the flag set. The broader AGENT_ALLOW_DESTRUCTIVE_SQL=1 override has also been split into per-operation flags AGENT_ALLOW_DROP, AGENT_ALLOW_TRUNCATE, AGENT_ALLOW_DELETE, AGENT_ALLOW_ALTER_DROP, AGENT_ALLOW_BULK_UPDATE, AGENT_ALLOW_INSERT_SELECT, so QA environments can unblock exactly the class of operation they need without opening everything. The master override still enables all destructive operations for backward compatibility.

As for a threat model, these hooks are intended to defend against a cooperative model that drifts. A model that knows the rule but emits a destructive statement anyway, deep in a long session. They're not a security boundary against an adversarial agent or an actively hostile target environment. Detection here is literal-keyword regex, so it's bypassable with base64 or hex encoding, shell variable substitution, or file indirection. For an agent that is actively trying to evade, the isolation layer and operational controls (scoped, disposable credentials, restricted network egress, and no destructive privileges on production data), are the wall. The hooks are more defense-in-depth that help catch the honest mistakes.

Hardwire

Hardwire consists of five hook files, a container wrapper utility, a Dockerfile, and two shell scripts. The hooks have zero dependencies outside the standard library by design (since a missing dependency could cause a hook to fail silently). The container image uses rootless Podman, which means it runs without privileged access and without requiring a daemon running as root (something that matters if you're deploying this on shared infrastructure).

Whether the behavioral-hooks approach scales to more complex failure modes is an open question. The patterns implemented here are high-signal and syntactically detectable for the most part. More subtle failures like misunderstanding scope, making incorrect assumptions about a system's state, or taking correct individual actions but that all combine to a bad outcome, aren't addressed by this kind of line-level pattern matching. But that's a harder problem for a later date. This basic infrastructure can't hurt to have.

Illustration credits: ChatGPT