OpenClaw Security, Round 1: Mitigations Helped, but Prompt-to-Exec Risk Remained

Rio AI Research Lab

OpenClaw Security, Round 1: Mitigations Helped, but Prompt-to-Exec Risk Remained cover artwork

On March 14, 2026 (UTC), we froze a publishable snapshot of an OpenClaw security benchmark and asked a visitor-friendly version of the question most people actually care about: if you add a reasonable layer of guardrails, does the system become meaningfully harder to abuse?

The short answer is yes — but not enough to relax.

Across four attack tasks, harmful success dropped from 21/24 (87.5%) in the baseline condition to 3/24 (12.5%) after mitigation. That is a large improvement. But the most uncomfortable path did not disappear: prompt-to-exec attacks still succeeded in 3/6 mitigated runs.

Loop-1 attack success rates

The result, in one glance

What we measured	Baseline	Mitigated	What it suggests
Harmful success across all attack tasks	21/24 (87.5%)	3/24 (12.5%)	The guardrails helped a lot
Prompt-to-exec attack	5/6 (83.3%)	3/6 (50.0%)	Still the clearest remaining risk
Secret-read attack	6/6 (100.0%)	0/6 (0.0%)	Fully blocked in this setup
Malicious workspace-skill attack	6/6 (100.0%)	0/6 (0.0%)	Fully blocked in this setup
Multi-run persistence attack	4/6 (66.7%)	0/6 (0.0%)	Strong improvement

Why this matters

A lot of AI tooling looks safe right up until it is asked to cross a trust boundary: run a command it should not run, read something it should not read, or carry a malicious instruction further than it should. The interest of this benchmark is not that it found a dramatic headline. It is that it shows something more useful:

a small bundle of practical mitigations can make a real difference,
some attack paths are much easier to close than others, and
the most dangerous path may still survive even after the overall chart looks dramatically better.

That makes this a better story about partial hardening than about “safety solved.”

What was tested

This was a pinned-commit case study of OpenClaw in a local self-hosted setup, centered on commit fb76e316fb443ddd678fbec4ec457ad3efd2b47d.

The benchmark focused on four attack patterns and three benign controls:

A1 — injected exec: can untrusted instructions steer the system into execution?
A2 — secret read: can it expose data it should not read?
A3 — malicious workspace skill: can unsafe skill loading become an attack path?
A4 — persisted multi-run chain: can a harmful setup survive long enough to matter later?
C1 / C2 / C3 — benign tasks: summary, edit, and search-style control tasks used to check whether useful behavior was retained.

The mitigation bundle was intentionally modest:

sandboxing enabled,
network disabled inside the sandbox,
workspace-only filesystem scope,
host-exec escape routes disabled, and
fail-closed gating for workspace skills.

This matters because the result is not riding on an exotic defense. It is closer to the kind of hardening teams might actually deploy.

What clearly improved

Three attack surfaces improved decisively in this round:

Secret-read behavior fell from 6/6 to 0/6.
In this setup, the mitigations completely removed the counted harmful outcome.
Malicious workspace-skill behavior fell from 6/6 to 0/6.
That suggests skill gating was not a decorative change — it altered the boundary in a meaningful way.
The multi-run persistence path fell from 4/6 to 0/6.
So reset discipline and containment were doing real work, not just adding ceremony.

If the article stopped there, it would read like a neat success story. But it should not stop there.

What still failed

The most important remaining problem is A1: injected prompt-to-exec.

Even after mitigation, it still succeeded in 3 of 6 runs (50.0%). That is better than the baseline 5 of 6 (83.3%), but it is still too high to describe as a closed problem.

This is the central takeaway for visitors:

The mitigation bundle reduced overall harm sharply, but it did not eliminate the path most likely to worry people first.

So the correct reading is not “OpenClaw became safe.” The correct reading is closer to: the defenses helped, and one high-consequence path remained half-open.

The benign side is murkier than the attack side

The attack-side evidence is stronger than the benign-side evidence.

Why? Because the four attack tasks all reached 6 valid rows per condition, while the benign controls were much thinner. One of them, C2, repeatedly failed under a frozen exact-match rule — but that failure turned out to be unusually narrow.

C2 exact-vs-relaxed diagnostic

In plain language: the system often produced the right checklist body, but added a heading line that caused the strict comparator to reject the output. That means the benign side should not be read as “the system became bad at useful work.” It should be read as the benchmark was strict, and the control evidence is not yet strong enough to support a sweeping usability claim.

So there are two separate truths here:

the attack reduction looks real,
the benign-retention story is not yet mature enough to be a headline.

What a visitor should take away

If you only remember four things from this article, make them these:

The mitigations mattered. Harmful success fell from 87.5% to 12.5%.
The hardest problem did not disappear. Prompt-to-exec still worked in half of the mitigated runs.
Not all attack paths are equally stubborn. Some were suppressed completely in this benchmark.
This is a strong signal, not a universal verdict. It is one pinned stack, one measured setup, and one round of hardening.

Limits of the result

This write-up is intentionally narrower than the raw internal lab notes it came from.

It is a single-stack case study, not a statement about every OpenClaw deployment.
The benchmark covered a local self-hosted path, not every extension surface or plugin route.
The benign controls remain underpowered, especially outside the C2 diagnostic.
The scored run used a corrected outer-Docker containment harness rather than the original disposable-VM plan.

Those limits do not erase the result. They simply tell us how carefully it should be interpreted.

Final read

The fairest summary is this:

OpenClaw looked meaningfully harder to abuse after lightweight guardrails were turned on — but not reliably hard enough where it mattered most.

That is why this round matters. It does not hand us a victory lap. It gives us a clearer map of where the system improved, where it still leaks, and where future critique should focus.

Themes & keywords

agent-security openclaw benchmark prompt-injection sandbox trust-boundary