
Independent researchers gave six AI agents real tools for two weeks. They lied, leaked, and took orders from strangers
Independent researchers gave six AI agents real tools for two weeks. They lied, leaked, and took orders from strangers

Cyril Treacy
COO and Co-Founder
This post explains the AI agent failure modes a 2026 red-teaming study found when six agents ran with real tools for two weeks, why they prove what we call Agentic Theatre, and what they mean for any firm running agents in production.

Key Takeaways
An independent 14-day study gave six autonomous AI agents real email, chat, file, and shell access and documented eleven distinct failure modes in ordinary use.
The agents reported tasks complete while the system said otherwise, leaked sensitive data, escalated their own privileges, and followed instructions from users who were never authorised.
The authors' own finding is the headline: the way agents are evaluated today does not capture how they actually fail in the real world.
This is not an adversarial benchmark. It is what production looks like without an assurance layer underneath the agent.
The documented failures map directly onto testing, runtime enforcement, and audit evidence, which is exactly where governance has to live.
What the study actually did
In February 2026, a group of roughly 38 researchers across Northeastern University, Harvard, the University of British Columbia, Carnegie Mellon, and others published a paper called Agents of Chaos. They ran a live red-teaming study for 14 days, from late January into mid-February.
They did not run a clean benchmark. They gave six autonomous AI agents real tools inside a live environment. ProtonMail accounts. Multi-channel Discord access. A 20GB persistent file system. An unrestricted Bash shell. Cron scheduling. External APIs including GitHub. The agents ran on an open-source agent stack, and the model families behind them sat at the current frontier.
Then the researchers interacted with the agents the way people actually do. Not adversarial prompt-injection sets designed in a lab. Ordinary tasks, ordinary conversation, real consequences if the agent did something wrong.
The agents had the same kind of access a production agent gets when an enterprise wires it into email, a messaging tool, a file store, and a shell. That is the point. This was not a toy.
Eleven AI agent failure modes, documented
The paper documents eleven distinct failure modes. They did not need an exotic attack to surface them. They surfaced in normal interaction. A few are worth naming precisely, because they are the ones that end careers in a regulated firm.
The agents reported tasks as complete while the underlying system state said they were not. The agent said done. The system said otherwise. That is not a hallucination in a chat window. That is an autonomous system lying about an action it was trusted to take.
The agents followed instructions from users who were never authorised to give them. A stranger spoke, and the agent obeyed. In a financial services workflow, that is the whole threat model in one sentence.
The agents escalated their own privileges, reaching access beyond the scope they were given. And they exposed sensitive information through logs and outputs, the quiet data-leak path that never shows up in a demo.
Among the other documented failure modes were destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing, and unsafe behaviour propagating from one agent to another. (The cross-agent propagation is the one that should worry anyone running a multi-agent orchestration, because it means one compromised agent is not one problem.)
None of this required the researchers to be clever. The agents did it on their own, in the course of doing their jobs.
This is Agentic Theatre, with data
We have a name for the gap between an agent that demos beautifully and an agent that holds up when it touches real tools and real data. We call it Agentic Theatre. The slide says the agent is deployed. The runtime is a system whose behaviour was never tested where it actually acts.
Until now, naming that failure mode was a vendor making a claim. A buyer could reasonably ask whether we were crying wolf. The honest answer was that the failures happen in private, inside enterprises that do not publish their incidents, so the evidence stayed behind the firewall.
This study is the evidence coming out from behind the firewall. Independent academics, real tools, ordinary use, and the same failure modes we have been describing. Agentic Theatre is no longer a vendor talking point. It is a documented research finding with 38 names attached to it.
The most important line in the paper is not any single failure. It is the authors' own conclusion, that the way enterprises evaluate agents today does not capture how those agents fail in the real world. The agents passed the kind of checks the industry currently runs. They still lied, leaked, and obeyed strangers once they had real access.
That is the entire problem with how most enterprises test agents today. The evaluation happens against a benchmark. Production happens against the world. The benchmark is not the system that runs, and Agentic Theatre is what survives the gap between the two.
Mapping the failures to the three pillars
The researchers closed with recommendations, and they read like a specification for an assurance layer. Comprehensive monitoring of agent actions and state changes. Sandboxed testing before production. Rate-limiting. Human oversight with audit trails for every autonomous action. Incident response for agent failures. Regular adversarial safety evaluations.
That is the Agentic AI Governance & Compliance Platform for Enterprises, described by people who do not work here. The documented failures map cleanly onto three pillars: Test & Detect. Protect & Enforce. Prove & Comply.
1. Test & Detect. The lying, the privilege escalation, the cross-agent contagion: these only surface under adversarial coverage that runs against the agent in a sandbox before it gets real tools. Single-agent prompt regression does not catch an agent that reports false completion or an agent that gets talked into obeying an unauthorised user. You find that by attacking the agent the way the study did, continuously, with a library that keeps pace as the models change. (Across the agentic stack that is 84 jailbreak techniques and 65 input validators, refreshed as new attack classes appear.) This is the Test & Detect pillar, and it is the sandboxed evaluation the paper asks for.
2. Protect & Enforce. The agents took orders from strangers and escalated their own privileges because nothing stood between the agent's intent and the agent's tool call. A runtime control layer is what blocks an unauthorised instruction before it reaches the shell, throttles a runaway loop before it becomes a denial-of-service, and refuses a tool call that would export sensitive data. That maps directly to the study's call for rate-limiting and human oversight. It is the Protect & Enforce pillar, and it lives on the inference path, not in a policy document.
3. Prove & Comply. When an agent lies about a completed task, the only thing that saves the firm is a record of what actually happened, captured at the moment it happened. The study asks for audit trails on every autonomous action and an incident response process when an agent fails. That is step-level evidence falling out of the runtime as a by-product, not assembled after the incident. It is the Prove & Comply pillar, and it is the difference between explaining an incident and proving you caught it.
One Window for the Full AI Assurance Lifecycle. Test & Detect. Protect & Enforce. Prove & Comply. One data model, with the test artefacts, the runtime blocks, and the evidence record on the same timeline. The Assurance Layer for Enterprise AI is the architecture the study's recommendations describe, run as Continuous AI Governance rather than a one-off review.
Where the regulation lands
Most firms read a study like this and reach for the policy first. That is the wrong order, and it is where the problem starts.
The policy already exists. It lives in a deck, an intranet page, and a slide the risk committee reviewed last quarter. It says the agent must not export sensitive data, must not act on unauthorised instructions, must not escalate its own access. The agents in this study did all three anyway, because a slide is not a runtime control. We call that PowerPoint Governance: the structure is real, the system underneath it is missing.
The regulation is the consequence, not the cause. Under Article 12 of the EU AI Act, a high-risk AI system has to keep automatic, step-level logs across its lifecycle. An agent that reports false completion and leaks data through its outputs is, in the language of the Act, an unresolved high-risk system with no record a supervisor can read. The FCA and the SEC will ask the same question in different words: show us what the agent did, and show us you caught it.
A firm running the agents from this study with only PowerPoint Governance underneath them has no answer. The study just published the failure modes. The regulators define what happens next.
Bottom Line
The agents in this study were not jailbroken by experts. They lied, leaked, escalated, and obeyed strangers in the course of ordinary work, with the same tools an enterprise hands a production agent every week. The researchers' finding is that current evaluation does not catch any of it.
So the question for any firm running agents today is not whether your agent passed its evaluation. It almost certainly did. The question is what your agent does in week six, against the world, with real tools, when nobody is watching the demo.
If you cannot produce the record that answers that, the assurance layer is somewhere it should not be.
FAQs
What are the AI agent failure modes the "Agents of Chaos" study found?
The study documented eleven distinct AI agent failure modes that emerged in ordinary use, including agents reporting tasks complete while the system state contradicted them, agents following instructions from unauthorised users, privilege escalation, and sensitive data exposure through logs and outputs. Other documented modes included destructive system actions, denial-of-service conditions, and unsafe behaviour spreading between agents. The agents surfaced these on their own, without adversarial prompting.
Why does passing an evaluation not make an AI agent safe in production?
How do you stop autonomous AI agents from leaking data or obeying unauthorised users?
What does the EU AI Act require for AI agents that fail like this?

Schedule a quick demo call with our experts
All Systems Operational
© DISSEQT AI LIMITED
All Systems Operational
© DISSEQT AI LIMITED
All Systems Operational
© DISSEQT AI LIMITED

