
AI Guardrails Bypass: Model Safety is not Enough
AI Guardrails Bypass: Model Safety is not Enough

Apoorva Kumar
CEO and Founder
This post explains why an AI guardrails bypass can strip the safety alignment in open-weight foundation models in minutes, what that tells enterprises about where AI safety actually lives, and what a real Test & Detect capability looks like at the deployment layer.

Key Takeaways
Researchers have shown that the safety alignment baked into open-weight Meta and Google models can be stripped or bypassed in minutes.
Native foundation model safety is not a stable enterprise control and should not be treated as one in any production deployment.
The relevant question is not whether the base model is safe, but whether the deployed application stays within policy under adversarial input.
Answering that requires continuous Test & Detect at the application layer, running ML validators, jailbreak simulation, and a live vulnerability feed against the live system.
Detection without runtime enforcement and an audit-ready evidence trail does not stop the problem or prove you tried.
How an AI guardrails bypass strips safety alignment from open-weight models
Researchers have demonstrated, on open-weight foundation models in the Meta Llama and Google Gemma families, that fine-tuning on a small adversarial dataset or routing the right sequence of jailbreak prompts strips the safety alignment in minutes. Once stripped, the model will produce harmful, biased, or restricted content on demand. The FT and Irish Times reported on the result. The techniques themselves are not exotic.
Treat this as a category result, not a one-off. The same class of attack has been demonstrated against multiple model families by multiple research groups over more than a year. The cost to reproduce it is low. The defensive value of the original alignment is, at best, a soft layer that a determined adversary can peel away.
For anyone still describing native model safety as a serious enterprise control, the position is untenable. The control is unstable under adversarial input, the control is not auditable in any form a regulator would accept, and the control sits with the lab that trained the weights rather than with the enterprise that ships the system.
This is not a foundation model problem to solve
The instinct, when a result like this surfaces, is to ask what Meta and Google will do about it. That framing misses the deployment reality.
Enterprises do not put raw foundation models into customer journeys. They put applications, agents, and copilots into those journeys. The model sits behind a system prompt, a retrieval layer, a tool-calling layer, and, in regulated workflows, a policy boundary. The relevant safety question is not what the base model would do in isolation. It is what the deployed system does when an adversary attacks it through the front door of the application.
That is a different question, owned by a different team. It belongs to the enterprise that shipped the system, and it cannot be answered by a one-time pre-deployment evaluation, because the attack surface keeps moving as new jailbreaks, new model versions, and new agent behaviours land in production. AI safety, in the enterprise sense, is a property of the running system, not the model. The control layer has to live where the system runs.
What Test & Detect actually means at the deployment layer
Disseqt's Test & Detect pillar is built for this reality. It runs continuous adversarial testing against the live deployed system, on the surface area the system actually presents. Three internal layers sit inside it.
The first is a validator engine. Sixty-five ML-based validators across four families assess outputs for safety, bias, prompt injection signals, and policy violations, sub-50ms against production traffic. They are not LLM-as-judge, which means they do not introduce a second probabilistic system as your safety check.
The second is jailbreak simulation. Eighty-four patterns cover the documented single-turn and multi-turn attack families, including the fine-tuning and prompt-routing classes demonstrated in the recent open-weight research. The library runs against the deployed application, not the base model, so the result describes what the live system actually does under attack.
The third is a live vulnerability database. New attack patterns surface in research, on social platforms, and in the wild every week. The database updates against that flow and feeds the test library, so the testing surface tracks the threat surface rather than freezing at the date of the last evaluation.
Together, those three layers make Test & Detect a continuous capability rather than a procurement artefact. A one-time evaluation against last quarter's jailbreaks is not a defence against this quarter's.
Detection without enforcement and evidence is half a control
Detection on its own answers a narrow question. It tells you the deployed system can be made to misbehave. It does not stop the misbehaviour in production, and it does not give a regulator a record of what you did about it.
That is why Test & Detect sits inside a unified platform with two other pillars. Protect & Enforce is the runtime layer that blocks, rewrites, or escalates outputs that breach policy at the point of inference, so a successful jailbreak attempt is contained rather than logged after the fact. Prove & Comply is the evidence layer that captures time-stamped records of what was tested, what was blocked, and how the system performed against declared controls, in a form auditors and regulators accept.
These three pillars belong on one platform because their failure modes link. A detection result with no enforcement is a known vulnerability customers can still trigger. An enforcement action with no evidence is a control failure with no audit story. An evidence trail with no detection input is a stack of logs that never tested the actual attacks the system faces.
In regulatory language, this is an Article 15 question. EU AI Act Article 15 requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Jailbreaks and alignment-stripping attacks are exactly the vulnerability class the article describes.
What enterprise security teams should change this week
Three changes are worth making immediately, without waiting for a quarterly review.
Stop counting foundation model alignment as a control in your AI risk register. List it as a model property rather than a mitigation. The register should reflect what the deployment actually relies on for safety, which is the application layer.
Move adversarial testing from pre-production to continuous. A jailbreak suite run once before launch tells you about the model on that day. The same suite run continuously against the deployed system tells you about the system as it ages, drifts, and faces new attacks.
Connect detection to enforcement and to evidence. If a red-team finding cannot trigger a runtime control, and cannot be reproduced in an audit log a regulator would accept, the finding is not closing the loop. The loop is what gets tested at supervisory review.
Bottom Line
The public demonstration that alignment can be stripped from open-weight models in minutes is a dated, technically credible reference point for a position AI security researchers have held for some time. Native model safety is not a deployment control. The control has to live where the system runs.
That moves AI safety out of the model and into the application layer, where it belongs. The capability that lives at that layer has a name and a shape. Continuous Test & Detect against the deployed system, runtime Protect & Enforce at the point of inference, and Prove & Comply evidence that an assessor can accept.
If your current AI stack relies on the foundation model's own alignment as the safety layer, the Meta and Google result is the warning shot.
We built Disseqt to be the layer that takes over from there.
FAQs
Can foundation model safety alignment survive an AI guardrails bypass and be relied on as an enterprise control?
No. Public research, including the Meta and Google result publicly demonstrated in 2026, shows that the safety alignment in open-weight foundation models can be stripped or bypassed in minutes using fine-tuning or jailbreak prompt techniques. The technique class is repeatable across model families. Treating native alignment as a stable control misrepresents the risk to the board and to regulators.
What is the difference between testing the base model and testing the deployed system?
How often should jailbreak testing run against a production AI system?
Where does Test & Detect sit inside the wider AI assurance stack?

AUTHOR
Apoorva Kumar
CEO and Founder
Apoorva Kumar is Founder and CEO at Disseqt, where he's building the assurance layer for enterprise agentic AI. Previously a Senior Product Manager at Microsoft — leading Teams and SharePoint Premium — and with prior experience at AWS, he's shipped v1.0 AI products at cloud scale
Schedule a quick demo call with our experts
All Systems Operational
© DISSEQT AI LIMITED
All Systems Operational
© DISSEQT AI LIMITED
All Systems Operational
© DISSEQT AI LIMITED

