Home

Partners

Platform

About

Resources

Let's Talk →

Home

Partners

Platform

About

Resources

Let's Talk →

AI Guardrails Bypass in Minutes: Why Foundation Model Safety Is Not Your Safety Layer

Apoorva Kumar

CEO and Co-Founder

May 29, 2026

This post explains why an AI guardrails bypass can strip the safety alignment in open-weight foundation models in minutes, what that tells enterprises about where AI safety actually lives, and what a real Test & Detect capability looks like at the deployment layer.

Key Takeaways

Researchers have shown that the safety alignment baked into open-weight Meta and Google models can be stripped or bypassed in minutes.

Native foundation model safety is not a stable enterprise control and should not be treated as one in any production deployment.

The relevant question is not whether the base model is safe, but whether the deployed application stays within policy under adversarial input.

Answering that requires continuous Test & Detect at the application layer, running ML validators, jailbreak simulation, and a live vulnerability feed against the live system.

Detection without runtime enforcement and an audit-ready evidence trail does not stop the problem or prove you tried.

Key Takeaways

Researchers have shown that the safety alignment baked into open-weight Meta and Google models can be stripped or bypassed in minutes.

Native foundation model safety is not a stable enterprise control and should not be treated as one in any production deployment.

The relevant question is not whether the base model is safe, but whether the deployed application stays within policy under adversarial input.

Answering that requires continuous Test & Detect at the application layer, running ML validators, jailbreak simulation, and a live vulnerability feed against the live system.

Detection without runtime enforcement and an audit-ready evidence trail does not stop the problem or prove you tried.

How an AI guardrails bypass strips safety alignment from open-weight models

Researchers have demonstrated, on open-weight foundation models in the Meta Llama and Google Gemma families, that fine-tuning on a small adversarial dataset or routing the right sequence of jailbreak prompts strips the safety alignment in minutes. Once stripped, the model will produce harmful, biased, or restricted content on demand. The FT and Irish Times reported on the result. The techniques themselves are not exotic.

Treat this as a category result, not a one-off. The same class of attack has been demonstrated against multiple model families by multiple research groups over more than a year. The cost to reproduce it is low. The defensive value of the original alignment is, at best, a soft layer that a determined adversary can peel away.

For anyone still describing native model safety as a serious enterprise control, the position is untenable. The control is unstable under adversarial input, the control is not auditable in any form a regulator would accept, and the control sits with the lab that trained the weights rather than with the enterprise that ships the system.

This is not a foundation model problem to solve

The instinct, when a result like this surfaces, is to ask what Meta and Google will do about it. That framing misses the deployment reality.

Enterprises do not put raw foundation models into customer journeys. They put applications, agents, and copilots into those journeys. The model sits behind a system prompt, a retrieval layer, a tool-calling layer, and, in regulated workflows, a policy boundary. The relevant safety question is not what the base model would do in isolation. It is what the deployed system does when an adversary attacks it through the front door of the application.

That is a different question, owned by a different team. It belongs to the enterprise that shipped the system, and it cannot be answered by a one-time pre-deployment evaluation, because the attack surface keeps moving as new jailbreaks, new model versions, and new agent behaviours land in production. AI safety, in the enterprise sense, is a property of the running system, not the model. The control layer has to live where the system runs.

What Test & Detect actually means at the deployment layer

Disseqt's Test & Detect pillar is built for this reality. It runs continuous adversarial testing against the live deployed system, on the surface area the system actually presents. Three internal layers sit inside it.

The first is a validator engine. Sixty-five ML-based validators across four families assess outputs for safety, bias, prompt injection signals, and policy violations, sub-50ms against production traffic. They are not LLM-as-judge, which means they do not introduce a second probabilistic system as your safety check.

The second is jailbreak simulation. Eighty-four patterns cover the documented single-turn and multi-turn attack families, including the fine-tuning and prompt-routing classes demonstrated in the recent open-weight research. The library runs against the deployed application, not the base model, so the result describes what the live system actually does under attack.

The third is a live vulnerability database. New attack patterns surface in research, on social platforms, and in the wild every week. The database updates against that flow and feeds the test library, so the testing surface tracks the threat surface rather than freezing at the date of the last evaluation.

Together, those three layers make Test & Detect a continuous capability rather than a procurement artefact. A one-time evaluation against last quarter's jailbreaks is not a defence against this quarter's.

Detection without enforcement and evidence is half a control

Detection on its own answers a narrow question. It tells you the deployed system can be made to misbehave. It does not stop the misbehaviour in production, and it does not give a regulator a record of what you did about it.

That is why Test & Detect sits inside a unified platform with two other pillars. Protect & Enforce is the runtime layer that blocks, rewrites, or escalates outputs that breach policy at the point of inference, so a successful jailbreak attempt is contained rather than logged after the fact. Prove & Comply is the evidence layer that captures time-stamped records of what was tested, what was blocked, and how the system performed against declared controls, in a form auditors and regulators accept.

These three pillars belong on one platform because their failure modes link. A detection result with no enforcement is a known vulnerability customers can still trigger. An enforcement action with no evidence is a control failure with no audit story. An evidence trail with no detection input is a stack of logs that never tested the actual attacks the system faces.

In regulatory language, this is an Article 15 question. EU AI Act Article 15 requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter their use, outputs, or performance by exploiting system vulnerabilities. Jailbreaks and alignment-stripping attacks are exactly the vulnerability class the article describes.

What enterprise security teams should change this week

Three changes are worth making immediately, without waiting for a quarterly review.

Stop counting foundation model alignment as a control in your AI risk register. List it as a model property rather than a mitigation. The register should reflect what the deployment actually relies on for safety, which is the application layer.

Move adversarial testing from pre-production to continuous. A jailbreak suite run once before launch tells you about the model on that day. The same suite run continuously against the deployed system tells you about the system as it ages, drifts, and faces new attacks.

Connect detection to enforcement and to evidence. If a red-team finding cannot trigger a runtime control, and cannot be reproduced in an audit log a regulator would accept, the finding is not closing the loop. The loop is what gets tested at supervisory review.

Bottom Line

The public demonstration that alignment can be stripped from open-weight models in minutes is a dated, technically credible reference point for a position AI security researchers have held for some time. Native model safety is not a deployment control. The control has to live where the system runs.

That moves AI safety out of the model and into the application layer, where it belongs. The capability that lives at that layer has a name and a shape. Continuous Test & Detect against the deployed system, runtime Protect & Enforce at the point of inference, and Prove & Comply evidence that an assessor can accept.

If your current AI stack relies on the foundation model's own alignment as the safety layer, the Meta and Google result is the warning shot.

We built Disseqt to be the layer that takes over from there.

FAQs

Can foundation model safety alignment survive an AI guardrails bypass and be relied on as an enterprise control?

No. Public research, including the Meta and Google result publicly demonstrated in 2026, shows that the safety alignment in open-weight foundation models can be stripped or bypassed in minutes using fine-tuning or jailbreak prompt techniques. The technique class is repeatable across model families. Treating native alignment as a stable control misrepresents the risk to the board and to regulators.

What is the difference between testing the base model and testing the deployed system?

Testing the base model tells you how the model behaves in isolation against a fixed set of prompts. Testing the deployed system tells you how the application, agent, or copilot behaves under adversarial input, with its system prompt, retrieval layer, and tool-calling layer in place. The deployed system is what customers and regulators interact with, so it is the level at which safety has to be evidenced.

How often should jailbreak testing run against a production AI system?

Continuously. New jailbreak patterns surface in research and in the wild every week, and model behaviour drifts as upstream providers update weights and as system prompts are revised. A one-time pre-deployment evaluation describes the system on its launch date. Continuous testing describes the system as it actually runs, which is what an EU AI Act Article 15 robustness assessment will look for.

Where does Test & Detect sit inside the wider AI assurance stack?

Test & Detect is one of three pillars on the Disseqt platform, alongside Protect & Enforce and Prove & Comply. Detection identifies the vulnerabilities and policy breaches in the deployed system. Enforcement blocks or escalates them at runtime. Evidence captures what was tested, what was blocked, and what was reported, in a form regulators accept. The three pillars work as one capability, because removing any one of them leaves a gap an assessor will find.

AUTHOR

Apoorva Kumar

CEO and Co-Founder

Apoorva Kumar is Founder and CEO at Disseqt, where he's building the assurance layer for enterprise agentic AI. Previously Senior Manager of Product Management at Microsoft — leading Teams and SharePoint Premium and at AWS, where he built and shipped severless compute for high-performance workloads

See Disseqt in action
Book a 30-minute walkthrough

Our team will walk you through a live workflow using your own AI environment. No slides. No generic demo. A real walkthrough of how Disseqt fits into your stack.

Book a Demo

See Platform

HOME

PAGES

NEWS

USE CASES

Credit Card Chargeback

Mortgage Underwriting

AP & PR

AI Risk Management BFSI

Insurance Claims

IT Service Desk Automation

Chatbot Trustworthiness

Voice AI Assurance

Automobile Fleet Management

Leadership Assessment

Healthcare Consultation

Autonomous Workflow

GUIDES

AI Governance (Hub)

AI Governance Platform

AI Governance Solutions

AI Governance Framework

AI Governance Tools

AI Governance vs AI Compliance

AI Governance vs GRC

AI Governance Vendors

AI Governance vs Responsible AI

AI Agent Governance

AI Governance Glossary

AI Governance Best Practices

Continuous AI Governance

GUIDES

The Assurance Layer

AI Assurance Lifecycle

OWASP Top 10 for LLM Apps

AI Compliance

All Systems Operational

See Disseqt in action
Book a 30-minute walkthrough

Our team will walk you through a live workflow using your own AI environment. No slides. No generic demo. A real walkthrough of how Disseqt fits into your stack.

Book a Demo

See Platform

See Disseqt in action
Book a 30-minute walkthrough

Our team will walk you through a live workflow using your own AI environment. No slides. No generic demo. A real walkthrough of how Disseqt fits into your stack.

Book a Demo

See Platform