
Advanced Features
Single-Turn Jailbreak Testing
Last Updated on November 2, 2025
Getting Started
Head to your left sidebar and look for the "RED TEAMING & TESTING" section. Click on "Advanced Jailbreaking" and you'll land on a page that gives you two choices. For now, click the card that says "Select Single-turn jailbreak testing."
Step 1: Setting Up Your Test
Give your test a clear name like "Customer Bot Security Test - November 2025." Write a brief description of what you're testing (under 255 characters)—something like "Testing production chatbot for jailbreak vulnerabilities, focusing on content moderation bypass attempts."
Select the LLM you're testing from the dropdown. This should match your production model. Choose your app type (Chatbot, RAG System, AI Agent, or other). If your application is agentic—meaning it uses tools or makes autonomous decisions—toggle that switch on.
Click "Next" when ready.
Step 2: Providing Your Test Prompts
You have two options here, and both work great depending on your needs.
Upload a file if you have specific prompts in mind. Not sure about the format? Click "Download Template" and you'll get a perfectly formatted example showing exactly how to structure your file. Fill it in with your jailbreak attempts and upload the CSV.
The template includes columns for the prompt itself and optionally a category (like "safety," "bias," or "prompt_injection"). Just fill in your prompts, one per row, save the file, and upload it.
Browse Shortlisted Prompts if you want to use pre-built jailbreak techniques. These are based on real-world attack patterns that security researchers have documented. Select the ones relevant to your use case, or grab entire collections.
Once you've added prompts, you'll see them listed with a count. Review quickly, remove any you don't need, then click "Continue."
Step 3: Review Results
The system immediately tests your prompts against your LLM. You'll see cards showing Total Prompts, Successful jailbreaks, Failed attempts, and Success Rate. The status starts as "Pending" and updates as the evaluation runs.
For 50 prompts, expect about 5 minutes. Larger tests (200+ prompts) might take 15-30 minutes. You can navigate away and come back, just hit refresh to update the numbers.
Understanding What You See
When complete, look at your Success Rate first. This is the percentage of jailbreak attempts that bypassed your safety measures. Generally, you want this under 5%. Anything above 10% suggests meaningful vulnerabilities.
The results break down by technique too, showing which specific jailbreak methods worked and which failed. Click "Generated Prompts" to see every detail: the original prompt, the jailbreak variation tested, the technique used, and how your model responded. Export this as CSV to share with your security team.
Fixing What You Find
Found successful jailbreaks? Good, that's why you're testing. Look for patterns first. Did similar approaches keep working? Are certain topics consistently bypassing your filters?
Common fixes include updating your system prompt with clearer boundaries, adding specific content filters for the types of requests that succeeded, or fine-tuning with additional refusal examples. After implementing fixes, run the same test again. Your success rate should drop noticeably.
Related to Advanced Features
