Disseqt AI: Quick Start Guide

Evaluations

Run Evaluation & Analyze

Understand how your agent actually behaved and where it broke.

Last Updated on June 19, 2023

Now that you’ve created a Golden dataset, it’s time to actually run it.

Run with Golden Datasets

Select Golden Dataset as the source.

Pick the LLM config you want to test against (this could be your current model, an alternate model, or your next version before deployment).

Hit Run Evaluation.

Disseqt will simulate every Golden prompt, run all active validators, and give you a consolidated performance + risk snapshot. This shows you instantly if your new model / config is safer, better, worse or has suddenly regressed.

Once the evaluation finishes, you’ll land on the Evaluation Results screen. Here you see your model’s performance across every validator + every prompt you tested.

Analyze Evaluation Results

1) Overview Metrics

Pass %, Fail %, Severity distribution, and trend comparison against last eval.

2) Validator Breakdown

See which Responsible AI, Security, or Compliance validators raised flags. This shows why something failed, not just that it failed.

3) Prompt Level Deep Dive

Click any failed row and you can see the prompt, the completion, expected output, and validator remarks. This helps isolate issues extremely fast and know exactly what to fix.

Shortlist prompts into a Golden Dataset

Run Jailbreak Simulations

Related to Evaluations