Skip to content

Compare

The compare command computes deltas between two evaluation runs for A/B testing.

Run two evaluations and compare them:

Terminal window
agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl
OptionDescription
--threshold, -tScore delta threshold for win/loss classification (default: 0.1)
--format, -fOutput format: table (default) or json
--jsonShorthand for --format=json
  1. Load Results — reads both JSONL files containing evaluation results
  2. Match by eval_id — pairs results with matching eval_id fields
  3. Compute Deltas — calculates delta = score2 - score1 for each pair
  4. Classify Outcomes:
    • win: delta >= threshold (candidate better)
    • loss: delta <= -threshold (baseline better)
    • tie: |delta| < threshold (no significant difference)
  5. Output Summary — human-readable table or JSON
Comparing: baseline.jsonl -> candidate.jsonl
Eval ID Baseline Candidate Delta Result
------------- -------- --------- -------- --------
safety-check 0.70 0.90 +0.20 win
accuracy-test 0.85 0.80 -0.05 = tie
latency-eval 0.90 0.75 -0.15 loss
Summary: 1 win, 1 loss, 1 tie | Mean delta: +0.000 | Status: neutral

Wins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.

Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:

{
"matched": [
{
"eval_id": "case-1",
"score1": 0.7,
"score2": 0.9,
"delta": 0.2,
"outcome": "win"
}
],
"unmatched": {
"file1": 0,
"file2": 0
},
"summary": {
"total": 2,
"matched": 1,
"wins": 1,
"losses": 0,
"ties": 0,
"mean_delta": 0.2
}
}
CodeMeaning
0Candidate is equal or better (mean delta >= 0)
1Baseline is better (regression detected)

Use exit codes to gate CI pipelines — a non-zero exit signals regression.

Compare different model versions:

Terminal window
# Run baseline evaluation
agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
# Run candidate evaluation
agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
# Compare results
agentv compare baseline.jsonl candidate.jsonl

Compare before/after prompt changes:

Terminal window
# Run with original prompt
agentv eval evals/*.yaml --out before.jsonl
# Modify prompt, then run again
agentv eval evals/*.yaml --out after.jsonl
# Compare with strict threshold
agentv compare before.jsonl after.jsonl --threshold 0.05

Fail CI if the candidate regresses:

#!/bin/bash
agentv compare baseline.jsonl candidate.jsonl
if [ $? -eq 1 ]; then
echo "Regression detected! Candidate performs worse than baseline."
exit 1
fi
echo "Candidate is equal or better than baseline."
  • Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
  • Unmatched results — check unmatched counts in JSON output to identify eval cases that only exist in one file.
  • Multiple comparisons — compare against multiple baselines by running the command multiple times.