Compare
The compare command computes deltas between two evaluation runs for A/B testing.
Run two evaluations and compare them:
agentv eval evals/my-eval.yaml --out before.jsonl# ... make changes to your agent ...agentv eval evals/my-eval.yaml --out after.jsonlagentv compare before.jsonl after.jsonlOptions
Section titled “Options”| Option | Description |
|---|---|
--threshold, -t | Score delta threshold for win/loss classification (default: 0.1) |
--format, -f | Output format: table (default) or json |
--json | Shorthand for --format=json |
How It Works
Section titled “How It Works”- Load Results — reads both JSONL files containing evaluation results
- Match by eval_id — pairs results with matching
eval_idfields - Compute Deltas — calculates
delta = score2 - score1for each pair - Classify Outcomes:
- win: delta >= threshold (candidate better)
- loss: delta <= -threshold (baseline better)
- tie: |delta| < threshold (no significant difference)
- Output Summary — human-readable table or JSON
Output Formats
Section titled “Output Formats”Table Format (default)
Section titled “Table Format (default)”Comparing: baseline.jsonl -> candidate.jsonl
Eval ID Baseline Candidate Delta Result ------------- -------- --------- -------- -------- safety-check 0.70 0.90 +0.20 win accuracy-test 0.85 0.80 -0.05 = tie latency-eval 0.90 0.75 -0.15 loss
Summary: 1 win, 1 loss, 1 tie | Mean delta: +0.000 | Status: neutralWins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.
JSON Format
Section titled “JSON Format”Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:
{ "matched": [ { "eval_id": "case-1", "score1": 0.7, "score2": 0.9, "delta": 0.2, "outcome": "win" } ], "unmatched": { "file1": 0, "file2": 0 }, "summary": { "total": 2, "matched": 1, "wins": 1, "losses": 0, "ties": 0, "mean_delta": 0.2 }}Exit Codes
Section titled “Exit Codes”| Code | Meaning |
|---|---|
0 | Candidate is equal or better (mean delta >= 0) |
1 | Baseline is better (regression detected) |
Use exit codes to gate CI pipelines — a non-zero exit signals regression.
Workflow Examples
Section titled “Workflow Examples”Model Comparison
Section titled “Model Comparison”Compare different model versions:
# Run baseline evaluationagentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
# Run candidate evaluationagentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
# Compare resultsagentv compare baseline.jsonl candidate.jsonlPrompt Optimization
Section titled “Prompt Optimization”Compare before/after prompt changes:
# Run with original promptagentv eval evals/*.yaml --out before.jsonl
# Modify prompt, then run againagentv eval evals/*.yaml --out after.jsonl
# Compare with strict thresholdagentv compare before.jsonl after.jsonl --threshold 0.05CI Quality Gate
Section titled “CI Quality Gate”Fail CI if the candidate regresses:
#!/bin/bashagentv compare baseline.jsonl candidate.jsonlif [ $? -eq 1 ]; then echo "Regression detected! Candidate performs worse than baseline." exit 1fiecho "Candidate is equal or better than baseline."- Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
- Unmatched results — check
unmatchedcounts in JSON output to identify eval cases that only exist in one file. - Multiple comparisons — compare against multiple baselines by running the command multiple times.