Compare

The compare command computes deltas between two evaluation runs for A/B testing.

Usage

Run two evaluations and compare them:

agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl

Options

Option	Description
`--threshold`, `-t`	Score delta threshold for win/loss classification (default: 0.1)
`--format`, `-f`	Output format: `table` (default) or `json`
`--json`	Shorthand for `--format=json`

How It Works

Load Results — reads both JSONL files containing evaluation results
Match by eval_id — pairs results with matching eval_id fields
Compute Deltas — calculates delta = score2 - score1 for each pair
Classify Outcomes:
- win: delta >= threshold (candidate better)
- loss: delta <= -threshold (baseline better)
- tie: |delta| < threshold (no significant difference)
Output Summary — human-readable table or JSON

Output Formats

Table Format (default)

Comparing: baseline.jsonl -> candidate.jsonl

  Eval ID        Baseline  Candidate     Delta  Result
  -------------  --------  ---------  --------  --------
  safety-check       0.70       0.90     +0.20  win
  accuracy-test      0.85       0.80     -0.05  = tie
  latency-eval       0.90       0.75     -0.15  loss

Summary: 1 win, 1 loss, 1 tie | Mean delta: +0.000 | Status: neutral

Wins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.

JSON Format

Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:

{
  "matched": [
    {
      "eval_id": "case-1",
      "score1": 0.7,
      "score2": 0.9,
      "delta": 0.2,
      "outcome": "win"
    }
  ],
  "unmatched": {
    "file1": 0,
    "file2": 0
  },
  "summary": {
    "total": 2,
    "matched": 1,
    "wins": 1,
    "losses": 0,
    "ties": 0,
    "mean_delta": 0.2
  }
}

Exit Codes

Code	Meaning
`0`	Candidate is equal or better (mean delta >= 0)
`1`	Baseline is better (regression detected)

Use exit codes to gate CI pipelines — a non-zero exit signals regression.

Workflow Examples

Model Comparison

Compare different model versions:

# Run baseline evaluation
agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl

# Run candidate evaluation
agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl

# Compare results
agentv compare baseline.jsonl candidate.jsonl

Prompt Optimization

Compare before/after prompt changes:

# Run with original prompt
agentv eval evals/*.yaml --out before.jsonl

# Modify prompt, then run again
agentv eval evals/*.yaml --out after.jsonl

# Compare with strict threshold
agentv compare before.jsonl after.jsonl --threshold 0.05

CI Quality Gate

Fail CI if the candidate regresses:

#!/bin/bash
agentv compare baseline.jsonl candidate.jsonl
if [ $? -eq 1 ]; then
  echo "Regression detected! Candidate performs worse than baseline."
  exit 1
fi
echo "Candidate is equal or better than baseline."

Tips

Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
Unmatched results — check unmatched counts in JSON output to identify eval cases that only exist in one file.
Multiple comparisons — compare against multiple baselines by running the command multiple times.