Batch CLI Evaluation
Batch CLI evaluation handles tools that process multiple inputs at once — bulk classifiers, screening engines, or any runner that reads all eval cases and outputs results in one pass.
Overview
Section titled “Overview”Use batch CLI evaluation when:
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
- The runner reads the eval YAML directly to extract all eval cases
- Output is JSONL with records keyed by eval case
id - Each eval case has its own evaluator to validate its corresponding output record
Execution Flow
Section titled “Execution Flow”- AgentV invokes the batch runner once, passing
--eval <yaml-path>and--output <jsonl-path> - Batch runner reads the eval YAML, extracts all eval cases, processes them, and writes JSONL output keyed by
id - AgentV parses the JSONL and routes each record to its matching eval case by
id - Per-case evaluators validate the output for each eval case independently
Eval File Structure
Section titled “Eval File Structure”description: Batch CLI demo using structured input_messagesexecution: target: batch_cli
evalcases: - id: case-001 expected_outcome: |- Batch runner returns JSON with decision=CLEAR.
expected_messages: - role: assistant content: decision: CLEAR
input_messages: - role: system content: You are a batch processor. - role: user content: request: type: screening_check jurisdiction: AU row: id: case-001 name: Example A amount: 5000
execution: evaluators: - name: decision-check type: code_judge script: bun run ./scripts/check-output.ts cwd: .
- id: case-002 expected_outcome: |- Batch runner returns JSON with decision=REVIEW.
expected_messages: - role: assistant content: decision: REVIEW
input_messages: - role: system content: You are a batch processor. - role: user content: request: type: screening_check jurisdiction: AU row: id: case-002 name: Example B amount: 25000
execution: evaluators: - name: decision-check type: code_judge script: bun run ./scripts/check-output.ts cwd: .Batch Runner Contract
Section titled “Batch Runner Contract”The batch runner reads the eval YAML directly and processes all eval cases in one invocation.
The runner receives the eval file path via --eval and an output path via --output:
bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonlOutput
Section titled “Output”JSONL where each line is a JSON object with an id matching an eval case:
{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}The id field must match the eval case id for AgentV to route output to the correct evaluator.
Output with Tool Trajectory
Section titled “Output with Tool Trajectory”To enable tool_trajectory evaluation, include output_messages with tool_calls:
{ "id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}", "output_messages": [ { "role": "assistant", "tool_calls": [ { "tool": "screening_check", "input": { "origin_country": "NZ", "amount": 5000 }, "output": { "decision": "CLEAR", "reasons": [] } } ] }, { "role": "assistant", "content": { "decision": "CLEAR" } } ]}AgentV extracts tool calls directly from output_messages[].tool_calls[] for tool_trajectory evaluators.
Evaluator Implementation
Section titled “Evaluator Implementation”Each eval case has its own evaluator that validates the batch runner output. The evaluator receives the standard code_judge input via stdin.
Input (stdin):
{ "candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}", "expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}], "input_messages": [...]}Output (stdout):
{ "score": 1.0, "hits": ["decision matches: CLEAR"], "misses": [], "reasoning": "Batch runner decision matches expected."}Example Evaluator
Section titled “Example Evaluator”import fs from 'node:fs';
type EvalInput = { candidate_answer?: string; expected_messages?: Array<{ role: string; content: unknown }>;};
function main() { const stdin = fs.readFileSync(0, 'utf8'); const input = JSON.parse(stdin) as EvalInput;
const expectedDecision = findExpectedDecision(input.expected_messages);
let candidateDecision: string | undefined; try { const parsed = JSON.parse(input.candidate_answer ?? ''); candidateDecision = parsed.decision; } catch { candidateDecision = undefined; }
const hits: string[] = []; const misses: string[] = [];
if (expectedDecision === candidateDecision) { hits.push(`decision matches: ${expectedDecision}`); } else { misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`); }
const score = misses.length === 0 ? 1 : 0;
process.stdout.write(JSON.stringify({ score, hits, misses, reasoning: score === 1 ? 'Batch runner output matches expected.' : 'Batch runner output did not match expected.', }));}
function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) { if (!messages) return undefined; for (const msg of messages) { if (typeof msg.content === 'object' && msg.content !== null) { return (msg.content as Record<string, unknown>).decision as string; } } return undefined;}
main();Structured Content
Section titled “Structured Content”Use structured objects in expected_messages.content to define expected output fields for easy validation:
expected_messages: - role: assistant content: decision: CLEAR confidence: high reasons: []The evaluator extracts these fields and compares them against the parsed candidate output.
Target Configuration
Section titled “Target Configuration”Configure the batch CLI provider in your targets file or eval file:
# In agentv-targets.yaml or eval filetargets: batch_cli: provider: cli commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE} provider_batching: trueKey settings:
| Setting | Description |
|---|---|
provider: cli | Use the CLI provider |
provider_batching: true | Run once for all eval cases instead of per-case |
{EVAL_FILE} | Placeholder replaced with the eval file path |
{OUTPUT_FILE} | Placeholder replaced with the JSONL output path |
Best Practices
Section titled “Best Practices”- Use unique eval case IDs — the batch runner and AgentV use
idto route outputs to the correct evaluator - Structured input_messages — put structured data in
user.contentfor the runner to extract - Structured expected_messages — define expected output as objects for easy comparison
- Deterministic runners — batch runners should produce consistent output for reliable testing
- Healthcheck support — add a
--healthcheckflag for runner validation:if (args.includes('--healthcheck')) {console.log('batch-runner: healthy');return;}