Code Judges
Code judges are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.
Contract
Section titled “Contract”Code judges communicate via stdin/stdout JSON:
Input (stdin):
{ "question": "What is 15 + 27?", "expected_outcome": "Correctly calculates 15 + 27 = 42", "candidate_answer": "The answer is 42.", "reference_answer": "42", "sidecar": {}}Output (stdout):
{ "score": 1.0, "hits": ["Answer contains correct value (42)"], "misses": [], "reasoning": "Passed 1 check(s)"}| Output Field | Type | Description |
|---|---|---|
score | number | 0.0 to 1.0 |
hits | string[] | Criteria that passed |
misses | string[] | Criteria that failed |
reasoning | string | Explanation of the score |
Python Example
Section titled “Python Example”import json, sysdata = json.load(sys.stdin)candidate_answer = data.get("candidate_answer", "")
hits = []misses = []
if "42" in candidate_answer: hits.append("Answer contains correct value (42)")else: misses.append("Answer does not contain expected value (42)")
score = 1.0 if hits else 0.0
print(json.dumps({ "score": score, "hits": hits, "misses": misses, "reasoning": f"Passed {len(hits)} check(s)"}))TypeScript Example
Section titled “TypeScript Example”import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const candidateAnswer: string = data.candidate_answer ?? "";
const hits: string[] = [];const misses: string[] = [];
if (candidateAnswer.includes("42")) { hits.push("Answer contains correct value (42)");} else { misses.push("Answer does not contain expected value (42)");}
console.log(JSON.stringify({ score: hits.length > 0 ? 1.0 : 0.0, hits, misses, reasoning: `Passed ${hits.length} check(s)`,}));Referencing in Eval Files
Section titled “Referencing in Eval Files”execution: evaluators: - name: my_validator type: code_judge script: ./validators/check_answer.py@agentv/eval SDK
Section titled “@agentv/eval SDK”The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeJudge to skip boilerplate:
#!/usr/bin/env bunimport { defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(({ candidateAnswer, expectedOutcome }) => { const hits: string[] = []; const misses: string[] = [];
if (candidateAnswer.includes(expectedOutcome)) { hits.push('Answer matches expected outcome'); } else { misses.push('Answer does not match expected outcome'); }
const total = hits.length + misses.length; return { score: total === 0 ? 0 : hits.length / total, hits, misses, reasoning: `Passed ${hits.length}/${total} checks`, };});SDK exports: defineCodeJudge, Message, ToolCall, TraceSummary, CodeJudgeInput, CodeJudgeResult
Target Access
Section titled “Target Access”Code judges can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).
Configuration
Section titled “Configuration”Add a target block to the evaluator config:
evaluators: - name: contextual-precision type: code_judge script: bun scripts/contextual-precision.ts target: max_calls: 10 # Default: 50Use createTargetClient from the SDK:
#!/usr/bin/env bunimport { createTargetClient, defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(async ({ question, candidateAnswer }) => { const target = createTargetClient(); if (!target) return { score: 0, misses: ['Target not configured'] };
const response = await target.invoke({ question: `Is this relevant to: ${question}? Response: ${candidateAnswer}`, systemPrompt: 'Respond with JSON: { "relevant": true/false }' });
const result = JSON.parse(response.rawText ?? '{}'); return { score: result.relevant ? 1.0 : 0.0 };});Use target.invokeBatch(requests) for multiple calls in parallel.
Environment variables (set automatically when target is configured):
| Variable | Description |
|---|---|
AGENTV_TARGET_PROXY_URL | Local proxy URL |
AGENTV_TARGET_PROXY_TOKEN | Bearer token for authentication |
Advanced Input Fields
Section titled “Advanced Input Fields”Beyond the basic question, expected_outcome, candidate_answer, and reference_answer fields, code judges receive additional context:
| Field | Type | Description |
|---|---|---|
guideline_files | string[] | Paths to guideline files referenced in the eval |
input_files | string[] | Paths to input files referenced in the eval |
input_messages | Message[] | Full resolved input message array |
expected_messages | Message[] | Expected agent behavior including tool calls |
output_messages | Message[] | Actual agent execution trace with tool calls |
trace_summary | TraceSummary | Lightweight execution metrics |
trace_summary structure
Section titled “trace_summary structure”{ "event_count": 5, "tool_names": ["fetch", "search"], "tool_calls_by_name": { "search": 2, "fetch": 1 }, "error_count": 0, "token_usage": { "input": 1000, "output": 500 }, "cost_usd": 0.0015, "duration_ms": 3500}Use expected_messages for retrieval context in RAG evals (tool calls with outputs) and output_messages for the actual agent execution trace from live runs.
Testing Locally
Section titled “Testing Locally”Test a code judge by piping JSON to stdin:
echo '{"question":"What is 2+2?","expected_outcome":"4","candidate_answer":"4","reference_answer":"4","sidecar":{}}' | python validators/check_answer.py