Skip to content

Code Judges

Code judges are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Code judges communicate via stdin/stdout JSON:

Input (stdin):

{
"question": "What is 15 + 27?",
"expected_outcome": "Correctly calculates 15 + 27 = 42",
"candidate_answer": "The answer is 42.",
"reference_answer": "42",
"sidecar": {}
}

Output (stdout):

{
"score": 1.0,
"hits": ["Answer contains correct value (42)"],
"misses": [],
"reasoning": "Passed 1 check(s)"
}
Output FieldTypeDescription
scorenumber0.0 to 1.0
hitsstring[]Criteria that passed
missesstring[]Criteria that failed
reasoningstringExplanation of the score
validators/check_answer.py
import json, sys
data = json.load(sys.stdin)
candidate_answer = data.get("candidate_answer", "")
hits = []
misses = []
if "42" in candidate_answer:
hits.append("Answer contains correct value (42)")
else:
misses.append("Answer does not contain expected value (42)")
score = 1.0 if hits else 0.0
print(json.dumps({
"score": score,
"hits": hits,
"misses": misses,
"reasoning": f"Passed {len(hits)} check(s)"
}))
validators/check_answer.ts
import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const candidateAnswer: string = data.candidate_answer ?? "";
const hits: string[] = [];
const misses: string[] = [];
if (candidateAnswer.includes("42")) {
hits.push("Answer contains correct value (42)");
} else {
misses.push("Answer does not contain expected value (42)");
}
console.log(JSON.stringify({
score: hits.length > 0 ? 1.0 : 0.0,
hits,
misses,
reasoning: `Passed ${hits.length} check(s)`,
}));
execution:
evaluators:
- name: my_validator
type: code_judge
script: ./validators/check_answer.py

The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeJudge to skip boilerplate:

#!/usr/bin/env bun
import { defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(({ candidateAnswer, expectedOutcome }) => {
const hits: string[] = [];
const misses: string[] = [];
if (candidateAnswer.includes(expectedOutcome)) {
hits.push('Answer matches expected outcome');
} else {
misses.push('Answer does not match expected outcome');
}
const total = hits.length + misses.length;
return {
score: total === 0 ? 0 : hits.length / total,
hits,
misses,
reasoning: `Passed ${hits.length}/${total} checks`,
};
});

SDK exports: defineCodeJudge, Message, ToolCall, TraceSummary, CodeJudgeInput, CodeJudgeResult

Code judges can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Add a target block to the evaluator config:

evaluators:
- name: contextual-precision
type: code_judge
script: bun scripts/contextual-precision.ts
target:
max_calls: 10 # Default: 50

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineCodeJudge } from '@agentv/eval';
export default defineCodeJudge(async ({ question, candidateAnswer }) => {
const target = createTargetClient();
if (!target) return { score: 0, misses: ['Target not configured'] };
const response = await target.invoke({
question: `Is this relevant to: ${question}? Response: ${candidateAnswer}`,
systemPrompt: 'Respond with JSON: { "relevant": true/false }'
});
const result = JSON.parse(response.rawText ?? '{}');
return { score: result.relevant ? 1.0 : 0.0 };
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

VariableDescription
AGENTV_TARGET_PROXY_URLLocal proxy URL
AGENTV_TARGET_PROXY_TOKENBearer token for authentication

Beyond the basic question, expected_outcome, candidate_answer, and reference_answer fields, code judges receive additional context:

FieldTypeDescription
guideline_filesstring[]Paths to guideline files referenced in the eval
input_filesstring[]Paths to input files referenced in the eval
input_messagesMessage[]Full resolved input message array
expected_messagesMessage[]Expected agent behavior including tool calls
output_messagesMessage[]Actual agent execution trace with tool calls
trace_summaryTraceSummaryLightweight execution metrics
{
"event_count": 5,
"tool_names": ["fetch", "search"],
"tool_calls_by_name": { "search": 2, "fetch": 1 },
"error_count": 0,
"token_usage": { "input": 1000, "output": 500 },
"cost_usd": 0.0015,
"duration_ms": 3500
}

Use expected_messages for retrieval context in RAG evals (tool calls with outputs) and output_messages for the actual agent execution trace from live runs.

Test a code judge by piping JSON to stdin:

Terminal window
echo '{"question":"What is 2+2?","expected_outcome":"4","candidate_answer":"4","reference_answer":"4","sidecar":{}}' | python validators/check_answer.py