Example Evaluations
This page collects complete eval file examples you can copy and adapt. Each demonstrates a different AgentV pattern.
Basic Q&A
Section titled “Basic Q&A”A minimal eval with a single question and expected answer:
description: Basic arithmetic evaluationexecution: target: default
evalcases: - id: simple-addition expected_outcome: Correctly calculates 2+2
input_messages: - role: user content: What is 2 + 2?
expected_messages: - role: assistant content: "4"Code Review with File References
Section titled “Code Review with File References”Use multipart content to attach files alongside text prompts:
description: Code review with guidelinesexecution: target: azure_base
evalcases: - id: code-review-basic expected_outcome: Assistant provides helpful code analysis with security considerations
input_messages: - role: system content: You are an expert code reviewer. - role: user content: - type: text value: |- Review this function for security issues:
```python def get_user(user_id): query = f"SELECT * FROM users WHERE id = {user_id}" return db.execute(query)- type: file value: /prompts/security-guidelines.md
expected_messages:
-
role: assistant content: |- This code has a critical SQL injection vulnerability. The user_id is directly interpolated into the query string without sanitization.
Recommended fix:
def get_user(user_id):query = "SELECT * FROM users WHERE id = ?"return db.execute(query, (user_id,))
## Multi-Evaluator
Combine a code judge and an LLM judge on the same eval case:
```yamldescription: JSON generation with validationexecution: target: default
evalcases: - id: json-generation-with-validation expected_outcome: Generates valid JSON with required fields
execution: evaluators: - name: json_format_validator type: code_judge script: uv run validate_json.py cwd: ./evaluators - name: content_evaluator type: llm_judge prompt: ./judges/semantic_correctness.md
input_messages: - role: user content: |- Generate a JSON object for a user with name "Alice", email "alice@example.com", and role "admin".
expected_messages: - role: assistant content: |- { "name": "Alice", "email": "alice@example.com", "role": "admin" }Tool Trajectory
Section titled “Tool Trajectory”Validate that an agent uses specific tools during execution:
description: Tool usage validationexecution: target: mock_agent
evalcases: # Validate minimum tool usage (order doesn't matter) - id: research-depth expected_outcome: Agent researches thoroughly input_messages: - role: user content: Research REST vs GraphQL execution: evaluators: - name: research-check type: tool_trajectory mode: any_order minimums: knowledgeSearch: 2 documentRetrieve: 1
# Validate exact tool sequence - id: auth-flow expected_outcome: Agent follows auth sequence input_messages: - role: user content: Authenticate user execution: evaluators: - name: auth-sequence type: tool_trajectory mode: exact expected: - tool: checkCredentials - tool: generateTokenStatic Trace
Section titled “Static Trace”Evaluate pre-existing trace files without running an agent:
description: Static trace evaluationexecution: target: static_trace
evalcases: - id: validate-trace-file expected_outcome: Trace contains required steps input_messages: - role: user content: Analyze trace execution: evaluators: - name: trace-check type: tool_trajectory mode: in_order expected: - tool: webSearch - tool: readFileMulti-Turn Conversation
Section titled “Multi-Turn Conversation”Test multi-turn interactions where intermediate messages set context:
description: Multi-turn debugging session with clarifying questionsexecution: target: default
evalcases: - id: debug-with-clarification expected_outcome: |- Assistant conducts a multi-turn debugging session, asking clarification questions when needed, correctly diagnosing the bug, and proposing a clear fix with rationale.
input_messages: - role: system content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly. - role: user content: |- I'm getting an off-by-one error in this function, but I can't see why:
```python def get_items(items): result = [] for i in range(len(items) - 1): result.append(items[i]) return resultSometimes the last element is missing. Can you help debug this?
- role: assistant
content: |-
I can help debug this. Before I propose a fix, could you tell me:
- What output you expect for an example input list
- What output you actually get
- role: user
content: |-
For
[1, 2, 3, 4]I expect[1, 2, 3, 4], but I get[1, 2, 3].
expected_messages:
-
role: assistant content: |- You have an off-by-one error in your loop bounds. You’re iterating with
range(len(items) - 1), which stops before the last index. To include all items, you can either:- Use
range(len(items)), or - Iterate directly over the list:
for item in items:
Here’s a corrected version:
def get_items(items):result = []for item in items:result.append(item)return result - Use
## Batch CLI
Evaluate external batch runners that process all eval cases in one invocation:
```yamldescription: Batch CLI demo (AML screening)execution: target: batch_cli
evalcases: - id: aml-001 expected_outcome: |- Batch runner returns JSON with decision=CLEAR.
expected_messages: - role: assistant content: decision: CLEAR
input_messages: - role: system content: You are a deterministic AML screening batch checker. - role: user content: request: type: aml_screening_check jurisdiction: AU effective_date: 2025-01-01 row: id: aml-001 customer_name: Example Customer A origin_country: NZ destination_country: AU transaction_type: INTERNATIONAL_TRANSFER amount: 5000 currency: USD
execution: evaluators: - name: decision-check type: code_judge script: bun run ./scripts/check-batch-cli-output.ts cwd: .
- id: aml-002 expected_outcome: |- Batch runner returns JSON with decision=REVIEW.
expected_messages: - role: assistant content: decision: REVIEW
input_messages: - role: system content: You are a deterministic AML screening batch checker. - role: user content: request: type: aml_screening_check jurisdiction: AU effective_date: 2025-01-01 row: id: aml-002 customer_name: Example Customer B origin_country: IR destination_country: AU transaction_type: INTERNATIONAL_TRANSFER amount: 2000 currency: USD
execution: evaluators: - name: decision-check type: code_judge script: bun run ./scripts/check-batch-cli-output.ts cwd: .Batch CLI Pattern Notes
Section titled “Batch CLI Pattern Notes”execution.target: batch_cli— configure CLI provider withprovider_batching: true- The batch runner reads the eval YAML via
--evalflag and outputs JSONL keyed byid - Put structured data in
user.contentas objects for the runner to extract - Use
expected_messages.contentwith object fields for structured expected output - Each eval case has its own evaluator to validate its portion of the output
File Path Conventions
Section titled “File Path Conventions”- Absolute paths (start with
/): resolved from the repository root- Example:
/prompts/guidelines.mdresolves to<repo_root>/prompts/guidelines.md
- Example:
- Relative paths (start with
./or../): resolved from the eval file directory- Example:
../../prompts/file.mdgoes two directories up, then intoprompts/
- Example:
Tips for Writing expected_outcome
Section titled “Tips for Writing expected_outcome”- Be specific about what success looks like
- Mention key elements that must be present
- For classification tasks, specify the expected category
- For reasoning tasks, describe the thought process expected
Tips for Writing expected_messages
Section titled “Tips for Writing expected_messages”- Show the pattern, not rigid templates
- Allow for natural language variation
- Focus on semantic correctness over exact matching
- Evaluators handle the actual validation logic