Example Evaluations

This page collects complete eval file examples you can copy and adapt. Each demonstrates a different AgentV pattern.

Basic Q&A

A minimal eval with a single question and expected answer:

description: Basic arithmetic evaluation
execution:
  target: default

evalcases:
  - id: simple-addition
    expected_outcome: Correctly calculates 2+2

    input_messages:
      - role: user
        content: What is 2 + 2?

    expected_messages:
      - role: assistant
        content: "4"

Code Review with File References

Use multipart content to attach files alongside text prompts:

description: Code review with guidelines
execution:
  target: azure_base

evalcases:
  - id: code-review-basic
    expected_outcome: Assistant provides helpful code analysis with security considerations

    input_messages:
      - role: system
        content: You are an expert code reviewer.
      - role: user
        content:
          - type: text
            value: |-
              Review this function for security issues:

              ```python
              def get_user(user_id):
                  query = f"SELECT * FROM users WHERE id = {user_id}"
                  return db.execute(query)

type: file value: /prompts/security-guidelines.md

expected_messages:

role: assistant content: |- This code has a critical SQL injection vulnerability. The user_id is directly interpolated into the query string without sanitization.

Recommended fix:
```
def get_user(user_id):
    query = "SELECT * FROM users WHERE id = ?"
    return db.execute(query, (user_id,))
```

## Multi-Evaluator

Combine a code judge and an LLM judge on the same eval case:

```yaml
description: JSON generation with validation
execution:
  target: default

evalcases:
  - id: json-generation-with-validation
    expected_outcome: Generates valid JSON with required fields

    execution:
      evaluators:
        - name: json_format_validator
          type: code_judge
          script: uv run validate_json.py
          cwd: ./evaluators
        - name: content_evaluator
          type: llm_judge
          prompt: ./judges/semantic_correctness.md

    input_messages:
      - role: user
        content: |-
          Generate a JSON object for a user with name "Alice",
          email "alice@example.com", and role "admin".

    expected_messages:
      - role: assistant
        content: |-
          {
            "name": "Alice",
            "email": "alice@example.com",
            "role": "admin"
          }

Tool Trajectory

Validate that an agent uses specific tools during execution:

description: Tool usage validation
execution:
  target: mock_agent

evalcases:
  # Validate minimum tool usage (order doesn't matter)
  - id: research-depth
    expected_outcome: Agent researches thoroughly
    input_messages:
      - role: user
        content: Research REST vs GraphQL
    execution:
      evaluators:
        - name: research-check
          type: tool_trajectory
          mode: any_order
          minimums:
            knowledgeSearch: 2
            documentRetrieve: 1

  # Validate exact tool sequence
  - id: auth-flow
    expected_outcome: Agent follows auth sequence
    input_messages:
      - role: user
        content: Authenticate user
    execution:
      evaluators:
        - name: auth-sequence
          type: tool_trajectory
          mode: exact
          expected:
            - tool: checkCredentials
            - tool: generateToken

Static Trace

Evaluate pre-existing trace files without running an agent:

description: Static trace evaluation
execution:
  target: static_trace

evalcases:
  - id: validate-trace-file
    expected_outcome: Trace contains required steps
    input_messages:
      - role: user
        content: Analyze trace
    execution:
      evaluators:
        - name: trace-check
          type: tool_trajectory
          mode: in_order
          expected:
            - tool: webSearch
            - tool: readFile

Multi-Turn Conversation

Test multi-turn interactions where intermediate messages set context:

description: Multi-turn debugging session with clarifying questions
execution:
  target: default

evalcases:
  - id: debug-with-clarification
    expected_outcome: |-
      Assistant conducts a multi-turn debugging session, asking clarification
      questions when needed, correctly diagnosing the bug, and proposing a clear
      fix with rationale.

    input_messages:
      - role: system
        content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
      - role: user
        content: |-
          I'm getting an off-by-one error in this function, but I can't see why:

          ```python
          def get_items(items):
              result = []
              for i in range(len(items) - 1):
                  result.append(items[i])
              return result

Sometimes the last element is missing. Can you help debug this?

role: assistant content: |- I can help debug this. Before I propose a fix, could you tell me:
- What output you expect for an example input list
- What output you actually get
role: user content: |- For [1, 2, 3, 4] I expect [1, 2, 3, 4], but I get [1, 2, 3].

expected_messages:

role: assistant content: |- You have an off-by-one error in your loop bounds. You’re iterating with range(len(items) - 1), which stops before the last index. To include all items, you can either:
- Use range(len(items)), or
- Iterate directly over the list: for item in items:
Here’s a corrected version:
```
def get_items(items):
    result = []
    for item in items:
        result.append(item)
    return result
```

## Batch CLI

Evaluate external batch runners that process all eval cases in one invocation:

```yaml
description: Batch CLI demo (AML screening)
execution:
  target: batch_cli

evalcases:
  - id: aml-001
    expected_outcome: |-
      Batch runner returns JSON with decision=CLEAR.

    expected_messages:
      - role: assistant
        content:
          decision: CLEAR

    input_messages:
      - role: system
        content: You are a deterministic AML screening batch checker.
      - role: user
        content:
          request:
            type: aml_screening_check
            jurisdiction: AU
            effective_date: 2025-01-01
          row:
            id: aml-001
            customer_name: Example Customer A
            origin_country: NZ
            destination_country: AU
            transaction_type: INTERNATIONAL_TRANSFER
            amount: 5000
            currency: USD

    execution:
      evaluators:
        - name: decision-check
          type: code_judge
          script: bun run ./scripts/check-batch-cli-output.ts
          cwd: .

  - id: aml-002
    expected_outcome: |-
      Batch runner returns JSON with decision=REVIEW.

    expected_messages:
      - role: assistant
        content:
          decision: REVIEW

    input_messages:
      - role: system
        content: You are a deterministic AML screening batch checker.
      - role: user
        content:
          request:
            type: aml_screening_check
            jurisdiction: AU
            effective_date: 2025-01-01
          row:
            id: aml-002
            customer_name: Example Customer B
            origin_country: IR
            destination_country: AU
            transaction_type: INTERNATIONAL_TRANSFER
            amount: 2000
            currency: USD

    execution:
      evaluators:
        - name: decision-check
          type: code_judge
          script: bun run ./scripts/check-batch-cli-output.ts
          cwd: .

Batch CLI Pattern Notes

execution.target: batch_cli — configure CLI provider with provider_batching: true
The batch runner reads the eval YAML via --eval flag and outputs JSONL keyed by id
Put structured data in user.content as objects for the runner to extract
Use expected_messages.content with object fields for structured expected output
Each eval case has its own evaluator to validate its portion of the output

File Path Conventions

Absolute paths (start with /): resolved from the repository root
- Example: /prompts/guidelines.md resolves to <repo_root>/prompts/guidelines.md
Relative paths (start with ./ or ../): resolved from the eval file directory
- Example: ../../prompts/file.md goes two directories up, then into prompts/

Tips for Writing expected_outcome

Be specific about what success looks like
Mention key elements that must be present
For classification tasks, specify the expected category
For reasoning tasks, describe the thought process expected

Tips for Writing expected_messages

Show the pattern, not rigid templates
Allow for natural language variation
Focus on semantic correctness over exact matching
Evaluators handle the actual validation logic