Structured Data & Metrics Evaluators
Built-in evaluators for grading structured outputs and gating on execution metrics:
field_accuracy— compare JSON fields against ground truthlatency— gate on response timecost— gate on monetary costtoken_usage— gate on token consumption
Ground Truth
Section titled “Ground Truth”Put the expected structured output in the evalcase expected_messages (typically as the last assistant message with content as an object). Evaluators read expected values from there.
evalcases: - id: invoice-001 expected_messages: - role: assistant content: invoice_number: "INV-2025-001234" net_total: 1889Field Accuracy
Section titled “Field Accuracy”Use field_accuracy to compare fields in the candidate JSON against the ground-truth object in expected_messages.
execution: evaluators: - name: invoice_fields type: field_accuracy aggregation: weighted_average fields: - path: invoice_number match: exact required: true weight: 2.0 - path: invoice_date match: date formats: ["DD-MMM-YYYY", "YYYY-MM-DD"] - path: net_total match: numeric_tolerance tolerance: 1.0Match Types
Section titled “Match Types”| Match Type | Description | Options |
|---|---|---|
exact | Strict equality | — |
date | Compares dates after parsing | formats — list of accepted date formats |
numeric_tolerance | Numeric compare within tolerance | tolerance — absolute threshold; relative: true for relative tolerance |
For fuzzy string matching, use a code_judge evaluator (e.g. Levenshtein distance) instead of adding a fuzzy mode to field_accuracy.
Aggregation
Section titled “Aggregation”| Strategy | Description |
|---|---|
weighted_average (default) | Weighted mean of field scores |
all_or_nothing | Score 1.0 only if all graded fields pass |
Latency
Section titled “Latency”Gate on execution time (in milliseconds) reported by the provider via traceSummary.
execution: evaluators: - name: performance type: latency threshold: 2000Gate on monetary cost reported by the provider via traceSummary.
execution: evaluators: - name: budget type: cost budget: 0.10Token Usage
Section titled “Token Usage”Gate on provider-reported token usage. Useful when cost is unavailable or model pricing differs.
execution: evaluators: - name: token-budget type: token_usage max_total: 10000 # or: # max_input: 8000 # max_output: 2000Combining with Composite Evaluators
Section titled “Combining with Composite Evaluators”Use a composite evaluator to produce a single “release gate” score from multiple checks:
execution: evaluators: - name: release_gate type: composite evaluators: - name: correctness type: field_accuracy fields: - path: invoice_number match: exact - name: latency type: latency threshold: 2000 - name: cost type: cost budget: 0.10 - name: tokens type: token_usage max_total: 10000 aggregator: type: weighted_average weights: correctness: 0.8 latency: 0.1 cost: 0.05 tokens: 0.05