Composite Evaluators
Composite evaluators combine multiple evaluators and aggregate their results into a single score. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
Basic Structure
Section titled “Basic Structure”A composite evaluator wraps two or more sub-evaluators and an aggregator that determines the final score:
execution: evaluators: - name: my_composite type: composite evaluators: - name: evaluator_1 type: llm_judge prompt: ./prompts/check1.md - name: evaluator_2 type: code_judge script: uv run check2.py aggregator: type: weighted_average weights: evaluator_1: 0.6 evaluator_2: 0.4Each sub-evaluator runs independently, then the aggregator combines their results.
Aggregator Types
Section titled “Aggregator Types”Weighted Average (Default)
Section titled “Weighted Average (Default)”Combines scores using a weighted arithmetic mean:
aggregator: type: weighted_average weights: safety: 0.3 # 30% weight quality: 0.7 # 70% weightIf weights are omitted, all evaluators receive equal weight (1.0).
The score is calculated as:
final_score = sum(score_i * weight_i) / sum(weight_i)Code Judge Aggregator
Section titled “Code Judge Aggregator”Run a custom script to decide the final score based on all evaluator results:
aggregator: type: code_judge path: node ./scripts/safety-gate.js cwd: ./evaluators # optional working directoryThe script receives the evaluator results on stdin and must print a result to stdout.
Input (stdin):
{ "results": { "safety": { "score": 0.9, "hits": ["..."], "misses": ["..."] }, "quality": { "score": 0.85, "hits": ["..."], "misses": ["..."] } }}Output (stdout):
{ "score": 0.87, "verdict": "pass", "hits": ["Combined check passed"], "misses": [], "reasoning": "Safety gate passed, quality acceptable"}LLM Judge Aggregator
Section titled “LLM Judge Aggregator”Use an LLM to resolve conflicts or make nuanced decisions across evaluator results:
aggregator: type: llm_judge prompt: ./prompts/conflict-resolution.mdInside the prompt file, use the {{EVALUATOR_RESULTS_JSON}} variable to inject the JSON results from all child evaluators.
Patterns
Section titled “Patterns”Safety Gate
Section titled “Safety Gate”Block outputs that fail safety even if quality is high. A code judge aggregator can enforce hard gates:
evalcases: - id: safety-gated-response expected_outcome: Safe and accurate response
input_messages: - role: user content: Explain quantum computing
execution: evaluators: - name: safety_gate type: composite evaluators: - name: safety type: llm_judge prompt: ./prompts/safety-check.md - name: quality type: llm_judge prompt: ./prompts/quality-check.md aggregator: type: code_judge path: ./scripts/safety-gate.jsThe safety-gate.js script can return a score of 0.0 whenever the safety evaluator fails, regardless of the quality score.
Multi-Criteria Weighted
Section titled “Multi-Criteria Weighted”Assign different importance to each evaluation dimension:
- name: release_readiness type: composite evaluators: - name: correctness type: llm_judge prompt: ./prompts/correctness.md - name: style type: code_judge script: uv run style_checker.py - name: security type: llm_judge prompt: ./prompts/security.md aggregator: type: weighted_average weights: correctness: 0.5 style: 0.2 security: 0.3Nested Composites
Section titled “Nested Composites”Composites can contain other composites for hierarchical evaluation:
- name: comprehensive_eval type: composite evaluators: - name: content_quality type: composite evaluators: - name: accuracy type: llm_judge prompt: ./prompts/accuracy.md - name: clarity type: llm_judge prompt: ./prompts/clarity.md aggregator: type: weighted_average weights: accuracy: 0.6 clarity: 0.4 - name: safety type: llm_judge prompt: ./prompts/safety.md aggregator: type: weighted_average weights: content_quality: 0.7 safety: 0.3Result Structure
Section titled “Result Structure”Composite evaluators return nested evaluator_results, giving full visibility into each sub-evaluator:
{ "score": 0.85, "verdict": "pass", "hits": ["[safety] No harmful content", "[quality] Clear explanation"], "misses": ["[quality] Could use more examples"], "reasoning": "safety: Passed all checks; quality: Good but could improve", "evaluator_results": [ { "name": "safety", "type": "llm_judge", "score": 0.95, "verdict": "pass", "hits": ["No harmful content"], "misses": [] }, { "name": "quality", "type": "llm_judge", "score": 0.8, "verdict": "pass", "hits": ["Clear explanation"], "misses": ["Could use more examples"] } ]}Hits and misses from sub-evaluators are prefixed with the evaluator name (e.g., [safety]) in the top-level arrays.
Best Practices
Section titled “Best Practices”- Name evaluators clearly — names appear in results and debugging output, so use descriptive labels like
safetyorcorrectnessrather thaneval_1. - Use safety gates for critical checks — do not let high quality scores override safety failures. A code judge aggregator can enforce hard gates.
- Balance weights thoughtfully — consider which aspects matter most for your use case and assign weights accordingly.
- Keep nesting shallow — deep nesting makes debugging harder. Two levels of composites is usually sufficient.
- Test aggregators independently — verify custom aggregation logic with unit tests before wiring it into a composite evaluator.