Judges Framework

The Glacis judges framework provides a pluggable pipeline for evaluating AI outputs using LLM-as-judge patterns. Multiple judges can run on the same item, and their scores are aggregated into a recommendation (uphold, borderline, or escalate). Judge results are attested alongside the original output, creating an auditable review record.

Overview

The framework consists of four components:

Component	Description
`BaseJudge`	Abstract class for implementing judges
`JudgeVerdict`	Result from a single judge on a single item
`JudgeRunner`	Runs multiple judges and aggregates scores
`Review`	Aggregated result with final score and recommendation

BaseJudge

All judges subclass BaseJudge and implement the evaluate() method:

from glacis.judges import BaseJudge, JudgeVerdict


class MyJudge(BaseJudge):
    judge_id = "my-judge-v1"

    def evaluate(self, item, reference=None, rubric=None):
        # Your evaluation logic here
        return JudgeVerdict(
            judge_id=self.judge_id,
            score=2.5,
            rationale="The answer is accurate and well-structured.",
            latency_ms=150,
            metadata={"model": "gpt-4o"},
        )

evaluate() Parameters

Parameter	Type	Description
`item`	`dict[str, Any]`	The item to evaluate (structure depends on use case)
`reference`	`str \| None`	Optional reference data for evaluation context
`rubric`	`str \| None`	Optional scoring rubric override (prompt text)

Judges also support the context manager protocol. Override close() to release expensive resources like API clients or model handles.

JudgeVerdict

Each judge returns a JudgeVerdict:

Field	Type	Default	Description
`judge_id`	`str`	—	Identifier for the judge
`score`	`float`	—	Numeric score (>= 0). Scale defined by `JudgesConfig.max_score`
`rationale`	`str`	—	Judge’s explanation for the score
`latency_ms`	`int`	`0`	Processing time in milliseconds
`metadata`	`dict`	`{}`	Judge-specific metadata for audit trail

JudgeRunner

JudgeRunner orchestrates multiple judges on a single item and aggregates their results:

from glacis.judges import JudgeRunner

runner = JudgeRunner(judges=[judge_a, judge_b])
result = runner.run(
    item={"question": "What is AI?", "answer": "AI is..."},
    reference="Source document text...",
)

print(f"Final score: {result.final_score}")
print(f"Recommendation: {result.recommendation}")
print(f"Consensus: {result.consensus}")

If a judge raises an exception, it is caught and recorded as a verdict with score=0 and the error message as the rationale. The pipeline continues with the remaining judges.

JudgeRunner Parameters

Parameter	Type	Default	Description
`judges`	`list[BaseJudge]`	(required)	List of judge instances to run
`consensus_threshold`	`float`	`1.0`	Max score spread before flagging disagreement (ignored if `config` provided)
`config`	`JudgesConfig \| None`	`None`	Full configuration with all thresholds

Review

The aggregated result from running all judges:

Field	Type	Description
`verdicts`	`list[JudgeVerdict]`	Individual judge results
`final_score`	`float`	Average of all judge scores
`max_score`	`float`	Maximum possible score (from config)
`consensus`	`bool`	Whether judges agree within the threshold
`recommendation`	`str`	`"uphold"`, `"borderline"`, or `"escalate"`

to_wire_review()

Convert a Review to a dict matching the glacis.models.Review wire format (for L2 attestation):

review = runner.run(item, reference=source_doc)
wire = review.to_wire_review(sample_probability=0.1)
# wire = {
#   "sample_probability": 0.1,
#   "judge_ids": ["gpt-4o-mini", "claude-haiku"],
#   "conformity_score": 0.9167,   # final_score / max_score, clamped to [0, 1]
#   "recommendation": "uphold",
#   "rationale": "correct; mostly correct",
# }

Parameter	Type	Description
`sample_probability`	`float`	Probability this item was sampled (0.0-1.0)

Recommendation Logic

The recommendation is derived from the final_score (average of all judge scores) using configurable thresholds:

if final_score >= uphold_threshold:
    recommendation = "uphold"
elif final_score >= borderline_threshold:
    recommendation = "borderline"
else:
    recommendation = "escalate"

With default thresholds (0-3 scale):

Score Range	Recommendation	Meaning
>= 2.0	`"uphold"`	Quality is acceptable
>= 1.0	`"borderline"`	Needs human review
< 1.0	`"escalate"`	Quality concern, requires attention

JudgesConfig

All thresholds are configurable via JudgesConfig:

Field	Type	Default	Description
`max_score`	`float`	`3.0`	Maximum score on the rubric scale
`consensus_threshold`	`float`	`1.0`	Max score spread between judges before flagging disagreement
`uphold_threshold`	`float`	`2.0`	Minimum average score for `"uphold"`
`borderline_threshold`	`float`	`1.0`	Minimum average score for `"borderline"` (below this means `"escalate"`)
`score_precision`	`int`	`4`	Decimal places for rounding `final_score`

Example: Custom Scale

from glacis.judges import JudgesConfig, JudgeRunner

# Binary pass/fail scale (0-1)
config = JudgesConfig(
    max_score=1.0,
    uphold_threshold=0.7,
    borderline_threshold=0.4,
    consensus_threshold=0.2,
)

runner = JudgeRunner(judges=[judge_a, judge_b], config=config)

Writing a Custom Judge

Here is a complete example of a fact-checking judge that uses an LLM to evaluate answer accuracy:

import time
from typing import Any, Optional

from glacis.judges import BaseJudge, JudgeVerdict


class FactCheckJudge(BaseJudge):
    """Fact-checking judge using an LLM."""

    judge_id = "fact-check-gpt4o"

    def __init__(self, openai_client):
        self._client = openai_client

    def evaluate(
        self,
        item: dict[str, Any],
        reference: Optional[str] = None,
        rubric: Optional[str] = None,
    ) -> JudgeVerdict:
        start = time.perf_counter()

        prompt = rubric or (
            "Rate the factual accuracy of this answer on a 0-3 scale.\n"
            "0 = completely wrong, 1 = partially correct, "
            "2 = mostly correct, 3 = fully correct.\n"
            "Respond with just the score and a brief rationale."
        )

        messages = [
            {"role": "system", "content": prompt},
            {"role": "user", "content": (
                f"Question: {item.get('question', '')}\n"
                f"Answer: {item.get('answer', '')}\n"
                f"Reference: {reference or 'N/A'}"
            )},
        ]

        response = self._client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            temperature=0.0,
        )

        text = response.choices[0].message.content or ""
        # Parse score from response (simplified)
        score = float(text[0]) if text and text[0].isdigit() else 0.0

        latency_ms = int((time.perf_counter() - start) * 1000)

        return JudgeVerdict(
            judge_id=self.judge_id,
            score=min(score, 3.0),
            rationale=text,
            latency_ms=latency_ms,
            metadata={"model": "gpt-4o"},
        )

    def close(self) -> None:
        """Release the OpenAI client if needed."""
        pass

Multiple Judges and Consensus

When running multiple judges, the consensus flag indicates whether the judges agree:

from glacis.judges import JudgeRunner, JudgesConfig

config = JudgesConfig(consensus_threshold=1.0)
runner = JudgeRunner(
    judges=[fact_check_judge, accuracy_judge],
    config=config,
)

result = runner.run(item={"question": "...", "answer": "..."})

if not result.consensus:
    print("Judges disagree significantly!")
    for v in result.verdicts:
        print(f"  {v.judge_id}: {v.score} - {v.rationale}")

Consensus is computed as: max(scores) - min(scores) ≤ consensus_threshold. With a single judge, consensus is always True.

Configuration via glacis.yaml

Judge thresholds can be set in your config file:

version: "1.3"
judges:
  max_score: 3.0
  consensus_threshold: 1.0
  uphold_threshold: 2.0
  borderline_threshold: 1.0
  score_precision: 4

Load the config and pass it to JudgeRunner:

from glacis.config import load_config
from glacis.judges import JudgeRunner

cfg = load_config("glacis.yaml")
runner = JudgeRunner(judges=[my_judge], config=cfg.judges)