Skip to content

Judges Framework

The Glacis judges framework provides a pluggable pipeline for evaluating AI outputs using LLM-as-judge patterns. Multiple judges can run on the same item, and their scores are aggregated into a recommendation (uphold, borderline, or escalate). Judge results are attested alongside the original output, creating an auditable review record.

The framework consists of four components:

ComponentDescription
BaseJudgeAbstract class for implementing judges
JudgeVerdictResult from a single judge on a single item
JudgeRunnerRuns multiple judges and aggregates scores
ReviewAggregated result with final score and recommendation

All judges subclass BaseJudge and implement the evaluate() method:

from glacis.judges import BaseJudge, JudgeVerdict
class MyJudge(BaseJudge):
judge_id = "my-judge-v1"
def evaluate(self, item, reference=None, rubric=None):
# Your evaluation logic here
return JudgeVerdict(
judge_id=self.judge_id,
score=2.5,
rationale="The answer is accurate and well-structured.",
latency_ms=150,
metadata={"model": "gpt-4o"},
)
ParameterTypeDescription
itemdict[str, Any]The item to evaluate (structure depends on use case)
referencestr | NoneOptional reference data for evaluation context
rubricstr | NoneOptional scoring rubric override (prompt text)

Judges also support the context manager protocol. Override close() to release expensive resources like API clients or model handles.

Each judge returns a JudgeVerdict:

FieldTypeDefaultDescription
judge_idstrIdentifier for the judge
scorefloatNumeric score (>= 0). Scale defined by JudgesConfig.max_score
rationalestrJudge’s explanation for the score
latency_msint0Processing time in milliseconds
metadatadict{}Judge-specific metadata for audit trail

JudgeRunner orchestrates multiple judges on a single item and aggregates their results:

from glacis.judges import JudgeRunner
runner = JudgeRunner(judges=[judge_a, judge_b])
result = runner.run(
item={"question": "What is AI?", "answer": "AI is..."},
reference="Source document text...",
)
print(f"Final score: {result.final_score}")
print(f"Recommendation: {result.recommendation}")
print(f"Consensus: {result.consensus}")

If a judge raises an exception, it is caught and recorded as a verdict with score=0 and the error message as the rationale. The pipeline continues with the remaining judges.

ParameterTypeDefaultDescription
judgeslist[BaseJudge](required)List of judge instances to run
consensus_thresholdfloat1.0Max score spread before flagging disagreement (ignored if config provided)
configJudgesConfig | NoneNoneFull configuration with all thresholds

The aggregated result from running all judges:

FieldTypeDescription
verdictslist[JudgeVerdict]Individual judge results
final_scorefloatAverage of all judge scores
max_scorefloatMaximum possible score (from config)
consensusboolWhether judges agree within the threshold
recommendationstr"uphold", "borderline", or "escalate"

Convert a Review to a dict matching the glacis.models.Review wire format (for L2 attestation):

review = runner.run(item, reference=source_doc)
wire = review.to_wire_review(sample_probability=0.1)
# wire = {
# "sample_probability": 0.1,
# "judge_ids": ["gpt-4o-mini", "claude-haiku"],
# "conformity_score": 0.9167, # final_score / max_score, clamped to [0, 1]
# "recommendation": "uphold",
# "rationale": "correct; mostly correct",
# }
ParameterTypeDescription
sample_probabilityfloatProbability this item was sampled (0.0-1.0)

The recommendation is derived from the final_score (average of all judge scores) using configurable thresholds:

if final_score >= uphold_threshold:
recommendation = "uphold"
elif final_score >= borderline_threshold:
recommendation = "borderline"
else:
recommendation = "escalate"

With default thresholds (0-3 scale):

Score RangeRecommendationMeaning
>= 2.0"uphold"Quality is acceptable
>= 1.0"borderline"Needs human review
< 1.0"escalate"Quality concern, requires attention

All thresholds are configurable via JudgesConfig:

FieldTypeDefaultDescription
max_scorefloat3.0Maximum score on the rubric scale
consensus_thresholdfloat1.0Max score spread between judges before flagging disagreement
uphold_thresholdfloat2.0Minimum average score for "uphold"
borderline_thresholdfloat1.0Minimum average score for "borderline" (below this means "escalate")
score_precisionint4Decimal places for rounding final_score
from glacis.judges import JudgesConfig, JudgeRunner
# Binary pass/fail scale (0-1)
config = JudgesConfig(
max_score=1.0,
uphold_threshold=0.7,
borderline_threshold=0.4,
consensus_threshold=0.2,
)
runner = JudgeRunner(judges=[judge_a, judge_b], config=config)

Here is a complete example of a fact-checking judge that uses an LLM to evaluate answer accuracy:

import time
from typing import Any, Optional
from glacis.judges import BaseJudge, JudgeVerdict
class FactCheckJudge(BaseJudge):
"""Fact-checking judge using an LLM."""
judge_id = "fact-check-gpt4o"
def __init__(self, openai_client):
self._client = openai_client
def evaluate(
self,
item: dict[str, Any],
reference: Optional[str] = None,
rubric: Optional[str] = None,
) -> JudgeVerdict:
start = time.perf_counter()
prompt = rubric or (
"Rate the factual accuracy of this answer on a 0-3 scale.\n"
"0 = completely wrong, 1 = partially correct, "
"2 = mostly correct, 3 = fully correct.\n"
"Respond with just the score and a brief rationale."
)
messages = [
{"role": "system", "content": prompt},
{"role": "user", "content": (
f"Question: {item.get('question', '')}\n"
f"Answer: {item.get('answer', '')}\n"
f"Reference: {reference or 'N/A'}"
)},
]
response = self._client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.0,
)
text = response.choices[0].message.content or ""
# Parse score from response (simplified)
score = float(text[0]) if text and text[0].isdigit() else 0.0
latency_ms = int((time.perf_counter() - start) * 1000)
return JudgeVerdict(
judge_id=self.judge_id,
score=min(score, 3.0),
rationale=text,
latency_ms=latency_ms,
metadata={"model": "gpt-4o"},
)
def close(self) -> None:
"""Release the OpenAI client if needed."""
pass

When running multiple judges, the consensus flag indicates whether the judges agree:

from glacis.judges import JudgeRunner, JudgesConfig
config = JudgesConfig(consensus_threshold=1.0)
runner = JudgeRunner(
judges=[fact_check_judge, accuracy_judge],
config=config,
)
result = runner.run(item={"question": "...", "answer": "..."})
if not result.consensus:
print("Judges disagree significantly!")
for v in result.verdicts:
print(f" {v.judge_id}: {v.score} - {v.rationale}")

Consensus is computed as: max(scores) - min(scores) ≤ consensus_threshold. With a single judge, consensus is always True.

Judge thresholds can be set in your config file:

version: "1.3"
judges:
max_score: 3.0
consensus_threshold: 1.0
uphold_threshold: 2.0
borderline_threshold: 1.0
score_precision: 4

Load the config and pass it to JudgeRunner:

from glacis.config import load_config
from glacis.judges import JudgeRunner
cfg = load_config("glacis.yaml")
runner = JudgeRunner(judges=[my_judge], config=cfg.judges)
  • Sampling & Evidence — how L2 sampling identifies attestations eligible for judge evaluation
  • Configuration — full glacis.yaml reference with judges section