Verifiers¶

Verifier ¶

Bases: Protocol

Post-hoc faithfulness verifier.

Decomposes each CitedSentence into atomic claims, checks each claim against its cited source span via NLI, and returns a VerificationResult per sentence.

In strict/paranoid modes the Pipeline will refuse unsupported sentences. In loose mode the verifier may be skipped entirely.

verify ¶

verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]

Return one VerificationResult per CitedSentence, in the same order.

Source code in src/verifiable_rag/verifiers/__init__.py

def verify(
    self,
    sentences: list[CitedSentence],
    documents: dict[str, Document],
) -> list[VerificationResult]:
    """Return one VerificationResult per CitedSentence, in the same order."""
    ...

NLIScorer Protocol¶

NLIScorer ¶

Bases: Protocol

Raw (premise, hypothesis) → entailment-probability scoring.

A thinner interface than :class:Verifier. Used by verifier-only benchmarks (e.g. RAGTruth) that have pre-generated responses and just need the underlying NLI signal, not the CitedSentence/Document ceremony. Future verifiers (MiniCheck, DualNLI, LLM-judge) will all expose this too so the same runner scores them apples-to-apples.

score_pairs ¶

score_pairs(pairs: list[tuple[str, str]]) -> list[float]

Score a batch of (premise, hypothesis) pairs.

Returns one float in [0, 1] per pair (higher = more supported).

Source code in src/verifiable_rag/verifiers/__init__.py

def score_pairs(self, pairs: list[tuple[str, str]]) -> list[float]:
    """Score a batch of (premise, hypothesis) pairs.

    Returns one float in ``[0, 1]`` per pair (higher = more supported).
    """
    ...

DualNLIVerifier¶

DualNLIVerifier ¶

DualNLIVerifier(scorer_a, scorer_b, aggregation: str = 'min', threshold: float = 0.0562)

Two-scorer NLI ensemble implementing the :class:Verifier Protocol.

Parameters¶

scorer_a, scorer_b: Any objects implementing :class:NLIScorer. Typically HHEMVerifier() and MiniCheckVerifier(). aggregation: How to collapse the two scorer outputs per pair. "min" (default) flags an example if either scorer is below threshold — matches HALT-RAG and our RAGTruth-published config. "mean" and "max" also supported for ablations. threshold: Cutoff for is_supported on each VerificationResult. Default 0.0562 is fit on RAGTruth-train with min aggregation across HHEM + MiniCheck. Re-fit for your own data via scripts/compute_calibrated_metrics.py if your domain differs.

Source code in src/verifiable_rag/verifiers/dual_nli.py

def __init__(
    self,
    scorer_a,  # type: ignore[no-untyped-def] — NLIScorer (Protocol from sibling module)
    scorer_b,
    aggregation: str = "min",
    threshold: float = 0.0562,
) -> None:
    if not (0.0 <= threshold <= 1.0):
        raise ValueError(f"threshold must be in [0, 1], got {threshold}")
    self._ensemble = EnsembleScorer(
        [scorer_a, scorer_b], aggregation=aggregation
    )
    self._threshold = threshold
    self._aggregation = aggregation

verify ¶

verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]

Return one VerificationResult per CitedSentence, in input order.

Source code in src/verifiable_rag/verifiers/dual_nli.py

def verify(
    self,
    sentences: list[CitedSentence],
    documents: dict[str, Document],
) -> list[VerificationResult]:
    """Return one VerificationResult per CitedSentence, in input order."""
    if not sentences:
        return []

    scored_pairs: list[tuple[str, str]] = []
    pair_to_sentence: list[int] = []

    for i, cs in enumerate(sentences):
        premise = _build_premise(cs.supporting_sentence_ids, documents)
        if not premise.strip() or not cs.text.strip():
            continue
        scored_pairs.append((premise, cs.text))
        pair_to_sentence.append(i)

    if scored_pairs:
        raw_scores = self._ensemble.score_pairs(scored_pairs)
    else:
        raw_scores = []

    scores_by_sentence_idx: dict[int, float] = {
        idx: float(s)
        for idx, s in zip(pair_to_sentence, raw_scores, strict=True)
    }

    results: list[VerificationResult] = []
    for i, cs in enumerate(sentences):
        score = scores_by_sentence_idx.get(i, 0.0)
        results.append(
            VerificationResult(
                cited_sentence_index=i,
                claim_text=cs.text,
                is_supported=score >= self._threshold,
                nli_score=score,
            )
        )
    return results

score_pairs ¶

score_pairs(pairs: list[tuple[str, str]]) -> list[float]

Raw (premise, hypothesis) scoring — used by RAGTruth runner.

Source code in src/verifiable_rag/verifiers/dual_nli.py

def score_pairs(self, pairs: list[tuple[str, str]]) -> list[float]:
    """Raw (premise, hypothesis) scoring — used by RAGTruth runner."""
    return self._ensemble.score_pairs(pairs)

HHEMVerifier¶

HHEMVerifier ¶

HHEMVerifier(model_name: str = 'vectara/hallucination_evaluation_model', threshold: float = 0.3, device: str | None = None)

Sentence-level NLI verifier backed by HHEM-2.1-open.

Parameters¶

model_name: HuggingFace model id. Default "vectara/hallucination_evaluation_model" (HHEM-2.1-open, ~600M). threshold: Cutoff for the boolean is_supported flag on each VerificationResult. Pipeline surgical correction uses this flag to decide which sentences to keep. Default 0.3 — empirically chosen because HHEM was trained on tight summarization-style entailment, but our LLM produces paraphrased / synthesized cited sentences that legitimately score in the 0.3-0.6 range. Raise this toward 0.5+ for stricter behavior; lower it for more permissive. device: "cpu", "cuda", "mps", or None to autodetect.

Source code in src/verifiable_rag/verifiers/hhem.py

def __init__(
    self,
    model_name: str = "vectara/hallucination_evaluation_model",
    threshold: float = 0.3,
    device: str | None = None,
) -> None:
    if not (0.0 <= threshold <= 1.0):
        raise ValueError(f"threshold must be in [0, 1], got {threshold}")
    self._model_name = model_name
    self._threshold = threshold
    self._device = device
    self._model: Any = None

verify ¶

verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]

Return one VerificationResult per CitedSentence, in input order.

Source code in src/verifiable_rag/verifiers/hhem.py

def verify(
    self,
    sentences: list[CitedSentence],
    documents: dict[str, Document],
) -> list[VerificationResult]:
    """Return one VerificationResult per CitedSentence, in input order."""
    if not sentences:
        return []

    # Build (premise, hypothesis) pairs, tracking which sentence-indices
    # actually have a non-empty premise. Indices with no premise score 0.
    scored_pairs: list[tuple[str, str]] = []
    pair_to_sentence: list[int] = []

    for i, cs in enumerate(sentences):
        premise = self._build_premise(cs.supporting_sentence_ids, documents)
        if not premise.strip() or not cs.text.strip():
            continue
        scored_pairs.append((premise, cs.text))
        pair_to_sentence.append(i)

    # Score everything in one batch call to the model.
    if scored_pairs:
        model = self._load()
        raw_scores = model.predict(scored_pairs)
    else:
        raw_scores = []

    scores_by_sentence_idx: dict[int, float] = {
        sentence_idx: float(score)
        for sentence_idx, score in zip(pair_to_sentence, raw_scores, strict=True)
    }

    results: list[VerificationResult] = []
    for i, cs in enumerate(sentences):
        score = scores_by_sentence_idx.get(i, 0.0)
        results.append(
            VerificationResult(
                cited_sentence_index=i,
                claim_text=cs.text,
                is_supported=score >= self._threshold,
                nli_score=score,
            )
        )
    return results

score_pairs ¶

score_pairs(pairs: list[tuple[str, str]]) -> list[float]

Score a batch of (premise, hypothesis) pairs.

Used by verifier-only runners (e.g. RAGTruth) that bypass the CitedSentence/Document path. Empty premise OR empty hypothesis scores 0.0; everything else goes through HHEM in one batch call.

Source code in src/verifiable_rag/verifiers/hhem.py

def score_pairs(self, pairs: list[tuple[str, str]]) -> list[float]:
    """Score a batch of ``(premise, hypothesis)`` pairs.

    Used by verifier-only runners (e.g. RAGTruth) that bypass the
    CitedSentence/Document path. Empty premise OR empty hypothesis
    scores 0.0; everything else goes through HHEM in one batch call.
    """
    if not pairs:
        return []
    keep_idx: list[int] = []
    keep_pairs: list[tuple[str, str]] = []
    for i, (p, h) in enumerate(pairs):
        if p.strip() and h.strip():
            keep_idx.append(i)
            keep_pairs.append((p, h))

    scores = [0.0] * len(pairs)
    if keep_pairs:
        raw = self._load().predict(keep_pairs)
        for i, s in zip(keep_idx, raw, strict=True):
            scores[i] = float(s)
    return scores

MiniCheckVerifier¶

MiniCheckVerifier ¶

MiniCheckVerifier(model_name: str = 'lytang/MiniCheck-Flan-T5-Large', device: str | None = None, max_input_length: int = 2048)

NLI verifier backed by MiniCheck-Flan-T5-Large.

Parameters¶

model_name: HuggingFace id. Default "lytang/MiniCheck-Flan-T5-Large". device: "cpu", "cuda", "mps", or None to autodetect. max_input_length: Token cap for the (premise, hypothesis) concatenation. Long premises get truncated from the right (preserving the claim text); 2048 covers >95% of RAGTruth contexts.

Source code in src/verifiable_rag/verifiers/minicheck.py

def __init__(
    self,
    model_name: str = "lytang/MiniCheck-Flan-T5-Large",
    device: str | None = None,
    max_input_length: int = 2048,
) -> None:
    self._model_name = model_name
    self._device = device
    self._max_input_length = max_input_length
    self._tokenizer: Any = None
    self._model: Any = None
    self._yes_id: int | None = None
    self._no_id: int | None = None

score_pairs ¶

score_pairs(pairs: list[tuple[str, str]]) -> list[float]

Score a batch of (premise, hypothesis) pairs.

Empty premise OR empty hypothesis scores 0.0; the rest go through MiniCheck in one batched forward pass.

Source code in src/verifiable_rag/verifiers/minicheck.py

def score_pairs(self, pairs: list[tuple[str, str]]) -> list[float]:
    """Score a batch of ``(premise, hypothesis)`` pairs.

    Empty premise OR empty hypothesis scores 0.0; the rest go
    through MiniCheck in one batched forward pass.
    """
    if not pairs:
        return []

    keep_idx: list[int] = []
    keep_pairs: list[tuple[str, str]] = []
    for i, (p, h) in enumerate(pairs):
        if p.strip() and h.strip():
            keep_idx.append(i)
            keep_pairs.append((p, h))

    scores = [0.0] * len(pairs)
    if not keep_pairs:
        return scores

    tokenizer, model = self._load()
    import torch

    texts = [f"premise: {p} hypothesis: {h}" for p, h in keep_pairs]
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=self._max_input_length,
    )
    if self._device is not None:
        inputs = {k: v.to(self._device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1,
            output_scores=True,
            return_dict_in_generate=True,
            do_sample=False,
        )
    # outputs.scores is a tuple of length max_new_tokens; first
    # element has shape (batch, vocab).
    first_token_logits = outputs.scores[0]
    assert self._yes_id is not None and self._no_id is not None
    two_class = torch.stack(
        [first_token_logits[:, self._no_id], first_token_logits[:, self._yes_id]],
        dim=-1,
    )
    probs = torch.softmax(two_class, dim=-1)[:, 1]
    raw = probs.detach().cpu().tolist()

    for idx, s in zip(keep_idx, raw, strict=True):
        scores[idx] = float(s)
    return scores

LLMJudgeVerifier¶

LLMJudgeVerifier ¶

LLMJudgeVerifier(model: str = 'claude-haiku-4-5-20251001', temperature: float = 0.0, max_tokens: int = 128, max_workers: int = 8, num_retries: int = 2, system_prompt: str = _DEFAULT_SYSTEM)

LLM-as-judge faithfulness scorer.

Parameters¶

model: LiteLLM model identifier. Default "claude-haiku-4-5-20251001" (cheap, fast, capable enough for most claims). temperature: Sampling temperature. Default 0.0 for deterministic judgments — calibration depends on reproducibility. max_tokens: Cap on the response. JSON output is tiny (~30 tokens) so 64 is plenty; the default leaves headroom. max_workers: ThreadPoolExecutor size for concurrent litellm.completion calls. Default 8 — beyond that, provider rate-limits dominate. num_retries: LiteLLM's built-in retry count for transient errors. system_prompt: Override the default fact-checker instruction.

Source code in src/verifiable_rag/verifiers/llm_judge.py

def __init__(
    self,
    model: str = "claude-haiku-4-5-20251001",
    temperature: float = 0.0,
    max_tokens: int = 128,
    max_workers: int = 8,
    num_retries: int = 2,
    system_prompt: str = _DEFAULT_SYSTEM,
) -> None:
    self._model = model
    self._temperature = temperature
    self._max_tokens = max_tokens
    self._max_workers = max_workers
    self._num_retries = num_retries
    self._system_prompt = system_prompt

score_pairs ¶

score_pairs(pairs: list[tuple[str, str]]) -> list[float]

Score a batch of (premise, hypothesis) pairs.

Empty premise or hypothesis scores 0.0. Each surviving pair is an independent LLM call, dispatched via a ThreadPoolExecutor so a batch of 32 doesn't take 32× the per-call latency.

Source code in src/verifiable_rag/verifiers/llm_judge.py

def score_pairs(self, pairs: list[tuple[str, str]]) -> list[float]:
    """Score a batch of ``(premise, hypothesis)`` pairs.

    Empty premise or hypothesis scores 0.0. Each surviving pair is
    an independent LLM call, dispatched via a ThreadPoolExecutor so
    a batch of 32 doesn't take 32× the per-call latency.
    """
    if not pairs:
        return []

    keep_idx: list[int] = []
    keep_pairs: list[tuple[str, str]] = []
    for i, (p, h) in enumerate(pairs):
        if p.strip() and h.strip():
            keep_idx.append(i)
            keep_pairs.append((p, h))

    scores = [0.0] * len(pairs)
    if not keep_pairs:
        return scores

    if self._max_workers > 1 and len(keep_pairs) > 1:
        with ThreadPoolExecutor(max_workers=self._max_workers) as pool:
            raw = list(pool.map(self._score_one, keep_pairs))
    else:
        raw = [self._score_one(pair) for pair in keep_pairs]

    for idx, s in zip(keep_idx, raw, strict=True):
        scores[idx] = float(s)
    return scores

EnsembleScorer¶

EnsembleScorer ¶

EnsembleScorer(scorers: list['NLIScorer'], aggregation: str = 'min')

Combine multiple :class:NLIScorer instances into one.

Parameters¶

scorers: Two or more objects satisfying :class:NLIScorer. aggregation: How to collapse per-pair scores across scorers. "min" (default) matches HALT-RAG and our RAGTruth-published configuration.

Source code in src/verifiable_rag/verifiers/ensemble.py

def __init__(
    self,
    scorers: list["NLIScorer"],
    aggregation: str = "min",
) -> None:
    if len(scorers) < 2:
        raise ValueError(
            f"EnsembleScorer requires ≥2 scorers, got {len(scorers)}"
        )
    if aggregation not in _AGGREGATIONS:
        raise ValueError(
            f"aggregation must be one of {sorted(_AGGREGATIONS)}, got "
            f"{aggregation!r}"
        )
    self._scorers = list(scorers)
    self._aggregation = aggregation