Verifiers¶
Verifier
¶
Bases: Protocol
Post-hoc faithfulness verifier.
Decomposes each CitedSentence into atomic claims, checks each claim against its cited source span via NLI, and returns a VerificationResult per sentence.
In strict/paranoid modes the Pipeline will refuse unsupported sentences. In loose mode the verifier may be skipped entirely.
verify
¶
verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]
Return one VerificationResult per CitedSentence, in the same order.
NLIScorer Protocol¶
NLIScorer
¶
Bases: Protocol
Raw (premise, hypothesis) → entailment-probability scoring.
A thinner interface than :class:Verifier. Used by verifier-only
benchmarks (e.g. RAGTruth) that have pre-generated responses and just
need the underlying NLI signal, not the CitedSentence/Document
ceremony. Future verifiers (MiniCheck, DualNLI, LLM-judge) will all
expose this too so the same runner scores them apples-to-apples.
score_pairs
¶
Score a batch of (premise, hypothesis) pairs.
Returns one float in [0, 1] per pair (higher = more supported).
DualNLIVerifier¶
DualNLIVerifier
¶
Two-scorer NLI ensemble implementing the :class:Verifier Protocol.
Parameters¶
scorer_a, scorer_b:
Any objects implementing :class:NLIScorer. Typically
HHEMVerifier() and MiniCheckVerifier().
aggregation:
How to collapse the two scorer outputs per pair. "min"
(default) flags an example if either scorer is below
threshold — matches HALT-RAG and our RAGTruth-published config.
"mean" and "max" also supported for ablations.
threshold:
Cutoff for is_supported on each VerificationResult. Default
0.0562 is fit on RAGTruth-train with min aggregation
across HHEM + MiniCheck. Re-fit for your own data via
scripts/compute_calibrated_metrics.py if your domain differs.
Source code in src/verifiable_rag/verifiers/dual_nli.py
verify
¶
verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]
Return one VerificationResult per CitedSentence, in input order.
Source code in src/verifiable_rag/verifiers/dual_nli.py
score_pairs
¶
Raw (premise, hypothesis) scoring — used by RAGTruth runner.
HHEMVerifier¶
HHEMVerifier
¶
HHEMVerifier(model_name: str = 'vectara/hallucination_evaluation_model', threshold: float = 0.3, device: str | None = None)
Sentence-level NLI verifier backed by HHEM-2.1-open.
Parameters¶
model_name:
HuggingFace model id. Default
"vectara/hallucination_evaluation_model" (HHEM-2.1-open, ~600M).
threshold:
Cutoff for the boolean is_supported flag on each
VerificationResult. Pipeline surgical correction uses this flag to
decide which sentences to keep. Default 0.3 — empirically
chosen because HHEM was trained on tight summarization-style
entailment, but our LLM produces paraphrased / synthesized cited
sentences that legitimately score in the 0.3-0.6 range. Raise this
toward 0.5+ for stricter behavior; lower it for more permissive.
device:
"cpu", "cuda", "mps", or None to autodetect.
Source code in src/verifiable_rag/verifiers/hhem.py
verify
¶
verify(sentences: list[CitedSentence], documents: dict[str, Document]) -> list[VerificationResult]
Return one VerificationResult per CitedSentence, in input order.
Source code in src/verifiable_rag/verifiers/hhem.py
score_pairs
¶
Score a batch of (premise, hypothesis) pairs.
Used by verifier-only runners (e.g. RAGTruth) that bypass the CitedSentence/Document path. Empty premise OR empty hypothesis scores 0.0; everything else goes through HHEM in one batch call.
Source code in src/verifiable_rag/verifiers/hhem.py
MiniCheckVerifier¶
MiniCheckVerifier
¶
MiniCheckVerifier(model_name: str = 'lytang/MiniCheck-Flan-T5-Large', device: str | None = None, max_input_length: int = 2048)
NLI verifier backed by MiniCheck-Flan-T5-Large.
Parameters¶
model_name:
HuggingFace id. Default "lytang/MiniCheck-Flan-T5-Large".
device:
"cpu", "cuda", "mps", or None to autodetect.
max_input_length:
Token cap for the (premise, hypothesis) concatenation. Long
premises get truncated from the right (preserving the claim
text); 2048 covers >95% of RAGTruth contexts.
Source code in src/verifiable_rag/verifiers/minicheck.py
score_pairs
¶
Score a batch of (premise, hypothesis) pairs.
Empty premise OR empty hypothesis scores 0.0; the rest go through MiniCheck in one batched forward pass.
Source code in src/verifiable_rag/verifiers/minicheck.py
LLMJudgeVerifier¶
LLMJudgeVerifier
¶
LLMJudgeVerifier(model: str = 'claude-haiku-4-5-20251001', temperature: float = 0.0, max_tokens: int = 128, max_workers: int = 8, num_retries: int = 2, system_prompt: str = _DEFAULT_SYSTEM)
LLM-as-judge faithfulness scorer.
Parameters¶
model:
LiteLLM model identifier. Default "claude-haiku-4-5-20251001"
(cheap, fast, capable enough for most claims).
temperature:
Sampling temperature. Default 0.0 for deterministic judgments
— calibration depends on reproducibility.
max_tokens:
Cap on the response. JSON output is tiny (~30 tokens) so 64 is
plenty; the default leaves headroom.
max_workers:
ThreadPoolExecutor size for concurrent litellm.completion
calls. Default 8 — beyond that, provider rate-limits dominate.
num_retries:
LiteLLM's built-in retry count for transient errors.
system_prompt:
Override the default fact-checker instruction.
Source code in src/verifiable_rag/verifiers/llm_judge.py
score_pairs
¶
Score a batch of (premise, hypothesis) pairs.
Empty premise or hypothesis scores 0.0. Each surviving pair is an independent LLM call, dispatched via a ThreadPoolExecutor so a batch of 32 doesn't take 32× the per-call latency.
Source code in src/verifiable_rag/verifiers/llm_judge.py
EnsembleScorer¶
EnsembleScorer
¶
Combine multiple :class:NLIScorer instances into one.
Parameters¶
scorers:
Two or more objects satisfying :class:NLIScorer.
aggregation:
How to collapse per-pair scores across scorers. "min" (default)
matches HALT-RAG and our RAGTruth-published configuration.