verifiable-rag¶
Document-grounded Q&A with sentence-level citations, NLI verification, and calibrated refusal.
A Python library for building RAG pipelines that actually verify what they say. Every generated sentence carries a tight pointer to the source span that supports it; every claim is NLI-checked against its cite; the system refuses cleanly when it isn't confident.
The one finding that drives the design¶
On RAGTruth (the canonical 2,700-example RAG hallucination benchmark), a dual NLI ensemble of two small open-source models matches a Claude Sonnet 4.6 LLM-judge — AUROC 0.844 vs 0.846 — at ~250× lower per-call cost.
That number is the whole pitch: you don't need a frontier-LLM judge to verify your RAG outputs. Two small NLI models, calibrated honestly, ensembled correctly, get you there for free per call.
See the full result → Read the blog post → All posts →
Quickstart¶
import verifiable_rag
from verifiable_rag.demo import sample_paper_path
answer = verifiable_rag.ask(
"What is the mechanism of action of penicillin?",
docs=sample_paper_path(),
)
print(answer.text)
For the full audit trail in a browser:
verifiable_rag.ask(
"What is the mechanism of action of penicillin?",
docs=sample_paper_path(),
output_html="audit.html",
)
Why this exists¶
Every shipping "chat with your documents" product (NotebookLM, ChatPDF, Humata, Adobe Acrobat AI) stops at chunk-level citations and prompt-conditioned grounding. The 2024–2026 research literature — ReClaim, SAFE, HALT-RAG, MiniCheck — has solved sentence-span attribution and post-hoc faithfulness verification.
None of it has shipped in a usable library. That's the gap.
What this library does that others don't¶
-
Sentence-level citations
Every generated sentence traces back to exact source spans
(doc_id, page, char_start, char_end)— not a chunk-level handwave. Citation granularity is decoupled from chunk granularity. -
NLI-verified claims
Each generated sentence is fact-checked against its cited span by a dual NLI ensemble (HHEM + MiniCheck). The verifier is calibrated on RAGTruth and matches a Sonnet-judge baseline.
-
Calibrated refusal
The strictness slider (
loose/balanced/strict/paranoid) maps to honest thresholds — not "say I don't know" prompting. Unsupported sentences get flagged or surgically removed; truly uncertain answers get refused. -
Fully auditable
Every
Answerexposes its full audit trail programmatically (unsupported_sentences,audit_trail(), per-sentence NLI scores) and renders a self-contained HTML report on demand.
Published benchmark results¶
| Benchmark | Headline |
|---|---|
| ALCE | Constrained decoding beats prompted by +4–7 F1 under dual-LLM-judge cross-validation |
| RAGTruth | Dual NLI ensemble = Sonnet 4.6 judge at 1/250× the per-call cost |
| LitQA2 | Constrained decoding lifts MC accuracy; contextual retrieval is a null result on saturated retrieval |
Status¶
v0.5 — alpha. Public API stabilizing; interfaces still reserve the right to change in 0.x releases. The library is usable end-to-end with five presets, YAML config, a top-level ask() one-liner, and a full audit-trail UX.