Swap LLM provider¶
The library uses LiteLLM to route generator calls, which means you can use any provider LiteLLM supports — Anthropic (default), OpenAI, Google Gemini, Groq, Together, local Ollama, vLLM endpoints, and dozens more.
Switching the generator model¶
The default is Claude Haiku 4.5. Override via the generator_model kwarg on any preset:
import verifiable_rag
answer = verifiable_rag.ask(
"What did the authors find?",
docs="paper.pdf",
preset="hybrid_balanced",
generator_model="openai/gpt-4o-mini",
)
Or when building a Pipeline directly:
from verifiable_rag import Pipeline
from verifiable_rag.generators import ConstrainedCitedGenerator
pipeline = Pipeline(
...,
generator=ConstrainedCitedGenerator(model="openai/gpt-4o-mini"),
)
Setting the API key¶
LiteLLM looks for provider-specific env vars:
| Provider | Env var |
|---|---|
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Google Gemini | GEMINI_API_KEY |
| Groq | GROQ_API_KEY |
| Together | TOGETHER_API_KEY |
| Cohere | COHERE_API_KEY (also used by the embedder/reranker) |
For the full list, see LiteLLM provider docs.
Recommended models per use case¶
Best citation quality (constrained generator)¶
ConstrainedCitedGenerator requires structured-output support. These all work:
anthropic/claude-haiku-4-5⭐ (default — best cost/quality)anthropic/claude-sonnet-4-6(stronger, more expensive — used byhybrid_paranoidpreset)openai/gpt-4o-miniopenai/gpt-4ogemini/gemini-1.5-flashgemini/gemini-1.5-pro
Best cost (prompted generator)¶
PromptedCitedGenerator works with any LiteLLM model, including ones without structured output. Useful for:
groq/llama-3.3-70b-versatile(very fast, free tier available)ollama/llama3.1(local, free, no API key needed)together/Qwen/Qwen2-72B-Instruct(open weights, hosted)
from verifiable_rag.generators import PromptedCitedGenerator
pipeline = Pipeline(
...,
generator=PromptedCitedGenerator(model="groq/llama-3.3-70b-versatile"),
)
Local-only (no API keys)¶
Use Ollama or a local vLLM endpoint with LiteLLM's api_base override:
Then start Ollama:
See Local-only setup for the full air-gapped recipe.
Switching the verifier's LLM judge¶
If you're using LLMJudgeVerifier instead of (or in addition to) the NLI verifiers:
from verifiable_rag.verifiers import LLMJudgeVerifier
verifier = LLMJudgeVerifier(
model="anthropic/claude-sonnet-4-6", # default
max_workers=8,
temperature=0.0,
)
Same provider list applies. For most use cases, the dual NLI verifier is the better choice — see Verification.
Switching the Contextual Retrieval LLM¶
LLMContextualizer (used inside ContextualChunker) also runs on LiteLLM:
from verifiable_rag.chunkers import LLMContextualizer
contextualizer = LLMContextualizer(
model="claude-haiku-4-5-20251001", # default — cheapest with prompt caching
)
For local Contextual Retrieval, swap to Ollama. Quality is noticeably lower per-preamble but the per-doc cost drops to zero.
Model pinning vs. version aliasing¶
LiteLLM accepts both:
| Form | Behavior |
|---|---|
anthropic/claude-haiku-4-5 |
Alias — Anthropic picks the latest 4.5 patch |
anthropic/claude-haiku-4-5-20251001 |
Pinned — exact model version |
For production, always pin. Silent model updates have shifted benchmark scores before, and the library's calibrated thresholds (HHEM 0.3, Dual NLI 0.0562) assume the same model that was calibrated against.
For prototyping, the alias is fine — you want the latest improvements.
Rate-limiting & retries¶
LiteLLM honors provider Retry-After headers and uses exponential backoff on transient errors. Each generator accepts a num_retries kwarg:
For LLMJudgeVerifier running large batches under sustained load, drop max_workers to 2–4 to avoid bursty overloaded_error responses.
When the swap silently doesn't work¶
Common failures and how to spot them:
- Empty answers from
ConstrainedCitedGenerator— your chosen model doesn't support structured output. Switch toPromptedCitedGeneratoror pick a supported model. - Garbage citations from
PromptedCitedGenerator— the model doesn't follow the in-prompt format reliably. Use a stronger model or switch toConstrainedCitedGeneratorif the model supports it. - Hallucinated cite IDs — a known failure mode for
PromptedCitedGeneratoron smaller models.ConstrainedCitedGeneratormakes this structurally impossible (cite IDs are drawn from an enum at decode time). - API errors propagate up — by design. The Pipeline catches per-question errors in the eval runner, but in production code wrap
pipeline.ask()in your own try/except.
For the trade-offs between generator types, see Citation flow.