Skip to content

Swap LLM provider

The library uses LiteLLM to route generator calls, which means you can use any provider LiteLLM supports — Anthropic (default), OpenAI, Google Gemini, Groq, Together, local Ollama, vLLM endpoints, and dozens more.

Switching the generator model

The default is Claude Haiku 4.5. Override via the generator_model kwarg on any preset:

import verifiable_rag

answer = verifiable_rag.ask(
    "What did the authors find?",
    docs="paper.pdf",
    preset="hybrid_balanced",
    generator_model="openai/gpt-4o-mini",
)

Or when building a Pipeline directly:

from verifiable_rag import Pipeline
from verifiable_rag.generators import ConstrainedCitedGenerator

pipeline = Pipeline(
    ...,
    generator=ConstrainedCitedGenerator(model="openai/gpt-4o-mini"),
)

Setting the API key

LiteLLM looks for provider-specific env vars:

Provider Env var
Anthropic ANTHROPIC_API_KEY
OpenAI OPENAI_API_KEY
Google Gemini GEMINI_API_KEY
Groq GROQ_API_KEY
Together TOGETHER_API_KEY
Cohere COHERE_API_KEY (also used by the embedder/reranker)

For the full list, see LiteLLM provider docs.

Best citation quality (constrained generator)

ConstrainedCitedGenerator requires structured-output support. These all work:

  • anthropic/claude-haiku-4-5 ⭐ (default — best cost/quality)
  • anthropic/claude-sonnet-4-6 (stronger, more expensive — used by hybrid_paranoid preset)
  • openai/gpt-4o-mini
  • openai/gpt-4o
  • gemini/gemini-1.5-flash
  • gemini/gemini-1.5-pro

Best cost (prompted generator)

PromptedCitedGenerator works with any LiteLLM model, including ones without structured output. Useful for:

  • groq/llama-3.3-70b-versatile (very fast, free tier available)
  • ollama/llama3.1 (local, free, no API key needed)
  • together/Qwen/Qwen2-72B-Instruct (open weights, hosted)
from verifiable_rag.generators import PromptedCitedGenerator

pipeline = Pipeline(
    ...,
    generator=PromptedCitedGenerator(model="groq/llama-3.3-70b-versatile"),
)

Local-only (no API keys)

Use Ollama or a local vLLM endpoint with LiteLLM's api_base override:

PromptedCitedGenerator(
    model="ollama/llama3.1",
    api_base="http://localhost:11434",
)

Then start Ollama:

ollama pull llama3.1
ollama serve

See Local-only setup for the full air-gapped recipe.

Switching the verifier's LLM judge

If you're using LLMJudgeVerifier instead of (or in addition to) the NLI verifiers:

from verifiable_rag.verifiers import LLMJudgeVerifier

verifier = LLMJudgeVerifier(
    model="anthropic/claude-sonnet-4-6",  # default
    max_workers=8,
    temperature=0.0,
)

Same provider list applies. For most use cases, the dual NLI verifier is the better choice — see Verification.

Switching the Contextual Retrieval LLM

LLMContextualizer (used inside ContextualChunker) also runs on LiteLLM:

from verifiable_rag.chunkers import LLMContextualizer

contextualizer = LLMContextualizer(
    model="claude-haiku-4-5-20251001",  # default — cheapest with prompt caching
)

For local Contextual Retrieval, swap to Ollama. Quality is noticeably lower per-preamble but the per-doc cost drops to zero.

Model pinning vs. version aliasing

LiteLLM accepts both:

Form Behavior
anthropic/claude-haiku-4-5 Alias — Anthropic picks the latest 4.5 patch
anthropic/claude-haiku-4-5-20251001 Pinned — exact model version

For production, always pin. Silent model updates have shifted benchmark scores before, and the library's calibrated thresholds (HHEM 0.3, Dual NLI 0.0562) assume the same model that was calibrated against.

For prototyping, the alias is fine — you want the latest improvements.

Rate-limiting & retries

LiteLLM honors provider Retry-After headers and uses exponential backoff on transient errors. Each generator accepts a num_retries kwarg:

PromptedCitedGenerator(model="anthropic/claude-haiku-4-5", num_retries=5)

For LLMJudgeVerifier running large batches under sustained load, drop max_workers to 2–4 to avoid bursty overloaded_error responses.

When the swap silently doesn't work

Common failures and how to spot them:

  • Empty answers from ConstrainedCitedGenerator — your chosen model doesn't support structured output. Switch to PromptedCitedGenerator or pick a supported model.
  • Garbage citations from PromptedCitedGenerator — the model doesn't follow the in-prompt format reliably. Use a stronger model or switch to ConstrainedCitedGenerator if the model supports it.
  • Hallucinated cite IDs — a known failure mode for PromptedCitedGenerator on smaller models. ConstrainedCitedGenerator makes this structurally impossible (cite IDs are drawn from an enum at decode time).
  • API errors propagate up — by design. The Pipeline catches per-question errors in the eval runner, but in production code wrap pipeline.ask() in your own try/except.

For the trade-offs between generator types, see Citation flow.