Chunkers¶
Chunker
¶
Bases: Protocol
Split a Document into retrieval-sized Chunks while preserving sentence_ids.
Span preservation invariant: every Chunk must list the sentence_ids of every Sentence it contains. Chunking granularity and citation granularity are decoupled — do not collapse sentence IDs into chunk IDs.
ParentChildChunker¶
ParentChildChunker
¶
ParentChildChunker(max_child_tokens: int = 400, min_child_tokens: int = 0, token_count: Callable[[str], int] = _default_token_count)
Paragraph-level chunker that preserves sentence_ids and section pointers.
Output chunks satisfy:
- chunk.sentence_ids lists every Sentence inside the chunk in order.
- chunk.span is the bounding span over those sentences (bboxes merged).
- chunk.metadata includes section_id, section_title,
paragraph_id (first paragraph touched, for backward compat),
paragraph_ids (tuple of all paragraph IDs touched),
paragraph_part, paragraph_part_count,
page_first, page_last.
Source code in src/verifiable_rag/chunkers/parent_child.py
ContextualChunker¶
ContextualChunker
¶
ContextualChunker(base, contextualizer: LLMContextualizer | None = None, granularity: Granularity = 'section', group_by: Callable[[Chunk], str | None] | None = None, keep_on_failure: bool = True)
Wrap a base :class:Chunker and add contextual preambles to chunks.
Chunks sharing the same group key (section, paragraph, or
individual chunk — controlled by granularity) receive the same
preamble. One LLM call per unique group, not per chunk — the same
preamble is mapped onto every child chunk under it.
Parameters¶
base:
Any object satisfying the :class:Chunker protocol. Typically
:class:ParentChildChunker.
contextualizer:
Object exposing generate_batch(document_text, passage_texts).
Defaults to :class:LLMContextualizer with Haiku 4.5.
granularity:
Choose how many LLM calls per document — cost vs specificity tradeoff:
- ``"section"`` (default): one preamble per Section, shared across
every chunk under it. Typically 10-20 calls per academic paper.
Right for structured docs (papers, books, technical docs).
- ``"paragraph"``: one preamble per Paragraph. Typically 30-100 calls.
Right for mixed-topic docs or very long sections.
- ``"chunk"``: one preamble per child Chunk (Anthropic's original
recipe). Typically 100-500 calls. Max specificity, max cost.
Requires the base chunker to set ``metadata["section_id"]`` or
``metadata["paragraph_id"]`` accordingly. :class:`ParentChildChunker`
sets both.
group_by:
Power-user override — a callable mapping a Chunk to a group key.
When set, supersedes granularity. Chunks returning the same
key share a preamble; chunks where the callable returns None
get no preamble. Use this for non-standard document structures
(chat logs, code, custom topic clusters).
keep_on_failure:
If True (default), chunks whose contextualization failed are kept
with no preamble (they embed as their original text). If False,
the failed chunks are dropped from the returned list.
Default True — partial degradation beats total ingest failure.
Source code in src/verifiable_rag/chunkers/contextual.py
LLMContextualizer¶
LLMContextualizer
¶
LLMContextualizer(model: str = 'claude-haiku-4-5-20251001', temperature: float = 0.0, max_tokens: int = 160, max_workers: int = 8, num_retries: int = 2, system_prompt: str = _DEFAULT_SYSTEM, user_template: str = _USER_TEMPLATE)
Generate a per-chunk contextual preamble via a chat LLM.
Parameters¶
model:
LiteLLM model identifier. Default "claude-haiku-4-5-20251001"
— cheap, fast, and prompt-cacheable for the long document portion.
temperature:
Sampling temperature. Default 0.0 for deterministic preambles
(matters if you cache embeddings across re-ingests).
max_tokens:
Cap on preamble length. Default 160 leaves headroom for the
50-100 token target.
max_workers:
Threadpool concurrency for contextualizing many chunks of one
document in parallel. Anthropic Tier 2 → 8 is safe.
num_retries:
LiteLLM built-in retries for transient errors.
system_prompt, user_template:
Override the default prompts. user_template must contain
{document_text} and {chunk_text} placeholders.
Source code in src/verifiable_rag/chunkers/contextual.py
generate_batch
¶
Return one preamble per chunk_text, in order. None on failure.
All calls share the same document_text, so Anthropic prompt
caching applies — the first call writes the cache, subsequent
calls read it at ~10 % cost.
Source code in src/verifiable_rag/chunkers/contextual.py
ParentExpander¶
ParentExpander
¶
ParentExpander(strategy: Strategy = 'section_or_window', max_parent_tokens: int = 2000, window_paragraphs: int = 3, token_count: Callable[[str], int] = _default_token_count)
Returns the parent context text for a retrieved Chunk.
Source code in src/verifiable_rag/chunkers/parent_expander.py
expand
¶
Return the parent context string for chunk.
Source code in src/verifiable_rag/chunkers/parent_expander.py
embedding_text helper¶
embedding_text
¶
embedding_text(chunk: Chunk) -> str
Return the text to embed for chunk.
Uses metadata["contextual_preamble"] if present (Contextual
Retrieval recipe: preamble + blank line + original text), otherwise
falls back to the chunk's original text. Centralized here so any
Pipeline / Indexer / Embedder call site can use the same convention.