Chunkers¶

Chunker ¶

Bases: Protocol

Split a Document into retrieval-sized Chunks while preserving sentence_ids.

Span preservation invariant: every Chunk must list the sentence_ids of every Sentence it contains. Chunking granularity and citation granularity are decoupled — do not collapse sentence IDs into chunk IDs.

chunk ¶

chunk(document: Document) -> list[Chunk]

Return chunks for all sentences in document.

Source code in src/verifiable_rag/chunkers/__init__.py

def chunk(self, document: Document) -> list[Chunk]:
    """Return chunks for all sentences in *document*."""
    ...

ParentChildChunker¶

ParentChildChunker ¶

ParentChildChunker(max_child_tokens: int = 400, min_child_tokens: int = 0, token_count: Callable[[str], int] = _default_token_count)

Paragraph-level chunker that preserves sentence_ids and section pointers.

Output chunks satisfy: - chunk.sentence_ids lists every Sentence inside the chunk in order. - chunk.span is the bounding span over those sentences (bboxes merged). - chunk.metadata includes section_id, section_title, paragraph_id (first paragraph touched, for backward compat), paragraph_ids (tuple of all paragraph IDs touched), paragraph_part, paragraph_part_count, page_first, page_last.

Source code in src/verifiable_rag/chunkers/parent_child.py

def __init__(
    self,
    max_child_tokens: int = 400,
    min_child_tokens: int = 0,
    token_count: Callable[[str], int] = _default_token_count,
) -> None:
    if max_child_tokens <= 0:
        raise ValueError(f"max_child_tokens must be > 0, got {max_child_tokens}")
    if min_child_tokens < 0:
        raise ValueError(f"min_child_tokens must be >= 0, got {min_child_tokens}")
    if min_child_tokens > max_child_tokens:
        raise ValueError(
            f"min_child_tokens ({min_child_tokens}) cannot exceed "
            f"max_child_tokens ({max_child_tokens})"
        )
    self._max = max_child_tokens
    self._min = min_child_tokens
    self._tok = token_count

ContextualChunker¶

ContextualChunker ¶

ContextualChunker(base, contextualizer: LLMContextualizer | None = None, granularity: Granularity = 'section', group_by: Callable[[Chunk], str | None] | None = None, keep_on_failure: bool = True)

Wrap a base :class:Chunker and add contextual preambles to chunks.

Chunks sharing the same group key (section, paragraph, or individual chunk — controlled by granularity) receive the same preamble. One LLM call per unique group, not per chunk — the same preamble is mapped onto every child chunk under it.

Parameters¶

base: Any object satisfying the :class:Chunker protocol. Typically :class:ParentChildChunker. contextualizer: Object exposing generate_batch(document_text, passage_texts). Defaults to :class:LLMContextualizer with Haiku 4.5. granularity: Choose how many LLM calls per document — cost vs specificity tradeoff:

- ``"section"`` (default): one preamble per Section, shared across
  every chunk under it. Typically 10-20 calls per academic paper.
  Right for structured docs (papers, books, technical docs).
- ``"paragraph"``: one preamble per Paragraph. Typically 30-100 calls.
  Right for mixed-topic docs or very long sections.
- ``"chunk"``: one preamble per child Chunk (Anthropic's original
  recipe). Typically 100-500 calls. Max specificity, max cost.

Requires the base chunker to set ``metadata["section_id"]`` or
``metadata["paragraph_id"]`` accordingly. :class:`ParentChildChunker`
sets both.

group_by: Power-user override — a callable mapping a Chunk to a group key. When set, supersedes granularity. Chunks returning the same key share a preamble; chunks where the callable returns None get no preamble. Use this for non-standard document structures (chat logs, code, custom topic clusters). keep_on_failure: If True (default), chunks whose contextualization failed are kept with no preamble (they embed as their original text). If False, the failed chunks are dropped from the returned list. Default True — partial degradation beats total ingest failure.

Source code in src/verifiable_rag/chunkers/contextual.py

def __init__(
    self,
    base,  # type: ignore[no-untyped-def] — Chunker (Protocol from sibling module)
    contextualizer: LLMContextualizer | None = None,
    granularity: Granularity = "section",
    group_by: Callable[[Chunk], str | None] | None = None,
    keep_on_failure: bool = True,
) -> None:
    if granularity not in _GRANULARITIES:
        raise ValueError(
            f"granularity must be one of {sorted(_GRANULARITIES)}, got "
            f"{granularity!r}"
        )
    self._base = base
    self._contextualizer = contextualizer or LLMContextualizer()
    self._granularity = granularity
    self._group_by = group_by
    self._keep_on_failure = keep_on_failure

LLMContextualizer¶

LLMContextualizer ¶

LLMContextualizer(model: str = 'claude-haiku-4-5-20251001', temperature: float = 0.0, max_tokens: int = 160, max_workers: int = 8, num_retries: int = 2, system_prompt: str = _DEFAULT_SYSTEM, user_template: str = _USER_TEMPLATE)

Generate a per-chunk contextual preamble via a chat LLM.

Parameters¶

model: LiteLLM model identifier. Default "claude-haiku-4-5-20251001" — cheap, fast, and prompt-cacheable for the long document portion. temperature: Sampling temperature. Default 0.0 for deterministic preambles (matters if you cache embeddings across re-ingests). max_tokens: Cap on preamble length. Default 160 leaves headroom for the 50-100 token target. max_workers: Threadpool concurrency for contextualizing many chunks of one document in parallel. Anthropic Tier 2 → 8 is safe. num_retries: LiteLLM built-in retries for transient errors. system_prompt, user_template: Override the default prompts. user_template must contain {document_text} and {chunk_text} placeholders.

Source code in src/verifiable_rag/chunkers/contextual.py

def __init__(
    self,
    model: str = "claude-haiku-4-5-20251001",
    temperature: float = 0.0,
    max_tokens: int = 160,
    max_workers: int = 8,
    num_retries: int = 2,
    system_prompt: str = _DEFAULT_SYSTEM,
    user_template: str = _USER_TEMPLATE,
) -> None:
    if "{document_text}" not in user_template or "{chunk_text}" not in user_template:
        raise ValueError(
            "user_template must contain both {document_text} and {chunk_text}"
        )
    self._model = model
    self._temperature = temperature
    self._max_tokens = max_tokens
    self._max_workers = max_workers
    self._num_retries = num_retries
    self._system_prompt = system_prompt
    self._user_template = user_template

generate_batch ¶

generate_batch(document_text: str, chunk_texts: list[str]) -> list[str | None]

Return one preamble per chunk_text, in order. None on failure.

All calls share the same document_text, so Anthropic prompt caching applies — the first call writes the cache, subsequent calls read it at ~10 % cost.

Source code in src/verifiable_rag/chunkers/contextual.py

def generate_batch(
    self,
    document_text: str,
    chunk_texts: list[str],
) -> list[str | None]:
    """Return one preamble per chunk_text, in order. ``None`` on failure.

    All calls share the same ``document_text``, so Anthropic prompt
    caching applies — the first call writes the cache, subsequent
    calls read it at ~10 % cost.
    """
    if not chunk_texts:
        return []

    if self._max_workers > 1 and len(chunk_texts) > 1:
        with ThreadPoolExecutor(max_workers=self._max_workers) as pool:
            return list(
                pool.map(
                    lambda ct: self._generate_one(document_text, ct),
                    chunk_texts,
                )
            )
    return [self._generate_one(document_text, ct) for ct in chunk_texts]

ParentExpander¶

ParentExpander ¶

ParentExpander(strategy: Strategy = 'section_or_window', max_parent_tokens: int = 2000, window_paragraphs: int = 3, token_count: Callable[[str], int] = _default_token_count)

Returns the parent context text for a retrieved Chunk.

Source code in src/verifiable_rag/chunkers/parent_expander.py

def __init__(
    self,
    strategy: Strategy = "section_or_window",
    max_parent_tokens: int = 2000,
    window_paragraphs: int = 3,
    token_count: Callable[[str], int] = _default_token_count,
) -> None:
    if max_parent_tokens <= 0:
        raise ValueError(f"max_parent_tokens must be > 0, got {max_parent_tokens}")
    if window_paragraphs < 0:
        raise ValueError(f"window_paragraphs must be >= 0, got {window_paragraphs}")
    self._strategy = strategy
    self._max = max_parent_tokens
    self._window = window_paragraphs
    self._tok = token_count

expand ¶

expand(chunk: Chunk, document: Document) -> str

Return the parent context string for chunk.

Source code in src/verifiable_rag/chunkers/parent_expander.py

def expand(self, chunk: Chunk, document: Document) -> str:
    """Return the parent context string for *chunk*."""
    if document.doc_id != chunk.doc_id:
        raise ValueError(
            f"Chunk doc_id {chunk.doc_id!r} != document doc_id {document.doc_id!r}"
        )

    section = self._lookup_section(chunk, document)

    if self._strategy == "section":
        return section.text
    if self._strategy == "window":
        return self._window_text(chunk, section)

    # section_or_window
    if self._tok(section.text) <= self._max:
        return section.text
    return self._window_text(chunk, section)

embedding_text helper¶

embedding_text ¶

embedding_text(chunk: Chunk) -> str

Return the text to embed for chunk.

Uses metadata["contextual_preamble"] if present (Contextual Retrieval recipe: preamble + blank line + original text), otherwise falls back to the chunk's original text. Centralized here so any Pipeline / Indexer / Embedder call site can use the same convention.

Source code in src/verifiable_rag/chunkers/contextual.py

def embedding_text(chunk: Chunk) -> str:
    """Return the text to embed for *chunk*.

    Uses ``metadata["contextual_preamble"]`` if present (Contextual
    Retrieval recipe: preamble + blank line + original text), otherwise
    falls back to the chunk's original text. Centralized here so any
    Pipeline / Indexer / Embedder call site can use the same convention.
    """
    preamble = chunk.metadata.get("contextual_preamble")
    if preamble:
        return f"{preamble}\n\n{chunk.text}"
    return chunk.text