Skip to content

Parsers

Parser

Bases: Protocol

Convert a document file into a span-preserving Document model.

Implementations must preserve character offsets through the full parsing pipeline so every Sentence can be traced back to its source location.

parse

parse(path: Path) -> Document

Parse the file at path and return a Document.

Raises: FileNotFoundError: if path does not exist. ValueError: if the file format is unsupported or contains no text.

Source code in src/verifiable_rag/parsers/__init__.py
def parse(self, path: Path) -> Document:
    """Parse the file at *path* and return a Document.

    Raises:
        FileNotFoundError: if *path* does not exist.
        ValueError: if the file format is unsupported or contains no text.
    """
    ...

DoclingParser

DoclingParser

DoclingParser(sentence_splitter: SentenceSplitter | None = None)

Parses PDFs (and other docling-supported formats) into a Document.

The parser is stateless across calls; expensive resources (the docling DocumentConverter, the wtpsplit model) are constructed lazily and reused.

Source code in src/verifiable_rag/parsers/docling_parser.py
def __init__(
    self,
    sentence_splitter: SentenceSplitter | None = None,
) -> None:
    self._splitter = sentence_splitter or SentenceSplitter()
    self._converter: Any = None

PyMuPDFParser

PyMuPDFParser

PyMuPDFParser(sentence_splitter: SentenceSplitter | None = None)

PyMuPDF-backed parser, used as a Docling fallback.

Stateless across calls; the wtpsplit splitter is reused.

Source code in src/verifiable_rag/parsers/pymupdf_parser.py
def __init__(self, sentence_splitter: SentenceSplitter | None = None) -> None:
    self._splitter = sentence_splitter or SentenceSplitter()

CompositeParser

CompositeParser

CompositeParser(primary: Parser, fallbacks: list[Parser], catch: tuple[type[BaseException], ...] = (ValueError, AssertionError))

Chain of parsers tried in order; falls through on catch exceptions.

Other exceptions (e.g. FileNotFoundError) propagate immediately — fallback is reserved for parser-internal failures, not for missing files.

Source code in src/verifiable_rag/parsers/composite.py
def __init__(
    self,
    primary: Parser,
    fallbacks: list[Parser],
    catch: tuple[type[BaseException], ...] = (ValueError, AssertionError),
) -> None:
    if not fallbacks:
        raise ValueError(
            "CompositeParser needs at least one fallback. "
            "Use the primary parser directly if you have nothing to fall back to."
        )
    self._primary = primary
    self._fallbacks = list(fallbacks)
    self._catch = catch