Parsers¶
Parser
¶
Bases: Protocol
Convert a document file into a span-preserving Document model.
Implementations must preserve character offsets through the full parsing pipeline so every Sentence can be traced back to its source location.
DoclingParser¶
DoclingParser
¶
Parses PDFs (and other docling-supported formats) into a Document.
The parser is stateless across calls; expensive resources (the docling DocumentConverter, the wtpsplit model) are constructed lazily and reused.
Source code in src/verifiable_rag/parsers/docling_parser.py
PyMuPDFParser¶
PyMuPDFParser
¶
PyMuPDF-backed parser, used as a Docling fallback.
Stateless across calls; the wtpsplit splitter is reused.
Source code in src/verifiable_rag/parsers/pymupdf_parser.py
CompositeParser¶
CompositeParser
¶
CompositeParser(primary: Parser, fallbacks: list[Parser], catch: tuple[type[BaseException], ...] = (ValueError, AssertionError))
Chain of parsers tried in order; falls through on catch exceptions.
Other exceptions (e.g. FileNotFoundError) propagate immediately — fallback is reserved for parser-internal failures, not for missing files.