Parser

omniread.pdf.parser

PDF parser base implementations for OmniRead.

This module defines the PDF-specific parser contract, extending the format-agnostic BaseParser with constraints appropriate for PDF content.

PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

parse() -> T

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed representation of type `T`.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.