Parser
omniread.pdf.parser
PDF parser base implementations for OmniRead.
This module defines the PDF-specific parser contract, extending the
format-agnostic BaseParser with constraints appropriate for PDF content.
PDF parsers are responsible for interpreting binary PDF data and producing structured representations suitable for downstream consumption.
PDFParser
PDFParser(content: Content)
Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define:
- Define the output type T
- Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {PDF}
Set of content types supported by this parser (PDF only).
parse
abstractmethod
parse() -> T
Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |