omniread.pdf
PDF format implementation for OmniRead.
This package provides PDF-specific implementations of the core OmniRead
contracts defined in omniread.core.
Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries
Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.
FileSystemPDFClient
Bases: BasePDFClient
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw binary contents.
fetch
fetch(path: Path) -> bytes
Read a PDF file from the local filesystem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
Path
|
Filesystem path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Raw PDF bytes. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the path does not exist. |
ValueError
|
If the path exists but is not a file. |
PDFParser
PDFParser(content: Content)
Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define:
- Define the output type T
- Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {PDF}
Set of content types supported by this parser (PDF only).
parse
abstractmethod
parse() -> T
Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |
PDFScraper
PDFScraper(*, client: BasePDFClient)
Bases: BaseScraper
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes output into Content.
The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata
Initialize the PDF scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
BasePDFClient
|
PDF client responsible for retrieving raw PDF bytes. |
required |
fetch
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch a PDF document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Any
|
Identifier of the PDF source as understood by the configured PDF client. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to attach to the returned content. |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
A |
Content
|
|
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors raised by the PDF client. |