Pdf

omniread.pdf

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Filesystem path to the PDF file.	required

Returns:

Type	Description
`bytes`	Raw PDF bytes.

Raises:

Type	Description
`FileNotFoundError`	If the path does not exist.
`ValueError`	If the path exists but is not a file.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

parse `abstractmethod`

parse() -> T

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed representation of type `T`.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Type	Description
`Content`	A `Content` instance containing:
`Content`	Raw PDF bytes
`Content`	Source identifier
`Content`	PDF content type
`Content`	Optional metadata

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.

Pdf

omniread.pdf

FileSystemPDFClient

fetch

PDFParser

supported_types class-attribute instance-attribute

parse abstractmethod

supports

PDFScraper

fetch

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`