Skip to content

Pdf

omniread.pdf

PDF format implementation for OmniRead.

This package provides PDF-specific implementations of the core OmniRead contracts defined in omniread.core.

Unlike HTML, PDF handling requires an explicit client layer for document access. This package therefore includes: - PDF clients for acquiring raw PDF data - PDF scrapers that coordinate client access - PDF parsers that extract structured content from PDF binaries

Public exports from this package represent the supported PDF pipeline and are safe for consumers to import directly when working with PDFs.

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default
path Path

Filesystem path to the PDF file.

required

Returns:

Type Description
bytes

Raw PDF bytes.

Raises:

Type Description
FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default
content Content

Content instance to be parsed.

required

Raises:

Type Description
ValueError

If the content type is not supported by this parser.

supported_types class-attribute instance-attribute

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

parse abstractmethod

parse() -> T

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type Description
T

Parsed representation of type T.

Raises:

Type Description
Exception

Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type Description
bool

True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default
client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default
source Any

Identifier of the PDF source as understood by the configured PDF client.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description
Content

A Content instance containing:

Content
  • Raw PDF bytes
Content
  • Source identifier
Content
  • PDF content type
Content
  • Optional metadata

Raises:

Type Description
Exception

Retrieval-specific errors raised by the PDF client.