Skip to content

Scraper

omniread.pdf.scraper

PDF scraping implementation for OmniRead.

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default
client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default
source Any

Identifier of the PDF source as understood by the configured PDF client.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description
Content

A Content instance containing:

Content
  • Raw PDF bytes
Content
  • Source identifier
Content
  • PDF content type
Content
  • Optional metadata

Raises:

Type Description
Exception

Retrieval-specific errors raised by the PDF client.