Scraper

omniread.pdf.scraper

PDF scraping implementation for OmniRead.

This module provides a PDF-specific scraper that coordinates PDF byte retrieval via a client and normalizes the result into a Content object.

The scraper implements the core BaseScraper contract while delegating all storage and access concerns to a BasePDFClient implementation.

PDFScraper(*, client: BasePDFClient)

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Type	Description
`Content`	A `Content` instance containing:
`Content`	Raw PDF bytes
`Content`	Source identifier
`Content`	PDF content type
`Content`	Optional metadata

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.