Scraper

omniread.core.scraper

Abstract scraping contracts for OmniRead.

This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.

Scrapers are responsible for: - Locating and retrieving raw content bytes - Attaching minimal contextual metadata - Returning normalized Content objects

Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing

All interpretation must be delegated to parsers.

BaseScraper

Bases: ABC

Base interface for all scrapers.

A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.

A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.

Scrapers define how content is obtained, not what the content means.

Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior

Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser

fetch `abstractmethod`

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch raw content from the given source.

Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.

Parameters:

Name	Type	Description	Default
`source`	`str`	Location identifier (URL, file path, S3 URI, etc.)	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional hints for the scraper (headers, auth, etc.)	`None`

Returns:

Type	Description
`Content`	Content object containing raw bytes and metadata.
`Content`	Raw content bytes
`Content`	Source identifier
`Content`	Optional metadata

Raises:

Type	Description
`Exception`	Retrieval-specific errors as defined by the implementation.

Scraper

omniread.core.scraper

BaseScraper

fetch abstractmethod

fetch `abstractmethod`