Scraper
omniread.core.scraper
Abstract scraping contracts for OmniRead.
This module defines the format-agnostic scraper interface responsible for acquiring raw content from external sources.
Scrapers are responsible for:
- Locating and retrieving raw content bytes
- Attaching minimal contextual metadata
- Returning normalized Content objects
Scrapers are explicitly NOT responsible for: - Parsing or interpreting content - Inferring structure or semantics - Performing content-type specific processing
All interpretation must be delegated to parsers.
BaseScraper
Bases: ABC
Base interface for all scrapers.
A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.
A scraper is a stateless acquisition component that retrieves raw
content from a source and returns it as a Content object.
Scrapers define how content is obtained, not what the content means.
Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior
Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser
fetch
abstractmethod
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch raw content from the given source.
Implementations must retrieve the content referenced by source
and return it as raw bytes wrapped in a Content object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
Location identifier (URL, file path, S3 URI, etc.) |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional hints for the scraper (headers, auth, etc.) |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
Content object containing raw bytes and metadata. |
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors as defined by the implementation. |