Skip to content

Core

omniread.core

Core domain contracts for OmniRead.

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

BaseParser

BaseParser(content: Content)

Bases: ABC, Generic[T]

Base interface for all parsers.

A parser is a self-contained object that owns the Content it is responsible for interpreting.

Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input

Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default
content Content

Content instance to be parsed.

required

Raises:

Type Description
ValueError

If the content type is not supported by this parser.

supported_types class-attribute instance-attribute

supported_types: Set[ContentType] = set()

Set of content types supported by this parser.

An empty set indicates that the parser is content-type agnostic.

parse abstractmethod

parse() -> T

Parse the owned content into structured output.

Implementations must fully consume the provided content and return a deterministic, structured output.

Returns:

Type Description
T

Parsed, structured representation.

Raises:

Type Description
Exception

Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type Description
bool

True if the content type is supported; False otherwise.

BaseScraper

Bases: ABC

Base interface for all scrapers.

A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.

A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.

Scrapers define how content is obtained, not what the content means.

Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior

Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser

fetch abstractmethod

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch raw content from the given source.

Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.

Parameters:

Name Type Description Default
source str

Location identifier (URL, file path, S3 URI, etc.)

required
metadata Optional[Mapping[str, Any]]

Optional hints for the scraper (headers, auth, etc.)

None

Returns:

Type Description
Content

Content object containing raw bytes and metadata.

Content
  • Raw content bytes
Content
  • Source identifier
Content
  • Optional metadata

Raises:

Type Description
Exception

Retrieval-specific errors as defined by the implementation.

Content dataclass

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description
raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML class-attribute instance-attribute

HTML = 'text/html'

HTML document content.

JSON class-attribute instance-attribute

JSON = 'application/json'

JSON document content.

PDF class-attribute instance-attribute

PDF = 'application/pdf'

PDF document content.

XML class-attribute instance-attribute

XML = 'application/xml'

XML document content.