Core

omniread.core

Core domain contracts for OmniRead.

This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).

Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.

Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts

Format-specific behavior must not be introduced at this layer.

BaseParser

BaseParser(content: Content)

Bases: ABC, Generic[T]

Base interface for all parsers.

A parser is a self-contained object that owns the Content it is responsible for interpreting.

Implementations must: - Declare supported content types via supported_types - Raise parsing-specific exceptions from parse() - Remain deterministic for a given input

Consumers may rely on: - Early validation of content compatibility - Type-stable return values from parse()

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

supported_types `class-attribute` `instance-attribute`

supported_types: Set[ContentType] = set()

Set of content types supported by this parser.

An empty set indicates that the parser is content-type agnostic.

parse `abstractmethod`

parse() -> T

Parse the owned content into structured output.

Implementations must fully consume the provided content and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed, structured representation.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.

BaseScraper

Bases: ABC

Base interface for all scrapers.

A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.

A scraper is a stateless acquisition component that retrieves raw content from a source and returns it as a Content object.

Scrapers define how content is obtained, not what the content means.

Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior

Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser

fetch `abstractmethod`

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch raw content from the given source.

Implementations must retrieve the content referenced by source and return it as raw bytes wrapped in a Content object.

Parameters:

Name	Type	Description	Default
`source`	`str`	Location identifier (URL, file path, S3 URI, etc.)	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional hints for the scraper (headers, auth, etc.)	`None`

Returns:

Type	Description
`Content`	Content object containing raw bytes and metadata.
`Content`	Raw content bytes
`Content`	Source identifier
`Content`	Optional metadata

Raises:

Type	Description
`Exception`	Retrieval-specific errors as defined by the implementation.

Content `dataclass`

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name	Type	Description
`raw`	`bytes`	Raw content bytes as retrieved from the source.
`source`	`str`	Identifier of the content origin (URL, file path, or logical name).
`content_type`	`Optional[ContentType]`	Optional MIME type of the content, if known.
`metadata`	`Optional[Mapping[str, Any]]`	Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML `class-attribute` `instance-attribute`

HTML = 'text/html'

HTML document content.

JSON `class-attribute` `instance-attribute`

JSON = 'application/json'

JSON document content.

PDF `class-attribute` `instance-attribute`

PDF = 'application/pdf'

PDF document content.

XML `class-attribute` `instance-attribute`

XML = 'application/xml'

XML document content.

Core

omniread.core

BaseParser

supported_types class-attribute instance-attribute

parse abstractmethod

supports

BaseScraper

fetch abstractmethod

Content dataclass

ContentType

HTML class-attribute instance-attribute

JSON class-attribute instance-attribute

PDF class-attribute instance-attribute

XML class-attribute instance-attribute

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

fetch `abstractmethod`

Content `dataclass`

HTML `class-attribute` `instance-attribute`

JSON `class-attribute` `instance-attribute`

PDF `class-attribute` `instance-attribute`

XML `class-attribute` `instance-attribute`