Core
omniread.core
Core domain contracts for OmniRead.
This package defines the format-agnostic domain layer of OmniRead. It exposes canonical content models and abstract interfaces that are implemented by format-specific modules (HTML, PDF, etc.).
Public exports from this package are considered stable contracts and are safe for downstream consumers to depend on.
Submodules: - content: Canonical content models and enums - parser: Abstract parsing contracts - scraper: Abstract scraping contracts
Format-specific behavior must not be introduced at this layer.
BaseParser
BaseParser(content: Content)
Bases: ABC, Generic[T]
Base interface for all parsers.
A parser is a self-contained object that owns the Content it is responsible for interpreting.
Implementations must:
- Declare supported content types via supported_types
- Raise parsing-specific exceptions from parse()
- Remain deterministic for a given input
Consumers may rely on:
- Early validation of content compatibility
- Type-stable return values from parse()
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
supported_types
class-attribute
instance-attribute
supported_types: Set[ContentType] = set()
Set of content types supported by this parser.
An empty set indicates that the parser is content-type agnostic.
parse
abstractmethod
parse() -> T
Parse the owned content into structured output.
Implementations must fully consume the provided content and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed, structured representation. |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |
BaseScraper
Bases: ABC
Base interface for all scrapers.
A scraper is responsible ONLY for fetching raw content (bytes) from a source. It must not interpret or parse it.
A scraper is a stateless acquisition component that retrieves raw
content from a source and returns it as a Content object.
Scrapers define how content is obtained, not what the content means.
Implementations may vary in: - Transport mechanism (HTTP, filesystem, cloud storage) - Authentication strategy - Retry and backoff behavior
Implementations must not: - Parse content - Modify content semantics - Couple scraping logic to a specific parser
fetch
abstractmethod
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch raw content from the given source.
Implementations must retrieve the content referenced by source
and return it as raw bytes wrapped in a Content object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
Location identifier (URL, file path, S3 URI, etc.) |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional hints for the scraper (headers, auth, etc.) |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
Content object containing raw bytes and metadata. |
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors as defined by the implementation. |
Content
dataclass
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)
Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal
contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
| Name | Type | Description |
|---|---|---|
raw |
bytes
|
Raw content bytes as retrieved from the source. |
source |
str
|
Identifier of the content origin (URL, file path, or logical name). |
content_type |
Optional[ContentType]
|
Optional MIME type of the content, if known. |
metadata |
Optional[Mapping[str, Any]]
|
Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes). |
ContentType
Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
HTML
class-attribute
instance-attribute
HTML = 'text/html'
HTML document content.
JSON
class-attribute
instance-attribute
JSON = 'application/json'
JSON document content.
PDF
class-attribute
instance-attribute
PDF = 'application/pdf'
PDF document content.
XML
class-attribute
instance-attribute
XML = 'application/xml'
XML document content.