Content
omniread.core.content
Canonical content models for OmniRead.
This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.
The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.
Content
dataclass
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)
Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal
contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
| Name | Type | Description |
|---|---|---|
raw |
bytes
|
Raw content bytes as retrieved from the source. |
source |
str
|
Identifier of the content origin (URL, file path, or logical name). |
content_type |
Optional[ContentType]
|
Optional MIME type of the content, if known. |
metadata |
Optional[Mapping[str, Any]]
|
Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes). |
ContentType
Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
HTML
class-attribute
instance-attribute
HTML = 'text/html'
HTML document content.
JSON
class-attribute
instance-attribute
JSON = 'application/json'
JSON document content.
PDF
class-attribute
instance-attribute
PDF = 'application/pdf'
PDF document content.
XML
class-attribute
instance-attribute
XML = 'application/xml'
XML document content.