Skip to content

Content

omniread.core.content

Canonical content models for OmniRead.

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

Content dataclass

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description
raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML class-attribute instance-attribute

HTML = 'text/html'

HTML document content.

JSON class-attribute instance-attribute

JSON = 'application/json'

JSON document content.

PDF class-attribute instance-attribute

PDF = 'application/pdf'

PDF document content.

XML class-attribute instance-attribute

XML = 'application/xml'

XML document content.