Content

omniread.core.content

Canonical content models for OmniRead.

This module defines the format-agnostic content representation used across all parsers and scrapers in OmniRead.

The models defined here represent what was extracted, not how it was retrieved or parsed. Format-specific behavior and metadata must not alter the semantic meaning of these models.

Content `dataclass`

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name	Type	Description
`raw`	`bytes`	Raw content bytes as retrieved from the source.
`source`	`str`	Identifier of the content origin (URL, file path, or logical name).
`content_type`	`Optional[ContentType]`	Optional MIME type of the content, if known.
`metadata`	`Optional[Mapping[str, Any]]`	Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML `class-attribute` `instance-attribute`

HTML = 'text/html'

HTML document content.

JSON `class-attribute` `instance-attribute`

JSON = 'application/json'

JSON document content.

PDF `class-attribute` `instance-attribute`

PDF = 'application/pdf'

PDF document content.

XML `class-attribute` `instance-attribute`

XML = 'application/xml'

XML document content.

Content

omniread.core.content

Content dataclass

ContentType

HTML class-attribute instance-attribute

JSON class-attribute instance-attribute

PDF class-attribute instance-attribute

XML class-attribute instance-attribute

Content `dataclass`

HTML `class-attribute` `instance-attribute`

JSON `class-attribute` `instance-attribute`

PDF `class-attribute` `instance-attribute`

XML `class-attribute` `instance-attribute`