Skip to content

omniread

omniread

OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

  1. Content A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.

  2. Scrapers Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.

  3. Parsers Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior


Installation

Install OmniRead using pip:

pip install omniread

Or with Poetry:

poetry add omniread

Basic Usage

HTML example:

from omniread import HTMLScraper, HTMLParser

scraper = HTMLScraper()
content = scraper.fetch("https://example.com")

class TitleParser(HTMLParser[str]):
    def parse(self) -> str:
        return self._soup.title.string

parser = TitleParser(content)
title = parser.parse()

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path

client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))

class TextPDFParser(PDFParser[str]):
    def parse(self) -> str:
        # implement PDF text extraction
        ...

parser = TextPDFParser(content)
result = parser.parse()

Public API Surface

This module re-exports the recommended public entry points of OmniRead.

Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Core: - Content - ContentType

HTML: - HTMLScraper - HTMLParser

PDF: - FileSystemPDFClient - PDFScraper - PDFParser

Core Philosophy

OmniRead is designed as a decoupled content engine:

  1. Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
  2. Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
  3. Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

Documentation Design

For those extending OmniRead, follow these "AI-Native" docstring principles:

For Humans
  • Clear Contracts: Explicitly state what a component is and is NOT responsible for.
  • Runnable Examples: Include small, logical snippets in the package __init__.py.
For LLMs
  • Structured Models: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
  • Type Safety: All public APIs must be fully typed and have corresponding .pyi stubs.
  • Detailed Raises: Include : description pairs in the Raises section to help agents handle errors gracefully.

Content dataclass

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name Type Description
raw bytes

Raw content bytes as retrieved from the source.

source str

Identifier of the content origin (URL, file path, or logical name).

content_type Optional[ContentType]

Optional MIME type of the content, if known.

metadata Optional[Mapping[str, Any]]

Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML class-attribute instance-attribute

HTML = 'text/html'

HTML document content.

JSON class-attribute instance-attribute

JSON = 'application/json'

JSON document content.

PDF class-attribute instance-attribute

PDF = 'application/pdf'

PDF document content.

XML class-attribute instance-attribute

XML = 'application/xml'

XML document content.

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name Type Description Default
path Path

Filesystem path to the PDF file.

required

Returns:

Type Description
bytes

Raw PDF bytes.

Raises:

Type Description
FileNotFoundError

If the path does not exist.

ValueError

If the path exists but is not a file.

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name Type Description Default
content Content

HTML content to be parsed.

required
features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description
ValueError

If the content is empty or not valid HTML.

supported_types class-attribute instance-attribute

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

parse abstractmethod

parse() -> T

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type Description
T

Parsed representation of type T.

parse_div staticmethod

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default
div Tag

BeautifulSoup tag representing a <div>.

required
separator str

String used to separate text nodes.

' '

Returns:

Type Description
str

Flattened, whitespace-normalized text content.

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default
a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description
Optional[str]

The value of the href attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property → content mappings

Returns:

Type Description
dict[str, Any]

Dictionary containing extracted metadata.

parse_table staticmethod

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default
table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description
list[list[str]]

A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type Description
bool

True if the content type is supported; False otherwise.

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default
client Optional[Client]

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None
timeout float

Request timeout in seconds.

15.0
headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None
follow_redirects bool

Whether to follow HTTP redirects.

True

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default
source str

URL of the HTML document.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Type Description
Content

A Content instance containing:

Content
  • Raw HTML bytes
Content
  • Source URL
Content
  • HTML content type
Content
  • HTTP response metadata

Raises:

Type Description
HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default
response Response

HTTP response returned by httpx.

required

Raises:

Type Description
ValueError

If the Content-Type header is missing or does not indicate HTML content.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name Type Description Default
content Content

Content instance to be parsed.

required

Raises:

Type Description
ValueError

If the content type is not supported by this parser.

supported_types class-attribute instance-attribute

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

parse abstractmethod

parse() -> T

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type Description
T

Parsed representation of type T.

Raises:

Type Description
Exception

Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type Description
bool

True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name Type Description Default
client BasePDFClient

PDF client responsible for retrieving raw PDF bytes.

required

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name Type Description Default
source Any

Identifier of the PDF source as understood by the configured PDF client.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to attach to the returned content.

None

Returns:

Type Description
Content

A Content instance containing:

Content
  • Raw PDF bytes
Content
  • Source identifier
Content
  • PDF content type
Content
  • Optional metadata

Raises:

Type Description
Exception

Retrieval-specific errors raised by the PDF client.