omniread

OmniRead — format-agnostic content acquisition and parsing framework.

OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.

The library is structured around three core concepts:

Content A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
Scrapers Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
Parsers Components responsible for interpreting acquired content and converting it into structured, typed representations.

OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior

Installation

Install OmniRead using pip:

pip install omniread

Or with Poetry:

poetry add omniread

Basic Usage

HTML example:

from omniread import HTMLScraper, HTMLParser

scraper = HTMLScraper()
content = scraper.fetch("https://example.com")

class TitleParser(HTMLParser[str]):
    def parse(self) -> str:
        return self._soup.title.string

parser = TitleParser(content)
title = parser.parse()

PDF example:

from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path

client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))

class TextPDFParser(PDFParser[str]):
    def parse(self) -> str:
        # implement PDF text extraction
        ...

parser = TextPDFParser(content)
result = parser.parse()

Public API Surface

This module re-exports the recommended public entry points of OmniRead.

Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.

Core: - Content - ContentType

HTML: - HTMLScraper - HTMLParser

PDF: - FileSystemPDFClient - PDFScraper - PDFParser

Core Philosophy

OmniRead is designed as a decoupled content engine:

Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
Normalized Exchange: All components communicate via the Content model, ensuring a consistent contract.
Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.

Documentation Design

For those extending OmniRead, follow these "AI-Native" docstring principles:

For Humans

Clear Contracts: Explicitly state what a component is and is NOT responsible for.
Runnable Examples: Include small, logical snippets in the package __init__.py.

For LLMs

Structured Models: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
Type Safety: All public APIs must be fully typed and have corresponding .pyi stubs.
Detailed Raises: Include : description pairs in the Raises section to help agents handle errors gracefully.

Content `dataclass`

Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)

Normalized representation of extracted content.

A Content instance represents a raw content payload along with minimal contextual metadata describing its origin and type.

This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers

Attributes:

Name	Type	Description
`raw`	`bytes`	Raw content bytes as retrieved from the source.
`source`	`str`	Identifier of the content origin (URL, file path, or logical name).
`content_type`	`Optional[ContentType]`	Optional MIME type of the content, if known.
`metadata`	`Optional[Mapping[str, Any]]`	Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes).

ContentType

Bases: str, Enum

Supported MIME types for extracted content.

This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.

HTML `class-attribute` `instance-attribute`

HTML = 'text/html'

HTML document content.

JSON `class-attribute` `instance-attribute`

JSON = 'application/json'

JSON document content.

PDF `class-attribute` `instance-attribute`

PDF = 'application/pdf'

PDF document content.

XML `class-attribute` `instance-attribute`

XML = 'application/xml'

XML document content.

FileSystemPDFClient

Bases: BasePDFClient

PDF client that reads from the local filesystem.

This client reads PDF files directly from the disk and returns their raw binary contents.

fetch

fetch(path: Path) -> bytes

Read a PDF file from the local filesystem.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Filesystem path to the PDF file.	required

Returns:

Type	Description
`bytes`	Raw PDF bytes.

Raises:

Type	Description
`FileNotFoundError`	If the path does not exist.
`ValueError`	If the path exists but is not a file.

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name	Type	Description	Default
`content`	`Content`	HTML content to be parsed.	required
`features`	`str`	BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').	`'html.parser'`

Raises:

Type	Description
`ValueError`	If the content is empty or not valid HTML.

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

parse `abstractmethod`

parse() -> T

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed representation of type `T`.

parse_div `staticmethod`

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name	Type	Description	Default
`div`	`Tag`	BeautifulSoup tag representing a `<div>`.	required
`separator`	`str`	String used to separate text nodes.	`' '`

Returns:

Type	Description
`str`	Flattened, whitespace-normalized text content.

parse_link `staticmethod`

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name	Type	Description	Default
`a`	`Tag`	BeautifulSoup tag representing an anchor.	required

Returns:

Type	Description
`Optional[str]`	The value of the `href` attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property → content mappings

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing extracted metadata.

parse_table `staticmethod`

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name	Type	Description	Default
`table`	`Tag`	BeautifulSoup tag representing a `<table>`.	required

Returns:

Type	Description
`list[list[str]]`	A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name	Type	Description	Default
`client`	`Optional[Client]`	Optional pre-configured `httpx.Client`. If omitted, a client is created internally.	`None`
`timeout`	`float`	Request timeout in seconds.	`15.0`
`headers`	`Optional[Mapping[str, str]]`	Optional default HTTP headers.	`None`
`follow_redirects`	`bool`	Whether to follow HTTP redirects.	`True`

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`str`	URL of the HTML document.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to be merged into the returned content.	`None`

Returns:

Type	Description
`Content`	A `Content` instance containing:
`Content`	Raw HTML bytes
`Content`	Source URL
`Content`	HTML content type
`Content`	HTTP response metadata

Raises:

Type	Description
`HTTPError`	If the HTTP request fails.
`ValueError`	If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name	Type	Description	Default
`response`	`Response`	HTTP response returned by `httpx`.	required

Raises:

Type	Description
`ValueError`	If the `Content-Type` header is missing or does not indicate HTML content.

PDFParser

PDFParser(content: Content)

Bases: BaseParser[T], Generic[T]

Base PDF parser.

This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.

Concrete implementations must define: - Define the output type T - Implement the parse() method

Initialize the parser with content to be parsed.

Parameters:

Name	Type	Description	Default
`content`	`Content`	Content instance to be parsed.	required

Raises:

Type	Description
`ValueError`	If the content type is not supported by this parser.

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {PDF}

Set of content types supported by this parser (PDF only).

parse `abstractmethod`

parse() -> T

Parse PDF content into a structured output.

Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed representation of type `T`.

Raises:

Type	Description
`Exception`	Parsing-specific errors as defined by the implementation.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.

PDFScraper

PDFScraper(*, client: BasePDFClient)

Bases: BaseScraper

Scraper for PDF sources.

Delegates byte retrieval to a PDF client and normalizes output into Content.

The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata

Initialize the PDF scraper.

Parameters:

Name	Type	Description	Default
`client`	`BasePDFClient`	PDF client responsible for retrieving raw PDF bytes.	required

fetch

fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch a PDF document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`Any`	Identifier of the PDF source as understood by the configured PDF client.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to attach to the returned content.	`None`

Returns:

Type	Description
`Content`	A `Content` instance containing:
`Content`	Raw PDF bytes
`Content`	Source identifier
`Content`	PDF content type
`Content`	Optional metadata

Raises:

Type	Description
`Exception`	Retrieval-specific errors raised by the PDF client.

omniread

omniread

Installation

Basic Usage

Public API Surface

Core Philosophy

Documentation Design

For Humans

For LLMs

Content dataclass

ContentType

HTML class-attribute instance-attribute

JSON class-attribute instance-attribute

PDF class-attribute instance-attribute

XML class-attribute instance-attribute

FileSystemPDFClient

fetch

HTMLParser

supported_types class-attribute instance-attribute

parse abstractmethod

parse_div staticmethod

parse_link staticmethod

parse_meta

parse_table staticmethod

supports

HTMLScraper

fetch

validate_content_type

PDFParser

supported_types class-attribute instance-attribute

parse abstractmethod

supports

PDFScraper

fetch

Content `dataclass`

HTML `class-attribute` `instance-attribute`

JSON `class-attribute` `instance-attribute`

PDF `class-attribute` `instance-attribute`

XML `class-attribute` `instance-attribute`

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

parse_div `staticmethod`

parse_link `staticmethod`

parse_table `staticmethod`

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`