omniread
omniread
OmniRead — format-agnostic content acquisition and parsing framework.
OmniRead provides a cleanly layered architecture for fetching, parsing, and normalizing content from heterogeneous sources such as HTML documents and PDF files.
The library is structured around three core concepts:
-
Content A canonical, format-agnostic container representing raw content bytes and minimal contextual metadata.
-
Scrapers Components responsible for acquiring raw content from a source (HTTP, filesystem, object storage, etc.). Scrapers never interpret content.
-
Parsers Components responsible for interpreting acquired content and converting it into structured, typed representations.
OmniRead deliberately separates these responsibilities to ensure: - Clear boundaries between IO and interpretation - Replaceable implementations per format - Predictable, testable behavior
Installation
Install OmniRead using pip:
pip install omniread
Or with Poetry:
poetry add omniread
Basic Usage
HTML example:
from omniread import HTMLScraper, HTMLParser
scraper = HTMLScraper()
content = scraper.fetch("https://example.com")
class TitleParser(HTMLParser[str]):
def parse(self) -> str:
return self._soup.title.string
parser = TitleParser(content)
title = parser.parse()
PDF example:
from omniread import FileSystemPDFClient, PDFScraper, PDFParser
from pathlib import Path
client = FileSystemPDFClient()
scraper = PDFScraper(client=client)
content = scraper.fetch(Path("document.pdf"))
class TextPDFParser(PDFParser[str]):
def parse(self) -> str:
# implement PDF text extraction
...
parser = TextPDFParser(content)
result = parser.parse()
Public API Surface
This module re-exports the recommended public entry points of OmniRead.
Consumers are encouraged to import from this namespace rather than from format-specific submodules directly, unless advanced customization is required.
Core: - Content - ContentType
HTML: - HTMLScraper - HTMLParser
PDF: - FileSystemPDFClient - PDFScraper - PDFParser
Core Philosophy
OmniRead is designed as a decoupled content engine:
- Separation of Concerns: Scrapers fetch, Parsers interpret. Neither knows about the other.
- Normalized Exchange: All components communicate via the
Contentmodel, ensuring a consistent contract. - Format Agnosticism: The core logic is independent of whether the input is HTML, PDF, or JSON.
Documentation Design
For those extending OmniRead, follow these "AI-Native" docstring principles:
For Humans
- Clear Contracts: Explicitly state what a component is and is NOT responsible for.
- Runnable Examples: Include small, logical snippets in the package
__init__.py.
For LLMs
- Structured Models: Use dataclasses and enums for core data to ensure clean MCP JSON representation.
- Type Safety: All public APIs must be fully typed and have corresponding
.pyistubs. - Detailed Raises: Include
: descriptionpairs in theRaisessection to help agents handle errors gracefully.
Content
dataclass
Content(raw: bytes, source: str, content_type: Optional[ContentType] = ..., metadata: Optional[Mapping[str, Any]] = ...)
Normalized representation of extracted content.
A Content instance represents a raw content payload along with minimal
contextual metadata describing its origin and type.
This class is the primary exchange format between: - Scrapers - Parsers - Downstream consumers
Attributes:
| Name | Type | Description |
|---|---|---|
raw |
bytes
|
Raw content bytes as retrieved from the source. |
source |
str
|
Identifier of the content origin (URL, file path, or logical name). |
content_type |
Optional[ContentType]
|
Optional MIME type of the content, if known. |
metadata |
Optional[Mapping[str, Any]]
|
Optional, implementation-defined metadata associated with the content (e.g., headers, encoding hints, extraction notes). |
ContentType
Bases: str, Enum
Supported MIME types for extracted content.
This enum represents the declared or inferred media type of the content source. It is primarily used for routing content to the appropriate parser or downstream consumer.
HTML
class-attribute
instance-attribute
HTML = 'text/html'
HTML document content.
JSON
class-attribute
instance-attribute
JSON = 'application/json'
JSON document content.
PDF
class-attribute
instance-attribute
PDF = 'application/pdf'
PDF document content.
XML
class-attribute
instance-attribute
XML = 'application/xml'
XML document content.
FileSystemPDFClient
Bases: BasePDFClient
PDF client that reads from the local filesystem.
This client reads PDF files directly from the disk and returns their raw binary contents.
fetch
fetch(path: Path) -> bytes
Read a PDF file from the local filesystem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
Path
|
Filesystem path to the PDF file. |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Raw PDF bytes. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the path does not exist. |
ValueError
|
If the path exists but is not a file. |
HTMLParser
HTMLParser(content: Content, features: str = 'html.parser')
Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must:
- Define the output type T
- Implement the parse() method
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {HTML}
Set of content types supported by this parser (HTML only).
parse
abstractmethod
parse() -> T
Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
parse_div
staticmethod
parse_div(div: Tag, *, separator: str = ' ') -> str
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Type | Description |
|---|---|
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
parse_link(a: Tag) -> Optional[str]
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The value of the |
parse_meta
parse_meta() -> dict[str, Any]
Extract high-level metadata from the HTML document.
This includes:
- Document title
- <meta> tag name/property → content mappings
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing extracted metadata. |
parse_table
staticmethod
parse_table(table: Tag) -> list[list[str]]
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
A list of rows, where each row is a list of cell text values. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |
HTMLScraper
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)
Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them
as raw content wrapped in a Content object.
Fetches raw bytes and metadata only.
The scraper:
- Uses httpx.Client for HTTP requests
- Enforces an HTML content type
- Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Optional[Client]
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
A |
Content
|
|
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
validate_content_type(response: httpx.Response) -> None
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |
PDFParser
PDFParser(content: Content)
Bases: BaseParser[T], Generic[T]
Base PDF parser.
This class enforces PDF content-type compatibility and provides the extension point for implementing concrete PDF parsing strategies.
Concrete implementations must define:
- Define the output type T
- Implement the parse() method
Initialize the parser with content to be parsed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
Content instance to be parsed. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content type is not supported by this parser. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {PDF}
Set of content types supported by this parser (PDF only).
parse
abstractmethod
parse() -> T
Parse PDF content into a structured output.
Implementations must fully interpret the PDF binary payload and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
Raises:
| Type | Description |
|---|---|
Exception
|
Parsing-specific errors as defined by the implementation. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |
PDFScraper
PDFScraper(*, client: BasePDFClient)
Bases: BaseScraper
Scraper for PDF sources.
Delegates byte retrieval to a PDF client and normalizes output into Content.
The scraper: - Does not perform parsing or interpretation - Does not assume a specific storage backend - Preserves caller-provided metadata
Initialize the PDF scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
BasePDFClient
|
PDF client responsible for retrieving raw PDF bytes. |
required |
fetch
fetch(source: Any, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch a PDF document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Any
|
Identifier of the PDF source as understood by the configured PDF client. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to attach to the returned content. |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
A |
Content
|
|
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
Exception
|
Retrieval-specific errors raised by the PDF client. |