Html
omniread.html
HTML format implementation for OmniRead.
This package provides HTML-specific implementations of the core OmniRead
contracts defined in omniread.core.
It includes: - HTML parsers that interpret HTML content - HTML scrapers that retrieve HTML documents
This package:
- Implements, but does not redefine, core contracts
- May contain HTML-specific behavior and edge-case handling
- Produces canonical content models defined in omniread.core.content
Consumers should depend on omniread.core interfaces wherever possible and
use this package only when HTML-specific behavior is required.
HTMLParser
HTMLParser(content: Content, features: str = 'html.parser')
Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must:
- Define the output type T
- Implement the parse() method
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {HTML}
Set of content types supported by this parser (HTML only).
parse
abstractmethod
parse() -> T
Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
parse_div
staticmethod
parse_div(div: Tag, *, separator: str = ' ') -> str
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Type | Description |
|---|---|
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
parse_link(a: Tag) -> Optional[str]
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The value of the |
parse_meta
parse_meta() -> dict[str, Any]
Extract high-level metadata from the HTML document.
This includes:
- Document title
- <meta> tag name/property → content mappings
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing extracted metadata. |
parse_table
staticmethod
parse_table(table: Tag) -> list[list[str]]
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
A list of rows, where each row is a list of cell text values. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |
HTMLScraper
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)
Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them
as raw content wrapped in a Content object.
Fetches raw bytes and metadata only.
The scraper:
- Uses httpx.Client for HTTP requests
- Enforces an HTML content type
- Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Optional[Client]
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
A |
Content
|
|
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
validate_content_type(response: httpx.Response) -> None
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |