Parser
omniread.html.parser
HTML parser base implementations for OmniRead.
This module provides reusable HTML parsing utilities built on top of
the abstract parser contracts defined in omniread.core.parser.
It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements
Concrete parsers must subclass HTMLParser and implement the parse() method
to return a structured representation appropriate for their use case.
HTMLParser
HTMLParser(content: Content, features: str = 'html.parser')
Bases: BaseParser[T], Generic[T]
Base HTML parser.
This class extends the core BaseParser with HTML-specific behavior,
including DOM parsing via BeautifulSoup and reusable extraction helpers.
Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.
Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures
Concrete subclasses must:
- Define the output type T
- Implement the parse() method
Initialize the HTML parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content |
Content
|
HTML content to be parsed. |
required |
features |
str
|
BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml'). |
'html.parser'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the content is empty or not valid HTML. |
supported_types
class-attribute
instance-attribute
supported_types: set[ContentType] = {HTML}
Set of content types supported by this parser (HTML only).
parse
abstractmethod
parse() -> T
Fully parse the HTML content into structured output.
Implementations must fully interpret the HTML DOM and return a deterministic, structured output.
Returns:
| Type | Description |
|---|---|
T
|
Parsed representation of type |
parse_div
staticmethod
parse_div(div: Tag, *, separator: str = ' ') -> str
Extract normalized text from a <div> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
div |
Tag
|
BeautifulSoup tag representing a |
required |
separator |
str
|
String used to separate text nodes. |
' '
|
Returns:
| Type | Description |
|---|---|
str
|
Flattened, whitespace-normalized text content. |
parse_link
staticmethod
parse_link(a: Tag) -> Optional[str]
Extract the hyperlink reference from an <a> element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
Tag
|
BeautifulSoup tag representing an anchor. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The value of the |
parse_meta
parse_meta() -> dict[str, Any]
Extract high-level metadata from the HTML document.
This includes:
- Document title
- <meta> tag name/property → content mappings
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing extracted metadata. |
parse_table
staticmethod
parse_table(table: Tag) -> list[list[str]]
Parse an HTML table into a 2D list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table |
Tag
|
BeautifulSoup tag representing a |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
A list of rows, where each row is a list of cell text values. |
supports
supports() -> bool
Check whether this parser supports the content's type.
Returns:
| Type | Description |
|---|---|
bool
|
True if the content type is supported; False otherwise. |