Parser

omniread.html.parser

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name	Type	Description	Default
`content`	`Content`	HTML content to be parsed.	required
`features`	`str`	BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').	`'html.parser'`

Raises:

Type	Description
`ValueError`	If the content is empty or not valid HTML.

supported_types `class-attribute` `instance-attribute`

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

parse `abstractmethod`

parse() -> T

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type	Description
`T`	Parsed representation of type `T`.

parse_div `staticmethod`

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name	Type	Description	Default
`div`	`Tag`	BeautifulSoup tag representing a `<div>`.	required
`separator`	`str`	String used to separate text nodes.	`' '`

Returns:

Type	Description
`str`	Flattened, whitespace-normalized text content.

parse_link `staticmethod`

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name	Type	Description	Default
`a`	`Tag`	BeautifulSoup tag representing an anchor.	required

Returns:

Type	Description
`Optional[str]`	The value of the `href` attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property → content mappings

Returns:

Type	Description
`dict[str, Any]`	Dictionary containing extracted metadata.

parse_table `staticmethod`

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name	Type	Description	Default
`table`	`Tag`	BeautifulSoup tag representing a `<table>`.	required

Returns:

Type	Description
`list[list[str]]`	A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type	Description
`bool`	True if the content type is supported; False otherwise.

Parser

omniread.html.parser

HTMLParser

supported_types class-attribute instance-attribute

parse abstractmethod

parse_div staticmethod

parse_link staticmethod

parse_meta

parse_table staticmethod

supports

supported_types `class-attribute` `instance-attribute`

parse `abstractmethod`

parse_div `staticmethod`

parse_link `staticmethod`

parse_table `staticmethod`