Skip to content

Parser

omniread.html.parser

HTML parser base implementations for OmniRead.

This module provides reusable HTML parsing utilities built on top of the abstract parser contracts defined in omniread.core.parser.

It supplies: - Content-type enforcement for HTML inputs - BeautifulSoup initialization and lifecycle management - Common helper methods for extracting structured data from HTML elements

Concrete parsers must subclass HTMLParser and implement the parse() method to return a structured representation appropriate for their use case.

HTMLParser

HTMLParser(content: Content, features: str = 'html.parser')

Bases: BaseParser[T], Generic[T]

Base HTML parser.

This class extends the core BaseParser with HTML-specific behavior, including DOM parsing via BeautifulSoup and reusable extraction helpers.

Provides reusable helpers for HTML extraction. Concrete parsers must explicitly define the return type.

Characteristics: - Accepts only HTML content - Owns a parsed BeautifulSoup DOM tree - Provides pure helper utilities for common HTML structures

Concrete subclasses must: - Define the output type T - Implement the parse() method

Initialize the HTML parser.

Parameters:

Name Type Description Default
content Content

HTML content to be parsed.

required
features str

BeautifulSoup parser backend to use (e.g., 'html.parser', 'lxml').

'html.parser'

Raises:

Type Description
ValueError

If the content is empty or not valid HTML.

supported_types class-attribute instance-attribute

supported_types: set[ContentType] = {HTML}

Set of content types supported by this parser (HTML only).

parse abstractmethod

parse() -> T

Fully parse the HTML content into structured output.

Implementations must fully interpret the HTML DOM and return a deterministic, structured output.

Returns:

Type Description
T

Parsed representation of type T.

parse_div staticmethod

parse_div(div: Tag, *, separator: str = ' ') -> str

Extract normalized text from a <div> element.

Parameters:

Name Type Description Default
div Tag

BeautifulSoup tag representing a <div>.

required
separator str

String used to separate text nodes.

' '

Returns:

Type Description
str

Flattened, whitespace-normalized text content.

parse_link(a: Tag) -> Optional[str]

Extract the hyperlink reference from an <a> element.

Parameters:

Name Type Description Default
a Tag

BeautifulSoup tag representing an anchor.

required

Returns:

Type Description
Optional[str]

The value of the href attribute, or None if absent.

parse_meta

parse_meta() -> dict[str, Any]

Extract high-level metadata from the HTML document.

This includes: - Document title - <meta> tag name/property → content mappings

Returns:

Type Description
dict[str, Any]

Dictionary containing extracted metadata.

parse_table staticmethod

parse_table(table: Tag) -> list[list[str]]

Parse an HTML table into a 2D list of strings.

Parameters:

Name Type Description Default
table Tag

BeautifulSoup tag representing a <table>.

required

Returns:

Type Description
list[list[str]]

A list of rows, where each row is a list of cell text values.

supports

supports() -> bool

Check whether this parser supports the content's type.

Returns:

Type Description
bool

True if the content type is supported; False otherwise.