Skip to content

Scraper

omniread.html.scraper

HTML scraping implementation for OmniRead.

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name Type Description Default
client Optional[Client]

Optional pre-configured httpx.Client. If omitted, a client is created internally.

None
timeout float

Request timeout in seconds.

15.0
headers Optional[Mapping[str, str]]

Optional default HTTP headers.

None
follow_redirects bool

Whether to follow HTTP redirects.

True

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name Type Description Default
source str

URL of the HTML document.

required
metadata Optional[Mapping[str, Any]]

Optional metadata to be merged into the returned content.

None

Returns:

Type Description
Content

A Content instance containing:

Content
  • Raw HTML bytes
Content
  • Source URL
Content
  • HTML content type
Content
  • HTTP response metadata

Raises:

Type Description
HTTPError

If the HTTP request fails.

ValueError

If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name Type Description Default
response Response

HTTP response returned by httpx.

required

Raises:

Type Description
ValueError

If the Content-Type header is missing or does not indicate HTML content.