Scraper

omniread.html.scraper

HTML scraping implementation for OmniRead.

This module provides an HTTP-based scraper for retrieving HTML documents. It implements the core BaseScraper contract using httpx as the transport layer.

This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content

This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting

HTMLScraper

HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)

Bases: BaseScraper

Base HTML scraper using httpx.

This scraper retrieves HTML documents over HTTP(S) and returns them as raw content wrapped in a Content object.

Fetches raw bytes and metadata only. The scraper: - Uses httpx.Client for HTTP requests - Enforces an HTML content type - Preserves HTTP response metadata

The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses

Initialize the HTML scraper.

Parameters:

Name	Type	Description	Default
`client`	`Optional[Client]`	Optional pre-configured `httpx.Client`. If omitted, a client is created internally.	`None`
`timeout`	`float`	Request timeout in seconds.	`15.0`
`headers`	`Optional[Mapping[str, str]]`	Optional default HTTP headers.	`None`
`follow_redirects`	`bool`	Whether to follow HTTP redirects.	`True`

fetch

fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content

Fetch an HTML document from the given source.

Parameters:

Name	Type	Description	Default
`source`	`str`	URL of the HTML document.	required
`metadata`	`Optional[Mapping[str, Any]]`	Optional metadata to be merged into the returned content.	`None`

Returns:

Type	Description
`Content`	A `Content` instance containing:
`Content`	Raw HTML bytes
`Content`	Source URL
`Content`	HTML content type
`Content`	HTTP response metadata

Raises:

Type	Description
`HTTPError`	If the HTTP request fails.
`ValueError`	If the response is not valid HTML.

validate_content_type

validate_content_type(response: httpx.Response) -> None

Validate that the HTTP response contains HTML content.

Parameters:

Name	Type	Description	Default
`response`	`Response`	HTTP response returned by `httpx`.	required

Raises:

Type	Description
`ValueError`	If the `Content-Type` header is missing or does not indicate HTML content.