Scraper
omniread.html.scraper
HTML scraping implementation for OmniRead.
This module provides an HTTP-based scraper for retrieving HTML documents.
It implements the core BaseScraper contract using httpx as the transport
layer.
This scraper is responsible for: - Fetching raw HTML bytes over HTTP(S) - Validating response content type - Attaching HTTP metadata to the returned content
This scraper is not responsible for: - Parsing or interpreting HTML - Retrying failed requests - Managing crawl policies or rate limiting
HTMLScraper
HTMLScraper(*, client: Optional[httpx.Client] = None, timeout: float = 15.0, headers: Optional[Mapping[str, str]] = None, follow_redirects: bool = True)
Bases: BaseScraper
Base HTML scraper using httpx.
This scraper retrieves HTML documents over HTTP(S) and returns them
as raw content wrapped in a Content object.
Fetches raw bytes and metadata only.
The scraper:
- Uses httpx.Client for HTTP requests
- Enforces an HTML content type
- Preserves HTTP response metadata
The scraper does not: - Parse HTML - Perform retries or backoff - Handle non-HTML responses
Initialize the HTML scraper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client |
Optional[Client]
|
Optional pre-configured |
None
|
timeout |
float
|
Request timeout in seconds. |
15.0
|
headers |
Optional[Mapping[str, str]]
|
Optional default HTTP headers. |
None
|
follow_redirects |
bool
|
Whether to follow HTTP redirects. |
True
|
fetch
fetch(source: str, *, metadata: Optional[Mapping[str, Any]] = None) -> Content
Fetch an HTML document from the given source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
str
|
URL of the HTML document. |
required |
metadata |
Optional[Mapping[str, Any]]
|
Optional metadata to be merged into the returned content. |
None
|
Returns:
| Type | Description |
|---|---|
Content
|
A |
Content
|
|
Content
|
|
Content
|
|
Content
|
|
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the HTTP request fails. |
ValueError
|
If the response is not valid HTML. |
validate_content_type
validate_content_type(response: httpx.Response) -> None
Validate that the HTTP response contains HTML content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response |
Response
|
HTTP response returned by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the |