Scraping API¶
The crawley.scraping module is the high-level, ergonomic entry point for
"just scrape this page" use cases. It is built on the same httpx + lxml
stack as the crawler but gives you a friendly, parsel / requests-html
flavoured interface.
Fetching a page¶
fetch() is synchronous. For async code use afetch(), and to fetch many
pages concurrently use afetch_all():
import asyncio
from crawley.scraping import afetch, afetch_all
doc = asyncio.run(afetch("https://example.com"))
urls = ["https://example.com/1", "https://example.com/2"]
docs = asyncio.run(afetch_all(urls)) # list of Document (None on error)
You can also parse html you already have:
Selecting elements¶
A Document (and any Element) supports CSS selectors and XPath:
doc.css("div.quote") # -> list[Element]
doc.css_first("h1") # -> Element | None
doc.xpath("//h1/text()") # -> list (strings or Element)
doc.title # -> the <title> text
Queries can be nested:
for quote in doc.css("div.quote"):
text = quote.css_first("span.text").text
author = quote.css_first("small.author").text
tags = quote.css("a.tag::text")
Pseudo-selectors¶
Append ::text or ::attr(name) to a CSS selector to pull values instead of
elements (just like scrapy / parsel):
doc.css("span.text::text") # -> ["The quote", ...]
doc.css("a::attr(href)") # -> ["https://...", ...] (absolute)
Element helpers¶
el = doc.css_first("a")
el.text # normalized recursive text
el.attr("href") # attribute (or a default)
el.attrs # dict of all attributes
el.html # serialized back to html
Links¶
links() returns the de-duplicated, absolute hrefs on the page:
Absolute urls
When a url is provided (fetch does this automatically), relative links
are resolved to absolute urls, so href="page2" becomes
https://site/dir/page2.
Declarative extraction¶
extract() maps field names to selectors. A string selector yields a
single value (the first match); a one-element list selector yields the
list of every match:
doc.extract({
"title": "h1::text",
"price": "span.price::text",
"images": ["img::attr(src)"],
"authors": ["small.author::text"],
})
# {"title": "...", "price": "...", "images": [...], "authors": [...]}
scrape() fetches and extracts in a single call:
Inside a crawler¶
The same shortcuts are available on the crawler's response object, so you can
use the modern API inside a scraper's scrape() method: