API Reference¶
Auto-generated from the docstrings.
Scraping¶
crawley.scraping.fetch(url, **kwargs)
¶
Fetch url and return a parsed :class:Document (synchronous).
crawley.scraping.afetch(url, client=None, **kwargs)
async
¶
Fetch url asynchronously and return a :class:Document.
crawley.scraping.afetch_all(urls, **kwargs)
async
¶
Fetch many urls concurrently, returning a list of :class:Document.
crawley.scraping.scrape(url, rules, **kwargs)
¶
Fetch url and immediately :meth:Document.extract it with rules.
crawley.scraping.parse(html, url=None)
¶
Parse an html string into a :class:Document.
crawley.scraping.Document
¶
Bases: Element
A parsed html document ready to be scraped.
title
property
¶
The page <title> text, if any.
extract(rules)
¶
Extract a dict of fields from the document.
rules maps field names to selectors. A plain string selector yields a single value (the first match); a one-element list selector yields the list of every match::
doc.extract({
"title": "h1::text",
"price": "span.price::text",
"images": ["img::attr(src)"],
})
links(selector='a')
¶
Return the (absolute, de-duplicated) hrefs found in the page.
crawley.scraping.Element
¶
A thin, convenient wrapper around an lxml html element.
attrs
property
¶
A dict of the element's attributes.
html
property
¶
The element serialized back to html.
text
property
¶
The normalized recursive text of the element.
attr(name, default=None)
¶
Return the name attribute (or default).
css(selector)
¶
Return the descendants matching the CSS selector.
css_first(selector)
¶
Return the first descendant matching selector (or None).
xpath(query)
¶
Return the result of an XPath query.
String results (e.g. "//a/@href") are returned as-is; element
results are wrapped in :class:Element.
Crawlers¶
crawley.crawlers.base.BaseCrawler
¶
User crawlers must inherit from this class.
Override the relevant methods and define start_urls, scrapers and
the max_depth to control the crawl.
allowed_urls = []
class-attribute
instance-attribute
¶
A list of url patterns allowed for crawl.
autothrottle = False
class-attribute
instance-attribute
¶
Adapt the per-host delay to the observed response latency.
autothrottle_max_delay = 60.0
class-attribute
instance-attribute
¶
Maximum per-host delay AutoThrottle may set (seconds).
autothrottle_start_delay = 1.0
class-attribute
instance-attribute
¶
Initial per-host delay used by AutoThrottle (seconds).
autothrottle_target_concurrency = 1.0
class-attribute
instance-attribute
¶
Target number of concurrent requests per host for AutoThrottle.
black_list = []
class-attribute
instance-attribute
¶
A list of blocked url patterns that are never crawled.
crawl_delay = config.CRAWL_DELAY
class-attribute
instance-attribute
¶
Minimum seconds between two requests to the same host.
extractor = extractor_class()
class-attribute
instance-attribute
¶
The extractor class. Defaults to :class:XPathExtractor.
headers = {}
class-attribute
instance-attribute
¶
The default request headers.
http_cache = False
class-attribute
instance-attribute
¶
Cache responses on disk (development helper).
http_cache_dir = '.crawley_cache'
class-attribute
instance-attribute
¶
Directory used by the on-disk HTTP cache.
login = None
class-attribute
instance-attribute
¶
Login data: a tuple of (url, login_dict).
max_concurrency_level = None
class-attribute
instance-attribute
¶
The maximum number of concurrent requests.
max_concurrency_per_host = config.MAX_CONCURRENCY_PER_HOST
class-attribute
instance-attribute
¶
Maximum simultaneous requests per host (None disables the limit).
max_depth = -1
class-attribute
instance-attribute
¶
The maximum crawling recursive level (-1 means unlimited).
max_retries = config.REQUEST_MAX_RETRIES
class-attribute
instance-attribute
¶
How many times a failed request is retried.
playwright_options = {}
class-attribute
instance-attribute
¶
Extra options for the Playwright manager (browser_type, headless, ...).
post_urls = []
class-attribute
instance-attribute
¶
POST data for urls: a list of (url, data_dict) tuples.
render_js = False
class-attribute
instance-attribute
¶
Render pages with a headless browser (Playwright). Needs crawley[js].
requests_delay = config.REQUEST_DELAY
class-attribute
instance-attribute
¶
The average delay time between requests.
requests_deviation = config.REQUEST_DEVIATION
class-attribute
instance-attribute
¶
The requests deviation time.
respect_robots = config.RESPECT_ROBOTS
class-attribute
instance-attribute
¶
When True the crawler honours each site's robots.txt.
retry_backoff = config.RETRY_BACKOFF_FACTOR
class-attribute
instance-attribute
¶
Base seconds for the exponential retry backoff.
retry_statuses = config.RETRY_STATUSES
class-attribute
instance-attribute
¶
HTTP status codes that trigger a retry.
scrapers = []
class-attribute
instance-attribute
¶
A list of scraper classes.
search_all_urls = True
class-attribute
instance-attribute
¶
Search for urls in the page when scrapers don't return any.
search_hidden_urls = False
class-attribute
instance-attribute
¶
Search for urls hidden anywhere in the html (not only <a> tags).
start_urls = []
class-attribute
instance-attribute
¶
A list containing the start urls for the crawler.
unique_urls = True
class-attribute
instance-attribute
¶
Skip urls that have already been visited during the crawl.
get_urls(response)
¶
Return the urls found in the current html page.
on_finish()
¶
Override to run code when the crawler finishes.
on_request_error(url, ex)
¶
Override to customize the request error handler.
on_robots_blocked(url)
¶
Override to react when robots.txt disallows crawling url.
on_start()
¶
Override to run code when the crawler starts.
run()
¶
Convenience synchronous entry point.
start()
async
¶
Run the crawler (coroutine).
crawley.crawlers.fast.FastCrawler
¶
crawley.crawlers.offline.OffLineCrawler
¶
Scrapers¶
crawley.scrapers.base.BaseScraper
¶
User scrapers must inherit from this class.
Implement :meth:scrape with the data extraction logic and define the
matching_urls that this scraper is able to process.
get_urls(response)
¶
Return a list of urls found in the current html.
on_cannot_scrape(response)
¶
Customize the can't-scrape handler.
on_scrape_error(response, ex)
¶
Customize the scrape error handler.
scrape(response)
¶
Define the data you want to extract here.
try_scrape(response)
¶
Try to parse the html page, returning the urls it discovers.
crawley.scrapers.smart.SmartScraper
¶
Bases: BaseScraper
Scrape only pages whose html structure is similar to a template page.
The structure of template_url is fetched once (synchronously) at
construction time and every candidate page is compared against it.
Spiders¶
crawley.spider.Spider
¶
Bases: BaseCrawler
A callback-driven spider.
Define :meth:parse (the default callback) and yield :class:Request
objects (or :func:response.follow(...)) to crawl further, and dicts /
:class:Item objects to emit data.
middlewares = []
class-attribute
instance-attribute
¶
Downloader middleware classes wrapping every download.
pipelines = []
class-attribute
instance-attribute
¶
Item pipeline classes applied, in order, to every emitted item.
on_item(item)
¶
Called for every item that survives the pipelines.
parse(response)
¶
Default callback. Override to extract data and follow links.
start_requests()
¶
Yield the initial requests (defaults to start_urls).
crawley.spider.Request
¶
crawley.spider.FormRequest
¶
crawley.spider.Item
¶
Bases: dict
A scraped item. Just a dict you may subclass for clarity.
crawley.spiders.CrawlSpider
¶
crawley.spiders.SitemapSpider
¶
crawley.spiders.LinkExtractor
¶
Extract links from a response, filtered by allow/deny rules.
extract_links(response)
¶
Return the (absolute, filtered) links found in response.
crawley.spiders.Rule
¶
Bind a :class:LinkExtractor to a callback and/or a follow behaviour.
Pipelines & middlewares¶
crawley.pipelines.ItemPipeline
¶
crawley.pipelines.DropItem
¶
Bases: Exception
Raise from process_item to discard the current item.
crawley.middlewares.DownloaderMiddleware
¶
Base class for downloader middlewares (all methods optional).
Stats, cache & throttling¶
crawley.stats.StatsCollector
¶
crawley.http.cache.HttpCache
¶
A tiny JSON-on-disk response cache.
crawley.http.autothrottle.AutoThrottle
¶
Compute a per-host delay from observed latencies.
adjust(host, latency)
¶
Update and return the new delay for host given a latency.
Extractors¶
crawley.extractors.XPathExtractor
¶
Bases: BaseExtractor
Extractor exposing an :mod:lxml tree, ready to be queried via XPath.
crawley.extractors.CSSExtractor
¶
Bases: BaseExtractor
Extractor exposing an :mod:lxml tree queryable with CSS selectors.
The returned tree supports tree.cssselect("div.foo a") thanks to the
cssselect package.
crawley.extractors.PyQueryExtractor
¶
Bases: BaseExtractor
Extractor using PyQuery (a jQuery-like library for Python).
crawley.extractors.RawExtractor
¶
Bases: BaseExtractor
Returns the raw html data untouched.
HTTP¶
crawley.http.response.Response
¶
Encapsulates an HTTP response.
Attributes:
| Name | Type | Description |
|---|---|---|
raw_html |
the decoded body of the response ( |
|
html |
the body parsed by the crawler's extractor (lxml tree, PyQuery
object, ...). |
|
url |
the final url of the request (after redirects). |
|
status_code |
the HTTP status code. |
|
headers |
the response headers. |
doc
property
¶
Return the body as a high level :class:~crawley.scraping.Document.
Lets you scrape a crawler response with the modern, ergonomic API::
def scrape(self, response):
title = response.doc.css_first("h1").text
meta
property
¶
The meta dict carried by the originating request (if any).
css(selector)
¶
Shortcut for response.doc.css(selector).
css_first(selector)
¶
Shortcut for response.doc.css_first(selector).
extract(rules)
¶
Shortcut for response.doc.extract(rules).
follow(url, callback=None, **kwargs)
¶
Build a :class:~crawley.spider.Request to a (possibly relative) url.
Relative urls are resolved against this response's url, and meta is
inherited from the current request unless overridden.