API Reference¶

Auto-generated from the docstrings.

Scraping¶

`crawley.scraping.fetch(url, **kwargs)` ¶

Fetch url and return a parsed :class:Document (synchronous).

`crawley.scraping.afetch(url, client=None, **kwargs)` `async` ¶

Fetch url asynchronously and return a :class:Document.

`crawley.scraping.afetch_all(urls, **kwargs)` `async` ¶

Fetch many urls concurrently, returning a list of :class:Document.

`crawley.scraping.scrape(url, rules, **kwargs)` ¶

Fetch url and immediately :meth:Document.extract it with rules.

`crawley.scraping.parse(html, url=None)` ¶

Parse an html string into a :class:Document.

`crawley.scraping.Document` ¶

Bases: Element

A parsed html document ready to be scraped.

`title` `property` ¶

The page <title> text, if any.

`extract(rules)` ¶

Extract a dict of fields from the document.

rules maps field names to selectors. A plain string selector yields a single value (the first match); a one-element list selector yields the list of every match::

doc.extract({
    "title": "h1::text",
    "price": "span.price::text",
    "images": ["img::attr(src)"],
})

`links(selector='a')` ¶

Return the (absolute, de-duplicated) hrefs found in the page.

`crawley.scraping.Element` ¶

A thin, convenient wrapper around an lxml html element.

`attrs` `property` ¶

A dict of the element's attributes.

`html` `property` ¶

The element serialized back to html.

`text` `property` ¶

The normalized recursive text of the element.

`attr(name, default=None)` ¶

Return the name attribute (or default).

`css(selector)` ¶

Return the descendants matching the CSS selector.

`css_first(selector)` ¶

Return the first descendant matching selector (or None).

`xpath(query)` ¶

Return the result of an XPath query.

String results (e.g. "//a/@href") are returned as-is; element results are wrapped in :class:Element.

Crawlers¶

`crawley.crawlers.base.BaseCrawler` ¶

User crawlers must inherit from this class.

Override the relevant methods and define start_urls, scrapers and the max_depth to control the crawl.

`allowed_urls = []` `class-attribute` `instance-attribute` ¶

A list of url patterns allowed for crawl.

`autothrottle = False` `class-attribute` `instance-attribute` ¶

Adapt the per-host delay to the observed response latency.

`autothrottle_max_delay = 60.0` `class-attribute` `instance-attribute` ¶

Maximum per-host delay AutoThrottle may set (seconds).

`autothrottle_start_delay = 1.0` `class-attribute` `instance-attribute` ¶

Initial per-host delay used by AutoThrottle (seconds).

`autothrottle_target_concurrency = 1.0` `class-attribute` `instance-attribute` ¶

Target number of concurrent requests per host for AutoThrottle.

`black_list = []` `class-attribute` `instance-attribute` ¶

A list of blocked url patterns that are never crawled.

`crawl_delay = config.CRAWL_DELAY` `class-attribute` `instance-attribute` ¶

Minimum seconds between two requests to the same host.

`extractor = extractor_class()` `class-attribute` `instance-attribute` ¶

The extractor class. Defaults to :class:XPathExtractor.

`headers = {}` `class-attribute` `instance-attribute` ¶

The default request headers.

`http_cache = False` `class-attribute` `instance-attribute` ¶

Cache responses on disk (development helper).

`http_cache_dir = '.crawley_cache'` `class-attribute` `instance-attribute` ¶

Directory used by the on-disk HTTP cache.

`login = None` `class-attribute` `instance-attribute` ¶

Login data: a tuple of (url, login_dict).

`max_concurrency_level = None` `class-attribute` `instance-attribute` ¶

The maximum number of concurrent requests.

`max_concurrency_per_host = config.MAX_CONCURRENCY_PER_HOST` `class-attribute` `instance-attribute` ¶

Maximum simultaneous requests per host (None disables the limit).

`max_depth = -1` `class-attribute` `instance-attribute` ¶

The maximum crawling recursive level (-1 means unlimited).

`max_retries = config.REQUEST_MAX_RETRIES` `class-attribute` `instance-attribute` ¶

How many times a failed request is retried.

`playwright_options = {}` `class-attribute` `instance-attribute` ¶

Extra options for the Playwright manager (browser_type, headless, ...).

`post_urls = []` `class-attribute` `instance-attribute` ¶

POST data for urls: a list of (url, data_dict) tuples.

`render_js = False` `class-attribute` `instance-attribute` ¶

Render pages with a headless browser (Playwright). Needs crawley[js].

`requests_delay = config.REQUEST_DELAY` `class-attribute` `instance-attribute` ¶

The average delay time between requests.

`requests_deviation = config.REQUEST_DEVIATION` `class-attribute` `instance-attribute` ¶

The requests deviation time.

`respect_robots = config.RESPECT_ROBOTS` `class-attribute` `instance-attribute` ¶

When True the crawler honours each site's robots.txt.

`retry_backoff = config.RETRY_BACKOFF_FACTOR` `class-attribute` `instance-attribute` ¶

Base seconds for the exponential retry backoff.

`retry_statuses = config.RETRY_STATUSES` `class-attribute` `instance-attribute` ¶

HTTP status codes that trigger a retry.

`scrapers = []` `class-attribute` `instance-attribute` ¶

A list of scraper classes.

`search_all_urls = True` `class-attribute` `instance-attribute` ¶

Search for urls in the page when scrapers don't return any.

`search_hidden_urls = False` `class-attribute` `instance-attribute` ¶

Search for urls hidden anywhere in the html (not only <a> tags).

`start_urls = []` `class-attribute` `instance-attribute` ¶

A list containing the start urls for the crawler.

`unique_urls = True` `class-attribute` `instance-attribute` ¶

Skip urls that have already been visited during the crawl.

`get_urls(response)` ¶

Return the urls found in the current html page.

`on_finish()` ¶

Override to run code when the crawler finishes.

`on_request_error(url, ex)` ¶

Override to customize the request error handler.

`on_robots_blocked(url)` ¶

Override to react when robots.txt disallows crawling url.

`on_start()` ¶

Override to run code when the crawler starts.

`run()` ¶

Convenience synchronous entry point.

`start()` `async` ¶

Run the crawler (coroutine).

`crawley.crawlers.fast.FastCrawler` ¶

Bases: BaseCrawler

Like :class:BaseCrawler but issues requests without delays.

`crawley.crawlers.offline.OffLineCrawler` ¶

Bases: BaseCrawler

A crawler that fixes relative asset urls in the fetched html.

Scrapers¶

`crawley.scrapers.base.BaseScraper` ¶

User scrapers must inherit from this class.

Implement :meth:scrape with the data extraction logic and define the matching_urls that this scraper is able to process.

`get_urls(response)` ¶

Return a list of urls found in the current html.

`on_cannot_scrape(response)` ¶

Customize the can't-scrape handler.

`on_scrape_error(response, ex)` ¶

Customize the scrape error handler.

`scrape(response)` ¶

Define the data you want to extract here.

`try_scrape(response)` ¶

Try to parse the html page, returning the urls it discovers.

`crawley.scrapers.smart.SmartScraper` ¶

Bases: BaseScraper

Scrape only pages whose html structure is similar to a template page.

The structure of template_url is fetched once (synchronously) at construction time and every candidate page is compared against it.

Spiders¶

`crawley.spider.Spider` ¶

Bases: BaseCrawler

A callback-driven spider.

Define :meth:parse (the default callback) and yield :class:Request objects (or :func:response.follow(...)) to crawl further, and dicts / :class:Item objects to emit data.

`middlewares = []` `class-attribute` `instance-attribute` ¶

Downloader middleware classes wrapping every download.

`pipelines = []` `class-attribute` `instance-attribute` ¶

Item pipeline classes applied, in order, to every emitted item.

`on_item(item)` ¶

Called for every item that survives the pipelines.

`parse(response)` ¶

Default callback. Override to extract data and follow links.

`start_requests()` ¶

Yield the initial requests (defaults to start_urls).

`crawley.spider.Request` ¶

A scheduled HTTP request with a callback to process its response.

`fingerprint()` ¶

A stable fingerprint (method + url + body) used for de-duplication.

`replace(**kwargs)` ¶

Return a copy of this request with some attributes replaced.

`crawley.spider.FormRequest` ¶

Bases: Request

A :class:Request that submits form data (POST by default).

`from_response(response, formdata=None, formid=None, formname=None, formxpath=None, callback=None, **kwargs)` `classmethod` ¶

Build a request from a <form> in response, pre-filling inputs.

`crawley.spider.Item` ¶

Bases: dict

A scraped item. Just a dict you may subclass for clarity.

`crawley.spiders.CrawlSpider` ¶

Bases: Spider

A spider that follows links according to a list of :class:Rule.

`crawley.spiders.SitemapSpider` ¶

Bases: Spider

Seed the crawl from sitemap.xml files (incl. sitemap indexes).

`crawley.spiders.LinkExtractor` ¶

Extract links from a response, filtered by allow/deny rules.

`extract_links(response)` ¶

Return the (absolute, filtered) links found in response.

`crawley.spiders.Rule` ¶

Bind a :class:LinkExtractor to a callback and/or a follow behaviour.

Pipelines & middlewares¶

`crawley.pipelines.ItemPipeline` ¶

Base class for item pipelines (all methods are optional).

`close_spider(spider)` ¶

Called once when the spider finishes.

`open_spider(spider)` ¶

Called once when the spider starts.

`process_item(item, spider)` ¶

Return the (possibly transformed) item, or raise :class:DropItem.

`crawley.pipelines.DropItem` ¶

Bases: Exception

Raise from process_item to discard the current item.

`crawley.middlewares.DownloaderMiddleware` ¶

Base class for downloader middlewares (all methods optional).

Stats, cache & throttling¶

`crawley.stats.StatsCollector` ¶

Collect counters and values during a crawl.

`close()` ¶

Record the total elapsed time.

`open()` ¶

Reset the stats and start the clock.

`crawley.http.cache.HttpCache` ¶

A tiny JSON-on-disk response cache.

`crawley.http.autothrottle.AutoThrottle` ¶

Compute a per-host delay from observed latencies.

`adjust(host, latency)` ¶

Update and return the new delay for host given a latency.

Extractors¶

`crawley.extractors.XPathExtractor` ¶

Bases: BaseExtractor

Extractor exposing an :mod:lxml tree, ready to be queried via XPath.

`crawley.extractors.CSSExtractor` ¶

Bases: BaseExtractor

Extractor exposing an :mod:lxml tree queryable with CSS selectors.

The returned tree supports tree.cssselect("div.foo a") thanks to the cssselect package.

`crawley.extractors.PyQueryExtractor` ¶

Bases: BaseExtractor

Extractor using PyQuery (a jQuery-like library for Python).

`crawley.extractors.RawExtractor` ¶

Bases: BaseExtractor

Returns the raw html data untouched.

HTTP¶

`crawley.http.response.Response` ¶

Encapsulates an HTTP response.

Attributes:

Name	Type	Description
`raw_html`		the decoded body of the response (`str`).
`html`		the body parsed by the crawler's extractor (lxml tree, PyQuery object, ...). `None` when no extractor was used.
`url`		the final url of the request (after redirects).
`status_code`		the HTTP status code.
`headers`		the response headers.

`doc` `property` ¶

Return the body as a high level :class:~crawley.scraping.Document.

Lets you scrape a crawler response with the modern, ergonomic API::

def scrape(self, response):
    title = response.doc.css_first("h1").text

`meta` `property` ¶

The meta dict carried by the originating request (if any).

`css(selector)` ¶

Shortcut for response.doc.css(selector).

`css_first(selector)` ¶

Shortcut for response.doc.css_first(selector).

`extract(rules)` ¶

Shortcut for response.doc.extract(rules).

`follow(url, callback=None, **kwargs)` ¶

Build a :class:~crawley.spider.Request to a (possibly relative) url.

Relative urls are resolved against this response's url, and meta is inherited from the current request unless overridden.

`crawley.http.retry.RetryPolicy` ¶

Configurable retry/backoff strategy.

`backoff_time(attempt, response=None)` ¶

Seconds to wait before retry number attempt (0-based).

`should_retry(attempt, response=None, exception=None)` ¶

Return True if a further attempt should be made.

`crawley.http.throttle.HostRateLimiter` ¶

Throttle requests on a per-host basis.

`semaphore(host)` ¶

Return the per-host semaphore (or None when uncapped).

`set_delay(host, delay)` ¶

Override the minimum delay for a single host (e.g. from robots).

`throttle(host)` `async` ¶

Wait so that requests to host respect the configured delay.

`crawley.http.robots.RobotsPolicy` ¶

Cache and evaluate robots.txt rules per host.

`allowed(url, client)` `async` ¶

Return True if url may be fetched according to robots.txt.

`crawl_delay(url)` ¶

Return the Crawl-delay for url's host, if any.

API Reference¶

Scraping¶

crawley.scraping.fetch(url, **kwargs) ¶

crawley.scraping.afetch(url, client=None, **kwargs) async ¶

crawley.scraping.afetch_all(urls, **kwargs) async ¶

crawley.scraping.scrape(url, rules, **kwargs) ¶

crawley.scraping.parse(html, url=None) ¶

crawley.scraping.Document ¶

title property ¶

extract(rules) ¶

links(selector='a') ¶

crawley.scraping.Element ¶

attrs property ¶

html property ¶

text property ¶

attr(name, default=None) ¶

css(selector) ¶

css_first(selector) ¶

xpath(query) ¶

Crawlers¶

crawley.crawlers.base.BaseCrawler ¶

allowed_urls = [] class-attribute instance-attribute ¶

autothrottle = False class-attribute instance-attribute ¶

autothrottle_max_delay = 60.0 class-attribute instance-attribute ¶

autothrottle_start_delay = 1.0 class-attribute instance-attribute ¶

autothrottle_target_concurrency = 1.0 class-attribute instance-attribute ¶

black_list = [] class-attribute instance-attribute ¶

crawl_delay = config.CRAWL_DELAY class-attribute instance-attribute ¶

extractor = extractor_class() class-attribute instance-attribute ¶

headers = {} class-attribute instance-attribute ¶

http_cache = False class-attribute instance-attribute ¶

http_cache_dir = '.crawley_cache' class-attribute instance-attribute ¶

login = None class-attribute instance-attribute ¶

max_concurrency_level = None class-attribute instance-attribute ¶

max_concurrency_per_host = config.MAX_CONCURRENCY_PER_HOST class-attribute instance-attribute ¶

max_depth = -1 class-attribute instance-attribute ¶

max_retries = config.REQUEST_MAX_RETRIES class-attribute instance-attribute ¶

playwright_options = {} class-attribute instance-attribute ¶

post_urls = [] class-attribute instance-attribute ¶

render_js = False class-attribute instance-attribute ¶

requests_delay = config.REQUEST_DELAY class-attribute instance-attribute ¶

requests_deviation = config.REQUEST_DEVIATION class-attribute instance-attribute ¶

respect_robots = config.RESPECT_ROBOTS class-attribute instance-attribute ¶

retry_backoff = config.RETRY_BACKOFF_FACTOR class-attribute instance-attribute ¶

retry_statuses = config.RETRY_STATUSES class-attribute instance-attribute ¶

scrapers = [] class-attribute instance-attribute ¶

search_all_urls = True class-attribute instance-attribute ¶

search_hidden_urls = False class-attribute instance-attribute ¶

start_urls = [] class-attribute instance-attribute ¶

unique_urls = True class-attribute instance-attribute ¶

get_urls(response) ¶

on_finish() ¶

on_request_error(url, ex) ¶

on_robots_blocked(url) ¶

on_start() ¶

run() ¶

start() async ¶

crawley.crawlers.fast.FastCrawler ¶

crawley.crawlers.offline.OffLineCrawler ¶

Scrapers¶

crawley.scrapers.base.BaseScraper ¶

get_urls(response) ¶

on_cannot_scrape(response) ¶

on_scrape_error(response, ex) ¶

scrape(response) ¶

try_scrape(response) ¶

crawley.scrapers.smart.SmartScraper ¶

Spiders¶

crawley.spider.Spider ¶

middlewares = [] class-attribute instance-attribute ¶

pipelines = [] class-attribute instance-attribute ¶

on_item(item) ¶

parse(response) ¶

start_requests() ¶

crawley.spider.Request ¶

fingerprint() ¶

replace(**kwargs) ¶

crawley.spider.FormRequest ¶

from_response(response, formdata=None, formid=None, formname=None, formxpath=None, callback=None, **kwargs) classmethod ¶

crawley.spider.Item ¶

`crawley.scraping.fetch(url, **kwargs)` ¶

`crawley.scraping.afetch(url, client=None, **kwargs)` `async` ¶

`crawley.scraping.afetch_all(urls, **kwargs)` `async` ¶

`crawley.scraping.scrape(url, rules, **kwargs)` ¶

`crawley.scraping.parse(html, url=None)` ¶

`crawley.scraping.Document` ¶

`title` `property` ¶

`extract(rules)` ¶

`links(selector='a')` ¶

`crawley.scraping.Element` ¶

`attrs` `property` ¶

`html` `property` ¶

`text` `property` ¶

`attr(name, default=None)` ¶

`css(selector)` ¶

`css_first(selector)` ¶

`xpath(query)` ¶

`crawley.crawlers.base.BaseCrawler` ¶

`allowed_urls = []` `class-attribute` `instance-attribute` ¶

`autothrottle = False` `class-attribute` `instance-attribute` ¶

`autothrottle_max_delay = 60.0` `class-attribute` `instance-attribute` ¶

`autothrottle_start_delay = 1.0` `class-attribute` `instance-attribute` ¶

`autothrottle_target_concurrency = 1.0` `class-attribute` `instance-attribute` ¶

`black_list = []` `class-attribute` `instance-attribute` ¶

`crawl_delay = config.CRAWL_DELAY` `class-attribute` `instance-attribute` ¶

`extractor = extractor_class()` `class-attribute` `instance-attribute` ¶

`headers = {}` `class-attribute` `instance-attribute` ¶

`http_cache = False` `class-attribute` `instance-attribute` ¶

`http_cache_dir = '.crawley_cache'` `class-attribute` `instance-attribute` ¶

`login = None` `class-attribute` `instance-attribute` ¶

`max_concurrency_level = None` `class-attribute` `instance-attribute` ¶

`max_concurrency_per_host = config.MAX_CONCURRENCY_PER_HOST` `class-attribute` `instance-attribute` ¶

`max_depth = -1` `class-attribute` `instance-attribute` ¶

`max_retries = config.REQUEST_MAX_RETRIES` `class-attribute` `instance-attribute` ¶

`playwright_options = {}` `class-attribute` `instance-attribute` ¶

`post_urls = []` `class-attribute` `instance-attribute` ¶

`render_js = False` `class-attribute` `instance-attribute` ¶

`requests_delay = config.REQUEST_DELAY` `class-attribute` `instance-attribute` ¶

`requests_deviation = config.REQUEST_DEVIATION` `class-attribute` `instance-attribute` ¶

`respect_robots = config.RESPECT_ROBOTS` `class-attribute` `instance-attribute` ¶

`retry_backoff = config.RETRY_BACKOFF_FACTOR` `class-attribute` `instance-attribute` ¶

`retry_statuses = config.RETRY_STATUSES` `class-attribute` `instance-attribute` ¶

`scrapers = []` `class-attribute` `instance-attribute` ¶

`search_all_urls = True` `class-attribute` `instance-attribute` ¶

`search_hidden_urls = False` `class-attribute` `instance-attribute` ¶

`start_urls = []` `class-attribute` `instance-attribute` ¶

`unique_urls = True` `class-attribute` `instance-attribute` ¶

`get_urls(response)` ¶

`on_finish()` ¶

`on_request_error(url, ex)` ¶

`on_robots_blocked(url)` ¶

`on_start()` ¶

`run()` ¶

`start()` `async` ¶

`crawley.crawlers.fast.FastCrawler` ¶

`crawley.crawlers.offline.OffLineCrawler` ¶

`crawley.scrapers.base.BaseScraper` ¶

`get_urls(response)` ¶

`on_cannot_scrape(response)` ¶

`on_scrape_error(response, ex)` ¶

`scrape(response)` ¶

`try_scrape(response)` ¶

`crawley.scrapers.smart.SmartScraper` ¶

`crawley.spider.Spider` ¶

`middlewares = []` `class-attribute` `instance-attribute` ¶

`pipelines = []` `class-attribute` `instance-attribute` ¶

`on_item(item)` ¶

`parse(response)` ¶

`start_requests()` ¶

`crawley.spider.Request` ¶

`fingerprint()` ¶

`replace(**kwargs)` ¶

`crawley.spider.FormRequest` ¶

`from_response(response, formdata=None, formid=None, formname=None, formxpath=None, callback=None, **kwargs)` `classmethod` ¶

`crawley.spider.Item` ¶

`crawley.spiders.CrawlSpider` ¶

`crawley.spiders.SitemapSpider` ¶

`crawley.spiders.LinkExtractor` ¶

`extract_links(response)` ¶

`crawley.spiders.Rule` ¶