Skip to content

API Reference

Auto-generated from the docstrings.

Scraping

crawley.scraping.fetch(url, **kwargs)

Fetch url and return a parsed :class:Document (synchronous).

crawley.scraping.afetch(url, client=None, **kwargs) async

Fetch url asynchronously and return a :class:Document.

crawley.scraping.afetch_all(urls, **kwargs) async

Fetch many urls concurrently, returning a list of :class:Document.

crawley.scraping.scrape(url, rules, **kwargs)

Fetch url and immediately :meth:Document.extract it with rules.

crawley.scraping.parse(html, url=None)

Parse an html string into a :class:Document.

crawley.scraping.Document

Bases: Element

A parsed html document ready to be scraped.

title property

The page <title> text, if any.

extract(rules)

Extract a dict of fields from the document.

rules maps field names to selectors. A plain string selector yields a single value (the first match); a one-element list selector yields the list of every match::

doc.extract({
    "title": "h1::text",
    "price": "span.price::text",
    "images": ["img::attr(src)"],
})

Return the (absolute, de-duplicated) hrefs found in the page.

crawley.scraping.Element

A thin, convenient wrapper around an lxml html element.

attrs property

A dict of the element's attributes.

html property

The element serialized back to html.

text property

The normalized recursive text of the element.

attr(name, default=None)

Return the name attribute (or default).

css(selector)

Return the descendants matching the CSS selector.

css_first(selector)

Return the first descendant matching selector (or None).

xpath(query)

Return the result of an XPath query.

String results (e.g. "//a/@href") are returned as-is; element results are wrapped in :class:Element.

Crawlers

crawley.crawlers.base.BaseCrawler

User crawlers must inherit from this class.

Override the relevant methods and define start_urls, scrapers and the max_depth to control the crawl.

allowed_urls = [] class-attribute instance-attribute

A list of url patterns allowed for crawl.

autothrottle = False class-attribute instance-attribute

Adapt the per-host delay to the observed response latency.

autothrottle_max_delay = 60.0 class-attribute instance-attribute

Maximum per-host delay AutoThrottle may set (seconds).

autothrottle_start_delay = 1.0 class-attribute instance-attribute

Initial per-host delay used by AutoThrottle (seconds).

autothrottle_target_concurrency = 1.0 class-attribute instance-attribute

Target number of concurrent requests per host for AutoThrottle.

black_list = [] class-attribute instance-attribute

A list of blocked url patterns that are never crawled.

crawl_delay = config.CRAWL_DELAY class-attribute instance-attribute

Minimum seconds between two requests to the same host.

extractor = extractor_class() class-attribute instance-attribute

The extractor class. Defaults to :class:XPathExtractor.

headers = {} class-attribute instance-attribute

The default request headers.

http_cache = False class-attribute instance-attribute

Cache responses on disk (development helper).

http_cache_dir = '.crawley_cache' class-attribute instance-attribute

Directory used by the on-disk HTTP cache.

login = None class-attribute instance-attribute

Login data: a tuple of (url, login_dict).

max_concurrency_level = None class-attribute instance-attribute

The maximum number of concurrent requests.

max_concurrency_per_host = config.MAX_CONCURRENCY_PER_HOST class-attribute instance-attribute

Maximum simultaneous requests per host (None disables the limit).

max_depth = -1 class-attribute instance-attribute

The maximum crawling recursive level (-1 means unlimited).

max_retries = config.REQUEST_MAX_RETRIES class-attribute instance-attribute

How many times a failed request is retried.

playwright_options = {} class-attribute instance-attribute

Extra options for the Playwright manager (browser_type, headless, ...).

post_urls = [] class-attribute instance-attribute

POST data for urls: a list of (url, data_dict) tuples.

render_js = False class-attribute instance-attribute

Render pages with a headless browser (Playwright). Needs crawley[js].

requests_delay = config.REQUEST_DELAY class-attribute instance-attribute

The average delay time between requests.

requests_deviation = config.REQUEST_DEVIATION class-attribute instance-attribute

The requests deviation time.

respect_robots = config.RESPECT_ROBOTS class-attribute instance-attribute

When True the crawler honours each site's robots.txt.

retry_backoff = config.RETRY_BACKOFF_FACTOR class-attribute instance-attribute

Base seconds for the exponential retry backoff.

retry_statuses = config.RETRY_STATUSES class-attribute instance-attribute

HTTP status codes that trigger a retry.

scrapers = [] class-attribute instance-attribute

A list of scraper classes.

search_all_urls = True class-attribute instance-attribute

Search for urls in the page when scrapers don't return any.

search_hidden_urls = False class-attribute instance-attribute

Search for urls hidden anywhere in the html (not only <a> tags).

start_urls = [] class-attribute instance-attribute

A list containing the start urls for the crawler.

unique_urls = True class-attribute instance-attribute

Skip urls that have already been visited during the crawl.

get_urls(response)

Return the urls found in the current html page.

on_finish()

Override to run code when the crawler finishes.

on_request_error(url, ex)

Override to customize the request error handler.

on_robots_blocked(url)

Override to react when robots.txt disallows crawling url.

on_start()

Override to run code when the crawler starts.

run()

Convenience synchronous entry point.

start() async

Run the crawler (coroutine).

crawley.crawlers.fast.FastCrawler

Bases: BaseCrawler

Like :class:BaseCrawler but issues requests without delays.

crawley.crawlers.offline.OffLineCrawler

Bases: BaseCrawler

A crawler that fixes relative asset urls in the fetched html.

Scrapers

crawley.scrapers.base.BaseScraper

User scrapers must inherit from this class.

Implement :meth:scrape with the data extraction logic and define the matching_urls that this scraper is able to process.

get_urls(response)

Return a list of urls found in the current html.

on_cannot_scrape(response)

Customize the can't-scrape handler.

on_scrape_error(response, ex)

Customize the scrape error handler.

scrape(response)

Define the data you want to extract here.

try_scrape(response)

Try to parse the html page, returning the urls it discovers.

crawley.scrapers.smart.SmartScraper

Bases: BaseScraper

Scrape only pages whose html structure is similar to a template page.

The structure of template_url is fetched once (synchronously) at construction time and every candidate page is compared against it.

Spiders

crawley.spider.Spider

Bases: BaseCrawler

A callback-driven spider.

Define :meth:parse (the default callback) and yield :class:Request objects (or :func:response.follow(...)) to crawl further, and dicts / :class:Item objects to emit data.

middlewares = [] class-attribute instance-attribute

Downloader middleware classes wrapping every download.

pipelines = [] class-attribute instance-attribute

Item pipeline classes applied, in order, to every emitted item.

on_item(item)

Called for every item that survives the pipelines.

parse(response)

Default callback. Override to extract data and follow links.

start_requests()

Yield the initial requests (defaults to start_urls).

crawley.spider.Request

A scheduled HTTP request with a callback to process its response.

fingerprint()

A stable fingerprint (method + url + body) used for de-duplication.

replace(**kwargs)

Return a copy of this request with some attributes replaced.

crawley.spider.FormRequest

Bases: Request

A :class:Request that submits form data (POST by default).

from_response(response, formdata=None, formid=None, formname=None, formxpath=None, callback=None, **kwargs) classmethod

Build a request from a <form> in response, pre-filling inputs.

crawley.spider.Item

Bases: dict

A scraped item. Just a dict you may subclass for clarity.

crawley.spiders.CrawlSpider

Bases: Spider

A spider that follows links according to a list of :class:Rule.

crawley.spiders.SitemapSpider

Bases: Spider

Seed the crawl from sitemap.xml files (incl. sitemap indexes).

crawley.spiders.LinkExtractor

Extract links from a response, filtered by allow/deny rules.

Return the (absolute, filtered) links found in response.

crawley.spiders.Rule

Bind a :class:LinkExtractor to a callback and/or a follow behaviour.

Pipelines & middlewares

crawley.pipelines.ItemPipeline

Base class for item pipelines (all methods are optional).

close_spider(spider)

Called once when the spider finishes.

open_spider(spider)

Called once when the spider starts.

process_item(item, spider)

Return the (possibly transformed) item, or raise :class:DropItem.

crawley.pipelines.DropItem

Bases: Exception

Raise from process_item to discard the current item.

crawley.middlewares.DownloaderMiddleware

Base class for downloader middlewares (all methods optional).

Stats, cache & throttling

crawley.stats.StatsCollector

Collect counters and values during a crawl.

close()

Record the total elapsed time.

open()

Reset the stats and start the clock.

crawley.http.cache.HttpCache

A tiny JSON-on-disk response cache.

crawley.http.autothrottle.AutoThrottle

Compute a per-host delay from observed latencies.

adjust(host, latency)

Update and return the new delay for host given a latency.

Extractors

crawley.extractors.XPathExtractor

Bases: BaseExtractor

Extractor exposing an :mod:lxml tree, ready to be queried via XPath.

crawley.extractors.CSSExtractor

Bases: BaseExtractor

Extractor exposing an :mod:lxml tree queryable with CSS selectors.

The returned tree supports tree.cssselect("div.foo a") thanks to the cssselect package.

crawley.extractors.PyQueryExtractor

Bases: BaseExtractor

Extractor using PyQuery (a jQuery-like library for Python).

crawley.extractors.RawExtractor

Bases: BaseExtractor

Returns the raw html data untouched.

HTTP

crawley.http.response.Response

Encapsulates an HTTP response.

Attributes:

Name Type Description
raw_html

the decoded body of the response (str).

html

the body parsed by the crawler's extractor (lxml tree, PyQuery object, ...). None when no extractor was used.

url

the final url of the request (after redirects).

status_code

the HTTP status code.

headers

the response headers.

doc property

Return the body as a high level :class:~crawley.scraping.Document.

Lets you scrape a crawler response with the modern, ergonomic API::

def scrape(self, response):
    title = response.doc.css_first("h1").text

meta property

The meta dict carried by the originating request (if any).

css(selector)

Shortcut for response.doc.css(selector).

css_first(selector)

Shortcut for response.doc.css_first(selector).

extract(rules)

Shortcut for response.doc.extract(rules).

follow(url, callback=None, **kwargs)

Build a :class:~crawley.spider.Request to a (possibly relative) url.

Relative urls are resolved against this response's url, and meta is inherited from the current request unless overridden.

crawley.http.retry.RetryPolicy

Configurable retry/backoff strategy.

backoff_time(attempt, response=None)

Seconds to wait before retry number attempt (0-based).

should_retry(attempt, response=None, exception=None)

Return True if a further attempt should be made.

crawley.http.throttle.HostRateLimiter

Throttle requests on a per-host basis.

semaphore(host)

Return the per-host semaphore (or None when uncapped).

set_delay(host, delay)

Override the minimum delay for a single host (e.g. from robots).

throttle(host) async

Wait so that requests to host respect the configured delay.

crawley.http.robots.RobotsPolicy

Cache and evaluate robots.txt rules per host.

allowed(url, client) async

Return True if url may be fetched according to robots.txt.

crawl_delay(url)

Return the Crawl-delay for url's host, if any.