Skip to content

crawley

A pythonic crawling / scraping framework for Python 3, built on asyncio + httpx.

crawley lets you crawl websites and extract structured data with a tiny, declarative API. This is the modernized release: the legacy eventlet / elixir stack has been replaced by asyncio, httpx and SQLAlchemy 2.x.

Two ways to use it

=== "As a scraping library"

The fastest way to pull data out of a page:

```python
from crawley.scraping import fetch

doc = fetch("https://quotes.toscrape.com/")
for quote in doc.css("div.quote"):
    print(quote.css_first("small.author").text,
          "->",
          quote.css_first("span.text").text)
```

See [Scraping API](scraping.md).

=== "As a crawling framework"

Define crawlers and scrapers declaratively and let crawley walk the site:

```python
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper

class QuotesScraper(BaseScraper):
    matching_urls = ["%quotes.toscrape.com%"]

    def scrape(self, response):
        for q in response.css("div.quote"):
            print(q.css_first("span.text").text)

class QuotesCrawler(BaseCrawler):
    start_urls = ["https://quotes.toscrape.com/"]
    scrapers = [QuotesScraper]
    max_depth = 2

QuotesCrawler().run()
```

See [Crawlers & Scrapers](crawler.md).

Features

  • High speed asynchronous crawler powered by asyncio + httpx.
  • Extract data with XPath, CSS selectors or PyQuery.
  • A modern, ergonomic scraping API (fetch, Document, extract).
  • Politeness: robots.txt, per-host rate limiting and retries with exponential backoff.
  • Persistence: SQL (SQLAlchemy 2.x), MongoDB, CouchDB and JSON / XML / CSV exports.
  • A small DSL and CLI (crawley startproject, crawley run, ...).

Requirements

  • Python 3.9+

Runnable examples

The examples/ folder has small, self-contained scripts you can run directly:

  • 01_scraping_quickstart.py — the scraping API.
  • 02_crawler.py — a crawler that follows pagination.
  • 03_polite_crawler.py — robots.txt, rate limiting and retries.
  • 04_persistence_json.py — persisting to a JSON document.
  • 05_concurrent_fetch.py — concurrent fetching with afetch_all.

Continue with the installation guide.