crawley¶

A pythonic crawling / scraping framework for Python 3, built on asyncio + httpx.

crawley lets you crawl websites and extract structured data with a tiny, declarative API. This is the modernized release: the legacy eventlet / elixir stack has been replaced by asyncio, httpx and SQLAlchemy 2.x.

Two ways to use it¶

=== "As a scraping library"

The fastest way to pull data out of a page:

```python
from crawley.scraping import fetch

doc = fetch("https://quotes.toscrape.com/")
for quote in doc.css("div.quote"):
    print(quote.css_first("small.author").text,
          "->",
          quote.css_first("span.text").text)
```

See [Scraping API](scraping.md).

=== "As a crawling framework"

Define crawlers and scrapers declaratively and let crawley walk the site:

```python
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper

class QuotesScraper(BaseScraper):
    matching_urls = ["%quotes.toscrape.com%"]

    def scrape(self, response):
        for q in response.css("div.quote"):
            print(q.css_first("span.text").text)

class QuotesCrawler(BaseCrawler):
    start_urls = ["https://quotes.toscrape.com/"]
    scrapers = [QuotesScraper]
    max_depth = 2

QuotesCrawler().run()
```

See [Crawlers & Scrapers](crawler.md).

Features¶

High speed asynchronous crawler powered by asyncio + httpx.
Extract data with XPath, CSS selectors or PyQuery.
A modern, ergonomic scraping API (fetch, Document, extract).
Politeness: robots.txt, per-host rate limiting and retries with exponential backoff.
Persistence: SQL (SQLAlchemy 2.x), MongoDB, CouchDB and JSON / XML / CSV exports.
A small DSL and CLI (crawley startproject, crawley run, ...).

Requirements¶

Python 3.9+

Runnable examples¶

The examples/ folder has small, self-contained scripts you can run directly:

01_scraping_quickstart.py — the scraping API.
02_crawler.py — a crawler that follows pagination.
03_polite_crawler.py — robots.txt, rate limiting and retries.
04_persistence_json.py — persisting to a JSON document.
05_concurrent_fetch.py — concurrent fetching with afetch_all.

Continue with the installation guide.