Persistence¶
crawley can persist scraped data to relational databases, NoSQL stores or plain documents. Sessions are passed to the crawler and committed as the crawl progresses.
Relational (SQLAlchemy 2.x)¶
Requires the sql extra (pip install "crawley[sql]").
Define entities by subclassing Entity. Instantiating an entity stages it in
the shared session:
from crawley.persistance import Entity, UrlEntity, Field, Unicode
class Package(Entity):
package = Field(Unicode(255))
description = Field(Unicode(255))
class Urls(UrlEntity): # has href / parent columns
pass
Field and Unicode are thin shims over SQLAlchemy's Column and column
types. Set up the engine and create the tables with setup():
from crawley.persistance import session, setup
setup("sqlite:///packages.sqlite")
Package(package="crawley", description="modern crawler")
session.commit()
Supported engines (via crawley.persistance.relational.connectors):
SQLite, PostgreSQL, MySQL, Oracle.
Documents — JSON / XML / CSV¶
No extra dependencies. Subclass the document type; each instance becomes a row,
and the matching session writes the file on commit():
from crawley.persistance.documents import JSONDocument, json_session
class Quote(JSONDocument):
pass
Quote(text="...", author="...")
json_session.file_name = "quotes.json"
json_session.commit()
XMLDocument / xml_session and CSVDocument / csv_session work the same
way.
NoSQL — MongoDB / CouchDB¶
MongoDB requires the mongo extra; CouchDB talks to the HTTP API directly via
httpx.
from crawley.persistance.nosql import MongoEntity, mongo_session
class Package(MongoEntity):
pass
Package(name="crawley", stars=42)
# configured + committed by the crawler via settings (see CLI docs)
Using sessions in a crawler¶
Pass the sessions you want committed to the crawler. After each successful
scrape, crawley calls commit() on every session:
import asyncio
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.persistance.documents import JSONDocument, json_session
json_session.file_name = "out.json"
class Quote(JSONDocument):
pass
class Scraper(BaseScraper):
matching_urls = ["%"]
def scrape(self, response):
Quote(title=response.css_first("h1").text)
class Crawler(BaseCrawler):
start_urls = ["https://quotes.toscrape.com/"]
scrapers = [Scraper]
asyncio.run(Crawler(sessions=[json_session]).start())
When using the CLI, the storages are wired up automatically from your
settings.py (DATABASE_*, JSON_DOCUMENT, MONGO_DB_HOST, ...).