CLI & Framework¶

Installing crawley provides a crawley command to scaffold and run projects.

Commands¶

Command	Description
`crawley startproject NAME`	Scaffold a new project.
`crawley run`	Sync the database and run the project's crawlers.
`crawley syncdb`	Create the database from `models.py`.
`crawley migratedb`	Drop and recreate the tables.
`crawley shell URL`	Fetch a url and open a Python shell to scrape it.
`crawley browser URL`	Open the visual scraping browser (needs `[gui]`).

startproject accepts a -t/--type: code (default), template (DSL) or database.

Project layout¶

crawley startproject myproject
cd myproject

myproject/
├── settings.py            # project configuration
└── myproject/
    ├── models.py          # your entities
    └── crawlers.py        # your crawlers + scrapers

models.py¶

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    package = Field(Unicode(255))
    description = Field(Unicode(255))

crawlers.py¶

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class PackageScraper(BaseScraper):
    matching_urls = ["%"]

    def scrape(self, response):
        for row in response.css("div.package"):
            Package(package=row.css_first("h2::text"))

class PackageCrawler(BaseCrawler):
    start_urls = ["https://pypi.org/"]
    scrapers = [PackageScraper]
    max_depth = 1
    extractor = XPathExtractor

settings.py¶

Relevant keys:

DATABASE_ENGINE = "sqlite"       # sqlite, postgres, mysql, oracle
DATABASE_NAME = "myproject"
JSON_DOCUMENT = ""               # set a filename to dump JSON
XML_DOCUMENT = ""
CSV_DOCUMENT = ""
# MONGO_DB_HOST / MONGO_DB_NAME, COUCH_DB_HOST / COUCH_DB_NAME
MAX_CONCURRENCY = 100
SHOW_DEBUG_INFO = True

Then run:

crawley run

The DSL (template projects)¶

For simple cases you can describe a scraper declaratively instead of writing Python. Create a template project and write a .crw template:

PAGE => http://example.com/
packages.name -> /html/body//h2
packages.author -> /html/body//span[@class="author"]

PAGE => url declares the template page (its structure is matched against candidate pages via SmartScraper).
table.column -> xpath maps an XPath result to an entity column.

crawley compiles this into entities and scrapers at runtime. The companion config.ini provides start_urls and max_depth:

[crawler]
start_urls = http://example.com/
max_depth = 1