Overview
Anatomy of a web crawler
As a developer, you only need to focus on three things when building a web crawler on Crawlspace:
- An initializer,
- A schema, and
- A response handler
These three pieces are exported as a JavaScript object and merged with platform code before getting deployed as a serverless function.
Initializer
The initializer instructs the crawler where to start crawling. Each crawl run starts from an empty queue, so the initializer provides the first set of URLs for the run. It returns an array of URLs.
It can also return an array of request objects. This is useful if you need to change the default request method or add request headers.
The initializer can be asynchronous, if you want to fetch URLs from an external source.
Schema
The schema of a crawler on Crawlspace is responsible for setting column types and constraints for your crawler’s SQLite table.
Return a Zod object in the schema()
function to define your schema. Here’s an example:
See Types to learn the mapping between Zod types and their corresponding column types.
Every column is nullable. This helps with migrations between deploys.
Response handler
The response handler is the meat and potatoes of your crawler. Here you write custom logic to do whatever you’d like! Below is an example that scrapes data and inserts it into SQLite.
Remember: by the time the handler runs, the request has already been made. Use the handler to process the page’s DOM rather than fetch new requests.
At the end of the day, it’s just JavaScript, which means that you can import any dependency installed by your favorite package manager to help you write your crawler.
Usage with Git
Use Git to version control your crawler. It’s recommended to put all of your crawlers into a single Git repo,
rather than create a Git repo for each crawler. In other words, you will have a better experience if your
.git
directory is a sibling to the crawlspace.toml
file.
You can also split up your crawler into multiple files if it starts to grow unwieldily — just make sure that the initializer, schema, and response handler are exported as the default object of the entry file.
Was this page helpful?