As a developer, you only need to focus on three things when building a web crawler on Crawlspace:

  • An initializer,
  • A schema, and
  • A response handler

These three pieces are exported as a JavaScript object and merged with platform code before getting deployed as a serverless function.

Initializer

The initializer instructs the crawler where to start crawling. Each crawl run starts from an empty queue, so the initializer provides the first set of URLs for the run. It returns an array of URLs.

init() {
  return [
    "https://news.ycombinator.com",
    "https://news.ycombinator.com/newest"
  ];
}

It can also return an array of request objects. This is useful if you need to change the default request method or add request headers.

init({ env }) {
  const { SECRET_GITHUB_TOKEN } = env;
  return [
    {
      url: "https://api.github.com/...",
      method: "GET",
      headers: { Authorization: `Bearer ${SECRET_GITHUB_TOKEN}` }
    }
  ];
}

The initializer can be asynchronous, if you want to fetch URLs from an external source.

async init() {
  const response = await fetch('...');
  const json = await response.json<{ urls: string[] }>();
  return json.urls;
}

Schema

The schema of a crawler on Crawlspace is responsible for setting column types and constraints for your crawler’s SQLite table. Return a Zod object in the schema() function to define your schema. Here’s an example:

schema({ z }) {
  return z.object({
    job_title: z.string(),
    is_remote: z.boolean(),
    salary_range: z.array(z.number()),
  });
}

See Types to learn the mapping between Zod types and their corresponding column types.

Every column is nullable. This helps with migrations between deploys.

Response handler

The response handler is the meat and potatoes of your crawler. Here you write custom logic to do whatever you’d like! Below is an example that scrapes data and inserts it into SQLite.

Remember: by the time the handler runs, the request has already been made. Use the handler to process the page’s DOM rather than fetch new requests.

onResponse({ $, $$, enqueue }) {
  // $ is shorthand for document.querySelector()
  // $$ is shorthand for Array.from(document.querySelectorAll())
  const title = $('head > title')?.innerText;
  const description = $('meta[name="description"]')?.getAttribute('content');
  // these keys should match what you've specified in `schema()`
  const row = { title, description };

  // find URLs to enqueue to continue the crawl
  enqueue($$("a[href]"));

  return {
    insert: { row },
  };
}

At the end of the day, it’s just JavaScript, which means that you can import any dependency installed by your favorite package manager to help you write your crawler.

Usage with Git

Use Git to version control your crawler. It’s recommended to put all of your crawlers into a single Git repo, rather than create a Git repo for each crawler. In other words, you will have a better experience if your .git directory is a sibling to the crawlspace.toml file.

You can also split up your crawler into multiple files if it starts to grow unwieldily — just make sure that the initializer, schema, and response handler are exported as the default object of the entry file.