Crawler configuration options are saved in a crawler.toml file alongside your code.

name

string · Required

The name of your crawler. Keep as unique for your user.

entry

string · Required

The relative path to the entrypoint of your crawler.

schedule

string · Default: @manual

How often your crawler runs. Accepts valid crontab syntax and shorthand expressions. Crawlers can run at most once per minute.

Crontab syntax examples:

  • */10 * * * * -> every 10 minutes
  • 45 23 * * * -> every day at 11:45pm UTC
  • 30 12 * * 2 -> every Tuesday at 12:30pm UTC
  • 59 1 14 3 * -> every March 14 at 1:59am UTC

Valid shorthand examples:

  • @manual -> only start a crawl by manually clicking a button
  • @hourly -> every hour at the beginning of the hour (0 * * * *)
  • @daily -> every day at midnight (0 0 * * *)
  • @weekly -> every week at midnight on Sunday (0 0 * * 0)
  • @monthly -> every month at 12am on the 1st of the month (0 0 1 * *)

crawl

Per-crawl and link traversal settings:

crawl.maxRequests

number · Required · Range: 1 - 10000000

The maximum number of pages to request per crawler run, not including retries. Set this value to enforce the maximum number of requests your crawler can make per crawl. A crawl ends when either maxMinutes or maxRequests is reached, whichever happens first.

crawl.maxRequestsPerOrigin

number · Optional · Range: 1 - 10000000

The maximum number of requests per origin per crawler run. You can set this value to limit the amount of requests per website without needing to keep track on your own.

crawl.maxMinutes

number · Optional · Range: 1 - 43199

The maximum duration of each crawl in minutes. If the crawler runs on a schedule, then this value should be at least one minute less than the scheduled interval. A crawl ends when either maxMinutes or maxRequests is reached, whichever happens first.

crawl.matchDomains

string[] · Optional

Glob patterns matching a URL’s hostname. Wildcards (*) match between dot (.) characters. Provide values to allow or deny adding certain URLs to the queue.

Domain glob examples:

  • example.com -> only crawl URLs whose hostname equals example.com
  • *.example.com -> only crawl single-level subdomains of example.com
  • **.example.com -> only crawl multi-level subdomains of example.com
  • **.io -> only crawl .io domains
  • !**.tk -> do not crawl .tk domains

Crawlspace already avoids certain domains, so there is no need to include any of the following yourself:

  • !**.gov
  • !**.onion

crawl.matchPaths

string[] · Optional

Glob patterns matching a URL’s pathname. Wildcards (*) match between slash (/) characters. Provide values to allow or deny adding certain URLs to the queue.

Path glob examples:

  • /jobs -> only crawl URLs whose pathname equals /jobs
  • /jobs/* -> only crawl URLs one level under the /jobs directory
  • /jobs/** -> only crawl URLs whose pathname starts with /jobs/
  • **/*.html -> only crawl URLs that have a .html file extension
  • !**/*.gif -> do not crawl URLs that have a .gif file extension

Crawlspace already avoids certain paths, so there is no need to include any of the following yourself:

  • !**/*.avif
  • !**/*.bak
  • !**/*.dmg
  • !**/*.exe
  • !**/*.gz
  • !**/*.m4a
  • !**/*.mov
  • !**/*.mp3
  • !**/*.mp4
  • !**/*.ogg
  • !**/*.pdf
  • !**/*.psd
  • !**/*.rar
  • !**/*.rpm
  • !**/*.wasm
  • !**/*.wav
  • !**/*.webm
  • !**/*.xsd
  • !**/*.zip

crawl.ignoreQueryParams

boolean · Default: false

Whether or not to remove query parameters from URLs before adding to the queue.

queue

Queue consumer settings:

queue.batchSize

number · Default: plan maximum · Range: 1 - 25

The maximum number of requests that a worker will attempt to fetch in parallel. Keep in mind that worker instances set a limit of 6 simultaneous open connections at a time.

queue.maxConcurrency

number · Default: plan maximum · Range: 1 - 250

The maximum amount of concurrent workers assigned to alleviate queue pressure. It can be useful to set this value to avoid an origin’s rate limits. Otherwise, it’s recommended to keep this value as auto.

queue.maxRetries

number · Default: 3 · Range: 0 - 20

The maximum number of times that an unsucessful request will be retried. Retries will be re-added to the queue with a delay in accordance to the response’s Retry-After header; if absent, with exponential backoff.

vector

Vector database settings:

vector.enabled

boolean · Default: false

Change to true to opt in to using a vector database with your crawler. A vector database gets created the first time your crawler is deployed while this value is true. You must set this value as true if your handler returns embeddings.

vector.dimensions

number · Default: 768 · Range: 1 - 1536

The number of dimensions in each embedding. Embeddings generated must have this number of dimensions in order to be stored correctly. This value cannot be changed after the vector database has been created.

vector.metric

“cosine” | “euclidean” | “dot-product” · Default: cosine

The distance metric used to calculate similar vectors at query-time. This value cannot be changed after the vector database has been created.

bucket

S3-compatible bucket settings:

bucket.enabled

boolean · Default: false

Change to true to opt in to using a bucket with your crawler. A bucket gets created the first time your crawler is deployed while this value is true. You must set this value as true if your handler returns the attach property.