Configuration options

Crawler configuration options are saved in a crawler.toml file alongside your code.

`name`

string · Required

The name of your crawler. Keep as unique for your user.

`entry`

string · Required

The relative path to the entrypoint of your crawler.

`schedule`

string · Default: @manual

How often your crawler runs. Accepts valid crontab syntax and shorthand expressions. Crawlers can run at most once per minute.

Crontab syntax examples:

*/10 * * * * -> every 10 minutes
45 23 * * * -> every day at 11:45pm UTC
30 12 * * 2 -> every Tuesday at 12:30pm UTC
59 1 14 3 * -> every March 14 at 1:59am UTC

Valid shorthand examples:

@manual -> only start a crawl by manually clicking a button
@hourly -> every hour at the beginning of the hour (0 * * * *)
@daily -> every day at midnight (0 0 * * *)
@weekly -> every week at midnight on Sunday (0 0 * * 0)
@monthly -> every month at 12am on the 1st of the month (0 0 1 * *)

crawl

Per-crawl and link traversal settings:

`crawl.maxRequests`

number · Required · Range: 1 - 10000000

The maximum number of pages to request per crawler run, not including retries. Set this value to enforce the maximum number of requests your crawler can make per crawl. A crawl ends when either maxMinutes or maxRequests is reached, whichever happens first.

`crawl.maxRequestsPerOrigin`

number · Optional · Range: 1 - 10000000

The maximum number of requests per origin per crawler run. You can set this value to limit the amount of requests per website without needing to keep track on your own.

`crawl.maxMinutes`

number · Optional · Range: 1 - 43199

The maximum duration of each crawl in minutes. If the crawler runs on a schedule, then this value should be at least one minute less than the scheduled interval. A crawl ends when either maxMinutes or maxRequests is reached, whichever happens first.

`crawl.matchDomains`

string[] · Optional

Glob patterns matching a URL’s hostname. Wildcards (*) match between dot (.) characters. Provide values to allow or deny adding certain URLs to the queue.

Domain glob examples:

example.com -> only crawl URLs whose hostname equals example.com
*.example.com -> only crawl single-level subdomains of example.com
**.example.com -> only crawl multi-level subdomains of example.com
**.io -> only crawl .io domains
!**.tk -> do not crawl .tk domains

Crawlspace already avoids certain domains, so there is no need to include any of the following yourself:

!**.gov
!**.onion

`crawl.matchPaths`

string[] · Optional

Glob patterns matching a URL’s pathname. Wildcards (*) match between slash (/) characters. Provide values to allow or deny adding certain URLs to the queue.

Path glob examples:

/jobs -> only crawl URLs whose pathname equals /jobs
/jobs/* -> only crawl URLs one level under the /jobs directory
/jobs/** -> only crawl URLs whose pathname starts with /jobs/
**/*.html -> only crawl URLs that have a .html file extension
!**/*.gif -> do not crawl URLs that have a .gif file extension

Crawlspace already avoids certain paths, so there is no need to include any of the following yourself:

!**/*.avif
!**/*.bak
!**/*.dmg
!**/*.exe
!**/*.gz
!**/*.m4a
!**/*.mov
!**/*.mp3
!**/*.mp4
!**/*.ogg
!**/*.pdf
!**/*.psd
!**/*.rar
!**/*.rpm
!**/*.wasm
!**/*.wav
!**/*.webm
!**/*.xsd
!**/*.zip

`crawl.ignoreQueryParams`

boolean · Default: false

Whether or not to remove query parameters from URLs before adding to the queue.

queue

Queue consumer settings:

`queue.batchSize`

number · Default: plan maximum · Range: 1 - 25

The maximum number of requests that a worker will attempt to fetch in parallel. Keep in mind that worker instances set a limit of 6 simultaneous open connections at a time.

`queue.maxConcurrency`

number · Default: plan maximum · Range: 1 - 250

The maximum amount of concurrent workers assigned to alleviate queue pressure. It can be useful to set this value to avoid an origin’s rate limits. Otherwise, it’s recommended to keep this value as auto.

`queue.maxRetries`

number · Default: 3 · Range: 0 - 20

The maximum number of times that an unsucessful request will be retried. Retries will be re-added to the queue with a delay in accordance to the response’s Retry-After header; if absent, with exponential backoff.

vector

Vector database settings:

`vector.enabled`

boolean · Default: false

Change to true to opt in to using a vector database with your crawler. A vector database gets created the first time your crawler is deployed while this value is true. You must set this value as true if your handler returns embeddings.

`vector.dimensions`

number · Default: 768 · Range: 1 - 1536

The number of dimensions in each embedding. Embeddings generated must have this number of dimensions in order to be stored correctly. This value cannot be changed after the vector database has been created.

`vector.metric`

“cosine” | “euclidean” | “dot-product” · Default: cosine

The distance metric used to calculate similar vectors at query-time. This value cannot be changed after the vector database has been created.

bucket

S3-compatible bucket settings:

`bucket.enabled`

boolean · Default: false

Change to true to opt in to using a bucket with your crawler. A bucket gets created the first time your crawler is deployed while this value is true. You must set this value as true if your handler returns the attach property.

Welcome

Build a web crawler

Learn by example

Configuration options

`name`

`entry`

`schedule`

crawl

`crawl.maxRequests`

`crawl.maxRequestsPerOrigin`

`crawl.maxMinutes`

`crawl.matchDomains`

`crawl.matchPaths`

`crawl.ignoreQueryParams`

queue

`queue.batchSize`

`queue.maxConcurrency`

`queue.maxRetries`

vector

`vector.enabled`

`vector.dimensions`

`vector.metric`

bucket

`bucket.enabled`

Welcome

Build a web crawler

Learn by example

​name

​entry

​schedule

​crawl

​crawl.maxRequests

​crawl.maxRequestsPerOrigin

​crawl.maxMinutes

​crawl.matchDomains

​crawl.matchPaths

​crawl.ignoreQueryParams

​queue

​queue.batchSize

​queue.maxConcurrency

​queue.maxRetries

​vector

​vector.enabled

​vector.dimensions

​vector.metric

​bucket

​bucket.enabled

`name`

`entry`

`schedule`

crawl

`crawl.maxRequests`

`crawl.maxRequestsPerOrigin`

`crawl.maxMinutes`

`crawl.matchDomains`

`crawl.matchPaths`

`crawl.ignoreQueryParams`

queue

`queue.batchSize`

`queue.maxConcurrency`

`queue.maxRetries`

vector

`vector.enabled`

`vector.dimensions`

`vector.metric`

bucket

`bucket.enabled`