onResponse function.
Remember: by the time the response handler runs, the request has already been made.
Use it to process the page’s DOM rather than fetch new requests.
Query selectors
For HTML responses, use$ and $$ to select specific DOM elements.
While similar to jQuery or Cheerio, these methods are standard JavaScript that work if you paste them in your browser console.
$
The $ function is shorthand for document.querySelector(),
and is available when the response content-type starts with text/html.
$$
The $$ function is short for Array.from(document.querySelectorAll()),
and is available when the response content-type starts with text/html.
Inference
Query selectors help you target specific DOM elements of a given website. But when crawling the web at large, you might not know what specific selectors to use. For general-purpose data extraction that works on any website, run inference with large language models (LLMs). LLMs are more resilient to website changes, too — your code won’t need to change when web page code changes. Crawlspace exposes three AI convenience methods for text generation: — extract, summarize, and sentiment.ai.extract
AI-based data extraction is a powerful and resilient general-purpose tool for reliably converting unstructured data into JSON.
meta/llama-3.1-8b: Small inferencemeta/llama-3.3-70b: XL inference
ai.summarize
Summarize the content of a page or a specific DOM node.
This method is very fast and inexpensive.
- facebook/bart-large-cnn: Tiny inference
ai.sentiment
Run sentiment analysis to classify text as either POSITIVE or NEGATIVE.
This method is very fast and inexpensive.
- distilbert/sst-2-english: Tiny inference
Embeddings
Use embeddings for retrieval-augmented generation (RAG) tasks. This is useful for chatting with the data gathered from your crawl. Embeddings are stored in your crawler’s vector database.ai.embed
Here’s how you’d write the end of your crawler’s handler to upsert embeddings of markdown for future retrieval:
- 384:
baai/bge-small-en-v1.5: Small embedding - 768:
baai/bge-base-en-v1.5: Base embedding - 1024:
baai/bge-large-en-v1.5: Large embedding
Enqueueing
Add requests to your crawler’s queue with theenqueue method.
This synchronous function appends requests to the queue.
You can call enqueue() multiple times in your response handler.
Requests are automatically normalized and deduplicated based on URL.
You can pass in URLs (strings), request objects, or HTMLElements.
enqueue
Response data
Access the raw response.response
Use the response object to read things like response headers.
The response body is already consumed for text-based responses, so do not use
response.text() or response.json().
Use the html, json, and xml properties instead.html
html string is available when the response content-type starts with text/html.
json
json object is available when the response content-type starts with application/json.
Use this object instead of awaiting response.json().
xml
xml object is available when the response content-type starts with application/xml.
Use this object instead of awaiting response.text().
Other helpers
getMarkdown
Converting a page or DOM node into markdown is helpful when working with LLMs.
It also reduces the token count by stripping out unnecessary markup.
env
Environment variables that you put in your crawler’s .env file are available inside the env object.
Env vars must begin with SECRET_ to be included in your crawler’s deployment.
All env vars are considered secret and are not stored on Crawlspace’s servers.
request
With request, you have access to the properties of the request that had been made.