The response handler is the brains of your web crawler.
Here you write custom logic to convert HTTP responses to structured data, blobs, or vectors.
How you scrape is entirely up to you.
Crawlspace exposes convenience methods to make scraping as easy as possible.
Each property and method below is available as a parameter of the onResponse function.
Remember: by the time the response handler runs, the request has already been made.
Use it to process the page’s DOM rather than fetch new requests.
Query selectors
For HTML responses, use $ and $$ to select specific DOM elements.
While similar to jQuery or Cheerio, these methods are standard JavaScript that work if you paste them in your browser console.
The $ function is shorthand for document.querySelector(),
and is available when the response content-type starts with text/html.
onResponse({ $ }) {
const paragraph = $('#parent .item p'); // HTMLElement | undefined
const text = paragraph?.innerText; // text of all children
// ...
}
The $$ function is short for Array.from(document.querySelectorAll()),
and is available when the response content-type starts with text/html.
onResponse({ $$ }) {
const links = $$('a[href]'); // Array<HTMLElement>
const hrefs = links.map(link => link.href); // string[]
// ...
}
Inference
Query selectors help you target specific DOM elements of a given website.
But when crawling the web at large, you might not know what specific selectors to use.
For general-purpose data extraction that works on any website, run inference with large language models (LLMs).
LLMs are more resilient to website changes, too — your code won’t need to change when web page code changes.
Crawlspace exposes three AI convenience methods for text generation: — extract, summarize, and sentiment.
AI-based data extraction is a powerful and resilient general-purpose tool for reliably converting unstructured data into JSON.
async onResponse({ $, ai, z }) {
// provide an HTMLElement or a string as the corpus to extract from
const corpus = $("main");
// you can reuse your crawler's schema, or create a new one
const schema = z.object({
tiers: z.array(
z.object({
planName: z.string(),
monthlyPrice: z.union([z.number().int(), z.string()]),
}),
),
});
// use Crawlspace's built-in AI workers (no extra API tokens needed)
const { tiers } = await ai.extract<z.infer<typeof schema>>(corpus, {
model: 'meta/llama-3.1-8b',
instruction: 'Find the pricing plan of each tier',
schema,
});
console.log(tiers); // { planName: string, monthlyPrice: number | string }[]
// ...
}
Usage is billed based on number of blended (input + output) tokens used.
Currently, there are two models available:
meta/llama-3.1-8b: Small inference
meta/llama-3.3-70b: XL inference
ai.summarize
Summarize the content of a page or a specific DOM node.
This method is very fast and inexpensive.
async onResponse({ $, ai }) {
const { summary } = await ai.summarize($('main'));
console.log(summary); // string
// ...
}
Usage is billed based on number of input tokens used.
Currently, there is one model available:
ai.sentiment
Run sentiment analysis to classify text as either POSITIVE or NEGATIVE.
This method is very fast and inexpensive.
async onResponse({ $, ai }) {
const result = await ai.sentiment($('#user-review-1');
const { score, label } = result[0];
console.log(label); // either "POSITIVE" or "NEGATIVE"
// ...
}
Usage is billed based on number of input tokens used.
Currently, there is one model available:
Embeddings
Use embeddings for retrieval-augmented generation (RAG) tasks.
This is useful for chatting with the data gathered from your crawl.
Embeddings are stored in your crawler’s vector database.
ai.embed
Here’s how you’d write the end of your crawler’s handler to upsert embeddings of markdown for future retrieval:
async onResponse({ $, ai, getMarkdown }) {
// convert the page into markdown
const markdown = getMarkdown();
const { embeddings } = await ai.embed(markdown, {
// important: should match `vector.dimensions` in crawler's config
dimensions: 768
});
return {
// upsert the embeddings along with sqlite row
upsert: { row, onConflict: "url", embeddings },
// upload the markdown to a bucket for future retrieval
attach: { content: markdown },
};
}
The above example uses the SQLite database, vector database, and bucket storage solutions in harmony.
Usage is billed based on number of input tokens used.
Currently, there are three embedding models available.
The model gets chosen automatically depending on the number of dimensions you request.
- 384:
baai/bge-small-en-v1.5: Small embedding
- 768:
baai/bge-base-en-v1.5: Base embedding
- 1024:
baai/bge-large-en-v1.5: Large embedding
Enqueueing
Add requests to your crawler’s queue with the enqueue method.
This synchronous function appends requests to the queue.
You can call enqueue() multiple times in your response handler.
Requests are automatically normalized and deduplicated based on URL.
You can pass in URLs (strings), request objects, or HTMLElements.
enqueue
onResponse({ $, enqueue }) {
// enqueue all links on the page
const links = $$('a[href]');
enqueue(links);
// enqueue the root page (`/`) of every link on the page
const rootPages = links.map(link => new URL(link).origin);
enqueue(rootPages);
// ...
}
Response data
Access the raw response.
response
Use the response object to read things like response headers.
The response body is already consumed for text-based responses, so do not use response.text() or response.json().
Use the html, json, and xml properties instead.
onResponse({ response }) {
const contentType = response.headers.get('content-type');
// ...
}
html
onResponse({ html }) {
if (html) {
console.log(html); // string
}
// ...
}
The raw html string is available when the response content-type starts with text/html.
json
onResponse({ json }) {
if (json) {
console.log(json); // unknown
}
// ...
}
The json object is available when the response content-type starts with application/json.
Use this object instead of awaiting response.json().
xml
onResponse({ xml }) {
if (xml) {
console.log(xml); // unknown
}
// ...
}
The xml object is available when the response content-type starts with application/xml.
Use this object instead of awaiting response.text().
Other helpers
getMarkdown
Converting a page or DOM node into markdown is helpful when working with LLMs.
It also reduces the token count by stripping out unnecessary markup.
onResponse({ $, getMarkdown }) {
// pass no arguments to get the entire page markdown
const pageMarkdown = getMarkdown();
// pass in a single element to get its contents
const firstParagraph = $('article > p');
const nodeMarkdown = getMarkdown(firstParagraph);
// or pass in a string to convert
const stringMarkdown = getMarkdown(firstParagraph?.innerHTML);
// ...
}
env
Environment variables that you put in your crawler’s .env file are available inside the env object.
Env vars must begin with SECRET_ to be included in your crawler’s deployment.
All env vars are considered secret and are not stored on Crawlspace’s servers.
onResponse({ env }) {
console.log(env.SECRET_API_KEY); // value in your crawler's .env file
// ...
}
request
With request, you have access to the properties of the request that had been made.
onResponse({ request }) {
const method = request.method;
console.log(method); // string
// ...
}
Zod global, for convenience.
Use this to access Zod types when defining a schema.
onResponse({ z }) {
// ...
}