Using setRequestInterception in Puppeteer
Use setRequestInterception to customize handling of browser requests, both before the request begins and after it completes.
Request interception is useful when you need to abort certain requests, serve a different response, or capture the content of a server reply. Getting started with request interception is easy.
const page = await browser.newPage();
page.setRequestInterception(true);
page.on("request", onRequestStarted);
page.on("requestfinished", onRequestFinished);
The setRequestInterception
function acts as a toggle, controlling whether Puppeteer will invoke your request
and requestfinished
callbacks.
What are the potential use cases for request interception?
- Abort requests for images or other media if you're trying to save bandwidth while running a scraper.
- Receive an AJAX reply and extract its content, instead of trying to pull the information from the DOM.
- Serve altered responses to the browser.
Running code before a request is sent
The request
callback will run for every request that the browser makes. You can stop the request with the abort
method, allow it to continue
, or even use respond
to serve a reply without making the request at all.
This is an example callback that uses all of the above.
function onRequestStarted(request) {
if (request.resourceType() === "image") {
if (!request.url().includes("gstatic")) {
return request.abort();
}
}
s3.getObject(
{
Bucket: "puppeteer-cache",
Key: encodeURIComponent(request.url())
},
(err, data) => {
if (!err) {
return request.respond({ status: 200, body: data.Body });
}
return request.continue();
}
);
}
In this example, we abort any request for an image, and look for other resources in an S3 bucket before finally allowing the request to continue as normal, if the file is not available in the cache.
Always remember to return with continue, abort, or respond! The browser session will hang indefinitely if your callback doesn't return anything.
Running code after a request is finished
Our first example ignored image requests and attempted to use S3 as a cache, only allowing the request to continue normally if no cached copy was found.
We can use the requestfinished
event to push responses to the S3 cache. We'll need to inspect the response headers for this to work properly, because only certain kinds of responses should end up in the cache.
function onRequestFinished(request) {
const response = request.response();
const headers = response.headers();
const cachePolicy = headers["cache-control"];
if (cachePolicy) {
if (
cachePolicy.includes("no-cache") ||
cachePolicy.includes("private")
) {
return;
}
}
if (response.status() != 301 && response.status() != 302) {
if (request.resourceType() == "image") {
return;
}
response
.buffer()
.then(buffer => {
s3.upload(
{
Bucket: "puppeteer-cache",
Key: encodeURIComponent(request.url()),
Body: buffer
},
(err, data) => {
if (err) {
console.log(`Cache error: ${err}`);
}
}
);
})
.catch(err => {
console.log(`S3 Put Failed: ${err}`);
});
}
}
We ignore image responses here, just like in the request
callback, but unless the response is marked as no-cache
it is otherwise uploaded to S3.
The request
and requestfinished
callbacks are a powerful combination! I hope these examples have given you some idea of how they can help solve your own scraping problems.