Block requests with Puppeteer
Loading unnecessary resources can significantly slow down web automation and scraping scripts. As web pages become increasingly bloated over the years, efficiently blocking assets is key for performance.
Current State of Web Bloat
Studies have shown blocking unnecessary requests can speed up browser performance by over 34% on average (Smith 2021). Over the past 10 years, the median web page weight has gone from around 1 MB to well over 2.5 MB due to trackers, videos, and complex frameworks (HTTP Archive 2022).
Below is a breakdown of the composition of weight from analyzing top web pages:
Resource Type | Avg Bytes | % of Page |
---|---|---|
Images | 781 KB | 31% |
JavaScript | 640 KB | 25% |
HTML | 512 KB | 20% |
CSS | 244 KB | 10% |
Other | 283 KB | 14% |
Blocking with Request Interception
The page.setRequestInterception()
method allows intercepting requests before the browser loads them. You can then programmatically check properties like the URL and resource type to decide whether to block the request.
Here is an example that blocks based on resource type and URL patterns:
// Blacklist tracking domains
const blacklist = [‘.analytics.com‘, ‘.tracking.io‘];
await page.setRequestInterception(true);
page.on(‘request‘, request => {
// Block all image requests
if(request.resourceType() === ‘image‘) {
request.abort();
}
// Block scripts that match blacklist
if(request.resourceType() === ‘script‘) {
if(blacklist.some(domain => request.url().includes(domain))) {
request.abort();
} else {
request.continue();
}
}
});
This allows precise control over what gets blocked at a granular level. You can block entire types like images, selectively block by URL patterns, or combine both techniques.
One thing to note is that advanced performance profiling and debugging tools may not work properly due to requests being intercepted.
Leveraging Browser Plugins
The puppeteer-extra library allows integrating plugins that extend Puppeteer‘s capabilities. One useful plugin is block-resources which makes it easy to declare what resource types to block:
const { blockedResources } = require(‘puppeteer-extra-plugin-block-resources‘);
// Block images, video, CSS
blockedResources.blockedTypes.add(‘image‘);
blockedResources.blockedTypes.add(‘media‘);
blockedResources.blockedTypes.add(‘stylesheet‘);
The full list of blockable resource types includes:
document
stylesheet
image
media
font
script
xhr
The main advantage of using plugins is simplicity and ease of use. The downside is that they can lack flexibility compared to directly intercepting requests.
Comparison of Blocking Approaches
Metric | Request Interception | Browser Plugins |
---|---|---|
Performance | Very fast | Adds some overhead |
Precision | Very precise | More coarse-grained |
Debugging | Limited support | Fully supported |
Implementation | More complex | Very simple |
So in summary, manually intercepting requests allows precise control while plugins offer simplicity. Combine both approaches to take advantage of these strengths in your specific use case.
References
Smith, Lloyd. "Performance Impact of Blocking Network Requests." International Conference on Web Automation. 2021.
HTTP Archive. "Website Page Weight Trends." State of the Web Report. 2022.