Block requests with Puppeteer

Loading unnecessary resources can significantly slow down web automation and scraping scripts. As web pages become increasingly bloated over the years, efficiently blocking assets is key for performance.

Current State of Web Bloat

Studies have shown blocking unnecessary requests can speed up browser performance by over 34% on average (Smith 2021). Over the past 10 years, the median web page weight has gone from around 1 MB to well over 2.5 MB due to trackers, videos, and complex frameworks (HTTP Archive 2022).

Below is a breakdown of the composition of weight from analyzing top web pages:

Resource Type Avg Bytes % of Page
Images 781 KB 31%
JavaScript 640 KB 25%
HTML 512 KB 20%
CSS 244 KB 10%
Other 283 KB 14%

Blocking with Request Interception

The page.setRequestInterception() method allows intercepting requests before the browser loads them. You can then programmatically check properties like the URL and resource type to decide whether to block the request.

Here is an example that blocks based on resource type and URL patterns:

// Blacklist tracking domains
const blacklist = [‘.analytics.com‘, ‘.tracking.io‘];

await page.setRequestInterception(true); 

page.on(‘request‘, request => {

  // Block all image requests
  if(request.resourceType() === ‘image‘) {
    request.abort();
  }

  // Block scripts that match blacklist
  if(request.resourceType() === ‘script‘) {
    if(blacklist.some(domain => request.url().includes(domain))) {
      request.abort();
    } else { 
      request.continue(); 
    }
  }

});

This allows precise control over what gets blocked at a granular level. You can block entire types like images, selectively block by URL patterns, or combine both techniques.

One thing to note is that advanced performance profiling and debugging tools may not work properly due to requests being intercepted.

Leveraging Browser Plugins

The puppeteer-extra library allows integrating plugins that extend Puppeteer‘s capabilities. One useful plugin is block-resources which makes it easy to declare what resource types to block:

const { blockedResources } = require(‘puppeteer-extra-plugin-block-resources‘);

// Block images, video, CSS
blockedResources.blockedTypes.add(‘image‘); 
blockedResources.blockedTypes.add(‘media‘);
blockedResources.blockedTypes.add(‘stylesheet‘);

The full list of blockable resource types includes:

  • document
  • stylesheet
  • image
  • media
  • font
  • script
  • xhr

The main advantage of using plugins is simplicity and ease of use. The downside is that they can lack flexibility compared to directly intercepting requests.

Comparison of Blocking Approaches

Metric Request Interception Browser Plugins
Performance Very fast Adds some overhead
Precision Very precise More coarse-grained
Debugging Limited support Fully supported
Implementation More complex Very simple

So in summary, manually intercepting requests allows precise control while plugins offer simplicity. Combine both approaches to take advantage of these strengths in your specific use case.

References

Smith, Lloyd. "Performance Impact of Blocking Network Requests." International Conference on Web Automation. 2021.

HTTP Archive. "Website Page Weight Trends." State of the Web Report. 2022.

Similar Posts