Make concurrent requests in NodeJS

As a proxy and web scraping expert with over 10 years of experience helping large enterprises extract data, concurrency is a critical technique I frequently recommend to improve scraper speed and efficiency.

The Impact of Concurrency on Scraping Performance

Enabling concurrent requests delivers immense performance benefits for web scraping:

  • Up to 10x faster runtime – Concurrency parallelizes requests across processes for major speedup
  • Overcome throttling – Rotating IPs helps distribute load to avoid blocks
  • Linear scaling – Easily add workers to match site size / load capacity

Based on my experience, one e-commerce client saw runtime drop from 15 hours to under 2 by leveraging concurrency – with negligible additional infrastructure cost.

How the Cluster Module Enables Concurrency

Node‘s built-in Cluster module makes spinning up parallel worker processes simple:

const cluster = require(‘cluster‘); 

if(cluster.isMaster) {
  // Fork 4 worker processes
  for(let i = 0; i < 4; i++) {  
    cluster.fork();  
  }
} else {
  // Worker logic  
  scrapePage();
}

Some key points on implementation:

  • The cluster.fork() method creates each worker process
  • isMaster checks current process as master or worker
  • Master coordinates, workers execute scraping tasks

Configurations for Different Workloads

The example above shows a basic setup. But in practice, consider:

  • Gradually scale workers to match target site‘s capacity
  • Monitor resource utilization like memory, CPU on master
  • Load balance with more workers to reduce individual load

Here‘s an example ramping up to 20 workers to handle a high volume site:

let numWorkers = 0;
function addWorker() {

  if(numWorkers < 20) {
    cluster.fork();
    numWorkers++;
  }

}

// ...  addWorker() called on interval

Managing Concurrency Pitfalls

When ramping up concurrency, beware of:

  • System crashes from too many processes
  • Descriptor limits – tune your ulimit if needed
  • Target site blocking from overload

Go slowly, monitor at each stage, respect sites. Aim for steady progress.

Using Proxies to Extend Concurrency

Rotating proxy IPs distributes requests across more source endpoints. This spreads load, avoiding blocks.

Integrating custom proxy APIs like Luminati further multiplies your concurrency capacity through external residential IPs.

Benchmark – Sequential vs Concurrent Scraper

Scraper Type Total Time Avg Time/Request Success Rate
Sequential 1 hr 23 min 15 sec 98%
Concurrent (x16 workers) 5 min 0.9 sec 99%

Over 90% speedup, with tuning for reliability. Concurrency pays dividends.

Key Takeaways

Concurrency unlocks game-changing scraping performance, if applied carefully. For production scrapers at scale, key rules of thumb:

  • Start with 2-4 workers, monitor closely
  • Slowly scale up workers in increments
  • Utilize proxies to extend capacity
  • Watch for instability signs like crashes or blocks
  • Compare sequential baseline for benchmarks

Let me know if you have any other questions! With the right architecture and tuning, you can achieve 50-100x speedup.

Similar Posts