Unlocking the Power of node-fetch for Web Scraping
Fetching and parsing web pages is essential for robust web scraping. As a proxy expert with over 10 years of JavaScript scraping experience, I often get asked about using the node-fetch module for this purpose.
In this comprehensive tutorial, you‘ll learn how to fully utilize node-fetch to extract data from complex sites.
Understanding the Fetch API
The Fetch API is a web standard that defines how browsers should handle HTTP requests and responses. It provides a simpler alternative to XMLHttpRequest (XHR).
Some key aspects:
- Uses promises for async code instead of callbacks
- Request and response handled by separate objects
- Request "body" exposed as a stream for efficient data handling
- Defaults to GET but supports other HTTP methods
Node-fetch implements the Fetch API in Node.js so you can make requests just like in the browser.
Why Choose node-fetch for Scraping?
Through building many scrapers over the years, I‘ve found node-fetch has critical advantages:
Lightweight – With no dependencies, it adds minimal overhead
Promise-Based – Avoids callback hell for cleaner async code
Familiar API – Similar to browser Fetch API for easy cross-over
User Agent – Easily set user-agent string to mimic browsers
Encoding Support – Handles text decoding like UTF-8 automatically
Stability – Actively maintained and used by major projects
Now let‘s see it in action for some common web scraping tasks.
Making GET Requests
Use the fetch function directly with just the URL to make a GET request:
const fetch = require(‘node-fetch‘);
const url = ‘https://api.example.com/data‘;
fetch(url)
.then(res => res.json())
.then(json => console.log(json))
.catch(err => console.error(err));
The first promise resolves to a Response object containing the headers, status code, etc.
Call .json()
to parse the response body as JSON automatically. Other body parsing methods are also available like .text()
and .blob()
.
Headers can be accessed directly on the response. For example:
fetch(url).then(res => {
console.log(res.headers.get(‘Content-Type‘))
});
Customizing Requests with Options
An options object can be passed as the second parameter to customize the HTTP request:
const options = {
method: ‘POST‘,
headers: {
Authorization: ‘Bearer my-token‘
}
};
fetch(url, options);
Here are some common options:
Parameter | Description |
---|---|
method | GET, POST, PUT, etc. Default: GET |
headers | Object with request headers |
redirect | Manual redirect handling |
size | Body size limit (bytes) |
This allows full control over the requests for complex scenarios.
Handling POST Requests and Form Data
To make a POST request, set the method: ‘POST‘
option.
For example, to submit form URL encoded data:
const formData = {
name: ‘John Doe‘,
email: ‘[email protected]‘
};
const options = {
method: ‘POST‘,
headers: { ‘Content-Type‘: ‘application/x-www-form-urlencoded‘ },
body: new URLSearchParams(formData)
};
fetch(url, options);
The body can also be set directly as a string for simple cases.
To POST JSON data, simply stringify the body:
const bodyData = {
some: ‘json‘
}
const options = {
method: ‘POST‘,
body: JSON.stringify(bodyData)
};
fetch(url, options);
Fetching and Parsing HTML
A common web scraping task is to fetch HTML content from a web page. We can pair node-fetch with a parser like cheerio to extract the data we want.
Here‘s an example fetching HTML and extracting all image src attributes:
const fetch = require(‘node-fetch‘);
const cheerio = require(‘cheerio‘);
const url = ‘https://example.com‘;
fetch(url)
.then(res => res.text())
.then(body => {
const $ = cheerio.load(body);
const images = $(‘img‘).map((i, img) => $(img).attr(‘src‘)).get();
console.log(images);
})
This technique works great for scraping content from any site. Fetch the HTML with node-fetch then use cheerio to query the DOM.
Parsing Responses
The response body can be consumed in several ways:
- .text() – Body as string
- .json() – Parse body as JSON
- .blob() – Body as a Blob
- .arrayBuffer() – Low-level buffer access
For buffers and streams, you can access the body directly:
fetch(url).then(res => {
// Streams
res.body.pipe(someWritableStream);
// Buffers
const chunks = [];
res.body.on(‘data‘, chunk => {
chunks.push(chunk);
});
});
Note that attempting multiple body parsing methods on the same response will fail. Use only one approach per response.
Handling Cookies Persistence
Maintaining cookies across requests is vital for some scrapers.
The simplest approach is to parse set-cookie headers manually:
let cookies;
fetch(url).then(res => {
cookies = res.headers
.get(‘set-cookie‘)
.map(c => c.split(‘;‘)[0])
.join(‘;‘);
return fetch(url2, {
headers: { Cookie: cookies }
});
});
However, for robust cookie persistence, I recommend using an external module like tough-cookie. This handles cookie jars across sessions.
Avoiding Getting Blocked
When scraping large sites, getting blocked or flagged as a bot is a risk. Some key tactics I use:
User-Agent Rotation – Use a module like random-useragent to set a random desktop or mobile browser agent on each request.
Proxy Rotation – Rotate different proxies randomly per request (residential proxies work best).
Slow Down – Add delays between requests and use a reasonable concurrency level.
I recommend using a commercial proxy service rather than scraping from your own IP address directly. The investment is worth avoiding the headaches of managing rotations at scale.
Here is an example proxy + user-agent rotator setup:
const RandomAgent = require(‘random-useragent‘);
const agent = new RandomAgent();
async function fetchWithRotation(url) {
const proxy = await getNextProxy();
return fetch(url, {
headers: {
‘User-Agent‘: agent.get()
},
agent: proxy
});
}
Conclusion
As you can see, node-fetch provides a complete solution for most common web scraping needs. Whether you are crawling sites, parsing pages, submitting forms, or handling APIs – it has you covered.
By leveraging node-fetch with other tools like cheerio, proxy services, and user-agent rotators, you can build scalable scrapers to extract huge datasets from complex sites without getting blocked.
To make it even easier, check out a web scraping platform like ScrapingBee which handles proxy/browser rotation, CAPTCHAs, and more out of the box.
Let me know if you have any other node-fetch techniques for automated scraping!