Use Scrapy for Web Scraping in Python
Web scraping is the process of extracting data from websites programmatically. It allows you to gather information at scale, automate tedious tasks, and create datasets for analysis and machine learning. While you can write web scrapers from scratch using libraries like requests and BeautifulSoup, the Scrapy framework makes it much easier to build and scale robust scrapers in Python.
In this comprehensive guide, we‘ll cover everything you need to know to harness the power of Scrapy for web scraping. Whether you‘re a beginner looking to extract data from a site or an experienced developer needing to scrape at scale, Scrapy provides the tools to get the job done efficiently. Let‘s dive in!
Why Use Scrapy for Web Scraping?
Scrapy is an open-source framework for writing web spiders that crawl websites and extract structured data. It simplifies many tasks associated with scraping and provides a complete ecosystem for developing and running scalable crawlers. Here are some key advantages of using Scrapy:
-
Productivity: Scrapy takes care of common scraping tasks like handling requests, parsing responses, managing concurrency, and more. This lets you focus on the unique aspects of each site you‘re scraping.
-
Performance: Scrapy is optimized for performance, utilizing asynchronous requests to fetch pages quickly. It can easily scrape hundreds of pages per minute on a single machine.
-
Extensibility: Scrapy has a pluggable architecture that allows you to easily extend and customize its functionality. There are many extensions and middlewares available for tasks like logging, caching, statistics, etc.
-
Built-in support for exporting data: Scrapy makes it easy to save extracted data to files or databases in various formats like JSON, CSV, XML.
-
Community and ecosystem: Scrapy has excellent documentation and an active community that contributes extensions, maintains tutorials, and provides support.
Getting Started with Scrapy
Before we start scraping, let‘s set up Scrapy and create a new project. First, make sure you have Python installed, preferably Python 3.6+. Then install Scrapy using pip:
pip install scrapy
Now create a new Scrapy project with the startproject
command:
scrapy startproject myproject
This generates the following file structure:
myproject/
scrapy.cfg # deploy config file
myproject/
__init__.py
items.py # data models
middlewares.py # project middlewares
pipelines.py # data pipelines
settings.py # project settings
spiders/ # spiders directory
__init__.py
The most important components are:
-
spiders/
: This directory contains the spiders, which are classes that define how a site will be scraped. -
items.py
: Defines the data models that spiders will populate with scraped data. -
pipelines.py
: Defines Item Pipelines, which are used to process items after they have been scraped. -
settings.py
: Contains configuration settings for the project.
With the project set up, we‘re ready to write our first spider!
Writing Your First Scrapy Spider
Spiders are classes that define how a website should be scraped. They consist of:
- A name
- A starting list of URLs
- A method for extracting links to follow
- A method for extracting data from pages
Let‘s create a basic spider that scrapes quotes from https://quotes.toscrape.com/. In the spiders
directory, create a file named quotes_spider.py
:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
‘https://quotes.toscrape.com/page/1/‘,
]
def parse(self, response):
for quote in response.css(‘div.quote‘):
yield {
‘text‘: quote.css(‘span.text::text‘).get(),
‘author‘: quote.css(‘small.author::text‘).get(),
‘tags‘: quote.css(‘div.tags a.tag::text‘).getall(),
}
next_page = response.css(‘li.next a::attr(href)‘).get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Let‘s break this down:
- The spider subclasses
scrapy.Spider
and defines aname
andstart_urls
. - The
parse
method is called with the response of each URL instart_urls
. - We use CSS selectors to find quotes on the page and yield Python dicts with the extracted data.
- If there‘s a next page link, we yield a new Request to recursively follow it.
To run this spider:
scrapy crawl quotes
You should see the scraped quotes printing to the console. Congratulations, you just created your first spider!
Storing Extracted Data
While printing to console is fine for debugging, you‘ll typically want to save the extracted data to a file or database. Scrapy supports this out of the box with Feed Exports. To save the quotes to a JSON file:
scrapy crawl quotes -o quotes.json
For more advanced data processing and storage, Scrapy provides Item Pipelines. Pipelines are Python classes that handle items after they have been scraped. Typical uses include:
- Cleansing HTML data
- Validating scraped data
- Checking for duplicates
- Storing items in a database
Let‘s create a simple pipeline to convert the author names to uppercase. In pipelines.py
:
class UppercasePipeline:
def process_item(self, item, spider):
item[‘author‘] = item[‘author‘].upper()
return item
And enable it in settings.py
:
ITEM_PIPELINES = {
‘myproject.pipelines.UppercasePipeline‘: 300,
}
Now when you run the spider, the author names will be uppercased before being stored.
Selector Expressions
Scrapy uses CSS and XPath expressions to locate elements on web pages and extract data from them. While we‘ve been using CSS so far, XPath is a more powerful language that can handle more complex extraction tasks.
For example, to select the text of the first <p>
element after an <h1>
using XPath:
response.xpath(‘//h1/following-sibling::p[1]/text()‘).get()
The equivalent CSS would be more verbose:
response.css(‘h1 + p::text‘).get()
Scrapy‘s Selector docs provide a good introduction to using CSS and XPath for scraping.
Handling Pagination and Link Following
Many websites spread content across multiple pages. To scrape all the data, you need to find the "Next" links and follow them recursively.
As we saw earlier, you can recursively follow links by yielding a Request in the parse callback:
next_page = response.css(‘li.next a::attr(href)‘).get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Scrapy handles the scheduling and execution of these Requests, making it easy to crawl sites deeply.
Avoiding Getting Blocked
When scraping, it‘s important to be respectful and avoid overloading servers. Scrapy has several built-in extensions to help with this:
- AutoThrottle adjusts the scraping speed based on the website‘s response times
- HTTP Caching avoids repeating requests for pages already downloaded
- User Agent spoofing makes the scraper look like a normal web browser
To enable AutoThrottle and set a custom User Agent, add this to settings.py
:
AUTOTHROTTLE_ENABLED = True
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36‘
For more advanced cases, Scrapy supports using proxy servers to distribute requests. There are extensions like scrapy-rotating-proxies that make it easy to use multiple proxies.
Dealing with Login Forms and Cookies
Some websites require logging in to access certain pages. Scrapy can handle this by submitting login forms and persisting cookies across requests.
Here‘s an example spider that logs into a site:
class LoginSpider(scrapy.Spider):
name = ‘example‘
start_urls = [‘https://example.com/login‘]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={‘username‘: ‘john‘, ‘password‘: ‘secret‘},
callback=self.after_login
)
def after_login(self, response):
# check login succeed
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping
return scrapy.Request(url="https://example.com/dashboard", callback=self.parse_dashboard)
def parse_dashboard(self, response):
# scrape the dashboard page
...
The key steps are:
- Start at the login page URL
- Submit the login form using
FormRequest.from_response
- Check that the login succeeded in the
after_login
callback - If login succeeded, continue scraping by yielding more Requests
Scrapy will automatically persist cookies across requests after the login, so subsequent pages can be accessed.
Conclusion
We‘ve covered the fundamentals of using Scrapy for web scraping in Python, including:
- Setting up a Scrapy project
- Writing spiders to crawl and extract data
- Using CSS and XPath selectors
- Storing scraped data
- Handling pagination and logins
- Avoiding getting blocked
Scrapy is a powerful tool that makes it easy to write and scale web scrapers. It has many more features and customization options than we could cover here. Consult the excellent Scrapy documentation to go deeper.
When scraping, always be mindful to respect websites‘ terms of service and robots.txt. Avoid making too many requests too quickly, and don‘t scrape any private user data.
Used ethically, web scraping can be an invaluable tool for gathering data for analysis, machine learning, market research, and more. With Scrapy in your toolkit, you‘re well-equipped to scrape the web efficiently and effectively!