Use Scrapy for Web Scraping in Python

Web scraping is the process of extracting data from websites programmatically. It allows you to gather information at scale, automate tedious tasks, and create datasets for analysis and machine learning. While you can write web scrapers from scratch using libraries like requests and BeautifulSoup, the Scrapy framework makes it much easier to build and scale robust scrapers in Python.

In this comprehensive guide, we‘ll cover everything you need to know to harness the power of Scrapy for web scraping. Whether you‘re a beginner looking to extract data from a site or an experienced developer needing to scrape at scale, Scrapy provides the tools to get the job done efficiently. Let‘s dive in!

Why Use Scrapy for Web Scraping?

Scrapy is an open-source framework for writing web spiders that crawl websites and extract structured data. It simplifies many tasks associated with scraping and provides a complete ecosystem for developing and running scalable crawlers. Here are some key advantages of using Scrapy:

  1. Productivity: Scrapy takes care of common scraping tasks like handling requests, parsing responses, managing concurrency, and more. This lets you focus on the unique aspects of each site you‘re scraping.

  2. Performance: Scrapy is optimized for performance, utilizing asynchronous requests to fetch pages quickly. It can easily scrape hundreds of pages per minute on a single machine.

  3. Extensibility: Scrapy has a pluggable architecture that allows you to easily extend and customize its functionality. There are many extensions and middlewares available for tasks like logging, caching, statistics, etc.

  4. Built-in support for exporting data: Scrapy makes it easy to save extracted data to files or databases in various formats like JSON, CSV, XML.

  5. Community and ecosystem: Scrapy has excellent documentation and an active community that contributes extensions, maintains tutorials, and provides support.

Getting Started with Scrapy

Before we start scraping, let‘s set up Scrapy and create a new project. First, make sure you have Python installed, preferably Python 3.6+. Then install Scrapy using pip:

pip install scrapy

Now create a new Scrapy project with the startproject command:

scrapy startproject myproject

This generates the following file structure:

myproject/
    scrapy.cfg            # deploy config file
    myproject/            
        __init__.py
        items.py          # data models
        middlewares.py    # project middlewares
        pipelines.py      # data pipelines
        settings.py       # project settings
        spiders/          # spiders directory
            __init__.py

The most important components are:

  • spiders/: This directory contains the spiders, which are classes that define how a site will be scraped.

  • items.py: Defines the data models that spiders will populate with scraped data.

  • pipelines.py: Defines Item Pipelines, which are used to process items after they have been scraped.

  • settings.py: Contains configuration settings for the project.

With the project set up, we‘re ready to write our first spider!

Writing Your First Scrapy Spider

Spiders are classes that define how a website should be scraped. They consist of:

  • A name
  • A starting list of URLs
  • A method for extracting links to follow
  • A method for extracting data from pages

Let‘s create a basic spider that scrapes quotes from https://quotes.toscrape.com/. In the spiders directory, create a file named quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        ‘https://quotes.toscrape.com/page/1/‘,
    ]

    def parse(self, response):
        for quote in response.css(‘div.quote‘):
            yield {
                ‘text‘: quote.css(‘span.text::text‘).get(),
                ‘author‘: quote.css(‘small.author::text‘).get(),
                ‘tags‘: quote.css(‘div.tags a.tag::text‘).getall(),
            }

        next_page = response.css(‘li.next a::attr(href)‘).get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Let‘s break this down:

  • The spider subclasses scrapy.Spider and defines a name and start_urls.
  • The parse method is called with the response of each URL in start_urls.
  • We use CSS selectors to find quotes on the page and yield Python dicts with the extracted data.
  • If there‘s a next page link, we yield a new Request to recursively follow it.

To run this spider:

scrapy crawl quotes

You should see the scraped quotes printing to the console. Congratulations, you just created your first spider!

Storing Extracted Data

While printing to console is fine for debugging, you‘ll typically want to save the extracted data to a file or database. Scrapy supports this out of the box with Feed Exports. To save the quotes to a JSON file:

scrapy crawl quotes -o quotes.json

For more advanced data processing and storage, Scrapy provides Item Pipelines. Pipelines are Python classes that handle items after they have been scraped. Typical uses include:

  • Cleansing HTML data
  • Validating scraped data
  • Checking for duplicates
  • Storing items in a database

Let‘s create a simple pipeline to convert the author names to uppercase. In pipelines.py:

class UppercasePipeline:

    def process_item(self, item, spider):
        item[‘author‘] = item[‘author‘].upper()
        return item

And enable it in settings.py:

ITEM_PIPELINES = {
   ‘myproject.pipelines.UppercasePipeline‘: 300,
}

Now when you run the spider, the author names will be uppercased before being stored.

Selector Expressions

Scrapy uses CSS and XPath expressions to locate elements on web pages and extract data from them. While we‘ve been using CSS so far, XPath is a more powerful language that can handle more complex extraction tasks.

For example, to select the text of the first <p> element after an <h1> using XPath:

response.xpath(‘//h1/following-sibling::p[1]/text()‘).get()

The equivalent CSS would be more verbose:

response.css(‘h1 + p::text‘).get()

Scrapy‘s Selector docs provide a good introduction to using CSS and XPath for scraping.

Handling Pagination and Link Following

Many websites spread content across multiple pages. To scrape all the data, you need to find the "Next" links and follow them recursively.

As we saw earlier, you can recursively follow links by yielding a Request in the parse callback:

next_page = response.css(‘li.next a::attr(href)‘).get()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

Scrapy handles the scheduling and execution of these Requests, making it easy to crawl sites deeply.

Avoiding Getting Blocked

When scraping, it‘s important to be respectful and avoid overloading servers. Scrapy has several built-in extensions to help with this:

  • AutoThrottle adjusts the scraping speed based on the website‘s response times
  • HTTP Caching avoids repeating requests for pages already downloaded
  • User Agent spoofing makes the scraper look like a normal web browser

To enable AutoThrottle and set a custom User Agent, add this to settings.py:

AUTOTHROTTLE_ENABLED = True
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36‘

For more advanced cases, Scrapy supports using proxy servers to distribute requests. There are extensions like scrapy-rotating-proxies that make it easy to use multiple proxies.

Dealing with Login Forms and Cookies

Some websites require logging in to access certain pages. Scrapy can handle this by submitting login forms and persisting cookies across requests.

Here‘s an example spider that logs into a site:

class LoginSpider(scrapy.Spider):
    name = ‘example‘
    start_urls = [‘https://example.com/login‘]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={‘username‘: ‘john‘, ‘password‘: ‘secret‘},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed 
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping
        return scrapy.Request(url="https://example.com/dashboard", callback=self.parse_dashboard)

    def parse_dashboard(self, response):
        # scrape the dashboard page
        ...

The key steps are:

  1. Start at the login page URL
  2. Submit the login form using FormRequest.from_response
  3. Check that the login succeeded in the after_login callback
  4. If login succeeded, continue scraping by yielding more Requests

Scrapy will automatically persist cookies across requests after the login, so subsequent pages can be accessed.

Conclusion

We‘ve covered the fundamentals of using Scrapy for web scraping in Python, including:

  • Setting up a Scrapy project
  • Writing spiders to crawl and extract data
  • Using CSS and XPath selectors
  • Storing scraped data
  • Handling pagination and logins
  • Avoiding getting blocked

Scrapy is a powerful tool that makes it easy to write and scale web scrapers. It has many more features and customization options than we could cover here. Consult the excellent Scrapy documentation to go deeper.

When scraping, always be mindful to respect websites‘ terms of service and robots.txt. Avoid making too many requests too quickly, and don‘t scrape any private user data.

Used ethically, web scraping can be an invaluable tool for gathering data for analysis, machine learning, market research, and more. With Scrapy in your toolkit, you‘re well-equipped to scrape the web efficiently and effectively!

Similar Posts