Web Scraping with Groovy

Groovy is a powerful JVM-based language that makes web scraping easy and enjoyable. In this comprehensive guide, we‘ll cover everything you need to know to scrape the web with Groovy, from sending basic HTTP requests to controlling headless browsers.

Why Use Groovy for Web Scraping?

There are a few key reasons why Groovy is an excellent choice for web scraping:

  • Succinct syntax: Groovy reduces a lot of Java‘s boilerplate, allowing you to write scraper code that‘s clean and readable. Things like immutable classes, getter/setter methods, and loops are handled automatically.

  • Runs on the JVM: You get access to all of Java‘s libraries and tools while writing code quicker. This includes things like Selenium for browser automation.

  • Built-in JSON support: Parsing and generating JSON is handled seamlessly with Groovy‘s native methods. No need for extra libraries.

  • Integrates with Java: Since Groovy runs on the JVM, integrating scraped data into Java apps is trivial.

  • Grape dependency manager: Adding libraries like Jodd HTTP is as easy as adding an annotation. No build configuration needed.

Let‘s look at some examples to see the power of Groovy in action.

Sending HTTP Requests with Jodd HTTP

For basic HTTP requests, Jodd HTTP paired with Groovy can‘t be beat:

@Grab(‘org.jodd:jodd-http:6.2.1‘)
import jodd.http.HttpRequest

def response = HttpRequest.get(‘https://example.com‘).send()
println(response.bodyText())

Making POST requests and dealing with JSON is just as straightforward:

@Grab(‘org.jodd:jodd-http:6.2.1‘)
import jodd.http.HttpRequest
import groovy.json.JsonSlurper

def data = [name: ‘John‘, age: 22]

def response = HttpRequest
    .post(‘https://example.com/users‘)
    .body(groovy.json.JsonOutput.toJson(data))
    .send()

def user = new JsonSlurper().parseText(response.bodyText())
println(user.id) 

Jodd HTTP handles all the connection logic, cookie handling, etc behind the scenes so you can focus on the essential scraper code.

Parsing HTML with Jodd Lagarto

For more complex sites, you‘ll want to parse and traverse HTML to extract the data you need. This is where Jodd Lagarto comes in handy.

Let‘s walk through scraping some comments from Hacker News:

@Grab(‘org.jodd:jodd-lagarto:6.0.6‘)  
import jodd.jerry.Jerry
import jodd.http.HttpRequest

def response = HttpRequest.get(‘https://news.ycombinator.com/item?id=1‘).send()
def document = Jerry.of(response.bodyText())

def comments = [] 

document.find(‘table.comment-tree tr.comtr‘).each { tr ->
    def id = tr.find(‘.hnuser‘).text()
    def comment = tr.find(‘.comment‘).text()

    comments << [id: id, comment: comment]
}

println comments

We first fetch the page HTML, then use Lagarto to create a searchable DOM document. CSS selectors give us an easy way to extract comment data into a clean array of maps.

Lagarto also supports XPath queries when CSS selectors aren‘t flexible enough.

Browser Automation with Selenium

While Jodd HTTP and Lagarto work great for simple sites, many modern web apps require full JavaScript support. For these cases, Selenium running Chrome in headless mode is the way to go:

@Grab(‘org.seleniumhq.selenium:selenium-chrome-driver:4.3.0‘)
import org.openqa.selenium.chrome.ChromeDriver
import org.openqa.selenium.By
import org.openqa.selenium.Keys


def driver = new ChromeDriver()

driver.get(‘https://twitter.com‘)

driver.findElement(By.cssSelector("input[name=‘session[username_or_email]‘]")).sendKeys(‘myuser‘)
driver.findElement(By.cssSelector("input[name=‘session[password]‘]")).sendKeys(‘1234‘)

driver.findElement(By.cssSelector("div[data-testid=LoginForm_Login_Button]")).click()

// Continue scraping...

driver.quit()

Here we:

  • Launch headless Chrome with Selenium
  • Navigate to Twitter and log in
  • Locate elements by CSS selector
  • Extract text or trigger clicks

This gives you the full power of a real browser minus the GUI. Selenium can drive all major browsers in headless mode.

Ready to Start Scraping?

We‘ve just scratched the surface of Groovy‘s capabilities when it comes to web scraping. Its succinct syntax, smooth integration with Java, and built-in support for tools like Jodd HTTP and Selenium 3 make it a reliable choice.

The examples above should provide a solid foundation for you to start writing your own scrapers. As always, feel free to reach out to us with any other questions!

Similar Posts