How to Build an Intelligent Job Board with Web Scraping and ChatGPT

Job search is a frustrating experience for most candidates. Generic boards like Indeed and LinkedIn serve millions of listings, making relevant picks like finding a needle in a haystack.

As per 2021 statistics, average time to hire in the US is 36 days – mainly due to poor search and alerts. Tailored niche boards serve smaller markets better.

But building such focused boards requires huge manual effort to collect and structure relevant listings. This guide shows how to combine web scraping and AI to create intelligent, custom job boards with minimum effort.

The Problem with Current Job Boards

Indeed and LinkedIn host hundreds of millions of listings, aiming to be ubiquitous. This leads to poor signal-to-noise ratio despite naive keyword search.

As per recent surveys, 74% of job seekers are unhappy with current boards and portals. Main reasons include:

  • Irrelevant listings: Generic search leaves users sifting through junk
  • No custom alerts: Can‘t save searches or get notifications
  • Outdated listings: Stale jobs not removed quickly
  • Poor location filters: Results not tuned to proximity

There is a clear need for niche boards catering to specific demographics and job types. For instance, a portal dedicated to finance openings in New York would serve many candidates better.

Problem % of Users Affected
Too many irrelevant listings 63%
No option for alerts 58%
Listings not updated quickly 53%
Ineffective location filters 47%

Stats above from 2022 ZipRecruiter survey of 1000+ job seekers

But building such focused, high-quality boards poses several technology and data challenges:

  • Collecting relevant and recent listings
  • Parsing details accurately from raw listings
  • Constant updating to keep listings current
  • Smart interfaces like custom alerts and filters

This is where automated web scraping combined with AI systems like ChatGPT can help tremendously.

Why Web Scraping is the Right Strategic Choice

For any custom job board, the first step is collecting enough targeted listings to offer value. Web scraping provides a scalable way to extract thousands of openings from multiple sites.

Structured Data Extraction

Unlike general web crawling, scrapers can parse details selectively from pages using underlying HTML structure and attributes.

For instance, key fields within a listing page can be extracted as:

title: Software Engineer
company: Acme Inc.    
category: Technology > IT > Software    
salary: $90k - $120k  
posted_on: 2022-05-16   
description: Acme Inc. is hiring experienced backend developers... 

Scales to Large Volumes

Once scrapers are set up correctly with rules tailored to a site, they scale seamlessly to extract hundreds or thousands of listings.

This enables quick ramp-up for new custom boards, without extensive manual effort.

Handles Site Changes

Unlike one-time downloads, scrapers keep running periodically to check for updates. So new jobs get added, while closed ones are dropped automatically.

They also gracefully handle minor layout changes on underlying sites without breaking entirely.

Well configured scrapers form an automated ETL pipeline – pulling relevant data from thousands of live sources daily into analytics-ready structures.

![Web scraper ETL process]

(image: Web scraper extracting job listings into databases)

Now that the value of web scraping is clear, let‘s move on to practical aspects of building such scrapers.

Step-by-Step Process of Scraping Job Listings

We will scrape listings from recruiting sites like Workable which host openings on subdomains e.g. apply.workable.com/companyname

The key steps are:

  1. Finding relevant companies to scrape
  2. Extracting all listings from a company‘s subdomain
  3. Parsing details like title, description for each listing

We will utilize ScrapingBee, a cloud web scraper API for heavy lifting parts of data extraction. For AI based text parsing, we use the ChatGPT API from Anthropic.

![Web Scraping System Architecture]

(Diagram showing data flow scrapers > APIs > databases)

Finding Target Websites

The first step is generating a list of companies relevant to our niche. For instance health-tech firms or blockchain startups.

We search on Google restricting results to specific sites, using syntax like:

"health-tech company" site:apply.workable.com 

This surfaces company subdomains on Workable‘s apply portal.

We extract the URLs from search results using ScrapingBee‘s API into an array of domains:

// Import axios & initialize ScrapingBee API 

const searchQuery = "health-tech site:apply.workable.com"

const getDomains = async () => {

  const response = await scrapingbee.get("google", {
      q: searchQuery 
  })

  // Extract domain URLs
  const domains = response.organic_results.map(item => item.url)

  return domains 
}

const companies = await getDomains()

To bolster our list, we repeat this process on other sites like Greenhouse, Lever, etc. Each nets 20-50 good companies relevant to our niche.

Listing Extraction

With target domains identified, we now scrape listings from each site. The listings page looks like:

![Example jobs listings page]

We extract all listings from such a page with ScrapingBee, storing job URLs separately:

const companyDomain = "https://apply.workable.com/health-tech-inc" 

const getListings = async () => {

  const response = await scrapingbee.get("extract", {
      url: companyDomain,

      extract_rules: {
        listings: {
          type: "list",
          selector: ".posting-list li > a",
          output: { url: "@href" }
        }  
      }
  })

  return response.data.listings  
}

const jobs = await getListings() 

The selector here pins down the HTML elements containing job links. Repeating this for all companies gives us a large, focused crawl of niche listings!

Parsing Job Details

Now we tackle Phase 2 – scraping useful details from each listing page:

![Job listing page]

The fields we want are:

  • Title
  • Location
  • Posted date
  • Description
  • Salary range
  • Skills

While some elements like title have unique IDs suitable for CSS selector extraction, others don‘t – specially free-form details like description and skills.

This is where AI can help!

Intelligent Data Parsing with ChatGPT

For tough elements lacking clear markup, we utilize ChatGPT – an AI assistant from Anthropic trained on massive language models.

We send ChatGPT the entire HTML of a listing, prompting it to extract details we need by reading the text and headers.

For example:

"Extract the job description from this page:

{raw html}

Description: "  

ChatGPT scans the HTML, locates relevant text blocks and returns the description.

Similarly, prompts can be tuned to get skills, qualifications etc. in a structured format.

Preprocessing Listing HTML

Since ChatGPT charges per API usage based on token count i.e. characters processed, we first minify and clean each listing‘s HTML to reduce cost.

const minifyHtml = (html) => {
  // Remove whitespace, comments, etc   
}

const cleaned = minifyHtml(dirtyHtml) 

Prompting for Relevant Fields

We now prompt ChatGPT with our cleaned HTML to extract fields one by one:

const chatgpt = new ChatGPTClient(API_key)

const getDescription = async (pageHtml) => {

  const response = await chatgpt.sendMessage(
    "Extract job description from this: \n" + pageHtml    
  )

  return response.description 
}

const description = await getDescription(html)

Similar prompts target location, salary range etc. Each costs ~$0.004 for a page with avg length of 1500 tokens.

Outputting Structured Data

Rather than raw text, we want output as clean variables. This prompt extracts salary as a JS object:

const salary = {
  currency: ‘USD‘, 
  min: 75000,
  max: 120000 
}

Allowing easy storage in databases.

Over 50-100 listings, this AI based parsing costs under $1 while saving hours of manual effort!

Enriching Data for an Intelligent Job Board

With robust listing collection and parsing foundations in place, we wrap up by enriching data to build smart interfaces.

Location Based Search

We geocode locations extracted from listings, to enable proximity based searches like Tech jobs within 20 miles of Boston.

![Map based job search]

Grouping results by city also helps focuses searches better.

Intelligent Tagging

AI helps automatically tag listings by key skills, seniority level, industry etc. Enabling faceted search and custom recommendations.

Personalized Alerts

Users save search criteria as alerts to receive email notifications for new matching jobs. Addressing a major pain point.

Auto-Updating Listings

The entire extraction system runs daily to append new listings and purge closed ones automatically via scrapers. Keeping quality high.

Conclusion and Next Steps

In closing, web scraping combined with AI techniques can accelerate building intelligent, niche job boards on tight budgets. Handling the heavy lifting of targeted data collection and cleaning.

With the frameworks detailed here, a basic industry or location specific board can be up within weeks without engineering teams. And evolve new features like maps, alerts etc. over time.

Of course, aspects like analyzing job trends or resume matching require more advanced ML capabilities. But they can be added gradually later.

I hope this guide served as a practical starting point for anyone looking to build custom job boards. Do let me know in comments if you have any other questions!

Similar Posts