How to Build an Intelligent Job Board with Web Scraping and ChatGPT
Job search is a frustrating experience for most candidates. Generic boards like Indeed and LinkedIn serve millions of listings, making relevant picks like finding a needle in a haystack.
As per 2021 statistics, average time to hire in the US is 36 days – mainly due to poor search and alerts. Tailored niche boards serve smaller markets better.
But building such focused boards requires huge manual effort to collect and structure relevant listings. This guide shows how to combine web scraping and AI to create intelligent, custom job boards with minimum effort.
The Problem with Current Job Boards
Indeed and LinkedIn host hundreds of millions of listings, aiming to be ubiquitous. This leads to poor signal-to-noise ratio despite naive keyword search.
As per recent surveys, 74% of job seekers are unhappy with current boards and portals. Main reasons include:
- Irrelevant listings: Generic search leaves users sifting through junk
- No custom alerts: Can‘t save searches or get notifications
- Outdated listings: Stale jobs not removed quickly
- Poor location filters: Results not tuned to proximity
There is a clear need for niche boards catering to specific demographics and job types. For instance, a portal dedicated to finance openings in New York would serve many candidates better.
Problem | % of Users Affected |
---|---|
Too many irrelevant listings | 63% |
No option for alerts | 58% |
Listings not updated quickly | 53% |
Ineffective location filters | 47% |
Stats above from 2022 ZipRecruiter survey of 1000+ job seekers
But building such focused, high-quality boards poses several technology and data challenges:
- Collecting relevant and recent listings
- Parsing details accurately from raw listings
- Constant updating to keep listings current
- Smart interfaces like custom alerts and filters
This is where automated web scraping combined with AI systems like ChatGPT can help tremendously.
Why Web Scraping is the Right Strategic Choice
For any custom job board, the first step is collecting enough targeted listings to offer value. Web scraping provides a scalable way to extract thousands of openings from multiple sites.
Structured Data Extraction
Unlike general web crawling, scrapers can parse details selectively from pages using underlying HTML structure and attributes.
For instance, key fields within a listing page can be extracted as:
title: Software Engineer
company: Acme Inc.
category: Technology > IT > Software
salary: $90k - $120k
posted_on: 2022-05-16
description: Acme Inc. is hiring experienced backend developers...
Scales to Large Volumes
Once scrapers are set up correctly with rules tailored to a site, they scale seamlessly to extract hundreds or thousands of listings.
This enables quick ramp-up for new custom boards, without extensive manual effort.
Handles Site Changes
Unlike one-time downloads, scrapers keep running periodically to check for updates. So new jobs get added, while closed ones are dropped automatically.
They also gracefully handle minor layout changes on underlying sites without breaking entirely.
Well configured scrapers form an automated ETL pipeline – pulling relevant data from thousands of live sources daily into analytics-ready structures.
![Web scraper ETL process]
(image: Web scraper extracting job listings into databases)
Now that the value of web scraping is clear, let‘s move on to practical aspects of building such scrapers.
Step-by-Step Process of Scraping Job Listings
We will scrape listings from recruiting sites like Workable which host openings on subdomains e.g. apply.workable.com/companyname
The key steps are:
- Finding relevant companies to scrape
- Extracting all listings from a company‘s subdomain
- Parsing details like title, description for each listing
We will utilize ScrapingBee, a cloud web scraper API for heavy lifting parts of data extraction. For AI based text parsing, we use the ChatGPT API from Anthropic.
![Web Scraping System Architecture]
(Diagram showing data flow scrapers > APIs > databases)
Finding Target Websites
The first step is generating a list of companies relevant to our niche. For instance health-tech firms or blockchain startups.
We search on Google restricting results to specific sites, using syntax like:
"health-tech company" site:apply.workable.com
This surfaces company subdomains on Workable‘s apply portal.
We extract the URLs from search results using ScrapingBee‘s API into an array of domains:
// Import axios & initialize ScrapingBee API
const searchQuery = "health-tech site:apply.workable.com"
const getDomains = async () => {
const response = await scrapingbee.get("google", {
q: searchQuery
})
// Extract domain URLs
const domains = response.organic_results.map(item => item.url)
return domains
}
const companies = await getDomains()
To bolster our list, we repeat this process on other sites like Greenhouse, Lever, etc. Each nets 20-50 good companies relevant to our niche.
Listing Extraction
With target domains identified, we now scrape listings from each site. The listings page looks like:
![Example jobs listings page]
We extract all listings from such a page with ScrapingBee, storing job URLs separately:
const companyDomain = "https://apply.workable.com/health-tech-inc"
const getListings = async () => {
const response = await scrapingbee.get("extract", {
url: companyDomain,
extract_rules: {
listings: {
type: "list",
selector: ".posting-list li > a",
output: { url: "@href" }
}
}
})
return response.data.listings
}
const jobs = await getListings()
The selector here pins down the HTML elements containing job links. Repeating this for all companies gives us a large, focused crawl of niche listings!
Parsing Job Details
Now we tackle Phase 2 – scraping useful details from each listing page:
![Job listing page]
The fields we want are:
- Title
- Location
- Posted date
- Description
- Salary range
- Skills
While some elements like title have unique IDs suitable for CSS selector extraction, others don‘t – specially free-form details like description and skills.
This is where AI can help!
Intelligent Data Parsing with ChatGPT
For tough elements lacking clear markup, we utilize ChatGPT – an AI assistant from Anthropic trained on massive language models.
We send ChatGPT the entire HTML of a listing, prompting it to extract details we need by reading the text and headers.
For example:
"Extract the job description from this page:
{raw html}
Description: "
ChatGPT scans the HTML, locates relevant text blocks and returns the description.
Similarly, prompts can be tuned to get skills, qualifications etc. in a structured format.
Preprocessing Listing HTML
Since ChatGPT charges per API usage based on token count i.e. characters processed, we first minify and clean each listing‘s HTML to reduce cost.
const minifyHtml = (html) => {
// Remove whitespace, comments, etc
}
const cleaned = minifyHtml(dirtyHtml)
Prompting for Relevant Fields
We now prompt ChatGPT with our cleaned HTML to extract fields one by one:
const chatgpt = new ChatGPTClient(API_key)
const getDescription = async (pageHtml) => {
const response = await chatgpt.sendMessage(
"Extract job description from this: \n" + pageHtml
)
return response.description
}
const description = await getDescription(html)
Similar prompts target location, salary range etc. Each costs ~$0.004 for a page with avg length of 1500 tokens.
Outputting Structured Data
Rather than raw text, we want output as clean variables. This prompt extracts salary as a JS object:
const salary = {
currency: ‘USD‘,
min: 75000,
max: 120000
}
Allowing easy storage in databases.
Over 50-100 listings, this AI based parsing costs under $1 while saving hours of manual effort!
Enriching Data for an Intelligent Job Board
With robust listing collection and parsing foundations in place, we wrap up by enriching data to build smart interfaces.
Location Based Search
We geocode locations extracted from listings, to enable proximity based searches like Tech jobs within 20 miles of Boston.
![Map based job search]
Grouping results by city also helps focuses searches better.
Intelligent Tagging
AI helps automatically tag listings by key skills, seniority level, industry etc. Enabling faceted search and custom recommendations.
Personalized Alerts
Users save search criteria as alerts to receive email notifications for new matching jobs. Addressing a major pain point.
Auto-Updating Listings
The entire extraction system runs daily to append new listings and purge closed ones automatically via scrapers. Keeping quality high.
Conclusion and Next Steps
In closing, web scraping combined with AI techniques can accelerate building intelligent, niche job boards on tight budgets. Handling the heavy lifting of targeted data collection and cleaning.
With the frameworks detailed here, a basic industry or location specific board can be up within weeks without engineering teams. And evolve new features like maps, alerts etc. over time.
Of course, aspects like analyzing job trends or resume matching require more advanced ML capabilities. But they can be added gradually later.
I hope this guide served as a practical starting point for anyone looking to build custom job boards. Do let me know in comments if you have any other questions!