Scraping Job Boards Using Firecrawl Actions and OpenAI
Scraping job boards to extract structured data can be a complex task, especially when dealing with dynamic websites and unstructured content. In this guide, we’ll walk through how to use Firecrawl Actions and OpenAI models to efficiently scrape job listings and extract valuable information.
Why Use Firecrawl and OpenAI?
- Firecrawl simplifies web scraping by handling dynamic content and providing actions like clicking and scrolling.
- OpenAI’s
o1
and4o
models excel at understanding and extracting structured data from unstructured text.o1
is best for more complex reasoning tasks while4o
is best for speed and cost.
Prerequisites
pip install requests python-dotenv openai
Step 1: Set Up Your Environment
Create a .env
file in your project directory and add your API keys:
FIRECRAWL_API_KEY=your_firecrawl_api_key
OPENAI_API_KEY=your_openai_api_key
Step 2: Initialize API Clients
import os
import requests
import json
from dotenv import load_dotenv
import openai
# Load environment variables
load_dotenv()
# Initialize API keys
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
Step 3: Define the Jobs Page URL and Resume
Specify the URL of the jobs page you want to scrape and provide your resume for matching.
# URL of the jobs page to scrape
jobs_page_url = "https://openai.com/careers/search"
# Candidate's resume (as a string)
resume_paste = """
[Your resume content here]
"""
Step 4: Scrape the Jobs Page Using Firecrawl
We use Firecrawl to scrape the jobs page and extract the HTML content.
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": jobs_page_url,
"formats": ["markdown"]
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
html_content = result['data']['markdown']
# Prepare the prompt for OpenAI
prompt = f"""
Extract up to 30 job application links from the given markdown content.
Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links).
The output should be a valid JSON object, with no additional text.
Markdown content:
{html_content[:100000]}
"""
else:
html_content = ""
else:
html_content = ""
except Exception as e:
html_content = ""
Step 5: Extract Apply Links Using OpenAI’s gpt-4o
Model
We use OpenAI’s gpt-4o
model to parse the scraped content and extract application links.
# Extract apply links using OpenAI
apply_links = []
if html_content:
try:
completion = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": prompt
}
]
)
if completion.choices:
result = json.loads(completion.choices[0].message.content.strip())
apply_links = result['apply_links']
except Exception as e:
pass
Step 6: Extract Job Details from Each Apply Link
We iterate over each apply link and use Firecrawl’s extraction capabilities to get job details.
# Initialize a list to store job data
extracted_data = []
# Define the extraction schema
schema = {
"type": "object",
"properties": {
"job_title": {"type": "string"},
"sub_division_of_organization": {"type": "string"},
"key_skills": {"type": "array", "items": {"type": "string"}},
"compensation": {"type": "string"},
"location": {"type": "string"},
"apply_link": {"type": "string"}
},
"required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"]
}
# Extract job details for each link
for link in apply_links:
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": link,
"formats": ["extract"],
"actions": [{
"type": "click",
"selector": "#job-overview"
}],
"extract": {
"schema": schema
}
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
extracted_data.append(result['data']['extract'])
except Exception as e:
pass
Step 7: Match Jobs to Your Resume Using OpenAI’s o1
Model
We use OpenAI’s o1
model to analyze your resume and recommend the top 3 job listings.
# Prepare the prompt
prompt = f"""
Please analyze the resume and job listings, and return a JSON list of the top 3 roles that best fit the candidate's experience and skills. Include only the job title, compensation, and apply link for each recommended role. The output should be a valid JSON array of objects in the following format:
[
{
"job_title": "Job Title",
"compensation": "Compensation",
"apply_link": "Application URL"
},
...
]
Based on the following resume:
{resume_paste}
And the following job listings:
{json.dumps(extracted_data, indent=2)}
"""
# Get recommendations from OpenAI
completion = openai.ChatCompletion.create(
model="o1-preview",
messages=[
{
"role": "user",
"content": prompt
}
]
)
# Extract recommended jobs
recommended_jobs = json.loads(completion.choices[0].message.content.strip())
Step 8: Output the Recommended Jobs
Finally, we can print or save the recommended jobs.
# Output the recommended jobs
print(json.dumps(recommended_jobs, indent=2))
Full Code Example on GitHub
You can find the full code example on GitHub.
Conclusion
By following this guide, you’ve learned how to:
- Scrape dynamic job boards using Firecrawl.
- Extract structured data from web pages with custom schemas.
- Leverage OpenAI’s models to parse content and make intelligent recommendations.
This approach can be extended to other websites and data extraction tasks, providing a powerful toolset for automating data collection and analysis.
References
That’s it! You’ve now built a pipeline to scrape job boards and find the best job matches using Firecrawl and OpenAI. Happy coding!
On this page
Why Use Firecrawl and OpenAI?
Prerequisites
Step 1: Set Up Your Environment
Step 2: Initialize API Clients
Step 3: Define the Jobs Page URL and Resume
Step 4: Scrape the Jobs Page Using Firecrawl
Step 5: Extract Apply Links Using OpenAI's Model
Step 6: Extract Job Details from Each Apply Link
Step 7: Match Jobs to Your Resume Using OpenAI's Model
Step 8: Output the Recommended Jobs
Full Code Example on GitHub
Conclusion
References
Ready to Build?
Start scraping web data for your AI apps today.
No credit card needed.
About the Author
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
More articles by Eric Ciarla
Cloudflare Error 1015: How to solve it?
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.
Build an agent that checks for website contradictions
Using Firecrawl and Claude to scrape your website's data and look for contradictions.
Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.
How to easily install requests with pip and python
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips
How to quickly install BeautifulSoup with Python
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips
How to Use OpenAI's o1 Reasoning Models in Your Applications
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.
Introducing Fire Engine for Firecrawl
The most scalable, reliable, and fast way to get web data for Firecrawl.
Firecrawl July 2024 Updates
Discover the latest features, integrations, and improvements in Firecrawl for July 2024.