Sept 27, 2024

•

Eric Ciarla imageEric Ciarla

Scraping Job Boards Using Firecrawl Actions and OpenAI

Scraping Job Boards Using Firecrawl Actions and OpenAI image

Scraping job boards to extract structured data can be a complex task, especially when dealing with dynamic websites and unstructured content. In this guide, we’ll walk through how to use Firecrawl Actions and OpenAI models to efficiently scrape job listings and extract valuable information.

Why Use Firecrawl and OpenAI?

  • Firecrawl simplifies web scraping by handling dynamic content and providing actions like clicking and scrolling.
  • OpenAI’s o1 and 4o models excel at understanding and extracting structured data from unstructured text. o1 is best for more complex reasoning tasks while 4o is best for speed and cost.

Prerequisites

  • Python 3.7 or higher
  • API keys for both Firecrawl and OpenAI
  • Install required libraries:
pip install requests python-dotenv openai

Step 1: Set Up Your Environment

Create a .env file in your project directory and add your API keys:

FIRECRAWL_API_KEY=your_firecrawl_api_key
OPENAI_API_KEY=your_openai_api_key

Step 2: Initialize API Clients

import os
import requests
import json
from dotenv import load_dotenv
import openai

# Load environment variables
load_dotenv()

# Initialize API keys
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")

Step 3: Define the Jobs Page URL and Resume

Specify the URL of the jobs page you want to scrape and provide your resume for matching.

# URL of the jobs page to scrape
jobs_page_url = "https://openai.com/careers/search"

# Candidate's resume (as a string)
resume_paste = """
[Your resume content here]
"""

Step 4: Scrape the Jobs Page Using Firecrawl

We use Firecrawl to scrape the jobs page and extract the HTML content.

try:
    response = requests.post(
        "https://api.firecrawl.dev/v1/scrape",
        headers={
            "Content-Type": "application/json",
            "Authorization": f"Bearer {firecrawl_api_key}"
        },
        json={
            "url": jobs_page_url,
            "formats": ["markdown"]
        }
    )
    if response.status_code == 200:
        result = response.json()
        if result.get('success'):
            html_content = result['data']['markdown']
            # Prepare the prompt for OpenAI
            prompt = f"""
Extract up to 30 job application links from the given markdown content.
Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links).
The output should be a valid JSON object, with no additional text.

Markdown content:
{html_content[:100000]}
"""
        else:
            html_content = ""
    else:
        html_content = ""
except Exception as e:
    html_content = ""

Step 5: Extract Apply Links Using OpenAI’s gpt-4o Model

We use OpenAI’s gpt-4o model to parse the scraped content and extract application links.

# Extract apply links using OpenAI
apply_links = []
if html_content:
    try:
        completion = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": prompt
                }
            ]
        )
        if completion.choices:
            result = json.loads(completion.choices[0].message.content.strip())
            apply_links = result['apply_links']
    except Exception as e:
        pass

Step 6: Extract Job Details from Each Apply Link

We iterate over each apply link and use Firecrawl’s extraction capabilities to get job details.

# Initialize a list to store job data
extracted_data = []

# Define the extraction schema
schema = {
    "type": "object",
    "properties": {
        "job_title": {"type": "string"},
        "sub_division_of_organization": {"type": "string"},
        "key_skills": {"type": "array", "items": {"type": "string"}},
        "compensation": {"type": "string"},
        "location": {"type": "string"},
        "apply_link": {"type": "string"}
    },
    "required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"]
}

# Extract job details for each link
for link in apply_links:
    try:
        response = requests.post(
            "https://api.firecrawl.dev/v1/scrape",
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {firecrawl_api_key}"
            },
            json={
                "url": link,
                "formats": ["extract"],
                "actions": [{
                    "type": "click",
                    "selector": "#job-overview"
                }],
                "extract": {
                    "schema": schema
                }
            }
        )
        if response.status_code == 200:
            result = response.json()
            if result.get('success'):
                extracted_data.append(result['data']['extract'])
    except Exception as e:
        pass

Step 7: Match Jobs to Your Resume Using OpenAI’s o1 Model

We use OpenAI’s o1 model to analyze your resume and recommend the top 3 job listings.

# Prepare the prompt
prompt = f"""
Please analyze the resume and job listings, and return a JSON list of the top 3 roles that best fit the candidate's experience and skills. Include only the job title, compensation, and apply link for each recommended role. The output should be a valid JSON array of objects in the following format:

[
  {
    "job_title": "Job Title",
    "compensation": "Compensation",
    "apply_link": "Application URL"
  },
  ...
]

Based on the following resume:
{resume_paste}

And the following job listings:
{json.dumps(extracted_data, indent=2)}
"""

# Get recommendations from OpenAI
completion = openai.ChatCompletion.create(
    model="o1-preview",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

# Extract recommended jobs
recommended_jobs = json.loads(completion.choices[0].message.content.strip())

Step 8: Output the Recommended Jobs

Finally, we can print or save the recommended jobs.

# Output the recommended jobs
print(json.dumps(recommended_jobs, indent=2))

Full Code Example on GitHub

You can find the full code example on GitHub.

Conclusion

By following this guide, you’ve learned how to:

  • Scrape dynamic job boards using Firecrawl.
  • Extract structured data from web pages with custom schemas.
  • Leverage OpenAI’s models to parse content and make intelligent recommendations.

This approach can be extended to other websites and data extraction tasks, providing a powerful toolset for automating data collection and analysis.

References

That’s it! You’ve now built a pipeline to scrape job boards and find the best job matches using Firecrawl and OpenAI. Happy coding!

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Eric Ciarla image
Eric Ciarla@ericciarla

Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.

More articles by Eric Ciarla