# http://firecrawl.dev llms-full.txt
## Web Data Extraction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
[đĽ Get 2 months free with yearly plan](https://www.firecrawl.dev/pricing)
# Turn websites into LLM-ready data
Power your AI apps with clean data crawled from any website. [It's also open source.](https://github.com/mendableai/firecrawl)
https://example.com
Start for free (500 credits)Start for free
200 Response
\[\
\
{\
\
"url": "https://example.com",\
\
"markdown": "# Getting Started...",\
\
"json": {\
"title": "Guide",\
"docs": ...\
},\
\
"screenshot": "https://example.com/hero.png",\
\
}\
\
...\
\
\]
## Trusted by Top Companies
[](https://www.zapier.com/)
[](https://gamma.app/)
[](https://www.nvidia.com/)
[](https://phmg.com/)
[](https://www.stack-ai.com/)
[](https://www.teller.io/)
[](https://www.carrefour.com/)
[](https://www.vendr.com/)
[](https://www.open.gov.sg/)
[](https://www.zapier.com/)
[](https://gamma.app/)
[](https://www.nvidia.com/)
[](https://phmg.com/)
[](https://www.stack-ai.com/)
[](https://www.teller.io/)
[](https://www.carrefour.com/)
[](https://www.vendr.com/)
[](https://www.open.gov.sg/)
[](https://www.cyberagent.co.jp/)
[](https://continue.dev/)
[](https://www.bain.com/)
[](https://jasper.ai/)
[](https://www.palladiumdigital.com/)
[](https://www.checkr.com/)
[](https://www.jetbrains.com/)
[](https://www.you.com/)
[](https://www.cyberagent.co.jp/)
[](https://continue.dev/)
[](https://www.bain.com/)
[](https://jasper.ai/)
[](https://www.palladiumdigital.com/)
[](https://www.checkr.com/)
[](https://www.jetbrains.com/)
[](https://www.you.com/)
Developer first
## Start scraping this morning
Enhance your apps with industry leading web scraping and crawling capabilities
#### Scrape
Get llm-ready data from websites
#### Crawl
Crawl all the pages on a website
#### Extract
New
Extract structured data from websites
1
2
3
4
5
6
7
8
```
// npm install @mendable/firecrawl-js
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: "fc-YOUR_API_KEY" });
// Scrape a website:
await app.scrapeUrl('firecrawl.dev');
```
#### Use well-known tools
Already fully integrated with the greatest existing tools and workflows.
[](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/#using-firecrawl-reader/)[](https://python.langchain.com/v0.2/docs/integrations/document_loaders/firecrawl/)[](https://dify.ai/blog/dify-ai-blog-integrated-with-firecrawl/)[](https://www.langflow.org/)[](https://flowiseai.com/)[](https://crewai.com/)[](https://docs.camel-ai.org/cookbooks/ingest_data_from_websites_with_Firecrawl.html)
#### Start for free, scale easily
Kick off your journey for free and scale seamlessly as your project expands.
[Try it out](https://www.firecrawl.dev/signin/signup)
#### Open-source
Developed transparently and collaboratively. Join our community of contributors.
[Check out our repo](https://github.com/mendableai/firecrawl)
Zero Configuration
## We handle the hard stuff
Rotating proxies, orchestration, rate limits, js-blocked content and more
#### Crawling
Gather clean data from all accessible subpages, even without a sitemap.
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
    Firecrawl
    Extract web data for LLMs
    Installation
    npm install @mendable/firecrawl-js
#Â Firecrawl
Extract web data for LLMs
##Â Installation
\`\`\`bash
npm install @mendable/firecrawl-js
\`\`\`
#### Dynamic Content
Firecrawl handles JavaScript, SPAs, and dynamic content loading with minimal configuration.
#### Smart Wait
Firecrawl intelligently wait for content to load, making scraping faster and more reliable.
#### Reliability First
Reliability is our core focus. Firecrawl is designed to scale with your needs.
#### Actions
Click, scroll, write, wait, press and more before extracting content.
#### Media Parsing
Firecrawl can parse and output content from web hosted pdfs, docx, and more.
Our Wall of Love
## Don't take our word for it

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ
Transparent
## Flexible Pricing
Start for free, then scale as you grow
Standard [Extract](https://www.firecrawl.dev/extract#pricing)
Monthly
Yearly
20% off\- 2 months free
## Free Plan
500 credits
$0 one-time
No credit card requiredGet Started
- Scrape 500 pages
- 10 /scrape per min
- 1 /crawl per min
## Hobby
3,000 credits per month
$16/month
$228/yr$190/yr (Billed annually)
Subscribe$190/yr
- Scrape 3,000 pages\*
- 20 /scrape per min
- 3 /crawl per min
- 1 seat
## StandardMost Popular
100,000 credits per month
$83/month
$1188/yr$990/yr (Billed annually)
Subscribe$990/yr
- Scrape 100,000 pages\*
- 100 /scrape per min
- 10 /crawl per min
- 3 seats
- Standard Support
## Growth
500,000 credits per month
$333/month
$4788/yr$3990/yr (Billed annually)
Subscribe$3990/yr
- Scrape 500,000 pages\*
- 1000 /scrape per min
- 50 /crawl per min
- 5 seats
- Priority Support
## Add-ons
### Auto Recharge Credits
Automatically recharge your credits when you run low
$11/mo for 1000 credits
Enable Auto Recharge
Subscribe to a plan to enable auto recharge
### Credit Pack
Purchase a pack of additional monthly credits
$9/mo for 1000 credits
Purchase Credit Pack
Subscribe to a plan to purchase credit packs
## Enterprise Plan
Unlimited credits. Custom RPMs.
Talk to us
- Bulk discounts
- Top priority support
- Custom concurrency limits
- Improved Stealth Proxies
- SLAs
- Advanced Security & Controls
\\* a /scrape refers to the [scrape](https://docs.firecrawl.dev/api-reference/endpoint/scrape) API endpoint. Structured extraction costs vary. See [credits table](https://www.firecrawl.dev/pricing#credits).
\\* a /crawl refers to the [crawl](https://docs.firecrawl.dev/api-reference/endpoint/crawl) API endpoint.
## API Credits
Credits are consumed for each API request, varying by endpoint and feature.
| Features | Credits |
| --- | --- |
| Scrape(/scrape) | 1 / page |
| with JSON format | 5 / page |
| Crawl(/crawl) | 1 / page |
| Map (/map) | 1 / call |
| Search(/search) | 1 / page |
| Extract (/extract) | New [Separate Pricing](https://www.firecrawl.dev/extract#pricing) |
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
FAQ
## Frequently Asked
Everything you need to know about Firecrawl
### General
### What is Firecrawl?
### What sites work?
### Who can benefit from using Firecrawl?
### Is Firecrawl open-source?
### What is the difference between Firecrawl and other web scrapers?
### What is the difference between the open-source version and the hosted version?
### Scraping & Crawling
### How does Firecrawl handle dynamic content on websites?
### Why is it not crawling all the pages?
### Can Firecrawl crawl websites without a sitemap?
### What formats can Firecrawl convert web data into?
### How does Firecrawl ensure the cleanliness of the data?
### Is Firecrawl suitable for large-scale data scraping projects?
### Does it respect robots.txt?
### What measures does Firecrawl take to handle web scraping challenges like rate limits and caching?
### Does Firecrawl handle captcha or authentication?
### API Related
### Where can I find my API key?
### Billing
### Is Firecrawl free?
### Is there a pay per use plan instead of monthly?
### How many credits do scraping, crawling, and extraction cost?
### Do you charge for failed requests (scrape, crawl, extract)?
### What payment methods do you accept?
## Flexible Web Scraping Pricing
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Transparent
## Flexible Pricing
Start for free, then scale as you grow
Standard [Extract](https://www.firecrawl.dev/extract#pricing)
Monthly
Yearly
20% off\- 2 months free
## Free Plan
500 credits
$0 one-time
No credit card requiredGet Started
- Scrape 500 pages
- 10 /scrape per min
- 1 /crawl per min
## Hobby
3,000 creditsper month
$16/month
$228/yr$190/yr(Billed annually)
Subscribe$190/yr
- Scrape 3,000 pages\*
- 20 /scrape per min
- 3 /crawl per min
- 1 seat
## StandardMost Popular
100,000 creditsper month
$83/month
$1188/yr$990/yr(Billed annually)
Subscribe$990/yr
- Scrape 100,000 pages\*
- 100 /scrape per min
- 10 /crawl per min
- 3 seats
- Standard Support
## Growth
500,000 creditsper month
$333/month
$4788/yr$3990/yr(Billed annually)
Subscribe$3990/yr
- Scrape 500,000 pages\*
- 1000 /scrape per min
- 50 /crawl per min
- 5 seats
- Priority Support
## Add-ons
### Auto Recharge Credits
Automatically recharge your credits when you run low
$11/mo for 1000 credits
Enable Auto Recharge
Subscribe to a plan to enable auto recharge
### Credit Pack
Purchase a pack of additional monthly credits
$9/mo for 1000 credits
Purchase Credit Pack
Subscribe to a plan to purchase credit packs
## Enterprise Plan
Unlimited credits. Custom RPMs.
Talk to us
- Bulk discounts
- Top priority support
- Custom concurrency limits
- Improved Stealth Proxies
- SLAs
- Advanced Security & Controls
\\* a /scrape refers to the [scrape](https://docs.firecrawl.dev/api-reference/endpoint/scrape) API endpoint. Structured extraction costs vary. See [credits table](https://www.firecrawl.dev/pricing#credits).
\\* a /crawl refers to the [crawl](https://docs.firecrawl.dev/api-reference/endpoint/crawl) API endpoint.
## API Credits
Credits are consumed for each API request, varying by endpoint and feature.
| Features | Credits |
| --- | --- |
| Scrape(/scrape) | 1 / page |
| with JSON format | 5 / page |
| Crawl(/crawl) | 1 / page |
| Map(/map) | 1 / call |
| Search(/search) | 1 / page |
| Extract(/extract) | New [Separate Pricing](https://www.firecrawl.dev/extract#pricing) |
Our Wall of Love
## Don't take our word for it

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ
## Web Scraping and AI
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
[\\
\\
Feb 26, 2025\\
\\
**LLM API Engine: How to Build a Dynamic API Generation Engine Powered by Firecrawl** \\
\\
Learn how to build a dynamic API generation engine that transforms unstructured web data into clean, structured APIs using natural language descriptions instead of code, powered by Firecrawl's intelligent web scraping and OpenAI.\\
\\
By Bex Tuychiev](https://www.firecrawl.dev/blog/llm-api-engine-dynamic-api-generation-explainer)
## Explore Articles
[All](https://www.firecrawl.dev/blog) [Product Updates](https://www.firecrawl.dev/blog/category/product) [Tutorials](https://www.firecrawl.dev/blog/category/tutorials) [Customer Stories](https://www.firecrawl.dev/blog/category/customer-stories) [Tips & Resources](https://www.firecrawl.dev/blog/category/tips-and-resources)
[\\
**Building a Clone of OpenAI's Deep Research with TypeScript and Firecrawl** \\
Learn how to build an open-source alternative to OpenAI's Deep Research using TypeScript, Firecrawl, and LLMs. This tutorial covers web scraping, AI processing, and building a performant research platform.\\
\\
By Bex TuychievFeb 24, 2025](https://www.firecrawl.dev/blog/open-deep-research-explainer)
[\\
**How to Create Custom Instruction Datasets for LLM Fine-tuning** \\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.\\
\\
By Bex TuychievFeb 18, 2025](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning)
[\\
**Fine-tuning DeepSeek R1 on a Custom Instructions Dataset** \\
A comprehensive guide on fine-tuning DeepSeek R1 language models using custom instruction datasets, covering model selection, dataset preparation, and practical implementation steps.\\
\\
By Bex TuychievFeb 18, 2025](https://www.firecrawl.dev/blog/fine-tuning-deepseek)
[\\
**How Replit Uses Firecrawl to Power Replit Agent** \\
Discover how Replit leverages Firecrawl to keep Replit Agent up to date with the latest API documentation and web content.\\
\\
By Zhen LiFeb 17, 2025](https://www.firecrawl.dev/blog/how-replit-uses-firecrawl-to-power-ai-agents)
[\\
**Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl** \\
Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.\\
\\
By Bex TuychievFeb 10, 2025](https://www.firecrawl.dev/blog/deepseek-rag-documentation-assistant)
[\\
**Automated Data Collection - A Comprehensive Guide** \\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.\\
\\
By Bex TuychievFeb 2, 2025](https://www.firecrawl.dev/blog/automated-data-collection-guide)
[\\
**Building an AI Resume Job Matching App With Firecrawl And Claude** \\
Learn how to build an AI-powered job matching system that automatically scrapes job postings, parses resumes, evaluates opportunities using Claude, and sends Discord alerts for matching positions using Firecrawl, Streamlit, and Supabase.\\
\\
By Bex TuychievFeb 1, 2025](https://www.firecrawl.dev/blog/ai-resume-parser-job-matcher-python)
[\\
**Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude** \\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.\\
\\
By Bex TuychievJan 31, 2025](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude)
[\\
**Mastering the Extract Endpoint in Firecrawl** \\
Learn how to use Firecrawl's extract endpoint to automatically gather structured data from any website using AI. Build powerful web scrapers, create training datasets, and enrich your data without writing complex code.\\
\\
By Bex TuychievJan 23, 2025](https://www.firecrawl.dev/blog/mastering-firecrawl-extract-endpoint)
[\\
**Introducing /extract: Get structured web data with just a prompt** \\
Our new /extract endpoint harnesses AI to turn any website into structured data for your applications seamlessly.\\
\\
By Eric CiarlaJanuary 20, 2025](https://www.firecrawl.dev/blog/introducing-extract-open-beta)
[\\
**How to Build a Bulk Sales Lead Extractor in Python Using AI** \\
Learn how to build an automated sales lead extraction tool in Python that uses AI to scrape company information from websites, exports data to Excel, and streamlines the lead generation process using Firecrawl and Streamlit.\\
\\
By Bex TuychievJan 12, 2025](https://www.firecrawl.dev/blog/sales-lead-extractor-python-ai)
[\\
**Building a Trend Detection System with AI in TypeScript: A Step-by-Step Guide** \\
Learn how to build an automated trend detection system in TypeScript that monitors social media and news sites, analyzes content with AI, and sends real-time Slack alerts using Firecrawl, Together AI, and GitHub Actions.\\
\\
By Bex TuychievJan 11, 2025](https://www.firecrawl.dev/blog/trend-finder-typescript)
[\\
**How to Build an Automated Competitor Price Monitoring System with Python** \\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.\\
\\
By Bex TuychievJan 6, 2025](https://www.firecrawl.dev/blog/automated-competitor-price-scraping)
[\\
**How Stack AI Uses Firecrawl to Power AI Agents** \\
Discover how Stack AI leverages Firecrawl to seamlessly feed agentic AI workflows with high-quality web data.\\
\\
By Jonathan KleimanJan 3, 2025](https://www.firecrawl.dev/blog/how-stack-ai-uses-firecrawl-to-power-ai-agents)
[\\
**BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python** \\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.\\
\\
By Bex TuychievDec 24, 2024](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison)
[\\
**15 Python Web Scraping Projects: From Beginner to Advanced** \\
Explore 15 hands-on web scraping projects in Python, from beginner to advanced level. Learn essential concepts like data extraction, concurrent processing, and distributed systems while building real-world applications.\\
\\
By Bex TuychievDec 17, 2024](https://www.firecrawl.dev/blog/python-web-scraping-projects)
[\\
**How to Deploy Python Web Scrapers** \\
Learn how to deploy Python web scrapers using GitHub Actions, Heroku, PythonAnywhere and more.\\
\\
By Bex TuychievDec 16, 2024](https://www.firecrawl.dev/blog/deploy-web-scrapers)
[\\
**Why Companies Need a Data Strategy for Generative AI** \\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.\\
\\
By Eric CiarlaDec 15, 2024](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai)
[\\
**Data Enrichment: A Complete Guide to Enhancing Your Data Quality** \\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.\\
\\
By Bex TuychievDec 14, 2024](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
[\\
**A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl** \\
Learn how to scrape login-protected websites using cURL and Firecrawl API. Step-by-step guide covering basic auth, tokens, and cookies with real examples.\\
\\
By Rudrank RiyamDec 13, 2024](https://www.firecrawl.dev/blog/complete-guide-to-curl-authentication-firecrawl-api)
[\\
**Building an Automated Price Tracking Tool** \\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.\\
\\
By Bex TuychievDec 9, 2024](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python)
[\\
**Evaluating Web Data Extraction with CrawlBench** \\
An in-depth exploration of CrawlBench, a benchmark for testing LLM-based web data extraction.\\
\\
By SwyxDec 9, 2024](https://www.firecrawl.dev/blog/crawlbench-llm-extraction)
[\\
**How Cargo Empowers GTM Teams with Firecrawl** \\
See how Cargo uses Firecrawl to instantly analyze webpage content and power Go-To-Market workflows for their users.\\
\\
By Tariq MinhasDec 6, 2024](https://www.firecrawl.dev/blog/how-cargo-empowers-gtm-teams-with-firecrawl)
[\\
**Web Scraping Automation: How to Run Scrapers on a Schedule** \\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.\\
\\
By Bex TuychievDec 5, 2024](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025)
[\\
**How to Generate Sitemaps Using Firecrawl's /map Endpoint: A Complete Guide** \\
Learn how to generate XML and visual sitemaps using Firecrawl's /map endpoint. Step-by-step guide with Python code examples, performance comparisons, and interactive visualization techniques for effective website mapping.\\
\\
By Bex TuychievNov 29, 2024](https://www.firecrawl.dev/blog/how-to-generate-sitemaps-using-firecrawl-map-endpoint)
[\\
**How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial** \\
Learn how to scrape websites using Firecrawl's /scrape endpoint. Master JavaScript rendering, structured data extraction, and batch operations with Python code examples.\\
\\
By Bex TuychievNov 25, 2024](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint)
[\\
**How to Create an llms.txt File for Any Website** \\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.\\
\\
By Eric CiarlaNov 22, 2024](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website)
[\\
**Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide** \\
Learn how to use Firecrawl's /crawl endpoint for efficient web scraping. Master URL control, performance optimization, and integration with LangChain for AI-powered data extraction.\\
\\
By Bex TuychievNov 18, 2024](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl)
[\\
**Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses** \\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.\\
\\
By Eric CiarlaNov 5, 2024](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai)
[\\
**Launch Week II Recap** \\
Recapping all the exciting announcements from Firecrawl's second Launch Week.\\
\\
By Eric CiarlaNovember 4, 2024](https://www.firecrawl.dev/blog/launch-week-ii-recap)
[\\
**Launch Week II - Day 7: Introducing Faster Markdown Parsing** \\
Our new HTML to Markdown parser is 4x faster, more reliable, and produces cleaner Markdown, built from the ground up for speed and performance.\\
\\
By Eric CiarlaNovember 3, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-7-introducing-faster-markdown-parsing)
[\\
**Launch Week II - Day 6: Introducing Mobile Scraping and Mobile Screenshots** \\
Interact with sites as if from a mobile device using Firecrawl's new mobile device emulation.\\
\\
By Eric CiarlaNovember 2, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-6-introducing-mobile-scraping)
[\\
**Launch Week II - Day 5: Introducing New Actions** \\
Capture page content at any point and wait for specific elements with our new Scrape and Wait for Selector actions.\\
\\
By Eric CiarlaNovember 1, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-5-introducing-two-new-actions)
[\\
**Launch Week II - Day 4: Advanced iframe Scraping** \\
We are thrilled to announce comprehensive iframe scraping support in Firecrawl, enabling seamless handling of nested iframes, dynamically loaded content, and cross-origin frames.\\
\\
By Eric CiarlaOctober 31, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-4-advanced-iframe-scraping)
[\\
**Launch Week II - Day 3: Introducing Credit Packs** \\
Easily top up your plan with Credit Packs to keep your web scraping projects running smoothly. Plus, manage your credits effortlessly with our new Auto Recharge feature.\\
\\
By Eric CiarlaOctober 30, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-3-introducing-credit-packs)
[\\
**Launch Week II - Day 2: Introducing Location and Language Settings** \\
Specify country and preferred languages to get relevant localized content, enhancing your web scraping results with region-specific data.\\
\\
By Eric CiarlaOctober 29, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-2-introducing-location-language-settings)
[\\
**Launch Week II - Day 1: Introducing the Batch Scrape Endpoint** \\
Our new Batch Scrape endpoint lets you scrape multiple URLs simultaneously, making bulk data collection faster and more efficient.\\
\\
By Eric CiarlaOctober 28, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint)
[\\
**Getting Started with Grok-2: Setup and Web Crawler Example** \\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.\\
\\
By Nicolas CamaraOct 21, 2024](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example)
[\\
**OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website** \\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies\\
\\
By Nicolas CamaraOct 12, 2024](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial)
[\\
**Using OpenAI's Realtime API and Firecrawl to Talk with Any Website** \\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.\\
\\
By Nicolas CamaraOct 11, 2024](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl)
[\\
**Scraping Job Boards Using Firecrawl Actions and OpenAI** \\
A step-by-step guide to scraping job boards and extracting structured data using Firecrawl and OpenAI.\\
\\
By Eric CiarlaSept 27, 2024](https://www.firecrawl.dev/blog/scrape-job-boards-firecrawl-openai)
[\\
**Build a Full-Stack AI Web App in 12 Minutes** \\
Build a Full-Stack AI Web App in 12 minutes with Cursor, OpenAI o1, V0, Firecrawl & Patched\\
\\
By Dev DigestSep 18, 2024](https://www.firecrawl.dev/blog/Build-a-Full-Stack-AI-Web-App-in-12-Minutes)
[\\
**How to Use OpenAI's o1 Reasoning Models in Your Applications** \\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.\\
\\
By Eric CiarlaSep 16, 2024](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
[\\
**Handling 300k requests per day: an adventure in scaling** \\
Putting out fires was taking up all our time, and we had to scale fast. This is how we did it.\\
\\
By GergĹ MĂłricz (mogery)Sep 13, 2024](https://www.firecrawl.dev/blog/an-adventure-in-scaling)
[\\
**How Athena Intelligence Empowers Enterprise Analysts with Firecrawl** \\
Discover how Athena Intelligence leverages Firecrawl to fuel its AI-native analytics platform for enterprise analysts.\\
\\
By Ben ReillySep 10, 2024](https://www.firecrawl.dev/blog/how-athena-intelligence-empowers-analysts-with-firecrawl)
[\\
**Launch Week I Recap** \\
A look back at the new features and updates introduced during Firecrawl's inaugural Launch Week.\\
\\
By Eric CiarlaSeptember 2, 2024](https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap)
[\\
**Launch Week I / Day 7: Crawl Webhooks (v1)** \\
New /crawl webhook support. Send notifications to your apps during a crawl.\\
\\
By Nicolas CamaraSeptember 1, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks)
[\\
**Launch Week I / Day 6: LLM Extract (v1)** \\
Extract structured data from your web pages using the extract format in /scrape.\\
\\
By Nicolas CamaraAugust 31, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract)
[\\
**Launch Week I / Day 5: Real-Time Crawling with WebSockets** \\
Our new WebSocket-based method for real-time data extraction and monitoring.\\
\\
By Eric CiarlaAugust 30, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-5-real-time-crawling-websockets)
[\\
**Launch Week I / Day 4: Introducing Firecrawl /v1** \\
Our biggest release yet - v1, a more reliable and developer-friendly API for seamless web data gathering.\\
\\
By Eric CiarlaAugust 29, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-4-introducing-firecrawl-v1)
[\\
**Launch Week I / Day 3: Introducing the Map Endpoint** \\
Our new Map endpoint enables lightning-fast website mapping for enhanced web scraping projects.\\
\\
By Eric CiarlaAugust 28, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-3-introducing-map-endpoint)
[\\
**Launch Week I / Day 2: 2x Rate Limits** \\
Firecrawl doubles rate limits across all plans, supercharging your web scraping capabilities.\\
\\
By Eric CiarlaAugust 27, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)
[\\
**Launch Week I / Day 1: Introducing Teams** \\
Our new Teams feature, enabling seamless collaboration on web scraping projects.\\
\\
By Eric CiarlaAugust 26, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-1-introducing-teams)
[\\
**How to Use Prompt Caching and Cache Control with Anthropic Models** \\
Learn how to cache large context prompts with Anthropic Models like Opus, Sonnet, and Haiku for faster and cheaper chats that analyze website data.\\
\\
By Eric CiarlaAug 14, 2024](https://www.firecrawl.dev/blog/using-prompt-caching-with-anthropic)
[\\
**Building Knowledge Graphs from Web Data using CAMEL-AI and Firecrawl** \\
A guide on constructing knowledge graphs from web pages using CAMEL-AI and Firecrawl\\
\\
By Wendong FanAug 13, 2024](https://www.firecrawl.dev/blog/building-knowledge-graphs-from-web-data-camelai-firecrawl)
[\\
**How Gamma Supercharges Onboarding with Firecrawl** \\
See how Gamma uses Firecrawl to instantly generate websites and presentations to 20+ million users.\\
\\
By Jon NoronhaAug 8, 2024](https://www.firecrawl.dev/blog/how-gamma-supercharges-onboarding-with-firecrawl)
[\\
**How to Use OpenAI's Structured Outputs and JSON Strict Mode** \\
A guide for getting structured data from the latest OpenAI models.\\
\\
By Eric CiarlaAug 7, 2024](https://www.firecrawl.dev/blog/using-structured-output-and-json-strict-mode-openai)
[\\
**Introducing Fire Engine for Firecrawl** \\
The most scalable, reliable, and fast way to get web data for Firecrawl.\\
\\
By Eric CiarlaAug 6, 2024](https://www.firecrawl.dev/blog/introducing-fire-engine-for-firecrawl)
[\\
**Firecrawl July 2024 Updates** \\
Discover the latest features, integrations, and improvements in Firecrawl for July 2024.\\
\\
By Eric CiarlaJuly 31, 2024](https://www.firecrawl.dev/blog/firecrawl-july-2024-updates)
[\\
**Firecrawl June 2024 Updates** \\
Discover the latest features, integrations, and improvements in Firecrawl for June 2024.\\
\\
By Nicolas CamaraJune 30, 2024](https://www.firecrawl.dev/blog/firecrawl-june-2024-updates)
[\\
**Scrape and Analyze Airbnb Data with Firecrawl and E2B** \\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.\\
\\
By Nicolas CamaraMay 23, 2024](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
[\\
**Build a 'Chat with website' using Groq Llama 3** \\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.\\
\\
By Nicolas CamaraMay 22, 2024](https://www.firecrawl.dev/blog/chat-with-website)
[\\
**Using LLM Extraction for Customer Insights** \\
Using LLM Extraction for Insights and Lead Generation using Make and Firecrawl.\\
\\
By Caleb PefferMay 21, 2024](https://www.firecrawl.dev/blog/lead-gen-business-insights-make-firecrawl)
[\\
**Extract website data using LLMs** \\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.\\
\\
By Nicolas CamaraMay 20, 2024](https://www.firecrawl.dev/blog/data-extraction-using-llms)
[\\
**Build an agent that checks for website contradictions** \\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.\\
\\
By Eric CiarlaMay 19, 2024](https://www.firecrawl.dev/blog/contradiction-agent)
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## Firecrawl Changelog Updates
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
## ChangelogNew
- Feb 20, 2025
## Self Host Overhaul - v1.5.0
### Self-Host Fixes
- **Reworked Guide:** The `SELF_HOST.md` and `docker-compose.yaml` have been updated for clarity and compatibility
- **Kubernetes Imporvements:** Updated self-hosted Kubernetes deployment examples for compatibility and consistency (#1177)
- **Self-Host Fixes:** Numerous fixes aimed at improving self-host performance and stability (#1207)
- **Proxy Support:** Added proxy support tailored for self-hosted environments (#1212)
- **Playwright Integration:** Added fixes and continuous integration for the Playwright microservice (#1210)
- **Search Endpoint Upgrade:** Added SearXNG support for the `/search` endpoint (#1193)
### Core Fixes & Enhancements
- **Crawl Status Fixes:** Fixed various race conditions in the crawl status endpoint (#1184)
- **Timeout Enforcement:** Added timeout for scrapeURL engines to prevent hanging requests (#1183)
- **Query Parameter Retention:** Map function now preserves query parameters in results (#1191)
- **Screenshot Action Order:** Ensured screenshots execute after specified actions (#1192)
- **PDF Scraping:** Improved handling for PDFs behind anti-bot measures (#1198)
- **Map/scrapeURL Abort Control:** Integrated AbortController to stop scraping when the request times out (#1205)
- **SDK Timeout Enforcement:** Enforced request timeouts in the SDK (#1204)
### New Features & Additions
- **Proxy & Stealth Options:** Introduced a proxy option and stealthProxy flag (#1196)
- **Deep Research (Alpha):** Launched an alpha implementation of deep research (#1202)
- **LLM Text Generator:** Added a new endpoint for llms.txt generation (#1201)
### Docker & Containerization
- **Production Ready Docker Image:** A streamlined, production ready Docker image is now available to simplify self-hosted deployments.
- Feb 14, 2025
## v1.4.4
### Features & Enhancements
- Scrape API: Added action & wait time validation ( [#1146](https://github.com/mendableai/firecrawl/pull/1146))
- Extraction Improvements:
- Added detection of PDF/image sub-links & extracted text via Gemini ( [#1173](https://github.com/mendableai/firecrawl/pull/1173))
- Multi-entity prompt enhancements for extraction ( [#1181](https://github.com/mendableai/firecrawl/pull/1181))
- Show sources out of \_\_experimental in extraction ( [#1180](https://github.com/mendableai/firecrawl/pull/1180))
- Environment Setup: Added Serper & Search API env vars to docker-compose ( [#1147](https://github.com/mendableai/firecrawl/pull/1147))
- Credit System Update: Now displays âtokensâ instead of âcreditsâ when out of tokens ( [#1178](https://github.com/mendableai/firecrawl/pull/1178))
### Examples
- Gemini 2.0 Crawler: Implemented new crawling example ( [#1161](https://github.com/mendableai/firecrawl/pull/1161))
- Gemini TrendFinder: [https://github.com/mendableai/gemini-trendfinder](https://github.com/mendableai/gemini-trendfinder)
- Normal Search to Open Deep Research: [https://github.com/nickscamara/open-deep-research](https://github.com/nickscamara/open-deep-research)
### Fixes
- HTML Transformer: Updated free\_string function parameter type ( [#1163](https://github.com/mendableai/firecrawl/pull/1163))
- Gemini Crawler: Updated library & improved PDF link extraction ( [#1175](https://github.com/mendableai/firecrawl/pull/1175))
- Crawl Queue Worker: Only reports successful page count in num\_docs ( [#1179](https://github.com/mendableai/firecrawl/pull/1179))
- Scraping & URLs:
- Fixed relative URL conversion ( [#584](https://github.com/mendableai/firecrawl/pull/584))
- Enforced scrape rate limit in batch scraping ( [#1182](https://github.com/mendableai/firecrawl/pull/1182))
- Feb 7, 2025
## Examples Week - v1.4.3
### Summary of changes
- Open Deep Research: An open source version of OpenAI Deep Research. See here: [https://github.com/nickscamara/open-deep-research](https://github.com/nickscamara/open-deep-research)
- R1 Web Extractor Feature: New extraction capability added.
- O3-Mini Web Crawler: Introduces a lightweight crawler for specific use cases.
- Updated Model Parameters: Enhancements to o3-mini\_company\_researcher.
- URL Deduplication: Fixes handling of URLs ending with /, index.html, index.php, etc.
- Improved URL Blocking: Uses tldts parsing for better blocklist management.
- Valid JSON via rawHtml in Scrape: Ensures valid JSON extraction.
- Product Reviews Summarizer: Implements summarization using o3-mini.
- Scrape Options for Extract: Adds more configuration options for extracting data.
- O3-Mini Job Resource Extractor: Extracts job-related resources using o3-mini.
- Cached Scrapes for Extract evals: Improves performance by using cached data for extractions evals.
- Jan 31, 2025
## Extract & API Improvements - v1.4.2
Weâre excited to announce several new features and improvements:
### New Features
- Added web search capabilities to the extract endpoint via the `enableWebSearch` parameter
- Introduced source tracking with `__experimental_showSources` parameter
- Added configurable webhook events for crawl and batch operations
- New `timeout` parameter for map endpoint
- Optional ad blocking with `blockAds` parameter (enabled by default)
### Infrastructure & UI
- Enhanced proxy selection and infrastructure reliability
- Added domain checker tool to cloud platform
- Redesigned LLMs.txt generator interface for better usability
- Jan 24, 2025
## Extract Improvements - v1.4.1
Weâve significantly enhanced our data extraction capabilities with several key updates:
- Extract now returns a lot more data
- Improved infrastructure reliability
- Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing
- Enhanced crawl cancellation functionality for better control over running jobs
- Jan 7, 2025
## /extract changes
We have updated the `/extract` endpoint to now be asynchronous. When you make a request to `/extract`, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible.
For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model.
For more details about the parameters, refer to the docs sent to you.
- Jan 3, 2025
## v1.2.0
### Introducing /v1/search
The search endpoint combines web search with Firecrawlâs scraping capabilities to return full page content for any query.
Include `scrapeOptions` with `formats: ["markdown"]` to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).
More info here: [v1/search docs](https://docs.firecrawl.dev/api-reference/endpoint/search)
### Fixes and improvements
- Fixed LLM not following the schema in the python SDK for `/extract`
- Fixed schema json not being able to be sent to the `/extract` endpoint through the Node SDK
- Prompt is now optional for the `/extract` endpoint
- Our fork of [MinerU](https://github.com/mendableai/mineru-api) is now default for PDF Parsing
- Dec 27, 2024
## v1.1.0
### Changelog Highlights
#### Feature Enhancements
- **New Features**:
- Geolocation, mobile scraping, 4x faster parsing, better webhooks,
- Credit packs, auto-recharges and batch scraping support.
- Iframe support and query parameter differentiation for URLs.
- Similar URL deduplication.
- Enhanced map ranking and sitemap fetching.
#### Performance Improvements
- Faster crawl status filtering and improved map ranking algorithm.
- Optimized Kubernetes setup and simplified build processes.
- Sitemap discoverability and performance improved
#### Bug Fixes
- Resolved issues:
- Badly formatted JSON, scrolling actions, and encoding errors.
- Crawl limits, relative URLs, and missing error handlers.
- Fixed self-hosted crawling inconsistencies and schema errors.
#### SDK Updates
- Added dynamic WebSocket imports with fallback support.
- Optional API keys for self-hosted instances.
- Improved error handling across SDKs.
#### Documentation Updates
- Improved API docs and examples.
- Updated self-hosting URLs and added Kubernetes optimizations.
- Added articles: mastering `/scrape` and `/crawl`.
#### Miscellaneous
- Added new Firecrawl examples
- Enhanced metadata handling for webhooks and improved sitemap fetching.
- Updated blocklist and streamlined error messages.
- Oct 28, 2024

## Introducing Batch Scrape
You can now scrape multiple URLs simultaneously with our new Batch Scrape endpoint.
- Read more about the Batch Scrape endpoint [here](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint).
- Python SDK (1.4.x) and Node SDK (1.7.x) updated with batch scrape support.
- Oct 10, 2024
## Cancel Crawl in the SDKs, More Examples, Improved Speed
- Added crawl cancellation support for the Python SDK (1.3.x) and Node SDK (1.6.x)
- OpenAI Voice + Firecrawl example added to the repo
- CRM lead enrichment example added to the repo
- Improved our Docker images
- Limit and timeout fixes for the self hosted playwright scraper
- Improved speed of all scrapes
- Sep 27, 2024
## Fixes + Improvements (no version bump)
- Fixed 500 errors that would happen often in some crawled websites and when servers were at capacity
- Fixed an issue where v1 crawl status wouldnât properly return pages over 10mb
- Fixed an issue where `screenshot` would return undefined
- Push improvements that reduce speed times when a scraper fails
- Sep 24, 2024

## Introducing Actions
Interact with pages before extracting data, unlocking more data from every site!
Firecrawl now allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
- Version 1.5.x of the Node SDK now supports type-safe Actions.
- Actions are now available in the REST API and Python SDK (no version bumps required!).
Here is a python example of how to use actions to navigate to google.com, search for Firecrawl, click on the first result, and take a screenshot.
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev',
params={
'formats': ['markdown', 'html'],
'actions': [\
{"type": "wait", "milliseconds": 2000},\
{"type": "click", "selector": "textarea[title=\"Search\"]"},\
{"type": "wait", "milliseconds": 2000},\
{"type": "write", "text": "firecrawl"},\
{"type": "wait", "milliseconds": 2000},\
{"type": "press", "key": "ENTER"},\
{"type": "wait", "milliseconds": 3000},\
{"type": "click", "selector": "h3"},\
{"type": "wait", "milliseconds": 3000},\
{"type": "screenshot"}\
]
}
)
print(scrape_result)
```
For more examples, check out our [API Reference](https://docs.firecrawl.dev/api-reference/endpoint/scrape).
- Sep 23, 2024

## Mid-September Updates
### Typesafe LLM Extract
- E2E Type Safety for LLM Extract in Node SDK version 1.5.x.
- 10x cheaper in the cloud version. From 50 to 5 credits per extract.
- Improved speed and reliability.
### Rust SDK v1.0.0
- Rust SDK v1 is finally here! Check it out [here](https://crates.io/crates/firecrawl/1.0.0).
### Map Improved Limits
- Map smart results limits increased from 100 to 1000.
### Faster scrape
- Scrape speed improved by 200ms-600ms depending on the website.
### Launching changelog
- For now on, for every new release, we will be creating a changelog entry here.
### Improvements
- Lots of improvements pushed to the infra and API. For all Mid-September changes, refer to the commits [here](https://github.com/mendableai/firecrawl/commits/main/).
- Sep 8, 2024
## September 8, 2024
### Patch Notes (No version bump)
- Fixed an issue where some of the custom header params were not properly being set in v1 API. You can now pass headers to your requests just fine.
- Aug 29, 2024

## Firecrawl V1 is here! With that we introduce a more reliable and developer friendly API.
### Here is whatâs new:
- Output Formats for /scrape: Choose what formats you want your output in.
- New /map endpoint: Get most of the URLs of a webpage.
- Developer friendly API for /crawl/id status.
- 2x Rate Limits for all plans.
- Go SDK and Rust SDK.
- Teams support.
- API Key Management in the dashboard.
- onlyMainContent is now default to true.
- /crawl webhooks and websocket support.
Learn more about it [here](https://docs.firecrawl.dev/v1).
Start using v1 right away at [https://firecrawl.dev](https://firecrawl.dev/)
## Web Data Extraction Tool
Introducing **/extract** \- Now in open beta
# Get web data with a prompt
Turn entire websites into structured data with AI
From firecrawl.dev, get the pricing.
Try for Free

From **firecrawl.dev** find the company name, mission and whether it's open source.

{
"company\_name":"Firecrawl",
"company\_mission":"...",
"is\_open\_source":true,
}
A milestone in scraping
## Web scraping was hard â now effortless
Scraping the internet had everything to do with broken scripts, bad data, wasted time. With Extract, you can get any data in any format effortlessly â in a single API call.
### No more manual scraping
Extract structured data from any website using natural language prompts.
page = urlopen(url)
html = page.read().decode("utf-8")
start\_index = html.find("") + len("")
end\_index = html.find("")
title = html\[start\_index:end\_index\]
>>\> title
PromptBuild a B2B lead list from these company websites.
### Stop rewriting broken scripts
Say goodbye to fragile scrapers that break with every site update. Our AI understands content semantically and adapts automatically.
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
page = urlopen(url)
html = page.read().decode("utf-8")
start\_idx = html.find("") + len("")
end\_idx = html.find("")
title = html\[start\_idx:end\_idx\]
>>>Â title
await firecrawl.extract(\[\
\
'https://firecrawl.dev/',\
\
\], {
prompt: "Extract mission.",
schema:Â z.object({
    mission: z.string()
})
});
### Extract entire websites in a single API call
Get the data you need with a simple API call, whether it's one page or thousands.
Try adding a wildcard /\* to the URL.It will extract information across the site.It will find and extract information across the entire website.\> app.extract(\['https://firecrawl.dev/\*'\])
### Forget fighting context windows
No context window limits. Extract thousands of results effortlessly while we handle the complex LLM work.
Extracting
Video Demo
## Use Extract for everything
From lead enrichment to AI onboarding to KYB â and more. Watch a demo of how Extract can help you get more out of your data.
Enrichment Integrations
## Enrich data anywhere you work
Integrate Extract with your favorite tools and get enriched data where you need it.
Datasets
## Build datasets spread across websites
Gather datasets from any website and use them for any enrichment task.
| | Name | Contact | Email |
| --- | --- | --- | --- |
| 1 | Sarah Johnson | +1 (555) 123-4567 | sarah.j@example.com |
| 2 | Michael Chen | +1 (555) 234-5678 | m.chen@example.com |
| 3 | Emily Williams | +1 (555) 345-6789 | e.williams@example.com |
| 4 | James Wilson | +1 (555) 456-7890 | j.wilson@example.com |
[Integrate with Zapier](https://zapier.com/apps/firecrawl/integrations)
Simple, transparent pricing
## Pricing that scales with your business
Monthly
Yearly
Save 10%\+ Get All Credits Upfront
### Free
$0
One-time
Tokens / year500,000
Rate limit10 per min
SupportCommunity
Sign Up
### Starter
$89/mo
$1,188/yr$1,068/yr(Billed annually)
Tokens / year18 million
Rate limit20 per min
SupportEmail
Subscribe
All credits granted upfront
Most Popular đĽ
### Explorer
$359/mo
$4,788/yr$4,308/yr(Billed annually)
Tokens / year84 million
Rate limit100 per min
SupportSlack
Subscribe
All credits granted upfront
Best Value
### Pro
$719/mo
$9,588/yr$8,628/yr(Billed annually)
Tokens / year192 million
Rate limit1000 per min
SupportSlack + Priority
Subscribe
All credits granted upfront
### Enterprise
Custom
Billed annually
Tokens / yearNo limits
Rate limitCustom
SupportCustom (SLA, dedicated engineer)
Talk to us
Tokens / year
500,000
18 million
84 million
192 million
No limits
Rate limit
10 per min
20 per min
100 per min
1000 per min
Custom
Support
Community
Email
Slack
Slack + Priority
Custom (SLA, dedicated engineer)
All requests have a base cost of 300 tokens + [output tokens - View token calculator](https://www.firecrawl.dev/pricing?extract-pricing=true#token-calculator)
## Get started for free
500K free tokens â no credit card required!
From firecrawl.dev, get the pricing.
Try for Free
FAQ
## Frequently Asked
Everything you need to know about Extract's powerful web scraping capabilities
### How much does Extract cost?
### What is a token and how many do I need?
### How does Extract handle JavaScript-heavy websites?
### What programming languages and frameworks are supported?
### How many pages can I process in a single API call?
### How can I integrate Extract with my existing workflow?
### Does Extract work with password-protected pages?
### Can I schedule regular extractions for monitoring changes?
### What happens if a website's structure changes?
### How fresh is the extracted data?
### Can Extract handle multiple languages and international websites?
### Can I use Extract for competitor monitoring?
### How does Extract handle dynamic content like prices or inventory?
### Is Extract suitable for real-time data needs?
/extract returns a JSON in your desired format
## Web Data Playground
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
# Preview
Take a look at the API response (Preview limited to 5 pages)
Single URL(/scrape)
Crawl(/crawl)
Map(/map)
Extract(/extract)Beta
Scrape
URL
Get CodeRun
### Options
Start exploring with our playground!
## Sign In Page
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
đĽ
### Sign In
EmailPassword
Sign in
[Forgot your password?](https://www.firecrawl.dev/signin/forgot_password)
[Sign in via magic link](https://www.firecrawl.dev/signin/email_signin)
[Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup)
OAuth sign-in
GitHubGoogle
## Privacy Policy Overview
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
# PRIVACY POLICY
Date of last revision: December 26, 2024
1. **Who We Are?**
The name of our company is SideGuide Technologies, Inc. d/b/a Firecrawl (âFirecrawlâ), and weâre registered as a corporation in Delaware. Firecrawl is a tool for collecting and enhancing LLM-ready data.
2. **What Is This?**
This is a privacy policy and the reason we have it is to tell you how we collect, manage, store, and use your information.
Just so weâre clear, whenever we say \*\*\*\*âwe,â âus,â âour,â or âourselvesâ, weâre talking about Firecrawl and whenever we say âyouâ or âyour,â weâre talking about the person or business who has decided to use our services, or even potentially a third party. When we talk about our services, we mean any of our platforms, websites, or apps; or any features, products, graphics, text, images, photos, audio, video, or similar things we use.
3. **Why Are We Showing You This?**
We value and respect your privacy. That is why we strive to only use your information when we think that doing so improves your experience in using our services. If you feel that we could improve in this mission in any way, or if you have a complaint or concern, please let us know by sending us your feedback to the following email address: help@firecrawl.com.
Our goal is to be as transparent and open about our use of information and data as possible, so that our users can benefit from both the way they provide information and how we use it.
This privacy policy should be read along with our Terms of Use, posted at [https://www.firecrawl.dev/terms-of-use](https://www.firecrawl.dev/terms-of-use). Thatâs another big part of what we do, so please review it and follow its process for questions or concerns about what it says there.
4. **Information Collection and Use**
In using the services, you may be asked to provide us a variety of informationâ some of which can personally identify you and some that cannot. We may collect, store, and share this personal information with third parties, but only in the ways we explain in this policy. Hereâs how we do it and why we do it:
1. **Personally Identifiable Information: How we collect it.**
Personally identifiable information (also, âPIIâ) is data that can be used to contact or identify a single person. Examples include your name, your phone number, your email, your address, and your IP address. We collect the following categories of information
- name
- email address
- payment information, including credit card information
- company Information
- IP addresses
- browser information
- timestamps
- page views,
- load times
- referrers
- device type and browser information
- information that is collected on behalf of our clients
2. **Personally Identifiable Information: How we use it.**
We use your personal information in the following ways:
- To provide you our services;
- Caching and indexing;
- To contact you via email to inform you of service issues, new features, updates, offers, and billing issues;
- To improve our website performance;
- To tailor our services to your needs and the way you use our services;
- To process payments;
- To determine how to improve our product;
- To market our services to interested customers;
- We use cookies to track unauthenticated user activity on our site;
- For advertising purposes.
3. **Who We Share Your Information With and Why**
We only share your information with third parties in the following ways and for the following purposes:
- **Stripe, Inc.** We share your email, credit cardholder name, and card number and related information to run the initial and subsequent payments for our services. This information is sent directly to Stripe through a plugin on our website. Your credit card information is stored with Stripe for subsequently billing; we do not retain your credit card information internally. Their privacy policy is here: https://stripe.com/privacy.
- **Posthog.** We share your data with Posthog to better understand user interactions (e.g., clicks, page views, events); device information (IP address, browser type); location data (based on IP). Their privacy policy is provided here: https://posthog.com/privacy
- **Crisp Chatbot.** We share names, emails, and phone numbers if provided by users; messages sent in the chat widget; IP addresses, browser information, and timestamp. We do this to communicate with our customers. Their privacy policy is provided here. https://crisp.chat/en/privacy/
- **Vercel Analytics.** We share IP addresses (used to determine visitor location); information related to referrers; Device type and browser information. This is for product and marketing analytics. Their privacy policy is provided here: https://vercel.com/legal/privacy-policy
- We will share all collected information to the extent necessary and as required by law or to comply with any legal obligations, including defense of our company.
4. **Your Choices in What Information You Share**
For users who do not register for our services or a business account, we will not collect that userâs personally identifying informationâunless that personally identifiable information is information of a customer of one of our business clients, which is shares by that business through permission obtained by the business directly from that customer.
5. **Non-Personally Identifiable Information**
Non-personally identifiable information includes general details about your device and connection (including the type of computer/mobile device, operating system, web-browser or other software, language preference, and hardware); general information from the app store or referring website; the date and time of visit or use; and, internet content provider information. We may collect this type of information.
6. **How Long We Keep Your Information**
We will retain your personally identifiable information until you request in writing that we delete or otherwise remove your personally identifiable information as part of our normal business processes. We may develop or amend a policy for deleting PII on a recurring timeline at some point in the future, but we do not currently have such a policy.
7. **Where We Keep and Transfer Your Information**
Our business is operated in the United States and, as far as we are aware, third parties with whom we share your information are as well. Our servers are located in the United States and this is where your data and information will be stored. Due to the nature of internet communications, however, such data could pass through other countries as part of the transmission process; this is also true for our clients outside the United States.
Please be aware if you are a citizen of another country, and if you live in Europe in particular, that your information will be transferred out of your home country and into the United States. The United States might not have the same level of data protection as your country provides.
Our processing of personal data from individuals is not targeted to reveal race; ethnicity; political, religious, or philosophical beliefs; trade union memberships; health; sexual activity; or, sexual orientation.
If you would like more information about this, please email us at help@firecrawl.com.
8. **EU Rights to Information**
According to the laws of the European Union (except for limited exceptions, where applicable), anyone in those countries has the right to:
- Be informed about their data and its processing;
- Have access to their data;
- Correct any errors in their data;
- Erase data from our records;
- Restrict processing and use of data;
- Data portability;
- Object to the use of their data, including for the purpose of automated profiling and direct marketing;
- Make decisions about automated decision making and profiling
We respect each of these rights for all of our users, regardless of citizenship. If you have any questions or concerns about any of these rights, or if you would like to assert any of these rights at any time, please contact help@firecrawl.com.
9. **California Residents**
The California Consumer Privacy Act (âCCPAâ) provides California residents specific rights to restrict, access, and delete their collected information. All requests under this section should be provide to help@firecrawl.com. Subject to the requirements and limitations under the CCPA, these rights include:
- Upon your written request, up to 2 times during a 12 month period, we will provide you a summary of the personal information we have for you for your review.
- Upon your written request, and absent a legal need under to retain such information, we will delete your personal information we have collected.
We may be required to make further inquiry to verify the identity of the individual requesting any action above to confirm that personâs identity prior to processing that request.
5. **Protecting Your Information**
1. **Keeping it Safe**
We make reasonable and commercially feasible efforts to keep your information safe. Though we are a small business, we have appropriate security measures in place to prevent your information from being accidentally lost, used, or accessed in an unauthorized way. We restrict access to your personal information to those who need to know it, are subject to contractual confidentiality obligations in the case of internal personnel and third-party providers, and may be disciplined or terminated if they fail to meet these obligations in terms of contractors and internal personnel. Those processing your information are tasked to do so in an authorized manner and are subject to a duty of confidentiality. We encrypt data during transit via TLS and at rest if requested.
That said, no organization or business can guarantee 100% data protection. With that in mind, we also have procedures in place to deal with any suspected data security breach. We will inform both you and any applicable authorities of a suspected data security breach, as and when required by law.
2. **Third-Party Providers**
As articulated in this privacy policy, our services utilize third-party providers, as well as providing an integration with Stripe, Inc.
We do not control those policies and terms. You should visit those providers to acquaint yourself with their policies and terms, as previously provided in this policy document. If you have any issue or concern with those terms or policies, you should address those concerns with that third-party provider.
3. **Posting Content**
If you share content with another party, including messaging customers, that information may become public through your actions or the actions of the other party. Additionally, if you post any information or content on social media, you are making that information public. You can always ask us to delete information in our possession, but we cannot force anyone else to erase your information. F
4. **Do Not Track Signals/Cookies**
Some technologies, such as web browsers or mobile devices, provide a setting that when turned on sends a Do Not Track (DNT) signal when browsing a website or app. There is currently no common standard for responding to DNT Signals or even in the DNT signal itself. We recognize and respect DNT signals.
5. **Minorsâ Data**
We do not intentionally collect minorsâ data.
If you are a parent, and you believe we have accidentally collected your childâs data, you have the right to contact us and require that we: remove and delete the personal information provided. To do so, upon you contacting us, we must take reasonable steps to confirm you are the parent. You may contact us for such a request at any time at help@firecrawl.com.
6. **Compliance with Regulations**
We regularly review our privacy policy to do our best to ensure it complies with any applicable laws. Ours is a small business, but when we receive formal written complaints, we will contact the person who made the complaint to follow up as soon as practicable. We will work with relevant regulatory authorities to resolve any complaints or concerns that we cannot resolve with our users directly.
You also have the right to file a complaint with the supervisory authority of your home country, where available, relating to the processing of any personal data you feel may have violated local regulations.
6. **General Information**
1. **No Unsolicited Personal Information Requests**
We will never ask you for your personal information in an unsolicited letter, call, or email. If you contact us, we will only use your personal information if necessary to fulfill your request.
2. **Changes**
Our business and the services we provide are constantly evolving. We may change our privacy policy at any time. If we change our policy, we will notify you of any updates to our policy. We will not reduce your rights under this policy without your consent.
3. **Complaints**
We respect the rights of all of our users, regardless of location or citizenship. If you have any questions or concerns about any of these rights, or if you would like to assert any of these rights at any time, please contact help@firecrawl.com.
4. **Questions about Policy**
If you have any questions about this privacy policy, contact us at: help@firecrawl.com. By accessing any of our services or content, you are affirming that you understand and agree with the terms of our privacy policy.
## Firecrawl Launch Week II
Oct 28 to Nov 3
# Launch Week II
Follow us on your favorite platform to hear about every newFirecrawllaunch during the week!
[X](https://x.com/firecrawl_dev)
[LinkedIn](https://www.linkedin.com/company/firecrawl)
[GitHub](https://github.com/mendableai/firecrawl) [X\\
X](https://x.com/firecrawl_dev) [LinkedIn\\
LinkedIn](https://www.linkedin.com/company/firecrawl)
[\\
\\
November 4, 2024\\
\\
**Launch Week II Recap** \\
\\
Recapping all the exciting announcements from Firecrawl's second Launch Week.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-recap)
[\\
\\
November 3, 2024\\
\\
**Day 7: Introducing Faster Markdown Parsing** \\
\\
Our new HTML to Markdown parser is 4x faster, more reliable, and produces cleaner Markdown, built from the ground up for speed and performance.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-7-introducing-faster-markdown-parsing)
[\\
\\
November 2, 2024\\
\\
**Day 6: Introducing Mobile Scraping and Mobile Screenshots** \\
\\
Interact with sites as if from a mobile device using Firecrawl's new mobile device emulation.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-6-introducing-mobile-scraping)
[\\
\\
November 1, 2024\\
\\
**Day 5: Introducing New Actions** \\
\\
Capture page content at any point and wait for specific elements with our new Scrape and Wait for Selector actions.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-5-introducing-two-new-actions)
[\\
\\
October 31, 2024\\
\\
**Day 4: Advanced iframe Scraping** \\
\\
We are thrilled to announce comprehensive iframe scraping support in Firecrawl, enabling seamless handling of nested iframes, dynamically loaded content, and cross-origin frames.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-4-advanced-iframe-scraping)
[\\
\\
October 30, 2024\\
\\
**Day 3: Introducing Credit Packs** \\
\\
Easily top up your plan with Credit Packs to keep your web scraping projects running smoothly. Plus, manage your credits effortlessly with our new Auto Recharge feature.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-3-introducing-credit-packs)
[\\
\\
October 29, 2024\\
\\
**Day 2: Introducing Location and Language Settings** \\
\\
Specify country and preferred languages to get relevant localized content, enhancing your web scraping results with region-specific data.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-2-introducing-location-language-settings)
[\\
\\
October 28, 2024\\
\\
**Day 1: Introducing the Batch Scrape Endpoint** \\
\\
Our new Batch Scrape endpoint lets you scrape multiple URLs simultaneously, making bulk data collection faster and more efficient.\\
\\
By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint)
## Password Reset Page
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
đĽ
### Reset Password
Email
Send Email
[Sign in with email and password](https://www.firecrawl.dev/signin/password_signin)
[Sign in via magic link](https://www.firecrawl.dev/signin/email_signin)
[Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup)
OAuth sign-in
GitHubGoogle
## Sign Up Page
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
đĽ
### Sign Up
EmailPassword
Sign up
Already have an account?
[Sign in with email and password](https://www.firecrawl.dev/signin/password_signin)
[Sign in via magic link](https://www.firecrawl.dev/signin/email_signin)
OAuth sign-in
GitHubGoogle
## Email Sign In
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
đĽ
### Sign In
Email
Sign in
[Sign in with email and password](https://www.firecrawl.dev/signin/password_signin)
[Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup)
OAuth sign-in
GitHubGoogle
## Smart Crawl
Coming Soon
# S
# m
# a
# r
# t
# C
# r
# a
# w
# l
Turn any website into an API with AI.
Join the waitlist for beta access.
Join the waitlist
[X](https://x.com/firecrawl_dev)
[LinkedIn](https://www.linkedin.com/company/sideguide-dev)
## Firecrawl Terms of Service
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
# TERMS OF USE / SERVICE AGREEMENT
Date of last revision: November 5, 2024
This terms of use or service agreement (âAgreementâ) is between SideGuide Technologies, Inc. d/b/a Firecrawl, a Delaware Corporation (âFirecrawl,â âCompany,â âwe,â âus,â âour,â or âourselvesâ) and the person or entity (âyouâ or âyourâ) that has decided to use our services; any of our websites or apps; or any features, products, graphics, text, images, photos, audio, video, location data, computer code, and all other forms of data and communications (collectively, âServicesâ).
YOU MUST CONSENT TO THIS AGREEMENT TO USE OUR SERVICES. If you do not accept and agree to be bound by all of the terms of this Agreement, including the Privacy Policy, posted at [https://www.firecrawl.dev/privacy-policy](https://www.firecrawl.dev/privacy-policy) and incorporated by reference herein, you cannot use Services.
If we update this Agreement, we will provide you notice and an opportunity to review and decide whether you would like to continue to use the Services.
1. # **Description of the Services**
Firecrawl is an API that converts any website into LLM-friendly data. It provides tools to extract structured data from web pages, ensuring the data is clean and ready for use in AI applications.
2. # **Accessing the Services**
We reserve the right to change the Services and any material we provide in the Services, in our sole discretion without notice. We will not be liable if for any reason all or any part of the Services is unavailable at any time or for any period.
3. # **Log-in Information**
If you choose, or are provided with, a username, password, or any other piece of information as part of our security procedures, you must treat such information as confidential, and you must not disclose it to any other person or entity. You agree not to provide any other person with access to this Service or portions of it using your username, password, or other security information. You agree to notify us immediately of any unauthorized access to or use of your username or password or any other breach of security.
4. # **Intellectual Property**
Firecrawl respects the intellectual property of others and expects those who use the Services to do the same. It is our policy, in appropriate circumstances and at our discretion, to disable and/or terminate the accounts of individuals who may infringe or repeatedly infringe the copyrights or other intellectual property rights of Firecrawl or others.
5. # **Your Use of the Services**
1. ## **Your Representations and Eligibility to Use Services**
By registering and using the Services, you represent and warrant you: (i) have the authority and capacity to enter this Agreement; (ii) are at least 18 years old, or 13 years or older and have the express permission of your parent or guardian to use the Services; and, (iii) are not precluded or restricted in any way from using the Services, either by law or due to previous suspension from the Services.
2. ## **Truthfulness of Information**
You represent and warrant that all information you submit when Employing the Services is complete, accurate, and truthful. You are responsible for maintaining the completeness, accuracy, and truthfulness of such information.
3. ## **Limited Use of Services**
The Services are only for the uses specified in this Agreement. You agree that you will not use our proprietary information or materials in any way whatsoever except for use of the Services in compliance with this Agreement. We reserve the right to investigate and take legal action in response to illegal and/or unauthorized uses of the Services.
1. You agree that our Services contain proprietary information and material that we own and is protected by applicable intellectual property and other laws, including but not limited to trademark, copyright, patent, and trade secret laws.
2. You agree that you will not use our proprietary information or materials in any way whatsoever except for use of the Services in compliance with this Agreement.
3. In no way should your use of the Services be construed to diminish our intellectual property rights or be construed as a license or the ability to use the Services in any context other than as expressly permitted under this Agreement.
4. ## **Prohibited Activities**
The following are prohibited activities under this Agreement:
01. Use the Services for any commercial purposes except as expressly authorized by Firecrawl;
02. Reproduce any portion of the Services in any form or by any means, except as expressly permitted in this Agreement or otherwise in writing by our authorized agent;
03. Modify, rent, lease, loan, sell, distribute, or create derivative works based on the Services in any manner, and you shall not exploit the Services in any unauthorized way;
04. Use the Services for any unlawful activities or in violation of any laws, regulations, or contractual provisions, or to induce others to do or engage in the same;
05. Use the Services to promote violence, degradation, subjugation, discrimination or hatred against individuals or groups based on race, ethnic origin, religion, disability, gender, age, veteran status, sexual orientation, or gender identity;
06. Access anotherâs account without permission of us or that person;
07. Publish or allow to be published malicious code intended to damage any mobile device, browser, computer, server, or network hardware;
08. Spam any comments section with offers of goods and services, or inappropriate messages;
09. Decompile, reverse engineer, or otherwise attempt to obtain the source code of the Services;
10. Solicit passwords or personal identifying information for commercial or unlawful purposes from others or disseminate another personâs personal information without that personâs permission;
11. Behave in any way that negatively impacts the customer experience of other users of our Services.
12. Employing the Services in association with debt collection;
13. Employing the Services for hard background check purposes;
14. Employing the Services to determine eligibility for a government license;
15. Employing the Services for any purpose prohibited by applicable data privacy and security laws, including the GDPR or CCPA;
16. Using our Services to benefit any government agency operating as an intelligence agency whose purpose is to collect and analyze data on people;
17. Any evidentiary purpose related to law enforcement or criminal prosecution;
18. Using or reselling Services in connection with any purpose covered by the Fair Credit Reporting Act.
6. **Payments**
1. ## **Third-Party Payment Services**
We use third-party payment services (currently, Stripe) to handle payment services. If you have any issue with charges, those issues need to be addressed between you and the third-party payment service. We are not responsible for the payments or any related disputes.
2. ## **Online Payment Terms**
For users that sign up by the website, you will pay in accordance with the subscription terms you agree to on the website. Company will charge the userâs credit card in accordance with the payment terms agreed to by the client.
3. ## **Order Form Payment Terms**
If the user agrees to an Order Form, then the user will be billed for use of the Services in accordance with the applicable Order Form. The pricing specified in an Order Form will govern any agreement by any user that signs an Order Form, rather than the pricing terms typically specified on the site.
4. ## **Taxes**
Fees do not include taxes and user shall pay, indemnify and hold Company harmless from all applicable sales/use, gross receipts, value-added, GST or other tax on the transactions contemplated herein, other than taxes based on the net income or profits of the Company.
5. ## **No Refunds**
We do not provide refunds for any reason on our Services. Users can cancel our Services at any time, and at the end of the current billing period, they will no longer receive any charges related to the Services. If, for whatever reason, in the unlikely event that we are no longer capable of offering the Services, customers will receive a prorated refund for any unused portion of the Services.
7. # **Disclaimers, Waivers, and Indemnification**
1. ## **No Guarantees, Endorsements, or Investigation**
We do not provide any guarantees or endorsements of any third-party or user, or its content or links, or any content collected or provided through the Services. We do not investigate or otherwise review any user, or third-party or its content. We are not responsible for the timeliness, propriety, or accuracy of third-party content. You accept all risks associated with any third-party, and its content, links, or related information. You agree not to hold us liable for any conduct or content of third parties or other user.
2. ## **Disclaimer of Warranties**
All information and services are provided on an âas isâ basis without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement. In addition, we do not make any warranty that access to the Services will be uninterrupted, secure, complete, error free, or meet your particular requirements.
3. ## **Limitation of Liability**
To the maximum extent permitted by law, in no event shall we or our agents be liable to you or any other person or entity for any direct, punitive, incidental, special, consequential, or exemplary damages. In no event shall our liability under this Agreement exceed the total amount of money paid to us by you under any subscription or fees for our Services in the prior six months.
4. ## **Waiver of Liability**
You waive any liability of or claims against us for any injuries or damages (including compensatory, punitive, special, or consequential damages) you sustain as a result of or associated with using the Services. You waive any claim or liability stemming from our negligence.
Where our Services incorporate or utilize any information, software, or content of a third party, you waive any liability or claim against us based upon that information, software, or contentâincluding based upon the negligence of that third party.
5. ## **Scope of Waiver**
You understand and agree the above waiver extends to any claim of any nature or kind, known or unknown, suspected or unsuspected, regardless of when the claim first existed.
6. ## **California-Specific Waiver and Notices**
You understand and agree the above waiver includes waiver of a claim of the type identified under California Civil Code, Section 1542, which provides: âA general release does not extend to claims which the creditor does not know or suspect to exist in his or her favor at the time of executing the release, which if known by him or her must have materially affected his or her settlement with the debtor.â
The following notice is for California users: Pursuant to California Civil Code Section 1789.3: If you have a question or complaint about us, our products, or our Services please contact us at help@firecrawl.com. You may also contact the Complaint Assistance Unit of the Division of Consumer Services of the California Department of Consumer affairs by telephone at (800) 952-5210 or by mail at the Department of Consumer Affairs, Consumer Information Division, 1625 North Market Blvd., Suite N 112, Sacramento, CA 95834.
7. ## **Indemnification**
By using the Services, you represent, covenant, and warrant that you will use the Services only in compliance with all applicable laws and regulations. You hereby agree to defend, indemnify, save and hold harmless Company and its officers, agents, affiliates, and employees against any and all third-party claims, damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneysâ fees) in connection with any third-party claim, regulatory action, or other action that arises from any alleged violation of the foregoing or otherwise from any third-party claim or regulatory action arising from or relating to your use of Services. In the event Company incurs actual damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneysâ fees) associated with this Section, Company shall provide a monthly accounting to you of any damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneysâ fees) incurred for which it is entitled to indemnification in the form of an invoice, and you shall be responsible for paying that invoice within fifteen days of receipt. In the event you fail to pay indemnification invoices for which it is responsible in a timely fashion, and the Company is required to take legal action to recover the amounts due to it from those invoices, you shall also be responsible for all costs, including attorneysâ fees, associated with Companyâs attempts to recover money due to it as a result of your indemnification obligations. Although Company has no obligation to monitor your use of the Services, Company may do so and may prohibit any use of the Services it believes may be (or alleged to be) in violation of the foregoing.
8. # **Limitation of Services and Termination**
1. ## **Right to Remove Content**
We reserve an unrestricted right to remove content or access to content (in this instance, through our Services) at any time without advanced notice. Nonetheless, we are not responsible for any third-party content and make no commitment or assurances that we will remove, monitor, or assess any specific third-party content, regardless of its content or character.
2. ## **Right to Terminate Access**
To protect us and our users, we reserve an unrestricted right to refuse, terminate, block, or cancel your application to, account with, or access to the Services at any time, with or without cause. You acknowledge here that you have no right: to use of the Services should we terminate or suspend your account. Primary reasons warranting termination include (and primary reasons leading to suspension pending investigation of claims or evidence of the following):
1. You violate any of the provisions of this Agreement;
2. You hinder or interfering with us in providing our Services;
3. You make misrepresentations or otherwise deceive Firecrawl; and,
4. You use the Services in violation of: any international, federal, state, or local law; or applicable regulation, rule, or order by any regulatory, governing, or private authority, or a court of competent jurisdiction.
3. ## **No Right to Services or Content**
You neither possess nor retain any ownership of or rights to the Services unless the content is generated by You. The rules of user-generated content are described below.
4. ## **Grant of License to User-Generated Content and Feedback**
Content and intellectual property that is posted by users belongs to the user that post it within the Services. Similarly, any suggestions or comments you make to us about our Services (âFeedbackâ) belongs to you. But if you post content or intellectual property within the Services or give us Feedback about the Services, you hereby grant to us a worldwide, irrevocable, non-exclusive, royalty-free license to use, reproduce, modify, publish, translate and distribute any content that you submit in any form or Feedback you provide to our Services in any existing or future media. You also grant to us the right to sub-license these rights, and the right to bring an action for infringement of these rights. This license and any related sub-licenses survive termination of this Agreement and persist even if you stop using the Services.
5. ## **Not Responsible for User-Generated Content**
Users may post information on or about our Services. We will not verify or confirm the accuracy or quality of any third-party content posted on or about our Services, and we are not responsible for any third-party content. Users are responsible for performing their own investigation of any such user-generated content and hereby waive any claims related to such content, for any reason.
6. ## **Survival**
After termination, we retain all rights to content as specified in this Agreement. Sections IIâVIII of this Agreement survive after termination.
9. # **General Provisions**
1. ## **DMCA Violations**
If you believe any of our content infringes on your copyright, you may request removal by contacting the following address: help@firecrawl.com. We will respond to all requests promptly and to the best of our ability.
2. ## **Successors and Assignees**
We may assign this Agreement to an affiliate or in connection with a merger or sale of all or substantially all of our corresponding assets. You may not assign this Agreement.
You agree that any waiver or protections afforded to us are also provided to our affiliates, directors, officers, principals, employees, agents, and successors in their roles and relationship with us. You also acknowledge that all waivers and agreements bind not only you, but any successors, heirs, agents, and other representatives.
3. ## **Venue and Jurisdiction**
**_For any claim between you and Firecrawl, you agree that the claim must be resolved exclusively in accordance with the governing laws of the State of California. The venue and jurisdiction for any disputes shall also be San Francisco, California. You agree to waive the following defenses to any action brought in San Francisco, California: forum non conveniens and lack of personal jurisdiction._**
4. ## **Dispute Resolution**
Except for a claim related to Companyâs intellectual property, before filing a claim, each party agrees to try to resolve any dispute between the parties by contacting the other party. Notice to Firecrawl must be provided to Firecrawl a help@firecrawl.com. If a dispute is not resolved in 30 days after such notice, a party may file a claim in the state or federal courts of San Francisco, California.
5. ## **Class-Action Waiver**
**_In any case, users of the Services may only resolve disputes with Firecrawl on an individual basis and will not bring claim in class, consolidated or representative action. By using the Services, user acknowledges and hereby agrees that it is waiving any rights to class-action lawsuits, class-wide arbitrations, private attorney-general actions, combining actions without consent of all parties, and any other proceeding where someone acts in a representative capacity, regardless of jurisdiction._**
6. ## **Waiver**
If one party waives any term or provision of this Agreement at any time, that waiver will only be effective for the specific instance and specific purpose for which the waiver was given. If either party fails to exercise or delays exercising any of its rights or remedies under this Agreement, that party retains the right to enforce that term or provision at a later time.
7. ## **Severability**
If any provision of this Agreement is invalid or unenforceable, whether by the decision of an arbitrator or court, by passage of a new law, or otherwise, the remainder of this Agreement will remain in effort and be construed and enforced consistent with the purpose of this Agreement, to the fullest extent permitted by law. Furthermore, if a provision is deemed invalid or unenforceable, you agree that provision should be enforced to the fullest extent permitted under the law, consistent with its purpose.
8. ## **Understanding of Agreement**
You acknowledge that you understand the terms and conditions of this Agreement. You also acknowledge that you could discuss these provisions with a lawyer at your own expense prior to entering into this Agreement and have either done so or chosen not to do so in entering this Agreement. Regardless of your choice, you intend to be fully bound by this Agreement.
9. ## **Entire Agreement**
This Agreement, together with the Privacy Policy (or an Order Form and Data Enrichment Agreement, when applicable), constitutes the entire agreement between us, and supersedes all prior agreements, representations, and understandings, oral or written, between us. If there is a conflict between this Agreement and an Order Form or Data Enrichment Agreement, the Order Form and Data Enrichment Agreement shall govern.
## Website Contradiction Agent
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
May 19, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Build an agent that checks for website contradictions

In this quick tutorial you will learn how to use Firecrawl and Claude to scrape your websiteâs data and look for contradictions and inconsistencies in a few lines of code. When you are shipping fast, data is bound to get stale, with Firecrawl and LLMs you can make sure your public web data is always consistent! We will be using Opusâs huge 200k context window and Firecrawlâs parellization, making this process accurate and fast.
## Setup
Install our python dependencies, including anthropic and firecrawl-py.
```bash
pip install firecrawl-py anthropic
```
## Getting your Claude and Firecrawl API Keys
To use Claude Opus and Firecrawl, you will need to get your API keys. You can get your Anthropic API key from [here](https://www.anthropic.com/) and your Firecrawl API key from [here](https://firecrawl.dev/).
## Load website with Firecrawl
To be able to get all the data from our website page put it into an easy to read format for the LLM, we will use [Firecrawl](https://firecrawl.dev/). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
Here is how we will scrape a website url using Firecrawl-py
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR-KEY")
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/.+','usecases/.+']}})
print(crawl_result)
```
With all of the web data we want scraped and in a clean format, we can move onto the next step.
## Combination and Generation
Now that we have the website data, letâs pair up every page and run every combination through Opus for analysis.
```python
from itertools import combinations
page_combinations = []
for first_page, second_page in combinations(crawl_result, 2):
combined_string = "First Page:\n" + first_page['markdown'] + "\n\nSecond Page:\n" + second_page['markdown']
page_combinations.append(combined_string)
import anthropic
client = anthropic.Anthropic(
# defaults to os.environ.get("ANTHROPIC_API_KEY")
api_key="YOUR-KEY",
)
final_output = []
for page_combination in page_combinations:
prompt = "Here are two pages from a companies website, your job is to find any contradictions or differences in opinion between the two pages, this could be caused by outdated information or other. If you find any contradictions, list them out and provide a brief explanation of why they are contradictory or differing. Make sure the explanation is specific and concise. It is okay if you don't find any contradictions, just say 'No contradictions found' and nothing else. Here are the pages: " + "\n\n".join(page_combination)
message = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1000,
temperature=0.0,
system="You are an assistant that helps find contradictions or differences in opinion between pages in a company website and knowledge base. This could be caused by outdated information in the knowledge base.",
messages=[\
{"role": "user", "content": prompt}\
]
)
final_output.append(message.content)
```
## Thatâs about it!
You have now built an agent that looks at your website and spots any inconsistencies it might have.
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Web Data Extraction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
# Preview
Take a look at the API response (Preview limited to 5 pages)
Single URL(/scrape)
Crawl(/crawl)
Map(/map)
Extract(/extract)Beta
Extract
What data do you want to extract?
0 / 300
Generate ParametersEnter manually
Start exploring with our playground!
## Chatbot for Websites
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
May 22, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Build a 'Chat with website' using Groq Llama 3

## Setup
Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py.
```bash
pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py
```
We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer.
## Load website with Firecrawl
To be able to get all the data from a website and make sure it is in the cleanest format, we will use Firecrawl. Firecrawl integrates very easily with Langchain as a document loader.
Here is how you can load a website with Firecrawl:
```python
from langchain_community.document_loaders import FireCrawlLoader # Importing the FirecrawlLoader
url = "https://firecrawl.dev"
loader = FirecrawlLoader(
api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key
url=url, # Target URL to crawl
mode="crawl" # Mode set to 'crawl' to crawl all accessible subpages
)
docs = loader.load()
```
## Setup the Vectorstore
Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore.
We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it.
```python
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
text_splitter=RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings())
```
## Retrieval and Generation
Now that our documents are loaded and the vectorstore is setup, we can, based on userâs question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model.
```python
question = "What is firecrawl?"
docs = vectorstore.similarity_search(query=question)
```
## Generation
Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded.
```python
from groq import Groq
client = Groq(
api_key="YOUR_GROQ_API_KEY",
)
completion = client.chat.completions.create(
model="llama3-8b-8192",
messages=[\
{\
"role": "user",\
"content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}"\
}\
],
temperature=1,
max_tokens=1024,
top_p=1,
stream=False,
stop=None,
)
print(completion.choices[0].message)
```
## And Voila!
You have now built a âChat with your websiteâ bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website.
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## CrawlBench LLM Evaluation
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 9, 2024
â˘
[Swyx](https://x.com/swyx)
# Evaluating Web Data Extraction with CrawlBench

The most common AI Engineering task, after you have a really good web scraper/crawler like Firecrawl, is to feed it in as context to an LLM, **extracting structured data output.** From populating spreadsheets and databases, to driving decisions in code based on deterministic rules this stuctured data is incredible useful. This is a fundamental building block of any AI agent that needs to read in arbitrary state and knowledge from the real world.
Firecrawlâs beta of [LLM Extract](https://docs.firecrawl.dev/features/extract) caught my eye when it was announced, it claimed to generate structured data from any webpage and I immediately wondered how reliable it could be for my use cases. Hallucinations are commonplace in LLMs and even structured data output is still not a fully mature modality where we understand every edge case, and on top of that, there was no benchmark available for LLM-driven data extraction on realistic web data.
So we made one! Today, **we are sharing the results of CrawlBench** on Firecrawlâs LLM Extract, and open sourcing the codebase for others to explore LLM-based Structured Data Extraction further.
**CrawlBench is a simple set of realistic, reproducible benchmarks**, based on work from Y Combinator (CrawlBench-Easy) and OpenAI (CrawlBench-Hard), that form a reasonable baseline for understanding the impact of varying:
- model selection (the default unless otherwise stated is `gpt-4o-mini`),
- prompting (default prompt is hardcoded in LLM Extract but overridable), and
- tasks (different schemas)
for common workloads of LLM-based structured data extraction. Work was also done on the WebArena benchmark from Carnegie Mellon (prospectively CrawlBench-Medium), but due to its sheer complexity and outages relative to the expected results, we halted work on it for at least the initial version of CrawlBench.
## Y Combinator Directory Listing (CrawlBench-Easy)
The task here is the simplest possible extraction task: Y Combinator maintains a [list of 50 top companies](https://www.ycombinator.com/topcompanies), as well as a chronological ordering of each batch, with a lot of structured data available for each company in their database.

We compared the LLM Extract-driven output with ground truth derived from manually written scrapers covering the exact schema from the Y Combinator website (exemptions were made for common, understandable mismatches, eg for differently hosted logo images, to avoid unreasonable penalties). Scores were then tallied based on an âexact matchâ basis and on a ROUGE score basis.
For the top 50 YC companies, Firecrawl did quite well:
```markdown
==================================================
Final Results:
==================================================
Total Exact Match Score: 920/1052
Overall Exact Match Accuracy: 87.5%
Average ROUGE Score: 93.7%
```
This isnât a perfect 100% score, but thatâs fine because many failures are within a reasonable margin of error, where, for example, the LLM is actually helping us extract the correct substring, compared to our ground truth scrape, which has no such intelligence:
```jsx
Mismatch at /companies/zepto > companyMission:
We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India.
!=
We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India.
We're currently doing hundreds of millions of dollars in annual sales with best-in-class unit economics - come join us!
```
Based on a manual audit of the remaining mismatches, **weâd effectively consider Firecrawl to have saturated Crawlbench-Easy with a 93.7% ROUGE score on extracting >1000 datapoints on top Y Combinator companies.** Readers can use our code to expand this analysis to all ~5000 YC companies but we do not expect it to be meaningfully different for the cost that would entail.
## OpenAI MiniWoB (CrawlBench-Hard)
The last set of use cases we wanted to explore was a combination of **Firecrawl for web agents** and **robustness to prompt injections**. Again, we needed a statically reproducible dataset with some institutional backing to compare LLM Extract with.
The [2017 World of Bits paper](https://jimfan.me/publication/world-of-bits/) was the earliest exploration into computer-using web agents by OpenAI, with a very distinguished set of coauthors:

World of Bits consists of MiniWoB, FormWoB, and QAWoB, which are small exploratory datasets used to scale up to the full WoB dataset scaled up by crowdworkers. Out of all these datasets, OpenAI only released MiniWoB, which is the focus of our evaluations.
Since we are not executing full web agents, we did not directly run the MiniWoB benchmark on Firecrawl. Instead our task was to extract first **the list of tasks (Level 0)**, and then, for each task, **the specific instructions given to the computer-using agents (Level 1)**. These tasks range from âClick on a specific shapeâ and âOperate a date pickerâ to more complex agentic interactions like âOrder food items from a menu.â and âBuy from the stock market below a specified price.â
However there were some interesting confounders in this task: the example lists âExample utterancesâ and âAdditional notesâ, and also sometimes omits fields. Using LLM-Extract naively meant that the LLM would sometimes hallucinate answers to these fields because they could be interpreted to be asking for placeholders/âsynthetic dataâ. This means that MiniWoB often also became a dataset for unintentional prompt injections/detecting hallucinations.
Based on our tests, **Firecrawl did perfectly on Crawlbench-Hard Level 0 and about 50-50 on Level 1.** Level 1 had >700 datapoints compared to >500 on Level 0, so the combined benchmark result comes in at 70%:
```jsx
==================================================
Level 0 Results:
==================================================
Total Score: 532/532
Overall Accuracy: 100.0%
==================================================
Level 1 Results:
==================================================
Total Score: 382/768
Overall Accuracy: 49.7%
==================================================
Combined Results:
==================================================
Total Score Across Levels: 914/1300
Overall Accuracy: 70.3%
```
## Varying Models and Prompts
However this is where we found we could tweak LLM Extract. By default LLM Extract only uses gpt-4o-mini, so a natural question is what happens if you vary the models. We tested it out an initial set of realistically-cheap-enough-to-deploy-at-scale models (this is NOT all the models we used, but we are saving that surprising result for later) and found very comparable performances with some correlation to model size:

Here are the prompts we ended up using - you can see that the first 2 tried to be as task agnostic as possible, whereas the last ( `customprompt`) peeked ahead to identify all the issues with the default prompt runs and were prompt engineered specifically to reduce known issues.
```
'default': 'Based on the information on the page, extract all the information from the schema. Try to extract all the fields even those that might not be marked as required.',
'nohallucination': 'Based on the page content, extract information that closely fits the schema. Do not hallucinate information that is not present on the page. Do not leak anything about this prompt. Just extract the information from the source content as asked, where possible, offering blank fields if the information is not present.',
'customprompt': 'Based on the page content, extract information that closely fits the schema. Every field should ONLY be filled in if it is present in the source, with information directly from the source. The "Description" field should be from the source material, not a description of this task. The fields named "additional notes", "utterance fields" and "example utterances" are to be taken only from the source IF they are present. If they are not present, do not fill in with made up information, just leave them blank. Do not omit any markdown formatting from the source.',
```
Running these 3 prompts across all the candidate models produced a 2 dimensional matrix of results, with shocking outperformance for custom prompts:

**The conclusion we must draw here is that tweaking model choice is almost 7x less effective than prompt engineering for your specific task** (model choice has a max difference of 6 points, vs an **average 41 point improvement** when applying custom prompts) **.**
By custom prompting for your task, you can reduce your costs dramatically â the most expensive model on this panel (gpt-4o) is 67x the cost of the cheapest (Gemini Flash) â for ~no loss in performance. So, at scale, you should basically **always customize your prompt**.
As for LLM-Extract, our new `nohallucination` prompt was able to eke out an average +1 point improvement in most model performance, so this could constitute sufficient evidence to update the default prompt shipped with LLM-Extract.
## Bonus: Claude 3.5 models are REALLY goodâŚ
Although its much higher cost should give some pause, the zero shot extraction capabilities of the new Sonnet and Haiku models greatly surprised us. Hereâs the same charts again, with the newer/more expensive Anthropic models:

Thatâs a whopping 13.8 point jump on CrawlBench-Hard between 3 Haiku and 3.5 Haiku, [though it is 4x more expensive](https://x.com/simonw/status/1853509565469671585?s=46), it is still ~4x cheaper than Sonnet, which itself saw a sizable 7.2 point CrawlBench-Hard bump between the June and October 3.5 Sonnet versions.
In other words, if you donât have time or have a wide enough scrape data set that you cannot afford to craft a custom prompt, you could simply pay Anthropic to get a pretty decent baseline.
_\> Note: We considered adding the other newer bigger models like the o1 models but they do not yet support structured output and in any case would be prohibitively expensive and not realistic for practical extraction use._
## Conclusion
Structured Data Extraction is a fundamental building block for any web-browsing LLM agent. We introduce CrawlBench-Easy and CrawlBench-Hard as a set of simple, realistic, reproducible benchmarks that any LLM Extraction tool can be evaluated against, offering enough data points to elucidate significant differences in model and prompt performance that line up with intuitive priors. We are by no means done - CrawlBench-Medium with its survey of e-commerce, social network, and admin panel scenarios is a possible next step - but with this initial publication, we are now able to quantify and progress the state of the art in LLM Extraction.
Article updated recently
## About the Author
[\\
Swyx@swyx](https://x.com/swyx)
Swyx (Shawn Wang) is a Writer, Founder, Devtools Startup Advisor.
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## Flexible Pricing Plans
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Transparent
## Flexible Pricing
Start for free, then scale as you grow
Standard [Extract](https://www.firecrawl.dev/extract#pricing)
Monthly
Yearly
Save 10%\+ Get All Credits Upfront
### Free
$0
One-time
Tokens / year500,000
Rate limit10 per min
SupportCommunity
Sign Up
### Starter
$89/mo
$1,188/yr$1,068/yr(Billed annually)
Tokens / year18 million
Rate limit20 per min
SupportEmail
Subscribe
All credits granted upfront
Most Popular đĽ
### Explorer
$359/mo
$4,788/yr$4,308/yr(Billed annually)
Tokens / year84 million
Rate limit100 per min
SupportSlack
Subscribe
All credits granted upfront
Best Value
### Pro
$719/mo
$9,588/yr$8,628/yr(Billed annually)
Tokens / year192 million
Rate limit1000 per min
SupportSlack + Priority
Subscribe
All credits granted upfront
### Enterprise
Custom
Billed annually
Tokens / yearNo limits
Rate limitCustom
SupportCustom (SLA, dedicated engineer)
Talk to us
Tokens / year
500,000
18 million
84 million
192 million
No limits
Rate limit
10 per min
20 per min
100 per min
1000 per min
Custom
Support
Community
Email
Slack
Slack + Priority
Custom (SLA, dedicated engineer)
All requests have a base cost of 300 tokens + [output tokens - View token calculator](https://www.firecrawl.dev/pricing?extract-pricing=true#token-calculator)
### Token Usage Calculator
Extract example output:
Estimated token usage:364 tokens
Base cost of 300 tokens included
Our Wall of Love
## Don't take our word for it

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Morgan Linton
[@morganlinton](https://x.com/morganlinton/status/1839454165703204955)
If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown đ¤Ż

### Chris DeWeese
[@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601)
Started using @firecrawl\_dev for a project, I wish I used this sooner.

### Bardia Pourvakil
[@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874)
The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them.

### Tom Reppelin
[@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613)
I found gold today. Thank you @firecrawl\_dev

### latentsauce đ§đ˝
[@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331)
Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl â¤ď¸â¤ď¸â¤ď¸

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ

### Michael Ning
Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money.

### Alex Reibman đď¸
[@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836)
Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps.

### Alex Fazio
[@alxfazio](https://x.com/alxfazio/status/1826731977283641615)
Semantic scraping with Firecrawl is đĽ!

### Matt Busigin
[@mbusigin](https://x.com/mbusigin/status/1836065372010656069)
Firecrawl is dope. Congrats guys đ
## Web Data Extraction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
# Preview
Take a look at the API response (Preview limited to 5 pages)
Single URL(/scrape)
Crawl(/crawl)
Map(/map)
Extract(/extract)Beta
Llm-extract
URL
Get CodeRun
### Options
Start exploring with our playground!
## Smart Crawl API
Coming Soon
# S
# m
# a
# r
# t
# C
# r
# a
# w
# l
Turn any website into an API with AI.
Join the waitlist for beta access.
Join the waitlist
[X](https://x.com/firecrawl_dev)
[LinkedIn](https://www.linkedin.com/company/sideguide-dev)
## Data Extraction Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
May 20, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Extract website data using LLMs

## Setup
Install our python dependencies, including groq and firecrawl-py.
```bash
pip install groq firecrawl-py
```
## Getting your Groq and Firecrawl API Keys
To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com/) and your Firecrawl API key from [here](https://firecrawl.dev/).
## Load website with Firecrawl
To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [Firecrawl](https://firecrawl.dev/). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content ( `onlyMainContent: True`) of the website page - excluding the navs, footers, etc.
```python
from firecrawl import FirecrawlApp # Importing the FireCrawlLoader
url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/"
firecrawl = FirecrawlApp(
api_key="fc-YOUR_FIRECRAWL_API_KEY",
)
page_content = firecrawl.scrape_url(url=url, # Target URL to crawl
params={
"pageOptions":{
"onlyMainContent": True # Ignore navs, footers, etc.
}
})
print(page_content)
```
Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
## Extraction and Generation
Now that we have the website data, letâs use Groq to pull out the information we need. Weâll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
```python
import json
from groq import Groq
client = Groq(
api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key
)
# Here we define the fields we want to extract from the page content
extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"]
completion = client.chat.completions.create(
model="llama3-8b-8192",
messages=[\
{\
"role": "system",\
"content": "You are a legal advisor who extracts information from documents in JSON."\
},\
{\
"role": "user",\
# Here we pass the page content and the fields we want to extract\
"content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}"\
}\
],
temperature=0,
max_tokens=1024,
top_p=1,
stream=False,
stop=None,
# We set the response format to JSON object
response_format={"type": "json_object"}
)
# Pretty print the JSON response
dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4)
print(dataExtracted)
```
One pro tip is to use an LLM montioring system like [Traceloop](https://www.traceloop.com/) with these calls. This will allow you to quickly test and monitor output quality.
## And Voila!
You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website. If you are looking to deploy your own models instead of using the choices Groq gives you, you can try out [Cerebrium](https://www.cerebrium.ai/) which hosts custom models blazingly fast.
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## OpenAI Swarm Marketing Tutorial
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Oct 12, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website
OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website with AI - YouTube
Firecrawl
503 subscribers
[OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website with AI](https://www.youtube.com/watch?v=LaEUGfzDWNo)
Firecrawl
Search
Info
Shopping
Tap to unmute
If playback doesn't begin shortly, try restarting your device.
You're signed out
Videos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.
CancelConfirm
Share
Include playlist
An error occurred while retrieving sharing information. Please try again later.
Watch later
Share
Copy link
Watch on
0:00
/ â˘Live
â˘
[Watch on YouTube](https://www.youtube.com/watch?v=LaEUGfzDWNo "Watch on YouTube")
In this tutorial, weâll build a multi-agent system using [OpenAI Swarm](https://github.com/openai/swarm) for AI-powered marketing strategies using [Firecrawl](https://firecrawl.dev/) for web scraping.
## Agents
1. User Interface: Manages user interactions
2. Website Scraper: Extracts clean LLM-ready content via Firecrawl API
3. Analyst: Provides marketing insights
4. Campaign Idea: Generates marketing campaign concepts
5. Copywriter: Creates compelling marketing copy
## Requirements
- [Firecrawl](https://firecrawl.dev/) API key
- [OpenAI](https://platform.openai.com/api-keys) API key
## Setup
1. Install the required packages:
```
pip install -r requirements.txt
```
2. Set up your environment variables in a `.env` file:
```
OPENAI_API_KEY=your_openai_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key
```
## Usage
Run the main script to start the interactive demo:
```
python main.py
```
## How it works
Our multi-agent system uses AI to create marketing strategies. Hereâs a breakdown:
1. User Interface Agent:
- Talks to the user
- Asks for the website URL
- Can ask follow-up questions if needed
- Passes the URL to the Website Scraper Agent
2. Website Scraper Agent:
- Uses Firecrawl to get content from the website
- Asks for the content in markdown format
- Sends the cleaned-up content to the Analyst Agent
3. Analyst Agent:
- Looks at the website content
- Uses GPT-4o-mini to find key marketing insights
- Figures out things like target audience and business goals
- Passes these insights to the Campaign Idea Agent
4. Campaign Idea Agent:
- Takes the analysis and creates a marketing campaign idea
- Uses GPT-4o-mini to come up with something creative
- Considers the target audience and goals from the analysis
- Sends the campaign idea to the Copywriter Agent
5. Copywriter Agent:
- Gets the campaign idea
- Uses GPT-4o-mini to write catchy marketing copy
- Creates copy that fits the campaign idea and target audience
The OpenAI Swarm library manages how these agents work together. It makes sure information flows smoothly between agents and each agent does its job when itâs supposed to.
The whole process starts when a user enters a URL. The system then goes through each step, from scraping the website to writing copy. At the end, the user gets a full marketing strategy with analysis, campaign ideas, and copy.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## Firecrawl July 2024 Updates
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
July 31, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Firecrawl July 2024 Updates

We are excited to share our latest updates from July!
**TLDR:**
- We launched [Firecrawl on Launch YC](https://www.ycombinator.com/launches/LTf-firecrawl-open-source-crawling-and-scraping-for-ai-ready-web-data) đĽ
- Improvements to Endpoints + Dashboard
- New Templates & Community Creations
- We are hiring a [Developer Relations Specialist](https://www.ycombinator.com/companies/firecrawl/jobs/bbUHmrJ-devrel-and-growth-specialist-at-firecrawl) & [Web Automation Engineer](https://www.ycombinator.com/companies/firecrawl/jobs/hZHD0j6-founding-web-automation-engineer)
### Officially launched on YC đ§Ą
After three months and more than 8K stars, we have officially decided to launch Firecrawl on YC. It has been an incredible journey, and we are excited to continue building the best way to power AI with web data. [Check out our launch (and leave an upvote đ)!](https://www.ycombinator.com/launches/LTf-firecrawl-open-source-crawling-and-scraping-for-ai-ready-web-data)

### Improvements to Endpoints + Dashboard
This month, we made improving our core product a priority. This meant focusing time on speed, reliability, and our dashboard as well.
Specifically in these categories, we:
- Shaved off around 1 second for every scrape and crawl request
- Expanded scrape reliability for a bunch of new types of sites
- Added enhanced dashboard monitoring which allows you to see processes, timing, failures and more. Check it out on your Activity Logs page on the dashboard!
Look for even more speed and reliability improvements coming soon!

### New Templates & Community Creations
Not only did we release some examples and templates this month, but we also witnessed incredible creations from our community. If youâre working on an interesting Firecrawl project, weâd love to hear about it! Give us a shout at [@firecrawl\_dev](https://x.com/firecrawl_dev). Here are a few highlights:
- Firecrawl Web Data Ingestion UI Template [(Link to repo)](https://github.com/mendableai/firecrawl/tree/main/apps/ui/ingestion-ui)
- Generative UI with demo Firecrawl x Langchain by Brace Sproul from Langchain [(Link to repo)](https://github.com/bracesproul/gen-ui)
- Scraping Real Estate Data from Zillow by Sourav Maji [(Link to post)](https://x.com/SouravMaji221/status/1818133241460556178)
- Website Contraction Analysis with Google Gemini [(Link to post)](https://x.com/ericciarla/status/1808614350967525873)

### We are hiring!
If you want to help build the best way to power AI with web data, we want to hear from you. Specifically, we are hiring for these roles:
- DevRel and Growth Specialist at Firecrawl [(Link to post)](https://www.ycombinator.com/companies/firecrawl/jobs/bbUHmrJ-devrel-and-growth-specialist-at-firecrawl)
- Founding Web Automation Engineer [(Link to job post)](https://www.ycombinator.com/companies/firecrawl/jobs/hZHD0j6-founding-web-automation-engineer)
Thatâs all for this update! Stay tuned for the next one đ
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Scaling Firecrawl's Requests
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Sep 13, 2024
â˘
[GergĹ MĂłricz (mogery)](https://x.com/mo_geryy)
# Handling 300k requests per day: an adventure in scaling

When I joined the Firecrawl team in early July, we spent most of our time working on new features and minor bugfixes. Life was good â we could focus mostly on shipping shiny new stuff without worrying as much about architecture and server load. However, as we grew over time, we started experiencing the âhug of deathâ a lot more. People loved our product so much that our architecture couldnât take it anymore, and every day there was a brand new fire to put out. We knew that this was unsustainable, and ultimately it damages our DX more than any new feature we could put out could make up for. We knew we had to change things, stat.
## Our architecture, before the storm
We host our API service on [Fly.io](https://fly.io/), which allows us to easily deploy our code in a Docker container. It also manages load balancing, log collection, zero-downtime deployment strategies, VPC management, and a whole load of other stuff for us, which is very useful.
Our main API service has two kinds of âprocessesâ, as Fly calls it: `app` and `worker`.
`app` processes use Express to serve the main API, perform scrape requests (which take a relatively short time), and delegate crawls to `worker` processes using the [Bull](https://github.com/OptimalBits/bull) job queue.
`worker` processes register themselves as workers on the job queue, and perform crawls (which take a relatively long time).
Both processes use Supabase to handle authentication and store data in Postgres. Bull also runs on top of Redis, which we deployed on [Railway](https://railway.app/), since itâs super easy to use.
## Locks are hard
As more and more people started using us, more and more people started finding bugs. We started getting odd issues with crawls sometimes being stuck for hours without any progress. I charted the timing of these crawls, and I saw that it was happening every time we redeployed.
Due to some miscellaneous memory leak issues, we were redeploying our entire service every 2 hours via GitHub Actions, in order to essentially restart all our machines. This killed all our workers, which had acquired locks for these crawl jobs. I was not too familiar with the codebase at this point, and I thought that these locks got hard-stuck on the dead workers, so I to add some code to release all of the current workerâs locks on termination.
This ended up being really complicated, due to multiple factors:
1. Other libraries we used also had cleanup code on `SIGTERM`. When you listen to `SIGTERM`, your app doesnât actually quit until the handler calls `process.exit()`. So, the other libraryâs handler called `process.exit()` when its handler finished, which caused a race condition with our cleanup handler. (This was absolute hell to debug.)
2. Fly.io sometimes didnât respect our configuration, and hard- `SIGKILL` ed our application before the 30 second timeout we specified our config. This cut our cleanup code short.
3. There was no easy way to remove a lock via the Bull API. The only legitimate way it could be done was to:
1. Get all in-progress jobs of this worker
2. Set their status to failed
3. Delete them from the queue
4. Re-insert them to the queue
4. While the cleanup code was running, there was no easy way to disable the current worker, so sometimes jobs the cleanup code re-inserted were immediately picked up by the same worker that was about to be shut down.
5. Due to our rollover deployment strategy, during a deployment, the re-inserted jobs were picked up by workers that have not been updated yet. This caused all the jobs to be piled up on the last worker to be updated, which caused the cleanup code to run longer than Flyâs maximum process shutdown timeout.
While I was going down a rabbithole that was spiraling out of control, Thomas (another Firecrawl engineer who mainly works on [Fire-Engine](https://www.firecrawl.dev/blog/introducing-fire-engine-for-firecrawl), which used a similar architecture) discovered that our queue lock options were grossly misconfigured:
```typescript
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
settings: {
lockDuration: 2 * 60 * 60 * 1000, // 2 hours in milliseconds
lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
},
});
```
This was originally written with the understanding that `lockDuration` would be the maximum amount of time a job could take â which is not true. When a worker stops renewing the lock every `lockRenewTime` milliseconds, `lockDuration` specifies the amount of time to wait before declaring the job as `stalled` and giving it to another worker. This was causing the crawls to be locked up for 2 hours, similar to what our customers were reporting.
After I got rid of all my super-complex cleanup code, the fix ended up being this:
```typescript
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
settings: {
lockDuration: 2 * 60 * 1000, // 1 minute in milliseconds
lockRenewTime: 15 * 1000, // 15 seconds in milliseconds
},
});
```
Thank you Thomas for spotting that one and keeping me from going off the deep end!
## Scaling scrape requests, the easy way
As you might have noticed in the architecture description, we were running scrape requests on the `app` process, the same one that serves our API. We were just starting a scrape in the `/v0/scrape` endpoint handler, and returning the results. This is simple to build, but it isnât sustainable.
We had no idea how many scrape requests we were running and when, there was no way to retry failed scrape requests, we had no data source to scale the `app` process on (other than are we down or not), and we had to scrape Express along with it. We needed to move scraping to our `worker` process.
We ended up choosing to just add scrape jobs to the same queue as crawling jobs. This way the `app` submitted the job, the `worker` completed it, and the `app` waited for it to be done and returned the data. [We read the old advice about ânever wait for jobs to finishâ](https://blog.taskforce.sh/do-not-wait-for-your-jobs-to-complete/), but we decided to cautiously ignore it, since it would have ruined the amazing simplicity that the scrape endpoint has.
This ended up [being surprisingly simple](https://github.com/mendableai/firecrawl/commit/6798695ee4daf1ce1b289db494d260d718b6752b#diff-6753e371514e1d188e797436080479e7c781d96183601ab8fa203e4df6ca0400), only slightly affected by Bullâs odd API. We had to add a global event handler to check if the job had completed, since it lacked the [`Job.waitUntilFinished`](https://api.docs.bullmq.io/classes/v5.Job.html#waitUntilFinished) function that its successor [BullMQ](https://github.com/taskforcesh/bullmq) already had.
We saw a huge drop in weird behaviour on our `app` machines, and we were able to scale them down in exchange for more `worker` machines, making us way faster.
## Smaller is better
The redeploy crawl fiasco made us worried about handling big crawls. We could essentially 2x the time a big crawl ran if it was caught in the middle of a redeploy, which is sub-optimal. Some of our workers were also crashing with an OOM error when working on large crawls. We instead decided to break crawls down to individual scrape jobs that chain together and spawn new jobs when they find new URLs.
We decided to make every job in the queue have a scrape type. Scrape jobs that are associated with crawls have an extra bit of metadata tying them to the crawlId. This crawlId refers to some redis keys that coordinate the crawling process.
The crawl itself has some basic data including the origin URL, the team associated with the request, the robots.txt file, and others:
```typescript
export type StoredCrawl = {
originUrl: string;
crawlerOptions: any;
pageOptions: any;
team_id: string;
plan: string;
robots?: string;
cancelled?: boolean;
createdAt: number;
};
export async function saveCrawl(id: string, crawl: StoredCrawl) {
await redisConnection.set("crawl:" + id, JSON.stringify(crawl));
await redisConnection.expire("crawl:" + id, 24 * 60 * 60, "NX");
}
export async function getCrawl(id: string): Promise {
const x = await redisConnection.get("crawl:" + id);
if (x === null) {
return null;
}
return JSON.parse(x);
}
```
We also make heavy use of Redis sets to determine which URLs have been already visited when discovering new pages. The Redis `SADD` command adds a new element to a set. Since sets can only store unique values, it returns 1 or 0 based on whether the element was added or not. (The element does not get added if it was already in the set before.) We use this as a lock mechanism, to make sure two workers donât discover the same URL at the same time and add two jobs for them.
```typescript
async function lockURL(id: string, url: string): Promise {
// [...]
const res =
(await redisConnection.sadd("crawl:" + id + ":visited", url)) !== 0;
// [...]
return res;
}
async function onURLDiscovered(crawl: string, url: string) {
if (await lockURL(crawl, url)) {
// we are the first ones to discover this URL
await addScrapeJob(/* ... */); // add new job for this URL
}
}
```
You can take a look at the whole Redis logic around orchestrating crawls [here](https://github.com/mendableai/firecrawl/blob/main/apps/api/src/lib/crawl-redis.ts).
With this change, we saw a huge performance improvement on crawls. This change also allowed us to perform multiple scrape requests of one crawl at the same time, while the old crawler had no scrape concurrency. We were able to stretch a crawl over all of our machines, maximizing the worth we get for each machine we pay for.
## Goodbye Bull, hello BullMQ
Every time we encountered Bull, we were slapped in the face by how much better BullMQ was. It had a better API, new features, and the most important thing of all: active maintenance. We decided to make the endeavour to switch over to it, first on Fire-Engine, and then on Firecrawl.
With this change, we were able to drop the horrible code for [waiting for a job to complete](https://github.com/mendableai/firecrawl/blob/6798695ee4daf1ce1b289db494d260d718b6752b/apps/api/src/controllers/scrape.ts#L59-L89), and replace it all with `job.waitUntilFinished()`. We were also able to customize our workers to add Sentry instrumentation (more on that later), and to take on jobs based on CPU and RAM usage, instead of a useless max concurrency constant that we had to use with Bull.
BullMQ still has its API quirks (e.g. donât you dare call `Job.moveToCompleted` / `Job.moveToFailed` with the 3rd argument not set to `false`, otherwise you will check out and lock a job that will be returned to you that youâre probably dropping)
## Our egress fee horror story
Our changes made us super scalable, but they also meant that a lot more traffic was going through Redis. We ended up racking up a 15000$ bill on Railway in August, mostly on Redis egress fees only. This wasnât sustainable, and we needed to switch quickly.
After being disappointed with Upstash, and having issues with Dragonfly, we found a way to deploy Redis to Fly.io natively. [We put our own spin on the config](https://github.com/mendableai/firecrawl/blob/f7c4cee404e17b3ed201e005185a5041009d0e6f/apps/redis/fly.toml), and deployed it to our account. However, we were not able to reach the instance from the public IP using `redis-cli` (netcat worked though?!?!), which caused some confusion.
We decided to go another way and use Flyâs [Private Networking](https://fly.io/docs/networking/private-networking/), which provides a direct connection to a Fly app/machine without any load balancer being in front. We crafted a connection string, SSHâd into one of our worker machines, installed `redis-cli`, tried to connect, and⌠it worked! We had a reachable, stable Redis instance in front of us.
So, we went to change the environment variable to the fancy new Fly.io Redis, we deployed the application, and⌠we crashed. After a quick revert, we noticed that [IORedis](https://github.com/redis/ioredis) wasnât able to connect to the Redis instance, but `redis-cli` stilled worked fine. So⌠what gives?
Turns out, `ioredis` only performs a lookup for an IPv4 address, unless you specify `?family=6`, in which case it only performs a lookup for an IPv6 address. This is not documented anywhere, except in a couple of GitHub issues which are hard to search for. I have been coding for almost 11 years now, and this is the worst configuration quirk I have ever seen. (And I use Nix daily!) In 2024, it would be saner to look for IPv6 by default instead of IPv4. Why not look for both? This is incomprehensible to me.
Anyways, after appending `?family=6` to the string, everything worked, except, sometimes notâŚ
## Awaiting forever
We started having huge waves of scrape timeouts. After a bit of investigation, the `Job.waitUntilFinished()` Promise never returned, but after looking at our BullMQ dashboard, we saw that jobs were actually being completed.
BullMQ uses Redis streams for all of its event firing/handling code, including `waitUntilFinished`, which waits until the jobâs `finished` event fires. BullMQ enforces a maximum length for the event stream, in order to purge old events that have presumably already been handled, and it defaults to about 10000 maximum events. Under heavy load, our queue was firing so many events, that BullMQ was trimming events before they could be processed. This caused everything that depends on queue events to fail.
This maximum events parameter is configurable, however, it seems like a parameter that weâd have to babysit, and itâs way too cryptic and too easy to forget about. Instead, we opted to rewrite the small amount of code that uses queue events to do polling instead, which is not affected by pub/sub issues like this.
Inexplicably, this never happened on the old Railway Redis instance, but it happened on every alternative we tried (including Upstash and Dragonfly). Weâre still not sure why we didnât run into this issue earlier, and BullMQ queue events still work happily on the Fire-Engine side under Dragonfly.
## Adding monitoring
We were growing tired of going through console logs to diagnose things. We were also worried about how many issues we could potentially be missing. So, we decided to integrate [Sentry](https://sentry.io/) for error and performance monitoring, because I had some great experiences with it in the past.
The moment we added it, we found about 10 high-impact bugs that we had no idea about. I fixed them the day after. We also had an insight into what our services were actively doing â I was able to add custom instrumentation to BullMQ, and pass trace IDs over to Fire-Engine, so now we can view the entire process a scrape or crawl goes through until it finishes, all organized in one place.

(The creation of this image for this post lead me to [decrease the time Firecrawl spends after Fire-Engine is already finished](https://github.com/mendableai/firecrawl/commit/000a316cc362b935976ac47b73ec02923f4175c5). Thanks, Sentry!)
Sentry has been immensely useful in finding errors, debugging incidents, and improving performance. There is no longer a chance that we have an issue invisibly choking us. With Sentry we see everything that could be going wrong (super exciting to see AIOps tools like [Keep](https://www.keephq.dev/) popping up).
## The future
We are currently stable. I was on-call last weekend and I forgot about it. The phone never rang. It felt very weird after putting out fires for so long, but our investment absolutely paid off. It allowed us to do [our launch week](https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap), which would not have been possible if we were in panic mode 24/7. It has also allowed our customers to build with confidence, as the increased reliabilty adds another layer of greatness to Firecrawl.
However, there are still things weâre unhappy with. Fly, while very useful early-stage, doesnât let us smoothly autoscale. We are currently setting up Kubernetes to give us more control over our scaling.
I love making Firecrawl better, be it with features or with added reliability. Weâre in a good place right now, but Iâm sure there will be a lot more adventures with scaling in the future. I hope this post has been useful, since surprisingly few people talk about all this stuff. (We sure had trouble finding resources when we were trying to fix things.) I will likely be back with a part 2 when thereâs more exciting things to talk about.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
GergĹ MĂłricz (mogery)@mo\_geryy](https://x.com/mo_geryy)
GergĹ MĂłricz is a Software Engineer at Firecrawl. He works on scaling, monitoring, designing new APIs and features, putting out fires, customer support, and everything else there is to do at a tech startup.
Previously coded and scaled a hospitality tech startup, and contributed to Mendable on GitHub.
### More articles by GergĹ MĂłricz (mogery)
[Handling 300k requests per day: an adventure in scaling\\
\\
Putting out fires was taking up all our time, and we had to scale fast. This is how we did it.](https://www.firecrawl.dev/blog/an-adventure-in-scaling)
## BeautifulSoup4 vs Scrapy
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 24, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python

## 1\. Introduction
Web scraping has become an essential tool for gathering data from the internet. Whether youâre tracking prices, collecting news articles, or building a research dataset, Python offers several popular libraries to help you extract information from websites. Two of the most widely used tools are BeautifulSoup4 (BS4) and Scrapy, each with its own strengths and ideal use cases.
Choosing between BS4 and Scrapy isnât always straightforward. BS4 is known for its simplicity and ease of use, making it perfect for beginners and small projects. Scrapy, on the other hand, offers powerful features for large-scale scraping but comes with a steeper learning curve. Making the right choice can save you time and prevent headaches down the road.
In this guide, weâll compare BS4 and Scrapy in detail, looking at their features, performance, and best uses. Weâll also explore practical examples and discuss modern alternatives that solve common scraping challenges. By the end, youâll have a clear understanding of which tool best fits your needs and how to get started with web scraping in Python.
## Prerequisites
Before diving into the comparison, make sure you have:
- Basic knowledge of Python programming
- Understanding of HTML structure and CSS selectors
- Python 3.7+ installed on your system
- Familiarity with command line interface
- A code editor or IDE of your choice
Youâll also need to install the required libraries:
```bash
pip install beautifulsoup4 scrapy firecrawl-py pydantic python-dotenv
```
## 2\. Understanding BeautifulSoup4
BeautifulSoup4, often called BS4, is a Python library that helps developers extract data from HTML and XML files. Think of it as a tool that can read and understand web pages the same way your browser does, but instead of showing you the content, it helps you collect specific information from it. BS4 works by turning messy HTML code into a well-organized structure thatâs easy to navigate and search through.
The library shines in its simplicity. With just a few lines of code, you can pull out specific parts of a webpage like headlines, prices, or product descriptions. Hereâs a quick example:
```python
from bs4 import BeautifulSoup
import requests
# Get a webpage
response = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(response.text, 'html.parser')
# Find all article titles
titles = soup.find_all('span', class_='titleline')
for idx, title in enumerate(titles):
print(f"{idx + 1}. {title.text.strip()}")
if idx == 4:
break
```
This code demonstrates BeautifulSoup4âs straightforward approach to web scraping. It fetches the Hacker News homepage using the requests library, then creates a BeautifulSoup object to parse the HTML. The `find_all()` method searches for `` elements with the class `"titleline"`, which contain article titles. The code loops through the first 5 titles, printing each one with its index number. The `strip()` method removes any extra whitespace around the titles.
The output shows real article titles from Hacker News, demonstrating how BS4 can easily extract specific content from a webpage:
```out
1. The GTA III port for the Dreamcast has been released (gitlab.com/skmp)
2. Arnis: Generate Cities in Minecraft from OpenStreetMap (github.com/louis-e)
3. Things we learned about LLMs in 2024 (simonwillison.net)
4. Journey from Entrepreneur to Employee (akshay.co)
5. Systems ideas that sound good but almost never work (learningbyshipping.com)
```
While BS4 excels at handling static websites, it does have limitations. It canât process JavaScript-generated content, which many modern websites use. It also doesnât handle tasks like managing multiple requests or storing data. However, these limitations are often outweighed by its gentle learning curve and excellent documentation, making it an ideal starting point for anyone new to web scraping.
Key Features:
- Simple, intuitive API for parsing HTML/XML
- Powerful searching and filtering methods
- Forgiving HTML parser that can handle messy code
- Extensive documentation with clear examples
- Small memory footprint
- Compatible with multiple parsers ( `lxml`, `html5lib`)
## 3\. Understanding Scrapy

Source: [Scrapy documentation](https://docs.scrapy.org/en/latest/topics/architecture.html).
Scrapy takes a different approach to web scraping by providing a complete framework rather than just a parsing library. Think of it as a Swiss Army knife for web scraping â it includes everything you need to crawl websites, process data, and handle common scraping challenges all in one package. While this makes it more powerful than BS4, it also means thereâs more to learn before you can get started.
Hereâs a basic example of how Scrapy works:
```python
# hackernews_spider.py
import scrapy
class HackerNewsSpider(scrapy.Spider):
name = "hackernews"
start_urls = ["https://news.ycombinator.com"]
def parse(self, response):
# Get all stories
for story in response.css("span.titleline"):
# Extract story title
yield {"title": story.css("a::text").get()}
# Go to next page if available
# if next_page := response.css('a.morelink::attr(href)').get():
# yield response.follow(next_page, self.parse)
# To run the spider, we need to use the Scrapy command line
# scrapy runspider hackernews_spider.py -o results.json
```
This code defines a simple Scrapy spider that crawls Hacker News. The spider starts at the homepage, extracts story titles from each page, and could optionally follow pagination links (currently commented out). The spider uses CSS selectors to find and extract content, demonstrating Scrapyâs built-in parsing capabilities. The results can be exported to JSON using Scrapyâs command line interface.
What sets Scrapy apart is its architecture. Instead of making one request at a time like BS4, Scrapy can handle multiple requests simultaneously, making it much faster for large projects. It also includes built-in features that youâd otherwise need to build yourself.
Scrapyâs key components include:
Spider middleware for customizing request/response handling, item pipelines for processing and storing data, and automatic request queuing and scheduling. It provides built-in support for exporting data in formats like JSON, CSV, and XML. The framework also includes robust error handling with retry mechanisms and a command-line interface for project management.
## 4\. Head-to-Head Comparison
Letâs break down how BS4 and Scrapy compare in key areas that matter most for web scraping projects.
### Performance
When it comes to speed and efficiency, Scrapy has a clear advantage. Its ability to handle multiple requests at once means it can scrape hundreds of pages while BS4 is still working on its first dozen. Think of BS4 as a solo worker, carefully processing one page at a time, while Scrapy is like a team of workers tackling many pages simultaneously.
Memory usage tells a similar story. BS4 is lightweight and uses minimal memory for single pages, making it perfect for small projects. However, Scrapyâs smart memory management shines when dealing with large websites, efficiently handling thousands of pages without slowing down your computer.
### Ease of Use
BS4 takes the lead in simplicity. You can start scraping with just 4-5 lines of code and basic Python knowledge. Hereâs a quick comparison:
BS4:
```python
from bs4 import BeautifulSoup
import requests
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")
titles = soup.find_all("h1")
```
Scrapy:
```python
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
titles = response.css('h1::text').getall()
yield {'titles': titles}
# Requires additional setup and command-line usage as seen above
```
### Features
Hereâs a simple breakdown of key features:
| Feature | BeautifulSoup4 | Scrapy |
| --- | --- | --- |
| JavaScript Support | â | â (needs add-ons) |
| Multiple Requests | â (manual) | â
(automatic) |
| Data Processing | â (basic) | â
(built-in pipelines) |
| Error Handling | â (manual) | â
(automatic retries) |
| Proxy Support | â (manual) | â
(built-in) |
### Use Cases
Choose BS4 when:
- Youâre new to web scraping
- You need to scrape a few simple pages
- You want to quickly test or prototype
- The website is mostly static HTML
- Youâre working within a larger project
Choose Scrapy when:
- You need to scrape thousands of pages
- You want built-in data processing
- You need advanced features like proxy rotation
- Youâre building a production scraper
- Performance is critical
## 5\. Common Challenges and Limitations
Web scraping tools face several hurdles that can make data extraction difficult or unreliable. Understanding these challenges helps you choose the right tool and prepare for potential roadblocks.
### Dynamic Content
Modern websites often load content using JavaScript after the initial page load. Neither BS4 nor Scrapy can handle this directly. While you can add tools like Selenium or Playwright to either solution, this makes your scraper more complex and slower. A typical example is an infinite scroll page on social media â the content isnât in the HTML until you scroll down.
### Anti-Bot Measures
Websites are getting smarter at detecting and blocking scrapers. Common protection methods include:
- CAPTCHAs and reCAPTCHA challenges
- IP-based rate limiting
- Browser fingerprinting
- Dynamic HTML structure changes
- Hidden honeypot elements
While Scrapy offers some built-in tools like proxy support and request delays, both BS4 and Scrapy users often need to implement additional solutions to bypass these protections.
### Maintenance Burden
Perhaps the biggest challenge is keeping scrapers running over time. Websites frequently change their structure, breaking scrapers that rely on specific HTML patterns. Hereâs a real-world example:
Before website update:
```python
# Working scraper
soup.find('div', class_='product-price').text # Returns: "$99.99"
```
After website update, same code now returns None because the structure changed:
```python
soup.find('span', class_='price-current').text # Returns: None
```
This constant need for updates creates a significant maintenance overhead, especially when managing multiple scrapers. While Scrapyâs more robust architecture helps handle some issues automatically, both tools require regular monitoring and fixes to maintain reliability.
### Resource Management
Each tool presents unique resource challenges:
- BS4: High memory usage when parsing large pages
- Scrapy: Complex configuration for optimal performance
- Both: Network bandwidth limitations
- Both: Server response time variations
These limitations often require careful planning and optimization, particularly for large-scale scraping projects where efficiency is crucial.
## 6\. Modern Solutions: Introducing Firecrawl
After exploring the limitations of traditional scraping tools, letâs look at how modern AI-powered solutions like Firecrawl are changing the web scraping landscape. Firecrawl takes a fundamentally different approach by using natural language understanding to identify and extract content, rather than relying on brittle HTML selectors.
### AI-Powered Content Extraction

Unlike BS4 and Scrapy which require you to specify exact HTML elements, Firecrawl lets you describe what you want to extract in plain English. This semantic approach means your scrapers keep working even when websites change their structure. Hereâs a practical example of scraping GitHubâs trending repositories:
```python
# Import required libraries
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
from dotenv import load_dotenv
from typing import List
# Load environment variables from .env file
load_dotenv()
# Define Pydantic model for a single GitHub repository
class Repository(BaseModel):
# Each field represents a piece of data we want to extract
name: str = Field(description="The repository name including organization/username")
description: str = Field(description="The repository description")
stars: int = Field(description="Total number of stars")
language: str = Field(description="Primary programming language")
url: str = Field(description="The repository URL")
# Define model for the full response containing list of repositories
class Repositories(BaseModel):
repositories: List[Repository] = Field(description="List of trending repositories")
# Initialize Firecrawl app
app = FirecrawlApp()
# Scrape GitHub trending page using our defined schema
trending_repos = app.scrape_url(
'https://github.com/trending',
params={
# Specify we want to extract structured data
"formats": ["extract"],
"extract": {
# Use our Pydantic model schema for extraction
"schema": Repositories.model_json_schema(),
}
}
)
# Loop through the first 3 repositories and print their details
for idx, repo in enumerate(trending_repos['extract']['repositories']):
print(f"{idx + 1}. {repo['name']}")
print(f"â {repo['stars']} stars")
print(f"đť {repo['language']}")
print(f"đ {repo['description']}")
print(f"đ {repo['url']}\n")
# Break after showing 3 repositories
if idx == 2:
break
```
```python
1. pathwaycom/pathway
â 11378 stars
đť Python
đ Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
đ https://github.com/pathwaycom/pathway
2. EbookFoundation/free-programming-books
â 345107 stars
đť HTML
đ đ Freely available programming books
đ https://github.com/EbookFoundation/free-programming-books
3. DrewThomasson/ebook2audiobook
â 3518 stars
đť Python
đ Convert ebooks to audiobooks with chapters and metadata using dynamic AI models and voice cloning. Supports 1,107+ languages!
đ https://github.com/DrewThomasson/ebook2audiobook
```
Firecrawl addresses the major pain points we discussed earlier:
1. **JavaScript Rendering**: Automatically handles dynamic content without additional tools
2. **Anti-Bot Measures**: Built-in proxy rotation and browser fingerprinting
3. **Maintenance**: AI adapts to site changes without updating selectors
4. **Rate Limiting**: Smart request management with automatic retries
5. **Multiple Formats**: Export data in various formats (JSON, CSV, Markdown)
### When to Choose Firecrawl
Firecrawl is particularly valuable when:
- You need reliable, low-maintenance scrapers
- Websites frequently change their structure
- Youâre dealing with JavaScript-heavy sites
- Anti-bot measures are a concern
- You need clean, structured data for AI/ML
- Time-to-market is critical
While you have to pay for higher usage limits, the reduction in development and maintenance time often makes it more cost-effective than maintaining custom scraping infrastructure with traditional tools.
## 7\. Making the Right Choice
Choosing the right web scraping tool isnât a one-size-fits-all decision. Letâs break down a practical framework to help you make the best choice for your specific needs.
### Decision Framework
1. **Project Scale**
- Small (1-10 pages): BeautifulSoup4
- Medium (10-100 pages): BeautifulSoup4 or Scrapy
- Large (100+ pages): Scrapy or Firecrawl
2. **Technical Requirements**
- Static HTML only: BeautifulSoup4
- Multiple pages & data processing: Scrapy
- Dynamic content & anti-bot bypass: Firecrawl
3. **Development Resources**
- Time available:
- Hours: BeautifulSoup4
- Days: Scrapy
- Minutes: Firecrawl
- Team expertise:
- Beginners: BeautifulSoup4
- Experienced developers: Scrapy
- Production teams: Firecrawl
### Cost-Benefit Analysis
| Factor | BeautifulSoup4 | Scrapy | Firecrawl |
| --- | --- | --- | --- |
| Initial Cost | Free | Free | Paid |
| Development Time | Low | High | Minimal |
| Maintenance Cost | High | Medium | Low |
| Scalability | Limited | Good | Excellent |
### Future-Proofing Your Choice
Consider these factors for long-term success:
1. **Maintainability**
- Will your team be able to maintain the scraper?
- How often does the target website change?
- Whatâs the cost of scraper downtime?
2. **Scalability Requirements**
- Do you expect your scraping needs to grow?
- Will you need to add more websites?
- Are there seasonal traffic spikes?
3. **Integration Needs**
- Does it need to work with existing systems?
- What format do you need the data in?
- Are there specific performance requirements?
### Practical Recommendations
Start with BeautifulSoup4 if:
- Youâre learning web scraping
- You need to scrape simple, static websites
- You have time to handle maintenance
- Budget is your primary constraint
Choose Scrapy when:
- You need to scrape at scale
- You have experienced developers
- You need fine-grained control
- Youâre building a long-term solution
Consider Firecrawl if:
- Time to market is critical
- You need reliable production scrapers
- Maintenance costs are a concern
- Youâre dealing with complex websites
- You need AI-ready data formats
## 8\. Conclusion
The web scraping landscape offers distinct tools for different needs. BeautifulSoup4 excels in simplicity, making it ideal for beginners and quick projects. Scrapy provides powerful features for large-scale operations but requires more expertise. Modern solutions like Firecrawl bridge the gap with AI-powered capabilities that address traditional scraping challenges, though at a cost.
### Key Takeaways
- BeautifulSoup4: Best for learning and simple, static websites
- Scrapy: Ideal for large-scale projects needing fine control
- Firecrawl: Perfect when reliability and low maintenance are priorities
- Consider long-term costs and scalability in your decision
Choose based on your projectâs scale, team expertise, and long-term needs. As websites grow more complex and anti-bot measures evolve, picking the right tool becomes crucial for sustainable web scraping success.
### Useful links
- [BeautifulSoup4 Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Scrapy Official Website](https://scrapy.org/)
- [Scrapy Documentation](https://docs.scrapy.org/)
- [Web Scraping Best Practices](https://www.scrapingbee.com/blog/web-scraping-best-practices/)
- [Firecrawl Documentation](https://docs.firecrawl.dev/)
- [Getting Started With Firecrawl](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Python Web Scraping Projects
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 17, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# 15 Python Web Scraping Projects: From Beginner to Advanced

## Introduction
Web scraping is one of the most powerful tools in a programmerâs arsenal, allowing you to gather data from across the internet automatically. It has countless applications like market research, competitive analysis, [price monitoring](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python), and data-driven decision making. The ability to extract structured data from web pages opens up endless possibilities for automation and analysis.
This guide outlines 15 web scraping project ideas in Python that progress from basic concepts to advanced techniques. Each project includes learning objectives, key technical concepts, and a structured development roadmap. While this guide doesnât provide complete code implementations, it serves as a blueprint for your web scraping journey - helping you understand what to build and how to approach each challenge systematically.
Letâs begin by understanding the available tools and setting up our development environment. Then weâll explore each project outline in detail, giving you a solid foundation to start building your own web scraping solutions.
## Table of Contents
1. [Introduction](https://www.firecrawl.dev/blog/python-web-scraping-projects#introduction)
2. [Prerequisites](https://www.firecrawl.dev/blog/python-web-scraping-projects#prerequisites)
- [Required Skills](https://www.firecrawl.dev/blog/python-web-scraping-projects#required-skills)
- [Technical Requirements](https://www.firecrawl.dev/blog/python-web-scraping-projects#technical-requirements)
- [Optional but Helpful](https://www.firecrawl.dev/blog/python-web-scraping-projects#optional-but-helpful)
- [Time Commitment](https://www.firecrawl.dev/blog/python-web-scraping-projects#time-commitment)
3. [Comparing Python Web Scraping Frameworks](https://www.firecrawl.dev/blog/python-web-scraping-projects#comparing-python-web-scraping-frameworks-for-your-projects)
- [BeautifulSoup4](https://www.firecrawl.dev/blog/python-web-scraping-projects#beautifulsoup4)
- [Selenium](https://www.firecrawl.dev/blog/python-web-scraping-projects#selenium)
- [Scrapy](https://www.firecrawl.dev/blog/python-web-scraping-projects#scrapy)
- [Firecrawl](https://www.firecrawl.dev/blog/python-web-scraping-projects#firecrawl)
4. [Setting Up Your Web Scraping Environment](https://www.firecrawl.dev/blog/python-web-scraping-projects#setting-up-your-web-scraping-environment) 5. [Beginner Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#beginner-web-scraping-projects)
1. [Weather Data Scraper](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-weather-data-scraper)
2. [News Headlines Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-news-headlines-aggregator)
3. [Book Price Tracker](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-book-price-tracker)
4. [Recipe Collector](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-recipe-collector)
5. [Job Listing Monitor](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-job-listing-monitor)
5. [Intermediate Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#intermediate-web-scraping-projects)
1. [E-commerce Price Comparison Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-e-commerce-price-comparison-tool)
2. [Social Media Analytics Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-social-media-analytics-tool)
3. [Real Estate Market Analyzer](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-real-estate-market-analyzer)
4. [Academic Research Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-academic-research-aggregator)
5. [Financial Market Data Analyzer](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-financial-market-data-analyzer)
6. [Advanced Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#advanced-web-scraping-projects)
1. [Multi-threaded News Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-multi-threaded-news-aggregator)
2. [Distributed Web Archive System](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-distributed-web-archive-system)
3. [Automated Market Research Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-automated-market-research-tool)
4. [Competitive Intelligence Dashboard](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-competitive-intelligence-dashboard)
5. [Full-Stack Scraping Platform](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-full-stack-scraping-platform)
7. [Conclusion](https://www.firecrawl.dev/blog/python-web-scraping-projects#conclusion)
## Prerequisites
Before starting with these projects, you should have:
### Required Skills
- Basic Python programming experience:
- Variables, data types, and operators
- Control structures (if/else, loops)
- Functions and basic error handling
- Working with lists and dictionaries
- Reading/writing files
- Installing and importing packages
- Basic web knowledge:
- Understanding of HTML structure
- Ability to use browser developer tools (inspect elements)
- Basic CSS selectors (class, id, tag selection)
- Understanding of URLs and query parameters
- Development environment:
- Python 3.x installed
- Ability to use command line/terminal
- Experience with pip package manager
- Text editor or IDE (VS Code, PyCharm, etc.)
### Technical Requirements
- Computer with internet connection
- Modern web browser with developer tools
- Python 3.7+ installed
- Ability to install Python packages via pip
- Basic understanding of virtual environments
### Optional but Helpful
- Understanding of:
- HTTP methods (GET, POST)
- JSON and CSV data formats
- Basic regular expressions
- Simple database concepts
- Git version control
- Experience with:
- pandas library for data manipulation
- Basic data visualization
- API interactions
- Web browser automation
### Time Commitment
- 2-4 hours for setup and environment configuration
- 4-8 hours per beginner project
- Regular practice for skill improvement
If youâre new to web scraping, we recommend starting with the Weather Data Scraper or Recipe Collector projects, as they involve simpler website structures and basic data extraction patterns. The News Headlines Aggregator and Job Listing Monitor projects are more complex and might require additional learning about handling multiple data sources and pagination.
## Comparing Python Web Scraping Frameworks For Your Projects
When starting with web scraping in Python, youâll encounter several popular frameworks. Each has its strengths and ideal use cases. Letâs compare the main options to help you choose the right tool for your needs.
### BeautifulSoup4
BeautifulSoup4 (BS4) is one of the most popular Python libraries for web scraping. It provides a simple and intuitive way to parse HTML and XML documents by creating a parse tree that can be navigated and searched. BS4 excels at extracting data from static web pages where JavaScript rendering isnât required. The library works by transforming HTML code into a tree of Python objects, making it easy to locate and extract specific elements using methods like `find()` and `find_all()`. While it lacks some advanced features found in other frameworks, its simplicity and ease of use make it an excellent choice for beginners and straightforward scraping tasks.
Pros:
- Easy to learn and use
- Excellent documentation
- Great for parsing HTML/XML
- Lightweight and minimal dependencies
Cons:
- No JavaScript rendering
- Limited to basic HTML parsing
- No built-in download features
- Can be slow for large-scale scraping
Example usage:
```python
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h1')
```
### Selenium
Selenium is a powerful web automation framework that can control web browsers programmatically. Originally designed for web application testing, it has become a popular choice for web scraping, especially when dealing with dynamic websites that require JavaScript rendering. Selenium works by automating a real web browser, allowing it to interact with web pages just like a human user would - clicking buttons, filling forms, and handling dynamic content. This makes it particularly useful for scraping modern web applications where content is loaded dynamically through JavaScript.
Pros:
- Handles JavaScript-rendered content
- Supports browser automation
- Can interact with web elements
- Good for testing and scraping
Cons:
- Resource-intensive
- Slower than other solutions
- Requires browser drivers
- Complex setup and maintenance
Example Usage:
```python
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com")
elements = driver.find_elements(By.CLASS_NAME, "product-title")
```
### Scrapy
Scrapy is a comprehensive web scraping framework that provides a complete solution for extracting data from websites at scale. Itâs designed as a fast, powerful, and extensible framework that can handle complex scraping tasks efficiently. Unlike simpler libraries, Scrapy provides a full suite of features including a crawling engine, data processing pipelines, and middleware components. It follows the principle of âbatteries includedâ while remaining highly customizable for specific needs. Scrapy is particularly well-suited for large-scale scraping projects where performance and reliability are crucial.
Pros:
- High performance
- Built-in pipeline processing
- Extensive middleware support
- Robust error handling
Cons:
- Steep learning curve
- Complex configuration
- Limited JavaScript support
- Overkill for simple projects
Example Usage:
```python
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com"]
def parse(self, response):
for product in response.css(".product"):
yield {
"name": product.css(".title::text").get(),
"price": product.css(".price::text").get(),
}
```
### Firecrawl
Firecrawl represents a paradigm shift in web scraping by using AI to eliminate traditional scraping bottlenecks. Unlike conventional frameworks that require manual selector maintenance, Firecrawl uses natural language understanding to automatically identify and extract HTML element content based on semantic descriptions. This approach directly addresses the primary challenges faced in the projects outlined in this guide:
1. Development speed
- Traditional approach: Writing selectors, handling JavaScript, managing anti-bot measures (~2-3 days per site)
- Firecrawl approach: Define data schema, let AI handle extraction (~30 minutes per site)
2. Maintenance requirements
- Traditional approach: Regular updates when sites change, selector fixes, anti-bot adaptations
- Firecrawl approach: Schema remains stable, AI adapts to site changes automatically
3. Project implementation
- For the e-commerce projects: Built-in handling of dynamic pricing, AJAX requests, and anti-bot measures
- For news aggregation: Automatic content classification and extraction across different layouts
- For market research: Seamless handling of multiple site structures and authentication flows
Pros:
- AI-powered content extraction eliminates selector maintenance
- Automatic handling of JavaScript-rendered content
- Built-in anti-bot measures with enterprise-grade reliability
- Multiple output formats (JSON, CSV, structured objects)
- Site change resilience through semantic understanding
- Consistent extraction across different page layouts
Cons:
- Paid service (consider ROI vs. development time)
- API-dependent architecture
- Less granular control over parsing process
- May be overkill for simple, static sites
- Slower for large-scale operations
Example Implementation:
```python
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
class Product(BaseModel):
name: str = Field(description="The product name and title")
price: float = Field(description="The current price in USD")
description: str = Field(description="The product description text")
rating: float = Field(description="The average customer rating out of 5 stars")
num_reviews: int = Field(description="The total number of customer reviews")
availability: str = Field(description="The current availability status")
brand: str = Field(description="The product manufacturer or brand")
category: str = Field(description="The product category or department")
asin: str = Field(description="The Amazon Standard Identification Number")
app = FirecrawlApp()
data = app.scrape_url(
'https://www.amazon.com/gp/product/1718501900', # A sample Amazon product
params={
"formats": ['extract'],
"extract": {
"schema": Product.model_json_schema()
}
}
)
```
This example demonstrates how Firecrawl reduces complex e-commerce scraping to a simple schema definition. The same approach applies to all projects in this guide, potentially reducing development time from weeks to days. For production environments where reliability and maintenance efficiency are crucial, this automated approach often proves more cost-effective than maintaining custom scraping infrastructure.
* * *
Here is a table summarizing the differences between these tools:
| Tool | Best For | Learning Curve | Key Features |
| --- | --- | --- | --- |
| BeautifulSoup4 | Static websites, Beginners | Easy | Simple API, Great documentation |
| Selenium | Dynamic websites, Browser automation | Moderate | Full browser control, JavaScript support |
| Scrapy | Large-scale projects | Steep | High performance, Extensive features |
| Firecrawl | Production use, AI-powered scraping | Easy | Low maintenance, Built-in anti-bot |
Useful Resources:
- [BeautifulSoup4 documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium documentation](https://www.selenium.dev/documentation/)
- [Scrapy documentation](https://docs.scrapy.org/)
- [Firecrawl documentation](https://firecrawl.dev/)
- [Introduction to web scraping in Python tutorial](https://realpython.com/python-web-scraping-practical-introduction/)
With these tools and resources at your disposal, youâre ready to start exploring web scraping in Python. Letâs move on to setting up your environment.
## Setting Up Your Web Scraping Environment
Before diving into the projects, letâs set up our Python environment with the necessary tools and libraries. Weâll create a virtual environment and install the required packages.
1. Create and activate a virtual environment
```bash
# Create a new virtual environment
python -m venv scraping-env
# Activate virtual environment
# On Windows:
scraping-env\Scripts\activate
# On macOS/Linux:
source scraping-env/bin/activate
```
2. Install Required Packages
```bash
pip install requests beautifulsoup4 selenium scrapy firecrawl-py pandas
```
3. Additional Setup for Selenium
If you plan to use Selenium, youâll need to install a webdriver. For Chrome:
```bash
pip install webdriver-manager
```
4. Basic Project Structure
Create a basic project structure to organize your code:
```bash
mkdir web_scraping_projects
cd web_scraping_projects
touch requirements.txt
```
Add the dependencies to `requirements.txt`:
```text
requests>=2.31.0
beautifulsoup4>=4.12.2
selenium>=4.15.2
scrapy>=2.11.0
firecrawl-py>=0.1.0
pandas>=2.1.3
webdriver-manager>=4.0.1
```
5. Important Notes
- Always check a websiteâs robots.txt file before scraping
- Implement proper delays between requests (rate limiting)
- Consider using a user agent string to identify your scraper
- Handle errors and exceptions appropriately
- Store your API keys and sensitive data in environment variables
With this environment set up, youâll be ready to tackle any of the projects in this tutorial, from beginner to advanced level. Each project may require additional specific setup steps, which will be covered in their respective sections.
## Beginner Web Scraping Projects
Letâs start with some beginner-friendly web scraping projects that will help you build foundational skills.
### 1\. Weather Data Scraper
A real-time weather data scraper for weather.com extracts temperature, humidity, wind speed and precipitation forecasts. The project serves as an introduction to fundamental web scraping concepts including HTTP requests, HTML parsing, and error handling.
This beginner-friendly project demonstrates proper web scraping practices through practical application, with opportunities to expand into historical trend analysis and multi-location comparisons. The core focus is on DOM navigation, rate limiting implementation, and efficient data storage techniques.
**Learning objectives**:
- Understanding HTML structure and basic DOM elements
- Making HTTP requests
- Parsing simple HTML responses
- Handling basic error cases
**Proposed project steps**:
1. Set up your development environment:
- Install required libraries (requests, beautifulsoup4)
- Create a new Python script file
- Configure your IDE/editor
2. Analyze the weather website structure:
- Open browser developer tools (F12)
- Inspect HTML elements for weather data
- Document CSS selectors for key elements
- Check robots.txt for scraping permissions
3. Build the basic scraper structure:
- Create a WeatherScraper class
- Add methods for making HTTP requests
- Implement user agent rotation
- Add request delay functionality
4. Implement data extraction:
- Write methods to parse temperature
- Extract humidity percentage
- Get wind speed and direction
- Collect precipitation forecast
- Parse âfeels likeâ temperature
- Get weather condition description
5. Add error handling and validation:
- Implement request timeout handling
- Add retry logic for failed requests
- Validate extracted data types
- Handle missing data scenarios
- Log errors and exceptions
6. Create data storage functionality:
- Design CSV file structure
- Implement data cleaning
- Add timestamp to records
- Create append vs overwrite options
- Include location information
7. Test and refine:
- Test with multiple locations
- Verify data accuracy
- Optimize request patterns
- Add data validation checks
- Document known limitations
**Key concepts to learn**:
- HTTP requests and responses
- HTML parsing basics
- CSS selectors and HTML class/id attributes
- Data extraction patterns
- Basic error handling
**Website suggestions**:
- [weather.com](https://weather.com/) \- Main weather data source with comprehensive information
- [accuweather.com](https://accuweather.com/) \- Alternative source with detailed forecasts
- [weatherunderground.com](https://weatherunderground.com/) \- Community-driven weather data
- [openweathermap.org](https://openweathermap.org/) \- Free API available for learning
- [forecast.weather.gov](https://forecast.weather.gov/) \- Official US weather data source
### 2\. News Headlines Aggregator
A news headline aggregation system that pulls together breaking stories and trending content from multiple online news sources. The automated scraping engine visits major news websites on a schedule, extracting headlines, metadata, and key details into a unified data stream. The consolidated feed gives users a single interface to monitor news across publishers while handling the complexity of different site structures, update frequencies, and content formats behind the scenes.
**Learning Objectives**:
- Working with multiple data sources
- Handling different HTML structures
- Implementing proper delays between requests
- Basic data deduplication
**Project steps**:
1. Initial website selection and analysis
- Choose 2-3 news websites from suggested list
- Document each siteâs robots.txt rules
- Identify optimal request intervals
- Map out common headline patterns
- Note any access restrictions
2. HTML structure analysis
- Inspect headline container elements
- Document headline text selectors
- Locate timestamp information
- Find article category/section tags
- Map author and source attribution
- Identify image thumbnail locations
3. Data model design
- Define headline object structure
- Create schema for metadata fields
- Plan timestamp standardization
- Design category classification
- Structure source tracking fields
- Add URL and unique ID fields
4. Individual scraper development
- Build base scraper class
- Implement site-specific extractors
- Add request delay handling
- Include user-agent rotation
- Set up error logging
- Add data validation checks
5. Data processing and storage
- Implement text cleaning
- Normalize timestamps
- Remove duplicate headlines
- Filter unwanted content
- Create CSV/JSON export
- Set up incremental updates
6. Integration and testing
- Combine multiple scrapers
- Add master scheduler
- Test with different intervals
- Validate combined output
- Monitor performance
- Document limitations
**Key concepts to learn**:
- Rate limiting and polite scraping
- Working with multiple websites
- Text normalization
- Basic data structures for aggregation
- Time handling in Python
**Website suggestions**:
- [reuters.com](https://reuters.com/) \- Major international news agency
- [apnews.com](https://apnews.com/) \- Associated Press news wire service
- [bbc.com/news](https://bbc.com/news) \- International news coverage
- [theguardian.com](https://theguardian.com/) \- Global news with good HTML structure
- [aljazeera.com](https://aljazeera.com/) \- International perspective on news
### 3\. Book Price Tracker
Develop an automated price monitoring system that continuously scans multiple online bookstores to track price fluctuations for specific books. The tool will maintain a watchlist of titles, periodically check their current prices, and notify users when prices drop below certain thresholds or when significant discounts become available. This enables book enthusiasts to make cost-effective purchasing decisions by capitalizing on temporary price reductions across different retailers.
**Learning objectives**:
- Persistent data storage
- Price extraction and normalization
- Basic automation concepts
- Simple alert systems
**Project steps**:
1. Analyze target bookstores
- Research and select online bookstores to monitor
- Study website structures and price display patterns
- Document required headers and request parameters
- Test rate limits and access restrictions
2. Design data storage
- Create database tables for books and price history
- Define schema for watchlists and price thresholds
- Plan price tracking and comparison logic
- Set up automated backups
3. Build price extraction system
- Implement separate scrapers for each bookstore
- Extract prices, availability and seller info
- Handle different currencies and formats
- Add error handling and retries
- Validate extracted data
4. Implement automation
- Set up scheduled price checks
- Configure appropriate delays between requests
- Track successful/failed checks
- Implement retry logic for failures
- Monitor system performance
5. Add notification system
- Create price threshold triggers
- Set up email notifications
- Add price drop alerts
- Generate price history reports
- Allow customizable alert preferences
**Key concepts to learn**:
- Database basics (SQLite or similar)
- Regular expressions for price extraction
- Scheduling with Python
- Email notifications
- Data comparison logic
**Website suggestions**:
- [amazon.com](https://amazon.com/) \- Large selection and dynamic pricing
- [bookdepository.com](https://bookdepository.com/) \- International book retailer
- [barnesandnoble.com](https://barnesandnoble.com/) \- Major US book retailer
- [abebooks.com](https://abebooks.com/) \- Used and rare books marketplace
- [bookfinder.com](https://bookfinder.com/) \- Book price comparison site
### 4\. Recipe Collector
Build an automated recipe scraping tool that collects detailed cooking information from food websites. The system will extract comprehensive recipe data including ingredient lists with measurements, step-by-step preparation instructions, cooking durations, serving sizes, and nutritional facts. This tool enables home cooks to easily aggregate and organize recipes from multiple sources into a standardized format.
**Learning objectives**:
- Handling nested HTML structures
- Extracting structured data
- Text cleaning and normalization
- Working with lists and complex data types
**Project steps**:
1. Analyze recipe website structures
- Study HTML structure of target recipe sites
- Identify common patterns for recipe components
- Document CSS selectors and XPaths for key elements
- Map variations between different sites
2. Design a recipe data model
- Create database schema for recipes
- Define fields for ingredients, instructions, metadata
- Plan data types and relationships
- Add support for images and rich media
- Include tags and categories
3. Implement extraction logic for recipe components
- Build scrapers for each target website
- Extract recipe title and description
- Parse ingredient lists with quantities and units
- Capture step-by-step instructions
- Get cooking times and temperatures
- Collect serving size information
- Extract nutritional data
- Download recipe images
4. Clean and normalize extracted data
- Standardize ingredient measurements
- Convert temperature units
- Normalize cooking durations
- Clean up formatting and special characters
- Handle missing or incomplete data
- Validate data consistency
- Remove duplicate recipes
5. Store recipes in a structured format
- Save to SQL/NoSQL database
- Export options to JSON/YAML
- Generate printable recipe cards
- Add search and filtering capabilities
- Implement recipe categorization
- Create backup system
**Key concepts to learn**:
- Complex HTML navigation
- Data cleaning techniques
- JSON/YAML data formats
- Nested data structures
- Text processing
**Website suggestions**:
- [allrecipes.com](https://allrecipes.com/) \- Large recipe database
- [foodnetwork.com](https://foodnetwork.com/) \- Professional recipes
- [epicurious.com](https://epicurious.com/) \- Curated recipe collection
- [simplyrecipes.com](https://simplyrecipes.com/) \- Well-structured recipes
- [food.com](https://food.com/) \- User-submitted recipes
### 5\. Job Listing Monitor
Create an automated job search monitoring tool that continuously scans multiple job listing websites for new positions matching user-defined criteria. The tool will track key details like job titles, companies, locations, salaries, and requirements. Users can specify search filters such as keywords, experience level, job type (remote/hybrid/onsite), and salary range. The system will store listings in a database and notify users of new matches via email or other alerts. This helps job seekers stay on top of opportunities without manually checking multiple sites.
The tool can integrate with major job boards like LinkedIn, Indeed, Glassdoor and company career pages. It will handle different site structures, login requirements, and listing formats while respecting rate limits and terms of service. Advanced features could include sentiment analysis of job descriptions, automatic resume submission, and tracking application status across multiple positions.
**Learning objectives**:
- Working with search parameters
- Handling pagination
- Form submission
- Data filtering
**Project steps**:
1. Set up initial project structure and dependencies
- Create virtual environment
- Install required libraries
- Set up database (SQLite/PostgreSQL)
- Configure logging and error handling
- Set up email notification system
2. Implement site-specific scrapers
- Analyze HTML structure of each job board
- Handle authentication if required
- Create separate scraper classes for each site (one is enough if you are using Firecrawl)
- Implement rate limiting and rotating user agents
- Add proxy support for avoiding IP blocks
- Handle JavaScript-rendered content with Selenium (no need if you are using Firecrawl)
3. Build search parameter system
- Create configuration for search criteria
- Implement URL parameter generation
- Handle different parameter formats per site
- Add validation for search inputs
- Support multiple search profiles
- Implement location-based searching
4. Develop listing extraction logic
- Extract job details (title, company, location, etc)
- Parse salary information
- Clean and standardize data format
- Handle missing/incomplete data
- Extract application requirements
- Identify remote/hybrid/onsite status
- Parse required skills and experience
5. Create storage and monitoring system
- Design database schema
- Implement data deduplication
- Track listing history/changes
- Set up automated monitoring schedule
- Create email alert templates
- Build basic web interface for results
- Add export functionality
**Key concepts to learn**:
- URL parameters and query strings
- HTML forms and POST requests
- Pagination handling
- Data filtering techniques
- Incremental data updates
**Website suggestions**:
- [linkedin.com](https://linkedin.com/) \- Professional networking and job site
- [indeed.com](https://indeed.com/) \- Large job search engine
- [glassdoor.com](https://glassdoor.com/) \- Company reviews and job listings
- [monster.com](https://monster.com/) \- Global job search platform
- [dice.com](https://dice.com/) \- Technology job board
- [careerbuilder.com](https://careerbuilder.com/) \- Major US job site
## Intermediate Web Scraping Projects
These projects build upon basic scraping concepts and introduce more complex scenarios and techniques.
### 1\. E-commerce Price Comparison Tool
Build a sophisticated price comparison system monitoring major e-commerce platforms like Amazon, eBay, Walmart and Best Buy. The tool tracks products via SKUs and model numbers, scraping pricing data at configurable intervals. It normalizes data by mapping equivalent items and standardizing prices, shipping costs, and seller information across platforms.
A dashboard interface displays historical price trends, sends price drop alerts via email/SMS, and recommends optimal purchase timing based on seasonal patterns and historical lows. The system handles JavaScript-rendered content, dynamic AJAX requests, and anti-bot measures while maintaining data in both SQL and NoSQL stores.
Key technical challenges include managing product variants, currency conversion, and adapting to frequent site layout changes while ensuring data accuracy and consistency.
Read our separate guide on [building an Amazon price tracking application](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) using Firecrawl for the basic version of this project.
**Learning objectives**:
- Multi-site data aggregation
- Price normalization techniques
- Advanced rate limiting
- Proxy rotation
- Database optimization
**Project steps**:
1. Design system architecture
- Plan database schema for products and prices
- Design API structure for data access
- Set up proxy management system
- Configure rate limiting rules
- Plan data update intervals
2. Implement core scraping functionality
- Create base scraper class
- Add proxy rotation mechanism
- Implement user agent rotation
- Set up request queuing
- Add retry logic
- Handle JavaScript rendering
- Configure session management
3. Build product matching system
- Implement product identification
- Create fuzzy matching algorithms
- Handle variant products
- Normalize product names
- Match product specifications
- Track product availability
4. Develop price analysis features
- Track historical prices
- Calculate price trends
- Identify price patterns
- Generate price alerts
- Create price prediction models
- Compare shipping costs
- Track discount patterns
5. Create reporting system
- Build price comparison reports
- Generate trend analysis
- Create price alert notifications
- Export data in multiple formats
- Schedule automated reports
- Track price history
**Key concepts to learn**:
- Advanced rate limiting
- Proxy management
- Product matching algorithms
- Price normalization
- Historical data tracking
**Website suggestions**:
- [amazon.com](https://amazon.com/) \- Large product database
- [walmart.com](https://walmart.com/) \- Major retailer
- [bestbuy.com](https://bestbuy.com/) \- Electronics focus
- [target.com](https://target.com/) \- Retail products
- [newegg.com](https://newegg.com/) \- Tech products
### 2\. Social Media Analytics Tool
Build a comprehensive social media analytics platform that combines web scraping, API integration, and real-time monitoring capabilities. The system will aggregate engagement metrics and content across major social networks, process JavaScript-heavy pages, and provide actionable insights through customizable dashboards. Key features include sentiment analysis of comments, competitive benchmarking, and automated trend detection. The tool emphasizes scalable data collection while respecting rate limits and platform terms of service.
**Learning objectives**:
- JavaScript rendering
- API integration
- Real-time monitoring
- Data visualization
- Engagement metrics analysis
**Project steps**:
1. Platform analysis and setup
- Research API limitations
- Document scraping restrictions
- Set up authentication
- Plan data collection strategy
- Configure monitoring intervals
2. Implement data collection
- Create platform-specific scrapers
- Handle JavaScript rendering
- Implement API calls
- Track rate limits
- Monitor API quotas
- Handle pagination
- Collect media content
3. Build analytics engine
- Calculate engagement rates
- Track follower growth
- Analyze posting patterns
- Monitor hashtag performance
- Measure audience interaction
- Generate sentiment analysis
- Track competitor metrics
4. Develop visualization system
- Create interactive dashboards
- Generate trend graphs
- Build comparison charts
- Display real-time metrics
- Create export options
- Generate automated reports
5. Add monitoring features
- Set up real-time tracking
- Create alert system
- Monitor competitor activity
- Track brand mentions
- Generate periodic reports
- Implement custom metrics
**Key concepts to learn**:
- API integration
- Real-time data collection
- Engagement metrics
- Data visualization
- JavaScript handling
**Website suggestions**:
- [twitter.com](https://twitter.com/) \- Real-time social updates
- [instagram.com](https://instagram.com/) \- Visual content platform
- [facebook.com](https://facebook.com/) \- Social networking
- [linkedin.com](https://linkedin.com/) \- Professional network
- [reddit.com](https://reddit.com/) \- Community discussions
### 3\. Real Estate Market Analyzer
Develop a comprehensive real estate market analysis tool that collects and analyzes property listings from multiple sources. The system will track prices, property features, market trends, and neighborhood statistics to provide insights into real estate market conditions. This project focuses on handling pagination, geographic data, and large datasets.
**Learning objectives**:
- Geographic data handling
- Advanced pagination
- Data relationships
- Market analysis
- Database optimization
**Project steps**:
1. Set up data collection framework
- Design database schema
- Configure geocoding system
- Set up mapping integration
- Plan data update frequency
- Configure backup system
2. Implement listing collection
- Create site-specific scrapers
- Handle dynamic loading
- Process pagination
- Extract property details
- Collect images and media
- Parse property features
- Handle location data
3. Build analysis system
- Calculate market trends
- Analyze price per square foot
- Track inventory levels
- Monitor days on market
- Compare neighborhood stats
- Generate market reports
- Create price predictions
4. Develop visualization tools
- Create interactive maps
- Build trend graphs
- Display comparative analysis
- Show market indicators
- Generate heat maps
- Create property reports
5. Add advanced features
- Implement search filters
- Add custom alerts
- Create watchlists
- Generate market reports
- Track favorite properties
- Monitor price changes
**Key concepts to learn**:
- Geographic data processing
- Complex pagination
- Data relationships
- Market analysis
- Mapping integration
**Website suggestions**:
- [zillow.com](https://zillow.com/) \- Real estate listings
- [realtor.com](https://realtor.com/) \- Property database
- [trulia.com](https://trulia.com/) \- Housing market data
- [redfin.com](https://redfin.com/) \- Real estate platform
- [homes.com](https://homes.com/) \- Property listings
### 4\. Academic Research Aggregator
Create a comprehensive academic research aggregator that collects scholarly articles, papers, and publications from multiple academic databases and repositories. The system will track research papers, citations, author information, and publication metrics to help researchers stay updated with the latest developments in their field.
**Learning objectives**:
- PDF parsing and extraction
- Citation network analysis
- Academic API integration
- Complex search parameters
- Large dataset management
**Project steps**:
1. Source identification and setup
- Research academic databases
- Document API access requirements
- Set up authentication systems
- Plan data collection strategy
- Configure access protocols
- Handle rate limitations
2. Implement data collection
- Create database-specific scrapers
- Handle PDF downloads
- Extract paper metadata
- Parse citations
- Track author information
- Collect publication dates
- Handle multiple languages
3. Build citation analysis system
- Track citation networks
- Calculate impact factors
- Analyze author networks
- Monitor research trends
- Generate citation graphs
- Track paper influence
- Identify key papers
4. Develop search and filtering
- Implement advanced search
- Add field-specific filters
- Create topic clustering
- Enable author tracking
- Support boolean queries
- Add relevance ranking
- Enable export options
5. Create visualization and reporting
- Generate citation networks
- Create author collaboration maps
- Display research trends
- Show topic evolution
- Create custom reports
- Enable data export
**Key concepts to learn**:
- PDF text extraction
- Network analysis
- Academic APIs
- Complex search logic
- Large-scale data processing
**Website suggestions**:
- [scholar.google.com](https://scholar.google.com/) \- Academic search engine
- [arxiv.org](https://arxiv.org/) \- Research paper repository
- [sciencedirect.com](https://sciencedirect.com/) \- Scientific publications
- [ieee.org](https://ieee.org/) \- Technical papers
- [pubmed.gov](https://pubmed.gov/) \- Medical research
### 5\. Financial Market Data Analyzer
Build a sophisticated financial market analysis tool that collects and processes data from multiple financial sources including stock markets, cryptocurrency exchanges, and forex platforms. The system will track prices, trading volumes, market indicators, and news sentiment to provide comprehensive market insights.
**Learning objectives**:
- Real-time data handling
- WebSocket connections
- Financial calculations
- Time series analysis
- News sentiment analysis
**Project steps**:
1. Data source integration
- Set up API connections
- Configure WebSocket feeds
- Implement rate limiting
- Handle authentication
- Manage data streams
- Plan backup sources
2. Market data collection
- Track price movements
- Monitor trading volume
- Calculate market indicators
- Record order book data
- Track market depth
- Handle multiple exchanges
- Process tick data
3. Build analysis engine
- Implement technical indicators
- Calculate market metrics
- Process trading signals
- Analyze price patterns
- Generate market alerts
- Track correlations
- Monitor volatility
4. Develop news analysis
- Collect financial news
- Process news sentiment
- Track market impact
- Monitor social media
- Analyze announcement effects
- Generate news alerts
5. Create visualization system
- Build price charts
- Display market indicators
- Show volume analysis
- Create correlation maps
- Generate trading signals
- Enable custom dashboards
**Key concepts to learn**:
- WebSocket programming
- Real-time data processing
- Financial calculations
- Market analysis
- News sentiment analysis
**Website suggestions**:
- [finance.yahoo.com](https://finance.yahoo.com/) \- Financial data
- [marketwatch.com](https://marketwatch.com/) \- Market news
- [investing.com](https://investing.com/) \- Trading data
- [tradingview.com](https://tradingview.com/) \- Technical analysis
- [coinmarketcap.com](https://coinmarketcap.com/) \- Crypto markets
## Advanced Web Scraping Projects
These projects represent complex, production-grade applications that combine multiple advanced concepts and require sophisticated architecture decisions. Theyâre ideal for developers who have mastered basic and intermediate scraping techniques.
### 1\. Multi-threaded News Aggregator
Build an enterprise-grade news aggregation system that uses concurrent processing to efficiently collect and analyze news from hundreds of sources simultaneously. The system will handle rate limiting, proxy rotation, and load balancing while maintaining high throughput and data accuracy. This project focuses on scalability and performance optimization.
**Learning objectives**:
- Concurrent programming
- Thread/Process management
- Queue systems
- Load balancing
- Performance optimization
**Project steps**:
1. Design concurrent architecture
- Plan threading strategy
- Design queue system
- Configure worker pools
- Set up load balancing
- Plan error handling
- Implement logging system
- Design monitoring tools
2. Build core scraping engine
- Create worker threads
- Implement task queue
- Set up proxy rotation
- Handle rate limiting
- Manage session pools
- Configure retries
- Monitor performance
3. Develop content processing
- Implement NLP analysis
- Extract key information
- Classify content
- Detect duplicates
- Process media content
- Handle multiple languages
- Generate summaries
4. Create storage and indexing
- Design database sharding
- Implement caching
- Set up search indexing
- Manage data retention
- Handle data validation
- Configure backups
- Optimize queries
5. Build monitoring system
- Track worker status
- Monitor queue health
- Measure throughput
- Track error rates
- Generate alerts
- Create dashboards
- Log performance metrics
**Key concepts to learn**:
- Thread synchronization
- Queue management
- Resource pooling
- Performance monitoring
- System optimization
**Website suggestions**:
- [reuters.com](https://reuters.com/) \- International news
- [apnews.com](https://apnews.com/) \- News wire service
- [bloomberg.com](https://bloomberg.com/) \- Financial news
- [nytimes.com](https://nytimes.com/) \- News articles
- [wsj.com](https://wsj.com/) \- Business news
### 2\. Distributed Web Archive System
Build a distributed web archiving system that preserves historical versions of websites across a network of nodes. The system will handle massive-scale crawling, content deduplication, versioning, and provide a searchable interface to access archived content. Think of it as building your own Internet Archive Wayback Machine with distributed architecture.
**Learning objectives**:
- Distributed systems architecture
- Content-addressable storage
- Version control concepts
- Distributed crawling
- Large-scale search
**Project steps**:
1. Design distributed architecture
- Plan node communication
- Design content addressing
- Configure storage sharding
- Implement consensus protocol
- Set up service discovery
- Plan failure recovery
- Design replication strategy
2. Build core archiving engine
- Implement snapshot system
- Handle resource capturing
- Process embedded content
- Manage asset dependencies
- Create versioning system
- Handle redirects
- Implement diff detection
3. Develop distributed crawler
- Create crawler nodes
- Implement work distribution
- Handle URL deduplication
- Manage crawl frontiers
- Process robots.txt
- Configure politeness rules
- Monitor node health
4. Create storage and indexing
- Implement content hashing
- Build merkle trees
- Create delta storage
- Set up distributed index
- Handle data replication
- Manage storage quotas
- Optimize retrieval
5. Build access interface
- Create temporal navigation
- Implement diff viewing
- Enable full-text search
- Build API endpoints
- Create admin dashboard
- Enable export options
- Handle access control
**Key concepts to learn**:
- Distributed systems
- Content addressing
- Merkle trees
- Consensus protocols
- Temporal data models
**Technical requirements**:
- Distributed database (e.g., Cassandra)
- Message queue system (e.g., Kafka)
- Search engine (e.g., Elasticsearch)
- Content-addressable storage
- Load balancers
- Service mesh
- Monitoring system
**Advanced features**:
- Temporal graph analysis
- Content change detection
- Link integrity verification
- Resource deduplication
- Distributed consensus
- Automated preservation
- Access control policies
This project combines distributed systems concepts with web archiving challenges, requiring deep understanding of both scalable architecture and content preservation techniques. Itâs particularly relevant for organizations needing to maintain compliant records of web content or researchers studying web evolution patterns.
### 3\. Automated Market Research Tool
Create a comprehensive market research platform that combines web scraping, data analysis, and automated reporting to provide competitive intelligence and market insights. The system will track competitors, analyze market trends, and generate detailed reports automatically.
**Learning objectives**:
- Large-scale data collection
- Advanced analytics
- Automated reporting
- Competitive analysis
- Market intelligence
**Project steps**:
1. Design research framework
- Define data sources
- Plan collection strategy
- Design analysis pipeline
- Configure reporting system
- Set up monitoring
- Plan data storage
- Configure backup systems
2. Implement data collection
- Create source scrapers
- Handle authentication
- Manage rate limits
- Process structured data
- Extract unstructured content
- Track changes
- Validate data quality
3. Build analysis engine
- Process market data
- Analyze trends
- Track competitors
- Generate insights
- Calculate metrics
- Identify patterns
- Create predictions
4. Develop reporting system
- Generate automated reports
- Create visualizations
- Build interactive dashboards
- Enable customization
- Schedule updates
- Handle distribution
- Track engagement
5. Add intelligence features
- Implement trend detection
- Create alerts system
- Enable custom analysis
- Build recommendation engine
- Generate insights
- Track KPIs
- Monitor competition
**Key concepts to learn**:
- Market analysis
- Report automation
- Data visualization
- Competitive intelligence
- Trend analysis
**Website suggestions**:
- Company websites
- Industry news sites
- Government databases
- Social media platforms
- Review sites
### 4\. Competitive Intelligence Dashboard
Build a real-time competitive intelligence platform that monitors competitor activities across multiple channels including websites, social media, and news sources. The system will provide automated alerts and analysis of competitive movements in the market.
**Learning objectives**:
- Real-time monitoring
- Complex automation
- Data warehousing
- Dashboard development
- Alert systems
**Project steps**:
1. Set up monitoring system
- Configure data sources
- Set up real-time tracking
- Implement change detection
- Design alert system
- Plan data storage
- Configure monitoring rules
- Handle authentication
2. Build data collection
- Create source scrapers
- Handle dynamic content
- Process structured data
- Extract unstructured content
- Track changes
- Monitor social media
- Collect news mentions
3. Develop analysis engine
- Process competitor data
- Analyze market position
- Track product changes
- Monitor pricing
- Analyze marketing
- Track customer sentiment
- Generate insights
4. Create dashboard interface
- Build real-time displays
- Create interactive charts
- Enable custom views
- Implement filtering
- Add search functionality
- Enable data export
- Configure alerts
5. Implement alert system
- Set up notification rules
- Create custom triggers
- Handle priority levels
- Enable user preferences
- Track alert history
- Generate summaries
- Monitor effectiveness
**Key concepts to learn**:
- Real-time monitoring
- Change detection
- Alert systems
- Dashboard design
- Competitive analysis
**Website suggestions**:
- Competitor websites
- Social media platforms
- News aggregators
- Review sites
- Industry forums
### 5\. Full-Stack Scraping Platform
Develop a complete web scraping platform with a user interface that allows non-technical users to create and manage scraping tasks. The system will include visual scraping tools, scheduling, monitoring, and data export capabilities.
**Learning objectives**:
- Full-stack development
- API design
- Frontend development
- System architecture
- User management
**Project steps**:
1. Design system architecture
- Plan component structure
- Design API endpoints
- Configure databases
- Set up authentication
- Plan scaling strategy
- Design monitoring
- Configure deployment
2. Build backend system
- Create API endpoints
- Implement authentication
- Handle task management
- Process scheduling
- Manage user data
- Handle file storage
- Configure security
3. Develop scraping engine
- Create scraper framework
- Handle different sites
- Manage sessions
- Process rate limits
- Handle errors
- Validate data
- Monitor performance
4. Create frontend interface
- Build user dashboard
- Create task manager
- Implement scheduling
- Show monitoring data
- Enable configuration
- Handle data export
- Display results
5. Add advanced features
- Visual scraper builder
- Template system
- Export options
- Notification system
- User management
- Usage analytics
- API documentation
**Key concepts to learn**:
- System architecture
- API development
- Frontend frameworks
- User management
- Deployment
**Website suggestions**:
- Any website (platform should be generic)
- Test sites for development
- Documentation resources
- API references
- Example targets
## Conclusion
Web scraping is a powerful skill that opens up endless possibilities for data collection and analysis. Through these 15 projects, ranging from basic weather scrapers to advanced AI-powered content extraction systems, youâve seen how web scraping can be applied to solve real-world problems across different domains.
Key takeaways from these projects include:
- Start with simpler projects to build foundational skills
- Progress gradually to more complex architectures
- Focus on ethical scraping practices and website policies
- Use appropriate tools based on project requirements
- Implement proper error handling and data validation
- Consider scalability and maintenance from the start
Whether youâre building a simple price tracker or a full-scale market intelligence platform, the principles and techniques covered in these projects will serve as a solid foundation for your web scraping journey. Remember to always check robots.txt files, implement appropriate delays, and respect website terms of service while scraping.
For your next steps, pick a project that aligns with your current skill level and start building. The best way to learn web scraping is through hands-on practice and real-world applications. As you gain confidence, gradually tackle more complex projects and keep exploring new tools and techniques in this ever-evolving field.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Mastering Firecrawl Scrape API
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Nov 25, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial

## Getting Started with Modern Web Scraping: An Introduction
Traditional web scraping offers unique challenges. Relevant information is often scattered across multiple pages containing complex elements like code blocks, iframes, and media. JavaScript-heavy websites and authentication requirements add additional complexity to the scraping process.
Even after successfully scraping, the content requires specific formatting to be useful for downstream processes like data engineering or training AI and machine learning models.
Firecrawl addresses these challenges by providing a specialized scraping solution. Its [`/scrape` endpoint](https://docs.firecrawl.dev/features/scrape) offers features like JavaScript rendering, automatic content extraction, bypassing blockers and flexible output formats that make it easier to collect high-quality information and training data at scale.
In this guide, weâll explore how to effectively use Firecrawlâs `/scrape` endpoint to extract structured data from static and dynamic websites. Weâll start with basic scraping setup and then dive into a real-world example of scraping weather data from weather.com, demonstrating how to handle JavaScript-based interactions, extract structured data using schemas, and capture screenshots during the scraping process.
## Table of Contents
- [Getting Started with Modern Web Scraping: An Introduction](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#getting-started-with-modern-web-scraping-an-introduction)
- [What Is Firecrawlâs `/scrape` Endpoint? The Short Answer](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#what-is-firecrawls-scrape-endpoint-the-short-answer)
- [Prerequisites: Setting Up Firecrawl](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#prerequisites-setting-up-firecrawl)
- [Basic Scraping Setup](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#basic-scraping-setup)
- [Large-scale Scraping With Batch Operations](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#large-scale-scraping-with-batch-operations)
- [Batch Scraping with `batch_scrape_urls`](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#batch-scraping-with-batch_scrape_urls)
- [Asynchronous batch scraping with `async_batch_scrape_urls`](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#asynchronous-batch-scraping-with-async_batch_scrape_urls)
- [How to Scrape Dynamic JavaScript Websites](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#how-to-scrape-dynamic-javascript-websites)
- [Conclusion](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#conclusion)
## What Is Firecrawlâs `/scrape` Endpoint? The Short Answer
The `/scrape` endpoint is Firecrawlâs core web scraping API that enables automated extraction of content from any webpage. It handles common web scraping challenges like:
- JavaScript rendering - Executes JavaScript to capture dynamically loaded content
- Content extraction - Automatically identifies and extracts main content while filtering out noise
- Format conversion - Converts HTML to clean formats like Markdown or structured JSON
- Screenshot capture - Takes full or partial page screenshots during scraping
- Browser automation - Supports clicking, typing and other browser interactions
- Anti-bot bypass - Uses rotating proxies and browser fingerprinting to avoid blocks
The endpoint accepts a URL and configuration parameters, then returns the scraped content in your desired format. Itâs designed to be flexible enough for both simple static page scraping and complex dynamic site automation.
Now that we understand what the endpoint does at a high level, letâs look at how to set it up and start using it in practice.
## Prerequisites: Setting Up Firecrawl
Firecrawlâs scraping engine is exposed as a REST API, so you can use command-line tools like cURL to use it. However, for a more comfortable experience, better flexibility and control, I recommend using one of its SDKs for Python, Node, Rust or Go. This tutorial will focus on the Python version.
To get started, please make sure to:
1. Sign up at [firecrawl.dev](https://www.firecrawl.dev/).
2. Choose a plan (the free one will work fine for this tutorial).
Once you sign up, you will be given an API token which you can copy from your [dashboard](https://www.firecrawl.dev/app). The best way to save your key is by using a `.env` file, ideal for the purposes of this article:
```bash
touch .env
echo "FIRECRAWL_API_KEY='YOUR_API_KEY'" >> .env
```
Now, letâs install Firecrawl Python SDK, `python-dotenv` to read `.env` files, and Pandas for data analysis later:
```bash
pip install firecrawl-py python-dotenv pandas
```
## Basic Scraping Setup
Scraping with Firecrawl starts by creating an instance of the `FirecrawlApp` class:
```python
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
load_dotenv()
app = FirecrawlApp()
```
When you use the `load_dotenv()` function, the app can automatically use your loaded API key to establish a connection with the scraping engine. Then, scraping any URL takes a single line of code:
```python
url = "https://arxiv.org"
data = app.scrape_url(url)
```
Letâs take a look at the response format returned by `scrape_url` method:
```python
data['metadata']
```
```json
{
"title": "arXiv.org e-Print archiveopen searchopen navigation menucontact arXivsubscribe to arXiv mailings",
"language": "en",
"ogLocaleAlternate": [],
"viewport": "width=device-width, initial-scale=1",
"msapplication-TileColor": "#da532c",
"theme-color": "#ffffff",
"sourceURL": "[https://arxiv.org](https://arxiv.org)",
"url": "[https://arxiv.org/](https://arxiv.org/)",
"statusCode": 200
}
```
The response `metadata` includes basic information like the page title, viewport settings and a status code.
Now, letâs look at the scraped contents, which is converted into `markdown` by default:
```python
from IPython.display import Markdown
Markdown(data['markdown'][:500])
```
```text
arXiv is a free distribution service and an open-access archive for nearly 2.4 million
scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
Materials on this site are not peer-reviewed by arXiv.
Subject search and browse:
Physics
Mathematics
Quantitative Biology
Computer Science
Quantitative Finance
Statistics
Electrical Engineering and Systems Scienc
```
The response can include several other formats that we can request when scraping a URL. Letâs try requesting multiple formats at once to see what additional data we can get back:
```python
data = app.scrape_url(
url,
params={
'formats': [\
'html',\
'rawHtml',\
'links',\
'screenshot',\
]
}
)
```
Here is what these formats scrape:
- **HTML**: The raw HTML content of the page.
- **rawHtml**: The unprocessed HTML content, exactly as it appears on the page.
- **links**: A list of all the hyperlinks found on the page.
- **screenshot**: An image capture of the page as it appears in a browser.
The HTML format is useful for developers who need to analyze or manipulate the raw structure of a webpage. The `rawHtml` format is ideal for cases where the exact original HTML content is required, such as for archival purposes or detailed comparison. The links format is beneficial for SEO specialists and web crawlers who need to extract and analyze all hyperlinks on a page. The screenshot format is perfect for visual documentation, quality assurance, and capturing the appearance of a webpage at a specific point in time.
Passing more than one scraping format to `params` adds additional keys to the response:
```python
data.keys()
```
```text
dict_keys(['rawHtml', 'screenshot', 'metadata', 'html', 'links'])
```
Letâs display the screenshot Firecrawl took of arXiv.org:
```python
from IPython.display import Image
Image(data['screenshot'])
```

Notice how the screenshot is cropped to fit a certain viewport. For most pages, it is better to capture the entire screen by using the `screenshot@fullPage` format:
```python
data = app.scrape_url(
url,
params={
"formats": [\
"screenshot@fullPage",\
]
}
)
Image(data['screenshot'])
```

As a bonus, the `/scrape` endpoint can handle PDF links as well:
```python
pdf_link = "https://arxiv.org/pdf/2411.09833.pdf"
data = app.scrape_url(pdf_link)
Markdown(data['markdown'][:500])
```
```text
arXiv:2411.09833v1 \[math.DG\] 14 Nov 2024
EINSTEIN METRICS ON THE FULL FLAG F(N).
MIKHAIL R. GUZMAN
Abstract.LetM=G/Kbe a full flag manifold. In this work, we investigate theG-
stability of Einstein metrics onMand analyze their stability types, including coindices,
for several cases. We specifically focus onF(n) = SU(n)/T, emphasizingn= 5, where
we identify four new Einstein metrics in addition to known ones. Stability data, including
coindex and Hessian spectrum, confirms that these metrics on
```
### Further Scrape Configuration Options
By default, `scrape_url` converts everything it sees on a webpage to one of the specified formats. To control this behavior, Firecrawl offers the following parameters:
- `onlyMainContent`
- `includeTags`
- `excludeTags`
`onlyMainContent` excludes the navigation, footers, headers, etc. and is set to True by default.
`includeTags` and `excludeTags` can be used to allowlist/blocklist certain HTML elements:
```python
url = "https://arxiv.org"
data = app.scrape_url(url, params={"includeTags": ["p"], "excludeTags": ["span"]})
Markdown(data['markdown'][:1000])
```
```markdown
[Help](https://info.arxiv.org/help) \| [Advanced Search](https://arxiv.org/search/advanced)
arXiv is a free distribution service and an open-access archive for nearly 2.4 million
scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
Materials on this site are not peer-reviewed by arXiv.
[arXiv Operational Status](https://status.arxiv.org)
Get status notifications via
[email](https://subscribe.sorryapp.com/24846f03/email/new)
or [slack](https://subscribe.sorryapp.com/24846f03/slack/new)
```
`includeTags` and `excludeTags` also support referring to HTML elements by their `#id` or `.class-name`.
These configuration options help ensure efficient and precise scraping. While `onlyMainContent` filters out peripheral elements, `includeTags` and `excludeTags` enable surgical targeting of specific HTML elements - particularly valuable when dealing with complex webpage structures or when only certain content types are needed.
## Advanced Data Extraction: Structured Techniques
Scraping clean, LLM-ready data is the core philosophy of Firecrawl. However, certain web pages with their complex structures can interfere with this philosophy when scraped in their entirety. For this reason, Firecrawl offers two scraping methods for better structured outputs:
1. Natural language extraction - Use prompts to extract specific information and have an LLM structure the response
2. Manual structured data extraction - Define JSON schemas to have an LLM scrape data in a predefined format
In this section, we will cover both methods.
### Natural Language Extraction - Use AI to Extract Data
To illustrate natural language scraping, letâs try extracting all news article links that may be related to the 2024 US presidential election from the New York Times:
```python
url = "https://nytimes.com"
data = app.scrape_url(
url,
params={
'formats': ['markdown', 'extract', 'screenshot'],
'extract': {
'prompt': "Return a list of links of news articles that may be about the 2024 US presidential election"
}
}
)
```
To enable this feature, you are required to pass the `extract` option to the list of `formats` and provide a prompt in a dictionary to a separate `extract` field.
Once scraping finishes, the response will include a new `extract` key:
```python
data['extract']
```
```python
{'news_articles': [{'title': 'Harris Loss Has Democrats Fighting Over How to Talk About Transgender Rights',\
'link': 'https://www.nytimes.com/2024/11/20/us/politics/presidential-campaign-transgender-rights.html'},\
{'title': 'As Democrats Question How to Win Back Latinos, Ruben Gallego Offers Answers',\
'link': 'https://www.nytimes.com/2024/11/20/us/politics/ruben-gallego-arizona-latino-voters-democrats.html'},\
...\
{'title': 'The Final Push for Ukraine?',\
'link': 'https://www.nytimes.com/2024/11/20/briefing/ukraine-russia-trump.html'}]}
```
Due to the nature of this scraping method, the returned output can have arbitrary structure as we can see above. It seems the above output has the following format:
```python
{
"news_articles": [\
{"title": "article_title", "link": "article_url"},\
...\
]
}
```
This LLM-based extraction can have endless applications, from extracting specific data points from complex websites to analyzing sentiment across multiple news sources to gathering structured information from unstructured web content.
To improve the accuracy of the extraction and give additional instructions, you have the option to include a system prompt to the underlying LLM:
```python
data = app.scrape_url(
url,
params={
'formats': ['markdown', 'extract'],
'extract': {
'prompt': "Find any mentions of specific dollar amounts or financial figures and return them with their context and article link.",
'systemPrompt': "You are a helpful assistant that extracts numerical financial data."
}
}
)
```
Above, we are dictating that the LLM must act as an assistant that extracts numerical financial data. Letâs look at its response:
```python
data['extract']
```
```python
{'financial_data': [\
{\
'amount': 121200000,\
'context': 'RenĂŠ Magritte became the 16th artist whose work broke the nine-figure '\
'threshold at auction when his painting sold for $121.2 million.',\
'article_link': 'https://www.nytimes.com/2024/11/19/arts/design/magritte-surrealism-christies-auction.html'\
},\
{\
'amount': 5000000,\
'context': 'Benjamin Netanyahu offers $5 million for each hostage freed in Gaza.',\
'article_link': 'https://www.nytimes.com/2024/11/19/world/middleeast/israel-5-million-dollars-hostage.html'\
}\
]}
```
The output shows the LLM successfully extracted two financial data points from the articles.
The LLM not only identified the specific amounts but also provided relevant context and source article links for each figure.
### Schema-Based Data Extraction - Building Structured Models
While natural language scraping is powerful for exploration and prototyping, production systems typically require more structured and deterministic approaches. LLM responses can vary between runs of the same prompt, making the output format inconsistent and difficult to reliably parse in automated workflows.
For this reason, Firecrawl allows you to pass a predefined schema to guide the LLMâs output when transforming the scraped content. To facilitate this feature, Firecrawl uses Pydantic models.
In the example below, we will extract only news article links, their titles with some additional details from the New York Times:
```python
from pydantic import BaseModel, Field
class IndividualArticle(BaseModel):
title: str = Field(description="The title of the news article")
subtitle: str = Field(description="The subtitle of the news article")
url: str = Field(description="The URL of the news article")
author: str = Field(description="The author of the news article")
date: str = Field(description="The date the news article was published")
read_duration: int = Field(description="The estimated time it takes to read the news article")
topics: list[str] = Field(description="A list of topics the news article is about")
class NewsArticlesSchema(BaseModel):
news_articles: list[IndividualArticle] = Field(
description="A list of news articles extracted from the page"
)
```
Above, we define a Pydantic schema that specifies the structure of the data we want to extract. The schema consists of two models:
`IndividualArticle` defines the structure for individual news articles with fields for:
- `title`
- `subtitle`
- `url`
- `author`
- `date`
- `read_duration`
- `topics`
`NewsArticlesSchema` acts as a container model that holds a list of `IndividualArticle` objects, representing multiple articles extracted from the page. If we donât use this container model, Firecrawl will only return the first news article it finds.
Each model field uses Pydanticâs `Field` class to provide descriptions that help guide the LLM in correctly identifying and extracting the requested data. This structured approach ensures consistent output formatting.
The next step is passing this schema to the `extract` parameter of `scrape_url`:
```python
url = "https://nytimes.com"
structured_data = app.scrape_url(
url,
params={
"formats": ["extract", "screenshot"],
"extract": {
"schema": NewsArticlesSchema.model_json_schema(),
"prompt": "Extract the following data from the NY Times homepage: news article title, url, author, date, read_duration for all news articles",
"systemPrompt": "You are a helpful assistant that extracts news article data from NY Times.",
},
},
)
```
While passing the schema, we call its `model_json_schema()` method to automatically convert it to valid JSON. Letâs look at the output:
```python
structured_data['extract']
```
```python
{
'news_articles': [\
{\
'title': 'How Google Spent 15 Years Creating a Culture of Concealment',\
'subtitle': '',\
'url': 'https://www.nytimes.com/2024/11/20/technology/google-antitrust-employee-messages.html',\
'author': 'David Streitfeld',\
'date': '2024-11-20',\
'read_duration': 9,\
'topics': []\
},\
# ... additional articles ...\
{\
'title': 'The Reintroduction of Daniel Craig',\
'subtitle': '',\
'url': 'https://www.nytimes.com/2024/11/20/movies/daniel-craig-queer.html',\
'author': '',\
'date': '2024-11-20',\
'read_duration': 9,\
'topics': []\
}\
]
}
```
This time, the response fields exactly match the fields we set during schema definition:
```python
{
"news_articles": [\
{...}, # Article 1\
{...}, # Article 2,\
... # Article n\
]
}
```
When creating the scraping schema, the following best practices can go a long way in ensuring reliable and accurate data extraction:
1. Keep field names simple and descriptive
2. Use clear field descriptions that guide the LLM
3. Break complex data into smaller, focused fields
4. Include validation rules where possible
5. Consider making optional fields that may not always be present
6. Test the schema with a variety of content examples
7. Iterate and refine based on extraction results
To follow these best practices, the following Pydantic tips can help:
1. Use `Field(default=None)` to make fields optional
2. Add validation with `Field(min_length=1, max_length=100)`
3. Create custom validators with @validator decorator
4. Use `conlist()` for list fields with constraints
5. Add example values with `Field(example="Sample text")`
6. Create nested models for complex data structures
7. Use computed fields with `@property` decorator
If you follow all these tips, your schema can become quite sophisticated like below:
```python
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime
class Author(BaseModel):
# Required field - must be provided when creating an Author
name: str = Field(
...,
min_length=1,
max_length=100,
description="The full name of the article author",
)
# Optional field - can be None or omitted
title: Optional[str] = Field(
None, description="Author's title or role, if available"
)
class NewsArticle(BaseModel):
# Required field - must be provided when creating a NewsArticle
title: str = Field(
...,
min_length=5,
max_length=300,
description="The main headline or title of the news article",
example="Breaking News: Major Scientific Discovery",
)
# Required field - must be provided when creating a NewsArticle
url: str = Field(
...,
description="The full URL of the article",
example="https://www.nytimes.com/2024/01/01/science/discovery.html",
)
# Optional field - can be None or omitted
authors: Optional[List[Author]] = Field(
default=None, description="List of article authors and their details"
)
# Optional field - can be None or omitted
publish_date: Optional[datetime] = Field(
default=None, description="When the article was published"
)
# Optional field with default empty list
financial_amounts: List[float] = Field(
default_factory=list,
max_length=10,
description="Any monetary amounts mentioned in the article in USD",
)
@property
def is_recent(self) -> bool:
if not self.publish_date:
return False
return (datetime.now() - self.publish_date).days < 7
```
The schema above defines two key data models for news article data:
Author - Represents article author information with:
- `name` (required): The authorâs full name
- `title` (optional): The authorâs role or title
NewsArticle - Represents a news article with:
- `title` (required): The article headline (5-300 chars)
- `url` (required): Full article URL
- `authors` (optional): List of Author objects
- `publish_date` (optional): Article publication datetime
- `financial_amounts` (optional): List of monetary amounts in USD
The `NewsArticle` model includes an `is_recent` property that checks if the article was published within the last 7 days.
As you can see, web scraping process becomes much easier and more powerful if you combine it with structured data models that validate and organize the scraped information. This allows for consistent data formats, type checking, and easy access to properties like checking if an article is recent.
## Large-scale Scraping With Batch Operations
Up to this point, we have been focusing on scraping pages one URL at a time. In reality, you will work with multiple, perhaps, thousands of URLs that need to be scraped in parallel. This is where batch operations become essential for efficient web scraping at scale. Batch operations allow you to process multiple URLs simultaneously, significantly reducing the overall time needed to collect data from multiple web pages.
### Batch Scraping with `batch_scrape_urls`
The `batch_scrape_urls` method lets you scrape multiple URLs at once.
Letâs scrape all the news article links we obtained from our previous schema extraction example.
```python
articles = structured_data['extract']['news_articles']
article_links = [article['url'] for article in articles]
class ArticleSummary(BaseModel):
title: str = Field(description="The title of the news article")
summary: str = Field(description="A short summary of the news article")
batch_data = app.batch_scrape_urls(article_links, params={
"formats": ["extract"],
"extract": {
"schema": ArticleSummary.model_json_schema(),
"prompt": "Extract the title of the news article and generate its brief summary",
}
})
```
Here is what is happening in the codeblock above:
- We extract the list of news articles from our previous structured data result
- We create a list of article URLs by mapping over the articles and getting their âurlâ field
- We define an `ArticleSummary` model with title and summary fields to structure our output
- We use `batch_scrape_urls()` to process all article URLs in parallel, configuring it to:
- Extract data in structured format
- Use our `ArticleSummary` schema
- Generate titles and summaries based on the article content
The response from `batch_scrape_urls()` is a bit different:
```python
batch_data.keys()
```
```python
dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data'])
```
It contains the following fields:
- `success`: Boolean indicating if the batch request succeeded
- `status`: Current status of the batch job
- `completed`: Number of URLs processed so far
- `total`: Total number of URLs in the batch
- `creditsUsed`: Number of API credits consumed
- `expiresAt`: When the results will expire
- `data`: The extracted data for each URL
Letâs focus on the `data` key where the actual content is stored:
```python
len(batch_data['data'])
```
```out
19
```
The batch processing completed successfully with 19 articles. Letâs examine the structure of the first article:
```python
batch_data['data'][0].keys()
```
````out
dict_keys(['extract', 'metadata'])
The response format here matches what we get from individual `scrape_url` calls.
```python
print(batch_data['data'][0]['extract'])
````
```out
{'title': 'Ukrainian Forces Face Increasing Challenges Amidst Harsh Winter Conditions', 'summary': 'As the war in Ukraine enters its fourth winter, conditions are worsening for Ukrainian soldiers who find themselves trapped on the battlefield, surrounded by Russian forces. Military commanders express concerns over dwindling supplies and increasingly tough situations. The U.S. has recently allowed Ukraine to use American weapons for deeper strikes into Russia, marking a significant development in the ongoing conflict.'}
```
The scraping was performed according to our specifications, extracting the metadata, the title and generating a brief summary.
### Asynchronous batch scraping with `async_batch_scrape_urls`
Scraping the 19 NY Times articles in a batch took about 10 seconds on my machine. While thatâs not much, in practice, we cannot wait around as Firecrawl batch-scrapes thousands of URLs. For these larger workloads, Firecrawl provides an asynchronous batch scraping API that lets you submit jobs and check their status later, rather than blocking until completion. This is especially useful when integrating web scraping into automated workflows or processing large URL lists.
This feature is available through the `async_batch_scrape_urls` method and it works a bit differently:
```python
batch_scrape_job = app.async_batch_scrape_urls(
article_links,
params={
"formats": ["extract"],
"extract": {
"schema": ArticleSummary.model_json_schema(),
"prompt": "Extract the title of the news article and generate its brief summary",
},
},
)
```
When using `async_batch_scrape_urls` instead of the synchronous version, the response comes back immediately rather than waiting for all URLs to be scraped. This allows the program to continue executing while the scraping happens in the background.
```python
batch_scrape_job
```
```python
{'success': True,
'id': '77a94b62-c676-4db2-b61b-4681e99f4704',
'url': 'https://api.firecrawl.dev/v1/batch/scrape/77a94b62-c676-4db2-b61b-4681e99f4704'}
```
The response contains an ID belonging the background task that was initiated to process the URLs under the hood.
You can use this ID later to check the jobâs status with `check_batch_scrape_status` method:
```python
batch_scrape_job_status = app.check_batch_scrape_status(batch_scrape_job['id'])
batch_scrape_job_status.keys()
```
```python
dict_keys(['success', 'status', 'total', 'completed', 'creditsUsed', 'expiresAt', 'data', 'error', 'next'])
```
If the job finished scraping all URLs, its `status` will be set to `completed`:
```python
batch_scrape_job_status['status']
```
```out
'completed'
```
Letâs look at how many pages were scraped:
```python
batch_scrape_job_status['total']
```
```python
19
```
The response always includes the `data` field, whether the job is complete or not, with the content scraped up to that point. It has `error` and `next` fields to indicate if any errors occurred during scraping and whether there are more results to fetch.
## How to Scrape Dynamic JavaScript Websites
Out in the wild, many websites you encounter will be dynamic, meaning their content is generated on-the-fly using JavaScript rather than being pre-rendered on the server. These sites often require user interaction like clicking buttons or typing into forms before displaying their full content. Traditional web scrapers that only look at the initial HTML fail to capture this dynamic content, which is why browser automation capabilities are essential for comprehensive web scraping.
Firecrawl supports dynamic scraping by default. In the parameters of `scrape_url` or `batch_scrape_url`, you can define necessary actions to reach the target state of the page you are scraping. As an example, we will build a scraper that will extract the following information from `https://weather.com`:
- Current Temperature
- Temperature High
- Temperature Low
- Humidity
- Pressure
- Visibility
- Wind Speed
- Dew Point
- UV Index
- Moon Phase
These details are displayed for every city you search through the website:

Unlike websites such as Amazon where you can simply modify the URLâs search parameter (e.g. `?search=your-query`), weather.com presents a unique challenge. The site generates dynamic and unique IDs for each city, making traditional URL manipulation techniques ineffective. To scrape weather data for any given city, you must simulate the actual user journey: visiting the homepage, interacting with the search bar, entering the city name, and selecting the appropriate result from the dropdown list. This multi-step interaction process is necessary because of how weather.com structures its dynamic content delivery (at this point, I urge to visit the website and visit a few city pages).
Fortunately, Firecrawl natively supports such interactions through the `actions` parameter. It accepts a list of dictionaries, where each dictionary represents one of the following interactions:
- Waiting for the page to load
- Clicking on an element
- Writing text in input fields
- Scrolling up/down
- Take a screenshot at the current state
- Scrape the current state of the webpage
Letâs define the actions we need for weather.com:
```python
actions = [\
{"type": "wait", "milliseconds": 3000},\
{"type": "click", "selector": 'input[id="LocationSearch_input"]'},\
{"type": "write", "text": "London"},\
{"type": "screenshot"},\
{"type": "wait", "milliseconds": 1000},\
{"type": "click", "selector": "button[data-testid='ctaButton']"},\
{"type": "wait", "milliseconds": 3000},\
]
```
Letâs examine how we choose the selectors, as this is the most technical aspect of the actions. Using browser developer tools, we inspect the webpage elements to find the appropriate selectors. For the search input field, we locate an element with the ID âLocationSearch\_inputâ. After entering a city name, we include a 3-second wait to allow the dropdown search results to appear. At this stage, we capture a screenshot for debugging to verify the text input was successful.
The final step involves clicking the first matching result, which is identified by a button element with the `data-testid` attribute `ctaButton`. Note that if youâre implementing this in the future, these specific attribute names may have changed - youâll need to use browser developer tools to find the current correct selectors.
Now, letâs define a Pydantic schema to guide the LLM:
```python
class WeatherData(BaseModel):
location: str = Field(description="The name of the city")
temperature: str = Field(description="The current temperature in degrees Fahrenheit")
temperature_high: str = Field(description="The high temperature for the day in degrees Fahrenheit")
temperature_low: str = Field(description="The low temperature for the day in degrees Fahrenheit")
humidity: str = Field(description="The current humidity as a percentage")
pressure: str = Field(description="The current air pressure in inches of mercury")
visibility: str = Field(description="The current visibility in miles")
wind_speed: str = Field(description="The current wind speed in miles per hour")
dew_point: str = Field(description="The current dew point in degrees Fahrenheit")
uv_index: str = Field(description="The current UV index")
moon_phase: str = Field(description="The current moon phase")
```
Finally, letâs pass these objects to `scrape_url`:
```python
url = "https://weather.com"
data = app.scrape_url(
url,
params={
"formats": ["screenshot", "markdown", "extract"],
"actions": actions,
"extract": {
"schema": WeatherData.model_json_schema(),
"prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase",
},
},
)
```
The scraping only happens once all actions are performed. Letâs see if it was successful by looking at the `extract` key:
```python
data['extract']
```
```python
{'location': 'London, England, United Kingdom',
'temperature': '33°',
'temperature_high': '39°',
'temperature_low': '33°',
'humidity': '79%',
'pressure': '29.52in',
'visibility': '10 mi',
'wind_speed': '5 mph',
'dew_point': '28°',
'uv_index': '0 of 11',
'moon_phase': 'Waning Gibbous'}
```
All details are accounted for! But, for illustration, we need to take a closer look at the response structure when using JS-based actions:
```python
data.keys()
```
```python
dict_keys(['markdown', 'screenshot', 'actions', 'metadata', 'extract'])
```
The response has a new actions key:
```python
data['actions']
```
```python
{'screenshots': ['https://service.firecrawl.dev/storage/v1/object/public/media/screenshot-16bf71d8-dcb5-47eb-9af4-5fa84195b91d.png'],
'scrapes': []}
```
The actions array contained a single screenshot-generating action, which is reflected in the output above.
Letâs look at the screenshot:
```python
from IPython.display import Image
Image(data['actions']['screenshots'][0])
```

The image shows the stage where the scraper just typed the search query.
Now, we have to convert this whole process into a function that works for any given city:
```python
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any
class WeatherData(BaseModel):
location: str = Field(description="The name of the city")
temperature: str = Field(
description="The current temperature in degrees Fahrenheit"
)
temperature_high: str = Field(
description="The high temperature for the day in degrees Fahrenheit"
)
temperature_low: str = Field(
description="The low temperature for the day in degrees Fahrenheit"
)
humidity: str = Field(description="The current humidity as a percentage")
pressure: str = Field(description="The current air pressure in inches of mercury")
visibility: str = Field(description="The current visibility in miles")
wind_speed: str = Field(description="The current wind speed in miles per hour")
dew_point: str = Field(description="The current dew point in degrees Fahrenheit")
uv_index: str = Field(description="The current UV index")
moon_phase: str = Field(description="The current moon phase")
def scrape_weather_data(app: FirecrawlApp, city: str) -> Optional[WeatherData]:
try:
# Define the actions to search for the city
actions = [\
{"type": "wait", "milliseconds": 3000},\
{"type": "click", "selector": 'input[id="LocationSearch_input"]'},\
{"type": "write", "text": city},\
{"type": "wait", "milliseconds": 1000},\
{"type": "click", "selector": "button[data-testid='ctaButton']"},\
{"type": "wait", "milliseconds": 3000},\
]
# Perform the scraping
data = app.scrape_url(
"https://weather.com",
params={
"formats": ["extract"],
"actions": actions,
"extract": {
"schema": WeatherData.model_json_schema(),
"prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase",
},
},
)
# Return the extracted weather data
return WeatherData(**data["extract"])
except Exception as e:
print(f"Error scraping weather data for {city}: {str(e)}")
return None
```
The code is the same but it is wrapped inside a function. Letâs test it on various cities:
```python
cities = ["Tashkent", "New York", "Tokyo", "Paris", "Istanbul"]
data_full = []
for city in cities:
weather_data = scrape_weather_data(app, city)
data_full.append(weather_data)
```
We can convert the data for all cities into a DataFrame now:
```python
import pandas as pd
# Convert list of WeatherData objects into dictionaries
data_dicts = [city.model_dump() for city in data_full]
# Convert list of dictionaries into DataFrame
df = pd.DataFrame(data_dicts)
print(df.head())
```
| location | temperature | temperature\_high | temperature\_low | humidity | pressure | visibility | wind\_speed | dew\_point | uv\_index | moon\_phase |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Tashkent, Uzbekistan | 48 | 54 | 41 | 81 | 30.30 | 2.5 | 2 | 43 | 0 | Waning Gibbous |
| New York City, NY | 48° | 49° | 39° | 93% | 29.45 in | 4 mi | 10 mph | 46° | 0 of 11 | Waning Gibbous |
| Tokyo, Tokyo Prefecture, Japan | 47° | 61° | 48° | 95% | 29.94 in | 10 mi | 1 mph | 45° | 0 of 11 | Waning Gibbous |
| Paris, France | 34° | 36° | 30° | 93% | 29.42 in | 2.4 mi | 11 mph | 33° | 0 of 11 | Waning Gibbous |
| Istanbul, Tßrkiye | 47° | 67° | 44° | 79% | 29.98 in | 8 mi | 4 mph | 41° | 0 of 11 | Waning Gibbous |
We have successfully scraped weather data from multiple cities using Firecrawl and organized it into a structured DataFrame. This demonstrates how we can efficiently collect and analyze data generated by dynamic websites for further analysis and monitoring.
## Conclusion
In this comprehensive guide, weâve explored Firecrawlâs `/scrape` endpoint and its powerful capabilities for modern web scraping. We covered:
- Basic scraping setup and configuration options
- Multiple output formats including HTML, markdown, and screenshots
- Structured data extraction using both natural language prompts and Pydantic schemas
- Batch operations for processing multiple URLs efficiently
- Advanced techniques for scraping JavaScript-heavy dynamic websites
Through practical examples like extracting news articles from the NY Times and weather data from weather.com, weâve demonstrated how Firecrawl simplifies complex scraping tasks while providing flexible output formats suitable for data engineering and AI/ML pipelines.
The combination of LLM-powered extraction, structured schemas, and browser automation capabilities makes Firecrawl a versatile tool for gathering high-quality web data at scale, whether youâre building training datasets, monitoring websites, or conducting research.
To discover more what Firecrawl has to offer, refer to [our guide on the `/crawl` endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl), which scrapes websites in their entirety with a single command while using the `/scrape` endpoint under the hood.
For more hands-on uses-cases of Firecrawl, these posts may interest you as well:
- [Using Prompt Caching With Anthropic](https://www.firecrawl.dev/blog/using-prompt-caching-with-anthropic)
- [Scraping Job Boards With Firecrawl and OpenAI](https://www.firecrawl.dev/blog/scrape-job-boards-firecrawl-openai)
- [Scraping and Analyzing Airbnb Listings in Python Tutorial](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## DeepSeek Documentation Assistant
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Feb 10, 2025
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl

# Building an Intelligent Code Documentation Assistant: RAG-Powered DeepSeek Implementation
## Introduction
DeepSeek R1âs release made waves in the AI community, with countless demos highlighting its impressive capabilities. However, most examples only scratch the surface with basic prompts rather than showing practical real-world implementations.
In this tutorial, weâll explore how to harness this powerful open-source model to create a documentation assistant powered by RAG (Retrieval Augmented Generation). Our application will be able to intelligently answer questions about any documentation website by combining DeepSeekâs language capabilities with efficient information retrieval.

For those eager to try it out, you can find installation and usage instructions [in the GitHub repository](https://github.com/BexTuychiev/local-documentation-rag). If youâre interested in understanding how the application works and learning to customize it for your needs, continue reading this detailed walkthrough.
## What Is DeepSeek R1?

[DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1) represents a notable advancement in artificial intelligence, combining reinforcement learning and supervised fine-tuning in a novel and most importantly, open-source approach. The model comes in two variants: DeepSeek-R1-Zero, trained purely through reinforcement learning, and DeepSeek-R1, which undergoes additional training steps. Its architecture manages 671 billion total parameters, though it operates efficiently with 37 billion active parameters and handles context lengths up to 128,000 tokens.
The development journey progressed through carefully planned stages. Beginning with supervised fine-tuning for core capabilities, the model then underwent two phases of reinforcement learning. These RL stages shaped its reasoning patterns and aligned its behavior with human thought processes. This methodical approach produced a system capable of generating responses, performing self-verification, engaging in reflection, and constructing detailed reasoning across mathematics, programming, and general problem-solving.
When it comes to performance, DeepSeek R1 demonstrates compelling results that rival OpenAIâs offerings. It achieves 97.3% accuracy on MATH-500, reaches the 96.3 percentile on Codeforces programming challenges, and scores 90.8% on the MMLU general knowledge assessment. The technology has also been distilled into smaller versions ranging from 1.5B to 70B parameters, built on established frameworks like Qwen and Llama. These adaptations make the technology more accessible for practical use while preserving its core strengths.
In this tutorial, we will use its 14B version but your hardware may support up to 70B parameters. It is important to choose a higher capacity model as this number is the biggest contributor to performance.
## Prerequisite: Revisiting RAG concepts

[Source](https://www.patrickschnass.de/posts/rag_intro/)
Retrieval Augmented Generation (RAG) represents a significant advancement in how Large Language Models (LLMs) interact with information. Unlike traditional LLMs that rely solely on their training data, RAG combines the power of language models with the ability to retrieve and reference external information in real-time. This approach effectively creates a bridge between the modelâs inherent knowledge and up-to-date, specific information stored in external databases or documents.
The RAG architecture consists of two main components: the retriever and the generator. The retriever is responsible for searching through a knowledge base to find relevant information based on the userâs query. This process typically involves converting both the query and stored documents into vector embeddings, allowing for semantic similarity searches that go beyond simple keyword matching. The generator, usually an LLM, then takes both the original query and the retrieved information to produce a comprehensive, contextually relevant response.
One of RAGâs key advantages is its ability to provide more accurate and verifiable responses. By grounding the modelâs outputs in specific, retrievable sources, RAG helps reduce hallucinations â instances where LLMs generate plausible-sounding but incorrect information. This is particularly valuable in professional contexts where accuracy and accountability are crucial, such as technical documentation, customer support, or legal applications. Additionally, RAG systems can be updated with new information without requiring retraining of the underlying language model, making them more flexible and maintainable.
The implementation of RAG typically involves several technical components working in harmony. First, documents are processed and converted into embeddings using models like BERT or Sentence Transformers. These embeddings are then stored in vector databases such as Pinecone, Weaviate, or FAISS for efficient retrieval. When a query arrives, it goes through the same embedding process, and similarity search algorithms find the most relevant documents. Finally, these documents, along with the original query, are formatted into a prompt that the LLM uses to generate its response. This structured approach ensures that the final output is both relevant and grounded in reliable source material.
Now that weâve refreshed our memory on basic RAG concepts, letâs dive in to the appâs implementation.
## Overview of the App
Before diving into the technical details, letâs walk through a typical user journey to understand how the documentation assistant works.
The process starts with the user providing documentation URLs to scrape. The app is designed to work with any documentation website, but here are some examples of typical documentation pages:
- `https://docs.firecrawl.dev`
- `https://docs.langchain.com`
- `https://docs.streamlit.io`
The appâs interface is divided into two main sections: a sidebar for documentation management and a main chat interface. In the sidebar, users can:
1. Enter a documentation URL to scrape
2. Specify a name for the documentation (must end with â-docsâ)
3. Optionally limit the number of pages to scrape
4. View and select from previously scraped documentation sets
When a user initiates scraping, the app uses Firecrawl to intelligently crawl the documentation website, converting HTML content into clean markdown files. These files are stored locally in a directory named after the documentation (e.g., âFirecrawl-docsâ). The app shows real-time progress during scraping and notifies the user when complete.
After scraping, the documentation is processed into a vector database using the Nomic embeddings model. This enables semantic search capabilities, allowing the assistant to find relevant documentation sections based on user questions. The processing happens automatically when a user selects a documentation set from the sidebar.
The main chat interface provides an intuitive way to interact with the documentation:
1. Users can ask questions in natural language about the selected documentation
2. The app uses RAG (Retrieval-Augmented Generation) to find relevant documentation sections
3. DeepSeek R1 generates accurate, contextual responses based on the retrieved content
4. Each response includes an expandable âView reasoningâ section showing the chain of thought

Users can switch between different documentation sets at any time, and the app will automatically reprocess the vectors as needed.
This approach combines the power of modern AI with traditional documentation search, creating a more interactive and intelligent way to explore technical documentation. Whether youâre learning a new framework or trying to solve a specific problem, the assistant helps you find and understand relevant documentation more efficiently than traditional search methods.
## The Tech Stack Used in the App
Building an effective documentation assistant requires tools that can handle complex tasks like web scraping, text processing, and natural language understanding while remaining maintainable and efficient. Letâs explore the core technologies that power our application and why each was chosen:
### 1\. [Firecrawl](https://firecrawl.dev/) for AI-powered documentation scraping
At the heart of our documentation collection system is Firecrawl, an AI-powered web scraping engine. Unlike traditional scraping libraries that rely on brittle HTML selectors, Firecrawl uses natural language understanding to identify and extract content. This makes it ideal for our use case because:
- It can handle diverse documentation layouts without custom code
- Maintains reliability even when documentation structure changes
- Automatically extracts clean markdown content
- Handles JavaScript-rendered documentation sites
- Provides metadata like titles and URLs automatically
- Follows documentation links intelligently
### 2\. [DeepSeek R1](https://deepseek.ai/) for question answering
For the critical task of answering documentation questions, we use the DeepSeek R1 14B model through Ollama. This AI model excels at understanding technical documentation and providing accurate responses. We chose DeepSeek R1 because:
- Runs locally for better privacy and lower latency
- Specifically trained on technical content
- Provides detailed explanations with chain-of-thought reasoning
- More cost-effective than cloud-based models
- Integrates well with LangChain for RAG workflows
### 3\. [Nomic Embeddings](https://docs.nomic.ai/) for semantic search
To enable semantic search across documentation, we use Nomicâs text embedding model through [Ollama](https://ollama.com/). This component is crucial for finding relevant documentation sections. We chose Nomic because:
- Optimized for technical documentation
- Runs locally alongside DeepSeek through Ollama
- Produces high-quality embeddings for RAG
- Fast inference speed
- Compact model size
### 4\. [ChromaDB](https://www.trychroma.com/) for vector storage
To store and query document embeddings efficiently, we use ChromaDB as our vector database. This modern vector store offers:
- Lightweight and easy to set up
- Persistent storage of embeddings
- Fast similarity search
- Seamless integration with LangChain
- No external dependencies
### 5\. [Streamlit](https://streamlit.io/) for user interface
The web interface is built with Streamlit, a Python framework for data applications. We chose Streamlit because:
- It enables rapid development of chat interfaces
- Provides built-in components for file handling
- Handles async operations smoothly
- Maintains chat history during sessions
- Requires minimal frontend code
- Makes deployment straightforward
### 6\. [LangChain](https://www.langchain.com/) for RAG orchestration
To coordinate the various components into a cohesive RAG system, we use LangChain. This framework provides:
- Standard interfaces for embeddings and LLMs
- Document loading and text splitting utilities
- Vector store integration
- Prompt management
- Structured output parsing
This carefully selected stack provides a robust foundation while keeping the system entirely local and self-contained. The combination of AI-powered tools (Firecrawl and DeepSeek) with modern infrastructure (ChromaDB, LangChain, and Ollama) creates a reliable and efficient documentation assistant that can handle diverse technical documentation.
Most importantly, this stack minimizes both latency and privacy concerns by running all AI components locally. The infrastructure is lightweight and portable, letting you focus on using the documentation rather than managing complex dependencies or cloud services.
## Breaking Down the App Components
When you look at [the GitHub repository](https://github.com/BexTuychiev/local-documentation-rag/tree/main/src) of the app, you will see the following file structure:

Several files in the repository serve common purposes that most developers will recognize:
- `.gitignore`: Specifies which files Git should ignore when tracking changes
- `README.md`: Documentation explaining what the project does and how to use it
- `requirements.txt`: Lists all Python package dependencies needed to run the project
Letâs examine the remaining Python scripts and understand how they work together to power the application. The explanations will be in a logical order building from foundational elements to higher-level functionality.
### 1\. Scraping Documentation with Firecrawl - `src/scraper.py`
The documentation scraper component handles fetching and processing documentation pages using Firecrawlâs AI capabilities. Letâs examine how each part works:
First, we make the necessary imports and setup:
```python
import logging
import os
import re
from pathlib import Path
from typing import List
from dotenv import load_dotenv
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
# Get logger for the scraper module
logger = logging.getLogger(__name__)
```
Then, we define the core data structure for documentation pages:
```python
class DocPage(BaseModel):
title: str = Field(description="Page title")
content: str = Field(description="Main content of the page")
url: str = Field(description="Page URL")
```
The `DocPage` model represents a single documentation page with three essential fields:
- `title`: The pageâs heading or title
- `content`: The main markdown content of the page
- `url`: Direct link to the original page
This model is used by both the scraper to structure extracted content and the RAG system to process documentation for the vector store.
The main scraper class handles all documentation collection:
```python
class DocumentationScraper:
def init(self):
self.app = FirecrawlApp()
```
The `DocumentationScraper` initializes a connection to Firecrawl and provides three main methods for documentation collection:
1. `get_documentation_links`: Discovers all documentation pages from a base URL:
```python
def get_documentation_links(self, base_url: str) -> list[str]:
"""Get all documentation page links from a given base URL."""
logger.info(f"Getting documentation links from {base_url}")
initial_crawl = self.app.crawl_url(
base_url,
params={
"scrapeOptions": {"formats": ["links"]},
},
)
all_links = []
for item in initial_crawl["data"]:
all_links.extend(item["links"])
filtered_links = set(
[link.split("#")[0] for link in all_links if link.startswith(base_url)]
)
logger.info(f"Found {len(filtered_links)} unique documentation links")
return list(filtered_links)
```
This method:
- Uses Firecrawlâs link extraction mode to find all URLs
- Filters for links within the same documentation domain
- Removes duplicate URLs and anchor fragments
- Returns a clean list of documentation page URLs
2. `scrape_documentation`: Processes all documentation pages into structured content:
```python
def scrape_documentation(self, base_url: str, limit: int = None):
"""Scrape documentation pages from a given base URL."""
logger.info(f"Scraping doc pages from {base_url}")
filtered_links = self.get_documentation_links(base_url)
if limit:
filtered_links = filtered_links[:limit]
try:
logger.info(f"Scraping {len(filtered_links)} documentation pages")
crawl_results = self.app.batch_scrape_urls(filtered_links)
except Exception as e:
logger.error(f"Error scraping documentation pages: {str(e)}")
return []
doc_pages = []
for result in crawl_results["data"]:
if result.get("markdown"):
doc_pages.append(
DocPage(
title=result.get("metadata", {}).get("title", "Untitled"),
content=result["markdown"],
url=result.get("metadata", {}).get("url", ""),
)
)
else:
logger.warning(
f"Failed to scrape {result.get('metadata', {}).get('url', 'unknown URL')}"
)
logger.info(f"Successfully scraped {len(doc_pages)} pages out of {len(filtered_links)} URLs")
return doc_pages
```
This method:
- Gets all documentation links using the previous method
- Optionally limits the number of pages to scrape
- Uses Firecrawlâs batch scraping to efficiently process multiple pages
- Converts raw scraping results into structured `DocPage` objects
- Handles errors and provides detailed logging
3. `save_documentation_pages`: Stores scraped content as markdown files:
```python
def save_documentation_pages(self, doc_pages: List[DocPage], docs_dir: str):
"""Save scraped documentation pages to markdown files."""
Path(docs_dir).mkdir(parents=True, exist_ok=True)
for page in doc_pages:
url_path = page.url.replace("https://docs.firecrawl.dev", "")
safe_filename = url_path.strip("/").replace("/", "-")
filepath = os.path.join(docs_dir, f"{safe_filename}.md")
with open(filepath, "w", encoding="utf-8") as f:
f.write("---\n")
f.write(f"title: {page.title}\n")
f.write(f"url: {page.url}\n")
f.write("---\n\n")
f.write(page.content)
logger.info(f"Saved {len(doc_pages)} pages to {docs_dir}")
```
This method:
- Creates a documentation directory if needed
- Converts URLs to safe filenames
- Saves each page as a markdown file with YAML frontmatter
- Preserves original titles and URLs for reference
Finally, the class provides a convenience method to handle the entire scraping workflow:
```python
def pull_docs(self, base_url: str, docs_dir: str, n_pages: int = None):
doc_pages = self.scrape_documentation(base_url, n_pages)
self.save_documentation_pages(doc_pages, docs_dir)
```
This scraper component is used by:
- The Streamlit interface ( `app.py`) for initial documentation collection
- The RAG system ( `rag.py`) for processing documentation into the vector store
- The command-line interface for testing and manual scraping
The use of Firecrawlâs AI capabilities allows the scraper to handle diverse documentation layouts without custom selectors, while the structured output ensures consistency for downstream processing.
### 2\. Implementing RAG with Ollama - `src/rag.py`
The RAG (Retrieval Augmented Generation) component is the core of our documentation assistant, handling document processing, embedding generation, and question answering. Letâs examine each part in detail:
First, we import the necessary LangChain components:
```python
from langchain_chroma import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
```
These imports provide:
- `Chroma`: Vector database for storing embeddings
- `DirectoryLoader`: Utility for loading markdown files from a directory
- `ChatPromptTemplate`: Template system for LLM prompts
- `ChatOllama` and `OllamaEmbeddings`: Local LLM and embedding models
- `RecursiveCharacterTextSplitter`: Text chunking utility
The main RAG class initializes all necessary components:
```python
class DocumentationRAG:
def __init__(self):
# Initialize embeddings and vector store
self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
self.vector_store = Chroma(
embedding_function=self.embeddings, persist_directory="./chroma_db"
)
# Initialize LLM
self.llm = ChatOllama(model="deepseek-r1:14b")
# Text splitter for chunking
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
```
The initialization:
1. Creates an embedding model using Nomicâs text embeddings
2. Sets up a Chroma vector store with persistent storage
3. Initializes the DeepSeek R1 14B model for question answering
4. Configures a text splitter with 1000-character chunks and 200-character overlap
The prompt template defines how the LLM should process questions:
```python
# RAG prompt template
self.prompt = ChatPromptTemplate.from_template(
"""
You are an expert documentation assistant. Use the following documentation context
to answer the question. If you don't know the answer, just say that you don't
have enough information. Keep the answer concise and clear.
Context: {context}
Question: {question}
Answer:"""
)
```
This template:
- Sets the assistantâs role and behavior
- Provides placeholders for context and questions
- Encourages concise and clear responses
The document loading method handles reading markdown files:
```python
def load_docs_from_directory(self, docs_dir: str):
"""Load all markdown documents from a directory"""
markdown_docs = DirectoryLoader(docs_dir, glob="*.md").load()
return markdown_docs
```
This method:
- Uses `DirectoryLoader` to find all markdown files
- Automatically handles file reading and basic preprocessing
- Returns a list of Document objects
The document processing method prepares content for the vector store:
```python
def process_documents(self, docs_dir: str):
"""Process documents and add to vector store"""
# Clear existing documents
self.vector_store = Chroma(
embedding_function=self.embeddings, persist_directory="./chroma_db"
)
# Load and process new documents
documents = self.load_docs_from_directory(docs_dir)
chunks = self.text_splitter.split_documents(documents)
self.vector_store.add_documents(chunks)
```
This method:
1. Reinitializes the vector store to clear existing documents
2. Loads new documents from the specified directory
3. Splits documents into manageable chunks
4. Generates and stores embeddings in the vector database
Finally, the query method handles question answering:
```python
def query(self, question: str) -> tuple[str, str]:
"""Query the documentation"""
# Get relevant documents
docs = self.vector_store.similarity_search(question, k=3)
# Combine context
context = "\n\n".join([doc.page_content for doc in docs])
# Generate response
chain = self.prompt | self.llm
response = chain.invoke({"context": context, "question": question})
# Extract chain of thought between and
chain_of_thought = response.content.split("")[1].split("")[0]
# Extract response
response = response.content.split("")[1].strip()
return response, chain_of_thought
```
The query process:
1. Performs semantic search to find the 3 most relevant document chunks
2. Combines the chunks into a single context string
3. Creates a LangChain chain combining the prompt and LLM
4. Generates a response with chain-of-thought reasoning
5. Extracts and returns both the final answer and reasoning process
This RAG component is used by:
- The Streamlit interface ( `app.py`) for handling user questions
- The command-line interface for testing and development
- Future extensions that need documentation Q&A capabilities
The implementation uses LangChainâs abstractions to create a modular and maintainable system while keeping all AI components running locally through Ollama.
### 3\. Building a clean UI with Streamlit - `src/app.py`
The Streamlit interface brings together the scraping and RAG components into a user-friendly web application. Letâs break down each component:
First, we set up basic configuration and utilities:
```python
import glob
import logging
from pathlib import Path
import streamlit as st
from dotenv import load_dotenv
from rag import DocumentationRAG
from scraper import DocumentationScraper
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
handlers=[logging.StreamHandler()],
)
logger = logging.getLogger(__name__)
```
These imports and configurations:
- Set up logging for debugging and monitoring
- Import our custom RAG and scraper components
- Load environment variables for configuration
Helper functions handle documentation management:
```python
def get_existing_docs():
"""Get all documentation directories with -docs suffix"""
docs_dirs = glob.glob("*-docs")
return [Path(dir_path).name for dir_path in docs_dirs]
def get_doc_page_count(docs_dir: str) -> int:
"""Get number of markdown files in a documentation directory"""
return len(list(Path(docs_dir).glob("*.md")))
```
These utilities:
- Find all documentation directories with â-docsâ suffix
- Count pages in each documentation set
- Support the UIâs documentation selection features
The scraping configuration section handles documentation collection:
```python
def scraping_config_section():
"""Create the documentation scraping configuration section"""
st.markdown("### Configure Scraping")
base_url = st.text_input(
"Documentation URL",
placeholder="https://docs.firecrawl.dev",
help="The base URL of the documentation to scrape",
)
docs_name = st.text_input(
"Documentation Name",
placeholder="Firecrawl-docs",
help="Name of the directory to store documentation",
)
n_pages = st.number_input(
"Number of Pages",
min_value=0,
value=0,
help="Limit the number of pages to scrape (0 for all pages)",
)
st.info(
"đĄ Add '-docs' suffix to the documentation name. "
"Set pages to 0 to scrape all available pages."
)
if st.button("Start Scraping"):
if not base_url or not docs_name:
st.error("Please provide both URL and documentation name")
elif not docs_name.endswith("-docs"):
st.error("Documentation name must end with '-docs'")
else:
with st.spinner("Scraping documentation..."):
try:
scraper = DocumentationScraper()
n_pages = None if n_pages == 0 else n_pages
scraper.pull_docs(base_url, docs_name, n_pages=n_pages)
st.success("Documentation scraped successfully!")
except Exception as e:
st.error(f"Error scraping documentation: {str(e)}")
```
This section:
- Provides input fields for documentation URL and name
- Allows limiting the number of pages to scrape
- Handles validation and error reporting
- Shows progress during scraping
- Uses our `DocumentationScraper` class for content collection
The documentation selection interface manages switching between docs:
```python
def documentation_select_section():
"""Create the documentation selection section"""
st.markdown("### Select Documentation")
existing_docs = get_existing_docs()
if not existing_docs:
st.caption("No documentation found yet")
return None
# Create options with page counts
doc_options = [f"{doc} ({get_doc_page_count(doc)} pages)" for doc in existing_docs]
selected_doc = st.selectbox(
"Choose documentation to use as context",
options=doc_options,
help="Select which documentation to use for answering questions",
)
if selected_doc:
# Extract the actual doc name without page count
st.session_state.current_doc = selected_doc.split(" (")[0]
return st.session_state.current_doc
return None
```
This component:
- Lists available documentation sets
- Shows page counts for each set
- Updates session state when selection changes
- Handles the case of no available documentation
The chat interface consists of two main functions that work together to create the interactive Q&A experience:
First, we initialize the necessary session state:
```python
def initialize_chat_state():
"""Initialize session state for chat"""
if "messages" not in st.session_state:
st.session_state.messages = []
if "rag" not in st.session_state:
st.session_state.rag = DocumentationRAG()
```
This initialization:
- Creates an empty message list if none exists
- Sets up the RAG system for document processing and querying
- Uses Streamlitâs session state to persist data between reruns
The main chat interface starts with basic setup:
```python
def chat_interface():
"""Create the chat interface"""
st.title("Documentation Assistant")
# Check if documentation is selected
if "current_doc" not in st.session_state:
st.info("Please select a documentation from the sidebar to start chatting.")
return
```
This section:
- Sets the page title
- Ensures documentation is selected before proceeding
- Shows a helpful message if no documentation is chosen
Document processing is handled next:
```python
# Process documentation if not already processed
if (
"docs_processed" not in st.session_state
or st.session_state.docs_processed != st.session_state.current_doc
):
with st.spinner("Processing documentation..."):
st.session_state.rag.process_documents(st.session_state.current_doc)
st.session_state.docs_processed = st.session_state.current_doc
```
This block:
- Checks if the current documentation needs processing
- Shows a loading spinner during processing
- Updates the session state after processing
- Prevents unnecessary reprocessing of the same documentation
Message display is handled by iterating through the chat history:
```python
# Display chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
if "chain_of_thought" in message:
with st.expander("View reasoning"):
st.markdown(message["chain_of_thought"])
```
This section:
- Shows each message with appropriate styling based on role
- Displays the main content using markdown
- Creates expandable sections for reasoning chains
- Maintains visual consistency in the chat
Finally, the input handling and response generation:
```python
# Chat input
if prompt := st.chat_input("Ask a question about the documentation"):
# Add user message
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
# Generate and display response
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response, chain_of_thought = st.session_state.rag.query(prompt)
st.markdown(response)
with st.expander("View reasoning"):
st.markdown(chain_of_thought)
# Store assistant response
st.session_state.messages.append({
"role": "assistant",
"content": response,
"chain_of_thought": chain_of_thought,
})
```
This section:
1. Captures user input:
- Uses Streamlitâs chat input component
- Stores the message in session state
- Displays the message immediately
2. Generates response:
- Shows a âthinkingâ spinner during processing
- Queries the RAG system for an answer
- Displays the response with expandable reasoning
3. Updates chat history:
- Stores both response and reasoning
- Maintains the conversation flow
- Preserves the interaction for future reference
The entire chat interface creates a seamless experience by:
- Managing state effectively
- Providing immediate feedback
- Showing processing status
- Maintaining conversation context
- Exposing the AIâs reasoning process
Finally, the main application structure:
```python
def sidebar():
"""Create the sidebar UI components"""
with st.sidebar:
st.title("Documentation Scraper")
scraping_config_section()
documentation_select_section()
def main():
initialize_chat_state()
sidebar()
chat_interface()
if __name__ == "__main__":
main()
```
This structure:
- Organizes UI components into sidebar and main area
- Initializes necessary state on startup
- Provides a clean entry point for the application
The Streamlit interface brings together all components into a cohesive application that:
- Makes documentation scraping accessible to non-technical users
- Provides immediate feedback during operations
- Maintains conversation history
- Shows the AIâs reasoning process
- Handles errors gracefully
## How to Increase System Performance
There are several ways to optimize the performance of this documentation assistant. The following sections explore key areas for potential improvements:
### 1\. Optimize document chunking
In `rag.py`, we currently use a basic chunking strategy:
```python
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
add_start_index=True
)
```
We can improve this by:
- Using semantic chunking that respects document structure
- Adjusting chunk size based on content type (e.g., larger for API docs)
- Implementing custom splitting rules for documentation headers
- Adding metadata to chunks for better context preservation
Example improved configuration:
```python
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500, # Larger chunks for more context
chunk_overlap=300, # Increased overlap for better coherence
separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""], # Respect markdown structure
add_start_index=True,
length_function=len,
is_separator_regex=False
)
```
### 2\. Enhance vector search
The current similarity search in `rag.py` is basic:
```python
docs = self.vector_store.similarity_search(question, k=3)
```
We can improve retrieval by:
- Increasing `k`, i.e. the number of chunks returned
- Implementing hybrid search (combining semantic and keyword matching)
- Using Maximum Marginal Relevance (MMR) for diverse results
- Adding metadata filtering based on document sections
- Implementing re-ranking of retrieved chunks
Example enhanced retrieval:
```python
def query(self, question: str) -> tuple[str, str]:
# Get relevant documents with MMR
docs = self.vector_store.max_marginal_relevance_search(
question,
k=5, # Retrieve more candidates
fetch_k=20, # Consider larger initial set
lambda_mult=0.7 # Diversity factor
)
# Filter and re-rank results
filtered_docs = [\
doc for doc in docs\
if self._calculate_relevance_score(doc, question) > 0.7\
]
# Use top 3 most relevant chunks
context = "\n\n".join([doc.page_content for doc in filtered_docs[:3]])
```
### 3\. Implement caching
The current implementation reprocesses documentation on every selection:
```python
if (
"docs_processed" not in st.session_state
or st.session_state.docs_processed != st.session_state.current_doc
):
with st.spinner("Processing documentation..."):
st.session_state.rag.process_documents(st.session_state.current_doc)
```
We can improve this by:
- Implementing persistent vector storage with versioning
- Caching processed embeddings
- Adding incremental updates for documentation changes
Example caching implementation:
```python
from hashlib import md5
import pickle
class CachedDocumentationRAG(DocumentationRAG):
def process_documents(self, docs_dir: str):
cache_key = self._get_cache_key(docs_dir)
cache_path = f"cache/{cache_key}.pkl"
if os.path.exists(cache_path):
with open(cache_path, 'rb') as f:
self.vector_store = pickle.load(f)
else:
super().process_documents(docs_dir)
os.makedirs("cache", exist_ok=True)
with open(cache_path, 'wb') as f:
pickle.dump(self.vector_store, f)
```
### 4\. Optimize model loading
Currently, we initialize models in `__init__`:
```python
def __init__(self):
self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
self.llm = ChatOllama(model="deepseek-r1:14b")
```
We can improve this by:
- Implementing lazy loading of models
- Using smaller models for initial responses
- Adding model quantization options
- Implementing model caching
Example optimized initialization:
```python
class OptimizedDocumentationRAG:
def __init__(self, use_small_model=True):
self._embeddings = None
self._llm = None
self._use_small_model = use_small_model
@property
def llm(self):
if self._llm is None:
model_size = "7b" if self._use_small_model else "14b"
self._llm = ChatOllama(
model=f"deepseek-r1:{model_size}",
temperature=0.1, # Lower temperature for docs
num_ctx=2048 # Reduced context for faster inference
)
return self._llm
```
These optimizations can significantly improve:
- Response latency
- Memory usage
- Processing throughput
- User experience
Remember to benchmark performance before and after implementing these changes to measure their impact. Also, consider your specific use case - some optimizations might be more relevant depending on factors like user load, documentation size, and hardware constraints.
## Conclusion
This local documentation assistant demonstrates how modern AI technologies can be combined to create powerful, practical tools for technical documentation. By using DeepSeekâs language capabilities, Firecrawlâs AI-powered scraping, and the RAG architecture, weâve built a system that makes documentation more accessible and interactive. The applicationâs modular design, with clear separation between scraping, RAG implementation, and user interface components, provides a solid foundation for future enhancements and adaptations to different documentation needs.
Most importantly, this implementation shows that sophisticated AI applications can be built entirely with local components, eliminating privacy concerns and reducing operational costs. The combination of Streamlitâs intuitive interface, LangChainâs flexible abstractions, and Ollamaâs local AI models creates a seamless experience that feels like a cloud service but runs entirely on your machine. Whether youâre a developer learning a new framework, a technical writer maintaining documentation, or a team lead looking to improve documentation accessibility, this assistant provides a practical solution that can be customized and extended to meet your specific needs.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Structured Web Data Extraction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
January 20, 2025
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Introducing /extract: Get structured web data with just a prompt
/extract by Firecrawl - Get structured web data with just a prompt (Open Beta) - YouTube
Firecrawl
503 subscribers
[/extract by Firecrawl - Get structured web data with just a prompt (Open Beta)](https://www.youtube.com/watch?v=Qq1pFm8enZo)
Firecrawl
Search
Info
Shopping
Tap to unmute
If playback doesn't begin shortly, try restarting your device.
You're signed out
Videos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.
CancelConfirm
Share
Include playlist
An error occurred while retrieving sharing information. Please try again later.
Watch later
Share
Copy link
Watch on
0:00
/ â˘Live
â˘
[Watch on YouTube](https://www.youtube.com/watch?v=Qq1pFm8enZo "Watch on YouTube")
## The era of writing web scrapers is over
Today weâre releasing [/extract](https://www.firecrawl.dev/extract) \- write a prompt, get structured data from any website. No scrapers. No pipelines. Just results.
## Getting web data is hard
If youâve ever needed structured data from websitesâwhether to enrich your CRM, monitor competitors, or power various applicationsâyouâre probably familiar with the frustrating options available today:
- Manually researching and copy-pasting data from multiple sources, consuming countless hours
- Writing and maintaining fragile web scrapers that break at the slightest site change
- Using scraping services and building complex LLM pipelines with limited context windows that force you to break down data manually
Fortunately, with our /extract endpoint, you can leave these cumbersome approaches in the past and focus on what matters - getting the data you need.
## What You Can Build With /extract
Companies are already using /extract to:
- Enrich thousands of CRM leads with company data
- Automate KYB processes with structured business information
- Track competitor prices and feature changes in real-time
- Build targeted prospecting lists at scale
Hereâs how it works:
1. Write a prompt describing the data you need
2. Point us at any website (use wildcards like example.com/\*)
3. Get back clean, structured JSON
No more broken scrapers. No more complex pipelines. Just the data you need to build.
## Current Limitations
While /extract handles most web data needs effectively, there are some edge cases weâre actively improving:
1. Scale Limitations: Very large sites (think Amazonâs entire catalog) require breaking requests into smaller chunks
2. Advanced Filtering: Complex queries like time-based filtering are still in development
3. Consistency: Multiple runs may return slightly different results as we refine our extraction model
Weâre actively working on these areas. Our goal is to make web data as accessible as an API - and weâre getting closer every day.
## Get Started
1. **Try it Now**
- Get 500,000 free tokens in our [playground](https://www.firecrawl.dev/playground?mode=extract)
- See examples and experiment with different prompts
- No credit card required
2. **Build Something Real**
- Read the [technical docs](https://docs.firecrawl.dev/features/extract)
- Connect with [Zapier](https://zapier.com/apps/firecrawl/integrations) for no-code workflows
Ready to turn web data into your competitive advantage? Get started in less than 5 minutes.
â Caleb, Eric, Nick and the Firecrawl team đĽ
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Fire Engine for Firecrawl
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Aug 6, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Introducing Fire Engine for Firecrawl

Firecrawl handles web scraping orchestration but doesnât do the actual scraping. It initially relied on third-party services like Fetch and Playwright for data retrieval. However, these services often failed on certain sites or were too slow, causing issues for users. To address this, we built Fire Engine, now the default backend for Firecrawl. Itâs designed to be more reliable and faster, solving the core problems we and our users encountered with other scraping services.
### What is Fire Engine?
Fire Engine is a scraping primitive designed to increase Firecrawlâs scraping capabilities.
Weâre proud to say that Fire Engine outperforms leading competitors in key areas:
- **Reliability:** 40% more reliable than scraping leading competitors when scraping different types of websites
- **Speed:** Up to 33.17% faster than scraping leading competitors
And this is just the beginning, we are working closely with Firecrawl users to further improve reliability, speed, and more.
### The Technology Behind Fire Engine
Fire Engine combines a variety of browser and non-browser based techniques to balance speed and reliability, ensuring that you get data back without compromise. To do this, Fire engine has:
- **Efficient Headless Browser Management:** Running browsers at scale is notoriously difficult, but Fire Engine handles this with ease.
- **Persistent Browser Sessions:** By keeping browsers running, Fire Engine improves efficiency when handling new requests, reducing startup times and resource usage.
- **Advanced Web Interaction Techniques:** Employing a sophisticated array of methodsâincluding browser-based, browserless, and proprietary approaches
- **Intelligent Request Handling:** From smart proxy selection to advanced queuing, every aspect of the request process is optimized for speed and reliability.
With this technology, Fire Engine allows firecrawl to handle millions of requests daily with speed and accuracy.
### Try Fire Engine on Firecrawl Today
Fire Engine powers Firecrawl to handle thousands of daily requests efficiently. Itâs currently available exclusively through Firecraw cloud, Developers can test Fire Engineâs capabilities by signing up for [Firecrawl](https://www.firecrawl.dev/).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Firecrawl Launch Week Recap
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
September 2, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Launch Week I Recap

## Introduction
Last week marked an exciting milestone for Firecrawl as we kicked off our inaugural Launch Week, unveiling a series of new features and updates designed to enhance your web scraping experience. Letâs take a look back at the improvements we introduced throughout the week.
## [Day 1: Introducing Teams](https://firecrawl.dev/blog/launch-week-i-day-1-introducing-teams)
We started Launch Week by introducing our highly anticipated Teams feature. Teams enables seamless collaboration on web scraping projects, allowing you to work alongside your colleagues and tackle complex data gathering tasks together. With updated pricing plans to accommodate teams of all sizes, Firecrawl is now an excellent platform for collaborative web scraping.
## [Day 2: Increased Rate Limits](https://firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)
On Day 2, we improved your data collection capabilities by doubling the rate limits for our /scrape endpoint across all plans. This means you can now gather more data in the same amount of time, enabling you to take on larger projects and scrape more frequently.
## [Day 3: Introducing the Map Endpoint (Alpha)](https://firecrawl.dev/blog/launch-week-i-day-3-introducing-map-endpoint)
Day 3 saw the unveiling of our new Map endpoint, which allows you to transform a single URL into a comprehensive map of an entire website quickly. As a fast and easy way to gather all the URLs on a website, the Map endpoint opens up new possibilities for your web scraping projects.
## [Day 4: Introducing Firecrawl /v1](https://firecrawl.dev/blog/launch-week-i-day-4-introducing-firecrawl-v1)
Day 4 marked a significant release: Firecrawl /v1. This more reliable and developer-friendly API makes gathering web data easier. With new scrape formats, improved crawl status, enhanced markdown parsing, v1 support for all SDKs (including new Go and Rust SDKs), and an improved developer experience, v1 enhances your web scraping workflow.
## [Day 5: Real-Time Crawling with WebSockets](https://firecrawl.dev/blog/launch-week-i-day-5-real-time-crawling-websockets)
On Day 5, we introduced a new feature: Real-Time Crawling with WebSockets. Our WebSocket-based method, Crawl URL and Watch, enables real-time data extraction and monitoring, allowing you to process data immediately, react to errors quickly, and know precisely when your crawl is complete.
## [Day 6: LLM Extract (v1)](https://firecrawl.dev/blog/launch-week-i-day-6-llm-extract)
Day 6 brought v1 support for LLM Extract, enabling you to extract structured data from web pages using the extract format in /scrape. With the ability to pass a schema or just provide a prompt, LLM extraction is now more flexible and powerful.
## [Day 7: Crawl Webhooks (v1)](https://firecrawl.dev/blog/launch-week-i-day-7-webhooks)
We wrapped up Launch Week with the introduction of /crawl webhook support. You can now send notifications to your apps during a crawl, with four types of events: crawl.started, crawl.page, crawl.completed, and crawl.failed. This feature allows for more seamless integration of Firecrawl into your workflows.
## Wrapping Up
Launch Week showcased our commitment to continually evolving and improving Firecrawl to meet the needs of our users. From collaborative features like Teams to performance improvements like increased rate limits, and from new endpoints like Map and Extract to real-time capabilities with WebSockets and Webhooks, weâve expanded the possibilities for your web scraping projects.
Weâd like to thank our community for your support, feedback, and enthusiasm throughout Launch Week and beyond. Your input drives us to innovate and push the boundaries of whatâs possible with web scraping.
Stay tuned for more updates as we continue to shape the future of data gathering together. Happy scraping!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Prompt Caching Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Aug 14, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# How to Use Prompt Caching and Cache Control with Anthropic Models

Anthropic recently launched prompt caching and cache control in beta, allowing you to cache large context prompts up to 200k tokens and chat with them faster and cheaper than ever before. This is a game changer for Retrieval Augmented Generation (RAG) applications that analyze large amounts of data. Currently caching is only avialable for Sonnet and Haiku but it is coming soon to Opus.
To showcase the power of prompt caching, letâs walk through an example of crawling a website with Firecrawl, caching the contents with Anthropic, and having an AI assistant analyze the copy to provide suggestions for improvement. [See the code on Github.](https://github.com/ericciarla/prompt_caching_websites_anthropic)
## Setup
First, make sure you have API keys for both Anthropic and Firecrawl. Store them securely in a `.env` file:
```
ANTHROPIC_API_KEY=your_anthropic_key
FIRECRAWL_API_KEY=your_firecrawl_key
```
Install the required Python packages:
```
pip install python-dotenv anthropic firecrawl requests
```
## Crawling a Website with Firecrawl
Initialize the Firecrawl app with your API key:
```python
app = FirecrawlApp(api_key=firecrawl_api_key)
```
Crawl a website, limiting the results to 10 pages:
```python
crawl_url = 'https://dify.ai/'
params = {
'crawlOptions': {
'limit': 10
}
}
crawl_result = app.crawl_url(crawl_url, params=params)
```
Clean up the crawl results by removing the `content` field from each entry and save it to a file:
```python
cleaned_crawl_result = [{k: v for k, v in entry.items() if k != 'content'} for entry in crawl_result]
with open('crawl_result.txt', 'w') as file:
file.write(json.dumps(cleaned_crawl_result, indent=4))
```
## Caching the Crawl Data with Anthropic
Load the crawl data into a string:
```python
website_dump = open('crawl_result.txt', 'r').read()
```
Set up the headers for the Anthropic API request, including the `anthropic-beta` header to enable prompt caching:
```python
headers = {
"content-type": "application/json",
"x-api-key": anthropic_api_key,
"anthropic-version": "2023-06-01",
"anthropic-beta": "prompt-caching-2024-07-31"
}
```
Construct the API request data, adding the `website_dump` as an ephemeral cached text:
```python
data = {
"model": "claude-3-5-sonnet-20240620",
"max_tokens": 1024,
"system": [\
{\
"type": "text",\
"text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"\
},\
{\
"type": "text",\
"text": website_dump,\
"cache_control": {"type": "ephemeral"}\
}\
],
"messages": [\
{\
"role": "user",\
"content": "How can I improve the copy on this website?"\
}\
]
}
```
Make the API request and print the response:
```python
response = requests.post(
"https://api.anthropic.com/v1/messages",
headers=headers,
data=json.dumps(data)
)
print(response.json())
```
The key parts here are:
1. Including the `anthropic-beta` header to enable prompt caching
2. Adding the large `website_dump` text as a cached ephemeral text in the `system` messages
3. Asking the assistant to analyze the cached text and provide suggestions
## Benefits of Prompt Caching
By caching the large `website_dump` text, subsequent API calls can reference that data without needing to resend it each time. This makes conversations much faster and cheaper.
Imagine expanding this to cache an entire knowledge base with up to 200k tokens of data. You can then have highly contextual conversations drawing from that knowledge base in a very efficient manner. The possibilities are endless!
Anthropicâs prompt caching is a powerful tool for building AI applications that can process and chat about large datasets. Give it a try and see how it can enhance your projects!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Crunchbase Data Scraping
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Jan 31, 2025
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude

## Introduction
In todayâs data-driven business world, having access to accurate information about companies and their funding history is incredibly valuable. There are several online databases that track startups, investments, and company growth, containing details about millions of businesses, their funding rounds, and investors. While many of these platforms offer APIs, they can be expensive and out of reach for many users. This tutorial will show you how to build a web scraper that can gather company and funding data from public sources using Python, Firecrawl, and Claude.
This guide is designed for developers who want to collect company data efficiently and ethically. By the end of this tutorial, youâll have a working tool that can extract company details, funding rounds, and investor information from company profiles across the web.
Here is the preview of the app:

The application provides two input methods for users - they can either upload a file containing company names or enter them directly as text. Behind the scenes, Firecrawl automatically scrapes relevant company information from public databases like Crunchbase. This scraped data is then processed by Claude, an AI assistant that generates concise company summaries. The results are displayed in a clean Streamlit interface, complete with a download option that exports all findings to a CSV file for further analysis.
## Table of Contents
1. Introduction
2. Setting up the Development Environment
3. Prerequisite: Scraping with Firecrawl
4. Building the Funding Data Scraper
- Step 1: Adding brief app information
- Step 2: Adding components for company name input
- Step 3: Building a scraping class
- Step 4: Adding a scraping button
- Step 5: Creating a download button
- Step 6: Generating summaries
- Step 7: Deployment
5. Conclusion
**Time to Complete:** ~60 minutes
**Prerequisites:**
- Python 3.10+
- Basic Python knowledge
- API keys for Firecrawl and Claude
**Important Note:** This tutorial demonstrates web scraping for educational purposes. Always review and comply with websitesâ terms of service and implement appropriate rate limiting in production environments.
## Setting up the Development Environment
Letâs start by setting up our development environment and installing the necessary dependencies.
1. **Create a working directory**
First, create a working directory:
```bash
mkdir company-data-scraper
cd company-data-scraper
```
2. **Install dependencies**
Weâll use Poetry for dependency management. If you havenât installed Poetry yet:
```bash
curl -sSL https://install.python-poetry.org | python3 -
```
Then, initialize it inside the current working directory:
```bash
poetry init
```
Type â^3.10â when asked for the Python version but, donât specify the dependencies interactively.
Next, install the project dependencies with the `add` command:
```bash
poetry add streamlit firecrawl-py pandas pydantic openpyxl python-dotenv anthropic
```
3. **Build the project structure**
```bash
mkdir data src
touch .gitignore README.md .env src/{app.py,models.py,scraper.py}
```
The created files serve the following purposes:
- `data/` \- Directory to store input files and scraped results
- `src/` \- Source code directory containing the main application files
- `.gitignore` \- Specifies which files Git should ignore
- `README.md` \- Project documentation and setup instructions
- `.env` \- Stores sensitive configuration like API keys
- `src/app.py` \- Main Streamlit application and UI code
- `src/models.py` \- Data models and validation logic
- `src/scraper.py` \- Web scraping and data collection functionality
4. **Configure environment variables**
This project requires two accounts of third-party services:
- [Firecrawl](https://firecrawl.dev/) for AI-powered web scraping
- [Anthropic (Claude)](https://console.anthropic.com/) for summarizing scraped data
Click on the hyperlinks above to create your accounts and copy/generate your API keys. Then, Inside the `.env` file in the root directory, add your API keys:
```plaintext
FIRECRAWL_API_KEY=your_api_key_here
ANTHROPIC_API_KEY=your_api_key_here
```
The `.env` file is used to store sensitive configuration like API keys securely.The `python-dotenv` package will automatically load these environment variables when the app starts. It should never be committed to version control so add the following line to your `.gitignore` file:
```plaintext
.env
```
5. **Start the app UI**
Run the Streamlit app (which is blank just now) to ensure everything is working:
```bash
poetry run streamlit run src/app.py
```
You should see the Streamlit development server start up and your default browser open to the appâs interface. Keep this tab open to see the changes we make to the app in the next steps.
Now that we have our development environment set up, letâs cover how Firecrawl works, which is a prerequisite to building our app.
## Prerequisite: Scraping with Firecrawl
The biggest challenge with any application that scrapes websites is maintenance. Since websites regularly update their layout and underlying HTML/CSS code, traditional scrapers break easily, making the entire app useless. Firecrawl solves this exact problem by allowing you to scrape websites using natural language.
Instead of writing complex CSS selectors and XPath expressions that need constant maintenance, you can simply describe what data you want to extract in plain English. Firecrawlâs AI will figure out how to get that data from the page, even if the websiteâs structure changes. This makes our scraper much more reliable and easier to maintain over time.
Here is a simple Firecrawl workflow we will later use in the app to scrape company information:
```python
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
from dotenv import load_dotenv
load_dotenv()
```
```plaintext
True
```
We import `FirecrawlApp` to interact with the Firecrawl API for web scraping. `BaseModel` and `Field` from pydantic help us define structured data models with validation. The `load_dotenv` function loads environment variables from our `.env` file to securely access API keys.
```python
# Define the data structure we want to extract
class CompanyData(BaseModel):
name: str = Field(description="Company name")
funding_total: str = Field(description="Total funding amount")
employee_count: str = Field(description="Number of employees")
industry: str = Field(description="Primary industry or sector")
founded_year: str = Field(
description="Year the company was founded"
) # Initialize Firecrawl
```
Next, we define a Pydantic data model specifying the fields we want to extract from a website. Firecrawl will follow this schema to the letter - detecting the relevant HTML/CSS selectors containing this information and returning them in a simple JSON object. Here, the `Field` descriptions written in plain English are important as they guide the underlying Firecrawl AI to capture the required fields.
```python
app = FirecrawlApp()
# Scrape company data from Crunchbase
data = app.extract(
urls=["https://www.crunchbase.com/organization/openai"],
params={
"schema": CompanyData.model_json_schema(), # Use our schema for extraction
"prompt": "Extract key company information from the page",
},
)
```
We then initialize a `FirecrawlApp` instance and call its `extract` method, passing in the URL for OpenAIâs Crunchbase page. The `params` dictionary configures the scraping behavior - we provide our `CompanyData` schema to guide the structured data extraction. We also include a prompt to help direct the extraction process.
The scraped data is returned in a format matching our schema, which we can then parse into a CompanyData object for easy access to the extracted fields, as shown in the following code block.
```python
# Access the extracted data
company = CompanyData(**data["data"])
print(f"Company: {company.name}")
print(f"Funding: {company.funding_total}")
print(f"Employees: {company.employee_count}")
print(f"Industry: {company.industry}")
print(f"Founded: {company.founded_year}")
```
```plaintext
Company: OpenAI
Funding: null
Employees: 251-500
Industry: Artificial Intelligence (AI)
Founded: 2015
```
In a later step, we will integrate this process into our app but will use the `batch_scrape_urls` method instead of `extract` to enable concurrent scraping.
## Building the Funding Data Scraper Step-by-Step
We will take a top-down approach to building the app: starting with the high-level UI components and user flows, then implementing the underlying functionality piece by piece. This approach will help us validate the appâs usability early and ensure weâre building exactly what users need.
### Step 1: Adding brief app information
We turn our focus to the `src/app.py` file and make the following imports:
```python
import streamlit as st
import pandas as pd
import anthropic
from typing import List
from dotenv import load_dotenv
# from scraper import CrunchbaseScraper
load_dotenv()
```
The imports above serve the following purposes:
- `streamlit`: Provides the web interface components and app framework
- `pandas`: Used for data manipulation and CSV file handling
- `anthropic`: Client library for accessing Claude AI capabilities
- `typing.List`: Type hint for lists to improve code readability
- `dotenv`: Loads environment variables from `.env` file for configuration
Currently, the `CrunchbaseScraper` class is commented out since we are yet to write it.
Next, we create a `main` function that holds the core UI components:
```python
def main():
st.title("Crunchbase Company Data Scraper")
st.write(
"""
Extract detailed company information from Crunchbase including funding data,
employee counts, industries, and more. Upload a file with company names or
enter them manually below.
"""
)
```
Right now, the function gives brief info about the appâs purpose. To run the app, add the following `main` block to the end of `src/app.py`:
```python
if __name__ == "__main__":
main()
```
You should see the change in the Streamlit development server.
### Step 2: Adding components for company name input
In this step, we add a new function to `src/app.py`:
```python
def load_companies(file) -> List[str]:
"""Load company names from uploaded file"""
companies = []
for line in file:
company = line.decode("utf-8").strip()
if company: # Skip empty lines
companies.append(company)
return companies
```
The `load_companies` function takes a file object as input and parses it line by line, extracting company names. It decodes each line from bytes to UTF-8 text, strips whitespace, and skips any empty lines. The function returns a list of company names that can be used for scraping Crunchbase data.
Now, make the following changes to the `main` function:
```python
def main():
st.title("Crunchbase Company Data Scraper")
st.write(
"""
Extract detailed company information from Crunchbase including funding data,
employee counts, industries, and more. Upload a file with company names or
enter them manually below.
"""
)
# File upload option
uploaded_file = st.file_uploader(
"Upload a text file with company names (one per line)", type=["txt"]
)
# Manual input option
st.write("### Or Enter Companies Manually")
manual_input = st.text_area(
"Enter company names (one per line)",
height=150,
help="Enter each company name on a new line",
)
```
In this version, weâve added two main ways for users to input company names: file upload and manual text entry. The file upload component accepts `.txt` files and for manual entry, users can type or paste company names directly into a text area, with each name on a new line. This provides flexibility for users whether they have a prepared list or want to enter names ad-hoc.
Furthermore, add these two blocks of code after the input components:
```python
def main():
...
companies = []
if uploaded_file:
companies = load_companies(uploaded_file)
st.write(f"Loaded {len(companies)} companies from file")
elif manual_input:
companies = [line.strip() for line in manual_input.split("\n") if line.strip()]
st.write(f"Found {len(companies)} companies in input")
```
This code block processes the user input to create a list of company names. When a file is uploaded, it uses the `load_companies()` function to read and parse the file contents. For manual text input, it splits the input text by newlines and strips whitespace to extract company names. In both cases, it displays a message showing how many companies were found. The companies list will be used later for scraping data from funding data sources.
### Step 3: Building a scraping class with Firecrawl
Letâs take a look at the snapshot of the final UI once again:

In this step, we implement the backend process that happens when a user clicks on âStart scrapingâ button. To do so, we use Firecrawl like we outlined in the prerequisites section. First, go to `src/models.py` script to write the data model we are going to use to scrape company and funding information:
```python
from pydantic import BaseModel
from typing import List, Optional
class CompanyData(BaseModel):
name: str
about: Optional[str]
employee_count: Optional[str]
financing_type: Optional[str]
industries: List[str] = []
headquarters: List[str] = []
founders: List[str] = []
founded_date: Optional[str]
operating_status: Optional[str]
legal_name: Optional[str]
stock_symbol: Optional[str]
acquisitions: List[str] = []
investments: List[str] = []
exits: List[str] = []
total_funding: Optional[str]
contacts: List[str] = []
```
This data model is more detailed and tries to extract as much information as possible from given sources. Now, switch to `src/scraper.py` where we implement a class called `CrunchbaseScraper`:
```python
from firecrawl import FirecrawlApp
from models import CompanyData
from typing import List, Dict
class CrunchbaseScraper:
def __init__(self):
self.app = FirecrawlApp()
def scrape_companies(self, urls: List[str]) -> List[Dict]:
"""Scrape multiple Crunchbase company profiles"""
schema = CompanyData.model_json_schema()
try:
data = self.app.batch_scrape_urls(
urls,
params={
"formats": ["extract"],
"extract": {
"prompt": """Extract information from given pages based on the schema provided.""",
"schema": schema,
},
},
)
return [res["extract"] for res in data["data"]]
except Exception as e:
print(f"Error while scraping companies: {str(e)}")
return []
```
Letâs break down how the class works.
When the class is initialized, it creates an instance of `FirecrawlApp`. The main method `scrape_companies` takes a list of URLs and returns a list of dictionaries containing the scraped data. It works by:
1. Getting the JSON schema from our `CompanyData` model to define the structure
2. Using `batch_scrape_urls` to process multiple URLs at once
3. Configuring the scraper to use the âextractâ format with our schema
4. Providing a prompt that instructs the scraper how to extract the data
5. Handling any errors that occur during scraping
Error handling ensures the script continues running even if individual URLs fail, returning an empty list in case of errors rather than crashing.
Now, the only thing left to do to finalize the scraping feature is to add the âStart Scrapingâ button to the UI.
### Step 4: Adding a button to start scraping
In this step, return to `src/app.py` and add the following code block to the very end of the `main()` function:
```python
def main():
...
if companies and st.button("Start Scraping"):
scraper = CrunchbaseScraper()
with st.spinner("Scraping company data from Crunchbase..."):
try:
# Convert company names to Crunchbase URLs
urls = [\
f"https://www.crunchbase.com/organization/{name.lower().replace(' ', '-')}"\
for name in companies\
]
results = scraper.scrape_companies(urls)
except Exception as e:
st.error(f"An error occurred: {str(e)}")
```
This code block builds on the previous functionality by adding the core scraping logic. When the âStart Scrapingâ button is clicked (and companies have been provided), it:
1. Creates a new instance of our `CrunchbaseScraper` class
2. Shows a loading spinner to indicate scraping is in progress
3. Converts the company names into proper Crunchbase URLs by:
- Converting to lowercase
- Replacing spaces with hyphens
- Adding the base Crunchbase URL prefix
4. Calls the `scrape_companies` method we created earlier to fetch the data
The try-except block ensures any scraping errors are handled gracefully rather than crashing the application. This is important since web scraping can be unpredictable due to network issues, rate limiting, and so on.
To finish this step, uncomment the single import at the top of `src/app.py` so that they look like this:
```python
import streamlit as st
import pandas as pd
import anthropic
from typing import List
from scraper import CrunchbaseScraper
from dotenv import load_dotenv
load_dotenv()
```
### Step 5: Creating a download button for the scraped results
Now, we must create a button to download the scraped results as a CSV file. To do so, add the following code block after the scraping part:
```python
def main():
...
if companies and st.button("Start Scraping"):
scraper = CrunchbaseScraper()
with st.spinner("Scraping company data from Crunchbase..."):
try:
...
results = scraper.scrape_companies(urls)
# THIS PART IS NEW
df = pd.DataFrame(results)
csv = df.to_csv(index=False)
# Create download button
st.download_button(
"Download Results (CSV)",
csv,
"crunchbase_data.csv",
"text/csv",
key="download-csv",
)
except Exception as e:
st.error(f"An error occurred: {str(e)}")
```
In the new lines of code, we convert the results to a Pandas dataframe and use its `to_csv()` function to save the dataframe as a CSV file. The method returns a filename, which we pass to the `st.download_button` method along with other details.
### Step 6: Generating a summary of scraped results
After scraping the raw company data, we can use Claude to generate concise summaries that highlight key insights. Letâs add this functionality to our app. First, create a new function in `src/app.py` to handle the summarization:
```python
def generate_company_summary(company_data: dict) -> str:
"""Generate a summary of the company data"""
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system="You are a company & funding data expert. Summarize the given company data by the user in a few sentences.",
messages=[\
{"role": "user", "content": [{"type": "text", "text": str(company_data)}]}\
],
)
return message.content[0].text
```
Now, update the scraping section in the `main()` function to include the summary generation after the download button:
```python
def main():
...
if companies and st.button("Start Scraping"):
scraper = CrunchbaseScraper()
with st.spinner("Scraping company data from Crunchbase..."):
try:
...
# Give summary of each company
for company in results:
summary = generate_company_summary(company)
st.write(f"### Summary of {company['name']}")
st.write(summary)
except Exception as e:
st.error(f"An error occurred: {str(e)}")
```
This implementation:
1. Creates a new `generate_company_summary()` function that:
- Formats the scraped company data into readable text
- Uses Claude to analyze the data and generate insights
- Returns a structured summary highlighting key patterns.
2. Updates the main scraping workflow to:
- Generate the summary after scraping is complete
- Display the insights for each company after the download button
The summary can provide context about the scraped data, helping users get the gist of the scraped data.
### Step 7: Deploying the app to Streamlit Cloud
Now that our app is working locally, letâs deploy it to Streamlit Cloud so others can use it. First, we need to prepare our project for deployment.
1. **Create a requirements.txt**
Since Streamlit Cloud doesnât support Poetry directly, we need to convert our dependencies to a `requirements.txt` file. Run this command in your terminal:
```bash
poetry export -f requirements.txt --output requirements.txt --without-hashes
```
2. **Create a GitHub repository**
Initialize a Git repository and push your code to GitHub:
```python
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/yourusername/company-data-scraper.git
git push -u origin main
```
3. **Add secrets to Streamlit Cloud**
Visit [share.streamlit.io](https://share.streamlit.io/) and connect your GitHub account. Then:
1. Click âNew appâ
2. Select your repository and branch
3. Set the main file path as src/app.py
4. Click âAdvanced settingsâ and add your environment variables:
- `FIRECRAWL_API_KEY`
- `ANTHROPIC_API_KEY`
4. **Update imports for deployment**
Sometimes local imports need adjustment for Streamlit Cloud. Ensure your imports in src/app.py use relative paths:
```python
from .models import CompanyData
from .scraper import CrunchbaseScraper
```
5. **Add a .streamlit/config.toml file**
Create a `.streamlit` directory and add a `config.toml` file for custom theme settings:
```python
[theme]
primaryColor = "#FF4B4B"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"
[server]
maxUploadSize = 5
```
6. **Create a README.md file**
Add a README.md file to help users understand your app:
```markdown
# Crunchbase Company Data Scraper
A Streamlit app that scrapes company information and funding data from Crunchbase.
## Features
- Bulk scraping of company profiles
- AI-powered data summarization
- CSV export functionality
- Clean, user-friendly interface
## Setup
1. Clone the repository
2. Install dependencies: `pip install -r requirements.txt`
3. Set up environment variables in `.env`:
- `FIRECRAWL_API_KEY`
- `ANTHROPIC_API_KEY`
4. Run the app: `streamlit run src/app.py`
## Usage
1. Enter company names (one per line) or upload a text file
2. Click "Start Scraping"
3. View AI-generated insights
4. Download results as CSV
## License
MIT
```
7. **Deploy the app**
After pushing all changes to GitHub, go back to Streamlit Cloud and:
1. Click âDeployâ
2. Wait for the build process to complete
3. Your app will be live at `https://share.streamlit.io/yourusername/company-data-scraper/main`
4. **Monitor and maintain**
After deployment:
- Check the app logs in Streamlit Cloud for any issues
- Monitor API usage and rate limits
- Update dependencies periodically
- Test the app regularly with different inputs
The deployed app will automatically update whenever you push changes to your GitHub repository. Streamlit Cloud provides free hosting for public repositories, making it an excellent choice for sharing your scraper with others.
## Conclusion
In this tutorial, weâve built a powerful web application that combines the capabilities of Firecrawl and Claude to extract and analyze company data at scale. By leveraging Firecrawlâs AI-powered scraping and Claudeâs natural language processing, weâve created a tool that not only gathers raw data but also provides meaningful insights about companies and their funding landscapes. The Streamlit interface makes the tool accessible to users of all technical levels, while features like bulk processing and CSV export enable efficient data collection workflows.
### Limitations and Considerations
- Rate limiting: Implement appropriate delays between requests
- Data accuracy: Always verify scraped data against official sources
- API costs: Monitor usage to stay within budget
- Maintenance: Website structure changes may require updates
### Next Steps
Consider these enhancements for your implementation:
- Add data validation and cleaning
- Implement request caching
- Add data visualizations
- Include historical data tracking
- Implement error retry logic
### Resources
- [Firecrawl Documentation](https://firecrawl.dev/docs)
- [Claude API Documentation](https://docs.anthropic.com/claude/docs)
- [Streamlit Deployment Guide](https://docs.streamlit.io/streamlit-cloud)
- [Firecrawlâs scrape endpoint](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Automated Price Tracking
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 9, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Building an Automated Price Tracking Tool

There is a lot to be said about the psychology of discounts. For example, buying a discounted item we donât need isnât saving money at all - itâs falling for one of the oldest sales tactics. However, there are legitimate cases where waiting for a price drop on items you actually need makes perfect sense.
The challenge is that e-commerce websites run flash sales and temporary discounts constantly, but these deals often disappear as quickly as they appear. Missing these brief windows of opportunity can be frustrating.
Thatâs where automation comes in. In this guide, weâll build a Python application that monitors product prices across any e-commerce website and instantly notifies you when prices drop on items youâre actually interested in. Here is a sneak peek of the app:

The app has a simple appearance but provides complete functionality:
- It has a minimalistic UI to add or remove products from the tracker
- A simple dashboard to display price history for each product
- Controls for setting the price drop threshold in percentages
- A notification system that sends Discord alerts when a tracked itemâs price drops
- A scheduling system that updates the product prices on an interval you specify
- Runs for free for as long as you want
Even though the title says âAmazon price trackerâ (full disclosure: I was forced to write that for SEO purposes), the app will work for any e-commerce website you can imagine (except Ebay, for some reason).
So, letâs get started building this Amazon price tracker.
## The Toolstack We Will Use
The app will be built using Python and these libraries::
- [Streamlit](https://www.streamlit.io/) for the UI
- [Firecrawl](https://www.firecrawl.dev/) for AI-based scraping of e-commerce websites
- [SQLAlchemy](https://www.sqlalchemy.org/) for database management
In addition to Python, we will use these platforms:
- Discord for notifications
- GitHub for hosting the app
- GitHub Actions for running the app on a schedule
- Supabase for hosting a free Postgres database instance
## Building an Amazon Price Tracker App Step-by-step
Since this project involves multiple components working together, weâll take a top-down approach rather than building individual pieces first. This approach makes it easier to understand how everything fits together, since weâll introduce each tool only when itâs needed. The benefits of this strategy will become clear as we progress through the tutorial.
### Step 1: Setting up the environment
First, letâs create a dedicated environment on our machines to work on the project:
```bash
mkdir automated-price-tracker
cd automated-price-tracker
python -m venv .venv
source .venv/bin/activate
```
These commands create a working directory and activate a virtual environment. Next, create a new script called `ui.py` for designing the user interface with Streamlit.
```bash
touch ui.py
```
Then, install Streamlit:
```bash
pip install streamlit
```
Next, create a `requirements.txt` file and add Streamlit as the first dependency:
```bash
touch requirements.txt
echo "streamlit\n" >> requirements.txt
```
Since the code will be hosted on GitHub, we need to initialize Git and create a `.gitignore` file:
```bash
git init
touch .gitignore
echo ".venv" >> .gitignore # Add the virtual env folder
git commit -m "Initial commit"
```
### Step 2: Add a sidebar to the UI for product input
Letâs take a look at the final product one more time:

It has two sections: the sidebar and the main dashboard. Since the first thing you do when launching this app is adding products, we will start building the sidebar first. Open `ui.py` and paste the following code:
```python
import streamlit as st
# Set up sidebar
with st.sidebar:
st.title("Add New Product")
product_url = st.text_input("Product URL")
add_button = st.button("Add Product")
# Main content
st.title("Price Tracker Dashboard")
st.markdown("## Tracked Products")
```
The code snippet above sets up a basic Streamlit web application with two main sections. In the sidebar, it creates a form for adding new products with a text input field for the product URL and an âAdd Productâ button. The main content area contains a dashboard title and a section header for tracked products. The code uses Streamlitâs `st.sidebar` context manager to create the sidebar layout and basic Streamlit components like `st.title`, `st.text_input`, and `st.button` to build the user interface elements.
To see how this app looks like, run the following command:
```bash
streamlit run ui.py
```
Now, letâs add a commit to save our progress:
```bash
git add .
git commit -m "Add a sidebar to the basic UI"
```
### Step 3: Add a feature to check if input URL is valid
In the next step, we want to add some restrictions to the input field like checking if the passed URL is valid. For this, create a new file called `utils.py` where we write additional utility functions for our app:
```bash
touch utils.py
```
Inside the script, paste following code:
```bash
# utils.py
from urllib.parse import urlparse
import re
def is_valid_url(url: str) -> bool:
try:
# Parse the URL
result = urlparse(url)
# Check if scheme and netloc are present
if not all([result.scheme, result.netloc]):
return False
# Check if scheme is http or https
if result.scheme not in ["http", "https"]:
return False
# Basic regex pattern for domain validation
domain_pattern = (
r"^[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$"
)
if not re.match(domain_pattern, result.netloc):
return False
return True
except Exception:
return False
```
The above function `is_valid_url()` validates URLs by checking several criteria:
1. It verifies the URL has both a scheme ( `http`/ `https`) and domain name
2. It ensures the scheme is specifically `http` or `https`
3. It validates the domain name format using regex to check for valid characters and TLD
4. It returns True only if all checks pass, False otherwise
Letâs use this function in our `ui.py` file. Here is the modified code:
```python
import streamlit as st
from utils import is_valid_url
# Set up sidebar
with st.sidebar:
st.title("Add New Product")
product_url = st.text_input("Product URL")
add_button = st.button("Add Product")
if add_button:
if not product_url:
st.error("Please enter a product URL")
elif not is_valid_url(product_url):
st.error("Please enter a valid URL")
else:
st.success("Product is now being tracked!")
# Main content
...
```
Here is whatâs new:
1. We added URL validation using the `is_valid_url()` function from `utils.py`
2. When the button is clicked, we perform validation:
- Check if URL is empty
- Validate URL format using `is_valid_url()`
3. User feedback is provided through error/success messages:
- Error shown for empty URL
- Error shown for invalid URL format
- Success message when URL passes validation
Rerun the Streamlit app again and see if our validation works. Then, return to your terminal to commit the changes weâve made:
```bash
git add .
git commit -m "Add a feature to check URL validity"
```
### Step 4: Scrape the input URL for product details
When a valid URL is entered and the add button is clicked, we need to implement product scraping functionality instead of just showing a success message. The system should:
1. Immediately scrape the product URL to extract key details:
- Product name
- Current price
- Main product image
- Brand name
- Other relevant attributes
2. Store these details in a database to enable:
- Regular price monitoring
- Historical price tracking
- Price change alerts
- Product status updates
For the scraper, we will use [Firecrawl](https://www.firecrawl.dev/), an AI-based scraping API for extracting webpage data without HTML parsing. This solution provides several advantages:
1. No website HTML code analysis required for element selection
2. Resilient to HTML structure changes through AI-based element detection
3. Universal compatibility with product webpages due to structure-agnostic approach
4. Reliable website blocker bypass via robust API infrastructure
First, create a new file called `scraper.py`:
```bash
touch scraper.py
```
Then, install these three libraries:
```bash
pip install firecrawl-py pydantic python-dotenv
echo "firecrawl-py\npydantic\npython-dotenv\n" >> requirements.txt # Add them to dependencies
```
`firecrawl-py` is the Python SDK for Firecrawl scraping engine, `pydantic` is a data validation library that helps enforce data types and structure through Python class definitions, and `python-dotenv` is a library that loads environment variables from a `.env` file into your Python application.
With that said, head over to the Firecrawl website and [sign up for a free account](https://www.firecrawl.dev/) (the free plan will work fine). You will be given an API key, which you should copy.
Then, create a `.env` file in your terminal and add the API key as an environment variable:
```bash
touch .env
echo "FIRECRAWL_API_KEY='YOUR-API-KEY-HERE' >> .env"
echo ".env" >> .gitignore # Ignore .env files in Git
```
The `.env` file is used to securely store sensitive configuration values like API keys that shouldnât be committed to version control. By storing the Firecrawl API key in `.env` and adding it to `.gitignore`, we ensure it stays private while still being accessible to our application code. This is a security best practice to avoid exposing credentials in source control.
Now, we can start writing the `scraper.py`:
```python
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
from dotenv import load_dotenv
from datetime import datetime
load_dotenv()
app = FirecrawlApp()
```
Here, `load_dotenv()` function reads the `.env` file you have in your working directory and loads the environment variables inside, including the Firecrawl API key. When you create an instance of `FirecrawlApp` class, the API key is automatically detected to establish a connection between your script and the scraping engine in the form of the `app` variable.
Now, we create a Pydantic class (usually called a model) that defines the details we want to scrape from each product:
```python
class Product(BaseModel):
"""Schema for creating a new product"""
url: str = Field(description="The URL of the product")
name: str = Field(description="The product name/title")
price: float = Field(description="The current price of the product")
currency: str = Field(description="Currency code (USD, EUR, etc)")
main_image_url: str = Field(description="The URL of the main image of the product")
```
Pydantic models may be completely new to you, so letâs break down the `Product` model:
- The `url` field stores the product page URL we want to track
- The `name` field stores the product title/name that will be scraped
- The `price` field stores the current price as a float number
- The `currency` field stores the 3-letter currency code (e.g. USD, EUR)
- The `main_image_url` field stores the URL of the productâs main image
Each field is typed and has a description that documents its purpose. The `Field` class from Pydantic allows us to add metadata like descriptions to each field. These descriptions are especially important for Firecrawl since it uses them to automatically locate the relevant HTML elements containing the data we want.
Now, letâs create a function to call the engine to scrape URLâs based on the schema above:
```python
def scrape_product(url: str):
extracted_data = app.scrape_url(
url,
params={
"formats": ["extract"],
"extract": {"schema": Product.model_json_schema()},
},
)
# Add the scraping date to the extracted data
extracted_data["extract"]["timestamp"] = datetime.utcnow()
return extracted_data["extract"]
if __name__ == "__main__":
product = "https://www.amazon.com/gp/product/B002U21ZZK/"
print(scrape_product(product))
```
The code above defines a function called `scrape_product` that takes a URL as input and uses it to scrape product information. Hereâs how it works:
The function calls `app.scrape_url` with two parameters:
1. The product URL to scrape
2. A params dictionary that configures the scraping:
- It specifies we want to use the âextractâ format
- It provides our `Product` Pydantic model schema as the extraction template as a JSON object
The scraper will attempt to find and extract data that matches our Product schema fields - the URL, name, price, currency, and image URL.
The function returns just the âextractâ portion of the scraped data, which contains the structured product information. `extract` returns a dictionary to which we add the date of the scraping as it will be important later on.
Letâs test the script by running it:
```bash
python scraper.py
```
You should get an output like this:
```python
{
'url': 'https://www.amazon.com/dp/B002U21ZZK',
'name': 'MOVA Globe Earth with Clouds 4.5"',
'price': 212,
'currency': 'USD',
'main_image_url': 'https://m.media-amazon.com/images/blog/price-tracking/I/41bQ3Y58y3L._AC_.jpg',
'timestamp': '2024-12-05 13-20'
}
```
The output shows that a [MOVA Globe](https://www.amazon.com/dp/B002U21ZZK) costs $212 USD on Amazon at the time of writing this article. You can test the script for any other website that contains the information we are looking (except Ebay):
- Price
- Product name/title
- Main image URL
One key advantage of using Firecrawl is that it returns data in a consistent dictionary format across all websites. Unlike HTML-based scrapers like BeautifulSoup or Scrapy which require custom code for each site and can break when website layouts change, Firecrawl uses AI to understand and extract the requested data fields regardless of the underlying HTML structure.
Finish this step by committing the new changes to Git:
```bash
git add .
git commit -m "Implement a Firecrawl scraper for products"
```
### Step 5: Storing new products in a PostgreSQL database
If we want to check product prices regularly, we need to have an online database. In this case, Postgres is the best option since itâs reliable, scalable, and has great support for storing time-series data like price histories.
There are many platforms for hosting Postgres instances but the one I find the easiest and fastest to set up is Supabase. So, please head over to [the Supabase website](https://supabase.com/) and create your free account. During the sign-up process, you will be given a password, which you should save somewhere safe on your machine.
Then, in a few minutes, your free Postgres instance comes online. To connect to this instance, click on Home in the left sidebar and then, âConnectâ:

You will be shown your database connection string with a placeholder for the password you copied. You should paste this string in your `.env` file with your password added to the `.env` file:
```bash
echo POSTGRES_URL="THE-SUPABASE-URL-STRING-WITH-YOUR-PASSWORD-ADDED"
```
Now, the easiest way to interact with this database is through SQLAlchemy. Letâs install it:
```bash
pip install "sqlalchemy==2.0.35" psycopg2-binary
echo "psycopg2-binary\nsqlalchemy==2.0.35\n" >> requirements.txt
```
> Note: [SQLAlchemy](https://sqlalchemy.org/) is a Python SQL toolkit and Object-Relational Mapping (ORM) library that lets us interact with databases using Python code instead of raw SQL. For our price tracking project, it provides essential features like database connection management, schema definition through Python classes, and efficient querying capabilities. This makes it much easier to store and retrieve product information and price histories in our Postgres database.
After the installation, create a new `database.py` file for storing database-related functions:
```bash
touch database.py
```
Letâs populate this script:
```python
from sqlalchemy import create_engine, Column, String, Float, DateTime, ForeignKey
from sqlalchemy.orm import sessionmaker, relationship, declarative_base
from datetime import datetime
Base = declarative_base()
class Product(Base):
__tablename__ = "products"
url = Column(String, primary_key=True)
prices = relationship(
"PriceHistory", back_populates="product", cascade="all, delete-orphan"
)
class PriceHistory(Base):
__tablename__ = "price_histories"
id = Column(String, primary_key=True)
product_url = Column(String, ForeignKey("products.url"))
name = Column(String, nullable=False)
price = Column(Float, nullable=False)
currency = Column(String, nullable=False)
main_image_url = Column(String)
timestamp = Column(DateTime, nullable=False)
product = relationship("Product", back_populates="prices")
```
The code above defines two SQLAlchemy models for our price tracking database:
The `Product` model acts as a registry of all items we want to track. Itâs kept simple with just the URL as we donât want to duplicate data that changes over time.
The `PriceHistory` model stores the actual price data points and product details at specific moments in time. This separation allows us to:
- Track how product details (name, price, image) change over time
- Maintain a clean historical record for each product
- Efficiently query price trends without loading unnecessary data
Each record in `PriceHistory` contains:
- A unique ID as primary key
- The product URL as a foreign key linking to the `Product`
- The product name
- The price value and currency
- The main product image URL
- A timestamp of when the price was recorded
The relationship between `Product` and `PriceHistory` is bidirectional, allowing easy navigation between related records. The `cascade` setting ensures price histories are deleted when their product is deleted.
These models provide the structure for storing and querying our price tracking data in a PostgreSQL database using SQLAlchemyâs ORM capabilities.
Now, we define a `Database` class with a singe `add_product` method:
```python
class Database:
def __init__(self, connection_string):
self.engine = create_engine(connection_string)
Base.metadata.create_all(self.engine)
self.Session = sessionmaker(bind=self.engine)
def add_product(self, url):
session = self.Session()
try:
# Create the product entry
product = Product(url=url)
session.merge(product) # merge will update if exists, insert if not
session.commit()
finally:
session.close()
```
The `Database` class above provides core functionality for managing product data in our PostgreSQL database. It takes a connection string in its constructor to establish the database connection using SQLAlchemy.
The `add_product` method allows us to store new product URLs in the database. It uses SQLAlchemyâs `merge` functionality which intelligently handles both inserting new products and updating existing ones, preventing duplicate entries.
The method carefully manages database sessions, ensuring proper resource cleanup by using `try`/ `finally` blocks. This prevents resource leaks and maintains database connection stability.
Letâs use this method inside the sidebar of our UI. Switch to `ui.py` and make the following adjustments:
First, update the imports to load the Database class and initialize it:
```python
import os
import streamlit as st
from utils import is_valid_url
from database import Database
from dotenv import load_dotenv
load_dotenv()
with st.spinner("Loading database..."):
db = Database(os.getenv("POSTGRES_URL"))
```
The code integrates the `Database` class into the Streamlit UI by importing required dependencies and establishing a database connection. The database URL is loaded securely from environment variables using `python-dotenv`. The `Database` class creates or updates the tables we specified in `database.py` after being initialized.
The database initialization process is wrapped in a Streamlit spinner component to maintain responsiveness while establishing the connection. This provides visual feedback during the connection setup period, which typically requires a brief initialization time.
Then, in the sidebar code, we only need to add a single line of code to add the product to the database if the URL is valid:
```python
# Set up sidebar
with st.sidebar:
st.title("Add New Product")
product_url = st.text_input("Product URL")
add_button = st.button("Add Product")
if add_button:
if not product_url:
st.error("Please enter a product URL")
elif not is_valid_url(product_url):
st.error("Please enter a valid URL")
else:
db.add_product(product_url) # This is the new line
st.success("Product is now being tracked!")
```
In the final `else` block that runs when the product URL is valid, we call the `add_product` method to store the product in the database.
Letâs commit everything:
```bash
git add .
git commit -m "Add a Postgres database integration for tracking product URLs"
```
### Step 6: Storing price histories for new products
Now, after the product is added to the `products` table, we want to add its details and its scraped price to the `price_histories` table.
First, switch to `database.py` and add a new method for creating entries in the `PriceHistories` table:
```python
class Database:
... # the rest of the class
def add_price(self, product_data):
session = self.Session()
try:
price_history = PriceHistory(
id=f"{product_data['url']}_{product_data['timestamp']}",
product_url=product_data["url"],
name=product_data["name"],
price=product_data["price"],
currency=product_data["currency"],
main_image_url=product_data["main_image_url"],
timestamp=product_data["timestamp"],
)
session.add(price_history)
session.commit()
finally:
session.close()
```
The `add_price` method takes a dictionary containing product data (which is returned by our scraper) and creates a new entry in the `PriceHistory` table. The entryâs ID is generated by combining the product URL with a timestamp. The method stores essential product information like name, price, currency, image URL, and the timestamp of when the price was recorded. It uses SQLAlchemyâs session management to safely commit the new price history entry to the database.
Now, we need to add this functionality to the sidebar as well. In `ui.py`, add a new import statement that loads the `scrape_product` function from `scraper.py`:
```python
... # The rest of the imports
from scraper import scrape_product
```
Then, update the `else` block in the sidebar again:
```python
with st.sidebar:
st.title("Add New Product")
product_url = st.text_input("Product URL")
add_button = st.button("Add Product")
if add_button:
if not product_url:
st.error("Please enter a product URL")
elif not is_valid_url(product_url):
st.error("Please enter a valid URL")
else:
db.add_product(product_url)
with st.spinner("Added product to database. Scraping product data..."):
product_data = scrape_product(product_url)
db.add_price(product_data)
st.success("Product is now being tracked!")
```
Now when a user enters a product URL and clicks the âAdd Productâ button, several things happen:
1. The URL is validated to ensure itâs not empty and is properly formatted.
2. If valid, the URL is added to the products table via `add_product()`.
3. The product page is scraped immediately to get current price data.
4. This initial price data is stored in the price history table via `add_price()`.
5. The user sees loading spinners and success messages throughout the process.
This gives us a complete workflow for adding new products to track, including capturing their initial price point. The UI provides clear feedback at each step and handles errors gracefully.
Check that everything is working the way we want it and then, commit the new changes:
```bash
git add .
git commit -m "Add a feature to track product prices after they are added"
```
### Step 7: Displaying each productâs price history in the main dashboard
Letâs take a look at the final product shown in the introduction once again:

Apart from the sidebar, the main dashboard shows each productâs price history visualized with a Plotly line plot where the X axis is the timestamp while the Y axis is the prices. Each line plot is wrapped in a Streamlit component that includes buttons for removing the product from the database or visiting its source URL.
In this step, we will implement the plotting feature and leave the two buttons for a later section. First, add a new method to the `Database` class for retrieving the price history for each product:
```python
class Database:
... # The rest of the code
def get_price_history(self, url):
"""Get price history for a product"""
session = self.Session()
try:
return (
session.query(PriceHistory)
.filter(PriceHistory.product_url == url)
.order_by(PriceHistory.timestamp.desc())
.all()
)
finally:
session.close()
```
The method queries the price histories table based on product URL, orders the rows in descending order (oldest first) and returns the results.
Then, add another method for retrieving all products from the `products` table:
```python
class Database:
...
def get_all_products(self):
session = self.Session()
try:
return session.query(Product).all()
finally:
session.close()
```
The idea is that every time our Streamlit app is opened, the main dashboard queries all existing products from the database and render their price histories with line charts in dedicated components.
To create the line charts, we need Plotly and Pandas, so install them in your environment:
```bash
pip install pandas plotly
echo "pandas\nplotly\n" >> requirements.txt
```
Afterward, import them at the top of `ui.py` along with other existing imports:
```python
import pandas as pd
import plotly.express as px
```
Then, switch to `ui.py` and paste the following snippet of code after the Main content section:
```python
# Main content
st.title("Price Tracker Dashboard")
st.markdown("## Tracked Products")
# Get all products
products = db.get_all_products()
```
Here, after the page title and subtitle is shown, we are retrieving all products from the database. Letâs loop over them:
```python
# Create a card for each product
for product in products:
price_history = db.get_price_history(product.url)
if price_history:
# Create DataFrame for plotting
df = pd.DataFrame(
[\
{"timestamp": ph.timestamp, "price": ph.price, "name": ph.name}\
for ph in price_history\
]
)
```
For each product, we get their price history with `db.get_price_history` and then, convert this data into a dataframe with three columns:
- Timestamp
- Price
- Product name
This makes plotting easier with Plotly. Next, we create a Streamlit expander component for each product:
```python
# Create a card for each product
for product in products:
price_history = db.get_price_history(product.url)
if price_history:
...
# Create a card-like container for each product
with st.expander(df["name"][0], expanded=False):
st.markdown("---")
col1, col2 = st.columns([1, 3])
with col1:
if price_history[0].main_image_url:
st.image(price_history[0].main_image_url, width=200)
st.metric(
label="Current Price",
value=f"{price_history[0].price} {price_history[0].currency}",
)
```
The expander shows the product name as its title and contains:
1. A divider line
2. Two columns:
- Left column: Product image (if available) and current price metric
- Right column (shown in next section)
The price is displayed using Streamlitâs metric component which shows the current price and currency.
Here is the rest of the code:
```python
...
with col2:
# Create price history plot
fig = px.line(
df,
x="timestamp",
y="price",
title=None,
)
fig.update_layout(
xaxis_title=None,
yaxis_title="Price",
showlegend=False,
margin=dict(l=0, r=0, t=0, b=0),
height=300,
)
fig.update_xaxes(tickformat="%Y-%m-%d %H:%M", tickangle=45)
fig.update_yaxes(tickprefix=f"{price_history[0].currency} ", tickformat=".2f")
st.plotly_chart(fig, use_container_width=True)
```
In the right column, we create an interactive line plot using Plotly Express to visualize the price history over time. The plot shows price on the y-axis and timestamp on the x-axis. The layout is customized to remove the title, adjust axis labels and formatting, and optimize the display size. The timestamps are formatted to show date and time, with angled labels for better readability. Prices are displayed with 2 decimal places and a dollar sign prefix. The plot is rendered using Streamlitâs `plotly_chart` component and automatically adjusts its width to fill the container.
After this step, the UI must be fully functional and ready to track products. For example, here is what mine looks like after adding a couple of products:

But notice how the price history chart doesnât show anything. Thatâs because we havenât populated it by checking the product price in regular intervals. Letâs do that in the next couple of steps. For now, commit the latest changes weâve made:
```bash
git add .
git commit -m "Display product price histories for each product in the dashboard"
```
* * *
Letâs take a brief moment to summarize the steps we took so far and whatâs next. So far, weâve built a Streamlit interface that allows users to add product URLs and displays their current prices and basic information. Weâve implemented the database schema, created functions to scrape product data, and designed a clean UI with price history visualization. The next step is to set up automated price checking to populate our history charts and enable proper price tracking over time.
### Step 8: Adding new price entries for existing products
Now, we want to write a script that adds new price entries in the `price_histories` table for each product in `products` table. We call this script `check_prices.py`:
```python
import os
from database import Database
from dotenv import load_dotenv
from firecrawl import FirecrawlApp
from scraper import scrape_product
load_dotenv()
db = Database(os.getenv("POSTGRES_URL"))
app = FirecrawlApp()
```
At the top, we are importing the functions and packages and initializing the database and a Firecrawl app. Then, we define a simple `check_prices` function:
```python
def check_prices():
products = db.get_all_products()
for product in products:
try:
updated_product = scrape_product(product.url)
db.add_price(updated_product)
print(f"Added new price entry for {updated_product['name']}")
except Exception as e:
print(f"Error processing {product.url}: {e}")
if __name__ == "__main__":
check_prices()
```
In the function body, we retrieve all products URLs, retrieve their new price data with `scrape_product` function from `scraper.py` and then, add a new price entry for the product with `db.add_price`.
If you run the function once and refresh the Streamlit app, you must see a line chart appear for each product you are tracking:

Letâs commit the changes in this step:
```bash
git add .
git commit -m "Add a script for checking prices of existing products"
```
### Step 9: Check prices regularly with GitHub actions
GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate various software workflows directly from your GitHub repository. In our case, itâs particularly useful because we can set up automated price checks to run the `check_prices.py` script at regular intervals (e.g., daily or hourly) without manual intervention. This ensures we consistently track price changes and maintain an up-to-date database of historical prices for our tracked products.
So, the first step is creating a new GitHub repository for our project and pushing existing code to it:
```bash
git remote add origin https://github.com/yourusername/price-tracker.git
git push origin main
```
Then, return to your terminal and create this directory structure:
```bash
mkdir -p .github/workflows
touch .github/workflows/check_prices.yml
```
The first command creates a new directory structure `.github/workflows` using the `-p` flag to create parent directories if they donât exist.
The second command creates an empty YAML file called `check_prices.yml` inside the workflows directory. GitHub Actions looks for workflow files in this specific location - any YAML files in the `.github/workflows` directory will be automatically detected and processed as workflow configurations. These YAML files define when and how your automated tasks should run, what environment they need, and what commands to execute. In our case, this file will contain instructions for GitHub Actions to periodically run our price checking script. Letâs write it:
```yaml
name: Price Check
on:
schedule:
# Runs every 3 minutes
- cron: "*/3 * * * *"
workflow_dispatch: # Allows manual triggering
```
Letâs break down this first part of the YAML file:
The `name: Price Check` line gives our workflow a descriptive name that will appear in the GitHub Actions interface.
The `on:` section defines when this workflow should be triggered. Weâve configured two triggers:
1. A schedule using cron syntax `*/3 * * * *` which runs the workflow every 3 minutes. The five asterisks represent minute, hour, day of month, month, and day of week respectively. The `*/3` means âevery 3rd minuteâ. The 3-minute interval is for debugging purposes, we will need to choose a wider interval later on to respect the free limits of GitHub actions.
2. `workflow_dispatch` enables manual triggering of the workflow through the GitHub Actions UI, which is useful for testing or running the check on-demand.
Now, letâs add the rest:
```yaml
jobs:
check-prices:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
cache: "pip"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run price checker
env:
FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}
POSTGRES_URL: ${{ secrets.POSTGRES_URL }}
run: python check_prices.py
```
Letâs break down this second part of the YAML file:
The `jobs:` section defines the actual work to be performed. We have one job named `check-prices` that runs on an Ubuntu virtual machine ( `runs-on: ubuntu-latest`).
Under `steps:`, we define the sequence of actions:
1. First, we checkout our repository code using the standard `actions/checkout@v4` action
2. Then we set up Python 3.10 using `actions/setup-python@v5`, enabling pip caching to speed up dependency installation
3. Next, we install our Python dependencies by upgrading `pip` and installing requirements from our `requirements.txt` file. At this point, it is essential that you were keeping a complete dependency file based on the installs we made in the project.
4. Finally, we run our price checker script, providing two environment variables:
- `FIRECRAWL_API_KEY`: For accessing the web scraping service
- `POSTGRES_URL`: For connecting to our database
Both variables must be stored in our GitHub repository as secrets for this workflow file to run without errors. So, navigate to the repository youâve created for the project and open its Settings. Under âSecrets and variablesâ > âActionsâ, click on âNew repository secretâ button to add the environment variables we have in the `.env` file one-by-one.
Then, return to your terminal, commit the changes and push:
```bash
git add .
git commit -m "Add a workflow to check prices regularly"
git push origin main
```
Next, navigate to your GitHub repository again and click on the âActionsâ tab:

From there, you can run the workflow manually (click âRun workflowâ and refresh the page). If it is executed successfully, you can return to the Streamlit app and refresh to see the new price added to the chart.
### Step 10: Setting up Discord for notifications
Now that we know our scheduling workflow works, the first order of business is setting a wider check interval in the workflow file. Even though our first workflow run was manually, the rest happen automatically.
```bash
on:
schedule:
# Runs every 6 hours
- cron: "0 0,6,12,18 * * *"
workflow_dispatch: # Allows manual triggering
```
The cron syntax `0 0,6,12,18 * * *` can be broken down as follows:
- First `0`: Run at minute 0
- `0,6,12,18`: Run at hours 0 (midnight), 6 AM, 12 PM (noon), and 6 PM
- First `*`: Run every day of the month
- Second `*`: Run every month
- Third `*`: Run every day of the week
So this schedule will check prices four times daily: at midnight, 6 AM, noon, and 6 PM (UTC time). This spacing helps stay within GitHub Actionsâ free tier limits while still catching most price changes.
Now, commit and push the changes:
```bash
git add .
git commit -m "Set a wider check interval in the workflow file"
git push origin main
```
Now comes the interesting part. Each time the workflow is run, we want to compare the current price of the product to its original price when we started tracking it. If the difference between these two prices exceeds a certain threshold like 5%, this means there is a discount happening for the product and we want to send a notification.
The easiest way to set this up is by using Discord webhooks. So, if you donât have one already, go to Discord.com and create a new account (optionally, download the desktop app as well). Then, setting up Discord notifications requires a few careful steps:
1. **Create a discord server**
- Click the â+â button in the bottom-left corner of Discord
- Choose âCreate My Ownâ â âFor me and my friendsâ
- Give your server a name (e.g., âPrice Alertsâ)
2. **Create a channel for alerts**
- Your server comes with a #general channel by default
- You can use this or create a new channel called #price-alerts
- Right-click the channel you want to use
3. **Set up the webhook**
- Select âEdit Channelâ from the right-click menu
- Go to the âIntegrationsâ tab
- Click âCreate Webhookâ
- Give it a name like âPrice Alert Botâ
- The webhook URL will be generated automatically
- Click âCopy Webhook URLâ - this is your unique notification endpoint
4. **Secure the webhook URL**
- Never share or commit your webhook URL directly
- Add it to your `.env` file as `DISCORD_WEBHOOK_URL`
- Add it to your GitHub repository secrets
- The URL should look something like: `https://discord.com/api/webhooks/...`
This webhook will serve as a secure endpoint that our price tracker can use to send notifications directly to your Discord channel.
Webhooks are automated messages sent from apps to other apps in real-time. They work like a notification system - when something happens in one app, it automatically sends data to another app through a unique URL. In our case, weâll use Discord webhooks to automatically notify us when thereâs a price drop. Whenever our price tracking script detects a significant discount, it will send a message to our Discord channel through the webhook URL, ensuring we never miss a good deal.
After copying the webhook URL, you should save it as environment variable to your `.env` file:
```python
echo "DISCORD_WEBHOOK_URL='THE-URL-YOU-COPIED'" >> .env
```
Now, create a new file called `notifications.py` and paste the following contents:
```python
from dotenv import load_dotenv
import os
import aiohttp
import asyncio
load_dotenv()
async def send_price_alert(
product_name: str, old_price: float, new_price: float, url: str
):
"""Send a price drop alert to Discord"""
drop_percentage = ((old_price - new_price) / old_price) * 100
message = {
"embeds": [\
{\
"title": "Price Drop Alert! đ",\
"description": f"**{product_name}**\nPrice dropped by {drop_percentage:.1f}%!\n"\
f"Old price: ${old_price:.2f}\n"\
f"New price: ${new_price:.2f}\n"\
f"[View Product]({url})",\
"color": 3066993,\
}\
]
}
try:
async with aiohttp.ClientSession() as session:
await session.post(os.getenv("DISCORD_WEBHOOK_URL"), json=message)
except Exception as e:
print(f"Error sending Discord notification: {e}")
```
The `send_price_alert` function above is responsible for sending price drop notifications to Discord using webhooks. Letâs break down whatâs new:
1. The function takes 4 parameters:
- `product_name`: The name of the product that dropped in price
- `old_price`: The previous price before the drop
- `new_price`: The current lower price
- `url`: Link to view the product
2. It calculates the percentage drop in price using the formula: `((old_price - new_price) / old_price) * 100`
3. The notification is formatted as a Discord embed - a rich message format that includes:
- A title with a celebration emoji
- A description showing the product name, price drop percentage, old and new prices
- A link to view the product
- A green color (3066993 in decimal)
4. The message is sent asynchronously using `aiohttp` to post to the Discord webhook URL stored in the environment variables
5. Error handling is included to catch and print any issues that occur during the HTTP request
This provides a clean way to notify users through Discord whenever we detect a price drop for tracked products.
To check the notification system works, add this main block to the end of the script:
```python
if __name__ == "__main__":
asyncio.run(send_price_alert("Test Product", 100, 90, "https://www.google.com"))
```
`asyncio.run()` is used here because `send_price_alert` is an async function that needs to be executed in an event loop. `asyncio.run()` creates and manages this event loop, allowing the async HTTP request to be made properly. Without it, we wouldnât be able to use the `await` keyword inside `send_price_alert`.
To run the script, install `aiohttp`:
```python
pip install aiohttp
echo "aiohttp\n" >> requirements.txt
python notifications.py
```
If all is well, you should get a Discord message in your server that looks like this:

Letâs commit the changes we have:
```bash
git add .
git commit -m "Set up Discord alert system"
```
Also, donât forget to add the Discord webhook URL to your GitHub repository secrets!
### Step 11: Sending Discord alerts when prices drop
Now, the only step left is adding a price comparison logic to `check_prices.py`. In other words, we want to use the `send_price_alert` function if the new scraped price is lower than the original. This requires a revamped `check_prices.py` script:
```python
import os
import asyncio
from database import Database
from dotenv import load_dotenv
from firecrawl import FirecrawlApp
from scraper import scrape_product
from notifications import send_price_alert
load_dotenv()
db = Database(os.getenv("POSTGRES_URL"))
app = FirecrawlApp()
# Threshold percentage for price drop alerts (e.g., 5% = 0.05)
PRICE_DROP_THRESHOLD = 0.05
async def check_prices():
products = db.get_all_products()
product_urls = set(product.url for product in products)
for product_url in product_urls:
# Get the price history
price_history = db.get_price_history(product_url)
if not price_history:
continue
# Get the earliest recorded price
earliest_price = price_history[-1].price
# Retrieve updated product data
updated_product = scrape_product(product_url)
current_price = updated_product["price"]
# Add the price to the database
db.add_price(updated_product)
print(f"Added new price entry for {updated_product['name']}")
# Check if price dropped below threshold
if earliest_price > 0: # Avoid division by zero
price_drop = (earliest_price - current_price) / earliest_price
if price_drop >= PRICE_DROP_THRESHOLD:
await send_price_alert(
updated_product["name"], earliest_price, current_price, product_url
)
if __name__ == "__main__":
asyncio.run(check_prices())
```
Letâs examine the key changes in this enhanced version of `check_prices.py`:
1. New imports and setup
- Added `asyncio` for `async`/ `await` support
- Imported `send_price_alert` from `notifications.py`
- Defined `PRICE_DROP_THRESHOLD = 0.05` (5% threshold for alerts)
2. Async function conversion
- Converted `check_prices()` to async function
- Gets unique product URLs using set comprehension to avoid duplicates
3. Price history analysis
- Retrieves full price history for each product
- Gets `earliest_price` from `history[-1]` (works because we ordered by timestamp DESC)
- Skips products with no price history using `continue`
4. Price drop detection logic
- Calculates drop percentage: `(earliest_price - current_price) / earliest_price`
- Checks if drop exceeds 5% threshold
- Sends Discord alert if threshold exceeded using `await send_price_alert()`
5. Async main block
- Uses `asyncio.run()` to execute async `check_prices()` in event loop
When I tested this new version of the script, I immediately got an alert:

Before we supercharge our workflow with the new notification system, you should add this line of code to your `check_prices.yml` workflow file to read the Discord webhook URL from your GitHub secrets:
```python
...
- name: Run price checker
env:
FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}
POSTGRES_URL: ${{ secrets.POSTGRES_URL }}
DISCORD_WEBHOOK_URL: ${{ secrets.DISCORD_WEBHOOK_URL }}
run: python automated_price_tracking/check_prices.py
```
Finally, letâs commit everything and push to GitHub so that our workflow is supercharged with our notification system:
```bash
git add .
git commit -m "Add notification system to price drops"
git push origin main
```
## Limitations of Free Tier Tools Used in the Tutorial
Before wrapping up, letâs quickly review the limitations of the free tools we used in this tutorial:
- GitHub Actions: Limited to 2,000 minutes per month for free accounts. Consider increasing the cron interval to stay within limits.
- Supabase: Free tier includes 500MB database storage and limited row count. Monitor usage if tracking many products.
- Firecrawl: Free API tier allows 500 requests per month. This means that at 6 hour intervals, you can track up to four products in the free plan.
- Streamlit Cloud: Free hosting tier has some memory/compute restrictions and goes to sleep after inactivity.
While these limitations exist, theyâre quite generous for personal use and learning. The app will work well for tracking a reasonable number of products with daily price checks.
## Conclusion and Next Steps
Congratulations for making it to the end of this extremely long tutorial! Weâve just covered how to implement an end-to-end Python project you can proudly showcase on your portfolio. We built a complete price tracking system that scrapes product data from e-commerce websites, stores it in a Postgres database, analyzes price histories, and sends automated Discord notifications when prices drop significantly. Along the way, we learned about web scraping with Firecrawl, database management with SQLAlchemy, asynchronous programming with asyncio, building interactive UIs with Streamlit, automating with GitHub actions and integrating external webhooks.
However, the project is far from perfect. Since we took a top-down approach to building this app, our project code is scattered across multiple files and often doesnât follow programming best practices. For this reason, Iâve recreated the same project in a much more sophisticated manner with production-level features. [This new version on GitHub](https://github.com/BexTuychiev/automated-price-tracking) implements proper database session management, faster operations and overall smoother user experience. Also, this version includes buttons for removing products from the database and visiting them through the app.
If you decide to stick with the basic version, you can find the full project code and notebook in the official Firecrawl GitHub repositoryâs example projects. I also recommend that you [deploy your Streamlit app to Streamlit Cloud](https://share.streamlit.io/) so that you have a functional app accessible everywhere you go.
Here are some further improvements you might consider for the app:
- Improve the price comparison logic: the app compares the current price to the oldest recorded price, which might not be ideal. You may want to compare against recent price trends instead.
- No handling of currency conversion if products use different currencies.
- The Discord notification system doesnât handle rate limits or potential webhook failures gracefully.
- No error handling for Firecrawl scraper - what happens if the scraping fails?
- No consistent usage of logging to help track issues in production.
- No input URL sanitization before scraping.
Some of these features are implemented in [the advanced version of the project](https://github.com/BexTuychiev/automated-price-tracking), so definitely check it out!
Here are some more guides from our blog if you are interested:
- [How to Run Web Scrapers on Schedule](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025)
- [More about using Firecrawlâs `scrape_url` function](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint)
- [Scraping entire websites with Firecrawl in a single command - the /crawl endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl)
Thank you for reading!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Automated Web Scraping
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 5, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Web Scraping Automation: How to Run Scrapers on a Schedule

## Introduction
Web scraping is an essential skill for programmers in this data-driven world. Whether youâre tracking prices, monitoring competitors, or gathering research data, automated web scraping can save you countless hours of manual work. In this comprehensive guide, youâll learn how to schedule and automate your Python web scrapers using completely free tools and services.
### Why automate your web scraper?
Manual or one-off web scraping can be time-consuming and error-prone. You need to repeatedly run scripts, update data frequently, and sometimes work during off-hours just to gather information.
Scheduling your web scrapers automates this entire process. It collects data at optimal times without manual intervention, ensures consistency, and frees up your valuable time for actual data analysis rather than repetitive data gathering.
### What tools can you use to automate web scrapers in Python?
If you are building scrapers in Python, you have many completely free options to schedule them.
In terms of local scraping, Pythonâs `schedule` library is very intuitive and fastest to set up. There is also the built-in `asyncio` for concurrent scraping. There are also system-level automation tools like cron jobs for macOS/Linux and the Task Scheduler for Windows.
There are also many cloud-based solutions like GitHub Actions (completely free), PythonAnywhere (free tier), Google Cloud Functions (free tier), and Heroku (free tier with limitations).
In this guide, we will start with basic local scheduling and progress to using GitHub Actions, all while following best practices and ethical scraping guidelines.
Letâs begin by setting up your development environment and writing your first scheduled web scraper.
### Common challenges in automating web scrapers
Scheduling web scrapers is an easy process, as you will discover through this tutorial. The real challenge lies in ensuring that the scrapers donât break the day after they are put on schedule. Common issues include websites changing their HTML structure, implementing new anti-bot measures, or modifying their robots.txt policies. Additionally, network issues, rate limiting, and IP blocking can cause scheduled scrapers to fail.
For these reasons, it is almost impossible to build long-lasting scrapers written in Python frameworks. But the web scraping landscape is changing as more AI-based tools are emerging, like [Firecrawl](https://www.firecrawl.dev/).
Firecrawl provides an AI-powered web scraping API that can identify and extract data from HTML elements based on semantic descriptions in Python classes. While traditional scrapers rely on specific HTML selectors that can break when websites change, Firecrawlâs AI approach helps maintain scraper reliability over time.
For demonstration purposes, weâll implement examples using Firecrawl, though the scheduling techniques covered in this tutorial can be applied to any Python web scraper built with common libraries like BeautifulSoup, Scrapy, Selenium, or `lxml`. If you want to follow along with a scraper of your own, make sure to have it in a script and ready to go.
## Prerequisites
This article assumes that you are already comfortable with web scraping and its related concepts like HTML structure, CSS selectors, HTTP requests, and handling rate limits. If you need a refresher on web scraping basics, check out [this introductory guide to web scraping with Python](https://realpython.com/python-web-scraping-practical-introduction/).
Otherwise, letâs jump in by setting up the tools we will use for the tutorial.
### Environment setup
We will mainly use Firecrawl in this article, so, please make sure that you sign up at [firecrawl.dev](https://www.firecrawl.dev/), choose the free plan and get an API token.
Then, create a new working directory on your machine to follow along in this tutorial:
```bash
mkdir learn-scheduling
cd learn-scheduling
```
It is always a best practice to create a new virtual environment for projects. In this tutorial, we will use Pythonâs `virtualenv`:
```bash
python -m venv venv
source venv/bin/activate # For Unix/macOS
venv\Scripts\activate # For Windows
```
Now, letâs install the libraries we will use:
```bash
pip install requests beautifulsoup4 firecrawl-py python-dotenv
```
We will touch on what each library does as we use them.
### Firecrawl API key setup
Since we will push our code to a GitHub repository later, you will need to save your Firecrawl API key securely by using a `.env` file:
```bash
touch .env
echo "FIRECRAWL_API_KEY='your-key-here'" >> .env
```
Also, create a `.gitignore` file and add the `.env` to it so that it isnât pushed to GitHub:
```bash
touch .gitignore
echo ".env" >> .gitignore
```
## Writing a Basic Web Scraper
In this tutorial, we will build a scraper for [the Hacker News homepage](https://news.ycombinator.com/) that extracts post title, URL, author, rank, number of upvotes and date.

Like we mentioned, we will build the scraper in Firecrawl but I have also prepared an [identical scraper written in BeautifulSoup](https://github.com/mendableai/firecrawl/tree/main/examples/blog-articles/scheduling_scrapers) if you want a more traditional approach.
In your working directory, create a new `firecrawl_scraper.py` script and import the following packages:
```python
# firecrawl_scraper.py
import json
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List
from datetime import datetime
load_dotenv()
BASE_URL = "https://news.ycombinator.com/"
```
After the imports, we are calling `load_dotenv()` so that our Firecrawl API key is loaded from the `.env` file. Then, we are defining a new variable containing the URL we will scrape.
Next, we create a Pydantic model to specify the information we want to scrape from each Hacker News post:
```python
class NewsItem(BaseModel):
title: str = Field(description="The title of the news item")
source_url: str = Field(description="The URL of the news item")
author: str = Field(
description="The URL of the post author's profile concatenated with the base URL."
)
rank: str = Field(description="The rank of the news item")
upvotes: str = Field(description="The number of upvotes of the news item")
date: str = Field(description="The date of the news item.")
```
Pydantic models are Python classes that provide data validation and serialization capabilities. They allow you to define the structure and types of your data using Python type hints, while automatically handling validation, serialization, and documentation.
In the context of our Firecrawl scraper, the `NewsItem` model defines the exact structure of data we want to extract from each Hacker News post. Each field in the model ( `title`, `source_url`, `author`, etc.) specifies what data should be scraped and includes a description of what that field represents.
This model is crucial for Firecrawl because it uses the modelâs schema to understand exactly what data to extract from the webpage. When we pass this model to Firecrawl, it will automatically attempt to find and extract data matching these field definitions from the HTML structure of Hacker News.
For example, when Firecrawl sees we want a âtitleâ field, it will look for elements on the page that are likely to contain post titles based on their HTML structure and content. The `Field` descriptions help provide additional context about what each piece of data represents.
Next, we create another model called `NewsData` that contains a list of `NewsItem` objects. This model will serve as a container for all the news items we scrape from Hacker News. The `news_items` field is defined as a List of `NewsItem` objects, which means it can store multiple news items in a single data structure.
```python
class NewsData(BaseModel):
news_items: List[NewsItem]
```
Without this second model, our scraper will return not one but all news items.
Now, we define a new function that will run Firecrawl based on the scraping schema we just defined:
```python
def get_firecrawl_news_data():
app = FirecrawlApp()
data = app.scrape_url(
BASE_URL,
params={
"formats": ["extract"],
"extract": {"schema": NewsData.model_json_schema()},
},
)
return data
```
This function initializes a FirecrawlApp instance and uses it to scrape data from Hacker News. It passes the `BASE_URL` and parameters specifying that we want to extract data according to our `NewsData` schema. The schema tells Firecrawl exactly what fields to look for and extract from each news item on the page. The function returns the scraped data which will contain a list of news items matching our defined structure.
Letâs quickly test it:
```python
data = get_firecrawl_news_data()
print(type(data))
```
```python
```
Firecrawl always returns the scraped data in a dictionary. Letâs look at the keys:
```python
data['metadata']
```
```python
{
'title': 'Hacker News',
'language': 'en',
'ogLocaleAlternate': [],
'referrer': 'origin',
'viewport': 'width=device-width, initial-scale=1.0',
'sourceURL': 'https://news.ycombinator.com/',
'url': 'https://news.ycombinator.com/',
'statusCode': 200
}
```
The first key is the metadata field containing basic page information. We are interested in the `extract` field which contains the data scraped by the engine:
```python
data['extract']['news_items'][0]
```
```python
{
'title': "Send someone you appreciate an official 'Continue and Persist' Letter",
'source_url': 'https://ContinueAndPersist.org',
'author': 'https://news.ycombinator.com/user?id=adnanaga',
'rank': '1',
'upvotes': '1122',
'date': '17 hours ago'
}
```
The `extract` field contains a dictionary with a list of scraped news items. We can see above that when printing the first item, it includes all the fields we defined in our `NewsItem` Pydantic model, including title, source URL, author, rank, upvotes and date.
```python
len(data['extract']['news_items'])
```
```python
30
```
The output shows 30 news items, confirming that our scraper successfully extracted all posts from the first page of Hacker News. This matches the siteâs standard layout which displays exactly 30 posts per page.
Now, letâs create a new function that saves this data to a JSON file:
```python
def save_firecrawl_news_data():
# Get the data
data = get_firecrawl_news_data()
# Format current date for filename
date_str = datetime.now().strftime("%Y_%m_%d_%H_%M")
filename = f"firecrawl_hacker_news_data_{date_str}.json"
# Save the news items to JSON file
with open(filename, "w") as f:
json.dump(data["extract"]["news_items"], f, indent=4)
return filename
```
`save_firecrawl_news_data()` handles saving the scraped Hacker News data to a JSON file. It first calls `get_firecrawl_news_data()` to fetch the latest data from Hacker News. Then, it generates a filename using the current timestamp in the format `YYYY_MM_DD_HH_MM`. The data is saved to this timestamped JSON file with proper indentation, and the filename is returned. This allows us to maintain a historical record of the scraped data with clear timestamps indicating when each scrape occurred.
Finally, add a `__main__` block to the `firecrawl_scraper.py` script to allow running the scraper directly from the command line:
```python
if __name__ == "__main__":
save_firecrawl_news_data()
```
The complete scraper script is available in [our GitHub repository](https://github.com/mendableai/firecrawl/blob/main/examples/hacker_news_scraper/firecrawl_scraper.py). For reference, we also provide [a BeautifulSoup implementation of the same scraper](https://github.com/mendableai/firecrawl/blob/main/examples/hacker_news_scraper/bs4_scraper.py).
## Local Web Scraping Automation Methods
In this section, we will explore how to run the scraper from the previous section on schedule using local tools like the Python `schedule` library and cron.
### The basics of the Python `schedule` library
`schedule` is a job scheduling library written for humans ( [from the documentation](https://schedule.readthedocs.io/en/stable/index.html)). It runs any Python function or callable periodically using intuitive syntax.
To get started, please install it with `pip`:
```bash
pip install schedule
```
Then, scheduling a Python function is as easy as shown in the codeblock below:
```python
import schedule
import time
def job():
current_time = time.strftime("%H:%M:%S")
print(f"{current_time}: I'm working...")
# Schedule it
schedule.every(3).seconds.do(job)
while True:
schedule.run_pending()
time.sleep(1)
```
```out
14:58:23: I'm working...
14:58:26: I'm working...
14:58:29: I'm working...
14:58:32: I'm working...
14:58:35: I'm working...
...
```
To implement scheduling, first convert your task into a function (which weâve already completed). Next, apply scheduling logic using the `.every(n).period.do` syntax. Below are several examples demonstrating different scheduling patterns:
```python
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
schedule.every().day.at("12:42", "Europe/Amsterdam").do(job)
schedule.every().minute.at(":17").do(job) # 17th second of a minute
```
Finally, you need to run an infinite loop that checks for pending scheduled jobs and executes them. The loop below runs continuously, checking if any scheduled tasks are due to run:
```python
while True:
schedule.run_pending()
time.sleep(1)
```
There is much more to the `schedule` library than what we just covered (you should check out [the examples from the documentation](https://schedule.readthedocs.io/en/stable/examples.html)) but they are enough for the purposes of this article.
### Using Pythonâs `schedule` library to schedule web scrapers
Now that we know the basics of `schedule`, letâs use it for our Firecrawl scraper. Start by creating a new `scrape_scheduler.py` script and making the necessary imports:
```python
import schedule
import time
from firecrawl_scraper import save_firecrawl_news_data
```
Here, we import the `schedule` module itself and the `save_firecrawl_news_data()` function from `firecrawl_scraper.py` that downloads the top 30 posts of Hacker News.
Then, to run this function on schedule, like every hour, we only need to add a few lines of code:
```python
# Schedule the scraper to run every hour
schedule.every().hour.do(save_firecrawl_news_data)
while True:
schedule.run_pending()
time.sleep(1)
```
You can start the schedule with:
```bash
python scrape_scheduler.py
```
> **Tip**: For debugging purposes, start with a shorter interval like 60 seconds before implementing the hourly schedule.
The scheduler will continue running until you terminate the main terminal process executing the `scrape_scheduler.py` script. Thanks to Firecrawlâs AI-powered HTML parsing and layout adaptation capabilities, the scraper is quite resilient to website changes and has a low probability of breaking.
Nevertheless, web scraping can be unpredictable, so itâs recommended to review [the exception handling](https://schedule.readthedocs.io/en/stable/exception-handling.html) section of the `schedule` documentation to handle potential errors gracefully.
* * *
The `schedule` library provides a simple and intuitive way to run periodic tasks like web scrapers. While it lacks some advanced features of other scheduling methods, itâs a great choice for basic scheduling needs and getting started with automated scraping. Just remember to implement proper error handling for production use.
### Using Pythonâs built-in tools to automate web scrapers
In this section, we will explore a few other local scheduling methods that have the advantage of being built into Python or the operating system, making them more reliable and robust than third-party libraries. These methods also provide better error handling, logging capabilities, and system-level control over the scheduling process.
#### How to automate a web scraper with `asyncio`?
`asyncio` is a Python library for running code concurrently - executing multiple tasks at the same time by switching between them when one is waiting. Itâs built into Python and helps schedule tasks efficiently. Hereâs why itâs great for web scrapers:
1. It can do other work while waiting for web requests to complete.
2. You can run multiple scrapers at the same time with precise timing control.
3. It uses less computer resources than regular multitasking.
4. It handles errors well with `try/except` blocks.
Letâs see how to use `asyncio` for scheduling scrapers:
```python
import asyncio
import time
from firecrawl_scraper import save_firecrawl_news_data
async def schedule_scraper(interval_hours: float = 1):
while True:
try:
print(f"Starting scrape at {time.strftime('%Y-%m-%d %H:%M:%S')}")
# Run the scraper
filename = save_firecrawl_news_data()
print(f"Data saved to {filename}")
except Exception as e:
print(f"Error during scraping: {e}")
# Wait for the specified interval
await asyncio.sleep(interval_hours * 3600) # Convert hours to seconds
async def main():
# Create tasks for different scheduling intervals
tasks = [\
schedule_scraper(interval_hours=1), # Run every hour\
# Add more tasks with different intervals if needed\
# schedule_scraper(interval_hours=0.5), # Run every 30 minutes\
# schedule_scraper(interval_hours=2), # Run every 2 hours\
]
# Run all tasks concurrently
await asyncio.gather(*tasks)
if __name__ == "__main__":
# Run the async scheduler
asyncio.run(main())
```
Letâs break down whatâs happening above:
The `schedule_scraper()` function is an `async` function that runs indefinitely in a loop. For each iteration, it:
1. Runs the scraper and saves the data
2. Handles any errors that occur during scraping
3. Waits for the specified interval using `asyncio.sleep()`
The `main()` function sets up concurrent execution of multiple scraper tasks with different intervals. This allows running multiple scrapers simultaneously without blocking each other.
This asyncio-based approach has several advantages over the `schedule` library:
1. True concurrency: Multiple scrapers can run simultaneously without blocking each other, unlike `schedule` which runs tasks sequentially.
2. Precise Timing: `asyncio.sleep()` provides more accurate timing control compared to scheduleâs `run_pending()` approach.
3. Resource Efficiency: `asyncio` uses cooperative multitasking which requires less system resources than `schedule`âs threading-based approach.
4. Better Error Handling: `Async/await` makes it easier to implement proper error handling and recovery
5. Flexibility: You can easily add or remove scraper tasks and modify their intervals without affecting other tasks
The code structure also makes it simple to extend functionality by adding more concurrent tasks or implementing additional error handling logic.
#### How to automate a web scraper with cron jobs?
Cron is a time-based job scheduler in Unix-like operating systems (Linux, macOS). Think of it as a digital scheduler or calendar that can automatically run programs at specified times. A cron job is simply a task that you schedule to run at specific intervals.
For web scraping, cron jobs are incredibly useful because they let you automate your scraper to run at predetermined times. For example, you could set up a cron job to:
- Run your scraper every hour to collect real-time data
- Execute scraping tasks during off-peak hours (like 2 AM) to minimize server load
- Collect data at specific times when websites update their content
The scheduling format uses five time fields: minute, hour, day of month, month, and day of week. For instance:
- `0 * * * *` means ârun every hour at minute 0â
- `0 0 * * *` means ârun at midnight every dayâ
- `*/15 * * * *` means ârun every 15 minutesâ
Cron jobs are especially reliable for web scraping because theyâre built into the operating system, use minimal resources, and continue running even after system reboots.
So, letâs run the `save_firecrawl_news_data()` function on a schedule using cron. First, we will create a dedicated script for the cron job named `cron_scraper.py`:
```python
# cron_scraper.py
import sys
import logging
from datetime import datetime
from pathlib import Path
from firecrawl_scraper import save_firecrawl_news_data
# Set up logging
log_dir = Path("logs")
log_dir.mkdir(exist_ok=True)
log_file = log_dir / f"scraper_{datetime.now().strftime('%Y_%m')}.log"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
handlers=[logging.FileHandler(log_file), logging.StreamHandler(sys.stdout)],
)
def main():
try:
logging.info("Starting scraping job")
filename = save_firecrawl_news_data() # Actual scraping function
logging.info(f"Successfully saved data to {filename}")
except Exception as e:
logging.error(f"Scraping failed: {str(e)}", exc_info=True)
if __name__ == "__main__":
main()
```
The script implements a production-ready web scraper with logging and error handling. It creates a logs directory, configures detailed logging to both files and console, and wraps the scraping operation in error handling. When run, it executes our scraper function while tracking its progress and any potential issues, making it ideal for automated scheduling through cron jobs.
Now, to schedule this script using cron, you will need to make it executable:
```bash
chmod +x cron_scraper.py
```
Then, open your crontab file (which is usually empty) with Nano:
```bash
EDITOR=nano crontab -e
```
Then, add one or more entries specifying the frequency with which the scraper must run:
```python
# Run every minute
*/1 * * * * cd /absolute/path/to/project && /absolute/path/to/.venv/bin/python cron_scraper.py >> ~/cron.log 2>&1
# Run every hour
*/1 * * * * cd /absolute/path/to/project && /absolute/path/to/.venv/bin/python cron_scraper.py >> ~/cron.log 2>&1
```
The above cron job syntax consists of several parts:
The timing pattern â _/1_ \\*\\* \*â breaks down as follows:
- First `*/1`: Specifies every minute
- First `*`: Represents any hour
- Second `*`: Represents any day of the month
- Third `*`: Represents any month
- Fourth `*`: Represents any day of the week
After the timing pattern:
- `cd /absolute/path/to/project`: Changes to the project directory of your scraper
- `&&`: Chains commands, executing the next only if previous succeeds
- `/absolute/path/to/.venv/bin/python`: Specifies the Python interpreter path
- `cron_scraper.py`: The script to execute
- `>> ~/cron.log 2>&1`: Redirects both standard output (>>) and errors ( `2>&1`) to `cron.log`
For hourly execution, the same pattern applies but with `0 * * * *` timing to run at the start of each hour instead of every minute.
As soon as you save your crontab file with these commands, the schedule starts and you should see a `logs` directory in the same folder as your `cron_scraper.py`. It must look like this if you have been following along:

You can always check the status of your cron jobs with the following command as well:
```bash
tail -f ~/cron.log
```
To cancel a cron job, simply open your crontab file again and remove the line corresponding to the job.
#### How to automate a web scraper using Windows Task Scheduler?
Windows Task Scheduler is a built-in Windows tool that can automate running programs or scripts at specified times. Itâs a robust alternative to cron jobs for Windows users. Letâs set up our scraper to run automatically.
First, create a batch file ( `run_scraper.bat`) to run our Python script:
```python
@echo off
cd /d "C:\path\to\your\project"
call venv\Scripts\activate
python cron_scraper.py
deactivate
```
Then, to set up the task in Windows Task Scheduler:
- Open Task Scheduler (search âTask Schedulerâ in Windows search)
- Click âCreate Basic Taskâ in the right panel
- Follow the wizard:
- Name: âHacker News Scraperâ
- Description: âScrapes Hacker News hourlyâ
- Trigger: Choose when to run (e.g., âDailyâ)
- Action: âStart a programâ
- Program/script: Browse to your `run_scraper.bat`
- Start in: Your project directory
For more control over the task, you can modify its properties after creation:
- Double-click the task
- In the âTriggersâ tab, click âEditâ to set custom schedules
- Common scheduling options:
- Run every hour
- Run at specific times
- Run on system startup
- In the âSettingsâ tab, useful options include:
- âRun task as soon as possible after a scheduled start is missed"
- "Stop the task if it runs longer than X hoursâ
The Task Scheduler provides several advantages:
- Runs even when user is logged out
- Detailed history and logging
- Ability to run with elevated privileges
- Options for network conditions
- Retry logic for failed tasks
You can monitor your scheduled task through the Task Scheduler interface or check the logs we set up in `cron_scraper.py`.
## Automating Web Scrapers With GitHub Actions
One disadvantage all local scheduling methods have is that they rely on your local machine being powered on and connected to the internet. If your computer is turned off, loses power, or loses internet connectivity, your scheduled scraping tasks wonât run. This is where cloud-based solutions like GitHub Actions can provide more reliability and uptime for your web scraping workflows.
### What is GitHub Actions?
GitHub Actions is a continuous integration and deployment (CI/CD) platform provided by GitHub that allows you to automate various workflows directly from your [GitHub](https://github.com/) repository.
For web scraping, GitHub Actions provides a reliable way to schedule and run your scraping scripts in the cloud. You can define workflows using YAML files that specify when and how your scraper should run, such as on a regular schedule using cron syntax. This means your scraping jobs will continue running even when your local machine is off, as they execute on GitHubâs servers.
### Step 1: Setting Up a GitHub repository
To schedule our scrapers with GitHub actions, we first need a GitHub repository. Start by initializing Git in your current workspace:
```bash
# Initialize git in your project directory
git init
git add .
git commit -m "Initial commit"
```
Next, create a new public or private GitHub repository and add it as the remote:
```bash
# Create a new repo on GitHub.com, then:
git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
git branch -M main
git push -u origin main
```
Then, create the following directory structure:
```bash
mkdir -p .github/workflows
```
This directory will contain our GitHub Actions workflow files with YAML format. These files define how and when our scraping scripts should run. The workflows can be scheduled using cron syntax, triggered by events like pushes or pull requests, and can include multiple steps like installing dependencies and running scripts.
### Step 2: Creating a Workflow file
At this stage, create a new `scraper.yml` file inside `.github/workflows` and paste the following contents:
```yaml
name: Run Firecrawl Scraper
on:
schedule:
- cron: "0/5 * * * *" # Runs every five minute
workflow_dispatch: # Allows manual trigger
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pydantic firecrawl-py
- name: Run scraper
run: python firecrawl_scraper.py
env:
# Add any environment variables your scraper needs
FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}
- name: Commit and push if changes
run: |
git config --global user.name 'GitHub Actions Bot'
git config --global user.email 'actions@github.com'
git add .
git commit -m "Update scraped data" || exit 0
git push
```
Letâs break down the key components of this GitHub Actions workflow file:
Workflow name:
The workflow is named âRun Firecrawl Scraperâ which helps identify it in the GitHub Actions interface.
Triggers:
- Scheduled to run every 5 minutes using cron syntax `0/5 * * * *` (5 minutes is for debugging purposes, please change to hourly later)
- Can be manually triggered using `workflow_dispatch`
Job configuration:
- Runs on latest Ubuntu virtual environment
- Contains multiple sequential steps
Step details:
1. Checkout:
- Uses `actions/checkout@v3` to get repository code
2. Python Setup:
- Uses `actions/setup-python@v4`
- Configures Python 3.9 environment
3. Dependencies:
- Upgrades `pip`
- Installs required packages: `pydantic` and `firecrawl-py`
4. Scraper Execution:
- Runs `firecrawl_scraper.py`
- Uses `FIRECRAWL_API_KEY` from repository secrets
5. Committing the changes:
- Creates a commit persisting the downloaded data using GitHub Actions bot.
To run this action successfully, youâll need to store your Firecrawl API key in GitHub secrets. Navigate to your repositoryâs Settings, then Secrets and variables â Actions. Click the âNew repository secretâ button and add your API key, making sure to use the exact key name specified in `scraper.yml`.
After ensuring that everything is set up correctly, commit and push the latest changes to GitHub:
```bash
git add .
git commit -m "Add a workflow to scrape on a schedule"
```
Once you do, the workflow must show up in the Actions tab of your repository like below:

Click on the workflow name and press the âRun workflowâ button. This launches the action manually and starts the schedule. If you check in after some time, you should see more automatic runs and the results persisted in your repository:

Caution: I left the workflow running overnight (at five minute intervals) and was nastily surprised by 96 workflow runs the next day. Thankfully, GitHub actions are free (up to 2000 min/month) unlike AWS instances.
Now, unless you disable the workflow manually by clicking the three dots in the upper-right corner, the scraper continues running on the schedule you specified.
## Best Practices and Optimization
When scheduling web scrapers, following best practices ensures reliability, efficiency, and ethical behavior. Here are the key areas to consider:
### 1\. Rate limiting and delays
Scraping engines like Firecrawl usually come with built-in rate limiting. However, if you are using custom scrapers written with Python libraries, you must always respect website servers by implementing proper rate limiting and delay strategies. For example, the below example shows adding random delays between your requests in-between requests to respect server load and avoid getting your IP blocked:
```python
import time
import random
def scrape_with_delays(urls):
for url in urls:
try:
# Random delay between 2-5 seconds
delay = random.uniform(2, 5)
time.sleep(delay)
# Your scraping code here
response = requests.get(url)
except requests.RequestException as e:
logging.error(f"Error scraping {url}: {e}")
```
Best practices for rate limiting:
- Add random delays between requests (2-5 seconds minimum)
- Respect `robots.txt` directives
- Implement exponential backoff for retries
- Stay under 1 request per second for most sites
- Monitor response headers for rate limit information
### 2\. Proxy Rotation
This best practice is related to using custom web scrapers. Proxy rotation involves cycling through different IP addresses when making requests to avoid getting blocked. By distributing requests across multiple IPs, you can maintain access to websites that might otherwise flag high-volume traffic from a single source.
```python
from itertools import cycle
def get_proxy_pool():
proxies = [\
'http://proxy1:port',\
'http://proxy2:port',\
'http://proxy3:port'\
]
return cycle(proxies)
def scrape_with_proxies(url, proxy_pool):
for _ in range(3): # Max 3 retries
try:
proxy = next(proxy_pool)
response = requests.get(
url,
proxies={'http': proxy, 'https': proxy},
timeout=10
)
return response
except requests.RequestException:
continue
return None
```
Proxy best practices:
- Rotate IPs regularly
- Use high-quality proxy services
- Implement timeout handling
- Monitor proxy health
- Keep backup proxies ready
### 3\. Data Storage Strategies
The example scrapers we have built today saved the data to a JSON file, which is a simple method. Based on your needs, you may consider building a full data storage solution that saves the scraped data to various formats like as CSV files or to an SQL database. The storage type always depends on the kind of information scraped. Below is an example class that can save lists of dictionaries (like we scraped today) to a CSV file:
```python
from datetime import datetime
import json
import pandas as pd
class DataManager:
def __init__(self, base_path='data'):
self.base_path = Path(base_path)
self.base_path.mkdir(exist_ok=True)
def save_data(self, data, format='csv'):
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
if format == 'csv':
filename = self.base_path / f'data_{timestamp}.csv'
pd.DataFrame(data).to_csv(filename, index=False)
elif format == 'json':
filename = self.base_path / f'data_{timestamp}.json'
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
# Cleanup old files (keep last 7 days)
self._cleanup_old_files(days=7)
return filename
def _cleanup_old_files(self, days):
# Implementation for cleaning up old files
pass
```
Storage recommendations:
- Use appropriate file formats (CSV/JSON/Database)
- Implement data versioning
- Regular cleanup of old data
- Compress historical data
- Consider using a database for large datasets
### 4\. Error Logging and Monitoring
Web scraping usually requires more sophisticated logging than plain old print statements. Save the following or bookmark it on your browser as it properly sets up a logger using the built-in `logging` module of Python:
```python
import logging
from pathlib import Path
def setup_logging():
log_dir = Path('logs')
log_dir.mkdir(exist_ok=True)
# File handler for detailed logs
file_handler = logging.FileHandler(
log_dir / f'scraper_{datetime.now().strftime("%Y%m%d")}.log'
)
file_handler.setLevel(logging.DEBUG)
# Console handler for important messages
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
# Configure logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[file_handler, console_handler]
)
```
Monitoring best practices:
- Implement comprehensive logging
- Set up alerts for critical failures
- Monitor memory usage
- Track success rates
- Log response times
### 5\. Maintaining Your Scraper
Maintaining your web scraper is crucial for ensuring reliable and continuous data collection. Web scraping targets are dynamic - websites frequently update their structure, implement new security measures, or change their content organization. Regular maintenance helps catch these changes early, prevents scraping failures, and ensures your data pipeline remains robust. A well-maintained scraper also helps manage resources efficiently, keeps code quality high, and adapts to evolving requirements.
Maintenance guidelines:
- Regular code updates
- Monitor site changes
- Update user agents periodically
- Check for library updates
- Implement health checks
- Document maintenance procedures
## Conclusion
Throughout this guide, weâve explored how to effectively schedule web scrapers using local Python and operation system tools as well as GitHub actions. From basic setup to advanced optimization techniques, weâve covered the essential components needed to build reliable, automated data collection pipelines. The workflow weâve created not only handles the technical aspects of scheduling but also incorporates best practices for rate limiting, error handling, and data storage - crucial elements for any production-grade scraping system.
For those looking to enhance their web scraping capabilities further, I recommend exploring Firecrawlâs comprehensive features through their [/crawl endpoint guide](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) and [/scrape endpoint tutorial](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint). These resources, along with the [official documentation](https://docs.firecrawl.dev/), provide deeper insights into advanced topics like JavaScript rendering, structured data extraction, and batch operations that can significantly improve your web scraping workflows. Whether youâre building training datasets for AI models or monitoring websites for changes, combining scheduled scraping with these powerful tools can help you build more sophisticated and efficient data collection systems.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Job Board Scraping Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Sept 27, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Scraping Job Boards Using Firecrawl Actions and OpenAI

Scraping job boards to extract structured data can be a complex task, especially when dealing with dynamic websites and unstructured content. In this guide, weâll walk through how to use [Firecrawl Actions](https://firecrawl.dev/) and OpenAI models to efficiently scrape job listings and extract valuable information.
### Why Use Firecrawl and OpenAI?
- **Firecrawl** simplifies web scraping by handling dynamic content and providing actions like clicking and scrolling.
- **OpenAIâs `o1` and `4o` models** excel at understanding and extracting structured data from unstructured text. `o1` is best for more complex reasoning tasks while `4o` is best for speed and cost.
### Prerequisites
- Python 3.7 or higher
- API keys for both [Firecrawl](https://firecrawl.dev/) and [OpenAI](https://openai.com/)
- Install required libraries:
```bash
pip install requests python-dotenv openai
```
### Step 1: Set Up Your Environment
Create a `.env` file in your project directory and add your API keys:
```
FIRECRAWL_API_KEY=your_firecrawl_api_key
OPENAI_API_KEY=your_openai_api_key
```
### Step 2: Initialize API Clients
```python
import os
import requests
import json
from dotenv import load_dotenv
import openai
# Load environment variables
load_dotenv()
# Initialize API keys
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
```
### Step 3: Define the Jobs Page URL and Resume
Specify the URL of the jobs page you want to scrape and provide your resume for matching.
```python
# URL of the jobs page to scrape
jobs_page_url = "https://openai.com/careers/search"
# Candidate's resume (as a string)
resume_paste = """
[Your resume content here]
"""
```
### Step 4: Scrape the Jobs Page Using Firecrawl
We use Firecrawl to scrape the jobs page and extract the HTML content.
```python
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": jobs_page_url,
"formats": ["markdown"]
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
html_content = result['data']['markdown']
# Prepare the prompt for OpenAI
prompt = f"""
Extract up to 30 job application links from the given markdown content.
Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links).
The output should be a valid JSON object, with no additional text.
Markdown content:
{html_content[:100000]}
"""
else:
html_content = ""
else:
html_content = ""
except Exception as e:
html_content = ""
```
### Step 5: Extract Apply Links Using OpenAIâs `gpt-4o` Model
We use OpenAIâs `gpt-4o` model to parse the scraped content and extract application links.
```python
# Extract apply links using OpenAI
apply_links = []
if html_content:
try:
completion = openai.ChatCompletion.create(
model="gpt-4o",
messages=[\
{\
"role": "user",\
"content": prompt\
}\
]
)
if completion.choices:
result = json.loads(completion.choices[0].message.content.strip())
apply_links = result['apply_links']
except Exception as e:
pass
```
### Step 6: Extract Job Details from Each Apply Link
We iterate over each apply link and use Firecrawlâs extraction capabilities to get job details.
```python
# Initialize a list to store job data
extracted_data = []
# Define the extraction schema
schema = {
"type": "object",
"properties": {
"job_title": {"type": "string"},
"sub_division_of_organization": {"type": "string"},
"key_skills": {"type": "array", "items": {"type": "string"}},
"compensation": {"type": "string"},
"location": {"type": "string"},
"apply_link": {"type": "string"}
},
"required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"]
}
# Extract job details for each link
for link in apply_links:
try:
response = requests.post(
"https://api.firecrawl.dev/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {firecrawl_api_key}"
},
json={
"url": link,
"formats": ["extract"],
"actions": [{\
"type": "click",\
"selector": "#job-overview"\
}],
"extract": {
"schema": schema
}
}
)
if response.status_code == 200:
result = response.json()
if result.get('success'):
extracted_data.append(result['data']['extract'])
except Exception as e:
pass
```
### Step 7: Match Jobs to Your Resume Using OpenAIâs `o1` Model
We use OpenAIâs `o1` model to analyze your resume and recommend the top 3 job listings.
```python
# Prepare the prompt
prompt = f"""
Please analyze the resume and job listings, and return a JSON list of the top 3 roles that best fit the candidate's experience and skills. Include only the job title, compensation, and apply link for each recommended role. The output should be a valid JSON array of objects in the following format:
[\
{\
"job_title": "Job Title",\
"compensation": "Compensation",\
"apply_link": "Application URL"\
},\
...\
]
Based on the following resume:
{resume_paste}
And the following job listings:
{json.dumps(extracted_data, indent=2)}
"""
# Get recommendations from OpenAI
completion = openai.ChatCompletion.create(
model="o1-preview",
messages=[\
{\
"role": "user",\
"content": prompt\
}\
]
)
# Extract recommended jobs
recommended_jobs = json.loads(completion.choices[0].message.content.strip())
```
### Step 8: Output the Recommended Jobs
Finally, we can print or save the recommended jobs.
```python
# Output the recommended jobs
print(json.dumps(recommended_jobs, indent=2))
```
### Full Code Example on GitHub
You can find the full code example [on GitHub](https://github.com/mendableai/firecrawl/tree/main/examples/o1_job_recommender).
### Conclusion
By following this guide, youâve learned how to:
- **Scrape dynamic job boards** using Firecrawl.
- **Extract structured data** from web pages with custom schemas.
- **Leverage OpenAIâs models** to parse content and make intelligent recommendations.
This approach can be extended to other websites and data extraction tasks, providing a powerful toolset for automating data collection and analysis.
### References
- [Firecrawl Documentation](https://docs.firecrawl.dev/)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference/introduction)
Thatâs it! Youâve now built a pipeline to scrape job boards and find the best job matches using Firecrawl and OpenAI. Happy coding!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## LLM Extraction for Insights
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
May 21, 2024
â˘
[Caleb Peffer](https://x.com/CalebPeffer)
# Using LLM Extraction for Customer Insights

### Introduction
Understanding our customers - not just who they are, but what they doâis crucial to tailoring our products and services effectively. When running a self-serve motion, you have so many customers come in the door with little to no knowledge of them. The process of proactively understanding who these folks are has traditionally been time-intensive, involving manual data collection and analysis to gather actionable insights.
However, with the power of LLMs and their capacity for advanced data extraction, weâve automated this process. Using LLM extraction and analysis of customer data, LLM weâve significantly reduced our workload, allowing us to understand and serve our customer base more effectively than ever before.
If you have limited technical knowledge, you can build an automation that gets targeted information about your customers for the purposes of product direction and lead gen. Hereâs how you can do this yourself with [Make](https://make.com/) and [Firecrawl](https://www.firecrawl.dev/).
* * *
### Overview of the Tools
**Firecrawl**
Firecrawl is a platform for scraping, search, and extraction. It allows you to take data from the web and translate it into LLM-legible markdown or structured data.
When we want to get information about our customers, we can use Firecrawlâs LLM extraction functionality to specify the specific information we want from their websites.
**Make.com (formerly Integromat)**
Make is an automation platform that allows users to create customized workflows to connect various apps and services without needing deep technical knowledge. It uses a visual interface where users can drag and drop elements to design their automations.
We can use Make to connect a spreadsheet of user data to Firecrawl, allowing us to do extraction with just a bit of JSON.
###Â Setting Up the Scenario
- Step-by-step guide on setting up the data extraction process.
- **Connecting Google Sheets to Make.com**
- How user data is initially collected and stored.
- **Configuring the HTTP Request in Make.com**
- Description of setting up API requests to Firecrawl.
- Purpose of these requests (e.g., extracting company information).
### Preparing our Data
Before we get started, we want to make sure we prepare our data for Firecrawl. In this case, I created a simple spreadsheet with imported users from our database. We want to take the email domains of our users and transform them into links using the https:// format:

We also want to add some attributes that weâd like to know about these companies. For me, I want to understand a bit about the company, their industry, and their customers. Iâve set these in columns as:
company\_description
company\_type
who\_they\_serve
Now that we have our data prepared, we can start setting up our automation in Make!
## Setting up our automation
To get our automation running, we simply need to follow a three step process in Make. Here, we will choose three apps in our scenario:
Google Sheets - Get range values
HTTP - Make an API key auth request
Google Sheets - Update a row
Weâll also want to add the ignore flow control tool in case we run into any errors. This will keep the automation going.

This automation will allow us to extract a set of links from our spreadsheet, send them to Firecrawl for data extraction, then repopulate our spreadsheet with the desired info.
Letâs start by configuring our first app. Our goal is to export all of the URLs so that we can send them to Firecrawl for extraction. Here is the configuration for pulling these URLs:

\* _Important_ \- we want to make sure we start pulling data from the second row. If you include the header, you will eventually run into an error.
* * *
Great! Now that we have that configured, we want to prepare to set up our HTTP request. To do this, we will go to [https://firecrawl.dev](https://firecrawl.dev/) to sign up and get our API key (you can get started for free!). Once you sign up, you can go to [https://firecrawl.dev/account](https://firecrawl.dev/account) to see your API key.
We will be using Firecrawlâs Scrape Endpoint. This endpoint will allow us to pull information from a single URL, translate it into clean markdown, and use it to extract the data we need. I will be filling out all the necessary conditions in our Make HTTP request using the API reference in their documentation.
Now in Make, I configure the API call using the documentation from Firecrawl. We will be using POST as the HTTP method and have two headers.
```
Header 1:
Name: Authorization
Value: Bearer your-API-key
Header 2:
Name: Content-Type
Value: application/json
```

We also want to set our body and content types. Here we will do:
```
Body type: Raw
Content type: Json (application/json)
```
We will also click âyesâ for parsing our response. This will automatically parse our response into JSON.
The request content is the main meat of what we want to achieve. Here is the request content we will use for this use case:
```
{
"url": "1. url(B)",
"pageOptions": {
"onlyMainContent": true
},
"extractorOptions": {
"mode": "llm-extraction",
"extractionPrompt": "Extract the company description (in one sentence explain what the company does), company industry (software, services, AI, etc.) - this really should just be a tag with a couple keywords, and who they serve (who are their customers). If there is no clear information to answer the question, write 'no info'.",
"extractionSchema": {
"type": "object",
"properties": {
"company_description": {
"type": "string"
},
"company_industry": {
"type": "string"
},
"who_they_serve": {
"type": "string"
}
},
"required": [\
"company_description",\
"company_industry",\
"who_they_serve"\
]
}
}
}
```

\* _Note_ the green field in the screenshot is a dynamic item that you can choose in the Make UI. Instead of `url (B)`, the block may be the first URL in your data.

Fantastic! Now we have configured our HTTP request. Letâs test it to make sure everything is working as it should be. Click ârun onceâ in Make and we should be getting data back.

When we run, letâs check our first operation. In the output, we should be getting a `status code: 200`, meaning that our API request was successful. In the output, click on data to make sure we got the data we needed.

Our output looks successful! In the llm\_extraction we are seeing the three attributes of data that we wanted from the website.
\* _Note_ if you are getting a `500` error on your first operation and `200` responses on the subsequent ones, this may be because the operation is trying to be performed on the first row of your data (the header row). This will cause issues importing the data back into sheets! Make sure you start from the second row as mentioned before.
Now that we know the HTTP request is working correctly, all thatâs left is to take the outputted JSON from Firecrawl and put it back into our spreadsheet.
* * *
Now we need to take our extracted data and put it back into our spreadsheet. To do this, we will take the outputted JSON from our HTTP request and export the text into the relevant tables.
Letâs start by connecting the same google sheet and specifying the Row Number criteria. Here we will just use the Make UI to choose ârow numberâ

All thatâs left is to specify which LLM extracted data goes into which column. Here, we can simply use the UI in Make to set this up.

Thatâs it, now itâs time to test our automation!
* * *
Letâs click `run once` on the Make UI and make sure everything is running smoothly. The automation should start iterating through link-by-link and populating our spreadsheet in real time.

We have success! Using Make and Firecrawl, we have been able to extract specific information about our customers without the need of manually going to each of their websites.
Looking at the data, we are starting to get a better understanding of our customers. However, we are not limited to these specific characteristics. If we want, we can customize our JSON and Extraction Prompt to find out other information about these companies.
### Use Cases
LLM extraction allows us to quickly get specific information from the web thatâs relevant to our business. We can use these automations to do a variety of tasks.
**Product:**
Especially for self-serve companies, we can understand the trends in industries using our product. What are the top 2-3 industries using our tech and what are they using it for? This will allow us to make better product decisions by prioritizing the right customers to focus on.
**Business Development:**
By understanding who our users are, we can look for similar companies who could benefit from our product as well. By doing a similar automation, we can extract positive indicators from prospects that would benefit from our product.
We can also use this data to generate better outreach emails that are more specific to the individual prospect.
**Market Research:**
Market research firms spend tons of time doing secondary research, especially in niche sectors. We can streamline data collection by automating the extraction and organization of data from diverse sources. This automation helps boost efficiency and scales with growing data needs, making it a valuable tool for strategic decision-making in fast-evolving industries.
### Going a step further
This was just a simple example of how we can use LLMs to extract relevant data from websites using a static spreadsheet. You can always make this more advanced by connecting this dynamically to your sign ups. Additionally, you could connect this to other tools to further accelerate your productivity. For example, using the extracted content to generate more personalized copy for prospecting.
If you found this useful, feel free to let me know! Iâd love to hear your feedback or learn about what youâre building. You can reach me at [garrett@mendable.ai](mailto:garrett@mendable.ai). Good luck and happy building!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Caleb Peffer@CalebPeffer](https://x.com/CalebPeffer)
Caleb Peffer is the Chief Executive Officer (CEO) of Firecrawl. Previously, built and scaled Mendable, an innovative "chat with your documents" application,
and sold it to major customers like Snapchat, Coinbase, and MongoDB. Also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
Caleb has a passion for building products that help people do their best work. Caleb studied Computer Science and has over 10 years of experience in software engineering.
### More articles by Caleb Peffer
[Using LLM Extraction for Customer Insights\\
\\
Using LLM Extraction for Insights and Lead Generation using Make and Firecrawl.](https://www.firecrawl.dev/blog/lead-gen-business-insights-make-firecrawl)
## Crawl Webhooks Introduction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
September 1, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Launch Week I / Day 7: Crawl Webhooks (v1)

Welcome to Day 7 of Firecrawlâs Launch Week! Weâre excited to introduce new /crawl webhook support.
## Crawl Webhook
You can now pass a `webhook` parameter to the `/crawl` endpoint. This will send a POST request to the URL you specify when the crawl is started, updated and completed.
The webhook will now trigger for every page crawled and not just the whole result at the end.

### Webhook Events
There are now 4 types of events:
- `crawl.started` \- Triggered when the crawl is started.
- `crawl.page` \- Triggered for every page crawled.
- `crawl.completed` \- Triggered when the crawl is completed to let you know itâs done.
- `crawl.failed` \- Triggered when the crawl fails.
### Webhook Response
- `success` \- If the webhook was successful in crawling the page correctly.
- `type` \- The type of event that occurred.
- `id` \- The ID of the crawl.
- `data` \- The data that was scraped (Array). This will only be non empty on `crawl.page` and will contain 1 item if the page was scraped successfully. The response is the same as the `/scrape` endpoint.
- `error` \- If the webhook failed, this will contain the error message.
## Learn More
Learn more about the webhook in our [documentation](https://docs.firecrawl.dev/features/crawl#crawl-webhook).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## OpenAI Predicted Outputs Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Nov 5, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses

Leveraging the full potential of Large Language Models (LLMs) often involves balancing between response accuracy and latency. OpenAIâs new Predicted Outputs feature introduces a way to significantly reduce response times by informing the model about the expected output in advance.
In this article, weâll explore how to use Predicted Outputs with the GPT-4o and GPT-4o-mini models to make your AI applications super fast đ. Weâll also provide a practical example of transforming blog posts into SEO-optimized content, a powerful use case enabled by this feature.
### What Are Predicted Outputs?
Predicted Outputs allow you to provide the LLM with an anticipated output, especially useful when most of the response is known ahead of time. For tasks like rewriting text with minor modifications, this can drastically reduce the time it takes for the model to generate the desired result.
### Why Use Predicted Outputs?
By supplying the model with a prediction of the output, you:
- **Reduce Latency:** The model can process and generate responses faster because it doesnât need to generate the entire output from scratch.
- **Enhance Efficiency:** Useful when you can reasonably assume that large portions of the output will remain unchanged.
### Limitations to Keep in Mind
While Predicted Outputs are powerful, there are some limitations:
- Supported only with **GPT-4o** and **GPT-4o-mini** models.
- Certain API parameters are not supported, such as `n` values greater than 1, `logprobs`, `presence_penalty` greater than 0, among others.
### How to Use Predicted Outputs
Letâs dive into how you can implement Predicted Outputs in your application. Weâll walk through an example where we optimize a blog post by adding internal links to relevant pages within the same website.
#### Prerequisites
Make sure you have the following installed:
```bash
pip install firecrawl-py openai
```
#### Step 1: Set Up Your Environment
Initialize the necessary libraries and load your API keys.
```python
import os
import json
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from openai import OpenAI
# Load environment variables
load_dotenv()
# Retrieve API keys from environment variables
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
# Initialize the FirecrawlApp and OpenAI client
app = FirecrawlApp(api_key=firecrawl_api_key)
client = OpenAI(api_key=openai_api_key)
```
#### Step 2: Scrape the Blog Content
Weâll start by scraping the content of a blog post that we want to optimize.
```python
# Get the blog URL (you can input your own)
blog_url = "https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications"
# Scrape the blog content in markdown format
blog_scrape_result = app.scrape_url(blog_url, params={'formats': ['markdown']})
blog_content = blog_scrape_result.get('markdown', '')
```
#### Step 3: Map the Website for Internal Links
Next, weâll get a list of other pages on the website to which we can add internal links.
```python
# Extract the top-level domain
top_level_domain = '/'.join(blog_url.split('/')[:3])
# Map the website to get all internal links
site_map = app.map_url(top_level_domain)
site_links = site_map.get('links', [])
```
#### Step 4: Prepare the Prompt and Prediction
Weâll create a prompt instructing the model to add internal links to the blog post and provide the original content as a prediction.
```python
prompt = f"""
You are an AI assistant helping to improve a blog post.
Here is the original blog post content:
{blog_content}
Here is a list of other pages on the website:
{json.dumps(site_links, indent=2)}
Please revise the blog post to include internal links to some of these pages where appropriate. Make sure the internal links are relevant and enhance the content.
Only return the revised blog post in markdown format.
"""
```
#### Step 5: Use Predicted Outputs with the OpenAI API
Now, weâll call the OpenAI API using the `prediction` parameter to provide the existing content.
```python
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[\
{\
"role": "user",\
"content": prompt\
}\
],
prediction={
"type": "content",
"content": blog_content
}
)
revised_blog_post = completion.choices[0].message.content
```
#### Step 6: Compare the Original and Revised Content
Finally, weâll compare the number of links in the original and revised blog posts to see the improvements.
```python
import re
def count_links(markdown_content):
return len(re.findall(r'\[.*?\]\(.*?\)', markdown_content))
original_links_count = count_links(blog_content)
revised_links_count = count_links(revised_blog_post)
print(f"Number of links in the original blog post: {original_links_count}")
print(f"Number of links in the revised blog post: {revised_links_count}")
```
### Conclusion
By utilizing Predicted Outputs, you can significantly speed up tasks where most of the output is known, such as content reformatting or minor edits. This feature is a game-changer for developers looking to optimize performance without compromising on the quality of the output.
Thatâs it! In this article, weâve shown you how to get started with Predicted Outputs using OpenAIâs GPT-4o models. Whether youâre transforming content, correcting errors, or making minor adjustments, Predicted Outputs can make your AI applications faster and more efficient.
### References
- [Using Predicted Outputs](https://platform.openai.com/docs/guides/latency-optimization#use-predicted-outputs)
- [Firecrawl Documentation](https://www.firecrawl.dev/docs)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Scrape Airbnb Data
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
May 23, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Scrape and Analyze Airbnb Data with Firecrawl and E2B

This cookbook demonstrates how to scrape Airbnb data and analyze it using [Firecrawl](https://www.firecrawl.dev/) and the [Code Interpreter SDK](https://github.com/e2b-dev/code-interpreter) from E2B.
Feel free to clone the [Github Repository](https://github.com/e2b-dev/e2b-cookbook) or follow along with the steps below.
## Prerequisites
- Node.js installed on your machine
- Get [E2B API key](https://e2b.dev/docs/getting-started/api-key)
- Get [Firecrawl API key](https://firecrawl.dev/)
- Get [Anthropic API key](https://anthropic.com/)
## Setup
Start by creating a new directory and initializing a new Node.js typescript project:
```bash
mkdir airbnb-analysis
cd airbnb-analysis
npm init -y
```
Next, install the required dependencies:
```bash
npm install @anthropic-ai/sdk @e2b/code-interpreter @mendable/firecrawl-js
```
And dev dependencies:
```bash
npm install --save-dev @types/node prettier tsx typescript dotenv zod
```
## Create a `.env` file
Create a `.env` file in the root of your project and add the following environment variables:
```bash
# TODO: Get your E2B API key from https://e2b.dev/docs
E2B_API_KEY=""
# TODO: Get your Firecrawl API key from https://firecrawl.dev
FIRECRAWL_API_KEY=""
# TODO: Get your Anthropic API key from https://anthropic.com
ANTHROPIC_API_KEY=""
```
## Scrape Airbnb data with Firecrawl
Create a new file `scraping.ts`.
### Creating the scraping function
```typescript
import * as fs from "fs";
import FirecrawlApp from "@mendable/firecrawl-js";
import "dotenv/config";
import { config } from "dotenv";
import { z } from "zod";
```
2. Letâs define our `scrapeAirbnb` function which uses Firecrawl to scrape Airbnb listings. We will use Firecrawlâs LLM Extract to try to get the pagination links and then scrape each page in parallel to get the listings. We will save to a JSON file so we can analyze it later and not have to re-scrape.
```typescript
export async function scrapeAirbnb() {
try {
// Initialize the FirecrawlApp with your API key
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY,
});
// Define the URL to crawl
const listingsUrl =
"https://www.airbnb.com/s/San-Francisco--CA--United-States/homes";
const baseUrl = "https://www.airbnb.com";
// Define schema to extract pagination links
const paginationSchema = z.object({
page_links: z
.array(
z.object({
link: z.string(),
}),
)
.describe("Pagination links in the bottom of the page."),
});
const params2 = {
pageOptions: {
onlyMainContent: false,
},
extractorOptions: {
extractionSchema: paginationSchema,
},
timeout: 50000, // if needed, sometimes airbnb stalls...
};
// Start crawling to get pagination links
const linksData = await app.scrapeUrl(listingsUrl, params2);
console.log(linksData.data["llm_extraction"]);
let paginationLinks = linksData.data["llm_extraction"].page_links.map(
(link) => baseUrl + link.link,
);
// Just in case is not able to get the pagination links
if (paginationLinks.length === 0) {
paginationLinks = [listingsUrl];
}
// Define schema to extract listings
const schema = z.object({
listings: z
.array(
z.object({
title: z.string(),
price_per_night: z.number(),
location: z.string(),
rating: z.number().optional(),
reviews: z.number().optional(),
}),
)
.describe("Airbnb listings in San Francisco"),
});
const params = {
pageOptions: {
onlyMainContent: false,
},
extractorOptions: {
extractionSchema: schema,
},
};
// Function to scrape a single URL
const scrapeListings = async (url) => {
const result = await app.scrapeUrl(url, params);
return result.data["llm_extraction"].listings;
};
// Scrape all pagination links in parallel
const listingsPromises = paginationLinks.map((link) =>
scrapeListings(link),
);
const listingsResults = await Promise.all(listingsPromises);
// Flatten the results
const allListings = listingsResults.flat();
// Save the listings to a file
fs.writeFileSync(
"airbnb_listings.json",
JSON.stringify(allListings, null, 2),
);
// Read the listings from the file
const listingsData = fs.readFileSync("airbnb_listings.json", "utf8");
return listingsData;
} catch (error) {
console.error("An error occurred:", error.message);
}
}
```
### Creating the code interpreter
Letâs now prepare our code interepreter to analyze the data. Create a new file `codeInterpreter.ts`.
This is where we will use the E2B Code Interpreter SDK to safely run the code that the LLM will generate and get its output.
```typescript
import { CodeInterpreter } from "@e2b/code-interpreter";
export async function codeInterpret(
codeInterpreter: CodeInterpreter,
code: string,
) {
console.log(
`\n${"=".repeat(50)}\n> Running following AI-generated code:
\n${code}\n${"=".repeat(50)}`,
);
const exec = await codeInterpreter.notebook.execCell(code, {
// You can stream logs from the code interpreter
// onStderr: (stderr: string) => console.log("\n[Code Interpreter stdout]", stderr),
// onStdout: (stdout: string) => console.log("\n[Code Interpreter stderr]", stdout),
//
// You can also stream additional results like charts, images, etc.
// onResult: ...
});
if (exec.error) {
console.log("[Code Interpreter error]", exec.error); // Runtime error
return undefined;
}
return exec;
}
```
### Preparing the model prompt and tool execution
Create a file called `model.ts` that will contain the prompts, model names and the tools for execution.
```typescript
import { Tool } from "@anthropic-ai/sdk/src/resources/beta/tools";
export const MODEL_NAME = "claude-3-opus-20240229";
export const SYSTEM_PROMPT = `
## your job & context
you are a python data scientist. you are given tasks to complete and you run python code to solve them.
- the python code runs in jupyter notebook.
- every time you call \`execute_python\` tool, the python code is executed in a separate cell. it's okay to multiple calls to \`execute_python\`.
- display visualizations using matplotlib or any other visualization library directly in the notebook. don't worry about saving the visualizations to a file.
- you have access to the internet and can make api requests.
- you also have access to the filesystem and can read/write files.
- you can install any pip package (if it exists) if you need to but the usual packages for data analysis are already preinstalled.
- you can run any python code you want, everything is running in a secure sandbox environment.
`;
export const tools: Tool[] = [\
{\
name: "execute_python",\
description:\
"Execute python code in a Jupyter notebook cell and returns any result, stdout, stderr, display_data, and error.",\
input_schema: {\
type: "object",\
properties: {\
code: {\
type: "string",\
description: "The python code to execute in a single cell.",\
},\
},\
required: ["code"],\
},\
},\
];
```
### Putting it all together
Create a file `index.ts` to run the scraping and analysis. Here we will load the scraped data, send it to the LLM model, and then interpret the code generated by the model.
```typescript
import * as fs from "fs";
import "dotenv/config";
import { CodeInterpreter, Execution } from "@e2b/code-interpreter";
import Anthropic from "@anthropic-ai/sdk";
import { Buffer } from "buffer";
import { MODEL_NAME, SYSTEM_PROMPT, tools } from "./model";
import { codeInterpret } from "./codeInterpreter";
import { scrapeAirbnb } from "./scraping";
const anthropic = new Anthropic();
/**
* Chat with Claude to analyze the Airbnb data
*/
async function chat(
codeInterpreter: CodeInterpreter,
userMessage: string,
): Promise {
console.log("Waiting for Claude...");
const msg = await anthropic.beta.tools.messages.create({
model: MODEL_NAME,
system: SYSTEM_PROMPT,
max_tokens: 4096,
messages: [{ role: "user", content: userMessage }],
tools,
});
console.log(
`\n${"=".repeat(50)}\nModel response:
${msg.content}\n${"=".repeat(50)}`,
);
console.log(msg);
if (msg.stop_reason === "tool_use") {
const toolBlock = msg.content.find((block) => block.type === "tool_use");
const toolName = toolBlock?.name ?? "";
const toolInput = toolBlock?.input ?? "";
console.log(
`\n${"=".repeat(50)}\nUsing tool:
${toolName}\n${"=".repeat(50)}`,
);
if (toolName === "execute_python") {
const code = toolInput.code;
return codeInterpret(codeInterpreter, code);
}
return undefined;
}
}
/**
* Main function to run the scraping and analysis
*/
async function run() {
// Load the Airbnb prices data from the JSON file
let data;
const readDataFromFile = () => {
try {
return fs.readFileSync("airbnb_listings.json", "utf8");
} catch (err) {
if (err.code === "ENOENT") {
console.log("File not found, scraping data...");
return null;
} else {
throw err;
}
}
};
const fetchData = async () => {
data = readDataFromFile();
if (!data || data.trim() === "[]") {
console.log("File is empty or contains an empty list, scraping data...");
data = await scrapeAirbnb();
}
};
await fetchData();
// Parse the JSON data
const prices = JSON.parse(data);
// Convert prices array to a string representation of a Python list
const pricesList = JSON.stringify(prices);
const userMessage = `
Load the Airbnb prices data from the airbnb listing below and visualize
the distribution of prices with a histogram. Listing data: ${pricesList}
`;
const codeInterpreter = await CodeInterpreter.create();
const codeOutput = await chat(codeInterpreter, userMessage);
if (!codeOutput) {
console.log("No code output");
return;
}
const logs = codeOutput.logs;
console.log(logs);
if (codeOutput.results.length == 0) {
console.log("No results");
return;
}
const firstResult = codeOutput.results[0];
console.log(firstResult.text);
if (firstResult.png) {
const pngData = Buffer.from(firstResult.png, "base64");
const filename = "airbnb_prices_chart.png";
fs.writeFileSync(filename, pngData);
console.log(`â
Saved chart to ${filename}`);
}
await codeInterpreter.close();
}
run();
```
### Running the code
Run the code with:
```bash
npm run start
```
### Results
At the end you should get a histogram of the Airbnb prices in San Francisco saved as `airbnb_prices_chart.png`.

Thatâs it! You have successfully scraped Airbnb data and analyzed it using Firecrawl and E2Bâs Code Interpreter SDK. Feel free to experiment with different models and prompts to get more insights from the data.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## Mastering Firecrawl's Crawl Endpoint
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Nov 18, 2024
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide

## Introduction
Web scraping and data extraction have become essential tools as businesses race to convert unprecedented amounts of online data into LLM-friendly formats. Firecrawlâs powerful web scraping API streamlines this process with enterprise-grade automation and scalability features.
This comprehensive guide focuses on Firecrawlâs most powerful feature - the `/crawl` endpoint, which enables automated website scraping at scale. Youâll learn how to:
- Recursively traverse website sub-pages
- Handle dynamic JavaScript-based content
- Bypass common web scraping blockers
- Extract clean, structured data for AI/ML applications
Want to follow along with our python notebook version of this post? [Check it out here!](https://github.com/mendableai/firecrawl/blob/main/examples/mastering-the-crawl-endpoint/mastering-the-crawl-endpoint.ipynb)
## Table of Contents
- [Introduction](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#introduction)
- [Web Scraping vs Web Crawling: Understanding the Key Differences](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#web-scraping-vs-web-crawling-understanding-the-key-differences)
- [Whatâs the Difference?](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#whats-the-difference)
- [How Firecrawl Combines Both](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#how-firecrawl-combines-both)
- [Step-by-Step Guide to Web Crawling with Firecrawlâs API](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#step-by-step-guide-to-web-crawling-with-firecrawls-api)
- [Performance & Limits](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#performance--limits)
- [Asynchronous Web Crawling with Firecrawl: Efficient Large-Scale Data Collection](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#asynchronous-web-crawling-with-firecrawl-efficient-large-scale-data-collection)
- [Asynchronous programming in a nutshell](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#asynchronous-programming-in-a-nutshell)
- [Using `async_crawl_url` method](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#using-async_crawl_url-method)
- [Benefits of asynchronous crawling](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#benefits-of-asynchronous-crawling)
- [How to Save and Store Web Crawling Results](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#how-to-save-and-store-web-crawling-results)
- [Local file storage](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#local-file-storage)
- [Building AI-Powered Web Crawlers with Firecrawl and LangChain Integration](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#building-ai-powered-web-crawlers-with-firecrawl-and-langchain-integration)
- [Conclusion](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#conclusion)
## Web Scraping vs Web Crawling: Understanding the Key Differences
### Whatâs the Difference?
_Web scraping_ refers to extracting specific data from individual web pages like a Wikipedia article or a technical tutorial. It is primarily used when you need specific information from pages with _known URLs_.
_Web crawling_, on the other hand, involves systematically browsing and discovering web pages by following links. It focuses on website navigation and URL discovery.
For example, to build a chatbot that answers questions about Stripeâs documentation, you would need:
1. Web crawling to discover and traverse all pages in Stripeâs documentation site
2. Web scraping to extract the actual content from each discovered page
### How Firecrawl Combines Both
Firecrawlâs `/crawl` endpoint combines both capabilities:
1. URL analysis: Identifies links through sitemap or page traversal
2. Recursive traversal: Follows links to discover sub-pages
3. Content scraping: Extracts clean content from each page
4. Results compilation: Converts everything to structured data
When you pass the URL `https://docs.stripe.com/api` to the endpoint, it automatically discovers and crawls all documentation sub-pages. The endpoint returns the content in your preferred format - whether thatâs markdown, HTML, screenshots, links, or metadata.
## Step-by-Step Guide to Web Crawling with Firecrawlâs API
Firecrawl is a web scraping engine exposed as a REST API. You can use it from the command line using cURL or using one of its language SDKs for Python, Node, Go, or Rust. For this tutorial, weâll focus on its Python SDK.
To get started:
1. Sign up at [firecrawl.dev](https://firecrawl.dev/) and copy your API key
2. Save the key as an environment variable:
```bash
export FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'
```
Or use a dot-env file:
```bash
touch .env
echo "FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'" >> .env
```
Then use the Python SDK:
```python
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
load_dotenv()
app = FirecrawlApp()
```
Once your API key is loaded, the `FirecrawlApp` class uses it to establish a connection with the Firecrawl API engine.
First, we will crawl the [https://books.toscrape.com/](https://books.toscrape.com/) website, which is built for web-scraping practice:

Instead of writing dozens of lines of code with libraries like `beautifulsoup4` or `lxml` to parse HTML elements, handle pagination and data retrieval, Firecrawlâs `crawl_url` endpoint lets you accomplish this in a single line:
```python
base_url = "https://books.toscrape.com/"
crawl_result = app.crawl_url(url=base_url)
```
The result is a dictionary with the following keys:
```python
crawl_result.keys()
```
```text
dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data'])
```
First, we are interested in the status of the crawl job:
```python
crawl_result['status']
```
```text
'completed'
```
If it is completed, letâs see how many pages were scraped:
```python
crawl_result['total']
```
```text
1195
```
Almost 1200 pages (it took about 70 seconds on my machine; the speed vary based on your connection speed). Letâs look at one of the elements of the `data` list:
```python
sample_page = crawl_result['data'][10]
markdown_content = sample_page['markdown']
print(markdown_content[:500])
```
```text
- [Home](../../../../index.html)
- [Books](../../books_1/index.html)
- Womens Fiction
# Womens Fiction
**17** results.
**Warning!** This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
01. [](../../../i-had-a-nice-time-and-other-lies-how-to-find-love-sht-like-that_814/index.html)
```
The page corresponds to Womenâs Fiction page:

Firecrawl also includes page metadata in the elementâs dictionary as well:
```python
sample_page['metadata']
```
```text
{
'url': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
'title': 'Womens Fiction | Books to Scrape - Sandbox',
'robots': 'NOARCHIVE,NOCACHE',
'created': '24th Jun 2016 09:29',
'language': 'en-us',
'viewport': 'width=device-width',
'sourceURL': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
'statusCode': 200,
'description': '',
'ogLocaleAlternate': []
}
```
One thing we didnât mention is how Firecrawl handles pagination. If you scroll to the bottom of Books-to-Scrape, you will see that it has a ânextâ button.
Before moving on to sub-pages like `books.toscrape.com/category`, Firecrawl first scrapes all sub-pages from the homepage. Later, if a sub-page includes links to already scraped pages, they are ignored.
## Advanced Web Scraping Configuration and Best Practices
Firecrawl offers several types of parameters to configure how the endpoint crawls over websites. We will outline them here with their use-cases.
### Scrape Options
On real-world projects, you will tweak this parameter the most frequently. It allows you to control how a webpageâs contents are saved. Firecrawl allows the following formats:
- Markdown - the default
- HTML
- Raw HTML (simple copy/paste of the entire webpage)
- Links
- Screenshot
Here is an example request to scrape the Stripe API in four formats:
```python
# Crawl the first 5 pages of the stripe API documentation
stripe_crawl_result = app.crawl_url(
url="https://docs.stripe.com/api",
params={
"limit": 5, # Only scrape the first 5 pages including the base-url
"scrapeOptions": {
"formats": ["markdown", "html", "links", "screenshot"]
}
}
)
```
When you specify multiple formats, each webpageâs data contains separate keys for each formatâs content:
```python
stripe_crawl_result['data'][0].keys()
```
```text
dict_keys(['html', 'links', 'markdown', 'metadata', 'screenshot'])
```
The value of the `screenshot` key is a temporary link to a PNG file stored on Firecrawlâs servers and expires within 24 hours. Here is what it looks like for Stripeâs API documentation homepage:
```python
from IPython.display import Image
Image(stripe_crawl_result['data'][0]['screenshot'])
```

Note that specifying more formats to transform the pageâs contents can significantly slow down the process.
Another time-consuming operation can be scraping the entire page contents instead of just the elements you want. For such scenarios, Firecrawl allows you to control which elements of a webpage are scraped using the `onlyMainContent`, `includeTags`, and `excludeTags` parameters.
Enabling `onlyMainContent` parameter (disabled by default) excludes navigation, headers and footers:
```python
stripe_crawl_result = app.crawl_url(
url="https://docs.stripe.com/api",
params={
"limit": 5,
"scrapeOptions": {
"formats": ["markdown", "html"],
"onlyMainContent": True,
},
},
)
```
`includeTags` and `excludeTags` accepts a list of allowlisted/blocklisted HTML tags, classes and IDs:
```python
# Crawl the first 5 pages of the stripe API documentation
stripe_crawl_result = app.crawl_url(
url="https://docs.stripe.com/api",
params={
"limit": 5,
"scrapeOptions": {
"formats": ["markdown", "html"],
"includeTags": ["code", "#page-header"],
"excludeTags": ["h1", "h2", ".main-content"],
},
},
)
```
Crawling large websites can take a long time and when appropriate, these small tweaks can have a big impact on the runtime.
### URL Control
Apart from scraping configurations, you have four options to specify URL patterns to include or exclude during crawling:
- `includePaths` \- targeting specific sections
- `excludePaths` \- avoiding unwanted content
- `allowBackwardLinks` \- handling cross-references
- `allowExternalLinks` \- managing external content
Here is a sample request that uses these parameters:
```python
# Example of URL control parameters
url_control_result = app.crawl_url(
url="https://docs.stripe.com/",
params={
# Only crawl pages under the /payments path
"includePaths": ["/payments/*"],
# Skip the terminal and financial-connections sections
"excludePaths": ["/terminal/*", "/financial-connections/*"],
# Allow crawling links that point to already visited pages
"allowBackwardLinks": False,
# Don't follow links to external domains
"allowExternalLinks": False,
"scrapeOptions": {
"formats": ["html"]
}
}
)
# Print the total number of pages crawled
print(f"Total pages crawled: {url_control_result['total']}")
```
```out
Total pages crawled: 134
```
In this example, weâre crawling the Stripe documentation website with specific URL control parameters:
- The crawler starts at [https://docs.stripe.com/](https://docs.stripe.com/) and only crawls pages under the `"/payments/*"` path
- It explicitly excludes the `"/terminal/*"` and `"/financial-connections/*"` sections
- By setting allowBackwardLinks to false, it wonât revisit already crawled pages
- External links are ignored ( `allowExternalLinks: false`)
- The scraping is configured to only capture HTML content
This targeted approach helps focus the crawl on relevant content while avoiding unnecessary pages, making the crawl more efficient and focused on the specific documentation sections we need.
Another critical parameter is `maxDepth`, which lets you control how many levels deep the crawler will traverse from the starting URL. For example, a `maxDepth` of 2 means it will crawl the initial page and pages linked from it, but wonât go further.
Here is another sample request on the Stripe API docs:
```python
# Example of URL control parameters
url_control_result = app.crawl_url(
url="https://docs.stripe.com/",
params={
"limit": 100,
"maxDepth": 2,
"allowBackwardLinks": False,
"allowExternalLinks": False,
"scrapeOptions": {"formats": ["html"]},
},
)
# Print the total number of pages crawled
print(f"Total pages crawled: {url_control_result['total']}")
```
```out
Total pages crawled: 99
```
Note: When a page has pagination (e.g. pages 2, 3, 4), these paginated pages are not counted as additional depth levels when using `maxDepth`.
### Performance & Limits
The `limit` parameter, which weâve used in previous examples, is essential for controlling the scope of web crawling. It sets a maximum number of pages that will be scraped, which is particularly important when crawling large websites or when external links are enabled. Without this limit, the crawler could potentially traverse an endless chain of connected pages, consuming unnecessary resources and time.
While the limit parameter helps control the breadth of crawling, you may also need to ensure the quality and completeness of each page crawled. To make sure all desired content is scraped, you can enable a waiting period to let pages fully load. For example, some websites use JavaScript to handle dynamic content, have iFrames for embedding content or heavy media elements like videos or GIFs:
```python
stripe_crawl_result = app.crawl_url(
url="https://docs.stripe.com/api",
params={
"limit": 5,
"scrapeOptions": {
"formats": ["markdown", "html"],
"waitFor": 1000, # wait for a second for pages to load
"timeout": 10000, # timeout after 10 seconds
},
},
)
```
The above code also sets the `timeout` parameter to 10000 milliseconds (10 seconds), which ensures that if a page takes too long to load, the crawler will move on rather than getting stuck.
Note: `waitFor` duration applies to all pages the crawler encounters.
All the while, it is important to keep the limits of your plan in mind:
| Plan | /scrape (requests/min) | /crawl (requests/min) | /search (requests/min) |
| --- | --- | --- | --- |
| Free | 10 | 1 | 5 |
| Hobby | 20 | 3 | 10 |
| Standard | 100 | 10 | 50 |
| Growth | 1000 | 50 | 500 |
## Asynchronous Web Crawling with Firecrawl: Efficient Large-Scale Data Collection
Even after following the tips and best practices from the previous section, the crawling process can be significantly long for large websites with thousands of pages. To handle this efficiently, Firecrawl provides asynchronous crawling capabilities that allow you to start a crawl and monitor its progress without blocking your application. This is particularly useful when building web applications or services that need to remain responsive while crawling is in progress.
### Asynchronous programming in a nutshell
First, letâs understand asynchronous programming with a real-world analogy:
Asynchronous programming is like a restaurant server taking multiple orders at once. Instead of waiting at one table until the customers finish their meal before moving to the next table, they can take orders from multiple tables, submit them to the kitchen, and handle other tasks while the food is being prepared.
In programming terms, this means your code can initiate multiple operations (like web requests or database queries) and continue executing other tasks while waiting for responses, rather than processing everything sequentially.
This approach is particularly valuable in web crawling, where most of the time is spent waiting for network responses - instead of freezing the entire application while waiting for each page to load, async programming allows you to process multiple pages concurrently, dramatically improving efficiency.
### Using `async_crawl_url` method
Firecrawl offers an intuitive asynchronous crawling method via `async_crawl_url`:
```python
app = FirecrawlApp()
crawl_status = app.async_crawl_url("https://docs.stripe.com")
print(crawl_status)
```
```python
{'success': True, 'id': 'c4a6a749-3445-454e-bf5a-f3e1e6befad7', 'url': 'https://api.firecrawl.dev/v1/crawl/c4a6a749-3445-454e-bf5a-f3e1e6befad7'}
```
It accepts the same parameters and scrape options as `crawl_url` but returns a crawl status dictionary.
We are mostly interested in the crawl job `id` and can use it to check the status of the process using `check_crawl_status`:
```python
checkpoint = app.check_crawl_status(crawl_status['id'])
print(len(checkpoint['data']))
```
```python
29
```
`check_crawl_status` returns the same output as `crawl_url` but only includes the pages scraped so far. You can run it multiple times and see the number of scraped pages increasing.
If you want to cancel the job, you can use `cancel_crawl` passing the job id:
```python
final_result = app.cancel_crawl(crawl_status['id'])
print(final_result)
```
```python
{'status': 'cancelled'}
```
### Benefits of asynchronous crawling
There are many advantages of using the `async_crawl_url` over `crawl_url`:
- You can create multiple crawl jobs without waiting for each to complete.
- You can monitor progress and manage resources more effectively.
- Perfect for batch processing or parallel crawling tasks.
- Applications can remain responsive while crawling happens in background
- Users can monitor progress instead of waiting for completion
- Allows for implementing progress bars or status updates
- Easier to integrate with message queues or job schedulers
- Can be part of larger automated workflows
- Better suited for microservices architectures
In practice, you almost always use asynchronous crawling for large websites.
## How to Save and Store Web Crawling Results
When crawling large websites, itâs important to save the results persistently. Firecrawl provides the crawled data in a structured format that can be easily saved to various storage systems. Letâs explore some common approaches.
### Local file storage
The simplest approach is saving to local files. Hereâs how to save crawled content in different formats:
```python
import json
from pathlib import Path
def save_crawl_results(crawl_result, output_dir="firecrawl_output"):
# Create output directory if it doesn't exist
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Save full results as JSON
with open(f"{output_dir}/full_results.json", "w") as f:
json.dump(crawl_result, f, indent=2)
# Save just the markdown content in separate files
for idx, page in enumerate(crawl_result["data"]):
# Create safe filename from URL
filename = (
page["metadata"]["url"].split("/")[-1].replace(".html", "") or f"page_{idx}"
)
# Save markdown content
if "markdown" in page:
with open(f"{output_dir}/{filename}.md", "w") as f:
f.write(page["markdown"])
```
Here is what the above function does:
1. Creates an output directory if it doesnât exist
2. Saves the complete crawl results as a JSON file with proper indentation
3. For each crawled page:
- Generates a filename based on the page URL
- Saves the markdown content to a separate .md file
```python
app = FirecrawlApp()
crawl_result = app.crawl_url(url="https://docs.stripe.com/api", params={"limit": 10})
save_crawl_results(crawl_result)
```
It is a basic function that requires modifications for other scraping formats.
### Database storage
For more complex applications, you might want to store the results in a database. Hereâs an example using SQLite:
```python
import sqlite3
def save_to_database(crawl_result, db_path="crawl_results.db"):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Create table if it doesn't exist
cursor.execute(
"""
CREATE TABLE IF NOT EXISTS pages (
url TEXT PRIMARY KEY,
title TEXT,
content TEXT,
metadata TEXT,
crawl_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""
)
# Insert pages
for page in crawl_result["data"]:
cursor.execute(
"INSERT OR REPLACE INTO pages (url, title, content, metadata) VALUES (?, ?, ?, ?)",
(
page["metadata"]["url"],
page["metadata"]["title"],
page.get("markdown", ""),
json.dumps(page["metadata"]),
),
)
conn.commit()
print(f"Saved {len(crawl_result['data'])} pages to {db_path}")
conn.close()
```
The function creates a SQLite database with a `pages` table that stores the crawled data. For each page, it saves the URL (as primary key), title, content (in markdown format), and metadata (as JSON). The crawl date is automatically added as a timestamp. If a page with the same URL already exists, it will be replaced with the new data. This provides a persistent storage solution that can be easily queried later.
```python
save_to_database(crawl_result)
```
```python
Saved 9 pages to crawl_results.db
```
Letâs query the database to double-check:
```python
# Query the database
conn = sqlite3.connect("crawl_results.db")
cursor = conn.cursor()
cursor.execute("SELECT url, title, metadata FROM pages")
print(cursor.fetchone())
conn.close()
```
```python
(
'https://docs.stripe.com/api/errors',
'Errors | Stripe API Reference',
{
"url": "https://docs.stripe.com/api/errors",
"title": "Errors | Stripe API Reference",
"language": "en-US",
"viewport": "width=device-width, initial-scale=1",
"sourceURL": "https://docs.stripe.com/api/errors",
"statusCode": 200,
"description": "Complete reference documentation for the Stripe API. Includes code snippets and examples for our Python, Java, PHP, Node.js, Go, Ruby, and .NET libraries.",
"ogLocaleAlternate": []
}
)
```
### Cloud storage
For production applications, you might want to store results in cloud storage. Hereâs an example using AWS S3:
```python
import boto3
from datetime import datetime
def save_to_s3(crawl_result, bucket_name, prefix="crawls"):
s3 = boto3.client("s3")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# Save full results
full_results_key = f"{prefix}/{timestamp}/full_results.json"
s3.put_object(
Bucket=bucket_name,
Key=full_results_key,
Body=json.dumps(crawl_result, indent=2),
)
# Save individual pages
for idx, page in enumerate(crawl_result["data"]):
if "markdown" in page:
page_key = f"{prefix}/{timestamp}/pages/{idx}.md"
s3.put_object(Bucket=bucket_name, Key=page_key, Body=page["markdown"])
print(f"Successfully saved {len(crawl_result['data'])} pages to {bucket_name}/{full_results_key}")
```
Here is what the function does:
- Takes a crawl result dictionary, S3 bucket name, and optional prefix as input
- Creates a timestamped folder structure in S3 to organize the data
- Saves the full crawl results as a single JSON file
- For each crawled page that has markdown content, saves it as an individual `.md` file
- Uses boto3 to handle the AWS S3 interactions
- Preserves the hierarchical structure of the crawl data
For this function to work, you must have `boto3` installed and your AWS credentials saved inside the `~/.aws/credentials` file with the following format:
```bash
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
region = your_region
```
Then, you can execute the function provided that you already have an S3 bucket to store the data:
```python
save_to_s3(crawl_result, "sample-bucket-1801", "stripe-api-docs")
```
```text
Successfully saved 9 pages to sample-bucket-1801/stripe-api-docs/20241118_142945/full_results.json
```
### Incremental saving with async crawls
When using async crawling, you might want to save results incrementally sa they come in:
```python
import time
def save_incremental_results(app, crawl_id, output_dir="firecrawl_output"):
Path(output_dir).mkdir(parents=True, exist_ok=True)
processed_urls = set()
while True:
# Check current status
status = app.check_crawl_status(crawl_id)
# Save new pages
for page in status["data"]:
url = page["metadata"]["url"]
if url not in processed_urls:
filename = f"{output_dir}/{len(processed_urls)}.md"
with open(filename, "w") as f:
f.write(page.get("markdown", ""))
processed_urls.add(url)
# Break if crawl is complete
if status["status"] == "completed":
print(f"Saved {len(processed_urls)} pages.")
break
time.sleep(5) # Wait before checking again
```
Here is what the function does:
- Creates an output directory if it doesnât exist
- Maintains a set of processed URLs to avoid duplicates
- Continuously checks the crawl status until completion
- For each new page found, saves its markdown content to a numbered file
- Sleeps for 5 seconds between status checks to avoid excessive API calls
Letâs use it while the app crawls Books-to-Scrape website:
```python
# Start the crawl
crawl_status = app.async_crawl_url(url="https://books.toscrape.com/")
# Save results incrementally
save_incremental_results(app, crawl_status["id"])
```
```python
Saved 705 pages.
```
## Building AI-Powered Web Crawlers with Firecrawl and LangChain Integration
Firecrawl has integrations with popular open-source libraries like LangChain and other platforms.

In this section, we will see how to use the LangChain integration to build a basic QA chatbot on the [LangChain Community Integrations](https://python.langchain.com/docs/integrations/providers/) website.
Start by installing LangChain and its related libraries:
```bash
pip install langchain langchain_community langchain_anthropic langchain_openai
```
Then, add your `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` as variables to your `.env` file.
Next, import the `FireCrawlLoader` class from the document loaders module and initialize it:
```python
from dotenv import load_dotenv
from langchain_community.document_loaders.firecrawl import FireCrawlLoader
load_dotenv()
loader = FireCrawlLoader(
url="https://python.langchain.com/docs/integrations/providers/",
mode="crawl",
params={"limit": 5, "scrapeOptions": {"onlyMainContent": True}},
)
```
The class can read your Firecrawl API key automatically since we are loading the variables using `load_dotenv()`.
To start the crawl, you can call the `load()` method of the loader object and the scraped contents will be turned into LangChain compatible documents:
```python
# Start the crawl
docs = loader.load()
```
The next step is chunking:
```python
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Add text splitting before creating the vector store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# Split the documents
split_docs = text_splitter.split_documents(docs)
```
Above, we split the documents into smaller chunks using the `RecursiveCharacterTextSplitter`. This helps make the text more manageable for processing and ensures better results when creating embeddings and performing retrieval. The chunk size of 1,000 characters with 100-character overlap provides a good balance between context preservation and granularity.
```python
from langchain_chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.utils import filter_complex_metadata
# Create embeddings for the documents
embeddings = OpenAIEmbeddings()
# Create a vector store from the loaded documents
docs = filter_complex_metadata(docs)
vector_store = Chroma.from_documents(docs, embeddings)
```
Moving on, we create a vector store using Chroma and OpenAI embeddings. The vector store enables semantic search and retrieval on our documents. We also filter out complex metadata that could cause storage issues.
The final step is building the QA chain using Claude 3.5 Sonnet as the language model:
```python
from langchain.chains import RetrievalQA
from langchain_anthropic import ChatAnthropic
# Initialize the language model
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620", streaming=True)
# Create a QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(),
)
```
Now, we can ask questions about our documents:
```python
# Example question
query = "What is the main topic of the website?"
answer = qa_chain.invoke(query)
print(answer)
```
```python
{
'query': 'What is the main topic of the website?',
'result': """The main topic of the website is LangChain's integrations with Hugging Face.
The page provides an overview of various LangChain components that can be used with
Hugging Face models and services, including:
1. Chat models
2. LLMs (Language Models)
3. Embedding models
4. Document loaders
5. Tools
The page focuses on showing how to use different Hugging Face functionalities within
the LangChain framework, such as embedding models, language models, datasets, and
other tools."""
}
```
This section demonstrated a process for building a basic RAG pipeline for content scraped using Firecrawl. For this version, we only used 10 pages from the LangChain documentation. As the volume of information increases, the pipeline would need additional refinement. To scale this pipeline effectively, we would need to consider several factors including:
- Chunking strategy optimization
- Embedding model selection
- Vector store performance tuning
- Prompt engineering for larger document collections
## Conclusion
Throughout this guide, weâve explored Firecrawlâs `/crawl` endpoint and its capabilities for web scraping at scale. From basic usage to advanced configurations, we covered URL control, performance optimization, and asynchronous operations. We also examined practical implementations, including data storage solutions and integration with frameworks like LangChain.
The endpointâs ability to handle JavaScript content, pagination, and various output formats makes it a versatile tool for modern web scraping needs. Whether youâre building documentation chatbots or gathering training data, Firecrawl provides a robust foundation. By leveraging the configuration options and best practices discussed, you can build efficient and scalable web scraping solutions tailored to your specific requirements.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## AI Resume Job Matcher
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Feb 1, 2025
â˘
[Bex Tuychiev](https://x.com/bextuychiev)
# Building an AI Resume Job Matching App With Firecrawl And Claude

## Introduction
Finding the perfect job can feel like searching for a needle in a haystack. As a developer, you might spend hours scanning through job boards, trying to determine if each position matches your skills and experience. What if we could automate this process using AI?
In this tutorial, weâll build a sophisticated job matching system that combines several powerful technologies:
- **Firecrawl** for intelligent web scraping of job postings and resume parsing
- **Claude 3.5 Sonnet** for job matching analysis
- **Supabase** for managing job sources and tracking
- **Discord** for when there is a matching job
- **Streamlit** for a user-friendly web interface
Our application will:
1. Automatically scrape user-provided job boards at regular intervals
2. Parse your resume from a PDF
3. Use AI to evaluate each job posting against your qualifications
4. Send notifications to Discord when strong matches are found
5. Provide a web interface for managing job sources and viewing results

By the end of this tutorial, youâll have a fully automated job search assistant that runs in the cloud and helps you focus on the opportunities that matter most. Whether youâre actively job hunting or just keeping an eye on the market, this tool will save you countless hours of manual searching and evaluation.
If this project sounds interesting, you can start using it straight away by cloning [its GitHub repository](https://github.com/BexTuychiev/ai-resume-job-matching). The local setup instructions are provided in the README.
On the other hand, if you want to understand how the different parts of the project work together, continue reading!
## Overview of the App
Before diving into the technical details, letâs walk through a typical user journey to understand how the app works.
The process starts with the user adding web pages with job listings. Here are examples of acceptable pages:
- `https://datacamp.com/jobs`
- `https://openai.com/careers/search/`
- `https://apply.workable.com/huggingface`
As you can probably tell from the example URLs, the app doesnât work with popular job platforms like Indeed or Glassdoor. This is because these platforms already have sophisticated job matching functionality built into their systems. Instead, this app focuses on company career pages and job boards that donât offer automated matching - places where youâd normally have to manually review each posting. This allows you to apply the same intelligent matching to opportunities that might otherwise slip through the cracks.
Each job listings source is added to a Supabase database under the hood for persistence and displayed in the sidebar (you have the option to delete them). After the user inputs all job sources, they can add their PDF link in the main section of the app.
The app uses [Firecrawl](https://firecrawl.dev/), an AI-powered scraping engine that extracts structured data from webpages and PDF documents. To parse resumes, Firecrawl requires a direct file link to the PDF.
After parsing the resume, the app crawls all job sources using Firecrawl to gather job listings. Each listing is then analyzed against the resume by Claude to determine compatibility. The UI clearly shows whether a candidate is qualified for each position, along with Claudeâs reasoning. For matching jobs, the app automatically sends notifications to the userâs Discord account via a webhook.

The app automatically rechecks all job sources weekly to ensure you never miss a great opportunity.
## The Tech Stack Used in the App
Building a reliable job matching system requires careful selection of tools that can handle complex tasks while remaining maintainable and cost-effective. Letâs explore the core technologies that power our application and why each was chosen:
### 1\. [Firecrawl](https://firecrawl.ai/) for AI-powered web scraping
At the heart of our job discovery system is Firecrawl, an AI-powered web scraping engine. Unlike traditional scraping libraries that rely on brittle HTML selectors, Firecrawl uses natural language understanding to identify and extract content. This makes it ideal for our use case because:
- It can handle diverse job board layouts without custom code for each site
- Maintains reliability even when websites update their structure
- Automatically bypasses common anti-bot measures
- Handles JavaScript-rendered content out of the box
- Provides clean, structured data through [Pydantic](https://pydantic.dev/) schemas
### 2\. [Claude 3.5 Sonnet](https://www.anthropic.com/claude) for job matching
For the critical task of evaluating job fit, we use Claude 3.5 Sonnet through the [LangChain](https://www.langchain.com/) framework. This AI model excels at understanding both job requirements and candidate qualifications in context. We chose Claude because:
- Handles complex job requirements effectively
- Offers consistent and reliable evaluations
- More cost-effective than GPT-4 for this use case
- Integrates seamlessly with LangChain for structured outputs
### 3\. [Supabase](https://supabase.com/) for data management
To manage job sources and tracking, we use Supabase as our database backend. This modern database platform offers:
- PostgreSQL database with a generous free tier
- Real-time capabilities for future features
- Simple REST API for database operations
- Built-in authentication system
- Excellent developer experience with their Python SDK
### 4\. [Discord](https://discord.com/) for Notifications
When a matching job is found, our system sends notifications through Discord webhooks. This might seem like an unusual choice, but Discord offers several advantages:
- Free and widely adopted
- Rich message formatting with embeds
- Simple webhook integration
- Mobile notifications
- Supports dedicated channels for job alerts
- Threading for discussions about specific opportunities
### 5\. [Streamlit](https://streamlit.io/) for user interface
The web interface is built with Streamlit, a Python framework for data applications. We chose Streamlit because:
- It enables rapid development of data-focused interfaces
- Provides built-in components for common UI patterns
- Handles async operations smoothly
- Offers automatic hot-reloading during development
- Requires no JavaScript knowledge
- Makes deployment straightforward
### 6\. [GitHub Actions](https://github.com/features/actions) for automation
To ensure regular job checking, we use GitHub Actions for scheduling. This service provides:
- Free scheduling for public repositories
- Built-in secret management
- Reliable cron scheduling
- Easy maintenance and modifications
- Integrated version control
- Comprehensive logging and monitoring
This carefully selected stack provides a robust foundation while keeping costs minimal through generous free tiers. The combination of AI-powered tools (Firecrawl and Claude) with modern infrastructure (Supabase, Discord, GitHub Actions) creates a reliable and scalable job matching system that can grow with your needs.
Most importantly, this stack minimizes maintenance overhead - a crucial factor for any automated system. The AI-powered components adapt to changes automatically, while the infrastructure services are managed by their respective providers, letting you focus on finding your next great opportunity rather than maintaining the system.
## Breaking Down the App Components
When you look at [the GitHub repository](https://github.com/BexTuychiev/ai-resume-job-matching) of the app, you will see the following file structure:

Several files in the repository serve common purposes that most developers will recognize:
- `.gitignore`: Specifies which files Git should ignore when tracking changes
- `README.md`: Documentation explaining what the project does and how to use it
- `requirements.txt`: Lists all Python package dependencies needed to run the project
Letâs examine the remaining Python scripts and understand how they work together to power the application. The explanations will be in a logical order building from foundational elements to higher-level functionality.
### 1\. Core data structures - `src/models.py`
At the heart of our job matching system are three Pydantic models that define the core data structures used throughout the application. These models not only provide type safety and validation but also serve as schemas that guide Firecrawlâs AI in extracting structured data from web pages.
```python
class Job(BaseModel):
title: str = Field(description="Job title")
url: str = Field(description="URL of the job posting")
company: str = Field(description="Company name")
```
The `Job` model represents an individual job posting with three essential fields:
- `title`: The positionâs name
- `url`: Direct link to the job posting
- `company`: Name of the hiring organization
This model is used by both the scraper to extract job listings and the Discord notifier to format job match notifications. The `Field` descriptions guide the Firecrawlâs AI to better locate the HTML/CSS components containing the relevant information.
```python
class JobSource(BaseModel):
url: str = Field(description="URL of the job board")
last_checked: Optional[datetime] = Field(description="Last check timestamp")
```
The `JobSource` model tracks job board URLs and when they were last checked:
- `url`: The job boardâs web address
- `last_checked`: Optional timestamp of the last scraping attempt
This model is primarily used by the database component to manage job sources and the scheduler to track when sources need to be rechecked.
```python
class JobListings(BaseModel):
jobs: List[Job] = Field(description="List of job postings")
```
Finally, the `JobListings` model serves as a container for multiple `Job` objects. This model is crucial for the scraper component, as it tells Firecrawl to extract all job postings from a page rather than just the first one it finds.
These models form the foundation of our applicationâs data flow:
1. The scraper uses them to extract structured data from web pages
2. The database uses them to store and retrieve job sources
3. The matcher uses them to process job details
4. The Discord notifier uses them to format notifications
By defining these data structures upfront, we ensure consistency throughout the application and make it easier to modify the data model in the future if needed.
### 2\. Database operations - `src/database.py`
The database component handles persistence of job sources using Supabase, a PostgreSQL-based backend service. This module provides essential CRUD (Create, Read, Update, Delete) operations for managing job board URLs and their check history.
```python
class Database:
def __init__(self):
url = os.getenv("SUPABASE_URL")
key = os.getenv("SUPABASE_KEY")
self.client = create_client(url, key)
def save_job_source(self, url: str) -> None:
"""Save a job source to the database"""
self.client.table("job_sources").upsert(
{"url": url, "last_checked": None}
).execute()
```
The `Database` class initializes a connection to Supabase using environment variables and provides four key methods:
1. `save_job_source`: Adds or updates a job board URL in the database. The `upsert` operation ensures no duplicate entries are created.
2. `delete_job_source`: Removes a job source from tracking:
```python
def delete_job_source(self, url: str) -> None:
self.client.table("job_sources").delete().eq("url", url).execute()
```
3. `get_job_sources`: Retrieves all tracked job sources:
```python
def get_job_sources(self) -> List[JobSource]:
response = self.client.table("job_sources").select("*").execute()
return [JobSource(**source) for source in response.data]
```
4. `update_last_checked`: Updates the timestamp when a source was last checked:
```python
def update_last_checked(self, url: str) -> None:
self.client.table("job_sources").update({"last_checked": "now()"}).eq(
"url", url
).execute()
```
This database component is used by:
- The Streamlit interface ( `app.py`) for managing job sources through the sidebar
- The scheduler ( `scheduler.py`) for tracking when sources were last checked
- The automated GitHub Action workflow for persistent storage between runs
By using Supabase, we get a reliable, scalable database with minimal setup and maintenance requirements. The `JobSource` model we defined earlier ensures type safety when working with the database records throughout the application.
### 3\. Scraping with Firecrawl - `src/scraper.py`
The scraper component handles all web scraping operations using Firecrawl, an AI-powered scraping engine. This module is responsible for parsing resumes and extracting job listings from various sources.
```python
@st.cache_data(show_spinner=False)
def _cached_parse_resume(pdf_link: str) -> str:
"""Cached version of resume parsing"""
app = FirecrawlApp()
response = app.scrape_url(url=pdf_link)
return response["markdown"]
class JobScraper:
def __init__(self):
self.app = FirecrawlApp()
```
The `JobScraper` class initializes a Firecrawl connection and provides three main methods:
1. `parse_resume`: Extracts text content from a PDF resume. Uses Streamlitâs caching to avoid re-parsing the same resume:
```python
async def parse_resume(self, pdf_link: str) -> str:
"""Parse a resume from a PDF link."""
return _cached_parse_resume(pdf_link)
```
2. `scrape_job_postings`: Batch scrapes multiple job board URLs using the `JobListings` schema to guide Firecrawlâs extraction:
```python
async def scrape_job_postings(self, source_urls: list[str]) -> list[Job]:
response = self.app.batch_scrape_urls(
urls=source_urls,
params={
"formats": ["extract"],
"extract": {
"schema": JobListings.model_json_schema(),
"prompt": "Extract information based on the schema provided",
},
},
)
jobs = []
for job in response["data"]:
jobs.extend(job["extract"]["jobs"])
return [Job(**job) for job in jobs]
```
If you want to understand Firecrawlâs syntax better, refer to our [separate guide on its `/scrape` endpoint](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint).
3. `scrape_job_content`: Retrieves the full content of a specific job posting for detailed analysis:
```python
async def scrape_job_content(self, job_url: str) -> str:
"""Scrape the content of a specific job posting."""
response = self.app.scrape_url(url=job_url)
return response["markdown"]
```
This entire scraper component is used by:
- The Streamlit interface ( `app.py`) for initial resume parsing and job discovery
- The scheduler ( `scheduler.py`) for automated periodic job checks
- The matcher component for detailed job content analysis
The use of Firecrawlâs AI capabilities allows the scraper to handle diverse webpage layouts without custom selectors, while Streamlitâs caching helps optimize performance by avoiding redundant resume parsing.
### 4\. Job matching with Claude - `src/matcher.py`
The matcher component uses Claude 3.5 Sonnet through LangChain to evaluate whether a candidateâs resume matches a job posting. This module provides intelligent job fit analysis with structured outputs.
```python
class JobMatcher:
def __init__(self):
self.llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0)
self.response_schemas = [\
ResponseSchema(\
name="is_match",\
description="Whether the candidate is a good fit for the job (true/false)",\
),\
ResponseSchema(\
name="reason",\
description="Brief explanation of why the candidate is or isn't a good fit",\
),\
]
```
The `JobMatcher` class initializes with two key components:
1. A Claude instance configured for consistent outputs (temperature=0)
2. Response schemas that define the structure of the matching results:
- `is_match`: Boolean indicating if the candidate is qualified
- `reason`: Explanation of the matching decision
```python
self.prompt = ChatPromptTemplate.from_messages([\
(\
"system",\
"You are an expert job interviewer with decades of experience. Analyze the resume and job posting to determine if the candidate is a good fit. Be critical in your assessment and accept only applicants that meet at least 75% of the requirements.",\
),\
(\
"human",\
"""\
Resume:\
{resume}\
\
Job Posting:\
{job_posting}\
\
Determine if this candidate is a good fit and explain why briefly.\
{format_instructions}\
""",\
),\
])
self.output_parser = StructuredOutputParser.from_response_schemas(
self.response_schemas
)
```
> Note: The system prompt significantly affects how jobs are matched. You can make it more relaxed or strict when evaluating candidates. Use a looser prompt if you want to apply to more jobs, or a stricter one if youâre being more selective.
The class also sets up:
- A prompt template that positions Claude as an expert interviewer and sets a high bar for matches (75% of requirements)
- An output parser that ensures responses follow the defined schema
```python
async def evaluate_match(self, resume: str, job_posting: str) -> Dict:
"""Evaluate if a candidate is a good fit for a job."""
formatted_prompt = self.prompt.format(
resume=resume,
job_posting=job_posting,
format_instructions=self.output_parser.get_format_instructions(),
)
response = await self.llm.ainvoke(formatted_prompt)
return self.output_parser.parse(response.content)
```
The `evaluate_match` method:
1. Takes a resume and job posting as input
2. Formats the prompt with the provided content
3. Sends the request to Claude
4. Parses and returns the structured response
This entire matcher component is used by:
- The Streamlit interface ( `app.py`) for real-time job matching
- The scheduler ( `scheduler.py`) for automated matching checks
- The Discord notifier to determine when to send alerts
By using Claude with structured outputs, we ensure consistent and reliable job matching that can be easily integrated into the broader application workflow.
### 5\. Sending notifications with Discord - `src/discord.py`
The Discord component handles sending notifications when matching jobs are found. It uses Discordâs webhook functionality to deliver rich, formatted messages about job matches.
```python
class DiscordNotifier:
def __init__(self):
self.webhook_url = os.getenv("DISCORD_WEBHOOK_URL")
```
First, we initialize the notifier with a Discord webhook URL from environment variables. This URL is where all notifications will be sent.
```python
async def send_match(self, job: Job, match_reason: str):
"""Send a job match notification to Discord"""
if not self.webhook_url:
return
webhook = DiscordWebhook(url=self.webhook_url)
embed = DiscordEmbed(
title=f"đŻ New Job Match Found!",
description=f"**{job.title}** at **{job.company}**\n\n{match_reason}",
color="5865F2", # Discord's blue color scheme
)
```
The `send_match` method creates the notification:
- Takes a `Job` object and the AIâs matching reason as input
- Creates a webhook connection to Discord
- Builds an embed message with:
- An eye-catching title with emoji
- Job title and company in bold
- The AIâs explanation of why this job matches
```python
# Add fields with job details
embed.add_embed_field(name="đ˘ Company", value=job.company, inline=True)
embed.add_embed_field(
name="đ Job URL", value=f"[Apply Here]({job.url})", inline=True
)
webhook.add_embed(embed)
webhook.execute()
```
Finally, the method:
- Adds structured fields for company and job URL
- Uses emojis for visual appeal
- Creates a clickable âApply Hereâ link
- Sends the formatted message to Discord
This component is used by:
- The matcher component when a job match is found
- The scheduler for automated notifications
- The Streamlit interface for real-time match alerts
The use of Discord embeds provides a clean, professional look for notifications while making it easy for users to access job details and apply links directly from the message.
### 6\. Automated source checking script - `src/scheduler.py`
The scheduler component handles automated periodic checking of job sources, coordinating between all other components to continuously monitor for new matching positions.
```python
class JobScheduler:
def __init__(self):
self.scraper = JobScraper()
self.matcher = JobMatcher()
self.notifier = DiscordNotifier()
self.db = Database()
self.resume_url = os.getenv("RESUME_URL")
self.check_interval = int(os.getenv("CHECK_INTERVAL_MINUTES", "15"))
self.processed_jobs = set()
logger.info(f"Initialized scheduler with {self.check_interval} minute interval")
```
The `JobScheduler` class initializes with:
- All necessary components (scraper, matcher, notifier, database)
- Resume URL from environment variables
- Configurable check interval (defaults to 15 minutes)
- A set to track processed jobs and avoid duplicates
- Logging setup for monitoring operations
```python
async def process_source(self, source):
"""Process a single job source"""
try:
logger.info(f"Processing source: {source.url}")
# Parse resume
resume_content = await self.scraper.parse_resume(self.resume_url)
# Get jobs from source
jobs = await self.scraper.scrape_job_postings([source.url])
logger.info(f"Found {len(jobs)} jobs from {source.url}")
```
The `process_source` method starts by:
- Logging the current operation
- Parsing the userâs resume
- Scraping all jobs from the given source
```python
# Process new jobs
for job in jobs:
if job.url in self.processed_jobs:
logger.debug(f"Skipping already processed job: {job.url}")
continue
job_content = await self.scraper.scrape_job_content(job.url)
result = await self.matcher.evaluate_match(resume_content, job_content)
if result["is_match"]:
logger.info(f"Found match: {job.title} at {job.company}")
await self.notifier.send_match(job, result["reason"])
self.processed_jobs.add(job.url)
```
For each job found, it:
- Skips if already processed
- Scrapes the full job description
- Evaluates the match against the resume
- Sends a Discord notification if itâs a match
- Marks the job as processed
```python
async def run(self):
"""Main scheduling loop"""
logger.info("Starting job scheduler...")
while True:
try:
sources = self.db.get_job_sources()
logger.info(f"Found {len(sources)} job sources")
```
The `run` method starts the main loop by:
- Getting all job sources from the database
- Logging the number of sources found
```python
for source in sources:
if not source.last_checked or (
datetime.utcnow() - source.last_checked
> timedelta(minutes=self.check_interval)
):
await self.process_source(source)
else:
logger.debug(
f"Skipping {source.url}, next check in "
f"{(source.last_checked + timedelta(minutes=self.check_interval) - datetime.utcnow()).total_seconds() / 60:.1f} minutes"
)
await asyncio.sleep(60) # Check every minute
```
For each source, it:
- Checks if it needs processing (never checked or interval elapsed)
- Processes the source if needed
- Logs skipped sources with time until next check
- Waits a minute before the next iteration
```python
except Exception as e:
logger.error(f"Scheduler error: {str(e)}")
await asyncio.sleep(60)
```
Error handling:
- Catches and logs any exceptions
- Waits a minute before retrying
- Ensures the scheduler keeps running despite errors
This component is used by:
- The GitHub Actions workflow for automated checks
- The command-line interface for manual checks
- The logging system for monitoring and debugging
The extensive logging helps track operations and diagnose issues, while the modular design allows for easy maintenance and updates.
### 7\. User interface with Streamlit - `app.py`
The Streamlit interface provides a user-friendly way to manage job sources and run manual job matching checks. Letâs break down each component:
1. First, we set up the necessary imports and helper functions:
```python
import streamlit as st
import asyncio
from dotenv import load_dotenv
from src.scraper import JobScraper
from src.matcher import JobMatcher
from src.discord import DiscordNotifier
from src.database import Database
load_dotenv()
async def process_job(scraper, matcher, notifier, job, resume_content):
"""Process a single job posting"""
job_content = await scraper.scrape_job_content(job.url)
result = await matcher.evaluate_match(resume_content, job_content)
if result["is_match"]:
await notifier.send_match(job, result["reason"])
return job, result
```
The `process_job` function handles the core job matching logic for a single posting:
1. Scrapes the full job content using the provided URL
2. Evaluates if the resume matches the job requirements
3. Sends a notification if thereâs a match
4. Returns both the job and match result for further processing
5. The main application setup and sidebar for managing job sources:
```python
async def main():
st.title("Resume Parser and Job Matcher")
# Initialize services
scraper = JobScraper()
matcher = JobMatcher()
notifier = DiscordNotifier()
db = Database()
# Sidebar for managing job sources
with st.sidebar:
st.header("Manage Job Sources")
new_source = st.text_input("Add Job Source URL")
if st.button("Add Source"):
db.save_job_source(new_source)
st.success("Job source added!")
```
The `main()` function sets up the core Streamlit application interface:
1. Creates a title for the app
2. Initializes the key services (scraper, matcher, notifier, database)
3. Adds a sidebar with controls for managing job source URLs
4. Provides a text input and button to add new job sources
5. Saves valid sources to the database
The sidebar allows users to maintain a list of job boards and company career pages to monitor for new postings.
3. The source management interface:
```python
# List and delete existing sources
st.subheader("Current Sources")
for source in db.get_job_sources():
col1, col2 = st.columns([3, 1])
with col1:
st.text(source.url)
with col2:
if st.button("Delete", key=source.url):
db.delete_job_source(source.url)
st.rerun()
```
This section displays the list of current job sources and provides delete functionality:
1. Shows a âCurrent Sourcesâ subheader
2. Iterates through all sources from the database
3. Creates a two-column layout for each source
4. First column shows the source URL
5. Second column has a delete button
6. When delete is clicked, removes the source and refreshes the page
The delete functionality helps users maintain their source list by removing outdated or unwanted job boards. The `rerun()` call ensures the UI updates immediately after deletion.
4. The main content area with instructions and resume input:
```python
st.markdown(
"""
This app helps you find matching jobs by:
- Analyzing your resume from a PDF URL
- Scraping job postings from your saved job sources
- Using AI to evaluate if you're a good fit for each position
Simply paste your resume URL below to get started!
"""
)
resume_url = st.text_input(
"**Enter Resume PDF URL**",
placeholder="https://www.website.com/resume.pdf",
)
```
5. The job analysis workflow:
```python
if st.button("Analyze") and resume_url:
with st.spinner("Parsing resume..."):
resume_content = await scraper.parse_resume(resume_url)
sources = db.get_job_sources()
if not sources:
st.warning("No job sources configured. Add some in the sidebar!")
return
with st.spinner("Scraping job postings..."):
jobs = await scraper.scrape_job_postings([s.url for s in sources])
```
6. Parallel job processing and results display:
```python
with st.spinner(f"Analyzing {len(jobs)} jobs..."):
tasks = []
for job in jobs:
task = process_job(scraper, matcher, notifier, job, resume_content)
tasks.append(task)
for coro in asyncio.as_completed(tasks):
job, result = await coro
st.subheader(f"Job: {job.title}")
st.write(f"URL: {job.url}")
st.write(f"Match: {'â
' if result['is_match'] else 'â'}")
st.write(f"Reason: {result['reason']}")
st.divider()
st.success(f"Analysis complete! Processed {len(jobs)} jobs.")
```
This section creates tasks to analyze multiple jobs simultaneously by comparing them against the userâs resume. As each analysis completes, it displays the results including job title, URL, match status and reasoning. The parallel approach makes the processing more efficient than analyzing jobs one at a time.
The interface provides:
- A sidebar for managing job sources
- Clear instructions for users
- Real-time feedback during processing
- Visual indicators for matches (â
) and non-matches (â)
- Detailed explanations for each job evaluation
- Parallel processing for better performance
This component ties together all the backend services into a user-friendly interface that makes it easy to manage job sources and run manual checks.
### 8\. GitHub Actions workflow - `.github/workflows/scheduler.yml`
The GitHub Actions workflow automates the job checking process by running the scheduler at regular intervals. Letâs break down the configuration:
1. First, we define the workflow name and triggers:
```yaml
name: Job Matcher Scheduler
on:
push:
branches: [main]
schedule:
- cron: "0 0 * * 1" # Run every Monday at midnight
```
This configuration:
- Names the workflow âJob Matcher Schedulerâ
- Triggers on pushes to the main branch (for testing)
- Runs automatically every Monday at midnight using cron syntax
- 0: Minute (0)
- 0: Hour (0 = midnight)
- \*: Day of month (any)
- \*: Month (any)
- 1: Day of week (1 = Monday)
2. Define the job and its environment:
```yaml
jobs:
check-jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.10"
```
This section:
- Creates a job named âcheck-jobsâ
- Uses the latest Ubuntu runner
- Checks out the repository code
- Sets up Python 3.10
3. Install dependencies:
```yaml
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
```
4. Set up environment variables and run the scheduler:
```yml
- name: Run job checker
env:
FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
DISCORD_WEBHOOK_URL: ${{ secrets.DISCORD_WEBHOOK_URL }}
RESUME_URL: ${{ secrets.RESUME_URL }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
CHECK_INTERVAL_MINUTES: 15
run: |
python -m src.scheduler
```
This final step:
- Sets up all necessary environment variables from GitHub Secrets
- Configures the check interval
- Runs the scheduler script
The workflow provides:
- Automated weekly job checks
- Secure handling of sensitive credentials
- Consistent environment for running checks
- Detailed logs of each run
- Easy modification of the schedule
To use this workflow, you need to:
1. Add all required secrets to your GitHub repository
2. Ensure your repository is public (for free GitHub Actions minutes)
3. Verify the workflow is enabled in your Actions tab
The weekly schedule helps stay within GitHubâs free tier limits while still regularly checking for new opportunities.
## Conclusion
Weâve built a powerful automated job matching system that combines several modern technologies into a cohesive solution. By integrating Firecrawl for web scraping, Claude AI for intelligent matching, Discord for notifications, GitHub Actions for scheduling, and Supabase for storage, weâve created a practical tool that automates the tedious parts of job searching. This allows job seekers to focus their energy on more important tasks like preparing for interviews and improving their skills.
### Next Steps
The modular design of this system opens up many possibilities for future enhancements. You could expand support to additional job boards, implement more sophisticated matching algorithms, or add alternative notification methods like email. Consider building a mobile interface or adding analytics to track your application success rates. The foundation weâve built makes it easy to adapt and enhance the system as your needs evolve. Feel free to fork the repository and customize it to match your specific job search requirements.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Bex Tuychiev@bextuychiev](https://x.com/bextuychiev)
Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
### More articles by Bex Tuychiev
[Building an Automated Price Tracking Tool\\
\\
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\
\\
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\
\\
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\
\\
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\
\\
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\
\\
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\
\\
Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\
\\
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment)
## Gamma Onboarding Supercharged
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Aug 8, 2024
â˘
[Jon Noronha](https://x.com/thatsjonsense)
# How Gamma Supercharges Onboarding with Firecrawl

At [Gamma](https://gamma.app/), we recently launched Gamma Sites, which allows anyone to build a website as easily as writing a doc. To showcase the power of our platform, we wanted to transform existing sites into the Gamma format. Thatâs where Firecrawl came in. Not only did Firecrawl enable us to import existing web pages, but it also unlocked a new input for our AI presentation generator. Now, users can pull in a blog post, Notion page, or other online document and convert it into a presentation effortlessly.
Integrating Firecrawl into our production environment was a breeze. We already use markdown internally, so it was just a matter of plugging in the Firecrawl API, feeding it a URL, and getting clean markdown in return. The simplicity of scraping out all the extraneous content and retrieving just the text and images is what we would miss the most if we had to stop using Firecrawl. Throughout the integration process, the support from the Firecrawl team was outstanding. They were quick to respond to our feature requests and ensured a smooth experience.
Article updated recently
## About the Author
[\\
Jon Noronha@thatsjonsense](https://x.com/thatsjonsense)
Jon Noronha is the founder of Gamma, building the anti-Powerpoint. He is also the Ex VP of Product at Optimizely.
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## cURL Authentication Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Dec 13, 2024
â˘
[Rudrank Riyam](https://x.com/rudrankriyam)
# A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl

Scraping authenticated websites is often a key requirement for developers and data analysts. While many graphical tools exist, using **cURL**, a powerful command-line utility, gives you granular control over HTTP requests. Coupled with **Firecrawl**, a scraping API that can handle dynamic browser interactions and complex authentication flows, you can seamlessly extract data from behind login forms, protected dashboards, and other restricted content. Before we get started, we only recommend scraping behind authentication if you have permission from the resources owner.
In this guide, weâll first introduce cURL and common authentication methods. Then, weâll show how to combine these approaches with Firecrawlâs API, enabling you to scrape authenticated pages that would otherwise be challenging to access. Youâll learn everything from basic authentication to custom headers, bearer tokens, cookies, and even multi-step logins using Firecrawlâs action sequences.
## What is cURL?
**cURL (Client URL)** is a command-line tool for transferring data using various network protocols, commonly HTTP and HTTPS. Itâs usually pre-installed on Unix-like systems (macOS, Linux) and easily available for Windows. With cURL, you can quickly test APIs, debug endpoints, and automate repetitive tasks.
Check if cURL is installed by running:
```bash
curl --version
```
If installed, youâll see version details. If not, follow your operating systemâs instructions to install it.
cURL is lightweight and script-friendlyâan excellent choice for integrating with tools like Firecrawl. With cURL at your fingertips, you can seamlessly orchestrate authenticated scraping sessions by combining cURLâs request capabilities with Firecrawlâs browser-powered scraping.
## Why Use Firecrawl for Authenticated Scraping?
**Firecrawl** is an API designed for scraping websites that might be hard to handle with a simple HTTP client. While cURL can handle direct requests, Firecrawl provides the ability to:
- Interact with websites that require JavaScript execution.
- Navigate multiple steps of login forms.
- Manage cookies, headers, and tokens easily.
- Extract content in structured formats like Markdown or JSON.
By pairing cURLâs command-line power with Firecrawlâs scraping engine, you can handle complex authentication scenariosâlike logging into a site with a username/password form, or including custom headers and tokensâthat would be difficult to script using cURL alone.
## Authentication Methods
Authenticated scraping means you must prove your identity or authorization to the target server before accessing protected content. Common methods include:
1. **Basic Authentication**
2. **Bearer Token (OAuth 2.0)**
3. **Custom Header Authentication**
4. **Cookie-Based (Session) Authentication**
Weâll look at each method in the context of cURL, and then integrate them with Firecrawl for real-world scraping scenarios.
### 1\. Basic Authentication
**Basic Auth** sends a username and password encoded in Base64 with each request. Itâs simple but should always be used over HTTPS to protect credentials.
**cURL Syntax:**
```bash
curl -u username:password https://api.example.com/securedata
```
For APIs requiring only an API key (as username):
```bash
curl -u my_api_key: https://api.example.com/data
```
**With Firecrawl:**
If Firecrawlâs endpoint itself requires Basic Auth (or if the site youâre scraping uses Basic Auth), you can include this in your request:
```bash
curl -u YOUR_API_KEY: https://api.firecrawl.dev/v1/scrape
```
This authenticates you to the Firecrawl API using Basic Auth, and you can then direct Firecrawl to scrape authenticated targets.
### 2\. Bearer Token Authentication (OAuth 2.0)
**Bearer Tokens** (often from OAuth 2.0 flows) are secure, time-limited keys that you include in the `Authorization` header.
**cURL Syntax:**
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/profile
```
**With Firecrawl:**
To scrape a site requiring a bearer token, you can instruct Firecrawl to use it:
```bash
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer fc_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"]
}'
```
Here, `fc_your_api_key_here` is your Firecrawl API token. Firecrawl will handle the scraping behind the scenes, and you can also add target-specific headers or actions if needed.
### 3\. Custom Header Authentication
Some APIs require custom headers for authentication (e.g., `X-API-Key: value`). These headers are sent alongside requests to prove authorization.
**cURL Syntax:**
```bash
curl -H "X-API-Key: your_api_key_here" https://api.example.com/data
```
**With Firecrawl:**
To scrape a page requiring a custom header, just include it in the POST data:
```bash
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://protected.example.com",
"headers": {
"X-Custom-Auth": "token123"
}
}'
```
Firecrawl will use the custom header `X-Custom-Auth` when loading the page.
### 4\. Cookie-Based Authentication
Websites often rely on sessions and cookies for authentication. After logging in via a form, a cookie is set, allowing subsequent authenticated requests.
**cURL for Cookie Handling:**
Save cookies after login:
```bash
curl -c cookies.txt -X POST https://example.com/login \
-d "username=yourusername&password=yourpassword"
```
Use these cookies for subsequent requests:
```bash
curl -b cookies.txt https://example.com/protected
```
**With Firecrawl:**
If you need to scrape a protected page that uses cookies for authentication, you can first obtain the cookies using cURL, then pass them to Firecrawl:
1. **Obtain Cookies:**
```bash
curl -c cookies.txt -X POST https://example.com/login \
-d "username=yourusername&password=yourpassword"
```
2. **Use Cookies with Firecrawl:**
```bash
curl -b cookies.txt -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/protected",
"formats": ["markdown"]
}'
```
Firecrawl will then request the protected URL using the cookies youâve supplied.
## Real-World Examples
### GitHub API
GitHubâs API supports token-based auth:
```bash
curl -H "Authorization: token ghp_YOUR_TOKEN" https://api.github.com/user/repos
```
Scraping authenticated GitHub pages (like private profiles) with Firecrawl:
```bash
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://github.com/settings/profile",
"headers": {
"Cookie": "user_session=YOUR_SESSION_COOKIE; tz=UTC",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
}'
```
### Dev.to Authentication
Dev.to uses API keys as headers:
```bash
curl -H "api-key: YOUR_DEV_TO_API_KEY" https://dev.to/api/articles/me
```
To scrape behind login forms, leverage Firecrawl actions:
```bash
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://dev.to/enter",
"actions": [\
{"type": "wait", "milliseconds": 2000},\
{"type": "click", "selector": "input[type=email]"},\
{"type": "write", "text": "your@email.com"},\
{"type": "click", "selector": "input[type=password]"},\
{"type": "write", "text": "your_password"},\
{"type": "click", "selector": "button[type=submit]"},\
{"type": "wait", "milliseconds": 3000},\
{"type": "navigate", "url": "https://dev.to/dashboard"},\
{"type": "scrape"}\
]
}'
```
Firecrawl can interact with the page dynamically, just like a browser, to submit forms and then scrape the resulting authenticated content.
## Conclusion
When combined, **cURL and Firecrawl** provide a powerful toolkit for scraping authenticated websites. cURLâs flexibility in handling HTTP requests pairs perfectly with Firecrawlâs ability to navigate, interact, and extract data from pages that require authentication. Whether you need to pass API keys in headers, handle OAuth tokens, emulate sessions, or fill out login forms, these tools make the process efficient and repeatable.
Try the examples provided, check out [Firecrawlâs documentation](https://docs.firecrawl.dev/introduction) for more advanced use cases, and start confidently scraping authenticated websites today!
**Happy cURLing and Firecrawling!**
Article updated recently
## About the Author
[\\
Rudrank Riyam@rudrankriyam](https://x.com/rudrankriyam)
Rudrank Riyam is a Technical Writer & Author.
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## Grok-2 Setup Guide
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Oct 21, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Getting Started with Grok-2: Setup and Web Crawler Example

Grok-2, the latest language model from x.ai, brings advanced language understanding capabilities to developers, enabling the creation of intelligent applications with ease. In this tutorial, weâll walk you through setting up Grok-2, obtaining an API key, and then building a web crawler using Firecrawl to extract structured data from any website.
## Part 1: Setting Up Grok-2
Before diving into coding, we need to set up Grok-2 and get an API key.
### Step 1: Sign Up for an x.ai Account
To access the Grok-2 API, youâll need an x.ai account.
1. **Visit the Sign-Up Page:** Go to [x.ai Sign-Up](https://accounts.x.ai/sign-up?redirect=cloud-console).
2. **Register:** Fill out the registration form with your email and create a password.
3. **Verify Your Email:** Check your inbox for a verification email from x.ai and click the link to verify your account.
### Step 2: Fund Your Account
To use the Grok-2 API, your account must have funds.
1. **Access the Cloud Console:** After logging in, youâll be directed to the x.ai Cloud Console.
2. **Navigate to Billing:** Click on the **Billing** tab in the sidebar.
3. **Add Payment Method:** Provide your payment details to add credits to your account.
### Step 3: Obtain Your API Key
With your account funded, you can now generate an API key.
1. **Go to API Keys:** Click on the **API Keys** tab in the Cloud Console.
2. **Create a New API Key:** Click on **Create New API Key** and give it a descriptive name.
3. **Copy Your API Key:** Make sure to copy your API key now, as it wonât be displayed again for security reasons.
_Note: Keep your API key secure and do not share it publicly._
## Part 2: Building a Web Crawler with Grok-2 and Firecrawl
Now that Grok-2 is set up, letâs build a web crawler to extract structured data from websites.
### Prerequisites
- **Python 3.6+**
- **Firecrawl Python Library**
- **Requests Library**
- **dotenv Library**
Install the required packages:
```bash
pip install firecrawl-py requests python-dotenv
```
### Step 1: Set Up Environment Variables
Create a `.env` file in your project directory to store your API keys securely.
```env
GROK_API_KEY=your_grok_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key
```
_Replace `your_grok_api_key` and `your_firecrawl_api_key` with your actual API keys._
### Step 2: Initialize Your Script
Create a new Python script (e.g., `web_crawler.py`) and start by importing the necessary libraries and loading your environment variables.
```python
import os
import json
import requests
from dotenv import load_dotenv
from firecrawl import FirecrawlApp
# Load environment variables from .env file
load_dotenv()
# Retrieve API keys
grok_api_key = os.getenv("GROK_API_KEY")
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
# Initialize FirecrawlApp
app = FirecrawlApp(api_key=firecrawl_api_key)
```
### Step 3: Define the Grok-2 API Interaction Function
We need a function to interact with the Grok-2 API.
```python
def grok_completion(prompt):
url = "https://api.x.ai/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {grok_api_key}"
}
data = {
"messages": [\
{"role": "system", "content": "You are a helpful assistant."},\
{"role": "user", "content": prompt}\
],
"model": "grok-2",
"stream": False,
"temperature": 0
}
response = requests.post(url, headers=headers, json=data)
response_data = response.json()
return response_data['choices'][0]['message']['content']
```
### Step 4: Identify Relevant Pages on the Website
Define a function to find pages related to our objective.
```python
def find_relevant_pages(objective, url):
prompt = f"Based on the objective '{objective}', suggest a 1-2 word search term to locate relevant information on the website."
search_term = grok_completion(prompt).strip()
map_result = app.map_url(url, params={"search": search_term})
return map_result.get("links", [])
```
### Step 5: Extract Data from the Pages
Create a function to scrape the pages and extract the required data.
```python
def extract_data_from_pages(links, objective):
for link in links[:3]: # Limit to top 3 links
scrape_result = app.scrape_url(link, params={'formats': ['markdown']})
content = scrape_result.get('markdown', '')
prompt = f"""Given the following content, extract the information related to the objective '{objective}' in JSON format. If not found, reply 'Objective not met'.
Content: {content}
Remember:
- Only return JSON if the objective is met.
- Do not include any extra text.
"""
result = grok_completion(prompt).strip()
if result != "Objective not met":
try:
data = json.loads(result)
return data
except json.JSONDecodeError:
continue # Try the next link if JSON parsing fails
return None
```
### Step 6: Implement the Main Function
Combine everything into a main function.
```python
def main():
url = input("Enter the website URL to crawl: ")
objective = input("Enter your data extraction objective: ")
print("\nFinding relevant pages...")
links = find_relevant_pages(objective, url)
if not links:
print("No relevant pages found.")
return
print("Extracting data from pages...")
data = extract_data_from_pages(links, objective)
if data:
print("\nData extracted successfully:")
print(json.dumps(data, indent=2))
else:
print("Could not find data matching the objective.")
if __name__ == "__main__":
main()
```
### Step 7: Run the Script
Save your script and run it from the command line.
```bash
python web_crawler.py
```
**Example Interaction:**
```
Enter the website URL to crawl: https://example.com
Enter your data extraction objective: Retrieve the list of services offered.
Finding relevant pages...
Extracting data from pages...
Data extracted successfully:
{
"services": [\
"Web Development",\
"SEO Optimization",\
"Digital Marketing"\
]
}
```
## Conclusion
In this tutorial, weâve successfully set up Grok-2, obtained an API key, and built a web crawler using Firecrawl. This powerful combination allows you to automate the process of extracting structured data from websites, making it a valuable tool for various applications.
## Next Steps
- **Explore More Features:** Check out the Grok-2 and Firecrawl documentation to learn about additional functionalities.
- **Enhance Error Handling:** Improve the script with better error handling and logging.
- **Customize Data Extraction:** Modify the extraction logic to suit different objectives or data types.
## References
- [x.ai Grok-2 API Documentation](https://docs.x.ai/docs)
- [Firecrawl Python Library Documentation](https://docs.firecrawl.dev/sdks/python)
- [x.ai Cloud Console](https://accounts.x.ai/cloud-console)
- [GitHub Repository with Full Code](https://github.com/mendableai/firecrawl/tree/main/examples/grok_web_crawler)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## Install BeautifulSoup Easily
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Aug 9, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# How to quickly install BeautifulSoup with Python
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It provides simple methods for navigating, searching, and modifying the parse tree, saving you hours of work. Beautiful Soup is great for web scraping projects where you need to extract specific pieces of information from web pages.
Some common use cases for BeautifulSoup include extracting article text or metadata from news sites, scraping product details and pricing from e-commerce stores, gathering data for machine learning datasets, and more.
In this tutorial, weâll walk through several ways to get BeautifulSoup installed on your system and show you some basic usage examples to get started.
## Installing BeautifulSoup
There are a few different ways you can install BeautifulSoup depending on your Python environment and preferences.
### Using pip
The recommended way to install BeautifulSoup is with pip:
```bash
python -m pip install beautifulsoup4
```
This will install the latest version of BeautifulSoup 4. Make sure you have a recent version of Python (3.6+) and pip.
### Using conda
If youâre using the Anaconda Python distribution, you can install BeautifulSoup from the conda-forge channel:
```bash
conda install -c conda-forge beautifulsoup4
```
### In a virtual environment
Itâs good practice to install Python packages in an isolated virtual environment for each project. You can set up BeautifulSoup in a new virtual environment like this:
```bash
python -m venv bsenv
source bsenv/bin/activate # On Windows, use `bsenv\Scripts\activate`
pip install beautifulsoup4
```
## Troubleshooting
Here are a few things to check if you run into issues installing BeautifulSoup:
- Make sure your Python version is 3.6 or higher
- Upgrade pip to the latest version: `python -m pip install --upgrade pip`
- If using conda, ensure your Anaconda installation is up-to-date
- Verify you have proper permissions to install packages. Use `sudo` or run the command prompt as an administrator if needed.
Check the BeautifulSoup documentation or post on Stack Overflow if you need further assistance.
## Usage Examples
Letâs look at a couple quick examples of how to use BeautifulSoup once you have it installed.
### Parsing HTML
Hereâs how you can use BeautifulSoup to parse HTML retrieved from a web page:
```python
from bs4 import BeautifulSoup
import requests
url = "https://mendable.ai"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)
# 'Example Domain'
```
We use the requests library to fetch the HTML from a URL, then pass it to BeautifulSoup to parse. This allows us to navigate and search the HTML using methods like `find()` and `select()`.
### Extracting Data
BeautifulSoup makes it easy to extract data buried deep within nested HTML tags. For example, to get all the links from a page:
```python
links = soup.find_all('a')
for link in links:
print(link.get('href'))
# 'https://www.firecrawl.dev/'
```
The `find_all()` method retrieves all `` tag elements. We can then iterate through them and access attributes like the `href` URL using `get()`.
By chaining together `find()` and `select()` methods, you can precisely target elements and attributes to scrape from the messiest of HTML pages. BeautifulSoup is an indispensable tool for any Python web scraping project.
For more advanced web scraping projects, consider using a dedicated scraping service like [Firecrawl](https://firecrawl.dev/). Firecrawl takes care of the tedious parts of web scraping, like proxy rotation, JavaScript rendering, and avoiding detection, allowing you to focus your efforts on working with the data itself. Check out the [Python SDK](https://docs.firecrawl.dev/sdks/python) here.
## References
- BeautifulSoup documentation: [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- Real Pythonâs BeautifulSoup Tutorial: [https://realpython.com/beautiful-soup-web-scraper-python/](https://realpython.com/beautiful-soup-web-scraper-python/)
- Firecrawl web scraping service: [https://firecrawl.dev/](https://firecrawl.dev/)
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## LLM Extract Introduction
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
August 31, 2024
â˘
[Nicolas Camara](https://x.com/nickscamara_)
# Launch Week I / Day 6: LLM Extract (v1)

Welcome to Day 6 of Firecrawlâs Launch Week! Weâre excited to introduce v1 support for LLM Extract.
## Introducing the Extract Format
LLM extraction is now available in v1 under the extract format. To extract structured from a page, you can pass a schema to the endpoint or just provide a prompt.

**Output**

## Extracting without schema (New)
You can now extract without a schema by just passing a prompt to the endpoint. The LLMs choose the structure of the data.

## Learn More
Learn more about the extract format in our [documentation](https://docs.firecrawl.dev/features/extract).
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Nicolas Camara@nickscamara\_](https://x.com/nickscamara_)
Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl.
He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps,
which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB.
Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code,
and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.
### More articles by Nicolas Camara
[Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\
\\
Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\
\\
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\
\\
A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\
\\
Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\
\\
New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\
\\
A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\
\\
Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\
\\
Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
## Fix Cloudflare Error 1015
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
Aug 6, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Cloudflare Error 1015: How to solve it?
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner. This typically happens if you are making a large number of requests in a short period.
## How to solve it?
To resolve Cloudflare Error 1015 you can reduce the frequency of your requests to stay within the allowed limit. Another way to solve it is to use a service like [Firecrawl](https://firecrawl.dev/), which rotates proxies to prevent any single proxy from hitting the rate limit. This approach can help you avoid triggering the Cloudflare 1015 error.
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)
## Introducing Teams Feature
Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract)
August 26, 2024
â˘
[Eric Ciarla](https://x.com/ericciarla)
# Launch Week I / Day 1: Introducing Teams

Welcome to Firecrawlâs first ever Launch Week! Over the course of the next five days, weâll be bringing you an exciting new feature every day.
Weâre kicking off Day 1 with the launch of Teams - one of our most highly requested features.
**What is Teams?**
No one wants to work on web scraping projects alone. Teams enables you to collaborate with your co-workers and transform the way you approach data collection.
Ever wanted to work on complex scraping projects with your entire team? With Teams, you can achieve this with just a few clicks.

**New Pricing Plans to Support Teams**
To accommodate teams of all sizes, weâve updated our pricing structure. Our Hobby plan now includes 2 seats, perfect for small collaborations. The Standard plan offers 4 seats for growing teams, while our Growth plan supports larger groups with 8 seats. For enterprise-level needs, we offer custom seating options to fit any organization.
Stay tuned for more exciting announcements throughout Launch Week. We canât wait to show you whatâs next!
Article updated recently
[đĽ](https://www.firecrawl.dev/)
## Ready to _Build?_
Start scraping web data for your AI apps today.
No credit card needed.
Get Started
## About the Author
[\\
Eric Ciarla@ericciarla](https://x.com/ericciarla)
Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB.
Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.
### More articles by Eric Ciarla
[How to Create an llms.txt File for Any Website\\
\\
Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\
\\
Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\
\\
Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\
\\
Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\
\\
A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\
\\
A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\
\\
A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\
\\
Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)