# http://firecrawl.dev llms-full.txt ## Web Data Extraction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) [💥 Get 2 months free with yearly plan](https://www.firecrawl.dev/pricing) # Turn websites into LLM-ready data Power your AI apps with clean data crawled from any website. [It's also open source.](https://github.com/mendableai/firecrawl) https://example.com Start for free (500 credits)Start for free 200 Response \[\ \ {\ \ "url": "https://example.com",\ \ "markdown": "# Getting Started...",\ \ "json": {\ "title": "Guide",\ "docs": ...\ },\ \ "screenshot": "https://example.com/hero.png",\ \ }\ \ ...\ \ \] ## Trusted by Top Companies [![Customer Logo](https://www.firecrawl.dev/logos/zapier.png)](https://www.zapier.com/) [![Customer Logo](https://www.firecrawl.dev/logos/gamma.svg)](https://gamma.app/) [![Customer Logo](https://www.firecrawl.dev/logos/nvidia-com.png)](https://www.nvidia.com/) [![Customer Logo](https://www.firecrawl.dev/logos/phmg.svg)](https://phmg.com/) [![Customer Logo](https://www.firecrawl.dev/logos/stackai.svg)](https://www.stack-ai.com/) [![Customer Logo](https://www.firecrawl.dev/logos/teller-io.svg)](https://www.teller.io/) [![Customer Logo](https://www.firecrawl.dev/logos/carrefour-c.svg)](https://www.carrefour.com/) [![Customer Logo](https://www.firecrawl.dev/logos/vendr.png)](https://www.vendr.com/) [![Customer Logo](https://www.firecrawl.dev/logos/open-gov-sg.png)](https://www.open.gov.sg/) [![Customer Logo](https://www.firecrawl.dev/logos/zapier.png)](https://www.zapier.com/) [![Customer Logo](https://www.firecrawl.dev/logos/gamma.svg)](https://gamma.app/) [![Customer Logo](https://www.firecrawl.dev/logos/nvidia-com.png)](https://www.nvidia.com/) [![Customer Logo](https://www.firecrawl.dev/logos/phmg.svg)](https://phmg.com/) [![Customer Logo](https://www.firecrawl.dev/logos/stackai.svg)](https://www.stack-ai.com/) [![Customer Logo](https://www.firecrawl.dev/logos/teller-io.svg)](https://www.teller.io/) [![Customer Logo](https://www.firecrawl.dev/logos/carrefour-c.svg)](https://www.carrefour.com/) [![Customer Logo](https://www.firecrawl.dev/logos/vendr.png)](https://www.vendr.com/) [![Customer Logo](https://www.firecrawl.dev/logos/open-gov-sg.png)](https://www.open.gov.sg/) [![Customer Logo](https://www.firecrawl.dev/logos/cyberagent-co-jp.svg)](https://www.cyberagent.co.jp/) [![Customer Logo](https://www.firecrawl.dev/logos/continue_dev.png)](https://continue.dev/) [![Customer Logo](https://www.firecrawl.dev/logos/bain-com.svg)](https://www.bain.com/) [![Customer Logo](https://www.firecrawl.dev/logos/jasperai.svg)](https://jasper.ai/) [![Customer Logo](https://www.firecrawl.dev/logos/palladiumdigital.jpg)](https://www.palladiumdigital.com/) [![Customer Logo](https://www.firecrawl.dev/logos/checkr.png)](https://www.checkr.com/) [![Customer Logo](https://www.firecrawl.dev/logos/jetbrains.png)](https://www.jetbrains.com/) [![Customer Logo](https://www.firecrawl.dev/logos/you_com.svg)](https://www.you.com/) [![Customer Logo](https://www.firecrawl.dev/logos/cyberagent-co-jp.svg)](https://www.cyberagent.co.jp/) [![Customer Logo](https://www.firecrawl.dev/logos/continue_dev.png)](https://continue.dev/) [![Customer Logo](https://www.firecrawl.dev/logos/bain-com.svg)](https://www.bain.com/) [![Customer Logo](https://www.firecrawl.dev/logos/jasperai.svg)](https://jasper.ai/) [![Customer Logo](https://www.firecrawl.dev/logos/palladiumdigital.jpg)](https://www.palladiumdigital.com/) [![Customer Logo](https://www.firecrawl.dev/logos/checkr.png)](https://www.checkr.com/) [![Customer Logo](https://www.firecrawl.dev/logos/jetbrains.png)](https://www.jetbrains.com/) [![Customer Logo](https://www.firecrawl.dev/logos/you_com.svg)](https://www.you.com/) Developer first ## Start scraping this morning Enhance your apps with industry leading web scraping and crawling capabilities #### Scrape Get llm-ready data from websites #### Crawl Crawl all the pages on a website #### Extract New Extract structured data from websites 1 2 3 4 5 6 7 8 ``` // npm install @mendable/firecrawl-js import FirecrawlApp from '@mendable/firecrawl-js'; const app = new FirecrawlApp({ apiKey: "fc-YOUR_API_KEY" }); // Scrape a website: await app.scrapeUrl('firecrawl.dev'); ``` #### Use well-known tools Already fully integrated with the greatest existing tools and workflows. [![LlamaIndex](https://www.firecrawl.dev/logos/llamaindex.svg)](https://docs.llamaindex.ai/en/stable/examples/data_connectors/WebPageDemo/#using-firecrawl-reader/)[![Langchain](https://www.firecrawl.dev/integrations/langchain.png)](https://python.langchain.com/v0.2/docs/integrations/document_loaders/firecrawl/)[![Dify](https://www.firecrawl.dev/logos/dify.png)](https://dify.ai/blog/dify-ai-blog-integrated-with-firecrawl/)[![Langflow](https://www.firecrawl.dev/integrations/langflow_2.png)](https://www.langflow.org/)[![Flowise](https://www.firecrawl.dev/integrations/flowise.png)](https://flowiseai.com/)[![CrewAI](https://www.firecrawl.dev/integrations/crewai.png)](https://crewai.com/)[![Camel AI](https://www.firecrawl.dev/integrations/camel-ai.png)](https://docs.camel-ai.org/cookbooks/ingest_data_from_websites_with_Firecrawl.html) #### Start for free, scale easily Kick off your journey for free and scale seamlessly as your project expands. [Try it out](https://www.firecrawl.dev/signin/signup) #### Open-source Developed transparently and collaboratively. Join our community of contributors. [Check out our repo](https://github.com/mendableai/firecrawl) Zero Configuration ## We handle the hard stuff Rotating proxies, orchestration, rate limits, js-blocked content and more #### Crawling Gather clean data from all accessible subpages, even without a sitemap.     

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
    

Firecrawl

    

Extract web data for LLMs

    

Installation

    
npm install @mendable/firecrawl-js
# Firecrawl Extract web data for LLMs ## Installation \`\`\`bash npm install @mendable/firecrawl-js \`\`\` #### Dynamic Content Firecrawl handles JavaScript, SPAs, and dynamic content loading with minimal configuration. #### Smart Wait Firecrawl intelligently wait for content to load, making scraping faster and more reliable. #### Reliability First Reliability is our core focus. Firecrawl is designed to scale with your needs. #### Actions Click, scroll, write, wait, press and more before extracting content. #### Media Parsing Firecrawl can parse and output content from web hosted pdfs, docx, and more. Our Wall of Love ## Don't take our word for it ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 Transparent ## Flexible Pricing Start for free, then scale as you grow Standard [Extract](https://www.firecrawl.dev/extract#pricing) Monthly Yearly 20% off\- 2 months free ## Free Plan 500 credits $0 one-time No credit card requiredGet Started - Scrape 500 pages - 10 /scrape per min - 1 /crawl per min ## Hobby 3,000 credits per month $16/month $228/yr$190/yr (Billed annually) Subscribe$190/yr - Scrape 3,000 pages\* - 20 /scrape per min - 3 /crawl per min - 1 seat ## StandardMost Popular 100,000 credits per month $83/month $1188/yr$990/yr (Billed annually) Subscribe$990/yr - Scrape 100,000 pages\* - 100 /scrape per min - 10 /crawl per min - 3 seats - Standard Support ## Growth 500,000 credits per month $333/month $4788/yr$3990/yr (Billed annually) Subscribe$3990/yr - Scrape 500,000 pages\* - 1000 /scrape per min - 50 /crawl per min - 5 seats - Priority Support ## Add-ons ### Auto Recharge Credits Automatically recharge your credits when you run low $11/mo for 1000 credits Enable Auto Recharge Subscribe to a plan to enable auto recharge ### Credit Pack Purchase a pack of additional monthly credits $9/mo for 1000 credits Purchase Credit Pack Subscribe to a plan to purchase credit packs ## Enterprise Plan Unlimited credits. Custom RPMs. Talk to us - Bulk discounts - Top priority support - Custom concurrency limits - Improved Stealth Proxies - SLAs - Advanced Security & Controls \\* a /scrape refers to the [scrape](https://docs.firecrawl.dev/api-reference/endpoint/scrape) API endpoint. Structured extraction costs vary. See [credits table](https://www.firecrawl.dev/pricing#credits). \\* a /crawl refers to the [crawl](https://docs.firecrawl.dev/api-reference/endpoint/crawl) API endpoint. ## API Credits Credits are consumed for each API request, varying by endpoint and feature. | Features | Credits | | --- | --- | | Scrape(/scrape) | 1 / page | | with JSON format | 5 / page | | Crawl(/crawl) | 1 / page | | Map (/map) | 1 / call | | Search(/search) | 1 / page | | Extract (/extract) | New [Separate Pricing](https://www.firecrawl.dev/extract#pricing) | [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started FAQ ## Frequently Asked Everything you need to know about Firecrawl ### General ### What is Firecrawl? ### What sites work? ### Who can benefit from using Firecrawl? ### Is Firecrawl open-source? ### What is the difference between Firecrawl and other web scrapers? ### What is the difference between the open-source version and the hosted version? ### Scraping & Crawling ### How does Firecrawl handle dynamic content on websites? ### Why is it not crawling all the pages? ### Can Firecrawl crawl websites without a sitemap? ### What formats can Firecrawl convert web data into? ### How does Firecrawl ensure the cleanliness of the data? ### Is Firecrawl suitable for large-scale data scraping projects? ### Does it respect robots.txt? ### What measures does Firecrawl take to handle web scraping challenges like rate limits and caching? ### Does Firecrawl handle captcha or authentication? ### API Related ### Where can I find my API key? ### Billing ### Is Firecrawl free? ### Is there a pay per use plan instead of monthly? ### How many credits do scraping, crawling, and extraction cost? ### Do you charge for failed requests (scrape, crawl, extract)? ### What payment methods do you accept? ## Flexible Web Scraping Pricing Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Transparent ## Flexible Pricing Start for free, then scale as you grow Standard [Extract](https://www.firecrawl.dev/extract#pricing) Monthly Yearly 20% off\- 2 months free ## Free Plan 500 credits $0 one-time No credit card requiredGet Started - Scrape 500 pages - 10 /scrape per min - 1 /crawl per min ## Hobby 3,000 creditsper month $16/month $228/yr$190/yr(Billed annually) Subscribe$190/yr - Scrape 3,000 pages\* - 20 /scrape per min - 3 /crawl per min - 1 seat ## StandardMost Popular 100,000 creditsper month $83/month $1188/yr$990/yr(Billed annually) Subscribe$990/yr - Scrape 100,000 pages\* - 100 /scrape per min - 10 /crawl per min - 3 seats - Standard Support ## Growth 500,000 creditsper month $333/month $4788/yr$3990/yr(Billed annually) Subscribe$3990/yr - Scrape 500,000 pages\* - 1000 /scrape per min - 50 /crawl per min - 5 seats - Priority Support ## Add-ons ### Auto Recharge Credits Automatically recharge your credits when you run low $11/mo for 1000 credits Enable Auto Recharge Subscribe to a plan to enable auto recharge ### Credit Pack Purchase a pack of additional monthly credits $9/mo for 1000 credits Purchase Credit Pack Subscribe to a plan to purchase credit packs ## Enterprise Plan Unlimited credits. Custom RPMs. Talk to us - Bulk discounts - Top priority support - Custom concurrency limits - Improved Stealth Proxies - SLAs - Advanced Security & Controls \\* a /scrape refers to the [scrape](https://docs.firecrawl.dev/api-reference/endpoint/scrape) API endpoint. Structured extraction costs vary. See [credits table](https://www.firecrawl.dev/pricing#credits). \\* a /crawl refers to the [crawl](https://docs.firecrawl.dev/api-reference/endpoint/crawl) API endpoint. ## API Credits Credits are consumed for each API request, varying by endpoint and feature. | Features | Credits | | --- | --- | | Scrape(/scrape) | 1 / page | | with JSON format | 5 / page | | Crawl(/crawl) | 1 / page | | Map(/map) | 1 / call | | Search(/search) | 1 / page | | Extract(/extract) | New [Separate Pricing](https://www.firecrawl.dev/extract#pricing) | Our Wall of Love ## Don't take our word for it ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 ## Web Scraping and AI Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) [![LLM API Engine: How to Build a Dynamic API Generation Engine Powered by Firecrawl](https://www.firecrawl.dev/images/blog/llm-api-engine/llm-api-engine.jpg)\\ \\ Feb 26, 2025\\ \\ **LLM API Engine: How to Build a Dynamic API Generation Engine Powered by Firecrawl** \\ \\ Learn how to build a dynamic API generation engine that transforms unstructured web data into clean, structured APIs using natural language descriptions instead of code, powered by Firecrawl's intelligent web scraping and OpenAI.\\ \\ By Bex Tuychiev](https://www.firecrawl.dev/blog/llm-api-engine-dynamic-api-generation-explainer) ## Explore Articles [All](https://www.firecrawl.dev/blog) [Product Updates](https://www.firecrawl.dev/blog/category/product) [Tutorials](https://www.firecrawl.dev/blog/category/tutorials) [Customer Stories](https://www.firecrawl.dev/blog/category/customer-stories) [Tips & Resources](https://www.firecrawl.dev/blog/category/tips-and-resources) [![Building a Clone of OpenAI's Deep Research with TypeScript and Firecrawl](https://www.firecrawl.dev/images/blog/open-deep-research/open-deep-research.jpg)\\ **Building a Clone of OpenAI's Deep Research with TypeScript and Firecrawl** \\ Learn how to build an open-source alternative to OpenAI's Deep Research using TypeScript, Firecrawl, and LLMs. This tutorial covers web scraping, AI processing, and building a performant research platform.\\ \\ By Bex TuychievFeb 24, 2025](https://www.firecrawl.dev/blog/open-deep-research-explainer) [![How to Create Custom Instruction Datasets for LLM Fine-tuning](https://www.firecrawl.dev/images/blog/custom-instruction-dataset/custom-instruction-dataset.jpg)\\ **How to Create Custom Instruction Datasets for LLM Fine-tuning** \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.\\ \\ By Bex TuychievFeb 18, 2025](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [![Fine-tuning DeepSeek R1 on a Custom Instructions Dataset](https://www.firecrawl.dev/images/blog/deepseek-fine-tune/deepseek-fine-tune.jpg)\\ **Fine-tuning DeepSeek R1 on a Custom Instructions Dataset** \\ A comprehensive guide on fine-tuning DeepSeek R1 language models using custom instruction datasets, covering model selection, dataset preparation, and practical implementation steps.\\ \\ By Bex TuychievFeb 18, 2025](https://www.firecrawl.dev/blog/fine-tuning-deepseek) [![How Replit Uses Firecrawl to Power Replit Agent](https://www.firecrawl.dev/images/blog/customer-story-replit.jpg)\\ **How Replit Uses Firecrawl to Power Replit Agent** \\ Discover how Replit leverages Firecrawl to keep Replit Agent up to date with the latest API documentation and web content.\\ \\ By Zhen LiFeb 17, 2025](https://www.firecrawl.dev/blog/how-replit-uses-firecrawl-to-power-ai-agents) [![Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl](https://www.firecrawl.dev/images/blog/deepseek_rag/deepseek-rag-documentation-assistant.jpg)\\ **Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl** \\ Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.\\ \\ By Bex TuychievFeb 10, 2025](https://www.firecrawl.dev/blog/deepseek-rag-documentation-assistant) [![Automated Data Collection - A Comprehensive Guide](https://www.firecrawl.dev/images/blog/automated_data_collection/automated-data-collection.jpg)\\ **Automated Data Collection - A Comprehensive Guide** \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.\\ \\ By Bex TuychievFeb 2, 2025](https://www.firecrawl.dev/blog/automated-data-collection-guide) [![Building an AI Resume Job Matching App With Firecrawl And Claude](https://www.firecrawl.dev/images/blog/resume_parser/ai-resume-parser.jpg)\\ **Building an AI Resume Job Matching App With Firecrawl And Claude** \\ Learn how to build an AI-powered job matching system that automatically scrapes job postings, parses resumes, evaluates opportunities using Claude, and sends Discord alerts for matching positions using Firecrawl, Streamlit, and Supabase.\\ \\ By Bex TuychievFeb 1, 2025](https://www.firecrawl.dev/blog/ai-resume-parser-job-matcher-python) [![Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude](https://www.firecrawl.dev/images/blog/company-data-scraping/company-data-scraping.jpg)\\ **Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude** \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.\\ \\ By Bex TuychievJan 31, 2025](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [![Mastering the Extract Endpoint in Firecrawl](https://www.firecrawl.dev/images/blog/extract_endpoint/mastering-extract-endpoint.jpg)\\ **Mastering the Extract Endpoint in Firecrawl** \\ Learn how to use Firecrawl's extract endpoint to automatically gather structured data from any website using AI. Build powerful web scrapers, create training datasets, and enrich your data without writing complex code.\\ \\ By Bex TuychievJan 23, 2025](https://www.firecrawl.dev/blog/mastering-firecrawl-extract-endpoint) [![Introducing /extract: Get structured web data with just a prompt](https://www.firecrawl.dev/images/blog/firecrawl-extract-endpoint.png)\\ **Introducing /extract: Get structured web data with just a prompt** \\ Our new /extract endpoint harnesses AI to turn any website into structured data for your applications seamlessly.\\ \\ By Eric CiarlaJanuary 20, 2025](https://www.firecrawl.dev/blog/introducing-extract-open-beta) [![How to Build a Bulk Sales Lead Extractor in Python Using AI](https://www.firecrawl.dev/images/blog/sales_lead_extractor/sales-lead-extractor.jpg)\\ **How to Build a Bulk Sales Lead Extractor in Python Using AI** \\ Learn how to build an automated sales lead extraction tool in Python that uses AI to scrape company information from websites, exports data to Excel, and streamlines the lead generation process using Firecrawl and Streamlit.\\ \\ By Bex TuychievJan 12, 2025](https://www.firecrawl.dev/blog/sales-lead-extractor-python-ai) [![Building a Trend Detection System with AI in TypeScript: A Step-by-Step Guide](https://www.firecrawl.dev/images/blog/trend_finder/trend-finder-typescript.jpg)\\ **Building a Trend Detection System with AI in TypeScript: A Step-by-Step Guide** \\ Learn how to build an automated trend detection system in TypeScript that monitors social media and news sites, analyzes content with AI, and sends real-time Slack alerts using Firecrawl, Together AI, and GitHub Actions.\\ \\ By Bex TuychievJan 11, 2025](https://www.firecrawl.dev/blog/trend-finder-typescript) [![How to Build an Automated Competitor Price Monitoring System with Python](https://www.firecrawl.dev/images/blog/competitor_price_scraping/competitor-price-scraping.jpg)\\ **How to Build an Automated Competitor Price Monitoring System with Python** \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.\\ \\ By Bex TuychievJan 6, 2025](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [![How Stack AI Uses Firecrawl to Power AI Agents](https://www.firecrawl.dev/images/blog/customer-story-stackai.jpg)\\ **How Stack AI Uses Firecrawl to Power AI Agents** \\ Discover how Stack AI leverages Firecrawl to seamlessly feed agentic AI workflows with high-quality web data.\\ \\ By Jonathan KleimanJan 3, 2025](https://www.firecrawl.dev/blog/how-stack-ai-uses-firecrawl-to-power-ai-agents) [![BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python](https://www.firecrawl.dev/images/blog/bs4_scrapy/bs4-vs-scrapy-comparison.jpg)\\ **BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python** \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.\\ \\ By Bex TuychievDec 24, 2024](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [![15 Python Web Scraping Projects: From Beginner to Advanced](https://www.firecrawl.dev/images/blog/web_scraping_projects/python-web-scraping-projects.jpg)\\ **15 Python Web Scraping Projects: From Beginner to Advanced** \\ Explore 15 hands-on web scraping projects in Python, from beginner to advanced level. Learn essential concepts like data extraction, concurrent processing, and distributed systems while building real-world applications.\\ \\ By Bex TuychievDec 17, 2024](https://www.firecrawl.dev/blog/python-web-scraping-projects) [![How to Deploy Python Web Scrapers](https://www.firecrawl.dev/images/blog/deploying-web-scrapers/deploy-web-scrapers.jpg)\\ **How to Deploy Python Web Scrapers** \\ Learn how to deploy Python web scrapers using GitHub Actions, Heroku, PythonAnywhere and more.\\ \\ By Bex TuychievDec 16, 2024](https://www.firecrawl.dev/blog/deploy-web-scrapers) [![Why Companies Need a Data Strategy for Generative AI](https://www.firecrawl.dev/images/blog/data-strategy.jpg)\\ **Why Companies Need a Data Strategy for Generative AI** \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.\\ \\ By Eric CiarlaDec 15, 2024](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [![Data Enrichment: A Complete Guide to Enhancing Your Data Quality](https://www.firecrawl.dev/images/blog/data_enrichment_guide/complete-data-enrichment-guide.jpg)\\ **Data Enrichment: A Complete Guide to Enhancing Your Data Quality** \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.\\ \\ By Bex TuychievDec 14, 2024](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) [![A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl](https://www.firecrawl.dev/images/blog/complete-guide-to-curl-authentication-firecrawl-api.jpg)\\ **A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl** \\ Learn how to scrape login-protected websites using cURL and Firecrawl API. Step-by-step guide covering basic auth, tokens, and cookies with real examples.\\ \\ By Rudrank RiyamDec 13, 2024](https://www.firecrawl.dev/blog/complete-guide-to-curl-authentication-firecrawl-api) [![Building an Automated Price Tracking Tool](https://www.firecrawl.dev/images/blog/price-tracking/price-tracking.jpg)\\ **Building an Automated Price Tracking Tool** \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.\\ \\ By Bex TuychievDec 9, 2024](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [![Evaluating Web Data Extraction with CrawlBench](https://www.firecrawl.dev/images/blog/crawlbench/crawlbench.jpg)\\ **Evaluating Web Data Extraction with CrawlBench** \\ An in-depth exploration of CrawlBench, a benchmark for testing LLM-based web data extraction.\\ \\ By SwyxDec 9, 2024](https://www.firecrawl.dev/blog/crawlbench-llm-extraction) [![How Cargo Empowers GTM Teams with Firecrawl](https://www.firecrawl.dev/images/blog/customer-story-cargo.jpg)\\ **How Cargo Empowers GTM Teams with Firecrawl** \\ See how Cargo uses Firecrawl to instantly analyze webpage content and power Go-To-Market workflows for their users.\\ \\ By Tariq MinhasDec 6, 2024](https://www.firecrawl.dev/blog/how-cargo-empowers-gtm-teams-with-firecrawl) [![Web Scraping Automation: How to Run Scrapers on a Schedule](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/automated-web-scraping-free-2025.jpg)\\ **Web Scraping Automation: How to Run Scrapers on a Schedule** \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.\\ \\ By Bex TuychievDec 5, 2024](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [![How to Generate Sitemaps Using Firecrawl's /map Endpoint: A Complete Guide](https://www.firecrawl.dev/images/blog/generating-sitemaps/how-to-generate-sitemap-using-firecrawl-map-endpoint.jpg)\\ **How to Generate Sitemaps Using Firecrawl's /map Endpoint: A Complete Guide** \\ Learn how to generate XML and visual sitemaps using Firecrawl's /map endpoint. Step-by-step guide with Python code examples, performance comparisons, and interactive visualization techniques for effective website mapping.\\ \\ By Bex TuychievNov 29, 2024](https://www.firecrawl.dev/blog/how-to-generate-sitemaps-using-firecrawl-map-endpoint) [![How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial](https://www.firecrawl.dev/images/blog/scrape-masterclass/mastering-scrape.jpg)\\ **How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial** \\ Learn how to scrape websites using Firecrawl's /scrape endpoint. Master JavaScript rendering, structured data extraction, and batch operations with Python code examples.\\ \\ By Bex TuychievNov 25, 2024](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint) [![How to Create an llms.txt File for Any Website](https://www.firecrawl.dev/images/blog/How-to-Create-an-llms-txt-File-for-Any-Website.jpg)\\ **How to Create an llms.txt File for Any Website** \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.\\ \\ By Eric CiarlaNov 22, 2024](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [![Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide](https://www.firecrawl.dev/images/blog/crawl-masterclass/images/mastering-crawl.jpg)\\ **Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide** \\ Learn how to use Firecrawl's /crawl endpoint for efficient web scraping. Master URL control, performance optimization, and integration with LangChain for AI-powered data extraction.\\ \\ By Bex TuychievNov 18, 2024](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) [![Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses](https://www.firecrawl.dev/images/blog/openai-predicted-outputs.jpg)\\ **Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses** \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.\\ \\ By Eric CiarlaNov 5, 2024](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [![Launch Week II Recap](https://www.firecrawl.dev/images/blog/launch-week-ii-recap.jpg)\\ **Launch Week II Recap** \\ Recapping all the exciting announcements from Firecrawl's second Launch Week.\\ \\ By Eric CiarlaNovember 4, 2024](https://www.firecrawl.dev/blog/launch-week-ii-recap) [![Launch Week II - Day 7: Introducing Faster Markdown Parsing](https://www.firecrawl.dev/images/blog/firecrawl-faster-markdown.jpg)\\ **Launch Week II - Day 7: Introducing Faster Markdown Parsing** \\ Our new HTML to Markdown parser is 4x faster, more reliable, and produces cleaner Markdown, built from the ground up for speed and performance.\\ \\ By Eric CiarlaNovember 3, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-7-introducing-faster-markdown-parsing) [![Launch Week II - Day 6: Introducing Mobile Scraping and Mobile Screenshots](https://www.firecrawl.dev/images/blog/firecrawl-mobile-scraping.jpg)\\ **Launch Week II - Day 6: Introducing Mobile Scraping and Mobile Screenshots** \\ Interact with sites as if from a mobile device using Firecrawl's new mobile device emulation.\\ \\ By Eric CiarlaNovember 2, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-6-introducing-mobile-scraping) [![Launch Week II - Day 5: Introducing New Actions](https://www.firecrawl.dev/images/blog/firecrawl-new-actions.jpg)\\ **Launch Week II - Day 5: Introducing New Actions** \\ Capture page content at any point and wait for specific elements with our new Scrape and Wait for Selector actions.\\ \\ By Eric CiarlaNovember 1, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-5-introducing-two-new-actions) [![Launch Week II - Day 4: Advanced iframe Scraping](https://www.firecrawl.dev/images/blog/firecrawl-iframe-scraping.jpg)\\ **Launch Week II - Day 4: Advanced iframe Scraping** \\ We are thrilled to announce comprehensive iframe scraping support in Firecrawl, enabling seamless handling of nested iframes, dynamically loaded content, and cross-origin frames.\\ \\ By Eric CiarlaOctober 31, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-4-advanced-iframe-scraping) [![Launch Week II - Day 3: Introducing Credit Packs](https://www.firecrawl.dev/images/blog/firecrawl-credit-packs.jpg)\\ **Launch Week II - Day 3: Introducing Credit Packs** \\ Easily top up your plan with Credit Packs to keep your web scraping projects running smoothly. Plus, manage your credits effortlessly with our new Auto Recharge feature.\\ \\ By Eric CiarlaOctober 30, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-3-introducing-credit-packs) [![Launch Week II - Day 2: Introducing Location and Language Settings](https://www.firecrawl.dev/images/blog/firecrawl-location-language.jpg)\\ **Launch Week II - Day 2: Introducing Location and Language Settings** \\ Specify country and preferred languages to get relevant localized content, enhancing your web scraping results with region-specific data.\\ \\ By Eric CiarlaOctober 29, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-2-introducing-location-language-settings) [![Launch Week II - Day 1: Introducing the Batch Scrape Endpoint](https://www.firecrawl.dev/images/blog/firecrawl-batch-scrape.jpg)\\ **Launch Week II - Day 1: Introducing the Batch Scrape Endpoint** \\ Our new Batch Scrape endpoint lets you scrape multiple URLs simultaneously, making bulk data collection faster and more efficient.\\ \\ By Eric CiarlaOctober 28, 2024](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint) [![Getting Started with Grok-2: Setup and Web Crawler Example](https://www.firecrawl.dev/images/blog/grok-2-web-crawler.jpg)\\ **Getting Started with Grok-2: Setup and Web Crawler Example** \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.\\ \\ By Nicolas CamaraOct 21, 2024](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [![OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website](https://www.firecrawl.dev/images/blog/openai-swarm.png)\\ **OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website** \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies\\ \\ By Nicolas CamaraOct 12, 2024](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [![Using OpenAI's Realtime API and Firecrawl to Talk with Any Website](https://www.firecrawl.dev/images/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl.jpg)\\ **Using OpenAI's Realtime API and Firecrawl to Talk with Any Website** \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.\\ \\ By Nicolas CamaraOct 11, 2024](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [![Scraping Job Boards Using Firecrawl Actions and OpenAI](https://www.firecrawl.dev/images/blog/firecrawl-openai-job-scraping.jpg)\\ **Scraping Job Boards Using Firecrawl Actions and OpenAI** \\ A step-by-step guide to scraping job boards and extracting structured data using Firecrawl and OpenAI.\\ \\ By Eric CiarlaSept 27, 2024](https://www.firecrawl.dev/blog/scrape-job-boards-firecrawl-openai) [![Build a Full-Stack AI Web App in 12 Minutes](https://www.firecrawl.dev/images/blog/Build-a-Full-Stack-AI-Web-App-in-12-Minutes.png)\\ **Build a Full-Stack AI Web App in 12 Minutes** \\ Build a Full-Stack AI Web App in 12 minutes with Cursor, OpenAI o1, V0, Firecrawl & Patched\\ \\ By Dev DigestSep 18, 2024](https://www.firecrawl.dev/blog/Build-a-Full-Stack-AI-Web-App-in-12-Minutes) [![How to Use OpenAI's o1 Reasoning Models in Your Applications](https://www.firecrawl.dev/images/blog/how-to-use-openai-o1-reasoning-models-in-applications.jpg)\\ **How to Use OpenAI's o1 Reasoning Models in Your Applications** \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.\\ \\ By Eric CiarlaSep 16, 2024](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) [![Handling 300k requests per day: an adventure in scaling](https://www.firecrawl.dev/images/blog/an-adventure-in-scaling.jpg)\\ **Handling 300k requests per day: an adventure in scaling** \\ Putting out fires was taking up all our time, and we had to scale fast. This is how we did it.\\ \\ By Gergő Móricz (mogery)Sep 13, 2024](https://www.firecrawl.dev/blog/an-adventure-in-scaling) [![How Athena Intelligence Empowers Enterprise Analysts with Firecrawl](https://www.firecrawl.dev/images/blog/customer-story-athena-intelligence.jpg)\\ **How Athena Intelligence Empowers Enterprise Analysts with Firecrawl** \\ Discover how Athena Intelligence leverages Firecrawl to fuel its AI-native analytics platform for enterprise analysts.\\ \\ By Ben ReillySep 10, 2024](https://www.firecrawl.dev/blog/how-athena-intelligence-empowers-analysts-with-firecrawl) [![Launch Week I Recap](https://www.firecrawl.dev/images/blog/launch-week-1-recap.png)\\ **Launch Week I Recap** \\ A look back at the new features and updates introduced during Firecrawl's inaugural Launch Week.\\ \\ By Eric CiarlaSeptember 2, 2024](https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap) [![Launch Week I / Day 7: Crawl Webhooks (v1)](https://www.firecrawl.dev/images/blog/webhooks.png)\\ **Launch Week I / Day 7: Crawl Webhooks (v1)** \\ New /crawl webhook support. Send notifications to your apps during a crawl.\\ \\ By Nicolas CamaraSeptember 1, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [![Launch Week I / Day 6: LLM Extract (v1)](https://www.firecrawl.dev/images/blog/firecrawl-llm-extract.png)\\ **Launch Week I / Day 6: LLM Extract (v1)** \\ Extract structured data from your web pages using the extract format in /scrape.\\ \\ By Nicolas CamaraAugust 31, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [![Launch Week I / Day 5: Real-Time Crawling with WebSockets](https://www.firecrawl.dev/images/blog/firecrawl-websockets.png)\\ **Launch Week I / Day 5: Real-Time Crawling with WebSockets** \\ Our new WebSocket-based method for real-time data extraction and monitoring.\\ \\ By Eric CiarlaAugust 30, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-5-real-time-crawling-websockets) [![Launch Week I / Day 4: Introducing Firecrawl /v1](https://www.firecrawl.dev/images/blog/firecrawl-v1-release.png)\\ **Launch Week I / Day 4: Introducing Firecrawl /v1** \\ Our biggest release yet - v1, a more reliable and developer-friendly API for seamless web data gathering.\\ \\ By Eric CiarlaAugust 29, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-4-introducing-firecrawl-v1) [![Launch Week I / Day 3: Introducing the Map Endpoint](https://www.firecrawl.dev/images/blog/firecrawl-map-endpoint.png)\\ **Launch Week I / Day 3: Introducing the Map Endpoint** \\ Our new Map endpoint enables lightning-fast website mapping for enhanced web scraping projects.\\ \\ By Eric CiarlaAugust 28, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-3-introducing-map-endpoint) [![Launch Week I / Day 2: 2x Rate Limits](https://www.firecrawl.dev/images/blog/firecrawl-rate-limits.png)\\ **Launch Week I / Day 2: 2x Rate Limits** \\ Firecrawl doubles rate limits across all plans, supercharging your web scraping capabilities.\\ \\ By Eric CiarlaAugust 27, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits) [![Launch Week I / Day 1: Introducing Teams](https://www.firecrawl.dev/images/blog/firecrawl-teams.png)\\ **Launch Week I / Day 1: Introducing Teams** \\ Our new Teams feature, enabling seamless collaboration on web scraping projects.\\ \\ By Eric CiarlaAugust 26, 2024](https://www.firecrawl.dev/blog/launch-week-i-day-1-introducing-teams) [![How to Use Prompt Caching and Cache Control with Anthropic Models](https://www.firecrawl.dev/images/blog/anthropic-prompt-caching.png)\\ **How to Use Prompt Caching and Cache Control with Anthropic Models** \\ Learn how to cache large context prompts with Anthropic Models like Opus, Sonnet, and Haiku for faster and cheaper chats that analyze website data.\\ \\ By Eric CiarlaAug 14, 2024](https://www.firecrawl.dev/blog/using-prompt-caching-with-anthropic) [![Building Knowledge Graphs from Web Data using CAMEL-AI and Firecrawl](https://www.firecrawl.dev/images/blog/knowledge-graph.jpg)\\ **Building Knowledge Graphs from Web Data using CAMEL-AI and Firecrawl** \\ A guide on constructing knowledge graphs from web pages using CAMEL-AI and Firecrawl\\ \\ By Wendong FanAug 13, 2024](https://www.firecrawl.dev/blog/building-knowledge-graphs-from-web-data-camelai-firecrawl) [![How Gamma Supercharges Onboarding with Firecrawl](https://www.firecrawl.dev/images/blog/customer-story-gamma.jpg)\\ **How Gamma Supercharges Onboarding with Firecrawl** \\ See how Gamma uses Firecrawl to instantly generate websites and presentations to 20+ million users.\\ \\ By Jon NoronhaAug 8, 2024](https://www.firecrawl.dev/blog/how-gamma-supercharges-onboarding-with-firecrawl) [![How to Use OpenAI's Structured Outputs and JSON Strict Mode](https://www.firecrawl.dev/images/blog/openai-structured-output.png)\\ **How to Use OpenAI's Structured Outputs and JSON Strict Mode** \\ A guide for getting structured data from the latest OpenAI models.\\ \\ By Eric CiarlaAug 7, 2024](https://www.firecrawl.dev/blog/using-structured-output-and-json-strict-mode-openai) [![Introducing Fire Engine for Firecrawl](https://www.firecrawl.dev/images/blog/fire-engine-launch.png)\\ **Introducing Fire Engine for Firecrawl** \\ The most scalable, reliable, and fast way to get web data for Firecrawl.\\ \\ By Eric CiarlaAug 6, 2024](https://www.firecrawl.dev/blog/introducing-fire-engine-for-firecrawl) [![Firecrawl July 2024 Updates](https://www.firecrawl.dev/images/blog/launch-yc-firecrawl.png)\\ **Firecrawl July 2024 Updates** \\ Discover the latest features, integrations, and improvements in Firecrawl for July 2024.\\ \\ By Eric CiarlaJuly 31, 2024](https://www.firecrawl.dev/blog/firecrawl-july-2024-updates) [![Firecrawl June 2024 Updates](https://www.firecrawl.dev/images/blog/dashboard2.png)\\ **Firecrawl June 2024 Updates** \\ Discover the latest features, integrations, and improvements in Firecrawl for June 2024.\\ \\ By Nicolas CamaraJune 30, 2024](https://www.firecrawl.dev/blog/firecrawl-june-2024-updates) [![Scrape and Analyze Airbnb Data with Firecrawl and E2B](https://www.firecrawl.dev/images/blog/firecrawl-e2b-airbnb.png)\\ **Scrape and Analyze Airbnb Data with Firecrawl and E2B** \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.\\ \\ By Nicolas CamaraMay 23, 2024](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) [![Build a 'Chat with website' using Groq Llama 3](https://www.firecrawl.dev/images/blog/g4.png)\\ **Build a 'Chat with website' using Groq Llama 3** \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.\\ \\ By Nicolas CamaraMay 22, 2024](https://www.firecrawl.dev/blog/chat-with-website) [![Using LLM Extraction for Customer Insights](https://www.firecrawl.dev/images/blog/g3.png)\\ **Using LLM Extraction for Customer Insights** \\ Using LLM Extraction for Insights and Lead Generation using Make and Firecrawl.\\ \\ By Caleb PefferMay 21, 2024](https://www.firecrawl.dev/blog/lead-gen-business-insights-make-firecrawl) [![Extract website data using LLMs](https://www.firecrawl.dev/images/blog/g2.png)\\ **Extract website data using LLMs** \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.\\ \\ By Nicolas CamaraMay 20, 2024](https://www.firecrawl.dev/blog/data-extraction-using-llms) [![Build an agent that checks for website contradictions](https://www.firecrawl.dev/images/blog/g1.png)\\ **Build an agent that checks for website contradictions** \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.\\ \\ By Eric CiarlaMay 19, 2024](https://www.firecrawl.dev/blog/contradiction-agent) [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## Firecrawl Changelog Updates Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) ## ChangelogNew - Feb 20, 2025 ## Self Host Overhaul - v1.5.0 ### Self-Host Fixes - **Reworked Guide:** The `SELF_HOST.md` and `docker-compose.yaml` have been updated for clarity and compatibility - **Kubernetes Imporvements:** Updated self-hosted Kubernetes deployment examples for compatibility and consistency (#1177) - **Self-Host Fixes:** Numerous fixes aimed at improving self-host performance and stability (#1207) - **Proxy Support:** Added proxy support tailored for self-hosted environments (#1212) - **Playwright Integration:** Added fixes and continuous integration for the Playwright microservice (#1210) - **Search Endpoint Upgrade:** Added SearXNG support for the `/search` endpoint (#1193) ### Core Fixes & Enhancements - **Crawl Status Fixes:** Fixed various race conditions in the crawl status endpoint (#1184) - **Timeout Enforcement:** Added timeout for scrapeURL engines to prevent hanging requests (#1183) - **Query Parameter Retention:** Map function now preserves query parameters in results (#1191) - **Screenshot Action Order:** Ensured screenshots execute after specified actions (#1192) - **PDF Scraping:** Improved handling for PDFs behind anti-bot measures (#1198) - **Map/scrapeURL Abort Control:** Integrated AbortController to stop scraping when the request times out (#1205) - **SDK Timeout Enforcement:** Enforced request timeouts in the SDK (#1204) ### New Features & Additions - **Proxy & Stealth Options:** Introduced a proxy option and stealthProxy flag (#1196) - **Deep Research (Alpha):** Launched an alpha implementation of deep research (#1202) - **LLM Text Generator:** Added a new endpoint for llms.txt generation (#1201) ### Docker & Containerization - **Production Ready Docker Image:** A streamlined, production ready Docker image is now available to simplify self-hosted deployments. - Feb 14, 2025 ## v1.4.4 ### Features & Enhancements - Scrape API: Added action & wait time validation ( [#1146](https://github.com/mendableai/firecrawl/pull/1146)) - Extraction Improvements: - Added detection of PDF/image sub-links & extracted text via Gemini ( [#1173](https://github.com/mendableai/firecrawl/pull/1173)) - Multi-entity prompt enhancements for extraction ( [#1181](https://github.com/mendableai/firecrawl/pull/1181)) - Show sources out of \_\_experimental in extraction ( [#1180](https://github.com/mendableai/firecrawl/pull/1180)) - Environment Setup: Added Serper & Search API env vars to docker-compose ( [#1147](https://github.com/mendableai/firecrawl/pull/1147)) - Credit System Update: Now displays “tokens” instead of “credits” when out of tokens ( [#1178](https://github.com/mendableai/firecrawl/pull/1178)) ### Examples - Gemini 2.0 Crawler: Implemented new crawling example ( [#1161](https://github.com/mendableai/firecrawl/pull/1161)) - Gemini TrendFinder: [https://github.com/mendableai/gemini-trendfinder](https://github.com/mendableai/gemini-trendfinder) - Normal Search to Open Deep Research: [https://github.com/nickscamara/open-deep-research](https://github.com/nickscamara/open-deep-research) ### Fixes - HTML Transformer: Updated free\_string function parameter type ( [#1163](https://github.com/mendableai/firecrawl/pull/1163)) - Gemini Crawler: Updated library & improved PDF link extraction ( [#1175](https://github.com/mendableai/firecrawl/pull/1175)) - Crawl Queue Worker: Only reports successful page count in num\_docs ( [#1179](https://github.com/mendableai/firecrawl/pull/1179)) - Scraping & URLs: - Fixed relative URL conversion ( [#584](https://github.com/mendableai/firecrawl/pull/584)) - Enforced scrape rate limit in batch scraping ( [#1182](https://github.com/mendableai/firecrawl/pull/1182)) - Feb 7, 2025 ## Examples Week - v1.4.3 ### Summary of changes - Open Deep Research: An open source version of OpenAI Deep Research. See here: [https://github.com/nickscamara/open-deep-research](https://github.com/nickscamara/open-deep-research) - R1 Web Extractor Feature: New extraction capability added. - O3-Mini Web Crawler: Introduces a lightweight crawler for specific use cases. - Updated Model Parameters: Enhancements to o3-mini\_company\_researcher. - URL Deduplication: Fixes handling of URLs ending with /, index.html, index.php, etc. - Improved URL Blocking: Uses tldts parsing for better blocklist management. - Valid JSON via rawHtml in Scrape: Ensures valid JSON extraction. - Product Reviews Summarizer: Implements summarization using o3-mini. - Scrape Options for Extract: Adds more configuration options for extracting data. - O3-Mini Job Resource Extractor: Extracts job-related resources using o3-mini. - Cached Scrapes for Extract evals: Improves performance by using cached data for extractions evals. - Jan 31, 2025 ## Extract & API Improvements - v1.4.2 We’re excited to announce several new features and improvements: ### New Features - Added web search capabilities to the extract endpoint via the `enableWebSearch` parameter - Introduced source tracking with `__experimental_showSources` parameter - Added configurable webhook events for crawl and batch operations - New `timeout` parameter for map endpoint - Optional ad blocking with `blockAds` parameter (enabled by default) ### Infrastructure & UI - Enhanced proxy selection and infrastructure reliability - Added domain checker tool to cloud platform - Redesigned LLMs.txt generator interface for better usability - Jan 24, 2025 ## Extract Improvements - v1.4.1 We’ve significantly enhanced our data extraction capabilities with several key updates: - Extract now returns a lot more data - Improved infrastructure reliability - Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing - Enhanced crawl cancellation functionality for better control over running jobs - Jan 7, 2025 ## /extract changes We have updated the `/extract` endpoint to now be asynchronous. When you make a request to `/extract`, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible. For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model. For more details about the parameters, refer to the docs sent to you. - Jan 3, 2025 ## v1.2.0 ### Introducing /v1/search The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query. Include `scrapeOptions` with `formats: ["markdown"]` to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description). More info here: [v1/search docs](https://docs.firecrawl.dev/api-reference/endpoint/search) ### Fixes and improvements - Fixed LLM not following the schema in the python SDK for `/extract` - Fixed schema json not being able to be sent to the `/extract` endpoint through the Node SDK - Prompt is now optional for the `/extract` endpoint - Our fork of [MinerU](https://github.com/mendableai/mineru-api) is now default for PDF Parsing - Dec 27, 2024 ## v1.1.0 ### Changelog Highlights #### Feature Enhancements - **New Features**: - Geolocation, mobile scraping, 4x faster parsing, better webhooks, - Credit packs, auto-recharges and batch scraping support. - Iframe support and query parameter differentiation for URLs. - Similar URL deduplication. - Enhanced map ranking and sitemap fetching. #### Performance Improvements - Faster crawl status filtering and improved map ranking algorithm. - Optimized Kubernetes setup and simplified build processes. - Sitemap discoverability and performance improved #### Bug Fixes - Resolved issues: - Badly formatted JSON, scrolling actions, and encoding errors. - Crawl limits, relative URLs, and missing error handlers. - Fixed self-hosted crawling inconsistencies and schema errors. #### SDK Updates - Added dynamic WebSocket imports with fallback support. - Optional API keys for self-hosted instances. - Improved error handling across SDKs. #### Documentation Updates - Improved API docs and examples. - Updated self-hosting URLs and added Kubernetes optimizations. - Added articles: mastering `/scrape` and `/crawl`. #### Miscellaneous - Added new Firecrawl examples - Enhanced metadata handling for webhooks and improved sitemap fetching. - Updated blocklist and streamlined error messages. - Oct 28, 2024 ![Batch Scrape](https://www.firecrawl.dev/images/blog/firecrawl-batch-scrape.jpg) ## Introducing Batch Scrape You can now scrape multiple URLs simultaneously with our new Batch Scrape endpoint. - Read more about the Batch Scrape endpoint [here](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint). - Python SDK (1.4.x) and Node SDK (1.7.x) updated with batch scrape support. - Oct 10, 2024 ## Cancel Crawl in the SDKs, More Examples, Improved Speed - Added crawl cancellation support for the Python SDK (1.3.x) and Node SDK (1.6.x) - OpenAI Voice + Firecrawl example added to the repo - CRM lead enrichment example added to the repo - Improved our Docker images - Limit and timeout fixes for the self hosted playwright scraper - Improved speed of all scrapes - Sep 27, 2024 ## Fixes + Improvements (no version bump) - Fixed 500 errors that would happen often in some crawled websites and when servers were at capacity - Fixed an issue where v1 crawl status wouldn’t properly return pages over 10mb - Fixed an issue where `screenshot` would return undefined - Push improvements that reduce speed times when a scraper fails - Sep 24, 2024 ![Actions](https://www.firecrawl.dev/images/actions.png) ## Introducing Actions Interact with pages before extracting data, unlocking more data from every site! Firecrawl now allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction. - Version 1.5.x of the Node SDK now supports type-safe Actions. - Actions are now available in the REST API and Python SDK (no version bumps required!). Here is a python example of how to use actions to navigate to google.com, search for Firecrawl, click on the first result, and take a screenshot. ```python from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="fc-YOUR_API_KEY") # Scrape a website: scrape_result = app.scrape_url('firecrawl.dev', params={ 'formats': ['markdown', 'html'], 'actions': [\ {"type": "wait", "milliseconds": 2000},\ {"type": "click", "selector": "textarea[title=\"Search\"]"},\ {"type": "wait", "milliseconds": 2000},\ {"type": "write", "text": "firecrawl"},\ {"type": "wait", "milliseconds": 2000},\ {"type": "press", "key": "ENTER"},\ {"type": "wait", "milliseconds": 3000},\ {"type": "click", "selector": "h3"},\ {"type": "wait", "milliseconds": 3000},\ {"type": "screenshot"}\ ] } ) print(scrape_result) ``` For more examples, check out our [API Reference](https://docs.firecrawl.dev/api-reference/endpoint/scrape). - Sep 23, 2024 ![Firecrawl E2E Type Safe LLM Extract](https://www.firecrawl.dev/images/newllmextract.jpeg) ## Mid-September Updates ### Typesafe LLM Extract - E2E Type Safety for LLM Extract in Node SDK version 1.5.x. - 10x cheaper in the cloud version. From 50 to 5 credits per extract. - Improved speed and reliability. ### Rust SDK v1.0.0 - Rust SDK v1 is finally here! Check it out [here](https://crates.io/crates/firecrawl/1.0.0). ### Map Improved Limits - Map smart results limits increased from 100 to 1000. ### Faster scrape - Scrape speed improved by 200ms-600ms depending on the website. ### Launching changelog - For now on, for every new release, we will be creating a changelog entry here. ### Improvements - Lots of improvements pushed to the infra and API. For all Mid-September changes, refer to the commits [here](https://github.com/mendableai/firecrawl/commits/main/). - Sep 8, 2024 ## September 8, 2024 ### Patch Notes (No version bump) - Fixed an issue where some of the custom header params were not properly being set in v1 API. You can now pass headers to your requests just fine. - Aug 29, 2024 ![Firecrawl V1](https://www.firecrawl.dev/images/blog/f-v1-changelog.png) ## Firecrawl V1 is here! With that we introduce a more reliable and developer friendly API. ### Here is what’s new: - Output Formats for /scrape: Choose what formats you want your output in. - New /map endpoint: Get most of the URLs of a webpage. - Developer friendly API for /crawl/id status. - 2x Rate Limits for all plans. - Go SDK and Rust SDK. - Teams support. - API Key Management in the dashboard. - onlyMainContent is now default to true. - /crawl webhooks and websocket support. Learn more about it [here](https://docs.firecrawl.dev/v1). Start using v1 right away at [https://firecrawl.dev](https://firecrawl.dev/) ## Web Data Extraction Tool Introducing **/extract** \- Now in open beta # Get web data with a prompt Turn entire websites into structured data with AI From firecrawl.dev, get the pricing. Try for Free ![Message balloon](https://www.firecrawl.dev/images/extract/balloon.svg) From **firecrawl.dev** find the company name, mission and whether it's open source. ![Message balloon](https://www.firecrawl.dev/images/extract/balloon-2.svg) { "company\_name":"Firecrawl", "company\_mission":"...", "is\_open\_source":true, } A milestone in scraping ## Web scraping was hard – now effortless Scraping the internet had everything to do with broken scripts, bad data, wasted time. With Extract, you can get any data in any format effortlessly – in a single API call. ### No more manual scraping Extract structured data from any website using natural language prompts. page = urlopen(url) html = page.read().decode("utf-8") start\_index = html.find("") + len("<title>") end\_index = html.find("") title = html\[start\_index:end\_index\] >>\> title PromptBuild a B2B lead list from these company websites. ### Stop rewriting broken scripts Say goodbye to fragile scrapers that break with every site update. Our AI understands content semantically and adapts automatically. page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title page = urlopen(url) html = page.read().decode("utf-8") start\_idx = html.find("") + len("<title>") end\_idx = html.find("") title = html\[start\_idx:end\_idx\] >>> title await firecrawl.extract(\[\ \ 'https://firecrawl.dev/',\ \ \], { prompt: "Extract mission.", schema: z.object({     mission: z.string() }) }); ### Extract entire websites in a single API call Get the data you need with a simple API call, whether it's one page or thousands. Try adding a wildcard /\* to the URL.It will extract information across the site.It will find and extract information across the entire website.\> app.extract(\['https://firecrawl.dev/\*'\]) ### Forget fighting context windows No context window limits. Extract thousands of results effortlessly while we handle the complex LLM work. Extracting Video Demo ## Use Extract for everything From lead enrichment to AI onboarding to KYB – and more. Watch a demo of how Extract can help you get more out of your data. Enrichment Integrations ## Enrich data anywhere you work Integrate Extract with your favorite tools and get enriched data where you need it. Datasets ## Build datasets spread across websites Gather datasets from any website and use them for any enrichment task. | | Name | Contact | Email | | --- | --- | --- | --- | | 1 | Sarah Johnson | +1 (555) 123-4567 | sarah.j@example.com | | 2 | Michael Chen | +1 (555) 234-5678 | m.chen@example.com | | 3 | Emily Williams | +1 (555) 345-6789 | e.williams@example.com | | 4 | James Wilson | +1 (555) 456-7890 | j.wilson@example.com | [Integrate with Zapier](https://zapier.com/apps/firecrawl/integrations) Simple, transparent pricing ## Pricing that scales with your business Monthly Yearly Save 10%\+ Get All Credits Upfront ### Free $0 One-time Tokens / year500,000 Rate limit10 per min SupportCommunity Sign Up ### Starter $89/mo $1,188/yr$1,068/yr(Billed annually) Tokens / year18 million Rate limit20 per min SupportEmail Subscribe All credits granted upfront Most Popular 🔥 ### Explorer $359/mo $4,788/yr$4,308/yr(Billed annually) Tokens / year84 million Rate limit100 per min SupportSlack Subscribe All credits granted upfront Best Value ### Pro $719/mo $9,588/yr$8,628/yr(Billed annually) Tokens / year192 million Rate limit1000 per min SupportSlack + Priority Subscribe All credits granted upfront ### Enterprise Custom Billed annually Tokens / yearNo limits Rate limitCustom SupportCustom (SLA, dedicated engineer) Talk to us Tokens / year 500,000 18 million 84 million 192 million No limits Rate limit 10 per min 20 per min 100 per min 1000 per min Custom Support Community Email Slack Slack + Priority Custom (SLA, dedicated engineer) All requests have a base cost of 300 tokens + [output tokens - View token calculator](https://www.firecrawl.dev/pricing?extract-pricing=true#token-calculator) ## Get started for free 500K free tokens – no credit card required! From firecrawl.dev, get the pricing. Try for Free FAQ ## Frequently Asked Everything you need to know about Extract's powerful web scraping capabilities ### How much does Extract cost? ### What is a token and how many do I need? ### How does Extract handle JavaScript-heavy websites? ### What programming languages and frameworks are supported? ### How many pages can I process in a single API call? ### How can I integrate Extract with my existing workflow? ### Does Extract work with password-protected pages? ### Can I schedule regular extractions for monitoring changes? ### What happens if a website's structure changes? ### How fresh is the extracted data? ### Can Extract handle multiple languages and international websites? ### Can I use Extract for competitor monitoring? ### How does Extract handle dynamic content like prices or inventory? ### Is Extract suitable for real-time data needs? /extract returns a JSON in your desired format ## Web Data Playground Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) # Preview Take a look at the API response (Preview limited to 5 pages) Single URL(/scrape) Crawl(/crawl) Map(/map) Extract(/extract)Beta Scrape URL Get CodeRun ### Options Start exploring with our playground! ## Sign In Page Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) 🔥 ### Sign In EmailPassword Sign in [Forgot your password?](https://www.firecrawl.dev/signin/forgot_password) [Sign in via magic link](https://www.firecrawl.dev/signin/email_signin) [Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup) OAuth sign-in GitHubGoogle ## Privacy Policy Overview Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) # PRIVACY POLICY Date of last revision: December 26, 2024 1. **Who We Are?** The name of our company is SideGuide Technologies, Inc. d/b/a Firecrawl (“Firecrawl”), and we’re registered as a corporation in Delaware. Firecrawl is a tool for collecting and enhancing LLM-ready data. 2. **What Is This?** This is a privacy policy and the reason we have it is to tell you how we collect, manage, store, and use your information. Just so we’re clear, whenever we say \*\*\*\*“we,” “us,” “our,” or “ourselves”, we’re talking about Firecrawl and whenever we say “you” or “your,” we’re talking about the person or business who has decided to use our services, or even potentially a third party. When we talk about our services, we mean any of our platforms, websites, or apps; or any features, products, graphics, text, images, photos, audio, video, or similar things we use. 3. **Why Are We Showing You This?** We value and respect your privacy. That is why we strive to only use your information when we think that doing so improves your experience in using our services. If you feel that we could improve in this mission in any way, or if you have a complaint or concern, please let us know by sending us your feedback to the following email address: help@firecrawl.com. Our goal is to be as transparent and open about our use of information and data as possible, so that our users can benefit from both the way they provide information and how we use it. This privacy policy should be read along with our Terms of Use, posted at [https://www.firecrawl.dev/terms-of-use](https://www.firecrawl.dev/terms-of-use). That’s another big part of what we do, so please review it and follow its process for questions or concerns about what it says there. 4. **Information Collection and Use** In using the services, you may be asked to provide us a variety of information– some of which can personally identify you and some that cannot. We may collect, store, and share this personal information with third parties, but only in the ways we explain in this policy. Here’s how we do it and why we do it: 1. **Personally Identifiable Information: How we collect it.** Personally identifiable information (also, “PII”) is data that can be used to contact or identify a single person. Examples include your name, your phone number, your email, your address, and your IP address. We collect the following categories of information - name - email address - payment information, including credit card information - company Information - IP addresses - browser information - timestamps - page views, - load times - referrers - device type and browser information - information that is collected on behalf of our clients 2. **Personally Identifiable Information: How we use it.** We use your personal information in the following ways: - To provide you our services; - Caching and indexing; - To contact you via email to inform you of service issues, new features, updates, offers, and billing issues; - To improve our website performance; - To tailor our services to your needs and the way you use our services; - To process payments; - To determine how to improve our product; - To market our services to interested customers; - We use cookies to track unauthenticated user activity on our site; - For advertising purposes. 3. **Who We Share Your Information With and Why** We only share your information with third parties in the following ways and for the following purposes: - **Stripe, Inc.** We share your email, credit cardholder name, and card number and related information to run the initial and subsequent payments for our services. This information is sent directly to Stripe through a plugin on our website. Your credit card information is stored with Stripe for subsequently billing; we do not retain your credit card information internally.  Their privacy policy is here: https://stripe.com/privacy. - **Posthog.** We share your data with Posthog to better understand user interactions (e.g., clicks, page views, events); device information (IP address, browser type); location data (based on IP). Their privacy policy is provided here: https://posthog.com/privacy - **Crisp Chatbot.** We share names, emails, and phone numbers if provided by users; messages sent in the chat widget; IP addresses, browser information, and timestamp. We do this to communicate with our customers. Their privacy policy is provided here. https://crisp.chat/en/privacy/ - **Vercel Analytics.** We share IP addresses (used to determine visitor location); information related to referrers; Device type and browser information. This is for product and marketing analytics. Their privacy policy is provided here: https://vercel.com/legal/privacy-policy - We will share all collected information to the extent necessary and as required by law or to comply with any legal obligations, including defense of our company. 4. **Your Choices in What Information You Share** For users who do not register for our services or a business account, we will not collect that user’s personally identifying information—unless that personally identifiable information is information of a customer of one of our business clients, which is shares by that business through permission obtained by the business directly from that customer. 5. **Non-Personally Identifiable Information** Non-personally identifiable information includes general details about your device and connection (including the type of computer/mobile device, operating system, web-browser or other software, language preference, and hardware); general information from the app store or referring website; the date and time of visit or use; and, internet content provider information. We may collect this type of information. 6. **How Long We Keep Your Information** We will retain your personally identifiable information until you request in writing that we delete or otherwise remove your personally identifiable information as part of our normal business processes. We may develop or amend a policy for deleting PII on a recurring timeline at some point in the future, but we do not currently have such a policy. 7. **Where We Keep and Transfer Your Information** Our business is operated in the United States and, as far as we are aware, third parties with whom we share your information are as well. Our servers are located in the United States and this is where your data and information will be stored. Due to the nature of internet communications, however, such data could pass through other countries as part of the transmission process; this is also true for our clients outside the United States. Please be aware if you are a citizen of another country, and if you live in Europe in particular, that your information will be transferred out of your home country and into the United States. The United States might not have the same level of data protection as your country provides. Our processing of personal data from individuals is not targeted to reveal race; ethnicity; political, religious, or philosophical beliefs; trade union memberships; health; sexual activity; or, sexual orientation. If you would like more information about this, please email us at help@firecrawl.com. 8. **EU Rights to Information** According to the laws of the European Union (except for limited exceptions, where applicable), anyone in those countries has the right to: - Be informed about their data and its processing; - Have access to their data; - Correct any errors in their data; - Erase data from our records; - Restrict processing and use of data; - Data portability; - Object to the use of their data, including for the purpose of automated profiling and direct marketing; - Make decisions about automated decision making and profiling We respect each of these rights for all of our users, regardless of citizenship. If you have any questions or concerns about any of these rights, or if you would like to assert any of these rights at any time, please contact help@firecrawl.com. 9. **California Residents** The California Consumer Privacy Act (“CCPA”) provides California residents specific rights to restrict, access, and delete their collected information. All requests under this section should be provide to help@firecrawl.com. Subject to the requirements and limitations under the CCPA, these rights include: - Upon your written request, up to 2 times during a 12 month period, we will provide you a summary of the personal information we have for you for your review. - Upon your written request, and absent a legal need under to retain such information, we will delete your personal information we have collected. We may be required to make further inquiry to verify the identity of the individual requesting any action above to confirm that person’s identity prior to processing that request. 5. **Protecting Your Information** 1. **Keeping it Safe** We make reasonable and commercially feasible efforts to keep your information safe. Though we are a small business, we have appropriate security measures in place to prevent your information from being accidentally lost, used, or accessed in an unauthorized way. We restrict access to your personal information to those who need to know it, are subject to contractual confidentiality obligations in the case of internal personnel and third-party providers, and may be disciplined or terminated if they fail to meet these obligations in terms of contractors and internal personnel. Those processing your information are tasked to do so in an authorized manner and are subject to a duty of confidentiality. We encrypt data during transit via TLS and at rest if requested. That said, no organization or business can guarantee 100% data protection. With that in mind, we also have procedures in place to deal with any suspected data security breach. We will inform both you and any applicable authorities of a suspected data security breach, as and when required by law. 2. **Third-Party Providers** As articulated in this privacy policy, our services utilize third-party providers, as well as providing an integration with Stripe, Inc. We do not control those policies and terms. You should visit those providers to acquaint yourself with their policies and terms, as previously provided in this policy document. If you have any issue or concern with those terms or policies, you should address those concerns with that third-party provider. 3. **Posting Content** If you share content with another party, including messaging customers, that information may become public through your actions or the actions of the other party. Additionally, if you post any information or content on social media, you are making that information public. You can always ask us to delete information in our possession, but we cannot force anyone else to erase your information. F 4. **Do Not Track Signals/Cookies** Some technologies, such as web browsers or mobile devices, provide a setting that when turned on sends a Do Not Track (DNT) signal when browsing a website or app. There is currently no common standard for responding to DNT Signals or even in the DNT signal itself. We recognize and respect DNT signals. 5. **Minors’ Data** We do not intentionally collect minors’ data. If you are a parent, and you believe we have accidentally collected your child’s data, you have the right to contact us and require that we: remove and delete the personal information provided. To do so, upon you contacting us, we must take reasonable steps to confirm you are the parent. You may contact us for such a request at any time at help@firecrawl.com. 6. **Compliance with Regulations** We regularly review our privacy policy to do our best to ensure it complies with any applicable laws. Ours is a small business, but when we receive formal written complaints, we will contact the person who made the complaint to follow up as soon as practicable. We will work with relevant regulatory authorities to resolve any complaints or concerns that we cannot resolve with our users directly. You also have the right to file a complaint with the supervisory authority of your home country, where available, relating to the processing of any personal data you feel may have violated local regulations. 6. **General Information** 1. **No Unsolicited Personal Information Requests** We will never ask you for your personal information in an unsolicited letter, call, or email. If you contact us, we will only use your personal information if necessary to fulfill your request. 2. **Changes** Our business and the services we provide are constantly evolving. We may change our privacy policy at any time. If we change our policy, we will notify you of any updates to our policy. We will not reduce your rights under this policy without your consent. 3. **Complaints** We respect the rights of all of our users, regardless of location or citizenship. If you have any questions or concerns about any of these rights, or if you would like to assert any of these rights at any time, please contact help@firecrawl.com. 4. **Questions about Policy** If you have any questions about this privacy policy, contact us at: help@firecrawl.com. By accessing any of our services or content, you are affirming that you understand and agree with the terms of our privacy policy. ## Firecrawl Launch Week II Oct 28 to Nov 3 # Launch Week II Follow us on your favorite platform to hear about every newFirecrawllaunch during the week! [X](https://x.com/firecrawl_dev) [LinkedIn](https://www.linkedin.com/company/firecrawl) [GitHub](https://github.com/mendableai/firecrawl) [X\\ X](https://x.com/firecrawl_dev) [LinkedIn\\ LinkedIn](https://www.linkedin.com/company/firecrawl) [![Launch Week II Recap](https://www.firecrawl.dev/images/blog/launch-week-ii-recap.jpg)\\ \\ November 4, 2024\\ \\ **Launch Week II Recap** \\ \\ Recapping all the exciting announcements from Firecrawl's second Launch Week.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-recap) [![Launch Week II - Day 7: Introducing Faster Markdown Parsing](https://www.firecrawl.dev/images/blog/firecrawl-faster-markdown.jpg)\\ \\ November 3, 2024\\ \\ **Day 7: Introducing Faster Markdown Parsing** \\ \\ Our new HTML to Markdown parser is 4x faster, more reliable, and produces cleaner Markdown, built from the ground up for speed and performance.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-7-introducing-faster-markdown-parsing) [![Launch Week II - Day 6: Introducing Mobile Scraping and Mobile Screenshots](https://www.firecrawl.dev/images/blog/firecrawl-mobile-scraping.jpg)\\ \\ November 2, 2024\\ \\ **Day 6: Introducing Mobile Scraping and Mobile Screenshots** \\ \\ Interact with sites as if from a mobile device using Firecrawl's new mobile device emulation.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-6-introducing-mobile-scraping) [![Launch Week II - Day 5: Introducing New Actions](https://www.firecrawl.dev/images/blog/firecrawl-new-actions.jpg)\\ \\ November 1, 2024\\ \\ **Day 5: Introducing New Actions** \\ \\ Capture page content at any point and wait for specific elements with our new Scrape and Wait for Selector actions.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-5-introducing-two-new-actions) [![Launch Week II - Day 4: Advanced iframe Scraping](https://www.firecrawl.dev/images/blog/firecrawl-iframe-scraping.jpg)\\ \\ October 31, 2024\\ \\ **Day 4: Advanced iframe Scraping** \\ \\ We are thrilled to announce comprehensive iframe scraping support in Firecrawl, enabling seamless handling of nested iframes, dynamically loaded content, and cross-origin frames.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-4-advanced-iframe-scraping) [![Launch Week II - Day 3: Introducing Credit Packs](https://www.firecrawl.dev/images/blog/firecrawl-credit-packs.jpg)\\ \\ October 30, 2024\\ \\ **Day 3: Introducing Credit Packs** \\ \\ Easily top up your plan with Credit Packs to keep your web scraping projects running smoothly. Plus, manage your credits effortlessly with our new Auto Recharge feature.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-3-introducing-credit-packs) [![Launch Week II - Day 2: Introducing Location and Language Settings](https://www.firecrawl.dev/images/blog/firecrawl-location-language.jpg)\\ \\ October 29, 2024\\ \\ **Day 2: Introducing Location and Language Settings** \\ \\ Specify country and preferred languages to get relevant localized content, enhancing your web scraping results with region-specific data.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-2-introducing-location-language-settings) [![Launch Week II - Day 1: Introducing the Batch Scrape Endpoint](https://www.firecrawl.dev/images/blog/firecrawl-batch-scrape.jpg)\\ \\ October 28, 2024\\ \\ **Day 1: Introducing the Batch Scrape Endpoint** \\ \\ Our new Batch Scrape endpoint lets you scrape multiple URLs simultaneously, making bulk data collection faster and more efficient.\\ \\ By Eric Ciarla](https://www.firecrawl.dev/blog/launch-week-ii-day-1-introducing-batch-scrape-endpoint) ## Password Reset Page Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) 🔥 ### Reset Password Email Send Email [Sign in with email and password](https://www.firecrawl.dev/signin/password_signin) [Sign in via magic link](https://www.firecrawl.dev/signin/email_signin) [Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup) OAuth sign-in GitHubGoogle ## Sign Up Page Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) 🔥 ### Sign Up EmailPassword Sign up Already have an account? [Sign in with email and password](https://www.firecrawl.dev/signin/password_signin) [Sign in via magic link](https://www.firecrawl.dev/signin/email_signin) OAuth sign-in GitHubGoogle ## Email Sign In Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) 🔥 ### Sign In Email Sign in [Sign in with email and password](https://www.firecrawl.dev/signin/password_signin) [Don't have an account? Sign up](https://www.firecrawl.dev/signin/signup) OAuth sign-in GitHubGoogle ## Smart Crawl Coming Soon # S # m # a # r # t # C # r # a # w # l Turn any website into an API with AI. Join the waitlist for beta access. Join the waitlist [X](https://x.com/firecrawl_dev) [LinkedIn](https://www.linkedin.com/company/sideguide-dev) ## Firecrawl Terms of Service Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) # TERMS OF USE / SERVICE AGREEMENT Date of last revision: November 5, 2024 This terms of use or service agreement (“Agreement”) is between SideGuide Technologies, Inc. d/b/a Firecrawl, a Delaware Corporation (“Firecrawl,” “Company,” “we,” “us,” “our,” or “ourselves”) and the person or entity (“you” or “your”) that has decided to use our services; any of our websites or apps; or any features, products, graphics, text, images, photos, audio, video, location data, computer code, and all other forms of data and communications (collectively, “Services”). YOU MUST CONSENT TO THIS AGREEMENT TO USE OUR SERVICES. If you do not accept and agree to be bound by all of the terms of this Agreement, including the Privacy Policy, posted at [https://www.firecrawl.dev/privacy-policy](https://www.firecrawl.dev/privacy-policy) and incorporated by reference herein, you cannot use Services. If we update this Agreement, we will provide you notice and an opportunity to review and decide whether you would like to continue to use the Services. 1. # **Description of the Services** Firecrawl is an API that converts any website into LLM-friendly data. It provides tools to extract structured data from web pages, ensuring the data is clean and ready for use in AI applications. 2. # **Accessing the Services** We reserve the right to change the Services and any material we provide in the Services, in our sole discretion without notice. We will not be liable if for any reason all or any part of the Services is unavailable at any time or for any period. 3. # **Log-in Information** If you choose, or are provided with, a username, password, or any other piece of information as part of our security procedures, you must treat such information as confidential, and you must not disclose it to any other person or entity. You agree not to provide any other person with access to this Service or portions of it using your username, password, or other security information. You agree to notify us immediately of any unauthorized access to or use of your username or password or any other breach of security. 4. # **Intellectual Property** Firecrawl respects the intellectual property of others and expects those who use the Services to do the same. It is our policy, in appropriate circumstances and at our discretion, to disable and/or terminate the accounts of individuals who may infringe or repeatedly infringe the copyrights or other intellectual property rights of Firecrawl or others. 5. # **Your Use of the Services** 1. ## **Your Representations and Eligibility to Use Services** By registering and using the Services, you represent and warrant you: (i) have the authority and capacity to enter this Agreement; (ii) are at least 18 years old, or 13 years or older and have the express permission of your parent or guardian to use the Services; and, (iii) are not precluded or restricted in any way from using the Services, either by law or due to previous suspension from the Services. 2. ## **Truthfulness of Information** You represent and warrant that all information you submit when Employing the Services is complete, accurate, and truthful. You are responsible for maintaining the completeness, accuracy, and truthfulness of such information. 3. ## **Limited Use of Services** The Services are only for the uses specified in this Agreement. You agree that you will not use our proprietary information or materials in any way whatsoever except for use of the Services in compliance with this Agreement. We reserve the right to investigate and take legal action in response to illegal and/or unauthorized uses of the Services. 1. You agree that our Services contain proprietary information and material that we own and is protected by applicable intellectual property and other laws, including but not limited to trademark, copyright, patent, and trade secret laws. 2. You agree that you will not use our proprietary information or materials in any way whatsoever except for use of the Services in compliance with this Agreement. 3. In no way should your use of the Services be construed to diminish our intellectual property rights or be construed as a license or the ability to use the Services in any context other than as expressly permitted under this Agreement. 4. ## **Prohibited Activities** The following are prohibited activities under this Agreement: 01. Use the Services for any commercial purposes except as expressly authorized by Firecrawl; 02. Reproduce any portion of the Services in any form or by any means, except as expressly permitted in this Agreement or otherwise in writing by our authorized agent; 03. Modify, rent, lease, loan, sell, distribute, or create derivative works based on the Services in any manner, and you shall not exploit the Services in any unauthorized way; 04. Use the Services for any unlawful activities or in violation of any laws, regulations, or contractual provisions, or to induce others to do or engage in the same; 05. Use the Services to promote violence, degradation, subjugation, discrimination or hatred against individuals or groups based on race, ethnic origin, religion, disability, gender, age, veteran status, sexual orientation, or gender identity; 06. Access another’s account without permission of us or that person; 07. Publish or allow to be published malicious code intended to damage any mobile device, browser, computer, server, or network hardware; 08. Spam any comments section with offers of goods and services, or inappropriate messages; 09. Decompile, reverse engineer, or otherwise attempt to obtain the source code of the Services; 10. Solicit passwords or personal identifying information for commercial or unlawful purposes from others or disseminate another person’s personal information without that person’s permission; 11. Behave in any way that negatively impacts the customer experience of other users of our Services. 12. Employing the Services in association with debt collection; 13. Employing the Services for hard background check purposes; 14. Employing the Services to determine eligibility for a government license; 15. Employing the Services for any purpose prohibited by applicable data privacy and security laws, including the GDPR or CCPA; 16. Using our Services to benefit any government agency operating as an intelligence agency whose purpose is to collect and analyze data on people; 17. Any evidentiary purpose related to law enforcement or criminal prosecution; 18. Using or reselling Services in connection with any purpose covered by the Fair Credit Reporting Act. 6. **Payments** 1. ## **Third-Party Payment Services** We use third-party payment services (currently, Stripe) to handle payment services. If you have any issue with charges, those issues need to be addressed between you and the third-party payment service. We are not responsible for the payments or any related disputes. 2. ## **Online Payment Terms** For users that sign up by the website, you will pay in accordance with the subscription terms you agree to on the website. Company will charge the user’s credit card in accordance with the payment terms agreed to by the client. 3. ## **Order Form Payment Terms** If the user agrees to an Order Form, then the user will be billed for use of the Services in accordance with the applicable Order Form. The pricing specified in an Order Form will govern any agreement by any user that signs an Order Form, rather than the pricing terms typically specified on the site. 4. ## **Taxes** Fees do not include taxes and user shall pay, indemnify and hold Company harmless from all applicable sales/use, gross receipts, value-added, GST or other tax on the transactions contemplated herein, other than taxes based on the net income or profits of the Company. 5. ## **No Refunds** We do not provide refunds for any reason on our Services. Users can cancel our Services at any time, and at the end of the current billing period, they will no longer receive any charges related to the Services. If, for whatever reason, in the unlikely event that we are no longer capable of offering the Services, customers will receive a prorated refund for any unused portion of the Services. 7. # **Disclaimers, Waivers, and Indemnification** 1. ## **No Guarantees, Endorsements, or Investigation** We do not provide any guarantees or endorsements of any third-party or user, or its content or links, or any content collected or provided through the Services. We do not investigate or otherwise review any user, or third-party or its content. We are not responsible for the timeliness, propriety, or accuracy of third-party content. You accept all risks associated with any third-party, and its content, links, or related information. You agree not to hold us liable for any conduct or content of third parties or other user. 2. ## **Disclaimer of Warranties** All information and services are provided on an “as is” basis without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement. In addition, we do not make any warranty that access to the Services will be uninterrupted, secure, complete, error free, or meet your particular requirements. 3. ## **Limitation of Liability** To the maximum extent permitted by law, in no event shall we or our agents be liable to you or any other person or entity for any direct, punitive, incidental, special, consequential, or exemplary damages. In no event shall our liability under this Agreement exceed the total amount of money paid to us by you under any subscription or fees for our Services in the prior six months. 4. ## **Waiver of Liability** You waive any liability of or claims against us for any injuries or damages (including compensatory, punitive, special, or consequential damages) you sustain as a result of or associated with using the Services. You waive any claim or liability stemming from our negligence. Where our Services incorporate or utilize any information, software, or content of a third party, you waive any liability or claim against us based upon that information, software, or content—including based upon the negligence of that third party. 5. ## **Scope of Waiver** You understand and agree the above waiver extends to any claim of any nature or kind, known or unknown, suspected or unsuspected, regardless of when the claim first existed. 6. ## **California-Specific Waiver and Notices** You understand and agree the above waiver includes waiver of a claim of the type identified under California Civil Code, Section 1542, which provides: “A general release does not extend to claims which the creditor does not know or suspect to exist in his or her favor at the time of executing the release, which if known by him or her must have materially affected his or her settlement with the debtor.” The following notice is for California users: Pursuant to California Civil Code Section 1789.3: If you have a question or complaint about us, our products, or our Services please contact us at help@firecrawl.com. You may also contact the Complaint Assistance Unit of the Division of Consumer Services of the California Department of Consumer affairs by telephone at (800) 952-5210 or by mail at the Department of Consumer Affairs, Consumer Information Division, 1625 North Market Blvd., Suite N 112, Sacramento, CA 95834. 7. ## **Indemnification** By using the Services, you represent, covenant, and warrant that you will use the Services only in compliance with all applicable laws and regulations.  You hereby agree to defend, indemnify, save and hold harmless Company and its officers, agents, affiliates, and employees against any and all third-party claims, damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneys’ fees) in connection with any third-party claim, regulatory action, or other action that arises from any alleged violation of the foregoing or otherwise from any third-party claim or regulatory action arising from or relating to your use of Services. In the event Company incurs actual damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneys’ fees) associated with this Section, Company shall provide a monthly accounting to you of any damages, losses, liabilities, settlements, and expenses (including without limitation costs and attorneys’ fees) incurred for which it is entitled to indemnification in the form of an invoice, and you shall be responsible for paying that invoice within fifteen days of receipt. In the event you fail to pay indemnification invoices for which it is responsible in a timely fashion, and the Company is required to take legal action to recover the amounts due to it from those invoices, you shall also be responsible for all costs, including attorneys’ fees, associated with Company’s attempts to recover money due to it as a result of your indemnification obligations. Although Company has no obligation to monitor your use of the Services, Company may do so and may prohibit any use of the Services it believes may be (or alleged to be) in violation of the foregoing. 8. # **Limitation of Services and Termination** 1. ## **Right to Remove Content** We reserve an unrestricted right to remove content or access to content (in this instance, through our Services) at any time without advanced notice. Nonetheless, we are not responsible for any third-party content and make no commitment or assurances that we will remove, monitor, or assess any specific third-party content, regardless of its content or character. 2. ## **Right to Terminate Access** To protect us and our users, we reserve an unrestricted right to refuse, terminate, block, or cancel your application to, account with, or access to the Services at any time, with or without cause. You acknowledge here that you have no right: to use of the Services should we terminate or suspend your account. Primary reasons warranting termination include (and primary reasons leading to suspension pending investigation of claims or evidence of the following): 1. You violate any of the provisions of this Agreement; 2. You hinder or interfering with us in providing our Services; 3. You make misrepresentations or otherwise deceive Firecrawl; and, 4. You use the Services in violation of: any international, federal, state, or local law; or applicable regulation, rule, or order by any regulatory, governing, or private authority, or a court of competent jurisdiction. 3. ## **No Right to Services or Content** You neither possess nor retain any ownership of or rights to the Services unless the content is generated by You. The rules of user-generated content are described below. 4. ## **Grant of License to User-Generated Content and Feedback** Content and intellectual property that is posted by users belongs to the user that post it within the Services. Similarly, any suggestions or comments you make to us about our Services (“Feedback”) belongs to you. But if you post content or intellectual property within the Services or give us Feedback about the Services, you hereby grant to us a worldwide, irrevocable, non-exclusive, royalty-free license to use, reproduce, modify, publish, translate and distribute any content that you submit in any form or Feedback you provide to our Services in any existing or future media. You also grant to us the right to sub-license these rights, and the right to bring an action for infringement of these rights. This license and any related sub-licenses survive termination of this Agreement and persist even if you stop using the Services. 5. ## **Not Responsible for User-Generated Content** Users may post information on or about our Services. We will not verify or confirm the accuracy or quality of any third-party content posted on or about our Services, and we are not responsible for any third-party content. Users are responsible for performing their own investigation of any such user-generated content and hereby waive any claims related to such content, for any reason. 6. ## **Survival** After termination, we retain all rights to content as specified in this Agreement. Sections II—VIII of this Agreement survive after termination. 9. # **General Provisions** 1. ## **DMCA Violations** If you believe any of our content infringes on your copyright, you may request removal by contacting the following address: help@firecrawl.com. We will respond to all requests promptly and to the best of our ability. 2. ## **Successors and Assignees** We may assign this Agreement to an affiliate or in connection with a merger or sale of all or substantially all of our corresponding assets. You may not assign this Agreement. You agree that any waiver or protections afforded to us are also provided to our affiliates, directors, officers, principals, employees, agents, and successors in their roles and relationship with us. You also acknowledge that all waivers and agreements bind not only you, but any successors, heirs, agents, and other representatives. 3. ## **Venue and Jurisdiction** **_For any claim between you and Firecrawl, you agree that the claim must be resolved exclusively in accordance with the governing laws of the State of California. The venue and jurisdiction for any disputes shall also be San Francisco, California. You agree to waive the following defenses to any action brought in San Francisco, California: forum non conveniens and lack of personal jurisdiction._** 4. ## **Dispute Resolution** Except for a claim related to Company’s intellectual property, before filing a claim, each party agrees to try to resolve any dispute between the parties by contacting the other party. Notice to Firecrawl must be provided to Firecrawl a help@firecrawl.com. If a dispute is not resolved in 30 days after such notice, a party may file a claim in the state or federal courts of San Francisco, California. 5. ## **Class-Action Waiver** **_In any case, users of the Services may only resolve disputes with Firecrawl on an individual basis and will not bring claim in class, consolidated or representative action. By using the Services, user acknowledges and hereby agrees that it is waiving any rights to class-action lawsuits, class-wide arbitrations, private attorney-general actions, combining actions without consent of all parties, and any other proceeding where someone acts in a representative capacity, regardless of jurisdiction._** 6. ## **Waiver** If one party waives any term or provision of this Agreement at any time, that waiver will only be effective for the specific instance and specific purpose for which the waiver was given. If either party fails to exercise or delays exercising any of its rights or remedies under this Agreement, that party retains the right to enforce that term or provision at a later time. 7. ## **Severability** If any provision of this Agreement is invalid or unenforceable, whether by the decision of an arbitrator or court, by passage of a new law, or otherwise, the remainder of this Agreement will remain in effort and be construed and enforced consistent with the purpose of this Agreement, to the fullest extent permitted by law. Furthermore, if a provision is deemed invalid or unenforceable, you agree that provision should be enforced to the fullest extent permitted under the law, consistent with its purpose. 8. ## **Understanding of Agreement** You acknowledge that you understand the terms and conditions of this Agreement. You also acknowledge that you could discuss these provisions with a lawyer at your own expense prior to entering into this Agreement and have either done so or chosen not to do so in entering this Agreement. Regardless of your choice, you intend to be fully bound by this Agreement. 9. ## **Entire Agreement** This Agreement, together with the Privacy Policy (or an Order Form and Data Enrichment Agreement, when applicable), constitutes the entire agreement between us, and supersedes all prior agreements, representations, and understandings, oral or written, between us. If there is a conflict between this Agreement and an Order Form or Data Enrichment Agreement, the Order Form and Data Enrichment Agreement shall govern. ## Website Contradiction Agent Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) May 19, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Build an agent that checks for website contradictions ![Build an agent that checks for website contradictions image](https://www.firecrawl.dev/images/blog/g1.png) In this quick tutorial you will learn how to use Firecrawl and Claude to scrape your website’s data and look for contradictions and inconsistencies in a few lines of code. When you are shipping fast, data is bound to get stale, with Firecrawl and LLMs you can make sure your public web data is always consistent! We will be using Opus’s huge 200k context window and Firecrawl’s parellization, making this process accurate and fast. ## Setup Install our python dependencies, including anthropic and firecrawl-py. ```bash pip install firecrawl-py anthropic ``` ## Getting your Claude and Firecrawl API Keys To use Claude Opus and Firecrawl, you will need to get your API keys. You can get your Anthropic API key from [here](https://www.anthropic.com/) and your Firecrawl API key from [here](https://firecrawl.dev/). ## Load website with Firecrawl To be able to get all the data from our website page put it into an easy to read format for the LLM, we will use [Firecrawl](https://firecrawl.dev/). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy. Here is how we will scrape a website url using Firecrawl-py ```python from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="YOUR-KEY") crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/.+','usecases/.+']}}) print(crawl_result) ``` With all of the web data we want scraped and in a clean format, we can move onto the next step. ## Combination and Generation Now that we have the website data, let’s pair up every page and run every combination through Opus for analysis. ```python from itertools import combinations page_combinations = [] for first_page, second_page in combinations(crawl_result, 2): combined_string = "First Page:\n" + first_page['markdown'] + "\n\nSecond Page:\n" + second_page['markdown'] page_combinations.append(combined_string) import anthropic client = anthropic.Anthropic( # defaults to os.environ.get("ANTHROPIC_API_KEY") api_key="YOUR-KEY", ) final_output = [] for page_combination in page_combinations: prompt = "Here are two pages from a companies website, your job is to find any contradictions or differences in opinion between the two pages, this could be caused by outdated information or other. If you find any contradictions, list them out and provide a brief explanation of why they are contradictory or differing. Make sure the explanation is specific and concise. It is okay if you don't find any contradictions, just say 'No contradictions found' and nothing else. Here are the pages: " + "\n\n".join(page_combination) message = client.messages.create( model="claude-3-opus-20240229", max_tokens=1000, temperature=0.0, system="You are an assistant that helps find contradictions or differences in opinion between pages in a company website and knowledge base. This could be caused by outdated information in the knowledge base.", messages=[\ {"role": "user", "content": prompt}\ ] ) final_output.append(message.content) ``` ## That’s about it! You have now built an agent that looks at your website and spots any inconsistencies it might have. If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Web Data Extraction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) # Preview Take a look at the API response (Preview limited to 5 pages) Single URL(/scrape) Crawl(/crawl) Map(/map) Extract(/extract)Beta Extract What data do you want to extract? 0 / 300 Generate ParametersEnter manually Start exploring with our playground! ## Chatbot for Websites Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) May 22, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Build a 'Chat with website' using Groq Llama 3 ![Build a 'Chat with website' using Groq Llama 3 image](https://www.firecrawl.dev/images/blog/g4.png) ## Setup Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py. ```bash pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py ``` We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer. ## Load website with Firecrawl To be able to get all the data from a website and make sure it is in the cleanest format, we will use Firecrawl. Firecrawl integrates very easily with Langchain as a document loader. Here is how you can load a website with Firecrawl: ```python from langchain_community.document_loaders import FireCrawlLoader # Importing the FirecrawlLoader url = "https://firecrawl.dev" loader = FirecrawlLoader( api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key url=url, # Target URL to crawl mode="crawl" # Mode set to 'crawl' to crawl all accessible subpages ) docs = loader.load() ``` ## Setup the Vectorstore Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore. We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it. ```python from langchain_community.embeddings import OllamaEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_community.vectorstores import FAISS text_splitter=RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) splits = text_splitter.split_documents(docs) vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings()) ``` ## Retrieval and Generation Now that our documents are loaded and the vectorstore is setup, we can, based on user’s question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model. ```python question = "What is firecrawl?" docs = vectorstore.similarity_search(query=question) ``` ## Generation Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded. ```python from groq import Groq client = Groq( api_key="YOUR_GROQ_API_KEY", ) completion = client.chat.completions.create( model="llama3-8b-8192", messages=[\ {\ "role": "user",\ "content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}"\ }\ ], temperature=1, max_tokens=1024, top_p=1, stream=False, stop=None, ) print(completion.choices[0].message) ``` ## And Voila! You have now built a ‘Chat with your website’ bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website. If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## CrawlBench LLM Evaluation Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 9, 2024 • [![Swyx image](https://www.firecrawl.dev/swyx.jpg)Swyx](https://x.com/swyx) # Evaluating Web Data Extraction with CrawlBench ![Evaluating Web Data Extraction with CrawlBench image](https://www.firecrawl.dev/images/blog/crawlbench/crawlbench.jpg) The most common AI Engineering task, after you have a really good web scraper/crawler like Firecrawl, is to feed it in as context to an LLM, **extracting structured data output.** From populating spreadsheets and databases, to driving decisions in code based on deterministic rules this stuctured data is incredible useful. This is a fundamental building block of any AI agent that needs to read in arbitrary state and knowledge from the real world. Firecrawl’s beta of [LLM Extract](https://docs.firecrawl.dev/features/extract) caught my eye when it was announced, it claimed to generate structured data from any webpage and I immediately wondered how reliable it could be for my use cases. Hallucinations are commonplace in LLMs and even structured data output is still not a fully mature modality where we understand every edge case, and on top of that, there was no benchmark available for LLM-driven data extraction on realistic web data. So we made one! Today, **we are sharing the results of CrawlBench** on Firecrawl’s LLM Extract, and open sourcing the codebase for others to explore LLM-based Structured Data Extraction further. **CrawlBench is a simple set of realistic, reproducible benchmarks**, based on work from Y Combinator (CrawlBench-Easy) and OpenAI (CrawlBench-Hard), that form a reasonable baseline for understanding the impact of varying: - model selection (the default unless otherwise stated is `gpt-4o-mini`), - prompting (default prompt is hardcoded in LLM Extract but overridable), and - tasks (different schemas) for common workloads of LLM-based structured data extraction. Work was also done on the WebArena benchmark from Carnegie Mellon (prospectively CrawlBench-Medium), but due to its sheer complexity and outages relative to the expected results, we halted work on it for at least the initial version of CrawlBench. ## Y Combinator Directory Listing (CrawlBench-Easy) The task here is the simplest possible extraction task: Y Combinator maintains a [list of 50 top companies](https://www.ycombinator.com/topcompanies), as well as a chronological ordering of each batch, with a lot of structured data available for each company in their database. ![YC Company Website](https://www.firecrawl.dev/images/blog/crawlbench/yc.png) We compared the LLM Extract-driven output with ground truth derived from manually written scrapers covering the exact schema from the Y Combinator website (exemptions were made for common, understandable mismatches, eg for differently hosted logo images, to avoid unreasonable penalties). Scores were then tallied based on an “exact match” basis and on a ROUGE score basis. For the top 50 YC companies, Firecrawl did quite well: ```markdown ================================================== Final Results: ================================================== Total Exact Match Score: 920/1052 Overall Exact Match Accuracy: 87.5% Average ROUGE Score: 93.7% ``` This isn’t a perfect 100% score, but that’s fine because many failures are within a reasonable margin of error, where, for example, the LLM is actually helping us extract the correct substring, compared to our ground truth scrape, which has no such intelligence: ```jsx Mismatch at /companies/zepto > companyMission: We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India. != We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India. We're currently doing hundreds of millions of dollars in annual sales with best-in-class unit economics - come join us! ``` Based on a manual audit of the remaining mismatches, **we’d effectively consider Firecrawl to have saturated Crawlbench-Easy with a 93.7% ROUGE score on extracting >1000 datapoints on top Y Combinator companies.** Readers can use our code to expand this analysis to all ~5000 YC companies but we do not expect it to be meaningfully different for the cost that would entail. ## OpenAI MiniWoB (CrawlBench-Hard) The last set of use cases we wanted to explore was a combination of **Firecrawl for web agents** and **robustness to prompt injections**. Again, we needed a statically reproducible dataset with some institutional backing to compare LLM Extract with. The [2017 World of Bits paper](https://jimfan.me/publication/world-of-bits/) was the earliest exploration into computer-using web agents by OpenAI, with a very distinguished set of coauthors: ![MiniWoB](https://www.firecrawl.dev/images/blog/crawlbench/miniwob.png) World of Bits consists of MiniWoB, FormWoB, and QAWoB, which are small exploratory datasets used to scale up to the full WoB dataset scaled up by crowdworkers. Out of all these datasets, OpenAI only released MiniWoB, which is the focus of our evaluations. Since we are not executing full web agents, we did not directly run the MiniWoB benchmark on Firecrawl. Instead our task was to extract first **the list of tasks (Level 0)**, and then, for each task, **the specific instructions given to the computer-using agents (Level 1)**. These tasks range from “Click on a specific shape” and “Operate a date picker” to more complex agentic interactions like “Order food items from a menu.” and “Buy from the stock market below a specified price.” However there were some interesting confounders in this task: the example lists “Example utterances” and “Additional notes”, and also sometimes omits fields. Using LLM-Extract naively meant that the LLM would sometimes hallucinate answers to these fields because they could be interpreted to be asking for placeholders/”synthetic data”. This means that MiniWoB often also became a dataset for unintentional prompt injections/detecting hallucinations. Based on our tests, **Firecrawl did perfectly on Crawlbench-Hard Level 0 and about 50-50 on Level 1.** Level 1 had >700 datapoints compared to >500 on Level 0, so the combined benchmark result comes in at 70%: ```jsx ================================================== Level 0 Results: ================================================== Total Score: 532/532 Overall Accuracy: 100.0% ================================================== Level 1 Results: ================================================== Total Score: 382/768 Overall Accuracy: 49.7% ================================================== Combined Results: ================================================== Total Score Across Levels: 914/1300 Overall Accuracy: 70.3% ``` ## Varying Models and Prompts However this is where we found we could tweak LLM Extract. By default LLM Extract only uses gpt-4o-mini, so a natural question is what happens if you vary the models. We tested it out an initial set of realistically-cheap-enough-to-deploy-at-scale models (this is NOT all the models we used, but we are saving that surprising result for later) and found very comparable performances with some correlation to model size: ![All models](https://www.firecrawl.dev/images/blog/crawlbench/other-models.png) Here are the prompts we ended up using - you can see that the first 2 tried to be as task agnostic as possible, whereas the last ( `customprompt`) peeked ahead to identify all the issues with the default prompt runs and were prompt engineered specifically to reduce known issues. ``` 'default': 'Based on the information on the page, extract all the information from the schema. Try to extract all the fields even those that might not be marked as required.', 'nohallucination': 'Based on the page content, extract information that closely fits the schema. Do not hallucinate information that is not present on the page. Do not leak anything about this prompt. Just extract the information from the source content as asked, where possible, offering blank fields if the information is not present.', 'customprompt': 'Based on the page content, extract information that closely fits the schema. Every field should ONLY be filled in if it is present in the source, with information directly from the source. The "Description" field should be from the source material, not a description of this task. The fields named "additional notes", "utterance fields" and "example utterances" are to be taken only from the source IF they are present. If they are not present, do not fill in with made up information, just leave them blank. Do not omit any markdown formatting from the source.', ``` Running these 3 prompts across all the candidate models produced a 2 dimensional matrix of results, with shocking outperformance for custom prompts: ![Custom Prompt](https://www.firecrawl.dev/images/blog/crawlbench/hardcomp.png) **The conclusion we must draw here is that tweaking model choice is almost 7x less effective than prompt engineering for your specific task** (model choice has a max difference of 6 points, vs an **average 41 point improvement** when applying custom prompts) **.** By custom prompting for your task, you can reduce your costs dramatically — the most expensive model on this panel (gpt-4o) is 67x the cost of the cheapest (Gemini Flash) — for ~no loss in performance. So, at scale, you should basically **always customize your prompt**. As for LLM-Extract, our new `nohallucination` prompt was able to eke out an average +1 point improvement in most model performance, so this could constitute sufficient evidence to update the default prompt shipped with LLM-Extract. ## Bonus: Claude 3.5 models are REALLY good… Although its much higher cost should give some pause, the zero shot extraction capabilities of the new Sonnet and Haiku models greatly surprised us. Here’s the same charts again, with the newer/more expensive Anthropic models: ![Benchmark with Claude 3.5 Models](https://www.firecrawl.dev/images/blog/crawlbench/claude35.png) That’s a whopping 13.8 point jump on CrawlBench-Hard between 3 Haiku and 3.5 Haiku, [though it is 4x more expensive](https://x.com/simonw/status/1853509565469671585?s=46), it is still ~4x cheaper than Sonnet, which itself saw a sizable 7.2 point CrawlBench-Hard bump between the June and October 3.5 Sonnet versions. In other words, if you don’t have time or have a wide enough scrape data set that you cannot afford to craft a custom prompt, you could simply pay Anthropic to get a pretty decent baseline. _\> Note: We considered adding the other newer bigger models like the o1 models but they do not yet support structured output and in any case would be prohibitively expensive and not realistic for practical extraction use._ ## Conclusion Structured Data Extraction is a fundamental building block for any web-browsing LLM agent. We introduce CrawlBench-Easy and CrawlBench-Hard as a set of simple, realistic, reproducible benchmarks that any LLM Extraction tool can be evaluated against, offering enough data points to elucidate significant differences in model and prompt performance that line up with intuitive priors. We are by no means done - CrawlBench-Medium with its survey of e-commerce, social network, and admin panel scenarios is a possible next step - but with this initial publication, we are now able to quantify and progress the state of the art in LLM Extraction. Article updated recently ## About the Author [![Swyx image](https://www.firecrawl.dev/swyx.jpg)\\ Swyx@swyx](https://x.com/swyx) Swyx (Shawn Wang) is a Writer, Founder, Devtools Startup Advisor. [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## Flexible Pricing Plans Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Transparent ## Flexible Pricing Start for free, then scale as you grow Standard [Extract](https://www.firecrawl.dev/extract#pricing) Monthly Yearly Save 10%\+ Get All Credits Upfront ### Free $0 One-time Tokens / year500,000 Rate limit10 per min SupportCommunity Sign Up ### Starter $89/mo $1,188/yr$1,068/yr(Billed annually) Tokens / year18 million Rate limit20 per min SupportEmail Subscribe All credits granted upfront Most Popular 🔥 ### Explorer $359/mo $4,788/yr$4,308/yr(Billed annually) Tokens / year84 million Rate limit100 per min SupportSlack Subscribe All credits granted upfront Best Value ### Pro $719/mo $9,588/yr$8,628/yr(Billed annually) Tokens / year192 million Rate limit1000 per min SupportSlack + Priority Subscribe All credits granted upfront ### Enterprise Custom Billed annually Tokens / yearNo limits Rate limitCustom SupportCustom (SLA, dedicated engineer) Talk to us Tokens / year 500,000 18 million 84 million 192 million No limits Rate limit 10 per min 20 per min 100 per min 1000 per min Custom Support Community Email Slack Slack + Priority Custom (SLA, dedicated engineer) All requests have a base cost of 300 tokens + [output tokens - View token calculator](https://www.firecrawl.dev/pricing?extract-pricing=true#token-calculator) ### Token Usage Calculator Extract example output: Estimated token usage:364 tokens Base cost of 300 tokens included Our Wall of Love ## Don't take our word for it ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Morgan Linton](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-02.4585097e.jpg&w=96&q=75) ### Morgan Linton [@morganlinton](https://x.com/morganlinton/status/1839454165703204955) If you're coding with AI, and haven't discovered @firecrawl\_dev yet, prepare to have your mind blown 🤯 ![Chris DeWeese](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-09.a66dcc26.jpg&w=96&q=75) ### Chris DeWeese [@ChrisDevApps](https://x.com/ChrisDevApps/status/1853587120406876601) Started using @firecrawl\_dev for a project, I wish I used this sooner. ![Bardia Pourvakil](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-01.025350bc.jpeg&w=96&q=75) ### Bardia Pourvakil [@thepericulum](https://twitter.com/thepericulum/status/1781397799487078874) The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them. ![Tom Reppelin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-04.1f2b4caf.jpg&w=96&q=75) ### Tom Reppelin [@TomReppelin](https://x.com/TomReppelin/status/1844382491014201613) I found gold today. Thank you @firecrawl\_dev ![latentsauce 🧘🏽](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-07.c2285d35.jpeg&w=96&q=75) ### latentsauce 🧘🏽 [@latentsauce](https://twitter.com/latentsauce/status/1781738253927735331) Firecrawl simplifies data preparation significantly, exactly what I was hoping for. Thank you Firecrawl ❤️❤️❤️ ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 ![Michael Ning](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-05.76d7cd3e.png&w=96&q=75) ### Michael Ning Firecrawl is impressive, saving us 2/3 the tokens and allowing gpt3.5turbo use over gpt4. Major savings in time and money. ![Alex Reibman 🖇️](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-06.4ee7cf5a.jpeg&w=96&q=75) ### Alex Reibman 🖇️ [@AlexReibman](https://twitter.com/AlexReibman/status/1780299595484131836) Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps. ![Alex Fazio](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-08.af684f18.jpg&w=96&q=75) ### Alex Fazio [@alxfazio](https://x.com/alxfazio/status/1826731977283641615) Semantic scraping with Firecrawl is 🔥! ![Matt Busigin](https://www.firecrawl.dev/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Ftestimonial-03.6f60bcb9.jpg&w=96&q=75) ### Matt Busigin [@mbusigin](https://x.com/mbusigin/status/1836065372010656069) Firecrawl is dope. Congrats guys 👏 ## Web Data Extraction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) # Preview Take a look at the API response (Preview limited to 5 pages) Single URL(/scrape) Crawl(/crawl) Map(/map) Extract(/extract)Beta Llm-extract URL Get CodeRun ### Options Start exploring with our playground! ## Smart Crawl API Coming Soon # S # m # a # r # t # C # r # a # w # l Turn any website into an API with AI. Join the waitlist for beta access. Join the waitlist [X](https://x.com/firecrawl_dev) [LinkedIn](https://www.linkedin.com/company/sideguide-dev) ## Data Extraction Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) May 20, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Extract website data using LLMs ![Extract website data using LLMs image](https://www.firecrawl.dev/images/blog/g2.png) ## Setup Install our python dependencies, including groq and firecrawl-py. ```bash pip install groq firecrawl-py ``` ## Getting your Groq and Firecrawl API Keys To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com/) and your Firecrawl API key from [here](https://firecrawl.dev/). ## Load website with Firecrawl To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [Firecrawl](https://firecrawl.dev/). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy. Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content ( `onlyMainContent: True`) of the website page - excluding the navs, footers, etc. ```python from firecrawl import FirecrawlApp # Importing the FireCrawlLoader url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/" firecrawl = FirecrawlApp( api_key="fc-YOUR_FIRECRAWL_API_KEY", ) page_content = firecrawl.scrape_url(url=url, # Target URL to crawl params={ "pageOptions":{ "onlyMainContent": True # Ignore navs, footers, etc. } }) print(page_content) ``` Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction. ## Extraction and Generation Now that we have the website data, let’s use Groq to pull out the information we need. We’ll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content. We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results. ```python import json from groq import Groq client = Groq( api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key ) # Here we define the fields we want to extract from the page content extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"] completion = client.chat.completions.create( model="llama3-8b-8192", messages=[\ {\ "role": "system",\ "content": "You are a legal advisor who extracts information from documents in JSON."\ },\ {\ "role": "user",\ # Here we pass the page content and the fields we want to extract\ "content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}"\ }\ ], temperature=0, max_tokens=1024, top_p=1, stream=False, stop=None, # We set the response format to JSON object response_format={"type": "json_object"} ) # Pretty print the JSON response dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4) print(dataExtracted) ``` One pro tip is to use an LLM montioring system like [Traceloop](https://www.traceloop.com/) with these calls. This will allow you to quickly test and monitor output quality. ## And Voila! You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website. If you are looking to deploy your own models instead of using the choices Groq gives you, you can try out [Cerebrium](https://www.cerebrium.ai/) which hosts custom models blazingly fast. If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev/). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## OpenAI Swarm Marketing Tutorial Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Oct 12, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website with AI - YouTube Firecrawl 503 subscribers [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website with AI](https://www.youtube.com/watch?v=LaEUGfzDWNo) Firecrawl Search Info Shopping Tap to unmute If playback doesn't begin shortly, try restarting your device. You're signed out Videos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer. CancelConfirm Share Include playlist An error occurred while retrieving sharing information. Please try again later. Watch later Share Copy link Watch on 0:00 / •Live • [Watch on YouTube](https://www.youtube.com/watch?v=LaEUGfzDWNo "Watch on YouTube") In this tutorial, we’ll build a multi-agent system using [OpenAI Swarm](https://github.com/openai/swarm) for AI-powered marketing strategies using [Firecrawl](https://firecrawl.dev/) for web scraping. ## Agents 1. User Interface: Manages user interactions 2. Website Scraper: Extracts clean LLM-ready content via Firecrawl API 3. Analyst: Provides marketing insights 4. Campaign Idea: Generates marketing campaign concepts 5. Copywriter: Creates compelling marketing copy ## Requirements - [Firecrawl](https://firecrawl.dev/) API key - [OpenAI](https://platform.openai.com/api-keys) API key ## Setup 1. Install the required packages: ``` pip install -r requirements.txt ``` 2. Set up your environment variables in a `.env` file: ``` OPENAI_API_KEY=your_openai_api_key FIRECRAWL_API_KEY=your_firecrawl_api_key ``` ## Usage Run the main script to start the interactive demo: ``` python main.py ``` ## How it works Our multi-agent system uses AI to create marketing strategies. Here’s a breakdown: 1. User Interface Agent: - Talks to the user - Asks for the website URL - Can ask follow-up questions if needed - Passes the URL to the Website Scraper Agent 2. Website Scraper Agent: - Uses Firecrawl to get content from the website - Asks for the content in markdown format - Sends the cleaned-up content to the Analyst Agent 3. Analyst Agent: - Looks at the website content - Uses GPT-4o-mini to find key marketing insights - Figures out things like target audience and business goals - Passes these insights to the Campaign Idea Agent 4. Campaign Idea Agent: - Takes the analysis and creates a marketing campaign idea - Uses GPT-4o-mini to come up with something creative - Considers the target audience and goals from the analysis - Sends the campaign idea to the Copywriter Agent 5. Copywriter Agent: - Gets the campaign idea - Uses GPT-4o-mini to write catchy marketing copy - Creates copy that fits the campaign idea and target audience The OpenAI Swarm library manages how these agents work together. It makes sure information flows smoothly between agents and each agent does its job when it’s supposed to. The whole process starts when a user enters a URL. The system then goes through each step, from scraping the website to writing copy. At the end, the user gets a full marketing strategy with analysis, campaign ideas, and copy. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## Firecrawl July 2024 Updates Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) July 31, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Firecrawl July 2024 Updates ![Firecrawl July 2024 Updates image](https://www.firecrawl.dev/images/blog/launch-yc-firecrawl.png) We are excited to share our latest updates from July! **TLDR:** - We launched [Firecrawl on Launch YC](https://www.ycombinator.com/launches/LTf-firecrawl-open-source-crawling-and-scraping-for-ai-ready-web-data) 🔥 - Improvements to Endpoints + Dashboard - New Templates & Community Creations - We are hiring a [Developer Relations Specialist](https://www.ycombinator.com/companies/firecrawl/jobs/bbUHmrJ-devrel-and-growth-specialist-at-firecrawl) & [Web Automation Engineer](https://www.ycombinator.com/companies/firecrawl/jobs/hZHD0j6-founding-web-automation-engineer) ### Officially launched on YC 🧡 After three months and more than 8K stars, we have officially decided to launch Firecrawl on YC. It has been an incredible journey, and we are excited to continue building the best way to power AI with web data. [Check out our launch (and leave an upvote 🙂)!](https://www.ycombinator.com/launches/LTf-firecrawl-open-source-crawling-and-scraping-for-ai-ready-web-data) ![Firecrawl Launch YC](https://www.firecrawl.dev/images/blog/launchyc.jpeg) ### Improvements to Endpoints + Dashboard This month, we made improving our core product a priority. This meant focusing time on speed, reliability, and our dashboard as well. Specifically in these categories, we: - Shaved off around 1 second for every scrape and crawl request - Expanded scrape reliability for a bunch of new types of sites - Added enhanced dashboard monitoring which allows you to see processes, timing, failures and more. Check it out on your Activity Logs page on the dashboard! Look for even more speed and reliability improvements coming soon! ![New enhanced dashboard monitoring](https://www.firecrawl.dev/images/blog/newactivitylogs.jpeg) ### New Templates & Community Creations Not only did we release some examples and templates this month, but we also witnessed incredible creations from our community. If you’re working on an interesting Firecrawl project, we’d love to hear about it! Give us a shout at [@firecrawl\_dev](https://x.com/firecrawl_dev). Here are a few highlights: - Firecrawl Web Data Ingestion UI Template [(Link to repo)](https://github.com/mendableai/firecrawl/tree/main/apps/ui/ingestion-ui) - Generative UI with demo Firecrawl x Langchain by Brace Sproul from Langchain [(Link to repo)](https://github.com/bracesproul/gen-ui) - Scraping Real Estate Data from Zillow by Sourav Maji [(Link to post)](https://x.com/SouravMaji221/status/1818133241460556178) - Website Contraction Analysis with Google Gemini [(Link to post)](https://x.com/ericciarla/status/1808614350967525873) ![Web Data Ingestion UI Template](https://www.firecrawl.dev/images/blog/ingestiontemplate.jpeg) ### We are hiring! If you want to help build the best way to power AI with web data, we want to hear from you. Specifically, we are hiring for these roles: - DevRel and Growth Specialist at Firecrawl [(Link to post)](https://www.ycombinator.com/companies/firecrawl/jobs/bbUHmrJ-devrel-and-growth-specialist-at-firecrawl) - Founding Web Automation Engineer [(Link to job post)](https://www.ycombinator.com/companies/firecrawl/jobs/hZHD0j6-founding-web-automation-engineer) That’s all for this update! Stay tuned for the next one 🚀 Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Scaling Firecrawl's Requests Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Sep 13, 2024 • [![Gergő Móricz (mogery) image](https://www.firecrawl.dev/mogery.jpg)Gergő Móricz (mogery)](https://x.com/mo_geryy) # Handling 300k requests per day: an adventure in scaling ![Handling 300k requests per day: an adventure in scaling image](https://www.firecrawl.dev/images/blog/an-adventure-in-scaling.jpg) When I joined the Firecrawl team in early July, we spent most of our time working on new features and minor bugfixes. Life was good — we could focus mostly on shipping shiny new stuff without worrying as much about architecture and server load. However, as we grew over time, we started experiencing the “hug of death” a lot more. People loved our product so much that our architecture couldn’t take it anymore, and every day there was a brand new fire to put out. We knew that this was unsustainable, and ultimately it damages our DX more than any new feature we could put out could make up for. We knew we had to change things, stat. ## Our architecture, before the storm We host our API service on [Fly.io](https://fly.io/), which allows us to easily deploy our code in a Docker container. It also manages load balancing, log collection, zero-downtime deployment strategies, VPC management, and a whole load of other stuff for us, which is very useful. Our main API service has two kinds of “processes”, as Fly calls it: `app` and `worker`. `app` processes use Express to serve the main API, perform scrape requests (which take a relatively short time), and delegate crawls to `worker` processes using the [Bull](https://github.com/OptimalBits/bull) job queue. `worker` processes register themselves as workers on the job queue, and perform crawls (which take a relatively long time). Both processes use Supabase to handle authentication and store data in Postgres. Bull also runs on top of Redis, which we deployed on [Railway](https://railway.app/), since it’s super easy to use. ## Locks are hard As more and more people started using us, more and more people started finding bugs. We started getting odd issues with crawls sometimes being stuck for hours without any progress. I charted the timing of these crawls, and I saw that it was happening every time we redeployed. Due to some miscellaneous memory leak issues, we were redeploying our entire service every 2 hours via GitHub Actions, in order to essentially restart all our machines. This killed all our workers, which had acquired locks for these crawl jobs. I was not too familiar with the codebase at this point, and I thought that these locks got hard-stuck on the dead workers, so I to add some code to release all of the current worker’s locks on termination. This ended up being really complicated, due to multiple factors: 1. Other libraries we used also had cleanup code on `SIGTERM`. When you listen to `SIGTERM`, your app doesn’t actually quit until the handler calls `process.exit()`. So, the other library’s handler called `process.exit()` when its handler finished, which caused a race condition with our cleanup handler. (This was absolute hell to debug.) 2. Fly.io sometimes didn’t respect our configuration, and hard- `SIGKILL` ed our application before the 30 second timeout we specified our config. This cut our cleanup code short. 3. There was no easy way to remove a lock via the Bull API. The only legitimate way it could be done was to: 1. Get all in-progress jobs of this worker 2. Set their status to failed 3. Delete them from the queue 4. Re-insert them to the queue 4. While the cleanup code was running, there was no easy way to disable the current worker, so sometimes jobs the cleanup code re-inserted were immediately picked up by the same worker that was about to be shut down. 5. Due to our rollover deployment strategy, during a deployment, the re-inserted jobs were picked up by workers that have not been updated yet. This caused all the jobs to be piled up on the last worker to be updated, which caused the cleanup code to run longer than Fly’s maximum process shutdown timeout. While I was going down a rabbithole that was spiraling out of control, Thomas (another Firecrawl engineer who mainly works on [Fire-Engine](https://www.firecrawl.dev/blog/introducing-fire-engine-for-firecrawl), which used a similar architecture) discovered that our queue lock options were grossly misconfigured: ```typescript webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, { settings: { lockDuration: 2 * 60 * 60 * 1000, // 2 hours in milliseconds lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds }, }); ``` This was originally written with the understanding that `lockDuration` would be the maximum amount of time a job could take — which is not true. When a worker stops renewing the lock every `lockRenewTime` milliseconds, `lockDuration` specifies the amount of time to wait before declaring the job as `stalled` and giving it to another worker. This was causing the crawls to be locked up for 2 hours, similar to what our customers were reporting. After I got rid of all my super-complex cleanup code, the fix ended up being this: ```typescript webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, { settings: { lockDuration: 2 * 60 * 1000, // 1 minute in milliseconds lockRenewTime: 15 * 1000, // 15 seconds in milliseconds }, }); ``` Thank you Thomas for spotting that one and keeping me from going off the deep end! ## Scaling scrape requests, the easy way As you might have noticed in the architecture description, we were running scrape requests on the `app` process, the same one that serves our API. We were just starting a scrape in the `/v0/scrape` endpoint handler, and returning the results. This is simple to build, but it isn’t sustainable. We had no idea how many scrape requests we were running and when, there was no way to retry failed scrape requests, we had no data source to scale the `app` process on (other than are we down or not), and we had to scrape Express along with it. We needed to move scraping to our `worker` process. We ended up choosing to just add scrape jobs to the same queue as crawling jobs. This way the `app` submitted the job, the `worker` completed it, and the `app` waited for it to be done and returned the data. [We read the old advice about “never wait for jobs to finish”](https://blog.taskforce.sh/do-not-wait-for-your-jobs-to-complete/), but we decided to cautiously ignore it, since it would have ruined the amazing simplicity that the scrape endpoint has. This ended up [being surprisingly simple](https://github.com/mendableai/firecrawl/commit/6798695ee4daf1ce1b289db494d260d718b6752b#diff-6753e371514e1d188e797436080479e7c781d96183601ab8fa203e4df6ca0400), only slightly affected by Bull’s odd API. We had to add a global event handler to check if the job had completed, since it lacked the [`Job.waitUntilFinished`](https://api.docs.bullmq.io/classes/v5.Job.html#waitUntilFinished) function that its successor [BullMQ](https://github.com/taskforcesh/bullmq) already had. We saw a huge drop in weird behaviour on our `app` machines, and we were able to scale them down in exchange for more `worker` machines, making us way faster. ## Smaller is better The redeploy crawl fiasco made us worried about handling big crawls. We could essentially 2x the time a big crawl ran if it was caught in the middle of a redeploy, which is sub-optimal. Some of our workers were also crashing with an OOM error when working on large crawls. We instead decided to break crawls down to individual scrape jobs that chain together and spawn new jobs when they find new URLs. We decided to make every job in the queue have a scrape type. Scrape jobs that are associated with crawls have an extra bit of metadata tying them to the crawlId. This crawlId refers to some redis keys that coordinate the crawling process. The crawl itself has some basic data including the origin URL, the team associated with the request, the robots.txt file, and others: ```typescript export type StoredCrawl = { originUrl: string; crawlerOptions: any; pageOptions: any; team_id: string; plan: string; robots?: string; cancelled?: boolean; createdAt: number; }; export async function saveCrawl(id: string, crawl: StoredCrawl) { await redisConnection.set("crawl:" + id, JSON.stringify(crawl)); await redisConnection.expire("crawl:" + id, 24 * 60 * 60, "NX"); } export async function getCrawl(id: string): Promise { const x = await redisConnection.get("crawl:" + id); if (x === null) { return null; } return JSON.parse(x); } ``` We also make heavy use of Redis sets to determine which URLs have been already visited when discovering new pages. The Redis `SADD` command adds a new element to a set. Since sets can only store unique values, it returns 1 or 0 based on whether the element was added or not. (The element does not get added if it was already in the set before.) We use this as a lock mechanism, to make sure two workers don’t discover the same URL at the same time and add two jobs for them. ```typescript async function lockURL(id: string, url: string): Promise { // [...] const res = (await redisConnection.sadd("crawl:" + id + ":visited", url)) !== 0; // [...] return res; } async function onURLDiscovered(crawl: string, url: string) { if (await lockURL(crawl, url)) { // we are the first ones to discover this URL await addScrapeJob(/* ... */); // add new job for this URL } } ``` You can take a look at the whole Redis logic around orchestrating crawls [here](https://github.com/mendableai/firecrawl/blob/main/apps/api/src/lib/crawl-redis.ts). With this change, we saw a huge performance improvement on crawls. This change also allowed us to perform multiple scrape requests of one crawl at the same time, while the old crawler had no scrape concurrency. We were able to stretch a crawl over all of our machines, maximizing the worth we get for each machine we pay for. ## Goodbye Bull, hello BullMQ Every time we encountered Bull, we were slapped in the face by how much better BullMQ was. It had a better API, new features, and the most important thing of all: active maintenance. We decided to make the endeavour to switch over to it, first on Fire-Engine, and then on Firecrawl. With this change, we were able to drop the horrible code for [waiting for a job to complete](https://github.com/mendableai/firecrawl/blob/6798695ee4daf1ce1b289db494d260d718b6752b/apps/api/src/controllers/scrape.ts#L59-L89), and replace it all with `job.waitUntilFinished()`. We were also able to customize our workers to add Sentry instrumentation (more on that later), and to take on jobs based on CPU and RAM usage, instead of a useless max concurrency constant that we had to use with Bull. BullMQ still has its API quirks (e.g. don’t you dare call `Job.moveToCompleted` / `Job.moveToFailed` with the 3rd argument not set to `false`, otherwise you will check out and lock a job that will be returned to you that you’re probably dropping) ## Our egress fee horror story Our changes made us super scalable, but they also meant that a lot more traffic was going through Redis. We ended up racking up a 15000$ bill on Railway in August, mostly on Redis egress fees only. This wasn’t sustainable, and we needed to switch quickly. After being disappointed with Upstash, and having issues with Dragonfly, we found a way to deploy Redis to Fly.io natively. [We put our own spin on the config](https://github.com/mendableai/firecrawl/blob/f7c4cee404e17b3ed201e005185a5041009d0e6f/apps/redis/fly.toml), and deployed it to our account. However, we were not able to reach the instance from the public IP using `redis-cli` (netcat worked though?!?!), which caused some confusion. We decided to go another way and use Fly’s [Private Networking](https://fly.io/docs/networking/private-networking/), which provides a direct connection to a Fly app/machine without any load balancer being in front. We crafted a connection string, SSH’d into one of our worker machines, installed `redis-cli`, tried to connect, and… it worked! We had a reachable, stable Redis instance in front of us. So, we went to change the environment variable to the fancy new Fly.io Redis, we deployed the application, and… we crashed. After a quick revert, we noticed that [IORedis](https://github.com/redis/ioredis) wasn’t able to connect to the Redis instance, but `redis-cli` stilled worked fine. So… what gives? Turns out, `ioredis` only performs a lookup for an IPv4 address, unless you specify `?family=6`, in which case it only performs a lookup for an IPv6 address. This is not documented anywhere, except in a couple of GitHub issues which are hard to search for. I have been coding for almost 11 years now, and this is the worst configuration quirk I have ever seen. (And I use Nix daily!) In 2024, it would be saner to look for IPv6 by default instead of IPv4. Why not look for both? This is incomprehensible to me. Anyways, after appending `?family=6` to the string, everything worked, except, sometimes not… ## Awaiting forever We started having huge waves of scrape timeouts. After a bit of investigation, the `Job.waitUntilFinished()` Promise never returned, but after looking at our BullMQ dashboard, we saw that jobs were actually being completed. BullMQ uses Redis streams for all of its event firing/handling code, including `waitUntilFinished`, which waits until the job’s `finished` event fires. BullMQ enforces a maximum length for the event stream, in order to purge old events that have presumably already been handled, and it defaults to about 10000 maximum events. Under heavy load, our queue was firing so many events, that BullMQ was trimming events before they could be processed. This caused everything that depends on queue events to fail. This maximum events parameter is configurable, however, it seems like a parameter that we’d have to babysit, and it’s way too cryptic and too easy to forget about. Instead, we opted to rewrite the small amount of code that uses queue events to do polling instead, which is not affected by pub/sub issues like this. Inexplicably, this never happened on the old Railway Redis instance, but it happened on every alternative we tried (including Upstash and Dragonfly). We’re still not sure why we didn’t run into this issue earlier, and BullMQ queue events still work happily on the Fire-Engine side under Dragonfly. ## Adding monitoring We were growing tired of going through console logs to diagnose things. We were also worried about how many issues we could potentially be missing. So, we decided to integrate [Sentry](https://sentry.io/) for error and performance monitoring, because I had some great experiences with it in the past. The moment we added it, we found about 10 high-impact bugs that we had no idea about. I fixed them the day after. We also had an insight into what our services were actively doing — I was able to add custom instrumentation to BullMQ, and pass trace IDs over to Fire-Engine, so now we can view the entire process a scrape or crawl goes through until it finishes, all organized in one place. ![Sentry Trace view](https://www.firecrawl.dev/images/blog/scaling-sentry.png) (The creation of this image for this post lead me to [decrease the time Firecrawl spends after Fire-Engine is already finished](https://github.com/mendableai/firecrawl/commit/000a316cc362b935976ac47b73ec02923f4175c5). Thanks, Sentry!) Sentry has been immensely useful in finding errors, debugging incidents, and improving performance. There is no longer a chance that we have an issue invisibly choking us. With Sentry we see everything that could be going wrong (super exciting to see AIOps tools like [Keep](https://www.keephq.dev/) popping up). ## The future We are currently stable. I was on-call last weekend and I forgot about it. The phone never rang. It felt very weird after putting out fires for so long, but our investment absolutely paid off. It allowed us to do [our launch week](https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap), which would not have been possible if we were in panic mode 24/7. It has also allowed our customers to build with confidence, as the increased reliabilty adds another layer of greatness to Firecrawl. However, there are still things we’re unhappy with. Fly, while very useful early-stage, doesn’t let us smoothly autoscale. We are currently setting up Kubernetes to give us more control over our scaling. I love making Firecrawl better, be it with features or with added reliability. We’re in a good place right now, but I’m sure there will be a lot more adventures with scaling in the future. I hope this post has been useful, since surprisingly few people talk about all this stuff. (We sure had trouble finding resources when we were trying to fix things.) I will likely be back with a part 2 when there’s more exciting things to talk about. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Gergő Móricz (mogery) image](https://www.firecrawl.dev/mogery.jpg)\\ Gergő Móricz (mogery)@mo\_geryy](https://x.com/mo_geryy) Gergő Móricz is a Software Engineer at Firecrawl. He works on scaling, monitoring, designing new APIs and features, putting out fires, customer support, and everything else there is to do at a tech startup. Previously coded and scaled a hospitality tech startup, and contributed to Mendable on GitHub. ### More articles by Gergő Móricz (mogery) [Handling 300k requests per day: an adventure in scaling\\ \\ Putting out fires was taking up all our time, and we had to scale fast. This is how we did it.](https://www.firecrawl.dev/blog/an-adventure-in-scaling) ## BeautifulSoup4 vs Scrapy Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 24, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python ![BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python image](https://www.firecrawl.dev/images/blog/bs4_scrapy/bs4-vs-scrapy-comparison.jpg) ## 1\. Introduction Web scraping has become an essential tool for gathering data from the internet. Whether you’re tracking prices, collecting news articles, or building a research dataset, Python offers several popular libraries to help you extract information from websites. Two of the most widely used tools are BeautifulSoup4 (BS4) and Scrapy, each with its own strengths and ideal use cases. Choosing between BS4 and Scrapy isn’t always straightforward. BS4 is known for its simplicity and ease of use, making it perfect for beginners and small projects. Scrapy, on the other hand, offers powerful features for large-scale scraping but comes with a steeper learning curve. Making the right choice can save you time and prevent headaches down the road. In this guide, we’ll compare BS4 and Scrapy in detail, looking at their features, performance, and best uses. We’ll also explore practical examples and discuss modern alternatives that solve common scraping challenges. By the end, you’ll have a clear understanding of which tool best fits your needs and how to get started with web scraping in Python. ## Prerequisites Before diving into the comparison, make sure you have: - Basic knowledge of Python programming - Understanding of HTML structure and CSS selectors - Python 3.7+ installed on your system - Familiarity with command line interface - A code editor or IDE of your choice You’ll also need to install the required libraries: ```bash pip install beautifulsoup4 scrapy firecrawl-py pydantic python-dotenv ``` ## 2\. Understanding BeautifulSoup4 BeautifulSoup4, often called BS4, is a Python library that helps developers extract data from HTML and XML files. Think of it as a tool that can read and understand web pages the same way your browser does, but instead of showing you the content, it helps you collect specific information from it. BS4 works by turning messy HTML code into a well-organized structure that’s easy to navigate and search through. The library shines in its simplicity. With just a few lines of code, you can pull out specific parts of a webpage like headlines, prices, or product descriptions. Here’s a quick example: ```python from bs4 import BeautifulSoup import requests # Get a webpage response = requests.get('https://news.ycombinator.com') soup = BeautifulSoup(response.text, 'html.parser') # Find all article titles titles = soup.find_all('span', class_='titleline') for idx, title in enumerate(titles): print(f"{idx + 1}. {title.text.strip()}") if idx == 4: break ``` This code demonstrates BeautifulSoup4’s straightforward approach to web scraping. It fetches the Hacker News homepage using the requests library, then creates a BeautifulSoup object to parse the HTML. The `find_all()` method searches for `` elements with the class `"titleline"`, which contain article titles. The code loops through the first 5 titles, printing each one with its index number. The `strip()` method removes any extra whitespace around the titles. The output shows real article titles from Hacker News, demonstrating how BS4 can easily extract specific content from a webpage: ```out 1. The GTA III port for the Dreamcast has been released (gitlab.com/skmp) 2. Arnis: Generate Cities in Minecraft from OpenStreetMap (github.com/louis-e) 3. Things we learned about LLMs in 2024 (simonwillison.net) 4. Journey from Entrepreneur to Employee (akshay.co) 5. Systems ideas that sound good but almost never work (learningbyshipping.com) ``` While BS4 excels at handling static websites, it does have limitations. It can’t process JavaScript-generated content, which many modern websites use. It also doesn’t handle tasks like managing multiple requests or storing data. However, these limitations are often outweighed by its gentle learning curve and excellent documentation, making it an ideal starting point for anyone new to web scraping. Key Features: - Simple, intuitive API for parsing HTML/XML - Powerful searching and filtering methods - Forgiving HTML parser that can handle messy code - Extensive documentation with clear examples - Small memory footprint - Compatible with multiple parsers ( `lxml`, `html5lib`) ## 3\. Understanding Scrapy ![Scrapy architecture diagram showing components like spiders, engine, scheduler and pipelines](https://www.firecrawl.dev/images/blog/bs4_scrapy/scrapy_architecture.jpeg) Source: [Scrapy documentation](https://docs.scrapy.org/en/latest/topics/architecture.html). Scrapy takes a different approach to web scraping by providing a complete framework rather than just a parsing library. Think of it as a Swiss Army knife for web scraping – it includes everything you need to crawl websites, process data, and handle common scraping challenges all in one package. While this makes it more powerful than BS4, it also means there’s more to learn before you can get started. Here’s a basic example of how Scrapy works: ```python # hackernews_spider.py import scrapy class HackerNewsSpider(scrapy.Spider): name = "hackernews" start_urls = ["https://news.ycombinator.com"] def parse(self, response): # Get all stories for story in response.css("span.titleline"): # Extract story title yield {"title": story.css("a::text").get()} # Go to next page if available # if next_page := response.css('a.morelink::attr(href)').get(): # yield response.follow(next_page, self.parse) # To run the spider, we need to use the Scrapy command line # scrapy runspider hackernews_spider.py -o results.json ``` This code defines a simple Scrapy spider that crawls Hacker News. The spider starts at the homepage, extracts story titles from each page, and could optionally follow pagination links (currently commented out). The spider uses CSS selectors to find and extract content, demonstrating Scrapy’s built-in parsing capabilities. The results can be exported to JSON using Scrapy’s command line interface. What sets Scrapy apart is its architecture. Instead of making one request at a time like BS4, Scrapy can handle multiple requests simultaneously, making it much faster for large projects. It also includes built-in features that you’d otherwise need to build yourself. Scrapy’s key components include: Spider middleware for customizing request/response handling, item pipelines for processing and storing data, and automatic request queuing and scheduling. It provides built-in support for exporting data in formats like JSON, CSV, and XML. The framework also includes robust error handling with retry mechanisms and a command-line interface for project management. ## 4\. Head-to-Head Comparison Let’s break down how BS4 and Scrapy compare in key areas that matter most for web scraping projects. ### Performance When it comes to speed and efficiency, Scrapy has a clear advantage. Its ability to handle multiple requests at once means it can scrape hundreds of pages while BS4 is still working on its first dozen. Think of BS4 as a solo worker, carefully processing one page at a time, while Scrapy is like a team of workers tackling many pages simultaneously. Memory usage tells a similar story. BS4 is lightweight and uses minimal memory for single pages, making it perfect for small projects. However, Scrapy’s smart memory management shines when dealing with large websites, efficiently handling thousands of pages without slowing down your computer. ### Ease of Use BS4 takes the lead in simplicity. You can start scraping with just 4-5 lines of code and basic Python knowledge. Here’s a quick comparison: BS4: ```python from bs4 import BeautifulSoup import requests response = requests.get("https://example.com") soup = BeautifulSoup(response.text, "html.parser") titles = soup.find_all("h1") ``` Scrapy: ```python import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://example.com'] def parse(self, response): titles = response.css('h1::text').getall() yield {'titles': titles} # Requires additional setup and command-line usage as seen above ``` ### Features Here’s a simple breakdown of key features: | Feature | BeautifulSoup4 | Scrapy | | --- | --- | --- | | JavaScript Support | ❌ | ❌ (needs add-ons) | | Multiple Requests | ❌ (manual) | ✅ (automatic) | | Data Processing | ❌ (basic) | ✅ (built-in pipelines) | | Error Handling | ❌ (manual) | ✅ (automatic retries) | | Proxy Support | ❌ (manual) | ✅ (built-in) | ### Use Cases Choose BS4 when: - You’re new to web scraping - You need to scrape a few simple pages - You want to quickly test or prototype - The website is mostly static HTML - You’re working within a larger project Choose Scrapy when: - You need to scrape thousands of pages - You want built-in data processing - You need advanced features like proxy rotation - You’re building a production scraper - Performance is critical ## 5\. Common Challenges and Limitations Web scraping tools face several hurdles that can make data extraction difficult or unreliable. Understanding these challenges helps you choose the right tool and prepare for potential roadblocks. ### Dynamic Content Modern websites often load content using JavaScript after the initial page load. Neither BS4 nor Scrapy can handle this directly. While you can add tools like Selenium or Playwright to either solution, this makes your scraper more complex and slower. A typical example is an infinite scroll page on social media – the content isn’t in the HTML until you scroll down. ### Anti-Bot Measures Websites are getting smarter at detecting and blocking scrapers. Common protection methods include: - CAPTCHAs and reCAPTCHA challenges - IP-based rate limiting - Browser fingerprinting - Dynamic HTML structure changes - Hidden honeypot elements While Scrapy offers some built-in tools like proxy support and request delays, both BS4 and Scrapy users often need to implement additional solutions to bypass these protections. ### Maintenance Burden Perhaps the biggest challenge is keeping scrapers running over time. Websites frequently change their structure, breaking scrapers that rely on specific HTML patterns. Here’s a real-world example: Before website update: ```python # Working scraper soup.find('div', class_='product-price').text # Returns: "$99.99" ``` After website update, same code now returns None because the structure changed: ```python soup.find('span', class_='price-current').text # Returns: None ``` This constant need for updates creates a significant maintenance overhead, especially when managing multiple scrapers. While Scrapy’s more robust architecture helps handle some issues automatically, both tools require regular monitoring and fixes to maintain reliability. ### Resource Management Each tool presents unique resource challenges: - BS4: High memory usage when parsing large pages - Scrapy: Complex configuration for optimal performance - Both: Network bandwidth limitations - Both: Server response time variations These limitations often require careful planning and optimization, particularly for large-scale scraping projects where efficiency is crucial. ## 6\. Modern Solutions: Introducing Firecrawl After exploring the limitations of traditional scraping tools, let’s look at how modern AI-powered solutions like Firecrawl are changing the web scraping landscape. Firecrawl takes a fundamentally different approach by using natural language understanding to identify and extract content, rather than relying on brittle HTML selectors. ### AI-Powered Content Extraction ![Firecrawl AI-powered web scraping tool interface showing natural language extraction capabilities and code examples](https://www.firecrawl.dev/images/blog/bs4_scrapy/firecrawl.png) Unlike BS4 and Scrapy which require you to specify exact HTML elements, Firecrawl lets you describe what you want to extract in plain English. This semantic approach means your scrapers keep working even when websites change their structure. Here’s a practical example of scraping GitHub’s trending repositories: ```python # Import required libraries from firecrawl import FirecrawlApp from pydantic import BaseModel, Field from dotenv import load_dotenv from typing import List # Load environment variables from .env file load_dotenv() # Define Pydantic model for a single GitHub repository class Repository(BaseModel): # Each field represents a piece of data we want to extract name: str = Field(description="The repository name including organization/username") description: str = Field(description="The repository description") stars: int = Field(description="Total number of stars") language: str = Field(description="Primary programming language") url: str = Field(description="The repository URL") # Define model for the full response containing list of repositories class Repositories(BaseModel): repositories: List[Repository] = Field(description="List of trending repositories") # Initialize Firecrawl app app = FirecrawlApp() # Scrape GitHub trending page using our defined schema trending_repos = app.scrape_url( 'https://github.com/trending', params={ # Specify we want to extract structured data "formats": ["extract"], "extract": { # Use our Pydantic model schema for extraction "schema": Repositories.model_json_schema(), } } ) # Loop through the first 3 repositories and print their details for idx, repo in enumerate(trending_repos['extract']['repositories']): print(f"{idx + 1}. {repo['name']}") print(f"⭐ {repo['stars']} stars") print(f"💻 {repo['language']}") print(f"📝 {repo['description']}") print(f"🔗 {repo['url']}\n") # Break after showing 3 repositories if idx == 2: break ``` ```python 1. pathwaycom/pathway ⭐ 11378 stars 💻 Python 📝 Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG. 🔗 https://github.com/pathwaycom/pathway 2. EbookFoundation/free-programming-books ⭐ 345107 stars 💻 HTML 📝 📚 Freely available programming books 🔗 https://github.com/EbookFoundation/free-programming-books 3. DrewThomasson/ebook2audiobook ⭐ 3518 stars 💻 Python 📝 Convert ebooks to audiobooks with chapters and metadata using dynamic AI models and voice cloning. Supports 1,107+ languages! 🔗 https://github.com/DrewThomasson/ebook2audiobook ``` Firecrawl addresses the major pain points we discussed earlier: 1. **JavaScript Rendering**: Automatically handles dynamic content without additional tools 2. **Anti-Bot Measures**: Built-in proxy rotation and browser fingerprinting 3. **Maintenance**: AI adapts to site changes without updating selectors 4. **Rate Limiting**: Smart request management with automatic retries 5. **Multiple Formats**: Export data in various formats (JSON, CSV, Markdown) ### When to Choose Firecrawl Firecrawl is particularly valuable when: - You need reliable, low-maintenance scrapers - Websites frequently change their structure - You’re dealing with JavaScript-heavy sites - Anti-bot measures are a concern - You need clean, structured data for AI/ML - Time-to-market is critical While you have to pay for higher usage limits, the reduction in development and maintenance time often makes it more cost-effective than maintaining custom scraping infrastructure with traditional tools. ## 7\. Making the Right Choice Choosing the right web scraping tool isn’t a one-size-fits-all decision. Let’s break down a practical framework to help you make the best choice for your specific needs. ### Decision Framework 1. **Project Scale** - Small (1-10 pages): BeautifulSoup4 - Medium (10-100 pages): BeautifulSoup4 or Scrapy - Large (100+ pages): Scrapy or Firecrawl 2. **Technical Requirements** - Static HTML only: BeautifulSoup4 - Multiple pages & data processing: Scrapy - Dynamic content & anti-bot bypass: Firecrawl 3. **Development Resources** - Time available: - Hours: BeautifulSoup4 - Days: Scrapy - Minutes: Firecrawl - Team expertise: - Beginners: BeautifulSoup4 - Experienced developers: Scrapy - Production teams: Firecrawl ### Cost-Benefit Analysis | Factor | BeautifulSoup4 | Scrapy | Firecrawl | | --- | --- | --- | --- | | Initial Cost | Free | Free | Paid | | Development Time | Low | High | Minimal | | Maintenance Cost | High | Medium | Low | | Scalability | Limited | Good | Excellent | ### Future-Proofing Your Choice Consider these factors for long-term success: 1. **Maintainability** - Will your team be able to maintain the scraper? - How often does the target website change? - What’s the cost of scraper downtime? 2. **Scalability Requirements** - Do you expect your scraping needs to grow? - Will you need to add more websites? - Are there seasonal traffic spikes? 3. **Integration Needs** - Does it need to work with existing systems? - What format do you need the data in? - Are there specific performance requirements? ### Practical Recommendations Start with BeautifulSoup4 if: - You’re learning web scraping - You need to scrape simple, static websites - You have time to handle maintenance - Budget is your primary constraint Choose Scrapy when: - You need to scrape at scale - You have experienced developers - You need fine-grained control - You’re building a long-term solution Consider Firecrawl if: - Time to market is critical - You need reliable production scrapers - Maintenance costs are a concern - You’re dealing with complex websites - You need AI-ready data formats ## 8\. Conclusion The web scraping landscape offers distinct tools for different needs. BeautifulSoup4 excels in simplicity, making it ideal for beginners and quick projects. Scrapy provides powerful features for large-scale operations but requires more expertise. Modern solutions like Firecrawl bridge the gap with AI-powered capabilities that address traditional scraping challenges, though at a cost. ### Key Takeaways - BeautifulSoup4: Best for learning and simple, static websites - Scrapy: Ideal for large-scale projects needing fine control - Firecrawl: Perfect when reliability and low maintenance are priorities - Consider long-term costs and scalability in your decision Choose based on your project’s scale, team expertise, and long-term needs. As websites grow more complex and anti-bot measures evolve, picking the right tool becomes crucial for sustainable web scraping success. ### Useful links - [BeautifulSoup4 Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [Scrapy Official Website](https://scrapy.org/) - [Scrapy Documentation](https://docs.scrapy.org/) - [Web Scraping Best Practices](https://www.scrapingbee.com/blog/web-scraping-best-practices/) - [Firecrawl Documentation](https://docs.firecrawl.dev/) - [Getting Started With Firecrawl](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Python Web Scraping Projects Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 17, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # 15 Python Web Scraping Projects: From Beginner to Advanced ![15 Python Web Scraping Projects: From Beginner to Advanced image](https://www.firecrawl.dev/images/blog/web_scraping_projects/python-web-scraping-projects.jpg) ## Introduction Web scraping is one of the most powerful tools in a programmer’s arsenal, allowing you to gather data from across the internet automatically. It has countless applications like market research, competitive analysis, [price monitoring](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python), and data-driven decision making. The ability to extract structured data from web pages opens up endless possibilities for automation and analysis. This guide outlines 15 web scraping project ideas in Python that progress from basic concepts to advanced techniques. Each project includes learning objectives, key technical concepts, and a structured development roadmap. While this guide doesn’t provide complete code implementations, it serves as a blueprint for your web scraping journey - helping you understand what to build and how to approach each challenge systematically. Let’s begin by understanding the available tools and setting up our development environment. Then we’ll explore each project outline in detail, giving you a solid foundation to start building your own web scraping solutions. ## Table of Contents 1. [Introduction](https://www.firecrawl.dev/blog/python-web-scraping-projects#introduction) 2. [Prerequisites](https://www.firecrawl.dev/blog/python-web-scraping-projects#prerequisites) - [Required Skills](https://www.firecrawl.dev/blog/python-web-scraping-projects#required-skills) - [Technical Requirements](https://www.firecrawl.dev/blog/python-web-scraping-projects#technical-requirements) - [Optional but Helpful](https://www.firecrawl.dev/blog/python-web-scraping-projects#optional-but-helpful) - [Time Commitment](https://www.firecrawl.dev/blog/python-web-scraping-projects#time-commitment) 3. [Comparing Python Web Scraping Frameworks](https://www.firecrawl.dev/blog/python-web-scraping-projects#comparing-python-web-scraping-frameworks-for-your-projects) - [BeautifulSoup4](https://www.firecrawl.dev/blog/python-web-scraping-projects#beautifulsoup4) - [Selenium](https://www.firecrawl.dev/blog/python-web-scraping-projects#selenium) - [Scrapy](https://www.firecrawl.dev/blog/python-web-scraping-projects#scrapy) - [Firecrawl](https://www.firecrawl.dev/blog/python-web-scraping-projects#firecrawl) 4. [Setting Up Your Web Scraping Environment](https://www.firecrawl.dev/blog/python-web-scraping-projects#setting-up-your-web-scraping-environment) 5. [Beginner Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#beginner-web-scraping-projects) 1. [Weather Data Scraper](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-weather-data-scraper) 2. [News Headlines Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-news-headlines-aggregator) 3. [Book Price Tracker](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-book-price-tracker) 4. [Recipe Collector](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-recipe-collector) 5. [Job Listing Monitor](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-job-listing-monitor) 5. [Intermediate Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#intermediate-web-scraping-projects) 1. [E-commerce Price Comparison Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-e-commerce-price-comparison-tool) 2. [Social Media Analytics Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-social-media-analytics-tool) 3. [Real Estate Market Analyzer](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-real-estate-market-analyzer) 4. [Academic Research Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-academic-research-aggregator) 5. [Financial Market Data Analyzer](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-financial-market-data-analyzer) 6. [Advanced Web Scraping Projects](https://www.firecrawl.dev/blog/python-web-scraping-projects#advanced-web-scraping-projects) 1. [Multi-threaded News Aggregator](https://www.firecrawl.dev/blog/python-web-scraping-projects#1-multi-threaded-news-aggregator) 2. [Distributed Web Archive System](https://www.firecrawl.dev/blog/python-web-scraping-projects#2-distributed-web-archive-system) 3. [Automated Market Research Tool](https://www.firecrawl.dev/blog/python-web-scraping-projects#3-automated-market-research-tool) 4. [Competitive Intelligence Dashboard](https://www.firecrawl.dev/blog/python-web-scraping-projects#4-competitive-intelligence-dashboard) 5. [Full-Stack Scraping Platform](https://www.firecrawl.dev/blog/python-web-scraping-projects#5-full-stack-scraping-platform) 7. [Conclusion](https://www.firecrawl.dev/blog/python-web-scraping-projects#conclusion) ## Prerequisites Before starting with these projects, you should have: ### Required Skills - Basic Python programming experience: - Variables, data types, and operators - Control structures (if/else, loops) - Functions and basic error handling - Working with lists and dictionaries - Reading/writing files - Installing and importing packages - Basic web knowledge: - Understanding of HTML structure - Ability to use browser developer tools (inspect elements) - Basic CSS selectors (class, id, tag selection) - Understanding of URLs and query parameters - Development environment: - Python 3.x installed - Ability to use command line/terminal - Experience with pip package manager - Text editor or IDE (VS Code, PyCharm, etc.) ### Technical Requirements - Computer with internet connection - Modern web browser with developer tools - Python 3.7+ installed - Ability to install Python packages via pip - Basic understanding of virtual environments ### Optional but Helpful - Understanding of: - HTTP methods (GET, POST) - JSON and CSV data formats - Basic regular expressions - Simple database concepts - Git version control - Experience with: - pandas library for data manipulation - Basic data visualization - API interactions - Web browser automation ### Time Commitment - 2-4 hours for setup and environment configuration - 4-8 hours per beginner project - Regular practice for skill improvement If you’re new to web scraping, we recommend starting with the Weather Data Scraper or Recipe Collector projects, as they involve simpler website structures and basic data extraction patterns. The News Headlines Aggregator and Job Listing Monitor projects are more complex and might require additional learning about handling multiple data sources and pagination. ## Comparing Python Web Scraping Frameworks For Your Projects When starting with web scraping in Python, you’ll encounter several popular frameworks. Each has its strengths and ideal use cases. Let’s compare the main options to help you choose the right tool for your needs. ### BeautifulSoup4 BeautifulSoup4 (BS4) is one of the most popular Python libraries for web scraping. It provides a simple and intuitive way to parse HTML and XML documents by creating a parse tree that can be navigated and searched. BS4 excels at extracting data from static web pages where JavaScript rendering isn’t required. The library works by transforming HTML code into a tree of Python objects, making it easy to locate and extract specific elements using methods like `find()` and `find_all()`. While it lacks some advanced features found in other frameworks, its simplicity and ease of use make it an excellent choice for beginners and straightforward scraping tasks. Pros: - Easy to learn and use - Excellent documentation - Great for parsing HTML/XML - Lightweight and minimal dependencies Cons: - No JavaScript rendering - Limited to basic HTML parsing - No built-in download features - Can be slow for large-scale scraping Example usage: ```python from bs4 import BeautifulSoup import requests response = requests.get('https://example.com') soup = BeautifulSoup(response.text, 'html.parser') titles = soup.find_all('h1') ``` ### Selenium Selenium is a powerful web automation framework that can control web browsers programmatically. Originally designed for web application testing, it has become a popular choice for web scraping, especially when dealing with dynamic websites that require JavaScript rendering. Selenium works by automating a real web browser, allowing it to interact with web pages just like a human user would - clicking buttons, filling forms, and handling dynamic content. This makes it particularly useful for scraping modern web applications where content is loaded dynamically through JavaScript. Pros: - Handles JavaScript-rendered content - Supports browser automation - Can interact with web elements - Good for testing and scraping Cons: - Resource-intensive - Slower than other solutions - Requires browser drivers - Complex setup and maintenance Example Usage: ```python from selenium import webdriver from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get("https://example.com") elements = driver.find_elements(By.CLASS_NAME, "product-title") ``` ### Scrapy Scrapy is a comprehensive web scraping framework that provides a complete solution for extracting data from websites at scale. It’s designed as a fast, powerful, and extensible framework that can handle complex scraping tasks efficiently. Unlike simpler libraries, Scrapy provides a full suite of features including a crawling engine, data processing pipelines, and middleware components. It follows the principle of “batteries included” while remaining highly customizable for specific needs. Scrapy is particularly well-suited for large-scale scraping projects where performance and reliability are crucial. Pros: - High performance - Built-in pipeline processing - Extensive middleware support - Robust error handling Cons: - Steep learning curve - Complex configuration - Limited JavaScript support - Overkill for simple projects Example Usage: ```python import scrapy class ProductSpider(scrapy.Spider): name = "products" start_urls = ["https://example.com"] def parse(self, response): for product in response.css(".product"): yield { "name": product.css(".title::text").get(), "price": product.css(".price::text").get(), } ``` ### Firecrawl Firecrawl represents a paradigm shift in web scraping by using AI to eliminate traditional scraping bottlenecks. Unlike conventional frameworks that require manual selector maintenance, Firecrawl uses natural language understanding to automatically identify and extract HTML element content based on semantic descriptions. This approach directly addresses the primary challenges faced in the projects outlined in this guide: 1. Development speed - Traditional approach: Writing selectors, handling JavaScript, managing anti-bot measures (~2-3 days per site) - Firecrawl approach: Define data schema, let AI handle extraction (~30 minutes per site) 2. Maintenance requirements - Traditional approach: Regular updates when sites change, selector fixes, anti-bot adaptations - Firecrawl approach: Schema remains stable, AI adapts to site changes automatically 3. Project implementation - For the e-commerce projects: Built-in handling of dynamic pricing, AJAX requests, and anti-bot measures - For news aggregation: Automatic content classification and extraction across different layouts - For market research: Seamless handling of multiple site structures and authentication flows Pros: - AI-powered content extraction eliminates selector maintenance - Automatic handling of JavaScript-rendered content - Built-in anti-bot measures with enterprise-grade reliability - Multiple output formats (JSON, CSV, structured objects) - Site change resilience through semantic understanding - Consistent extraction across different page layouts Cons: - Paid service (consider ROI vs. development time) - API-dependent architecture - Less granular control over parsing process - May be overkill for simple, static sites - Slower for large-scale operations Example Implementation: ```python from firecrawl import FirecrawlApp from pydantic import BaseModel, Field class Product(BaseModel): name: str = Field(description="The product name and title") price: float = Field(description="The current price in USD") description: str = Field(description="The product description text") rating: float = Field(description="The average customer rating out of 5 stars") num_reviews: int = Field(description="The total number of customer reviews") availability: str = Field(description="The current availability status") brand: str = Field(description="The product manufacturer or brand") category: str = Field(description="The product category or department") asin: str = Field(description="The Amazon Standard Identification Number") app = FirecrawlApp() data = app.scrape_url( 'https://www.amazon.com/gp/product/1718501900', # A sample Amazon product params={ "formats": ['extract'], "extract": { "schema": Product.model_json_schema() } } ) ``` This example demonstrates how Firecrawl reduces complex e-commerce scraping to a simple schema definition. The same approach applies to all projects in this guide, potentially reducing development time from weeks to days. For production environments where reliability and maintenance efficiency are crucial, this automated approach often proves more cost-effective than maintaining custom scraping infrastructure. * * * Here is a table summarizing the differences between these tools: | Tool | Best For | Learning Curve | Key Features | | --- | --- | --- | --- | | BeautifulSoup4 | Static websites, Beginners | Easy | Simple API, Great documentation | | Selenium | Dynamic websites, Browser automation | Moderate | Full browser control, JavaScript support | | Scrapy | Large-scale projects | Steep | High performance, Extensive features | | Firecrawl | Production use, AI-powered scraping | Easy | Low maintenance, Built-in anti-bot | Useful Resources: - [BeautifulSoup4 documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [Selenium documentation](https://www.selenium.dev/documentation/) - [Scrapy documentation](https://docs.scrapy.org/) - [Firecrawl documentation](https://firecrawl.dev/) - [Introduction to web scraping in Python tutorial](https://realpython.com/python-web-scraping-practical-introduction/) With these tools and resources at your disposal, you’re ready to start exploring web scraping in Python. Let’s move on to setting up your environment. ## Setting Up Your Web Scraping Environment Before diving into the projects, let’s set up our Python environment with the necessary tools and libraries. We’ll create a virtual environment and install the required packages. 1. Create and activate a virtual environment ```bash # Create a new virtual environment python -m venv scraping-env # Activate virtual environment # On Windows: scraping-env\Scripts\activate # On macOS/Linux: source scraping-env/bin/activate ``` 2. Install Required Packages ```bash pip install requests beautifulsoup4 selenium scrapy firecrawl-py pandas ``` 3. Additional Setup for Selenium If you plan to use Selenium, you’ll need to install a webdriver. For Chrome: ```bash pip install webdriver-manager ``` 4. Basic Project Structure Create a basic project structure to organize your code: ```bash mkdir web_scraping_projects cd web_scraping_projects touch requirements.txt ``` Add the dependencies to `requirements.txt`: ```text requests>=2.31.0 beautifulsoup4>=4.12.2 selenium>=4.15.2 scrapy>=2.11.0 firecrawl-py>=0.1.0 pandas>=2.1.3 webdriver-manager>=4.0.1 ``` 5. Important Notes - Always check a website’s robots.txt file before scraping - Implement proper delays between requests (rate limiting) - Consider using a user agent string to identify your scraper - Handle errors and exceptions appropriately - Store your API keys and sensitive data in environment variables With this environment set up, you’ll be ready to tackle any of the projects in this tutorial, from beginner to advanced level. Each project may require additional specific setup steps, which will be covered in their respective sections. ## Beginner Web Scraping Projects Let’s start with some beginner-friendly web scraping projects that will help you build foundational skills. ### 1\. Weather Data Scraper A real-time weather data scraper for weather.com extracts temperature, humidity, wind speed and precipitation forecasts. The project serves as an introduction to fundamental web scraping concepts including HTTP requests, HTML parsing, and error handling. This beginner-friendly project demonstrates proper web scraping practices through practical application, with opportunities to expand into historical trend analysis and multi-location comparisons. The core focus is on DOM navigation, rate limiting implementation, and efficient data storage techniques. **Learning objectives**: - Understanding HTML structure and basic DOM elements - Making HTTP requests - Parsing simple HTML responses - Handling basic error cases **Proposed project steps**: 1. Set up your development environment: - Install required libraries (requests, beautifulsoup4) - Create a new Python script file - Configure your IDE/editor 2. Analyze the weather website structure: - Open browser developer tools (F12) - Inspect HTML elements for weather data - Document CSS selectors for key elements - Check robots.txt for scraping permissions 3. Build the basic scraper structure: - Create a WeatherScraper class - Add methods for making HTTP requests - Implement user agent rotation - Add request delay functionality 4. Implement data extraction: - Write methods to parse temperature - Extract humidity percentage - Get wind speed and direction - Collect precipitation forecast - Parse “feels like” temperature - Get weather condition description 5. Add error handling and validation: - Implement request timeout handling - Add retry logic for failed requests - Validate extracted data types - Handle missing data scenarios - Log errors and exceptions 6. Create data storage functionality: - Design CSV file structure - Implement data cleaning - Add timestamp to records - Create append vs overwrite options - Include location information 7. Test and refine: - Test with multiple locations - Verify data accuracy - Optimize request patterns - Add data validation checks - Document known limitations **Key concepts to learn**: - HTTP requests and responses - HTML parsing basics - CSS selectors and HTML class/id attributes - Data extraction patterns - Basic error handling **Website suggestions**: - [weather.com](https://weather.com/) \- Main weather data source with comprehensive information - [accuweather.com](https://accuweather.com/) \- Alternative source with detailed forecasts - [weatherunderground.com](https://weatherunderground.com/) \- Community-driven weather data - [openweathermap.org](https://openweathermap.org/) \- Free API available for learning - [forecast.weather.gov](https://forecast.weather.gov/) \- Official US weather data source ### 2\. News Headlines Aggregator A news headline aggregation system that pulls together breaking stories and trending content from multiple online news sources. The automated scraping engine visits major news websites on a schedule, extracting headlines, metadata, and key details into a unified data stream. The consolidated feed gives users a single interface to monitor news across publishers while handling the complexity of different site structures, update frequencies, and content formats behind the scenes. **Learning Objectives**: - Working with multiple data sources - Handling different HTML structures - Implementing proper delays between requests - Basic data deduplication **Project steps**: 1. Initial website selection and analysis - Choose 2-3 news websites from suggested list - Document each site’s robots.txt rules - Identify optimal request intervals - Map out common headline patterns - Note any access restrictions 2. HTML structure analysis - Inspect headline container elements - Document headline text selectors - Locate timestamp information - Find article category/section tags - Map author and source attribution - Identify image thumbnail locations 3. Data model design - Define headline object structure - Create schema for metadata fields - Plan timestamp standardization - Design category classification - Structure source tracking fields - Add URL and unique ID fields 4. Individual scraper development - Build base scraper class - Implement site-specific extractors - Add request delay handling - Include user-agent rotation - Set up error logging - Add data validation checks 5. Data processing and storage - Implement text cleaning - Normalize timestamps - Remove duplicate headlines - Filter unwanted content - Create CSV/JSON export - Set up incremental updates 6. Integration and testing - Combine multiple scrapers - Add master scheduler - Test with different intervals - Validate combined output - Monitor performance - Document limitations **Key concepts to learn**: - Rate limiting and polite scraping - Working with multiple websites - Text normalization - Basic data structures for aggregation - Time handling in Python **Website suggestions**: - [reuters.com](https://reuters.com/) \- Major international news agency - [apnews.com](https://apnews.com/) \- Associated Press news wire service - [bbc.com/news](https://bbc.com/news) \- International news coverage - [theguardian.com](https://theguardian.com/) \- Global news with good HTML structure - [aljazeera.com](https://aljazeera.com/) \- International perspective on news ### 3\. Book Price Tracker Develop an automated price monitoring system that continuously scans multiple online bookstores to track price fluctuations for specific books. The tool will maintain a watchlist of titles, periodically check their current prices, and notify users when prices drop below certain thresholds or when significant discounts become available. This enables book enthusiasts to make cost-effective purchasing decisions by capitalizing on temporary price reductions across different retailers. **Learning objectives**: - Persistent data storage - Price extraction and normalization - Basic automation concepts - Simple alert systems **Project steps**: 1. Analyze target bookstores - Research and select online bookstores to monitor - Study website structures and price display patterns - Document required headers and request parameters - Test rate limits and access restrictions 2. Design data storage - Create database tables for books and price history - Define schema for watchlists and price thresholds - Plan price tracking and comparison logic - Set up automated backups 3. Build price extraction system - Implement separate scrapers for each bookstore - Extract prices, availability and seller info - Handle different currencies and formats - Add error handling and retries - Validate extracted data 4. Implement automation - Set up scheduled price checks - Configure appropriate delays between requests - Track successful/failed checks - Implement retry logic for failures - Monitor system performance 5. Add notification system - Create price threshold triggers - Set up email notifications - Add price drop alerts - Generate price history reports - Allow customizable alert preferences **Key concepts to learn**: - Database basics (SQLite or similar) - Regular expressions for price extraction - Scheduling with Python - Email notifications - Data comparison logic **Website suggestions**: - [amazon.com](https://amazon.com/) \- Large selection and dynamic pricing - [bookdepository.com](https://bookdepository.com/) \- International book retailer - [barnesandnoble.com](https://barnesandnoble.com/) \- Major US book retailer - [abebooks.com](https://abebooks.com/) \- Used and rare books marketplace - [bookfinder.com](https://bookfinder.com/) \- Book price comparison site ### 4\. Recipe Collector Build an automated recipe scraping tool that collects detailed cooking information from food websites. The system will extract comprehensive recipe data including ingredient lists with measurements, step-by-step preparation instructions, cooking durations, serving sizes, and nutritional facts. This tool enables home cooks to easily aggregate and organize recipes from multiple sources into a standardized format. **Learning objectives**: - Handling nested HTML structures - Extracting structured data - Text cleaning and normalization - Working with lists and complex data types **Project steps**: 1. Analyze recipe website structures - Study HTML structure of target recipe sites - Identify common patterns for recipe components - Document CSS selectors and XPaths for key elements - Map variations between different sites 2. Design a recipe data model - Create database schema for recipes - Define fields for ingredients, instructions, metadata - Plan data types and relationships - Add support for images and rich media - Include tags and categories 3. Implement extraction logic for recipe components - Build scrapers for each target website - Extract recipe title and description - Parse ingredient lists with quantities and units - Capture step-by-step instructions - Get cooking times and temperatures - Collect serving size information - Extract nutritional data - Download recipe images 4. Clean and normalize extracted data - Standardize ingredient measurements - Convert temperature units - Normalize cooking durations - Clean up formatting and special characters - Handle missing or incomplete data - Validate data consistency - Remove duplicate recipes 5. Store recipes in a structured format - Save to SQL/NoSQL database - Export options to JSON/YAML - Generate printable recipe cards - Add search and filtering capabilities - Implement recipe categorization - Create backup system **Key concepts to learn**: - Complex HTML navigation - Data cleaning techniques - JSON/YAML data formats - Nested data structures - Text processing **Website suggestions**: - [allrecipes.com](https://allrecipes.com/) \- Large recipe database - [foodnetwork.com](https://foodnetwork.com/) \- Professional recipes - [epicurious.com](https://epicurious.com/) \- Curated recipe collection - [simplyrecipes.com](https://simplyrecipes.com/) \- Well-structured recipes - [food.com](https://food.com/) \- User-submitted recipes ### 5\. Job Listing Monitor Create an automated job search monitoring tool that continuously scans multiple job listing websites for new positions matching user-defined criteria. The tool will track key details like job titles, companies, locations, salaries, and requirements. Users can specify search filters such as keywords, experience level, job type (remote/hybrid/onsite), and salary range. The system will store listings in a database and notify users of new matches via email or other alerts. This helps job seekers stay on top of opportunities without manually checking multiple sites. The tool can integrate with major job boards like LinkedIn, Indeed, Glassdoor and company career pages. It will handle different site structures, login requirements, and listing formats while respecting rate limits and terms of service. Advanced features could include sentiment analysis of job descriptions, automatic resume submission, and tracking application status across multiple positions. **Learning objectives**: - Working with search parameters - Handling pagination - Form submission - Data filtering **Project steps**: 1. Set up initial project structure and dependencies - Create virtual environment - Install required libraries - Set up database (SQLite/PostgreSQL) - Configure logging and error handling - Set up email notification system 2. Implement site-specific scrapers - Analyze HTML structure of each job board - Handle authentication if required - Create separate scraper classes for each site (one is enough if you are using Firecrawl) - Implement rate limiting and rotating user agents - Add proxy support for avoiding IP blocks - Handle JavaScript-rendered content with Selenium (no need if you are using Firecrawl) 3. Build search parameter system - Create configuration for search criteria - Implement URL parameter generation - Handle different parameter formats per site - Add validation for search inputs - Support multiple search profiles - Implement location-based searching 4. Develop listing extraction logic - Extract job details (title, company, location, etc) - Parse salary information - Clean and standardize data format - Handle missing/incomplete data - Extract application requirements - Identify remote/hybrid/onsite status - Parse required skills and experience 5. Create storage and monitoring system - Design database schema - Implement data deduplication - Track listing history/changes - Set up automated monitoring schedule - Create email alert templates - Build basic web interface for results - Add export functionality **Key concepts to learn**: - URL parameters and query strings - HTML forms and POST requests - Pagination handling - Data filtering techniques - Incremental data updates **Website suggestions**: - [linkedin.com](https://linkedin.com/) \- Professional networking and job site - [indeed.com](https://indeed.com/) \- Large job search engine - [glassdoor.com](https://glassdoor.com/) \- Company reviews and job listings - [monster.com](https://monster.com/) \- Global job search platform - [dice.com](https://dice.com/) \- Technology job board - [careerbuilder.com](https://careerbuilder.com/) \- Major US job site ## Intermediate Web Scraping Projects These projects build upon basic scraping concepts and introduce more complex scenarios and techniques. ### 1\. E-commerce Price Comparison Tool Build a sophisticated price comparison system monitoring major e-commerce platforms like Amazon, eBay, Walmart and Best Buy. The tool tracks products via SKUs and model numbers, scraping pricing data at configurable intervals. It normalizes data by mapping equivalent items and standardizing prices, shipping costs, and seller information across platforms. A dashboard interface displays historical price trends, sends price drop alerts via email/SMS, and recommends optimal purchase timing based on seasonal patterns and historical lows. The system handles JavaScript-rendered content, dynamic AJAX requests, and anti-bot measures while maintaining data in both SQL and NoSQL stores. Key technical challenges include managing product variants, currency conversion, and adapting to frequent site layout changes while ensuring data accuracy and consistency. Read our separate guide on [building an Amazon price tracking application](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) using Firecrawl for the basic version of this project. **Learning objectives**: - Multi-site data aggregation - Price normalization techniques - Advanced rate limiting - Proxy rotation - Database optimization **Project steps**: 1. Design system architecture - Plan database schema for products and prices - Design API structure for data access - Set up proxy management system - Configure rate limiting rules - Plan data update intervals 2. Implement core scraping functionality - Create base scraper class - Add proxy rotation mechanism - Implement user agent rotation - Set up request queuing - Add retry logic - Handle JavaScript rendering - Configure session management 3. Build product matching system - Implement product identification - Create fuzzy matching algorithms - Handle variant products - Normalize product names - Match product specifications - Track product availability 4. Develop price analysis features - Track historical prices - Calculate price trends - Identify price patterns - Generate price alerts - Create price prediction models - Compare shipping costs - Track discount patterns 5. Create reporting system - Build price comparison reports - Generate trend analysis - Create price alert notifications - Export data in multiple formats - Schedule automated reports - Track price history **Key concepts to learn**: - Advanced rate limiting - Proxy management - Product matching algorithms - Price normalization - Historical data tracking **Website suggestions**: - [amazon.com](https://amazon.com/) \- Large product database - [walmart.com](https://walmart.com/) \- Major retailer - [bestbuy.com](https://bestbuy.com/) \- Electronics focus - [target.com](https://target.com/) \- Retail products - [newegg.com](https://newegg.com/) \- Tech products ### 2\. Social Media Analytics Tool Build a comprehensive social media analytics platform that combines web scraping, API integration, and real-time monitoring capabilities. The system will aggregate engagement metrics and content across major social networks, process JavaScript-heavy pages, and provide actionable insights through customizable dashboards. Key features include sentiment analysis of comments, competitive benchmarking, and automated trend detection. The tool emphasizes scalable data collection while respecting rate limits and platform terms of service. **Learning objectives**: - JavaScript rendering - API integration - Real-time monitoring - Data visualization - Engagement metrics analysis **Project steps**: 1. Platform analysis and setup - Research API limitations - Document scraping restrictions - Set up authentication - Plan data collection strategy - Configure monitoring intervals 2. Implement data collection - Create platform-specific scrapers - Handle JavaScript rendering - Implement API calls - Track rate limits - Monitor API quotas - Handle pagination - Collect media content 3. Build analytics engine - Calculate engagement rates - Track follower growth - Analyze posting patterns - Monitor hashtag performance - Measure audience interaction - Generate sentiment analysis - Track competitor metrics 4. Develop visualization system - Create interactive dashboards - Generate trend graphs - Build comparison charts - Display real-time metrics - Create export options - Generate automated reports 5. Add monitoring features - Set up real-time tracking - Create alert system - Monitor competitor activity - Track brand mentions - Generate periodic reports - Implement custom metrics **Key concepts to learn**: - API integration - Real-time data collection - Engagement metrics - Data visualization - JavaScript handling **Website suggestions**: - [twitter.com](https://twitter.com/) \- Real-time social updates - [instagram.com](https://instagram.com/) \- Visual content platform - [facebook.com](https://facebook.com/) \- Social networking - [linkedin.com](https://linkedin.com/) \- Professional network - [reddit.com](https://reddit.com/) \- Community discussions ### 3\. Real Estate Market Analyzer Develop a comprehensive real estate market analysis tool that collects and analyzes property listings from multiple sources. The system will track prices, property features, market trends, and neighborhood statistics to provide insights into real estate market conditions. This project focuses on handling pagination, geographic data, and large datasets. **Learning objectives**: - Geographic data handling - Advanced pagination - Data relationships - Market analysis - Database optimization **Project steps**: 1. Set up data collection framework - Design database schema - Configure geocoding system - Set up mapping integration - Plan data update frequency - Configure backup system 2. Implement listing collection - Create site-specific scrapers - Handle dynamic loading - Process pagination - Extract property details - Collect images and media - Parse property features - Handle location data 3. Build analysis system - Calculate market trends - Analyze price per square foot - Track inventory levels - Monitor days on market - Compare neighborhood stats - Generate market reports - Create price predictions 4. Develop visualization tools - Create interactive maps - Build trend graphs - Display comparative analysis - Show market indicators - Generate heat maps - Create property reports 5. Add advanced features - Implement search filters - Add custom alerts - Create watchlists - Generate market reports - Track favorite properties - Monitor price changes **Key concepts to learn**: - Geographic data processing - Complex pagination - Data relationships - Market analysis - Mapping integration **Website suggestions**: - [zillow.com](https://zillow.com/) \- Real estate listings - [realtor.com](https://realtor.com/) \- Property database - [trulia.com](https://trulia.com/) \- Housing market data - [redfin.com](https://redfin.com/) \- Real estate platform - [homes.com](https://homes.com/) \- Property listings ### 4\. Academic Research Aggregator Create a comprehensive academic research aggregator that collects scholarly articles, papers, and publications from multiple academic databases and repositories. The system will track research papers, citations, author information, and publication metrics to help researchers stay updated with the latest developments in their field. **Learning objectives**: - PDF parsing and extraction - Citation network analysis - Academic API integration - Complex search parameters - Large dataset management **Project steps**: 1. Source identification and setup - Research academic databases - Document API access requirements - Set up authentication systems - Plan data collection strategy - Configure access protocols - Handle rate limitations 2. Implement data collection - Create database-specific scrapers - Handle PDF downloads - Extract paper metadata - Parse citations - Track author information - Collect publication dates - Handle multiple languages 3. Build citation analysis system - Track citation networks - Calculate impact factors - Analyze author networks - Monitor research trends - Generate citation graphs - Track paper influence - Identify key papers 4. Develop search and filtering - Implement advanced search - Add field-specific filters - Create topic clustering - Enable author tracking - Support boolean queries - Add relevance ranking - Enable export options 5. Create visualization and reporting - Generate citation networks - Create author collaboration maps - Display research trends - Show topic evolution - Create custom reports - Enable data export **Key concepts to learn**: - PDF text extraction - Network analysis - Academic APIs - Complex search logic - Large-scale data processing **Website suggestions**: - [scholar.google.com](https://scholar.google.com/) \- Academic search engine - [arxiv.org](https://arxiv.org/) \- Research paper repository - [sciencedirect.com](https://sciencedirect.com/) \- Scientific publications - [ieee.org](https://ieee.org/) \- Technical papers - [pubmed.gov](https://pubmed.gov/) \- Medical research ### 5\. Financial Market Data Analyzer Build a sophisticated financial market analysis tool that collects and processes data from multiple financial sources including stock markets, cryptocurrency exchanges, and forex platforms. The system will track prices, trading volumes, market indicators, and news sentiment to provide comprehensive market insights. **Learning objectives**: - Real-time data handling - WebSocket connections - Financial calculations - Time series analysis - News sentiment analysis **Project steps**: 1. Data source integration - Set up API connections - Configure WebSocket feeds - Implement rate limiting - Handle authentication - Manage data streams - Plan backup sources 2. Market data collection - Track price movements - Monitor trading volume - Calculate market indicators - Record order book data - Track market depth - Handle multiple exchanges - Process tick data 3. Build analysis engine - Implement technical indicators - Calculate market metrics - Process trading signals - Analyze price patterns - Generate market alerts - Track correlations - Monitor volatility 4. Develop news analysis - Collect financial news - Process news sentiment - Track market impact - Monitor social media - Analyze announcement effects - Generate news alerts 5. Create visualization system - Build price charts - Display market indicators - Show volume analysis - Create correlation maps - Generate trading signals - Enable custom dashboards **Key concepts to learn**: - WebSocket programming - Real-time data processing - Financial calculations - Market analysis - News sentiment analysis **Website suggestions**: - [finance.yahoo.com](https://finance.yahoo.com/) \- Financial data - [marketwatch.com](https://marketwatch.com/) \- Market news - [investing.com](https://investing.com/) \- Trading data - [tradingview.com](https://tradingview.com/) \- Technical analysis - [coinmarketcap.com](https://coinmarketcap.com/) \- Crypto markets ## Advanced Web Scraping Projects These projects represent complex, production-grade applications that combine multiple advanced concepts and require sophisticated architecture decisions. They’re ideal for developers who have mastered basic and intermediate scraping techniques. ### 1\. Multi-threaded News Aggregator Build an enterprise-grade news aggregation system that uses concurrent processing to efficiently collect and analyze news from hundreds of sources simultaneously. The system will handle rate limiting, proxy rotation, and load balancing while maintaining high throughput and data accuracy. This project focuses on scalability and performance optimization. **Learning objectives**: - Concurrent programming - Thread/Process management - Queue systems - Load balancing - Performance optimization **Project steps**: 1. Design concurrent architecture - Plan threading strategy - Design queue system - Configure worker pools - Set up load balancing - Plan error handling - Implement logging system - Design monitoring tools 2. Build core scraping engine - Create worker threads - Implement task queue - Set up proxy rotation - Handle rate limiting - Manage session pools - Configure retries - Monitor performance 3. Develop content processing - Implement NLP analysis - Extract key information - Classify content - Detect duplicates - Process media content - Handle multiple languages - Generate summaries 4. Create storage and indexing - Design database sharding - Implement caching - Set up search indexing - Manage data retention - Handle data validation - Configure backups - Optimize queries 5. Build monitoring system - Track worker status - Monitor queue health - Measure throughput - Track error rates - Generate alerts - Create dashboards - Log performance metrics **Key concepts to learn**: - Thread synchronization - Queue management - Resource pooling - Performance monitoring - System optimization **Website suggestions**: - [reuters.com](https://reuters.com/) \- International news - [apnews.com](https://apnews.com/) \- News wire service - [bloomberg.com](https://bloomberg.com/) \- Financial news - [nytimes.com](https://nytimes.com/) \- News articles - [wsj.com](https://wsj.com/) \- Business news ### 2\. Distributed Web Archive System Build a distributed web archiving system that preserves historical versions of websites across a network of nodes. The system will handle massive-scale crawling, content deduplication, versioning, and provide a searchable interface to access archived content. Think of it as building your own Internet Archive Wayback Machine with distributed architecture. **Learning objectives**: - Distributed systems architecture - Content-addressable storage - Version control concepts - Distributed crawling - Large-scale search **Project steps**: 1. Design distributed architecture - Plan node communication - Design content addressing - Configure storage sharding - Implement consensus protocol - Set up service discovery - Plan failure recovery - Design replication strategy 2. Build core archiving engine - Implement snapshot system - Handle resource capturing - Process embedded content - Manage asset dependencies - Create versioning system - Handle redirects - Implement diff detection 3. Develop distributed crawler - Create crawler nodes - Implement work distribution - Handle URL deduplication - Manage crawl frontiers - Process robots.txt - Configure politeness rules - Monitor node health 4. Create storage and indexing - Implement content hashing - Build merkle trees - Create delta storage - Set up distributed index - Handle data replication - Manage storage quotas - Optimize retrieval 5. Build access interface - Create temporal navigation - Implement diff viewing - Enable full-text search - Build API endpoints - Create admin dashboard - Enable export options - Handle access control **Key concepts to learn**: - Distributed systems - Content addressing - Merkle trees - Consensus protocols - Temporal data models **Technical requirements**: - Distributed database (e.g., Cassandra) - Message queue system (e.g., Kafka) - Search engine (e.g., Elasticsearch) - Content-addressable storage - Load balancers - Service mesh - Monitoring system **Advanced features**: - Temporal graph analysis - Content change detection - Link integrity verification - Resource deduplication - Distributed consensus - Automated preservation - Access control policies This project combines distributed systems concepts with web archiving challenges, requiring deep understanding of both scalable architecture and content preservation techniques. It’s particularly relevant for organizations needing to maintain compliant records of web content or researchers studying web evolution patterns. ### 3\. Automated Market Research Tool Create a comprehensive market research platform that combines web scraping, data analysis, and automated reporting to provide competitive intelligence and market insights. The system will track competitors, analyze market trends, and generate detailed reports automatically. **Learning objectives**: - Large-scale data collection - Advanced analytics - Automated reporting - Competitive analysis - Market intelligence **Project steps**: 1. Design research framework - Define data sources - Plan collection strategy - Design analysis pipeline - Configure reporting system - Set up monitoring - Plan data storage - Configure backup systems 2. Implement data collection - Create source scrapers - Handle authentication - Manage rate limits - Process structured data - Extract unstructured content - Track changes - Validate data quality 3. Build analysis engine - Process market data - Analyze trends - Track competitors - Generate insights - Calculate metrics - Identify patterns - Create predictions 4. Develop reporting system - Generate automated reports - Create visualizations - Build interactive dashboards - Enable customization - Schedule updates - Handle distribution - Track engagement 5. Add intelligence features - Implement trend detection - Create alerts system - Enable custom analysis - Build recommendation engine - Generate insights - Track KPIs - Monitor competition **Key concepts to learn**: - Market analysis - Report automation - Data visualization - Competitive intelligence - Trend analysis **Website suggestions**: - Company websites - Industry news sites - Government databases - Social media platforms - Review sites ### 4\. Competitive Intelligence Dashboard Build a real-time competitive intelligence platform that monitors competitor activities across multiple channels including websites, social media, and news sources. The system will provide automated alerts and analysis of competitive movements in the market. **Learning objectives**: - Real-time monitoring - Complex automation - Data warehousing - Dashboard development - Alert systems **Project steps**: 1. Set up monitoring system - Configure data sources - Set up real-time tracking - Implement change detection - Design alert system - Plan data storage - Configure monitoring rules - Handle authentication 2. Build data collection - Create source scrapers - Handle dynamic content - Process structured data - Extract unstructured content - Track changes - Monitor social media - Collect news mentions 3. Develop analysis engine - Process competitor data - Analyze market position - Track product changes - Monitor pricing - Analyze marketing - Track customer sentiment - Generate insights 4. Create dashboard interface - Build real-time displays - Create interactive charts - Enable custom views - Implement filtering - Add search functionality - Enable data export - Configure alerts 5. Implement alert system - Set up notification rules - Create custom triggers - Handle priority levels - Enable user preferences - Track alert history - Generate summaries - Monitor effectiveness **Key concepts to learn**: - Real-time monitoring - Change detection - Alert systems - Dashboard design - Competitive analysis **Website suggestions**: - Competitor websites - Social media platforms - News aggregators - Review sites - Industry forums ### 5\. Full-Stack Scraping Platform Develop a complete web scraping platform with a user interface that allows non-technical users to create and manage scraping tasks. The system will include visual scraping tools, scheduling, monitoring, and data export capabilities. **Learning objectives**: - Full-stack development - API design - Frontend development - System architecture - User management **Project steps**: 1. Design system architecture - Plan component structure - Design API endpoints - Configure databases - Set up authentication - Plan scaling strategy - Design monitoring - Configure deployment 2. Build backend system - Create API endpoints - Implement authentication - Handle task management - Process scheduling - Manage user data - Handle file storage - Configure security 3. Develop scraping engine - Create scraper framework - Handle different sites - Manage sessions - Process rate limits - Handle errors - Validate data - Monitor performance 4. Create frontend interface - Build user dashboard - Create task manager - Implement scheduling - Show monitoring data - Enable configuration - Handle data export - Display results 5. Add advanced features - Visual scraper builder - Template system - Export options - Notification system - User management - Usage analytics - API documentation **Key concepts to learn**: - System architecture - API development - Frontend frameworks - User management - Deployment **Website suggestions**: - Any website (platform should be generic) - Test sites for development - Documentation resources - API references - Example targets ## Conclusion Web scraping is a powerful skill that opens up endless possibilities for data collection and analysis. Through these 15 projects, ranging from basic weather scrapers to advanced AI-powered content extraction systems, you’ve seen how web scraping can be applied to solve real-world problems across different domains. Key takeaways from these projects include: - Start with simpler projects to build foundational skills - Progress gradually to more complex architectures - Focus on ethical scraping practices and website policies - Use appropriate tools based on project requirements - Implement proper error handling and data validation - Consider scalability and maintenance from the start Whether you’re building a simple price tracker or a full-scale market intelligence platform, the principles and techniques covered in these projects will serve as a solid foundation for your web scraping journey. Remember to always check robots.txt files, implement appropriate delays, and respect website terms of service while scraping. For your next steps, pick a project that aligns with your current skill level and start building. The best way to learn web scraping is through hands-on practice and real-world applications. As you gain confidence, gradually tackle more complex projects and keep exploring new tools and techniques in this ever-evolving field. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Mastering Firecrawl Scrape API Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Nov 25, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial ![How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial image](https://www.firecrawl.dev/images/blog/scrape-masterclass/mastering-scrape.jpg) ## Getting Started with Modern Web Scraping: An Introduction Traditional web scraping offers unique challenges. Relevant information is often scattered across multiple pages containing complex elements like code blocks, iframes, and media. JavaScript-heavy websites and authentication requirements add additional complexity to the scraping process. Even after successfully scraping, the content requires specific formatting to be useful for downstream processes like data engineering or training AI and machine learning models. Firecrawl addresses these challenges by providing a specialized scraping solution. Its [`/scrape` endpoint](https://docs.firecrawl.dev/features/scrape) offers features like JavaScript rendering, automatic content extraction, bypassing blockers and flexible output formats that make it easier to collect high-quality information and training data at scale. In this guide, we’ll explore how to effectively use Firecrawl’s `/scrape` endpoint to extract structured data from static and dynamic websites. We’ll start with basic scraping setup and then dive into a real-world example of scraping weather data from weather.com, demonstrating how to handle JavaScript-based interactions, extract structured data using schemas, and capture screenshots during the scraping process. ## Table of Contents - [Getting Started with Modern Web Scraping: An Introduction](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#getting-started-with-modern-web-scraping-an-introduction) - [What Is Firecrawl’s `/scrape` Endpoint? The Short Answer](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#what-is-firecrawls-scrape-endpoint-the-short-answer) - [Prerequisites: Setting Up Firecrawl](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#prerequisites-setting-up-firecrawl) - [Basic Scraping Setup](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#basic-scraping-setup) - [Large-scale Scraping With Batch Operations](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#large-scale-scraping-with-batch-operations) - [Batch Scraping with `batch_scrape_urls`](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#batch-scraping-with-batch_scrape_urls) - [Asynchronous batch scraping with `async_batch_scrape_urls`](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#asynchronous-batch-scraping-with-async_batch_scrape_urls) - [How to Scrape Dynamic JavaScript Websites](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#how-to-scrape-dynamic-javascript-websites) - [Conclusion](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint#conclusion) ## What Is Firecrawl’s `/scrape` Endpoint? The Short Answer The `/scrape` endpoint is Firecrawl’s core web scraping API that enables automated extraction of content from any webpage. It handles common web scraping challenges like: - JavaScript rendering - Executes JavaScript to capture dynamically loaded content - Content extraction - Automatically identifies and extracts main content while filtering out noise - Format conversion - Converts HTML to clean formats like Markdown or structured JSON - Screenshot capture - Takes full or partial page screenshots during scraping - Browser automation - Supports clicking, typing and other browser interactions - Anti-bot bypass - Uses rotating proxies and browser fingerprinting to avoid blocks The endpoint accepts a URL and configuration parameters, then returns the scraped content in your desired format. It’s designed to be flexible enough for both simple static page scraping and complex dynamic site automation. Now that we understand what the endpoint does at a high level, let’s look at how to set it up and start using it in practice. ## Prerequisites: Setting Up Firecrawl Firecrawl’s scraping engine is exposed as a REST API, so you can use command-line tools like cURL to use it. However, for a more comfortable experience, better flexibility and control, I recommend using one of its SDKs for Python, Node, Rust or Go. This tutorial will focus on the Python version. To get started, please make sure to: 1. Sign up at [firecrawl.dev](https://www.firecrawl.dev/). 2. Choose a plan (the free one will work fine for this tutorial). Once you sign up, you will be given an API token which you can copy from your [dashboard](https://www.firecrawl.dev/app). The best way to save your key is by using a `.env` file, ideal for the purposes of this article: ```bash touch .env echo "FIRECRAWL_API_KEY='YOUR_API_KEY'" >> .env ``` Now, let’s install Firecrawl Python SDK, `python-dotenv` to read `.env` files, and Pandas for data analysis later: ```bash pip install firecrawl-py python-dotenv pandas ``` ## Basic Scraping Setup Scraping with Firecrawl starts by creating an instance of the `FirecrawlApp` class: ```python from firecrawl import FirecrawlApp from dotenv import load_dotenv load_dotenv() app = FirecrawlApp() ``` When you use the `load_dotenv()` function, the app can automatically use your loaded API key to establish a connection with the scraping engine. Then, scraping any URL takes a single line of code: ```python url = "https://arxiv.org" data = app.scrape_url(url) ``` Let’s take a look at the response format returned by `scrape_url` method: ```python data['metadata'] ``` ```json { "title": "arXiv.org e-Print archiveopen searchopen navigation menucontact arXivsubscribe to arXiv mailings", "language": "en", "ogLocaleAlternate": [], "viewport": "width=device-width, initial-scale=1", "msapplication-TileColor": "#da532c", "theme-color": "#ffffff", "sourceURL": "[https://arxiv.org](https://arxiv.org)", "url": "[https://arxiv.org/](https://arxiv.org/)", "statusCode": 200 } ``` The response `metadata` includes basic information like the page title, viewport settings and a status code. Now, let’s look at the scraped contents, which is converted into `markdown` by default: ```python from IPython.display import Markdown Markdown(data['markdown'][:500]) ``` ```text arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv. Subject search and browse: Physics Mathematics Quantitative Biology Computer Science Quantitative Finance Statistics Electrical Engineering and Systems Scienc ``` The response can include several other formats that we can request when scraping a URL. Let’s try requesting multiple formats at once to see what additional data we can get back: ```python data = app.scrape_url( url, params={ 'formats': [\ 'html',\ 'rawHtml',\ 'links',\ 'screenshot',\ ] } ) ``` Here is what these formats scrape: - **HTML**: The raw HTML content of the page. - **rawHtml**: The unprocessed HTML content, exactly as it appears on the page. - **links**: A list of all the hyperlinks found on the page. - **screenshot**: An image capture of the page as it appears in a browser. The HTML format is useful for developers who need to analyze or manipulate the raw structure of a webpage. The `rawHtml` format is ideal for cases where the exact original HTML content is required, such as for archival purposes or detailed comparison. The links format is beneficial for SEO specialists and web crawlers who need to extract and analyze all hyperlinks on a page. The screenshot format is perfect for visual documentation, quality assurance, and capturing the appearance of a webpage at a specific point in time. Passing more than one scraping format to `params` adds additional keys to the response: ```python data.keys() ``` ```text dict_keys(['rawHtml', 'screenshot', 'metadata', 'html', 'links']) ``` Let’s display the screenshot Firecrawl took of arXiv.org: ```python from IPython.display import Image Image(data['screenshot']) ``` ![Screenshot of arXiv.org homepage that was taken with Firecrawl's screenshot feature showing research paper categories like Computer Science, Mathematics, Physics and other scientific disciplines](https://www.firecrawl.dev/images/blog/scrape-masterclass/notebook_files/notebook_20_0.png) Notice how the screenshot is cropped to fit a certain viewport. For most pages, it is better to capture the entire screen by using the `screenshot@fullPage` format: ```python data = app.scrape_url( url, params={ "formats": [\ "screenshot@fullPage",\ ] } ) Image(data['screenshot']) ``` ![Full page screenshot of arXiv.org homepage taken with Firecrawl's full-page screenshot capture feature showing research paper categories, search functionality, and recent submissions in an academic layout](https://www.firecrawl.dev/images/blog/scrape-masterclass/notebook_files/notebook_22_0.png) As a bonus, the `/scrape` endpoint can handle PDF links as well: ```python pdf_link = "https://arxiv.org/pdf/2411.09833.pdf" data = app.scrape_url(pdf_link) Markdown(data['markdown'][:500]) ``` ```text arXiv:2411.09833v1 \[math.DG\] 14 Nov 2024 EINSTEIN METRICS ON THE FULL FLAG F(N). MIKHAIL R. GUZMAN Abstract.LetM=G/Kbe a full flag manifold. In this work, we investigate theG- stability of Einstein metrics onMand analyze their stability types, including coindices, for several cases. We specifically focus onF(n) = SU(n)/T, emphasizingn= 5, where we identify four new Einstein metrics in addition to known ones. Stability data, including coindex and Hessian spectrum, confirms that these metrics on ``` ### Further Scrape Configuration Options By default, `scrape_url` converts everything it sees on a webpage to one of the specified formats. To control this behavior, Firecrawl offers the following parameters: - `onlyMainContent` - `includeTags` - `excludeTags` `onlyMainContent` excludes the navigation, footers, headers, etc. and is set to True by default. `includeTags` and `excludeTags` can be used to allowlist/blocklist certain HTML elements: ```python url = "https://arxiv.org" data = app.scrape_url(url, params={"includeTags": ["p"], "excludeTags": ["span"]}) Markdown(data['markdown'][:1000]) ``` ```markdown [Help](https://info.arxiv.org/help) \| [Advanced Search](https://arxiv.org/search/advanced) arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv. [arXiv Operational Status](https://status.arxiv.org) Get status notifications via [email](https://subscribe.sorryapp.com/24846f03/email/new) or [slack](https://subscribe.sorryapp.com/24846f03/slack/new) ``` `includeTags` and `excludeTags` also support referring to HTML elements by their `#id` or `.class-name`. These configuration options help ensure efficient and precise scraping. While `onlyMainContent` filters out peripheral elements, `includeTags` and `excludeTags` enable surgical targeting of specific HTML elements - particularly valuable when dealing with complex webpage structures or when only certain content types are needed. ## Advanced Data Extraction: Structured Techniques Scraping clean, LLM-ready data is the core philosophy of Firecrawl. However, certain web pages with their complex structures can interfere with this philosophy when scraped in their entirety. For this reason, Firecrawl offers two scraping methods for better structured outputs: 1. Natural language extraction - Use prompts to extract specific information and have an LLM structure the response 2. Manual structured data extraction - Define JSON schemas to have an LLM scrape data in a predefined format In this section, we will cover both methods. ### Natural Language Extraction - Use AI to Extract Data To illustrate natural language scraping, let’s try extracting all news article links that may be related to the 2024 US presidential election from the New York Times: ```python url = "https://nytimes.com" data = app.scrape_url( url, params={ 'formats': ['markdown', 'extract', 'screenshot'], 'extract': { 'prompt': "Return a list of links of news articles that may be about the 2024 US presidential election" } } ) ``` To enable this feature, you are required to pass the `extract` option to the list of `formats` and provide a prompt in a dictionary to a separate `extract` field. Once scraping finishes, the response will include a new `extract` key: ```python data['extract'] ``` ```python {'news_articles': [{'title': 'Harris Loss Has Democrats Fighting Over How to Talk About Transgender Rights',\ 'link': 'https://www.nytimes.com/2024/11/20/us/politics/presidential-campaign-transgender-rights.html'},\ {'title': 'As Democrats Question How to Win Back Latinos, Ruben Gallego Offers Answers',\ 'link': 'https://www.nytimes.com/2024/11/20/us/politics/ruben-gallego-arizona-latino-voters-democrats.html'},\ ...\ {'title': 'The Final Push for Ukraine?',\ 'link': 'https://www.nytimes.com/2024/11/20/briefing/ukraine-russia-trump.html'}]} ``` Due to the nature of this scraping method, the returned output can have arbitrary structure as we can see above. It seems the above output has the following format: ```python { "news_articles": [\ {"title": "article_title", "link": "article_url"},\ ...\ ] } ``` This LLM-based extraction can have endless applications, from extracting specific data points from complex websites to analyzing sentiment across multiple news sources to gathering structured information from unstructured web content. To improve the accuracy of the extraction and give additional instructions, you have the option to include a system prompt to the underlying LLM: ```python data = app.scrape_url( url, params={ 'formats': ['markdown', 'extract'], 'extract': { 'prompt': "Find any mentions of specific dollar amounts or financial figures and return them with their context and article link.", 'systemPrompt': "You are a helpful assistant that extracts numerical financial data." } } ) ``` Above, we are dictating that the LLM must act as an assistant that extracts numerical financial data. Let’s look at its response: ```python data['extract'] ``` ```python {'financial_data': [\ {\ 'amount': 121200000,\ 'context': 'René Magritte became the 16th artist whose work broke the nine-figure '\ 'threshold at auction when his painting sold for $121.2 million.',\ 'article_link': 'https://www.nytimes.com/2024/11/19/arts/design/magritte-surrealism-christies-auction.html'\ },\ {\ 'amount': 5000000,\ 'context': 'Benjamin Netanyahu offers $5 million for each hostage freed in Gaza.',\ 'article_link': 'https://www.nytimes.com/2024/11/19/world/middleeast/israel-5-million-dollars-hostage.html'\ }\ ]} ``` The output shows the LLM successfully extracted two financial data points from the articles. The LLM not only identified the specific amounts but also provided relevant context and source article links for each figure. ### Schema-Based Data Extraction - Building Structured Models While natural language scraping is powerful for exploration and prototyping, production systems typically require more structured and deterministic approaches. LLM responses can vary between runs of the same prompt, making the output format inconsistent and difficult to reliably parse in automated workflows. For this reason, Firecrawl allows you to pass a predefined schema to guide the LLM’s output when transforming the scraped content. To facilitate this feature, Firecrawl uses Pydantic models. In the example below, we will extract only news article links, their titles with some additional details from the New York Times: ```python from pydantic import BaseModel, Field class IndividualArticle(BaseModel): title: str = Field(description="The title of the news article") subtitle: str = Field(description="The subtitle of the news article") url: str = Field(description="The URL of the news article") author: str = Field(description="The author of the news article") date: str = Field(description="The date the news article was published") read_duration: int = Field(description="The estimated time it takes to read the news article") topics: list[str] = Field(description="A list of topics the news article is about") class NewsArticlesSchema(BaseModel): news_articles: list[IndividualArticle] = Field( description="A list of news articles extracted from the page" ) ``` Above, we define a Pydantic schema that specifies the structure of the data we want to extract. The schema consists of two models: `IndividualArticle` defines the structure for individual news articles with fields for: - `title` - `subtitle` - `url` - `author` - `date` - `read_duration` - `topics` `NewsArticlesSchema` acts as a container model that holds a list of `IndividualArticle` objects, representing multiple articles extracted from the page. If we don’t use this container model, Firecrawl will only return the first news article it finds. Each model field uses Pydantic’s `Field` class to provide descriptions that help guide the LLM in correctly identifying and extracting the requested data. This structured approach ensures consistent output formatting. The next step is passing this schema to the `extract` parameter of `scrape_url`: ```python url = "https://nytimes.com" structured_data = app.scrape_url( url, params={ "formats": ["extract", "screenshot"], "extract": { "schema": NewsArticlesSchema.model_json_schema(), "prompt": "Extract the following data from the NY Times homepage: news article title, url, author, date, read_duration for all news articles", "systemPrompt": "You are a helpful assistant that extracts news article data from NY Times.", }, }, ) ``` While passing the schema, we call its `model_json_schema()` method to automatically convert it to valid JSON. Let’s look at the output: ```python structured_data['extract'] ``` ```python { 'news_articles': [\ {\ 'title': 'How Google Spent 15 Years Creating a Culture of Concealment',\ 'subtitle': '',\ 'url': 'https://www.nytimes.com/2024/11/20/technology/google-antitrust-employee-messages.html',\ 'author': 'David Streitfeld',\ 'date': '2024-11-20',\ 'read_duration': 9,\ 'topics': []\ },\ # ... additional articles ...\ {\ 'title': 'The Reintroduction of Daniel Craig',\ 'subtitle': '',\ 'url': 'https://www.nytimes.com/2024/11/20/movies/daniel-craig-queer.html',\ 'author': '',\ 'date': '2024-11-20',\ 'read_duration': 9,\ 'topics': []\ }\ ] } ``` This time, the response fields exactly match the fields we set during schema definition: ```python { "news_articles": [\ {...}, # Article 1\ {...}, # Article 2,\ ... # Article n\ ] } ``` When creating the scraping schema, the following best practices can go a long way in ensuring reliable and accurate data extraction: 1. Keep field names simple and descriptive 2. Use clear field descriptions that guide the LLM 3. Break complex data into smaller, focused fields 4. Include validation rules where possible 5. Consider making optional fields that may not always be present 6. Test the schema with a variety of content examples 7. Iterate and refine based on extraction results To follow these best practices, the following Pydantic tips can help: 1. Use `Field(default=None)` to make fields optional 2. Add validation with `Field(min_length=1, max_length=100)` 3. Create custom validators with @validator decorator 4. Use `conlist()` for list fields with constraints 5. Add example values with `Field(example="Sample text")` 6. Create nested models for complex data structures 7. Use computed fields with `@property` decorator If you follow all these tips, your schema can become quite sophisticated like below: ```python from pydantic import BaseModel, Field from typing import Optional, List from datetime import datetime class Author(BaseModel): # Required field - must be provided when creating an Author name: str = Field( ..., min_length=1, max_length=100, description="The full name of the article author", ) # Optional field - can be None or omitted title: Optional[str] = Field( None, description="Author's title or role, if available" ) class NewsArticle(BaseModel): # Required field - must be provided when creating a NewsArticle title: str = Field( ..., min_length=5, max_length=300, description="The main headline or title of the news article", example="Breaking News: Major Scientific Discovery", ) # Required field - must be provided when creating a NewsArticle url: str = Field( ..., description="The full URL of the article", example="https://www.nytimes.com/2024/01/01/science/discovery.html", ) # Optional field - can be None or omitted authors: Optional[List[Author]] = Field( default=None, description="List of article authors and their details" ) # Optional field - can be None or omitted publish_date: Optional[datetime] = Field( default=None, description="When the article was published" ) # Optional field with default empty list financial_amounts: List[float] = Field( default_factory=list, max_length=10, description="Any monetary amounts mentioned in the article in USD", ) @property def is_recent(self) -> bool: if not self.publish_date: return False return (datetime.now() - self.publish_date).days < 7 ``` The schema above defines two key data models for news article data: Author - Represents article author information with: - `name` (required): The author’s full name - `title` (optional): The author’s role or title NewsArticle - Represents a news article with: - `title` (required): The article headline (5-300 chars) - `url` (required): Full article URL - `authors` (optional): List of Author objects - `publish_date` (optional): Article publication datetime - `financial_amounts` (optional): List of monetary amounts in USD The `NewsArticle` model includes an `is_recent` property that checks if the article was published within the last 7 days. As you can see, web scraping process becomes much easier and more powerful if you combine it with structured data models that validate and organize the scraped information. This allows for consistent data formats, type checking, and easy access to properties like checking if an article is recent. ## Large-scale Scraping With Batch Operations Up to this point, we have been focusing on scraping pages one URL at a time. In reality, you will work with multiple, perhaps, thousands of URLs that need to be scraped in parallel. This is where batch operations become essential for efficient web scraping at scale. Batch operations allow you to process multiple URLs simultaneously, significantly reducing the overall time needed to collect data from multiple web pages. ### Batch Scraping with `batch_scrape_urls` The `batch_scrape_urls` method lets you scrape multiple URLs at once. Let’s scrape all the news article links we obtained from our previous schema extraction example. ```python articles = structured_data['extract']['news_articles'] article_links = [article['url'] for article in articles] class ArticleSummary(BaseModel): title: str = Field(description="The title of the news article") summary: str = Field(description="A short summary of the news article") batch_data = app.batch_scrape_urls(article_links, params={ "formats": ["extract"], "extract": { "schema": ArticleSummary.model_json_schema(), "prompt": "Extract the title of the news article and generate its brief summary", } }) ``` Here is what is happening in the codeblock above: - We extract the list of news articles from our previous structured data result - We create a list of article URLs by mapping over the articles and getting their ‘url’ field - We define an `ArticleSummary` model with title and summary fields to structure our output - We use `batch_scrape_urls()` to process all article URLs in parallel, configuring it to: - Extract data in structured format - Use our `ArticleSummary` schema - Generate titles and summaries based on the article content The response from `batch_scrape_urls()` is a bit different: ```python batch_data.keys() ``` ```python dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data']) ``` It contains the following fields: - `success`: Boolean indicating if the batch request succeeded - `status`: Current status of the batch job - `completed`: Number of URLs processed so far - `total`: Total number of URLs in the batch - `creditsUsed`: Number of API credits consumed - `expiresAt`: When the results will expire - `data`: The extracted data for each URL Let’s focus on the `data` key where the actual content is stored: ```python len(batch_data['data']) ``` ```out 19 ``` The batch processing completed successfully with 19 articles. Let’s examine the structure of the first article: ```python batch_data['data'][0].keys() ``` ````out dict_keys(['extract', 'metadata']) The response format here matches what we get from individual `scrape_url` calls. ```python print(batch_data['data'][0]['extract']) ```` ```out {'title': 'Ukrainian Forces Face Increasing Challenges Amidst Harsh Winter Conditions', 'summary': 'As the war in Ukraine enters its fourth winter, conditions are worsening for Ukrainian soldiers who find themselves trapped on the battlefield, surrounded by Russian forces. Military commanders express concerns over dwindling supplies and increasingly tough situations. The U.S. has recently allowed Ukraine to use American weapons for deeper strikes into Russia, marking a significant development in the ongoing conflict.'} ``` The scraping was performed according to our specifications, extracting the metadata, the title and generating a brief summary. ### Asynchronous batch scraping with `async_batch_scrape_urls` Scraping the 19 NY Times articles in a batch took about 10 seconds on my machine. While that’s not much, in practice, we cannot wait around as Firecrawl batch-scrapes thousands of URLs. For these larger workloads, Firecrawl provides an asynchronous batch scraping API that lets you submit jobs and check their status later, rather than blocking until completion. This is especially useful when integrating web scraping into automated workflows or processing large URL lists. This feature is available through the `async_batch_scrape_urls` method and it works a bit differently: ```python batch_scrape_job = app.async_batch_scrape_urls( article_links, params={ "formats": ["extract"], "extract": { "schema": ArticleSummary.model_json_schema(), "prompt": "Extract the title of the news article and generate its brief summary", }, }, ) ``` When using `async_batch_scrape_urls` instead of the synchronous version, the response comes back immediately rather than waiting for all URLs to be scraped. This allows the program to continue executing while the scraping happens in the background. ```python batch_scrape_job ``` ```python {'success': True, 'id': '77a94b62-c676-4db2-b61b-4681e99f4704', 'url': 'https://api.firecrawl.dev/v1/batch/scrape/77a94b62-c676-4db2-b61b-4681e99f4704'} ``` The response contains an ID belonging the background task that was initiated to process the URLs under the hood. You can use this ID later to check the job’s status with `check_batch_scrape_status` method: ```python batch_scrape_job_status = app.check_batch_scrape_status(batch_scrape_job['id']) batch_scrape_job_status.keys() ``` ```python dict_keys(['success', 'status', 'total', 'completed', 'creditsUsed', 'expiresAt', 'data', 'error', 'next']) ``` If the job finished scraping all URLs, its `status` will be set to `completed`: ```python batch_scrape_job_status['status'] ``` ```out 'completed' ``` Let’s look at how many pages were scraped: ```python batch_scrape_job_status['total'] ``` ```python 19 ``` The response always includes the `data` field, whether the job is complete or not, with the content scraped up to that point. It has `error` and `next` fields to indicate if any errors occurred during scraping and whether there are more results to fetch. ## How to Scrape Dynamic JavaScript Websites Out in the wild, many websites you encounter will be dynamic, meaning their content is generated on-the-fly using JavaScript rather than being pre-rendered on the server. These sites often require user interaction like clicking buttons or typing into forms before displaying their full content. Traditional web scrapers that only look at the initial HTML fail to capture this dynamic content, which is why browser automation capabilities are essential for comprehensive web scraping. Firecrawl supports dynamic scraping by default. In the parameters of `scrape_url` or `batch_scrape_url`, you can define necessary actions to reach the target state of the page you are scraping. As an example, we will build a scraper that will extract the following information from `https://weather.com`: - Current Temperature - Temperature High - Temperature Low - Humidity - Pressure - Visibility - Wind Speed - Dew Point - UV Index - Moon Phase These details are displayed for every city you search through the website: ![Weather.com interface showing detailed weather forecast for London including temperature, humidity, wind speed and other meteorological data in an interactive dashboard layout](https://www.firecrawl.dev/images/blog/scrape-masterclass/notebook_files/image.png) Unlike websites such as Amazon where you can simply modify the URL’s search parameter (e.g. `?search=your-query`), weather.com presents a unique challenge. The site generates dynamic and unique IDs for each city, making traditional URL manipulation techniques ineffective. To scrape weather data for any given city, you must simulate the actual user journey: visiting the homepage, interacting with the search bar, entering the city name, and selecting the appropriate result from the dropdown list. This multi-step interaction process is necessary because of how weather.com structures its dynamic content delivery (at this point, I urge to visit the website and visit a few city pages). Fortunately, Firecrawl natively supports such interactions through the `actions` parameter. It accepts a list of dictionaries, where each dictionary represents one of the following interactions: - Waiting for the page to load - Clicking on an element - Writing text in input fields - Scrolling up/down - Take a screenshot at the current state - Scrape the current state of the webpage Let’s define the actions we need for weather.com: ```python actions = [\ {"type": "wait", "milliseconds": 3000},\ {"type": "click", "selector": 'input[id="LocationSearch_input"]'},\ {"type": "write", "text": "London"},\ {"type": "screenshot"},\ {"type": "wait", "milliseconds": 1000},\ {"type": "click", "selector": "button[data-testid='ctaButton']"},\ {"type": "wait", "milliseconds": 3000},\ ] ``` Let’s examine how we choose the selectors, as this is the most technical aspect of the actions. Using browser developer tools, we inspect the webpage elements to find the appropriate selectors. For the search input field, we locate an element with the ID “LocationSearch\_input”. After entering a city name, we include a 3-second wait to allow the dropdown search results to appear. At this stage, we capture a screenshot for debugging to verify the text input was successful. The final step involves clicking the first matching result, which is identified by a button element with the `data-testid` attribute `ctaButton`. Note that if you’re implementing this in the future, these specific attribute names may have changed - you’ll need to use browser developer tools to find the current correct selectors. Now, let’s define a Pydantic schema to guide the LLM: ```python class WeatherData(BaseModel): location: str = Field(description="The name of the city") temperature: str = Field(description="The current temperature in degrees Fahrenheit") temperature_high: str = Field(description="The high temperature for the day in degrees Fahrenheit") temperature_low: str = Field(description="The low temperature for the day in degrees Fahrenheit") humidity: str = Field(description="The current humidity as a percentage") pressure: str = Field(description="The current air pressure in inches of mercury") visibility: str = Field(description="The current visibility in miles") wind_speed: str = Field(description="The current wind speed in miles per hour") dew_point: str = Field(description="The current dew point in degrees Fahrenheit") uv_index: str = Field(description="The current UV index") moon_phase: str = Field(description="The current moon phase") ``` Finally, let’s pass these objects to `scrape_url`: ```python url = "https://weather.com" data = app.scrape_url( url, params={ "formats": ["screenshot", "markdown", "extract"], "actions": actions, "extract": { "schema": WeatherData.model_json_schema(), "prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase", }, }, ) ``` The scraping only happens once all actions are performed. Let’s see if it was successful by looking at the `extract` key: ```python data['extract'] ``` ```python {'location': 'London, England, United Kingdom', 'temperature': '33°', 'temperature_high': '39°', 'temperature_low': '33°', 'humidity': '79%', 'pressure': '29.52in', 'visibility': '10 mi', 'wind_speed': '5 mph', 'dew_point': '28°', 'uv_index': '0 of 11', 'moon_phase': 'Waning Gibbous'} ``` All details are accounted for! But, for illustration, we need to take a closer look at the response structure when using JS-based actions: ```python data.keys() ``` ```python dict_keys(['markdown', 'screenshot', 'actions', 'metadata', 'extract']) ``` The response has a new actions key: ```python data['actions'] ``` ```python {'screenshots': ['https://service.firecrawl.dev/storage/v1/object/public/media/screenshot-16bf71d8-dcb5-47eb-9af4-5fa84195b91d.png'], 'scrapes': []} ``` The actions array contained a single screenshot-generating action, which is reflected in the output above. Let’s look at the screenshot: ```python from IPython.display import Image Image(data['actions']['screenshots'][0]) ``` ![Screenshot of weather.com search interface showing search bar with typed city name, demonstrating automated web scraping process with Firecrawl](https://www.firecrawl.dev/images/blog/scrape-masterclass/notebook_files/notebook_96_0.png) The image shows the stage where the scraper just typed the search query. Now, we have to convert this whole process into a function that works for any given city: ```python from pydantic import BaseModel, Field from typing import Optional, Dict, Any class WeatherData(BaseModel): location: str = Field(description="The name of the city") temperature: str = Field( description="The current temperature in degrees Fahrenheit" ) temperature_high: str = Field( description="The high temperature for the day in degrees Fahrenheit" ) temperature_low: str = Field( description="The low temperature for the day in degrees Fahrenheit" ) humidity: str = Field(description="The current humidity as a percentage") pressure: str = Field(description="The current air pressure in inches of mercury") visibility: str = Field(description="The current visibility in miles") wind_speed: str = Field(description="The current wind speed in miles per hour") dew_point: str = Field(description="The current dew point in degrees Fahrenheit") uv_index: str = Field(description="The current UV index") moon_phase: str = Field(description="The current moon phase") def scrape_weather_data(app: FirecrawlApp, city: str) -> Optional[WeatherData]: try: # Define the actions to search for the city actions = [\ {"type": "wait", "milliseconds": 3000},\ {"type": "click", "selector": 'input[id="LocationSearch_input"]'},\ {"type": "write", "text": city},\ {"type": "wait", "milliseconds": 1000},\ {"type": "click", "selector": "button[data-testid='ctaButton']"},\ {"type": "wait", "milliseconds": 3000},\ ] # Perform the scraping data = app.scrape_url( "https://weather.com", params={ "formats": ["extract"], "actions": actions, "extract": { "schema": WeatherData.model_json_schema(), "prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase", }, }, ) # Return the extracted weather data return WeatherData(**data["extract"]) except Exception as e: print(f"Error scraping weather data for {city}: {str(e)}") return None ``` The code is the same but it is wrapped inside a function. Let’s test it on various cities: ```python cities = ["Tashkent", "New York", "Tokyo", "Paris", "Istanbul"] data_full = [] for city in cities: weather_data = scrape_weather_data(app, city) data_full.append(weather_data) ``` We can convert the data for all cities into a DataFrame now: ```python import pandas as pd # Convert list of WeatherData objects into dictionaries data_dicts = [city.model_dump() for city in data_full] # Convert list of dictionaries into DataFrame df = pd.DataFrame(data_dicts) print(df.head()) ``` | location | temperature | temperature\_high | temperature\_low | humidity | pressure | visibility | wind\_speed | dew\_point | uv\_index | moon\_phase | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Tashkent, Uzbekistan | 48 | 54 | 41 | 81 | 30.30 | 2.5 | 2 | 43 | 0 | Waning Gibbous | | New York City, NY | 48° | 49° | 39° | 93% | 29.45 in | 4 mi | 10 mph | 46° | 0 of 11 | Waning Gibbous | | Tokyo, Tokyo Prefecture, Japan | 47° | 61° | 48° | 95% | 29.94 in | 10 mi | 1 mph | 45° | 0 of 11 | Waning Gibbous | | Paris, France | 34° | 36° | 30° | 93% | 29.42 in | 2.4 mi | 11 mph | 33° | 0 of 11 | Waning Gibbous | | Istanbul, Türkiye | 47° | 67° | 44° | 79% | 29.98 in | 8 mi | 4 mph | 41° | 0 of 11 | Waning Gibbous | We have successfully scraped weather data from multiple cities using Firecrawl and organized it into a structured DataFrame. This demonstrates how we can efficiently collect and analyze data generated by dynamic websites for further analysis and monitoring. ## Conclusion In this comprehensive guide, we’ve explored Firecrawl’s `/scrape` endpoint and its powerful capabilities for modern web scraping. We covered: - Basic scraping setup and configuration options - Multiple output formats including HTML, markdown, and screenshots - Structured data extraction using both natural language prompts and Pydantic schemas - Batch operations for processing multiple URLs efficiently - Advanced techniques for scraping JavaScript-heavy dynamic websites Through practical examples like extracting news articles from the NY Times and weather data from weather.com, we’ve demonstrated how Firecrawl simplifies complex scraping tasks while providing flexible output formats suitable for data engineering and AI/ML pipelines. The combination of LLM-powered extraction, structured schemas, and browser automation capabilities makes Firecrawl a versatile tool for gathering high-quality web data at scale, whether you’re building training datasets, monitoring websites, or conducting research. To discover more what Firecrawl has to offer, refer to [our guide on the `/crawl` endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl), which scrapes websites in their entirety with a single command while using the `/scrape` endpoint under the hood. For more hands-on uses-cases of Firecrawl, these posts may interest you as well: - [Using Prompt Caching With Anthropic](https://www.firecrawl.dev/blog/using-prompt-caching-with-anthropic) - [Scraping Job Boards With Firecrawl and OpenAI](https://www.firecrawl.dev/blog/scrape-job-boards-firecrawl-openai) - [Scraping and Analyzing Airbnb Listings in Python Tutorial](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## DeepSeek Documentation Assistant Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Feb 10, 2025 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl ![Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl image](https://www.firecrawl.dev/images/blog/deepseek_rag/deepseek-rag-documentation-assistant.jpg) # Building an Intelligent Code Documentation Assistant: RAG-Powered DeepSeek Implementation ## Introduction DeepSeek R1’s release made waves in the AI community, with countless demos highlighting its impressive capabilities. However, most examples only scratch the surface with basic prompts rather than showing practical real-world implementations. In this tutorial, we’ll explore how to harness this powerful open-source model to create a documentation assistant powered by RAG (Retrieval Augmented Generation). Our application will be able to intelligently answer questions about any documentation website by combining DeepSeek’s language capabilities with efficient information retrieval. ![A screenshot showing the documentation assistant interface with a chat window on the right and a sidebar on the left for managing documentation sources](https://www.firecrawl.dev/images/blog/deepseek_rag/demo2.png) For those eager to try it out, you can find installation and usage instructions [in the GitHub repository](https://github.com/BexTuychiev/local-documentation-rag). If you’re interested in understanding how the application works and learning to customize it for your needs, continue reading this detailed walkthrough. ## What Is DeepSeek R1? ![A logo for DeepSeek AI showing a stylized deep learning neural network visualization](https://www.firecrawl.dev/images/blog/deepseek_rag/deepseek.jpeg) [DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1) represents a notable advancement in artificial intelligence, combining reinforcement learning and supervised fine-tuning in a novel and most importantly, open-source approach. The model comes in two variants: DeepSeek-R1-Zero, trained purely through reinforcement learning, and DeepSeek-R1, which undergoes additional training steps. Its architecture manages 671 billion total parameters, though it operates efficiently with 37 billion active parameters and handles context lengths up to 128,000 tokens. The development journey progressed through carefully planned stages. Beginning with supervised fine-tuning for core capabilities, the model then underwent two phases of reinforcement learning. These RL stages shaped its reasoning patterns and aligned its behavior with human thought processes. This methodical approach produced a system capable of generating responses, performing self-verification, engaging in reflection, and constructing detailed reasoning across mathematics, programming, and general problem-solving. When it comes to performance, DeepSeek R1 demonstrates compelling results that rival OpenAI’s offerings. It achieves 97.3% accuracy on MATH-500, reaches the 96.3 percentile on Codeforces programming challenges, and scores 90.8% on the MMLU general knowledge assessment. The technology has also been distilled into smaller versions ranging from 1.5B to 70B parameters, built on established frameworks like Qwen and Llama. These adaptations make the technology more accessible for practical use while preserving its core strengths. In this tutorial, we will use its 14B version but your hardware may support up to 70B parameters. It is important to choose a higher capacity model as this number is the biggest contributor to performance. ## Prerequisite: Revisiting RAG concepts ![A diagram showing the RAG (Retrieval Augmented Generation) architecture with components for document processing, embedding generation, vector storage, and query processing connected by arrows to illustrate the information flow](https://www.firecrawl.dev/images/blog/deepseek_rag/rag-overview.png) [Source](https://www.patrickschnass.de/posts/rag_intro/) Retrieval Augmented Generation (RAG) represents a significant advancement in how Large Language Models (LLMs) interact with information. Unlike traditional LLMs that rely solely on their training data, RAG combines the power of language models with the ability to retrieve and reference external information in real-time. This approach effectively creates a bridge between the model’s inherent knowledge and up-to-date, specific information stored in external databases or documents. The RAG architecture consists of two main components: the retriever and the generator. The retriever is responsible for searching through a knowledge base to find relevant information based on the user’s query. This process typically involves converting both the query and stored documents into vector embeddings, allowing for semantic similarity searches that go beyond simple keyword matching. The generator, usually an LLM, then takes both the original query and the retrieved information to produce a comprehensive, contextually relevant response. One of RAG’s key advantages is its ability to provide more accurate and verifiable responses. By grounding the model’s outputs in specific, retrievable sources, RAG helps reduce hallucinations – instances where LLMs generate plausible-sounding but incorrect information. This is particularly valuable in professional contexts where accuracy and accountability are crucial, such as technical documentation, customer support, or legal applications. Additionally, RAG systems can be updated with new information without requiring retraining of the underlying language model, making them more flexible and maintainable. The implementation of RAG typically involves several technical components working in harmony. First, documents are processed and converted into embeddings using models like BERT or Sentence Transformers. These embeddings are then stored in vector databases such as Pinecone, Weaviate, or FAISS for efficient retrieval. When a query arrives, it goes through the same embedding process, and similarity search algorithms find the most relevant documents. Finally, these documents, along with the original query, are formatted into a prompt that the LLM uses to generate its response. This structured approach ensures that the final output is both relevant and grounded in reliable source material. Now that we’ve refreshed our memory on basic RAG concepts, let’s dive in to the app’s implementation. ## Overview of the App Before diving into the technical details, let’s walk through a typical user journey to understand how the documentation assistant works. The process starts with the user providing documentation URLs to scrape. The app is designed to work with any documentation website, but here are some examples of typical documentation pages: - `https://docs.firecrawl.dev` - `https://docs.langchain.com` - `https://docs.streamlit.io` The app’s interface is divided into two main sections: a sidebar for documentation management and a main chat interface. In the sidebar, users can: 1. Enter a documentation URL to scrape 2. Specify a name for the documentation (must end with “-docs”) 3. Optionally limit the number of pages to scrape 4. View and select from previously scraped documentation sets When a user initiates scraping, the app uses Firecrawl to intelligently crawl the documentation website, converting HTML content into clean markdown files. These files are stored locally in a directory named after the documentation (e.g., “Firecrawl-docs”). The app shows real-time progress during scraping and notifies the user when complete. After scraping, the documentation is processed into a vector database using the Nomic embeddings model. This enables semantic search capabilities, allowing the assistant to find relevant documentation sections based on user questions. The processing happens automatically when a user selects a documentation set from the sidebar. The main chat interface provides an intuitive way to interact with the documentation: 1. Users can ask questions in natural language about the selected documentation 2. The app uses RAG (Retrieval-Augmented Generation) to find relevant documentation sections 3. DeepSeek R1 generates accurate, contextual responses based on the retrieved content 4. Each response includes an expandable “View reasoning” section showing the chain of thought ![Screenshot showing the documentation assistant interface with sidebar controls and chat interface](https://www.firecrawl.dev/images/blog/deepseek_rag/demo2.png) Users can switch between different documentation sets at any time, and the app will automatically reprocess the vectors as needed. This approach combines the power of modern AI with traditional documentation search, creating a more interactive and intelligent way to explore technical documentation. Whether you’re learning a new framework or trying to solve a specific problem, the assistant helps you find and understand relevant documentation more efficiently than traditional search methods. ## The Tech Stack Used in the App Building an effective documentation assistant requires tools that can handle complex tasks like web scraping, text processing, and natural language understanding while remaining maintainable and efficient. Let’s explore the core technologies that power our application and why each was chosen: ### 1\. [Firecrawl](https://firecrawl.dev/) for AI-powered documentation scraping At the heart of our documentation collection system is Firecrawl, an AI-powered web scraping engine. Unlike traditional scraping libraries that rely on brittle HTML selectors, Firecrawl uses natural language understanding to identify and extract content. This makes it ideal for our use case because: - It can handle diverse documentation layouts without custom code - Maintains reliability even when documentation structure changes - Automatically extracts clean markdown content - Handles JavaScript-rendered documentation sites - Provides metadata like titles and URLs automatically - Follows documentation links intelligently ### 2\. [DeepSeek R1](https://deepseek.ai/) for question answering For the critical task of answering documentation questions, we use the DeepSeek R1 14B model through Ollama. This AI model excels at understanding technical documentation and providing accurate responses. We chose DeepSeek R1 because: - Runs locally for better privacy and lower latency - Specifically trained on technical content - Provides detailed explanations with chain-of-thought reasoning - More cost-effective than cloud-based models - Integrates well with LangChain for RAG workflows ### 3\. [Nomic Embeddings](https://docs.nomic.ai/) for semantic search To enable semantic search across documentation, we use Nomic’s text embedding model through [Ollama](https://ollama.com/). This component is crucial for finding relevant documentation sections. We chose Nomic because: - Optimized for technical documentation - Runs locally alongside DeepSeek through Ollama - Produces high-quality embeddings for RAG - Fast inference speed - Compact model size ### 4\. [ChromaDB](https://www.trychroma.com/) for vector storage To store and query document embeddings efficiently, we use ChromaDB as our vector database. This modern vector store offers: - Lightweight and easy to set up - Persistent storage of embeddings - Fast similarity search - Seamless integration with LangChain - No external dependencies ### 5\. [Streamlit](https://streamlit.io/) for user interface The web interface is built with Streamlit, a Python framework for data applications. We chose Streamlit because: - It enables rapid development of chat interfaces - Provides built-in components for file handling - Handles async operations smoothly - Maintains chat history during sessions - Requires minimal frontend code - Makes deployment straightforward ### 6\. [LangChain](https://www.langchain.com/) for RAG orchestration To coordinate the various components into a cohesive RAG system, we use LangChain. This framework provides: - Standard interfaces for embeddings and LLMs - Document loading and text splitting utilities - Vector store integration - Prompt management - Structured output parsing This carefully selected stack provides a robust foundation while keeping the system entirely local and self-contained. The combination of AI-powered tools (Firecrawl and DeepSeek) with modern infrastructure (ChromaDB, LangChain, and Ollama) creates a reliable and efficient documentation assistant that can handle diverse technical documentation. Most importantly, this stack minimizes both latency and privacy concerns by running all AI components locally. The infrastructure is lightweight and portable, letting you focus on using the documentation rather than managing complex dependencies or cloud services. ## Breaking Down the App Components When you look at [the GitHub repository](https://github.com/BexTuychiev/local-documentation-rag/tree/main/src) of the app, you will see the following file structure: ![GitHub repository file structure showing src directory with core Python files and configuration files](https://www.firecrawl.dev/images/blog/deepseek_rag/github-snapshot.png) Several files in the repository serve common purposes that most developers will recognize: - `.gitignore`: Specifies which files Git should ignore when tracking changes - `README.md`: Documentation explaining what the project does and how to use it - `requirements.txt`: Lists all Python package dependencies needed to run the project Let’s examine the remaining Python scripts and understand how they work together to power the application. The explanations will be in a logical order building from foundational elements to higher-level functionality. ### 1\. Scraping Documentation with Firecrawl - `src/scraper.py` The documentation scraper component handles fetching and processing documentation pages using Firecrawl’s AI capabilities. Let’s examine how each part works: First, we make the necessary imports and setup: ```python import logging import os import re from pathlib import Path from typing import List from dotenv import load_dotenv from firecrawl import FirecrawlApp from pydantic import BaseModel, Field # Get logger for the scraper module logger = logging.getLogger(__name__) ``` Then, we define the core data structure for documentation pages: ```python class DocPage(BaseModel): title: str = Field(description="Page title") content: str = Field(description="Main content of the page") url: str = Field(description="Page URL") ``` The `DocPage` model represents a single documentation page with three essential fields: - `title`: The page’s heading or title - `content`: The main markdown content of the page - `url`: Direct link to the original page This model is used by both the scraper to structure extracted content and the RAG system to process documentation for the vector store. The main scraper class handles all documentation collection: ```python class DocumentationScraper: def init(self): self.app = FirecrawlApp() ``` The `DocumentationScraper` initializes a connection to Firecrawl and provides three main methods for documentation collection: 1. `get_documentation_links`: Discovers all documentation pages from a base URL: ```python def get_documentation_links(self, base_url: str) -> list[str]: """Get all documentation page links from a given base URL.""" logger.info(f"Getting documentation links from {base_url}") initial_crawl = self.app.crawl_url( base_url, params={ "scrapeOptions": {"formats": ["links"]}, }, ) all_links = [] for item in initial_crawl["data"]: all_links.extend(item["links"]) filtered_links = set( [link.split("#")[0] for link in all_links if link.startswith(base_url)] ) logger.info(f"Found {len(filtered_links)} unique documentation links") return list(filtered_links) ``` This method: - Uses Firecrawl’s link extraction mode to find all URLs - Filters for links within the same documentation domain - Removes duplicate URLs and anchor fragments - Returns a clean list of documentation page URLs 2. `scrape_documentation`: Processes all documentation pages into structured content: ```python def scrape_documentation(self, base_url: str, limit: int = None): """Scrape documentation pages from a given base URL.""" logger.info(f"Scraping doc pages from {base_url}") filtered_links = self.get_documentation_links(base_url) if limit: filtered_links = filtered_links[:limit] try: logger.info(f"Scraping {len(filtered_links)} documentation pages") crawl_results = self.app.batch_scrape_urls(filtered_links) except Exception as e: logger.error(f"Error scraping documentation pages: {str(e)}") return [] doc_pages = [] for result in crawl_results["data"]: if result.get("markdown"): doc_pages.append( DocPage( title=result.get("metadata", {}).get("title", "Untitled"), content=result["markdown"], url=result.get("metadata", {}).get("url", ""), ) ) else: logger.warning( f"Failed to scrape {result.get('metadata', {}).get('url', 'unknown URL')}" ) logger.info(f"Successfully scraped {len(doc_pages)} pages out of {len(filtered_links)} URLs") return doc_pages ``` This method: - Gets all documentation links using the previous method - Optionally limits the number of pages to scrape - Uses Firecrawl’s batch scraping to efficiently process multiple pages - Converts raw scraping results into structured `DocPage` objects - Handles errors and provides detailed logging 3. `save_documentation_pages`: Stores scraped content as markdown files: ```python def save_documentation_pages(self, doc_pages: List[DocPage], docs_dir: str): """Save scraped documentation pages to markdown files.""" Path(docs_dir).mkdir(parents=True, exist_ok=True) for page in doc_pages: url_path = page.url.replace("https://docs.firecrawl.dev", "") safe_filename = url_path.strip("/").replace("/", "-") filepath = os.path.join(docs_dir, f"{safe_filename}.md") with open(filepath, "w", encoding="utf-8") as f: f.write("---\n") f.write(f"title: {page.title}\n") f.write(f"url: {page.url}\n") f.write("---\n\n") f.write(page.content) logger.info(f"Saved {len(doc_pages)} pages to {docs_dir}") ``` This method: - Creates a documentation directory if needed - Converts URLs to safe filenames - Saves each page as a markdown file with YAML frontmatter - Preserves original titles and URLs for reference Finally, the class provides a convenience method to handle the entire scraping workflow: ```python def pull_docs(self, base_url: str, docs_dir: str, n_pages: int = None): doc_pages = self.scrape_documentation(base_url, n_pages) self.save_documentation_pages(doc_pages, docs_dir) ``` This scraper component is used by: - The Streamlit interface ( `app.py`) for initial documentation collection - The RAG system ( `rag.py`) for processing documentation into the vector store - The command-line interface for testing and manual scraping The use of Firecrawl’s AI capabilities allows the scraper to handle diverse documentation layouts without custom selectors, while the structured output ensures consistency for downstream processing. ### 2\. Implementing RAG with Ollama - `src/rag.py` The RAG (Retrieval Augmented Generation) component is the core of our documentation assistant, handling document processing, embedding generation, and question answering. Let’s examine each part in detail: First, we import the necessary LangChain components: ```python from langchain_chroma import Chroma from langchain_community.document_loaders import DirectoryLoader from langchain_core.prompts import ChatPromptTemplate from langchain_ollama import ChatOllama, OllamaEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter ``` These imports provide: - `Chroma`: Vector database for storing embeddings - `DirectoryLoader`: Utility for loading markdown files from a directory - `ChatPromptTemplate`: Template system for LLM prompts - `ChatOllama` and `OllamaEmbeddings`: Local LLM and embedding models - `RecursiveCharacterTextSplitter`: Text chunking utility The main RAG class initializes all necessary components: ```python class DocumentationRAG: def __init__(self): # Initialize embeddings and vector store self.embeddings = OllamaEmbeddings(model="nomic-embed-text") self.vector_store = Chroma( embedding_function=self.embeddings, persist_directory="./chroma_db" ) # Initialize LLM self.llm = ChatOllama(model="deepseek-r1:14b") # Text splitter for chunking self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, add_start_index=True ) ``` The initialization: 1. Creates an embedding model using Nomic’s text embeddings 2. Sets up a Chroma vector store with persistent storage 3. Initializes the DeepSeek R1 14B model for question answering 4. Configures a text splitter with 1000-character chunks and 200-character overlap The prompt template defines how the LLM should process questions: ```python # RAG prompt template self.prompt = ChatPromptTemplate.from_template( """ You are an expert documentation assistant. Use the following documentation context to answer the question. If you don't know the answer, just say that you don't have enough information. Keep the answer concise and clear. Context: {context} Question: {question} Answer:""" ) ``` This template: - Sets the assistant’s role and behavior - Provides placeholders for context and questions - Encourages concise and clear responses The document loading method handles reading markdown files: ```python def load_docs_from_directory(self, docs_dir: str): """Load all markdown documents from a directory""" markdown_docs = DirectoryLoader(docs_dir, glob="*.md").load() return markdown_docs ``` This method: - Uses `DirectoryLoader` to find all markdown files - Automatically handles file reading and basic preprocessing - Returns a list of Document objects The document processing method prepares content for the vector store: ```python def process_documents(self, docs_dir: str): """Process documents and add to vector store""" # Clear existing documents self.vector_store = Chroma( embedding_function=self.embeddings, persist_directory="./chroma_db" ) # Load and process new documents documents = self.load_docs_from_directory(docs_dir) chunks = self.text_splitter.split_documents(documents) self.vector_store.add_documents(chunks) ``` This method: 1. Reinitializes the vector store to clear existing documents 2. Loads new documents from the specified directory 3. Splits documents into manageable chunks 4. Generates and stores embeddings in the vector database Finally, the query method handles question answering: ```python def query(self, question: str) -> tuple[str, str]: """Query the documentation""" # Get relevant documents docs = self.vector_store.similarity_search(question, k=3) # Combine context context = "\n\n".join([doc.page_content for doc in docs]) # Generate response chain = self.prompt | self.llm response = chain.invoke({"context": context, "question": question}) # Extract chain of thought between and chain_of_thought = response.content.split("")[1].split("")[0] # Extract response response = response.content.split("")[1].strip() return response, chain_of_thought ``` The query process: 1. Performs semantic search to find the 3 most relevant document chunks 2. Combines the chunks into a single context string 3. Creates a LangChain chain combining the prompt and LLM 4. Generates a response with chain-of-thought reasoning 5. Extracts and returns both the final answer and reasoning process This RAG component is used by: - The Streamlit interface ( `app.py`) for handling user questions - The command-line interface for testing and development - Future extensions that need documentation Q&A capabilities The implementation uses LangChain’s abstractions to create a modular and maintainable system while keeping all AI components running locally through Ollama. ### 3\. Building a clean UI with Streamlit - `src/app.py` The Streamlit interface brings together the scraping and RAG components into a user-friendly web application. Let’s break down each component: First, we set up basic configuration and utilities: ```python import glob import logging from pathlib import Path import streamlit as st from dotenv import load_dotenv from rag import DocumentationRAG from scraper import DocumentationScraper # Configure logging logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()], ) logger = logging.getLogger(__name__) ``` These imports and configurations: - Set up logging for debugging and monitoring - Import our custom RAG and scraper components - Load environment variables for configuration Helper functions handle documentation management: ```python def get_existing_docs(): """Get all documentation directories with -docs suffix""" docs_dirs = glob.glob("*-docs") return [Path(dir_path).name for dir_path in docs_dirs] def get_doc_page_count(docs_dir: str) -> int: """Get number of markdown files in a documentation directory""" return len(list(Path(docs_dir).glob("*.md"))) ``` These utilities: - Find all documentation directories with “-docs” suffix - Count pages in each documentation set - Support the UI’s documentation selection features The scraping configuration section handles documentation collection: ```python def scraping_config_section(): """Create the documentation scraping configuration section""" st.markdown("### Configure Scraping") base_url = st.text_input( "Documentation URL", placeholder="https://docs.firecrawl.dev", help="The base URL of the documentation to scrape", ) docs_name = st.text_input( "Documentation Name", placeholder="Firecrawl-docs", help="Name of the directory to store documentation", ) n_pages = st.number_input( "Number of Pages", min_value=0, value=0, help="Limit the number of pages to scrape (0 for all pages)", ) st.info( "💡 Add '-docs' suffix to the documentation name. " "Set pages to 0 to scrape all available pages." ) if st.button("Start Scraping"): if not base_url or not docs_name: st.error("Please provide both URL and documentation name") elif not docs_name.endswith("-docs"): st.error("Documentation name must end with '-docs'") else: with st.spinner("Scraping documentation..."): try: scraper = DocumentationScraper() n_pages = None if n_pages == 0 else n_pages scraper.pull_docs(base_url, docs_name, n_pages=n_pages) st.success("Documentation scraped successfully!") except Exception as e: st.error(f"Error scraping documentation: {str(e)}") ``` This section: - Provides input fields for documentation URL and name - Allows limiting the number of pages to scrape - Handles validation and error reporting - Shows progress during scraping - Uses our `DocumentationScraper` class for content collection The documentation selection interface manages switching between docs: ```python def documentation_select_section(): """Create the documentation selection section""" st.markdown("### Select Documentation") existing_docs = get_existing_docs() if not existing_docs: st.caption("No documentation found yet") return None # Create options with page counts doc_options = [f"{doc} ({get_doc_page_count(doc)} pages)" for doc in existing_docs] selected_doc = st.selectbox( "Choose documentation to use as context", options=doc_options, help="Select which documentation to use for answering questions", ) if selected_doc: # Extract the actual doc name without page count st.session_state.current_doc = selected_doc.split(" (")[0] return st.session_state.current_doc return None ``` This component: - Lists available documentation sets - Shows page counts for each set - Updates session state when selection changes - Handles the case of no available documentation The chat interface consists of two main functions that work together to create the interactive Q&A experience: First, we initialize the necessary session state: ```python def initialize_chat_state(): """Initialize session state for chat""" if "messages" not in st.session_state: st.session_state.messages = [] if "rag" not in st.session_state: st.session_state.rag = DocumentationRAG() ``` This initialization: - Creates an empty message list if none exists - Sets up the RAG system for document processing and querying - Uses Streamlit’s session state to persist data between reruns The main chat interface starts with basic setup: ```python def chat_interface(): """Create the chat interface""" st.title("Documentation Assistant") # Check if documentation is selected if "current_doc" not in st.session_state: st.info("Please select a documentation from the sidebar to start chatting.") return ``` This section: - Sets the page title - Ensures documentation is selected before proceeding - Shows a helpful message if no documentation is chosen Document processing is handled next: ```python # Process documentation if not already processed if ( "docs_processed" not in st.session_state or st.session_state.docs_processed != st.session_state.current_doc ): with st.spinner("Processing documentation..."): st.session_state.rag.process_documents(st.session_state.current_doc) st.session_state.docs_processed = st.session_state.current_doc ``` This block: - Checks if the current documentation needs processing - Shows a loading spinner during processing - Updates the session state after processing - Prevents unnecessary reprocessing of the same documentation Message display is handled by iterating through the chat history: ```python # Display chat messages for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) if "chain_of_thought" in message: with st.expander("View reasoning"): st.markdown(message["chain_of_thought"]) ``` This section: - Shows each message with appropriate styling based on role - Displays the main content using markdown - Creates expandable sections for reasoning chains - Maintains visual consistency in the chat Finally, the input handling and response generation: ```python # Chat input if prompt := st.chat_input("Ask a question about the documentation"): # Add user message st.session_state.messages.append({"role": "user", "content": prompt}) with st.chat_message("user"): st.markdown(prompt) # Generate and display response with st.chat_message("assistant"): with st.spinner("Thinking..."): response, chain_of_thought = st.session_state.rag.query(prompt) st.markdown(response) with st.expander("View reasoning"): st.markdown(chain_of_thought) # Store assistant response st.session_state.messages.append({ "role": "assistant", "content": response, "chain_of_thought": chain_of_thought, }) ``` This section: 1. Captures user input: - Uses Streamlit’s chat input component - Stores the message in session state - Displays the message immediately 2. Generates response: - Shows a “thinking” spinner during processing - Queries the RAG system for an answer - Displays the response with expandable reasoning 3. Updates chat history: - Stores both response and reasoning - Maintains the conversation flow - Preserves the interaction for future reference The entire chat interface creates a seamless experience by: - Managing state effectively - Providing immediate feedback - Showing processing status - Maintaining conversation context - Exposing the AI’s reasoning process Finally, the main application structure: ```python def sidebar(): """Create the sidebar UI components""" with st.sidebar: st.title("Documentation Scraper") scraping_config_section() documentation_select_section() def main(): initialize_chat_state() sidebar() chat_interface() if __name__ == "__main__": main() ``` This structure: - Organizes UI components into sidebar and main area - Initializes necessary state on startup - Provides a clean entry point for the application The Streamlit interface brings together all components into a cohesive application that: - Makes documentation scraping accessible to non-technical users - Provides immediate feedback during operations - Maintains conversation history - Shows the AI’s reasoning process - Handles errors gracefully ## How to Increase System Performance There are several ways to optimize the performance of this documentation assistant. The following sections explore key areas for potential improvements: ### 1\. Optimize document chunking In `rag.py`, we currently use a basic chunking strategy: ```python self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, add_start_index=True ) ``` We can improve this by: - Using semantic chunking that respects document structure - Adjusting chunk size based on content type (e.g., larger for API docs) - Implementing custom splitting rules for documentation headers - Adding metadata to chunks for better context preservation Example improved configuration: ```python self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1500, # Larger chunks for more context chunk_overlap=300, # Increased overlap for better coherence separators=["\n## ", "\n### ", "\n\n", "\n", " ", ""], # Respect markdown structure add_start_index=True, length_function=len, is_separator_regex=False ) ``` ### 2\. Enhance vector search The current similarity search in `rag.py` is basic: ```python docs = self.vector_store.similarity_search(question, k=3) ``` We can improve retrieval by: - Increasing `k`, i.e. the number of chunks returned - Implementing hybrid search (combining semantic and keyword matching) - Using Maximum Marginal Relevance (MMR) for diverse results - Adding metadata filtering based on document sections - Implementing re-ranking of retrieved chunks Example enhanced retrieval: ```python def query(self, question: str) -> tuple[str, str]: # Get relevant documents with MMR docs = self.vector_store.max_marginal_relevance_search( question, k=5, # Retrieve more candidates fetch_k=20, # Consider larger initial set lambda_mult=0.7 # Diversity factor ) # Filter and re-rank results filtered_docs = [\ doc for doc in docs\ if self._calculate_relevance_score(doc, question) > 0.7\ ] # Use top 3 most relevant chunks context = "\n\n".join([doc.page_content for doc in filtered_docs[:3]]) ``` ### 3\. Implement caching The current implementation reprocesses documentation on every selection: ```python if ( "docs_processed" not in st.session_state or st.session_state.docs_processed != st.session_state.current_doc ): with st.spinner("Processing documentation..."): st.session_state.rag.process_documents(st.session_state.current_doc) ``` We can improve this by: - Implementing persistent vector storage with versioning - Caching processed embeddings - Adding incremental updates for documentation changes Example caching implementation: ```python from hashlib import md5 import pickle class CachedDocumentationRAG(DocumentationRAG): def process_documents(self, docs_dir: str): cache_key = self._get_cache_key(docs_dir) cache_path = f"cache/{cache_key}.pkl" if os.path.exists(cache_path): with open(cache_path, 'rb') as f: self.vector_store = pickle.load(f) else: super().process_documents(docs_dir) os.makedirs("cache", exist_ok=True) with open(cache_path, 'wb') as f: pickle.dump(self.vector_store, f) ``` ### 4\. Optimize model loading Currently, we initialize models in `__init__`: ```python def __init__(self): self.embeddings = OllamaEmbeddings(model="nomic-embed-text") self.llm = ChatOllama(model="deepseek-r1:14b") ``` We can improve this by: - Implementing lazy loading of models - Using smaller models for initial responses - Adding model quantization options - Implementing model caching Example optimized initialization: ```python class OptimizedDocumentationRAG: def __init__(self, use_small_model=True): self._embeddings = None self._llm = None self._use_small_model = use_small_model @property def llm(self): if self._llm is None: model_size = "7b" if self._use_small_model else "14b" self._llm = ChatOllama( model=f"deepseek-r1:{model_size}", temperature=0.1, # Lower temperature for docs num_ctx=2048 # Reduced context for faster inference ) return self._llm ``` These optimizations can significantly improve: - Response latency - Memory usage - Processing throughput - User experience Remember to benchmark performance before and after implementing these changes to measure their impact. Also, consider your specific use case - some optimizations might be more relevant depending on factors like user load, documentation size, and hardware constraints. ## Conclusion This local documentation assistant demonstrates how modern AI technologies can be combined to create powerful, practical tools for technical documentation. By using DeepSeek’s language capabilities, Firecrawl’s AI-powered scraping, and the RAG architecture, we’ve built a system that makes documentation more accessible and interactive. The application’s modular design, with clear separation between scraping, RAG implementation, and user interface components, provides a solid foundation for future enhancements and adaptations to different documentation needs. Most importantly, this implementation shows that sophisticated AI applications can be built entirely with local components, eliminating privacy concerns and reducing operational costs. The combination of Streamlit’s intuitive interface, LangChain’s flexible abstractions, and Ollama’s local AI models creates a seamless experience that feels like a cloud service but runs entirely on your machine. Whether you’re a developer learning a new framework, a technical writer maintaining documentation, or a team lead looking to improve documentation accessibility, this assistant provides a practical solution that can be customized and extended to meet your specific needs. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Structured Web Data Extraction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) January 20, 2025 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Introducing /extract: Get structured web data with just a prompt /extract by Firecrawl - Get structured web data with just a prompt (Open Beta) - YouTube Firecrawl 503 subscribers [/extract by Firecrawl - Get structured web data with just a prompt (Open Beta)](https://www.youtube.com/watch?v=Qq1pFm8enZo) Firecrawl Search Info Shopping Tap to unmute If playback doesn't begin shortly, try restarting your device. You're signed out Videos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer. CancelConfirm Share Include playlist An error occurred while retrieving sharing information. Please try again later. Watch later Share Copy link Watch on 0:00 / •Live • [Watch on YouTube](https://www.youtube.com/watch?v=Qq1pFm8enZo "Watch on YouTube") ## The era of writing web scrapers is over Today we’re releasing [/extract](https://www.firecrawl.dev/extract) \- write a prompt, get structured data from any website. No scrapers. No pipelines. Just results. ## Getting web data is hard If you’ve ever needed structured data from websites—whether to enrich your CRM, monitor competitors, or power various applications—you’re probably familiar with the frustrating options available today: - Manually researching and copy-pasting data from multiple sources, consuming countless hours - Writing and maintaining fragile web scrapers that break at the slightest site change - Using scraping services and building complex LLM pipelines with limited context windows that force you to break down data manually Fortunately, with our /extract endpoint, you can leave these cumbersome approaches in the past and focus on what matters - getting the data you need. ## What You Can Build With /extract Companies are already using /extract to: - Enrich thousands of CRM leads with company data - Automate KYB processes with structured business information - Track competitor prices and feature changes in real-time - Build targeted prospecting lists at scale Here’s how it works: 1. Write a prompt describing the data you need 2. Point us at any website (use wildcards like example.com/\*) 3. Get back clean, structured JSON No more broken scrapers. No more complex pipelines. Just the data you need to build. ## Current Limitations While /extract handles most web data needs effectively, there are some edge cases we’re actively improving: 1. Scale Limitations: Very large sites (think Amazon’s entire catalog) require breaking requests into smaller chunks 2. Advanced Filtering: Complex queries like time-based filtering are still in development 3. Consistency: Multiple runs may return slightly different results as we refine our extraction model We’re actively working on these areas. Our goal is to make web data as accessible as an API - and we’re getting closer every day. ## Get Started 1. **Try it Now** - Get 500,000 free tokens in our [playground](https://www.firecrawl.dev/playground?mode=extract) - See examples and experiment with different prompts - No credit card required 2. **Build Something Real** - Read the [technical docs](https://docs.firecrawl.dev/features/extract) - Connect with [Zapier](https://zapier.com/apps/firecrawl/integrations) for no-code workflows Ready to turn web data into your competitive advantage? Get started in less than 5 minutes. — Caleb, Eric, Nick and the Firecrawl team 🔥 Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Fire Engine for Firecrawl Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Aug 6, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Introducing Fire Engine for Firecrawl ![Introducing Fire Engine for Firecrawl image](https://www.firecrawl.dev/images/blog/fire-engine-launch.png) Firecrawl handles web scraping orchestration but doesn’t do the actual scraping. It initially relied on third-party services like Fetch and Playwright for data retrieval. However, these services often failed on certain sites or were too slow, causing issues for users. To address this, we built Fire Engine, now the default backend for Firecrawl. It’s designed to be more reliable and faster, solving the core problems we and our users encountered with other scraping services. ### What is Fire Engine? Fire Engine is a scraping primitive designed to increase Firecrawl’s scraping capabilities. We’re proud to say that Fire Engine outperforms leading competitors in key areas: - **Reliability:** 40% more reliable than scraping leading competitors when scraping different types of websites - **Speed:** Up to 33.17% faster than scraping leading competitors And this is just the beginning, we are working closely with Firecrawl users to further improve reliability, speed, and more. ### The Technology Behind Fire Engine Fire Engine combines a variety of browser and non-browser based techniques to balance speed and reliability, ensuring that you get data back without compromise. To do this, Fire engine has: - **Efficient Headless Browser Management:** Running browsers at scale is notoriously difficult, but Fire Engine handles this with ease. - **Persistent Browser Sessions:** By keeping browsers running, Fire Engine improves efficiency when handling new requests, reducing startup times and resource usage. - **Advanced Web Interaction Techniques:** Employing a sophisticated array of methods—including browser-based, browserless, and proprietary approaches - **Intelligent Request Handling:** From smart proxy selection to advanced queuing, every aspect of the request process is optimized for speed and reliability. With this technology, Fire Engine allows firecrawl to handle millions of requests daily with speed and accuracy. ### Try Fire Engine on Firecrawl Today Fire Engine powers Firecrawl to handle thousands of daily requests efficiently. It’s currently available exclusively through Firecraw cloud, Developers can test Fire Engine’s capabilities by signing up for [Firecrawl](https://www.firecrawl.dev/). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Firecrawl Launch Week Recap Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) September 2, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Launch Week I Recap ![Launch Week I Recap image](https://www.firecrawl.dev/images/blog/launch-week-1-recap.png) ## Introduction Last week marked an exciting milestone for Firecrawl as we kicked off our inaugural Launch Week, unveiling a series of new features and updates designed to enhance your web scraping experience. Let’s take a look back at the improvements we introduced throughout the week. ## [Day 1: Introducing Teams](https://firecrawl.dev/blog/launch-week-i-day-1-introducing-teams) We started Launch Week by introducing our highly anticipated Teams feature. Teams enables seamless collaboration on web scraping projects, allowing you to work alongside your colleagues and tackle complex data gathering tasks together. With updated pricing plans to accommodate teams of all sizes, Firecrawl is now an excellent platform for collaborative web scraping. ## [Day 2: Increased Rate Limits](https://firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits) On Day 2, we improved your data collection capabilities by doubling the rate limits for our /scrape endpoint across all plans. This means you can now gather more data in the same amount of time, enabling you to take on larger projects and scrape more frequently. ## [Day 3: Introducing the Map Endpoint (Alpha)](https://firecrawl.dev/blog/launch-week-i-day-3-introducing-map-endpoint) Day 3 saw the unveiling of our new Map endpoint, which allows you to transform a single URL into a comprehensive map of an entire website quickly. As a fast and easy way to gather all the URLs on a website, the Map endpoint opens up new possibilities for your web scraping projects. ## [Day 4: Introducing Firecrawl /v1](https://firecrawl.dev/blog/launch-week-i-day-4-introducing-firecrawl-v1) Day 4 marked a significant release: Firecrawl /v1. This more reliable and developer-friendly API makes gathering web data easier. With new scrape formats, improved crawl status, enhanced markdown parsing, v1 support for all SDKs (including new Go and Rust SDKs), and an improved developer experience, v1 enhances your web scraping workflow. ## [Day 5: Real-Time Crawling with WebSockets](https://firecrawl.dev/blog/launch-week-i-day-5-real-time-crawling-websockets) On Day 5, we introduced a new feature: Real-Time Crawling with WebSockets. Our WebSocket-based method, Crawl URL and Watch, enables real-time data extraction and monitoring, allowing you to process data immediately, react to errors quickly, and know precisely when your crawl is complete. ## [Day 6: LLM Extract (v1)](https://firecrawl.dev/blog/launch-week-i-day-6-llm-extract) Day 6 brought v1 support for LLM Extract, enabling you to extract structured data from web pages using the extract format in /scrape. With the ability to pass a schema or just provide a prompt, LLM extraction is now more flexible and powerful. ## [Day 7: Crawl Webhooks (v1)](https://firecrawl.dev/blog/launch-week-i-day-7-webhooks) We wrapped up Launch Week with the introduction of /crawl webhook support. You can now send notifications to your apps during a crawl, with four types of events: crawl.started, crawl.page, crawl.completed, and crawl.failed. This feature allows for more seamless integration of Firecrawl into your workflows. ## Wrapping Up Launch Week showcased our commitment to continually evolving and improving Firecrawl to meet the needs of our users. From collaborative features like Teams to performance improvements like increased rate limits, and from new endpoints like Map and Extract to real-time capabilities with WebSockets and Webhooks, we’ve expanded the possibilities for your web scraping projects. We’d like to thank our community for your support, feedback, and enthusiasm throughout Launch Week and beyond. Your input drives us to innovate and push the boundaries of what’s possible with web scraping. Stay tuned for more updates as we continue to shape the future of data gathering together. Happy scraping! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Prompt Caching Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Aug 14, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # How to Use Prompt Caching and Cache Control with Anthropic Models ![How to Use Prompt Caching and Cache Control with Anthropic Models image](https://www.firecrawl.dev/images/blog/anthropic-prompt-caching.png) Anthropic recently launched prompt caching and cache control in beta, allowing you to cache large context prompts up to 200k tokens and chat with them faster and cheaper than ever before. This is a game changer for Retrieval Augmented Generation (RAG) applications that analyze large amounts of data. Currently caching is only avialable for Sonnet and Haiku but it is coming soon to Opus. To showcase the power of prompt caching, let’s walk through an example of crawling a website with Firecrawl, caching the contents with Anthropic, and having an AI assistant analyze the copy to provide suggestions for improvement. [See the code on Github.](https://github.com/ericciarla/prompt_caching_websites_anthropic) ## Setup First, make sure you have API keys for both Anthropic and Firecrawl. Store them securely in a `.env` file: ``` ANTHROPIC_API_KEY=your_anthropic_key FIRECRAWL_API_KEY=your_firecrawl_key ``` Install the required Python packages: ``` pip install python-dotenv anthropic firecrawl requests ``` ## Crawling a Website with Firecrawl Initialize the Firecrawl app with your API key: ```python app = FirecrawlApp(api_key=firecrawl_api_key) ``` Crawl a website, limiting the results to 10 pages: ```python crawl_url = 'https://dify.ai/' params = { 'crawlOptions': { 'limit': 10 } } crawl_result = app.crawl_url(crawl_url, params=params) ``` Clean up the crawl results by removing the `content` field from each entry and save it to a file: ```python cleaned_crawl_result = [{k: v for k, v in entry.items() if k != 'content'} for entry in crawl_result] with open('crawl_result.txt', 'w') as file: file.write(json.dumps(cleaned_crawl_result, indent=4)) ``` ## Caching the Crawl Data with Anthropic Load the crawl data into a string: ```python website_dump = open('crawl_result.txt', 'r').read() ``` Set up the headers for the Anthropic API request, including the `anthropic-beta` header to enable prompt caching: ```python headers = { "content-type": "application/json", "x-api-key": anthropic_api_key, "anthropic-version": "2023-06-01", "anthropic-beta": "prompt-caching-2024-07-31" } ``` Construct the API request data, adding the `website_dump` as an ephemeral cached text: ```python data = { "model": "claude-3-5-sonnet-20240620", "max_tokens": 1024, "system": [\ {\ "type": "text",\ "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"\ },\ {\ "type": "text",\ "text": website_dump,\ "cache_control": {"type": "ephemeral"}\ }\ ], "messages": [\ {\ "role": "user",\ "content": "How can I improve the copy on this website?"\ }\ ] } ``` Make the API request and print the response: ```python response = requests.post( "https://api.anthropic.com/v1/messages", headers=headers, data=json.dumps(data) ) print(response.json()) ``` The key parts here are: 1. Including the `anthropic-beta` header to enable prompt caching 2. Adding the large `website_dump` text as a cached ephemeral text in the `system` messages 3. Asking the assistant to analyze the cached text and provide suggestions ## Benefits of Prompt Caching By caching the large `website_dump` text, subsequent API calls can reference that data without needing to resend it each time. This makes conversations much faster and cheaper. Imagine expanding this to cache an entire knowledge base with up to 200k tokens of data. You can then have highly contextual conversations drawing from that knowledge base in a very efficient manner. The possibilities are endless! Anthropic’s prompt caching is a powerful tool for building AI applications that can process and chat about large datasets. Give it a try and see how it can enhance your projects! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Crunchbase Data Scraping Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Jan 31, 2025 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude ![Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude image](https://www.firecrawl.dev/images/blog/company-data-scraping/company-data-scraping.jpg) ## Introduction In today’s data-driven business world, having access to accurate information about companies and their funding history is incredibly valuable. There are several online databases that track startups, investments, and company growth, containing details about millions of businesses, their funding rounds, and investors. While many of these platforms offer APIs, they can be expensive and out of reach for many users. This tutorial will show you how to build a web scraper that can gather company and funding data from public sources using Python, Firecrawl, and Claude. This guide is designed for developers who want to collect company data efficiently and ethically. By the end of this tutorial, you’ll have a working tool that can extract company details, funding rounds, and investor information from company profiles across the web. Here is the preview of the app: ![Screenshot of the Crunchbase scraping app interface showing company input fields, a scraping button, and results display area with download option](https://www.firecrawl.dev/images/blog/company-data-scraping/app_demo.png) The application provides two input methods for users - they can either upload a file containing company names or enter them directly as text. Behind the scenes, Firecrawl automatically scrapes relevant company information from public databases like Crunchbase. This scraped data is then processed by Claude, an AI assistant that generates concise company summaries. The results are displayed in a clean Streamlit interface, complete with a download option that exports all findings to a CSV file for further analysis. ## Table of Contents 1. Introduction 2. Setting up the Development Environment 3. Prerequisite: Scraping with Firecrawl 4. Building the Funding Data Scraper - Step 1: Adding brief app information - Step 2: Adding components for company name input - Step 3: Building a scraping class - Step 4: Adding a scraping button - Step 5: Creating a download button - Step 6: Generating summaries - Step 7: Deployment 5. Conclusion **Time to Complete:** ~60 minutes **Prerequisites:** - Python 3.10+ - Basic Python knowledge - API keys for Firecrawl and Claude **Important Note:** This tutorial demonstrates web scraping for educational purposes. Always review and comply with websites’ terms of service and implement appropriate rate limiting in production environments. ## Setting up the Development Environment Let’s start by setting up our development environment and installing the necessary dependencies. 1. **Create a working directory** First, create a working directory: ```bash mkdir company-data-scraper cd company-data-scraper ``` 2. **Install dependencies** We’ll use Poetry for dependency management. If you haven’t installed Poetry yet: ```bash curl -sSL https://install.python-poetry.org | python3 - ``` Then, initialize it inside the current working directory: ```bash poetry init ``` Type “^3.10” when asked for the Python version but, don’t specify the dependencies interactively. Next, install the project dependencies with the `add` command: ```bash poetry add streamlit firecrawl-py pandas pydantic openpyxl python-dotenv anthropic ``` 3. **Build the project structure** ```bash mkdir data src touch .gitignore README.md .env src/{app.py,models.py,scraper.py} ``` The created files serve the following purposes: - `data/` \- Directory to store input files and scraped results - `src/` \- Source code directory containing the main application files - `.gitignore` \- Specifies which files Git should ignore - `README.md` \- Project documentation and setup instructions - `.env` \- Stores sensitive configuration like API keys - `src/app.py` \- Main Streamlit application and UI code - `src/models.py` \- Data models and validation logic - `src/scraper.py` \- Web scraping and data collection functionality 4. **Configure environment variables** This project requires two accounts of third-party services: - [Firecrawl](https://firecrawl.dev/) for AI-powered web scraping - [Anthropic (Claude)](https://console.anthropic.com/) for summarizing scraped data Click on the hyperlinks above to create your accounts and copy/generate your API keys. Then, Inside the `.env` file in the root directory, add your API keys: ```plaintext FIRECRAWL_API_KEY=your_api_key_here ANTHROPIC_API_KEY=your_api_key_here ``` The `.env` file is used to store sensitive configuration like API keys securely.The `python-dotenv` package will automatically load these environment variables when the app starts. It should never be committed to version control so add the following line to your `.gitignore` file: ```plaintext .env ``` 5. **Start the app UI** Run the Streamlit app (which is blank just now) to ensure everything is working: ```bash poetry run streamlit run src/app.py ``` You should see the Streamlit development server start up and your default browser open to the app’s interface. Keep this tab open to see the changes we make to the app in the next steps. Now that we have our development environment set up, let’s cover how Firecrawl works, which is a prerequisite to building our app. ## Prerequisite: Scraping with Firecrawl The biggest challenge with any application that scrapes websites is maintenance. Since websites regularly update their layout and underlying HTML/CSS code, traditional scrapers break easily, making the entire app useless. Firecrawl solves this exact problem by allowing you to scrape websites using natural language. Instead of writing complex CSS selectors and XPath expressions that need constant maintenance, you can simply describe what data you want to extract in plain English. Firecrawl’s AI will figure out how to get that data from the page, even if the website’s structure changes. This makes our scraper much more reliable and easier to maintain over time. Here is a simple Firecrawl workflow we will later use in the app to scrape company information: ```python from firecrawl import FirecrawlApp from pydantic import BaseModel, Field from dotenv import load_dotenv load_dotenv() ``` ```plaintext True ``` We import `FirecrawlApp` to interact with the Firecrawl API for web scraping. `BaseModel` and `Field` from pydantic help us define structured data models with validation. The `load_dotenv` function loads environment variables from our `.env` file to securely access API keys. ```python # Define the data structure we want to extract class CompanyData(BaseModel): name: str = Field(description="Company name") funding_total: str = Field(description="Total funding amount") employee_count: str = Field(description="Number of employees") industry: str = Field(description="Primary industry or sector") founded_year: str = Field( description="Year the company was founded" ) # Initialize Firecrawl ``` Next, we define a Pydantic data model specifying the fields we want to extract from a website. Firecrawl will follow this schema to the letter - detecting the relevant HTML/CSS selectors containing this information and returning them in a simple JSON object. Here, the `Field` descriptions written in plain English are important as they guide the underlying Firecrawl AI to capture the required fields. ```python app = FirecrawlApp() # Scrape company data from Crunchbase data = app.extract( urls=["https://www.crunchbase.com/organization/openai"], params={ "schema": CompanyData.model_json_schema(), # Use our schema for extraction "prompt": "Extract key company information from the page", }, ) ``` We then initialize a `FirecrawlApp` instance and call its `extract` method, passing in the URL for OpenAI’s Crunchbase page. The `params` dictionary configures the scraping behavior - we provide our `CompanyData` schema to guide the structured data extraction. We also include a prompt to help direct the extraction process. The scraped data is returned in a format matching our schema, which we can then parse into a CompanyData object for easy access to the extracted fields, as shown in the following code block. ```python # Access the extracted data company = CompanyData(**data["data"]) print(f"Company: {company.name}") print(f"Funding: {company.funding_total}") print(f"Employees: {company.employee_count}") print(f"Industry: {company.industry}") print(f"Founded: {company.founded_year}") ``` ```plaintext Company: OpenAI Funding: null Employees: 251-500 Industry: Artificial Intelligence (AI) Founded: 2015 ``` In a later step, we will integrate this process into our app but will use the `batch_scrape_urls` method instead of `extract` to enable concurrent scraping. ## Building the Funding Data Scraper Step-by-Step We will take a top-down approach to building the app: starting with the high-level UI components and user flows, then implementing the underlying functionality piece by piece. This approach will help us validate the app’s usability early and ensure we’re building exactly what users need. ### Step 1: Adding brief app information We turn our focus to the `src/app.py` file and make the following imports: ```python import streamlit as st import pandas as pd import anthropic from typing import List from dotenv import load_dotenv # from scraper import CrunchbaseScraper load_dotenv() ``` The imports above serve the following purposes: - `streamlit`: Provides the web interface components and app framework - `pandas`: Used for data manipulation and CSV file handling - `anthropic`: Client library for accessing Claude AI capabilities - `typing.List`: Type hint for lists to improve code readability - `dotenv`: Loads environment variables from `.env` file for configuration Currently, the `CrunchbaseScraper` class is commented out since we are yet to write it. Next, we create a `main` function that holds the core UI components: ```python def main(): st.title("Crunchbase Company Data Scraper") st.write( """ Extract detailed company information from Crunchbase including funding data, employee counts, industries, and more. Upload a file with company names or enter them manually below. """ ) ``` Right now, the function gives brief info about the app’s purpose. To run the app, add the following `main` block to the end of `src/app.py`: ```python if __name__ == "__main__": main() ``` You should see the change in the Streamlit development server. ### Step 2: Adding components for company name input In this step, we add a new function to `src/app.py`: ```python def load_companies(file) -> List[str]: """Load company names from uploaded file""" companies = [] for line in file: company = line.decode("utf-8").strip() if company: # Skip empty lines companies.append(company) return companies ``` The `load_companies` function takes a file object as input and parses it line by line, extracting company names. It decodes each line from bytes to UTF-8 text, strips whitespace, and skips any empty lines. The function returns a list of company names that can be used for scraping Crunchbase data. Now, make the following changes to the `main` function: ```python def main(): st.title("Crunchbase Company Data Scraper") st.write( """ Extract detailed company information from Crunchbase including funding data, employee counts, industries, and more. Upload a file with company names or enter them manually below. """ ) # File upload option uploaded_file = st.file_uploader( "Upload a text file with company names (one per line)", type=["txt"] ) # Manual input option st.write("### Or Enter Companies Manually") manual_input = st.text_area( "Enter company names (one per line)", height=150, help="Enter each company name on a new line", ) ``` In this version, we’ve added two main ways for users to input company names: file upload and manual text entry. The file upload component accepts `.txt` files and for manual entry, users can type or paste company names directly into a text area, with each name on a new line. This provides flexibility for users whether they have a prepared list or want to enter names ad-hoc. Furthermore, add these two blocks of code after the input components: ```python def main(): ... companies = [] if uploaded_file: companies = load_companies(uploaded_file) st.write(f"Loaded {len(companies)} companies from file") elif manual_input: companies = [line.strip() for line in manual_input.split("\n") if line.strip()] st.write(f"Found {len(companies)} companies in input") ``` This code block processes the user input to create a list of company names. When a file is uploaded, it uses the `load_companies()` function to read and parse the file contents. For manual text input, it splits the input text by newlines and strips whitespace to extract company names. In both cases, it displays a message showing how many companies were found. The companies list will be used later for scraping data from funding data sources. ### Step 3: Building a scraping class with Firecrawl Let’s take a look at the snapshot of the final UI once again: ![Screenshot of the Crunchbase scraping app interface showing company input fields, a scraping button, and results display area with download option](https://www.firecrawl.dev/images/blog/company-data-scraping/app_demo.png) In this step, we implement the backend process that happens when a user clicks on “Start scraping” button. To do so, we use Firecrawl like we outlined in the prerequisites section. First, go to `src/models.py` script to write the data model we are going to use to scrape company and funding information: ```python from pydantic import BaseModel from typing import List, Optional class CompanyData(BaseModel): name: str about: Optional[str] employee_count: Optional[str] financing_type: Optional[str] industries: List[str] = [] headquarters: List[str] = [] founders: List[str] = [] founded_date: Optional[str] operating_status: Optional[str] legal_name: Optional[str] stock_symbol: Optional[str] acquisitions: List[str] = [] investments: List[str] = [] exits: List[str] = [] total_funding: Optional[str] contacts: List[str] = [] ``` This data model is more detailed and tries to extract as much information as possible from given sources. Now, switch to `src/scraper.py` where we implement a class called `CrunchbaseScraper`: ```python from firecrawl import FirecrawlApp from models import CompanyData from typing import List, Dict class CrunchbaseScraper: def __init__(self): self.app = FirecrawlApp() def scrape_companies(self, urls: List[str]) -> List[Dict]: """Scrape multiple Crunchbase company profiles""" schema = CompanyData.model_json_schema() try: data = self.app.batch_scrape_urls( urls, params={ "formats": ["extract"], "extract": { "prompt": """Extract information from given pages based on the schema provided.""", "schema": schema, }, }, ) return [res["extract"] for res in data["data"]] except Exception as e: print(f"Error while scraping companies: {str(e)}") return [] ``` Let’s break down how the class works. When the class is initialized, it creates an instance of `FirecrawlApp`. The main method `scrape_companies` takes a list of URLs and returns a list of dictionaries containing the scraped data. It works by: 1. Getting the JSON schema from our `CompanyData` model to define the structure 2. Using `batch_scrape_urls` to process multiple URLs at once 3. Configuring the scraper to use the “extract” format with our schema 4. Providing a prompt that instructs the scraper how to extract the data 5. Handling any errors that occur during scraping Error handling ensures the script continues running even if individual URLs fail, returning an empty list in case of errors rather than crashing. Now, the only thing left to do to finalize the scraping feature is to add the “Start Scraping” button to the UI. ### Step 4: Adding a button to start scraping In this step, return to `src/app.py` and add the following code block to the very end of the `main()` function: ```python def main(): ... if companies and st.button("Start Scraping"): scraper = CrunchbaseScraper() with st.spinner("Scraping company data from Crunchbase..."): try: # Convert company names to Crunchbase URLs urls = [\ f"https://www.crunchbase.com/organization/{name.lower().replace(' ', '-')}"\ for name in companies\ ] results = scraper.scrape_companies(urls) except Exception as e: st.error(f"An error occurred: {str(e)}") ``` This code block builds on the previous functionality by adding the core scraping logic. When the “Start Scraping” button is clicked (and companies have been provided), it: 1. Creates a new instance of our `CrunchbaseScraper` class 2. Shows a loading spinner to indicate scraping is in progress 3. Converts the company names into proper Crunchbase URLs by: - Converting to lowercase - Replacing spaces with hyphens - Adding the base Crunchbase URL prefix 4. Calls the `scrape_companies` method we created earlier to fetch the data The try-except block ensures any scraping errors are handled gracefully rather than crashing the application. This is important since web scraping can be unpredictable due to network issues, rate limiting, and so on. To finish this step, uncomment the single import at the top of `src/app.py` so that they look like this: ```python import streamlit as st import pandas as pd import anthropic from typing import List from scraper import CrunchbaseScraper from dotenv import load_dotenv load_dotenv() ``` ### Step 5: Creating a download button for the scraped results Now, we must create a button to download the scraped results as a CSV file. To do so, add the following code block after the scraping part: ```python def main(): ... if companies and st.button("Start Scraping"): scraper = CrunchbaseScraper() with st.spinner("Scraping company data from Crunchbase..."): try: ... results = scraper.scrape_companies(urls) # THIS PART IS NEW df = pd.DataFrame(results) csv = df.to_csv(index=False) # Create download button st.download_button( "Download Results (CSV)", csv, "crunchbase_data.csv", "text/csv", key="download-csv", ) except Exception as e: st.error(f"An error occurred: {str(e)}") ``` In the new lines of code, we convert the results to a Pandas dataframe and use its `to_csv()` function to save the dataframe as a CSV file. The method returns a filename, which we pass to the `st.download_button` method along with other details. ### Step 6: Generating a summary of scraped results After scraping the raw company data, we can use Claude to generate concise summaries that highlight key insights. Let’s add this functionality to our app. First, create a new function in `src/app.py` to handle the summarization: ```python def generate_company_summary(company_data: dict) -> str: """Generate a summary of the company data""" client = anthropic.Anthropic() message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, system="You are a company & funding data expert. Summarize the given company data by the user in a few sentences.", messages=[\ {"role": "user", "content": [{"type": "text", "text": str(company_data)}]}\ ], ) return message.content[0].text ``` Now, update the scraping section in the `main()` function to include the summary generation after the download button: ```python def main(): ... if companies and st.button("Start Scraping"): scraper = CrunchbaseScraper() with st.spinner("Scraping company data from Crunchbase..."): try: ... # Give summary of each company for company in results: summary = generate_company_summary(company) st.write(f"### Summary of {company['name']}") st.write(summary) except Exception as e: st.error(f"An error occurred: {str(e)}") ``` This implementation: 1. Creates a new `generate_company_summary()` function that: - Formats the scraped company data into readable text - Uses Claude to analyze the data and generate insights - Returns a structured summary highlighting key patterns. 2. Updates the main scraping workflow to: - Generate the summary after scraping is complete - Display the insights for each company after the download button The summary can provide context about the scraped data, helping users get the gist of the scraped data. ### Step 7: Deploying the app to Streamlit Cloud Now that our app is working locally, let’s deploy it to Streamlit Cloud so others can use it. First, we need to prepare our project for deployment. 1. **Create a requirements.txt** Since Streamlit Cloud doesn’t support Poetry directly, we need to convert our dependencies to a `requirements.txt` file. Run this command in your terminal: ```bash poetry export -f requirements.txt --output requirements.txt --without-hashes ``` 2. **Create a GitHub repository** Initialize a Git repository and push your code to GitHub: ```python git init git add . git commit -m "Initial commit" git branch -M main git remote add origin https://github.com/yourusername/company-data-scraper.git git push -u origin main ``` 3. **Add secrets to Streamlit Cloud** Visit [share.streamlit.io](https://share.streamlit.io/) and connect your GitHub account. Then: 1. Click “New app” 2. Select your repository and branch 3. Set the main file path as src/app.py 4. Click “Advanced settings” and add your environment variables: - `FIRECRAWL_API_KEY` - `ANTHROPIC_API_KEY` 4. **Update imports for deployment** Sometimes local imports need adjustment for Streamlit Cloud. Ensure your imports in src/app.py use relative paths: ```python from .models import CompanyData from .scraper import CrunchbaseScraper ``` 5. **Add a .streamlit/config.toml file** Create a `.streamlit` directory and add a `config.toml` file for custom theme settings: ```python [theme] primaryColor = "#FF4B4B" backgroundColor = "#FFFFFF" secondaryBackgroundColor = "#F0F2F6" textColor = "#262730" font = "sans serif" [server] maxUploadSize = 5 ``` 6. **Create a README.md file** Add a README.md file to help users understand your app: ```markdown # Crunchbase Company Data Scraper A Streamlit app that scrapes company information and funding data from Crunchbase. ## Features - Bulk scraping of company profiles - AI-powered data summarization - CSV export functionality - Clean, user-friendly interface ## Setup 1. Clone the repository 2. Install dependencies: `pip install -r requirements.txt` 3. Set up environment variables in `.env`: - `FIRECRAWL_API_KEY` - `ANTHROPIC_API_KEY` 4. Run the app: `streamlit run src/app.py` ## Usage 1. Enter company names (one per line) or upload a text file 2. Click "Start Scraping" 3. View AI-generated insights 4. Download results as CSV ## License MIT ``` 7. **Deploy the app** After pushing all changes to GitHub, go back to Streamlit Cloud and: 1. Click “Deploy” 2. Wait for the build process to complete 3. Your app will be live at `https://share.streamlit.io/yourusername/company-data-scraper/main` 4. **Monitor and maintain** After deployment: - Check the app logs in Streamlit Cloud for any issues - Monitor API usage and rate limits - Update dependencies periodically - Test the app regularly with different inputs The deployed app will automatically update whenever you push changes to your GitHub repository. Streamlit Cloud provides free hosting for public repositories, making it an excellent choice for sharing your scraper with others. ## Conclusion In this tutorial, we’ve built a powerful web application that combines the capabilities of Firecrawl and Claude to extract and analyze company data at scale. By leveraging Firecrawl’s AI-powered scraping and Claude’s natural language processing, we’ve created a tool that not only gathers raw data but also provides meaningful insights about companies and their funding landscapes. The Streamlit interface makes the tool accessible to users of all technical levels, while features like bulk processing and CSV export enable efficient data collection workflows. ### Limitations and Considerations - Rate limiting: Implement appropriate delays between requests - Data accuracy: Always verify scraped data against official sources - API costs: Monitor usage to stay within budget - Maintenance: Website structure changes may require updates ### Next Steps Consider these enhancements for your implementation: - Add data validation and cleaning - Implement request caching - Add data visualizations - Include historical data tracking - Implement error retry logic ### Resources - [Firecrawl Documentation](https://firecrawl.dev/docs) - [Claude API Documentation](https://docs.anthropic.com/claude/docs) - [Streamlit Deployment Guide](https://docs.streamlit.io/streamlit-cloud) - [Firecrawl’s scrape endpoint](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Automated Price Tracking Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 9, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Building an Automated Price Tracking Tool ![Building an Automated Price Tracking Tool image](https://www.firecrawl.dev/images/blog/price-tracking/price-tracking.jpg) There is a lot to be said about the psychology of discounts. For example, buying a discounted item we don’t need isn’t saving money at all - it’s falling for one of the oldest sales tactics. However, there are legitimate cases where waiting for a price drop on items you actually need makes perfect sense. The challenge is that e-commerce websites run flash sales and temporary discounts constantly, but these deals often disappear as quickly as they appear. Missing these brief windows of opportunity can be frustrating. That’s where automation comes in. In this guide, we’ll build a Python application that monitors product prices across any e-commerce website and instantly notifies you when prices drop on items you’re actually interested in. Here is a sneak peek of the app: ![Screenshot of a minimalist price tracking application showing product listings, price history charts, and notification controls for monitoring e-commerce deals using Firecrawl](https://www.firecrawl.dev/images/blog/price-tracking/sneak-peek.png) The app has a simple appearance but provides complete functionality: - It has a minimalistic UI to add or remove products from the tracker - A simple dashboard to display price history for each product - Controls for setting the price drop threshold in percentages - A notification system that sends Discord alerts when a tracked item’s price drops - A scheduling system that updates the product prices on an interval you specify - Runs for free for as long as you want Even though the title says “Amazon price tracker” (full disclosure: I was forced to write that for SEO purposes), the app will work for any e-commerce website you can imagine (except Ebay, for some reason). So, let’s get started building this Amazon price tracker. ## The Toolstack We Will Use The app will be built using Python and these libraries:: - [Streamlit](https://www.streamlit.io/) for the UI - [Firecrawl](https://www.firecrawl.dev/) for AI-based scraping of e-commerce websites - [SQLAlchemy](https://www.sqlalchemy.org/) for database management In addition to Python, we will use these platforms: - Discord for notifications - GitHub for hosting the app - GitHub Actions for running the app on a schedule - Supabase for hosting a free Postgres database instance ## Building an Amazon Price Tracker App Step-by-step Since this project involves multiple components working together, we’ll take a top-down approach rather than building individual pieces first. This approach makes it easier to understand how everything fits together, since we’ll introduce each tool only when it’s needed. The benefits of this strategy will become clear as we progress through the tutorial. ### Step 1: Setting up the environment First, let’s create a dedicated environment on our machines to work on the project: ```bash mkdir automated-price-tracker cd automated-price-tracker python -m venv .venv source .venv/bin/activate ``` These commands create a working directory and activate a virtual environment. Next, create a new script called `ui.py` for designing the user interface with Streamlit. ```bash touch ui.py ``` Then, install Streamlit: ```bash pip install streamlit ``` Next, create a `requirements.txt` file and add Streamlit as the first dependency: ```bash touch requirements.txt echo "streamlit\n" >> requirements.txt ``` Since the code will be hosted on GitHub, we need to initialize Git and create a `.gitignore` file: ```bash git init touch .gitignore echo ".venv" >> .gitignore # Add the virtual env folder git commit -m "Initial commit" ``` ### Step 2: Add a sidebar to the UI for product input Let’s take a look at the final product one more time: ![A screenshot of an Amazon price tracker web application showing a sidebar for adding product URLs and a main dashboard displaying tracked products with price history charts. Created with streamlit and firecrawl](https://www.firecrawl.dev/images/blog/price-tracking/sneak-peek.png) It has two sections: the sidebar and the main dashboard. Since the first thing you do when launching this app is adding products, we will start building the sidebar first. Open `ui.py` and paste the following code: ```python import streamlit as st # Set up sidebar with st.sidebar: st.title("Add New Product") product_url = st.text_input("Product URL") add_button = st.button("Add Product") # Main content st.title("Price Tracker Dashboard") st.markdown("## Tracked Products") ``` The code snippet above sets up a basic Streamlit web application with two main sections. In the sidebar, it creates a form for adding new products with a text input field for the product URL and an “Add Product” button. The main content area contains a dashboard title and a section header for tracked products. The code uses Streamlit’s `st.sidebar` context manager to create the sidebar layout and basic Streamlit components like `st.title`, `st.text_input`, and `st.button` to build the user interface elements. To see how this app looks like, run the following command: ```bash streamlit run ui.py ``` Now, let’s add a commit to save our progress: ```bash git add . git commit -m "Add a sidebar to the basic UI" ``` ### Step 3: Add a feature to check if input URL is valid In the next step, we want to add some restrictions to the input field like checking if the passed URL is valid. For this, create a new file called `utils.py` where we write additional utility functions for our app: ```bash touch utils.py ``` Inside the script, paste following code: ```bash # utils.py from urllib.parse import urlparse import re def is_valid_url(url: str) -> bool: try: # Parse the URL result = urlparse(url) # Check if scheme and netloc are present if not all([result.scheme, result.netloc]): return False # Check if scheme is http or https if result.scheme not in ["http", "https"]: return False # Basic regex pattern for domain validation domain_pattern = ( r"^[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$" ) if not re.match(domain_pattern, result.netloc): return False return True except Exception: return False ``` The above function `is_valid_url()` validates URLs by checking several criteria: 1. It verifies the URL has both a scheme ( `http`/ `https`) and domain name 2. It ensures the scheme is specifically `http` or `https` 3. It validates the domain name format using regex to check for valid characters and TLD 4. It returns True only if all checks pass, False otherwise Let’s use this function in our `ui.py` file. Here is the modified code: ```python import streamlit as st from utils import is_valid_url # Set up sidebar with st.sidebar: st.title("Add New Product") product_url = st.text_input("Product URL") add_button = st.button("Add Product") if add_button: if not product_url: st.error("Please enter a product URL") elif not is_valid_url(product_url): st.error("Please enter a valid URL") else: st.success("Product is now being tracked!") # Main content ... ``` Here is what’s new: 1. We added URL validation using the `is_valid_url()` function from `utils.py` 2. When the button is clicked, we perform validation: - Check if URL is empty - Validate URL format using `is_valid_url()` 3. User feedback is provided through error/success messages: - Error shown for empty URL - Error shown for invalid URL format - Success message when URL passes validation Rerun the Streamlit app again and see if our validation works. Then, return to your terminal to commit the changes we’ve made: ```bash git add . git commit -m "Add a feature to check URL validity" ``` ### Step 4: Scrape the input URL for product details When a valid URL is entered and the add button is clicked, we need to implement product scraping functionality instead of just showing a success message. The system should: 1. Immediately scrape the product URL to extract key details: - Product name - Current price - Main product image - Brand name - Other relevant attributes 2. Store these details in a database to enable: - Regular price monitoring - Historical price tracking - Price change alerts - Product status updates For the scraper, we will use [Firecrawl](https://www.firecrawl.dev/), an AI-based scraping API for extracting webpage data without HTML parsing. This solution provides several advantages: 1. No website HTML code analysis required for element selection 2. Resilient to HTML structure changes through AI-based element detection 3. Universal compatibility with product webpages due to structure-agnostic approach 4. Reliable website blocker bypass via robust API infrastructure First, create a new file called `scraper.py`: ```bash touch scraper.py ``` Then, install these three libraries: ```bash pip install firecrawl-py pydantic python-dotenv echo "firecrawl-py\npydantic\npython-dotenv\n" >> requirements.txt # Add them to dependencies ``` `firecrawl-py` is the Python SDK for Firecrawl scraping engine, `pydantic` is a data validation library that helps enforce data types and structure through Python class definitions, and `python-dotenv` is a library that loads environment variables from a `.env` file into your Python application. With that said, head over to the Firecrawl website and [sign up for a free account](https://www.firecrawl.dev/) (the free plan will work fine). You will be given an API key, which you should copy. Then, create a `.env` file in your terminal and add the API key as an environment variable: ```bash touch .env echo "FIRECRAWL_API_KEY='YOUR-API-KEY-HERE' >> .env" echo ".env" >> .gitignore # Ignore .env files in Git ``` The `.env` file is used to securely store sensitive configuration values like API keys that shouldn’t be committed to version control. By storing the Firecrawl API key in `.env` and adding it to `.gitignore`, we ensure it stays private while still being accessible to our application code. This is a security best practice to avoid exposing credentials in source control. Now, we can start writing the `scraper.py`: ```python from firecrawl import FirecrawlApp from pydantic import BaseModel, Field from dotenv import load_dotenv from datetime import datetime load_dotenv() app = FirecrawlApp() ``` Here, `load_dotenv()` function reads the `.env` file you have in your working directory and loads the environment variables inside, including the Firecrawl API key. When you create an instance of `FirecrawlApp` class, the API key is automatically detected to establish a connection between your script and the scraping engine in the form of the `app` variable. Now, we create a Pydantic class (usually called a model) that defines the details we want to scrape from each product: ```python class Product(BaseModel): """Schema for creating a new product""" url: str = Field(description="The URL of the product") name: str = Field(description="The product name/title") price: float = Field(description="The current price of the product") currency: str = Field(description="Currency code (USD, EUR, etc)") main_image_url: str = Field(description="The URL of the main image of the product") ``` Pydantic models may be completely new to you, so let’s break down the `Product` model: - The `url` field stores the product page URL we want to track - The `name` field stores the product title/name that will be scraped - The `price` field stores the current price as a float number - The `currency` field stores the 3-letter currency code (e.g. USD, EUR) - The `main_image_url` field stores the URL of the product’s main image Each field is typed and has a description that documents its purpose. The `Field` class from Pydantic allows us to add metadata like descriptions to each field. These descriptions are especially important for Firecrawl since it uses them to automatically locate the relevant HTML elements containing the data we want. Now, let’s create a function to call the engine to scrape URL’s based on the schema above: ```python def scrape_product(url: str): extracted_data = app.scrape_url( url, params={ "formats": ["extract"], "extract": {"schema": Product.model_json_schema()}, }, ) # Add the scraping date to the extracted data extracted_data["extract"]["timestamp"] = datetime.utcnow() return extracted_data["extract"] if __name__ == "__main__": product = "https://www.amazon.com/gp/product/B002U21ZZK/" print(scrape_product(product)) ``` The code above defines a function called `scrape_product` that takes a URL as input and uses it to scrape product information. Here’s how it works: The function calls `app.scrape_url` with two parameters: 1. The product URL to scrape 2. A params dictionary that configures the scraping: - It specifies we want to use the “extract” format - It provides our `Product` Pydantic model schema as the extraction template as a JSON object The scraper will attempt to find and extract data that matches our Product schema fields - the URL, name, price, currency, and image URL. The function returns just the “extract” portion of the scraped data, which contains the structured product information. `extract` returns a dictionary to which we add the date of the scraping as it will be important later on. Let’s test the script by running it: ```bash python scraper.py ``` You should get an output like this: ```python { 'url': 'https://www.amazon.com/dp/B002U21ZZK', 'name': 'MOVA Globe Earth with Clouds 4.5"', 'price': 212, 'currency': 'USD', 'main_image_url': 'https://m.media-amazon.com/images/blog/price-tracking/I/41bQ3Y58y3L._AC_.jpg', 'timestamp': '2024-12-05 13-20' } ``` The output shows that a [MOVA Globe](https://www.amazon.com/dp/B002U21ZZK) costs $212 USD on Amazon at the time of writing this article. You can test the script for any other website that contains the information we are looking (except Ebay): - Price - Product name/title - Main image URL One key advantage of using Firecrawl is that it returns data in a consistent dictionary format across all websites. Unlike HTML-based scrapers like BeautifulSoup or Scrapy which require custom code for each site and can break when website layouts change, Firecrawl uses AI to understand and extract the requested data fields regardless of the underlying HTML structure. Finish this step by committing the new changes to Git: ```bash git add . git commit -m "Implement a Firecrawl scraper for products" ``` ### Step 5: Storing new products in a PostgreSQL database If we want to check product prices regularly, we need to have an online database. In this case, Postgres is the best option since it’s reliable, scalable, and has great support for storing time-series data like price histories. There are many platforms for hosting Postgres instances but the one I find the easiest and fastest to set up is Supabase. So, please head over to [the Supabase website](https://supabase.com/) and create your free account. During the sign-up process, you will be given a password, which you should save somewhere safe on your machine. Then, in a few minutes, your free Postgres instance comes online. To connect to this instance, click on Home in the left sidebar and then, “Connect”: ![Screenshot of Supabase dashboard showing database connection settings and credentials for connecting to a PostgreSQL database instance](https://www.firecrawl.dev/images/blog/price-tracking/supabase_connect.png) You will be shown your database connection string with a placeholder for the password you copied. You should paste this string in your `.env` file with your password added to the `.env` file: ```bash echo POSTGRES_URL="THE-SUPABASE-URL-STRING-WITH-YOUR-PASSWORD-ADDED" ``` Now, the easiest way to interact with this database is through SQLAlchemy. Let’s install it: ```bash pip install "sqlalchemy==2.0.35" psycopg2-binary echo "psycopg2-binary\nsqlalchemy==2.0.35\n" >> requirements.txt ``` > Note: [SQLAlchemy](https://sqlalchemy.org/) is a Python SQL toolkit and Object-Relational Mapping (ORM) library that lets us interact with databases using Python code instead of raw SQL. For our price tracking project, it provides essential features like database connection management, schema definition through Python classes, and efficient querying capabilities. This makes it much easier to store and retrieve product information and price histories in our Postgres database. After the installation, create a new `database.py` file for storing database-related functions: ```bash touch database.py ``` Let’s populate this script: ```python from sqlalchemy import create_engine, Column, String, Float, DateTime, ForeignKey from sqlalchemy.orm import sessionmaker, relationship, declarative_base from datetime import datetime Base = declarative_base() class Product(Base): __tablename__ = "products" url = Column(String, primary_key=True) prices = relationship( "PriceHistory", back_populates="product", cascade="all, delete-orphan" ) class PriceHistory(Base): __tablename__ = "price_histories" id = Column(String, primary_key=True) product_url = Column(String, ForeignKey("products.url")) name = Column(String, nullable=False) price = Column(Float, nullable=False) currency = Column(String, nullable=False) main_image_url = Column(String) timestamp = Column(DateTime, nullable=False) product = relationship("Product", back_populates="prices") ``` The code above defines two SQLAlchemy models for our price tracking database: The `Product` model acts as a registry of all items we want to track. It’s kept simple with just the URL as we don’t want to duplicate data that changes over time. The `PriceHistory` model stores the actual price data points and product details at specific moments in time. This separation allows us to: - Track how product details (name, price, image) change over time - Maintain a clean historical record for each product - Efficiently query price trends without loading unnecessary data Each record in `PriceHistory` contains: - A unique ID as primary key - The product URL as a foreign key linking to the `Product` - The product name - The price value and currency - The main product image URL - A timestamp of when the price was recorded The relationship between `Product` and `PriceHistory` is bidirectional, allowing easy navigation between related records. The `cascade` setting ensures price histories are deleted when their product is deleted. These models provide the structure for storing and querying our price tracking data in a PostgreSQL database using SQLAlchemy’s ORM capabilities. Now, we define a `Database` class with a singe `add_product` method: ```python class Database: def __init__(self, connection_string): self.engine = create_engine(connection_string) Base.metadata.create_all(self.engine) self.Session = sessionmaker(bind=self.engine) def add_product(self, url): session = self.Session() try: # Create the product entry product = Product(url=url) session.merge(product) # merge will update if exists, insert if not session.commit() finally: session.close() ``` The `Database` class above provides core functionality for managing product data in our PostgreSQL database. It takes a connection string in its constructor to establish the database connection using SQLAlchemy. The `add_product` method allows us to store new product URLs in the database. It uses SQLAlchemy’s `merge` functionality which intelligently handles both inserting new products and updating existing ones, preventing duplicate entries. The method carefully manages database sessions, ensuring proper resource cleanup by using `try`/ `finally` blocks. This prevents resource leaks and maintains database connection stability. Let’s use this method inside the sidebar of our UI. Switch to `ui.py` and make the following adjustments: First, update the imports to load the Database class and initialize it: ```python import os import streamlit as st from utils import is_valid_url from database import Database from dotenv import load_dotenv load_dotenv() with st.spinner("Loading database..."): db = Database(os.getenv("POSTGRES_URL")) ``` The code integrates the `Database` class into the Streamlit UI by importing required dependencies and establishing a database connection. The database URL is loaded securely from environment variables using `python-dotenv`. The `Database` class creates or updates the tables we specified in `database.py` after being initialized. The database initialization process is wrapped in a Streamlit spinner component to maintain responsiveness while establishing the connection. This provides visual feedback during the connection setup period, which typically requires a brief initialization time. Then, in the sidebar code, we only need to add a single line of code to add the product to the database if the URL is valid: ```python # Set up sidebar with st.sidebar: st.title("Add New Product") product_url = st.text_input("Product URL") add_button = st.button("Add Product") if add_button: if not product_url: st.error("Please enter a product URL") elif not is_valid_url(product_url): st.error("Please enter a valid URL") else: db.add_product(product_url) # This is the new line st.success("Product is now being tracked!") ``` In the final `else` block that runs when the product URL is valid, we call the `add_product` method to store the product in the database. Let’s commit everything: ```bash git add . git commit -m "Add a Postgres database integration for tracking product URLs" ``` ### Step 6: Storing price histories for new products Now, after the product is added to the `products` table, we want to add its details and its scraped price to the `price_histories` table. First, switch to `database.py` and add a new method for creating entries in the `PriceHistories` table: ```python class Database: ... # the rest of the class def add_price(self, product_data): session = self.Session() try: price_history = PriceHistory( id=f"{product_data['url']}_{product_data['timestamp']}", product_url=product_data["url"], name=product_data["name"], price=product_data["price"], currency=product_data["currency"], main_image_url=product_data["main_image_url"], timestamp=product_data["timestamp"], ) session.add(price_history) session.commit() finally: session.close() ``` The `add_price` method takes a dictionary containing product data (which is returned by our scraper) and creates a new entry in the `PriceHistory` table. The entry’s ID is generated by combining the product URL with a timestamp. The method stores essential product information like name, price, currency, image URL, and the timestamp of when the price was recorded. It uses SQLAlchemy’s session management to safely commit the new price history entry to the database. Now, we need to add this functionality to the sidebar as well. In `ui.py`, add a new import statement that loads the `scrape_product` function from `scraper.py`: ```python ... # The rest of the imports from scraper import scrape_product ``` Then, update the `else` block in the sidebar again: ```python with st.sidebar: st.title("Add New Product") product_url = st.text_input("Product URL") add_button = st.button("Add Product") if add_button: if not product_url: st.error("Please enter a product URL") elif not is_valid_url(product_url): st.error("Please enter a valid URL") else: db.add_product(product_url) with st.spinner("Added product to database. Scraping product data..."): product_data = scrape_product(product_url) db.add_price(product_data) st.success("Product is now being tracked!") ``` Now when a user enters a product URL and clicks the “Add Product” button, several things happen: 1. The URL is validated to ensure it’s not empty and is properly formatted. 2. If valid, the URL is added to the products table via `add_product()`. 3. The product page is scraped immediately to get current price data. 4. This initial price data is stored in the price history table via `add_price()`. 5. The user sees loading spinners and success messages throughout the process. This gives us a complete workflow for adding new products to track, including capturing their initial price point. The UI provides clear feedback at each step and handles errors gracefully. Check that everything is working the way we want it and then, commit the new changes: ```bash git add . git commit -m "Add a feature to track product prices after they are added" ``` ### Step 7: Displaying each product’s price history in the main dashboard Let’s take a look at the final product shown in the introduction once again: ![Screenshot of a minimalist price tracking dashboard showing product price history charts, add/remove product controls, and notification settings for monitoring e-commerce deals and price drops](https://www.firecrawl.dev/images/blog/price-tracking/sneak-peek.png) Apart from the sidebar, the main dashboard shows each product’s price history visualized with a Plotly line plot where the X axis is the timestamp while the Y axis is the prices. Each line plot is wrapped in a Streamlit component that includes buttons for removing the product from the database or visiting its source URL. In this step, we will implement the plotting feature and leave the two buttons for a later section. First, add a new method to the `Database` class for retrieving the price history for each product: ```python class Database: ... # The rest of the code def get_price_history(self, url): """Get price history for a product""" session = self.Session() try: return ( session.query(PriceHistory) .filter(PriceHistory.product_url == url) .order_by(PriceHistory.timestamp.desc()) .all() ) finally: session.close() ``` The method queries the price histories table based on product URL, orders the rows in descending order (oldest first) and returns the results. Then, add another method for retrieving all products from the `products` table: ```python class Database: ... def get_all_products(self): session = self.Session() try: return session.query(Product).all() finally: session.close() ``` The idea is that every time our Streamlit app is opened, the main dashboard queries all existing products from the database and render their price histories with line charts in dedicated components. To create the line charts, we need Plotly and Pandas, so install them in your environment: ```bash pip install pandas plotly echo "pandas\nplotly\n" >> requirements.txt ``` Afterward, import them at the top of `ui.py` along with other existing imports: ```python import pandas as pd import plotly.express as px ``` Then, switch to `ui.py` and paste the following snippet of code after the Main content section: ```python # Main content st.title("Price Tracker Dashboard") st.markdown("## Tracked Products") # Get all products products = db.get_all_products() ``` Here, after the page title and subtitle is shown, we are retrieving all products from the database. Let’s loop over them: ```python # Create a card for each product for product in products: price_history = db.get_price_history(product.url) if price_history: # Create DataFrame for plotting df = pd.DataFrame( [\ {"timestamp": ph.timestamp, "price": ph.price, "name": ph.name}\ for ph in price_history\ ] ) ``` For each product, we get their price history with `db.get_price_history` and then, convert this data into a dataframe with three columns: - Timestamp - Price - Product name This makes plotting easier with Plotly. Next, we create a Streamlit expander component for each product: ```python # Create a card for each product for product in products: price_history = db.get_price_history(product.url) if price_history: ... # Create a card-like container for each product with st.expander(df["name"][0], expanded=False): st.markdown("---") col1, col2 = st.columns([1, 3]) with col1: if price_history[0].main_image_url: st.image(price_history[0].main_image_url, width=200) st.metric( label="Current Price", value=f"{price_history[0].price} {price_history[0].currency}", ) ``` The expander shows the product name as its title and contains: 1. A divider line 2. Two columns: - Left column: Product image (if available) and current price metric - Right column (shown in next section) The price is displayed using Streamlit’s metric component which shows the current price and currency. Here is the rest of the code: ```python ... with col2: # Create price history plot fig = px.line( df, x="timestamp", y="price", title=None, ) fig.update_layout( xaxis_title=None, yaxis_title="Price", showlegend=False, margin=dict(l=0, r=0, t=0, b=0), height=300, ) fig.update_xaxes(tickformat="%Y-%m-%d %H:%M", tickangle=45) fig.update_yaxes(tickprefix=f"{price_history[0].currency} ", tickformat=".2f") st.plotly_chart(fig, use_container_width=True) ``` In the right column, we create an interactive line plot using Plotly Express to visualize the price history over time. The plot shows price on the y-axis and timestamp on the x-axis. The layout is customized to remove the title, adjust axis labels and formatting, and optimize the display size. The timestamps are formatted to show date and time, with angled labels for better readability. Prices are displayed with 2 decimal places and a dollar sign prefix. The plot is rendered using Streamlit’s `plotly_chart` component and automatically adjusts its width to fill the container. After this step, the UI must be fully functional and ready to track products. For example, here is what mine looks like after adding a couple of products: ![Screenshot of a price tracking dashboard showing multiple product listings with price history charts, product images, and current prices for Amazon items](https://www.firecrawl.dev/images/blog/price-tracking/finished.png) But notice how the price history chart doesn’t show anything. That’s because we haven’t populated it by checking the product price in regular intervals. Let’s do that in the next couple of steps. For now, commit the latest changes we’ve made: ```bash git add . git commit -m "Display product price histories for each product in the dashboard" ``` * * * Let’s take a brief moment to summarize the steps we took so far and what’s next. So far, we’ve built a Streamlit interface that allows users to add product URLs and displays their current prices and basic information. We’ve implemented the database schema, created functions to scrape product data, and designed a clean UI with price history visualization. The next step is to set up automated price checking to populate our history charts and enable proper price tracking over time. ### Step 8: Adding new price entries for existing products Now, we want to write a script that adds new price entries in the `price_histories` table for each product in `products` table. We call this script `check_prices.py`: ```python import os from database import Database from dotenv import load_dotenv from firecrawl import FirecrawlApp from scraper import scrape_product load_dotenv() db = Database(os.getenv("POSTGRES_URL")) app = FirecrawlApp() ``` At the top, we are importing the functions and packages and initializing the database and a Firecrawl app. Then, we define a simple `check_prices` function: ```python def check_prices(): products = db.get_all_products() for product in products: try: updated_product = scrape_product(product.url) db.add_price(updated_product) print(f"Added new price entry for {updated_product['name']}") except Exception as e: print(f"Error processing {product.url}: {e}") if __name__ == "__main__": check_prices() ``` In the function body, we retrieve all products URLs, retrieve their new price data with `scrape_product` function from `scraper.py` and then, add a new price entry for the product with `db.add_price`. If you run the function once and refresh the Streamlit app, you must see a line chart appear for each product you are tracking: ![Screenshot of a price tracking dashboard showing a line chart visualization of product price history over time, with price on the y-axis and dates on the x-axis](https://www.firecrawl.dev/images/blog/price-tracking/linechart.png) Let’s commit the changes in this step: ```bash git add . git commit -m "Add a script for checking prices of existing products" ``` ### Step 9: Check prices regularly with GitHub actions GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate various software workflows directly from your GitHub repository. In our case, it’s particularly useful because we can set up automated price checks to run the `check_prices.py` script at regular intervals (e.g., daily or hourly) without manual intervention. This ensures we consistently track price changes and maintain an up-to-date database of historical prices for our tracked products. So, the first step is creating a new GitHub repository for our project and pushing existing code to it: ```bash git remote add origin https://github.com/yourusername/price-tracker.git git push origin main ``` Then, return to your terminal and create this directory structure: ```bash mkdir -p .github/workflows touch .github/workflows/check_prices.yml ``` The first command creates a new directory structure `.github/workflows` using the `-p` flag to create parent directories if they don’t exist. The second command creates an empty YAML file called `check_prices.yml` inside the workflows directory. GitHub Actions looks for workflow files in this specific location - any YAML files in the `.github/workflows` directory will be automatically detected and processed as workflow configurations. These YAML files define when and how your automated tasks should run, what environment they need, and what commands to execute. In our case, this file will contain instructions for GitHub Actions to periodically run our price checking script. Let’s write it: ```yaml name: Price Check on: schedule: # Runs every 3 minutes - cron: "*/3 * * * *" workflow_dispatch: # Allows manual triggering ``` Let’s break down this first part of the YAML file: The `name: Price Check` line gives our workflow a descriptive name that will appear in the GitHub Actions interface. The `on:` section defines when this workflow should be triggered. We’ve configured two triggers: 1. A schedule using cron syntax `*/3 * * * *` which runs the workflow every 3 minutes. The five asterisks represent minute, hour, day of month, month, and day of week respectively. The `*/3` means “every 3rd minute”. The 3-minute interval is for debugging purposes, we will need to choose a wider interval later on to respect the free limits of GitHub actions. 2. `workflow_dispatch` enables manual triggering of the workflow through the GitHub Actions UI, which is useful for testing or running the check on-demand. Now, let’s add the rest: ```yaml jobs: check-prices: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.10" cache: "pip" - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements.txt - name: Run price checker env: FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} POSTGRES_URL: ${{ secrets.POSTGRES_URL }} run: python check_prices.py ``` Let’s break down this second part of the YAML file: The `jobs:` section defines the actual work to be performed. We have one job named `check-prices` that runs on an Ubuntu virtual machine ( `runs-on: ubuntu-latest`). Under `steps:`, we define the sequence of actions: 1. First, we checkout our repository code using the standard `actions/checkout@v4` action 2. Then we set up Python 3.10 using `actions/setup-python@v5`, enabling pip caching to speed up dependency installation 3. Next, we install our Python dependencies by upgrading `pip` and installing requirements from our `requirements.txt` file. At this point, it is essential that you were keeping a complete dependency file based on the installs we made in the project. 4. Finally, we run our price checker script, providing two environment variables: - `FIRECRAWL_API_KEY`: For accessing the web scraping service - `POSTGRES_URL`: For connecting to our database Both variables must be stored in our GitHub repository as secrets for this workflow file to run without errors. So, navigate to the repository you’ve created for the project and open its Settings. Under “Secrets and variables” > “Actions”, click on “New repository secret” button to add the environment variables we have in the `.env` file one-by-one. Then, return to your terminal, commit the changes and push: ```bash git add . git commit -m "Add a workflow to check prices regularly" git push origin main ``` Next, navigate to your GitHub repository again and click on the “Actions” tab: ![Screenshot of GitHub Actions interface showing workflow runs and manual trigger button for automated price tracking application](https://www.firecrawl.dev/images/blog/price-tracking/actions.png) From there, you can run the workflow manually (click “Run workflow” and refresh the page). If it is executed successfully, you can return to the Streamlit app and refresh to see the new price added to the chart. ### Step 10: Setting up Discord for notifications Now that we know our scheduling workflow works, the first order of business is setting a wider check interval in the workflow file. Even though our first workflow run was manually, the rest happen automatically. ```bash on: schedule: # Runs every 6 hours - cron: "0 0,6,12,18 * * *" workflow_dispatch: # Allows manual triggering ``` The cron syntax `0 0,6,12,18 * * *` can be broken down as follows: - First `0`: Run at minute 0 - `0,6,12,18`: Run at hours 0 (midnight), 6 AM, 12 PM (noon), and 6 PM - First `*`: Run every day of the month - Second `*`: Run every month - Third `*`: Run every day of the week So this schedule will check prices four times daily: at midnight, 6 AM, noon, and 6 PM (UTC time). This spacing helps stay within GitHub Actions’ free tier limits while still catching most price changes. Now, commit and push the changes: ```bash git add . git commit -m "Set a wider check interval in the workflow file" git push origin main ``` Now comes the interesting part. Each time the workflow is run, we want to compare the current price of the product to its original price when we started tracking it. If the difference between these two prices exceeds a certain threshold like 5%, this means there is a discount happening for the product and we want to send a notification. The easiest way to set this up is by using Discord webhooks. So, if you don’t have one already, go to Discord.com and create a new account (optionally, download the desktop app as well). Then, setting up Discord notifications requires a few careful steps: 1. **Create a discord server** - Click the ”+” button in the bottom-left corner of Discord - Choose “Create My Own” → “For me and my friends” - Give your server a name (e.g., “Price Alerts”) 2. **Create a channel for alerts** - Your server comes with a #general channel by default - You can use this or create a new channel called #price-alerts - Right-click the channel you want to use 3. **Set up the webhook** - Select “Edit Channel” from the right-click menu - Go to the “Integrations” tab - Click “Create Webhook” - Give it a name like “Price Alert Bot” - The webhook URL will be generated automatically - Click “Copy Webhook URL” - this is your unique notification endpoint 4. **Secure the webhook URL** - Never share or commit your webhook URL directly - Add it to your `.env` file as `DISCORD_WEBHOOK_URL` - Add it to your GitHub repository secrets - The URL should look something like: `https://discord.com/api/webhooks/...` This webhook will serve as a secure endpoint that our price tracker can use to send notifications directly to your Discord channel. Webhooks are automated messages sent from apps to other apps in real-time. They work like a notification system - when something happens in one app, it automatically sends data to another app through a unique URL. In our case, we’ll use Discord webhooks to automatically notify us when there’s a price drop. Whenever our price tracking script detects a significant discount, it will send a message to our Discord channel through the webhook URL, ensuring we never miss a good deal. After copying the webhook URL, you should save it as environment variable to your `.env` file: ```python echo "DISCORD_WEBHOOK_URL='THE-URL-YOU-COPIED'" >> .env ``` Now, create a new file called `notifications.py` and paste the following contents: ```python from dotenv import load_dotenv import os import aiohttp import asyncio load_dotenv() async def send_price_alert( product_name: str, old_price: float, new_price: float, url: str ): """Send a price drop alert to Discord""" drop_percentage = ((old_price - new_price) / old_price) * 100 message = { "embeds": [\ {\ "title": "Price Drop Alert! 🎉",\ "description": f"**{product_name}**\nPrice dropped by {drop_percentage:.1f}%!\n"\ f"Old price: ${old_price:.2f}\n"\ f"New price: ${new_price:.2f}\n"\ f"[View Product]({url})",\ "color": 3066993,\ }\ ] } try: async with aiohttp.ClientSession() as session: await session.post(os.getenv("DISCORD_WEBHOOK_URL"), json=message) except Exception as e: print(f"Error sending Discord notification: {e}") ``` The `send_price_alert` function above is responsible for sending price drop notifications to Discord using webhooks. Let’s break down what’s new: 1. The function takes 4 parameters: - `product_name`: The name of the product that dropped in price - `old_price`: The previous price before the drop - `new_price`: The current lower price - `url`: Link to view the product 2. It calculates the percentage drop in price using the formula: `((old_price - new_price) / old_price) * 100` 3. The notification is formatted as a Discord embed - a rich message format that includes: - A title with a celebration emoji - A description showing the product name, price drop percentage, old and new prices - A link to view the product - A green color (3066993 in decimal) 4. The message is sent asynchronously using `aiohttp` to post to the Discord webhook URL stored in the environment variables 5. Error handling is included to catch and print any issues that occur during the HTTP request This provides a clean way to notify users through Discord whenever we detect a price drop for tracked products. To check the notification system works, add this main block to the end of the script: ```python if __name__ == "__main__": asyncio.run(send_price_alert("Test Product", 100, 90, "https://www.google.com")) ``` `asyncio.run()` is used here because `send_price_alert` is an async function that needs to be executed in an event loop. `asyncio.run()` creates and manages this event loop, allowing the async HTTP request to be made properly. Without it, we wouldn’t be able to use the `await` keyword inside `send_price_alert`. To run the script, install `aiohttp`: ```python pip install aiohttp echo "aiohttp\n" >> requirements.txt python notifications.py ``` If all is well, you should get a Discord message in your server that looks like this: ![Screenshot of a Discord notification showing a price drop alert with product details, original price, new discounted price and percentage savings](https://www.firecrawl.dev/images/blog/price-tracking/alert.png) Let’s commit the changes we have: ```bash git add . git commit -m "Set up Discord alert system" ``` Also, don’t forget to add the Discord webhook URL to your GitHub repository secrets! ### Step 11: Sending Discord alerts when prices drop Now, the only step left is adding a price comparison logic to `check_prices.py`. In other words, we want to use the `send_price_alert` function if the new scraped price is lower than the original. This requires a revamped `check_prices.py` script: ```python import os import asyncio from database import Database from dotenv import load_dotenv from firecrawl import FirecrawlApp from scraper import scrape_product from notifications import send_price_alert load_dotenv() db = Database(os.getenv("POSTGRES_URL")) app = FirecrawlApp() # Threshold percentage for price drop alerts (e.g., 5% = 0.05) PRICE_DROP_THRESHOLD = 0.05 async def check_prices(): products = db.get_all_products() product_urls = set(product.url for product in products) for product_url in product_urls: # Get the price history price_history = db.get_price_history(product_url) if not price_history: continue # Get the earliest recorded price earliest_price = price_history[-1].price # Retrieve updated product data updated_product = scrape_product(product_url) current_price = updated_product["price"] # Add the price to the database db.add_price(updated_product) print(f"Added new price entry for {updated_product['name']}") # Check if price dropped below threshold if earliest_price > 0: # Avoid division by zero price_drop = (earliest_price - current_price) / earliest_price if price_drop >= PRICE_DROP_THRESHOLD: await send_price_alert( updated_product["name"], earliest_price, current_price, product_url ) if __name__ == "__main__": asyncio.run(check_prices()) ``` Let’s examine the key changes in this enhanced version of `check_prices.py`: 1. New imports and setup - Added `asyncio` for `async`/ `await` support - Imported `send_price_alert` from `notifications.py` - Defined `PRICE_DROP_THRESHOLD = 0.05` (5% threshold for alerts) 2. Async function conversion - Converted `check_prices()` to async function - Gets unique product URLs using set comprehension to avoid duplicates 3. Price history analysis - Retrieves full price history for each product - Gets `earliest_price` from `history[-1]` (works because we ordered by timestamp DESC) - Skips products with no price history using `continue` 4. Price drop detection logic - Calculates drop percentage: `(earliest_price - current_price) / earliest_price` - Checks if drop exceeds 5% threshold - Sends Discord alert if threshold exceeded using `await send_price_alert()` 5. Async main block - Uses `asyncio.run()` to execute async `check_prices()` in event loop When I tested this new version of the script, I immediately got an alert: ![Screenshot of a Discord notification showing a price drop alert for an Amazon product, displaying the original and discounted prices with percentage savings](https://www.firecrawl.dev/images/blog/price-tracking/new-alert.png) Before we supercharge our workflow with the new notification system, you should add this line of code to your `check_prices.yml` workflow file to read the Discord webhook URL from your GitHub secrets: ```python ... - name: Run price checker env: FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} POSTGRES_URL: ${{ secrets.POSTGRES_URL }} DISCORD_WEBHOOK_URL: ${{ secrets.DISCORD_WEBHOOK_URL }} run: python automated_price_tracking/check_prices.py ``` Finally, let’s commit everything and push to GitHub so that our workflow is supercharged with our notification system: ```bash git add . git commit -m "Add notification system to price drops" git push origin main ``` ## Limitations of Free Tier Tools Used in the Tutorial Before wrapping up, let’s quickly review the limitations of the free tools we used in this tutorial: - GitHub Actions: Limited to 2,000 minutes per month for free accounts. Consider increasing the cron interval to stay within limits. - Supabase: Free tier includes 500MB database storage and limited row count. Monitor usage if tracking many products. - Firecrawl: Free API tier allows 500 requests per month. This means that at 6 hour intervals, you can track up to four products in the free plan. - Streamlit Cloud: Free hosting tier has some memory/compute restrictions and goes to sleep after inactivity. While these limitations exist, they’re quite generous for personal use and learning. The app will work well for tracking a reasonable number of products with daily price checks. ## Conclusion and Next Steps Congratulations for making it to the end of this extremely long tutorial! We’ve just covered how to implement an end-to-end Python project you can proudly showcase on your portfolio. We built a complete price tracking system that scrapes product data from e-commerce websites, stores it in a Postgres database, analyzes price histories, and sends automated Discord notifications when prices drop significantly. Along the way, we learned about web scraping with Firecrawl, database management with SQLAlchemy, asynchronous programming with asyncio, building interactive UIs with Streamlit, automating with GitHub actions and integrating external webhooks. However, the project is far from perfect. Since we took a top-down approach to building this app, our project code is scattered across multiple files and often doesn’t follow programming best practices. For this reason, I’ve recreated the same project in a much more sophisticated manner with production-level features. [This new version on GitHub](https://github.com/BexTuychiev/automated-price-tracking) implements proper database session management, faster operations and overall smoother user experience. Also, this version includes buttons for removing products from the database and visiting them through the app. If you decide to stick with the basic version, you can find the full project code and notebook in the official Firecrawl GitHub repository’s example projects. I also recommend that you [deploy your Streamlit app to Streamlit Cloud](https://share.streamlit.io/) so that you have a functional app accessible everywhere you go. Here are some further improvements you might consider for the app: - Improve the price comparison logic: the app compares the current price to the oldest recorded price, which might not be ideal. You may want to compare against recent price trends instead. - No handling of currency conversion if products use different currencies. - The Discord notification system doesn’t handle rate limits or potential webhook failures gracefully. - No error handling for Firecrawl scraper - what happens if the scraping fails? - No consistent usage of logging to help track issues in production. - No input URL sanitization before scraping. Some of these features are implemented in [the advanced version of the project](https://github.com/BexTuychiev/automated-price-tracking), so definitely check it out! Here are some more guides from our blog if you are interested: - [How to Run Web Scrapers on Schedule](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) - [More about using Firecrawl’s `scrape_url` function](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint) - [Scraping entire websites with Firecrawl in a single command - the /crawl endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) Thank you for reading! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Automated Web Scraping Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 5, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Web Scraping Automation: How to Run Scrapers on a Schedule ![Web Scraping Automation: How to Run Scrapers on a Schedule image](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/automated-web-scraping-free-2025.jpg) ## Introduction Web scraping is an essential skill for programmers in this data-driven world. Whether you’re tracking prices, monitoring competitors, or gathering research data, automated web scraping can save you countless hours of manual work. In this comprehensive guide, you’ll learn how to schedule and automate your Python web scrapers using completely free tools and services. ### Why automate your web scraper? Manual or one-off web scraping can be time-consuming and error-prone. You need to repeatedly run scripts, update data frequently, and sometimes work during off-hours just to gather information. Scheduling your web scrapers automates this entire process. It collects data at optimal times without manual intervention, ensures consistency, and frees up your valuable time for actual data analysis rather than repetitive data gathering. ### What tools can you use to automate web scrapers in Python? If you are building scrapers in Python, you have many completely free options to schedule them. In terms of local scraping, Python’s `schedule` library is very intuitive and fastest to set up. There is also the built-in `asyncio` for concurrent scraping. There are also system-level automation tools like cron jobs for macOS/Linux and the Task Scheduler for Windows. There are also many cloud-based solutions like GitHub Actions (completely free), PythonAnywhere (free tier), Google Cloud Functions (free tier), and Heroku (free tier with limitations). In this guide, we will start with basic local scheduling and progress to using GitHub Actions, all while following best practices and ethical scraping guidelines. Let’s begin by setting up your development environment and writing your first scheduled web scraper. ### Common challenges in automating web scrapers Scheduling web scrapers is an easy process, as you will discover through this tutorial. The real challenge lies in ensuring that the scrapers don’t break the day after they are put on schedule. Common issues include websites changing their HTML structure, implementing new anti-bot measures, or modifying their robots.txt policies. Additionally, network issues, rate limiting, and IP blocking can cause scheduled scrapers to fail. For these reasons, it is almost impossible to build long-lasting scrapers written in Python frameworks. But the web scraping landscape is changing as more AI-based tools are emerging, like [Firecrawl](https://www.firecrawl.dev/). Firecrawl provides an AI-powered web scraping API that can identify and extract data from HTML elements based on semantic descriptions in Python classes. While traditional scrapers rely on specific HTML selectors that can break when websites change, Firecrawl’s AI approach helps maintain scraper reliability over time. For demonstration purposes, we’ll implement examples using Firecrawl, though the scheduling techniques covered in this tutorial can be applied to any Python web scraper built with common libraries like BeautifulSoup, Scrapy, Selenium, or `lxml`. If you want to follow along with a scraper of your own, make sure to have it in a script and ready to go. ## Prerequisites This article assumes that you are already comfortable with web scraping and its related concepts like HTML structure, CSS selectors, HTTP requests, and handling rate limits. If you need a refresher on web scraping basics, check out [this introductory guide to web scraping with Python](https://realpython.com/python-web-scraping-practical-introduction/). Otherwise, let’s jump in by setting up the tools we will use for the tutorial. ### Environment setup We will mainly use Firecrawl in this article, so, please make sure that you sign up at [firecrawl.dev](https://www.firecrawl.dev/), choose the free plan and get an API token. Then, create a new working directory on your machine to follow along in this tutorial: ```bash mkdir learn-scheduling cd learn-scheduling ``` It is always a best practice to create a new virtual environment for projects. In this tutorial, we will use Python’s `virtualenv`: ```bash python -m venv venv source venv/bin/activate # For Unix/macOS venv\Scripts\activate # For Windows ``` Now, let’s install the libraries we will use: ```bash pip install requests beautifulsoup4 firecrawl-py python-dotenv ``` We will touch on what each library does as we use them. ### Firecrawl API key setup Since we will push our code to a GitHub repository later, you will need to save your Firecrawl API key securely by using a `.env` file: ```bash touch .env echo "FIRECRAWL_API_KEY='your-key-here'" >> .env ``` Also, create a `.gitignore` file and add the `.env` to it so that it isn’t pushed to GitHub: ```bash touch .gitignore echo ".env" >> .gitignore ``` ## Writing a Basic Web Scraper In this tutorial, we will build a scraper for [the Hacker News homepage](https://news.ycombinator.com/) that extracts post title, URL, author, rank, number of upvotes and date. ![Hacker News Homepage showing posts to be scraped using automated scheduling with Firecrawl](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/hacker_news_homepage.png) Like we mentioned, we will build the scraper in Firecrawl but I have also prepared an [identical scraper written in BeautifulSoup](https://github.com/mendableai/firecrawl/tree/main/examples/blog-articles/scheduling_scrapers) if you want a more traditional approach. In your working directory, create a new `firecrawl_scraper.py` script and import the following packages: ```python # firecrawl_scraper.py import json from firecrawl import FirecrawlApp from dotenv import load_dotenv from pydantic import BaseModel, Field from typing import List from datetime import datetime load_dotenv() BASE_URL = "https://news.ycombinator.com/" ``` After the imports, we are calling `load_dotenv()` so that our Firecrawl API key is loaded from the `.env` file. Then, we are defining a new variable containing the URL we will scrape. Next, we create a Pydantic model to specify the information we want to scrape from each Hacker News post: ```python class NewsItem(BaseModel): title: str = Field(description="The title of the news item") source_url: str = Field(description="The URL of the news item") author: str = Field( description="The URL of the post author's profile concatenated with the base URL." ) rank: str = Field(description="The rank of the news item") upvotes: str = Field(description="The number of upvotes of the news item") date: str = Field(description="The date of the news item.") ``` Pydantic models are Python classes that provide data validation and serialization capabilities. They allow you to define the structure and types of your data using Python type hints, while automatically handling validation, serialization, and documentation. In the context of our Firecrawl scraper, the `NewsItem` model defines the exact structure of data we want to extract from each Hacker News post. Each field in the model ( `title`, `source_url`, `author`, etc.) specifies what data should be scraped and includes a description of what that field represents. This model is crucial for Firecrawl because it uses the model’s schema to understand exactly what data to extract from the webpage. When we pass this model to Firecrawl, it will automatically attempt to find and extract data matching these field definitions from the HTML structure of Hacker News. For example, when Firecrawl sees we want a “title” field, it will look for elements on the page that are likely to contain post titles based on their HTML structure and content. The `Field` descriptions help provide additional context about what each piece of data represents. Next, we create another model called `NewsData` that contains a list of `NewsItem` objects. This model will serve as a container for all the news items we scrape from Hacker News. The `news_items` field is defined as a List of `NewsItem` objects, which means it can store multiple news items in a single data structure. ```python class NewsData(BaseModel): news_items: List[NewsItem] ``` Without this second model, our scraper will return not one but all news items. Now, we define a new function that will run Firecrawl based on the scraping schema we just defined: ```python def get_firecrawl_news_data(): app = FirecrawlApp() data = app.scrape_url( BASE_URL, params={ "formats": ["extract"], "extract": {"schema": NewsData.model_json_schema()}, }, ) return data ``` This function initializes a FirecrawlApp instance and uses it to scrape data from Hacker News. It passes the `BASE_URL` and parameters specifying that we want to extract data according to our `NewsData` schema. The schema tells Firecrawl exactly what fields to look for and extract from each news item on the page. The function returns the scraped data which will contain a list of news items matching our defined structure. Let’s quickly test it: ```python data = get_firecrawl_news_data() print(type(data)) ``` ```python ``` Firecrawl always returns the scraped data in a dictionary. Let’s look at the keys: ```python data['metadata'] ``` ```python { 'title': 'Hacker News', 'language': 'en', 'ogLocaleAlternate': [], 'referrer': 'origin', 'viewport': 'width=device-width, initial-scale=1.0', 'sourceURL': 'https://news.ycombinator.com/', 'url': 'https://news.ycombinator.com/', 'statusCode': 200 } ``` The first key is the metadata field containing basic page information. We are interested in the `extract` field which contains the data scraped by the engine: ```python data['extract']['news_items'][0] ``` ```python { 'title': "Send someone you appreciate an official 'Continue and Persist' Letter", 'source_url': 'https://ContinueAndPersist.org', 'author': 'https://news.ycombinator.com/user?id=adnanaga', 'rank': '1', 'upvotes': '1122', 'date': '17 hours ago' } ``` The `extract` field contains a dictionary with a list of scraped news items. We can see above that when printing the first item, it includes all the fields we defined in our `NewsItem` Pydantic model, including title, source URL, author, rank, upvotes and date. ```python len(data['extract']['news_items']) ``` ```python 30 ``` The output shows 30 news items, confirming that our scraper successfully extracted all posts from the first page of Hacker News. This matches the site’s standard layout which displays exactly 30 posts per page. Now, let’s create a new function that saves this data to a JSON file: ```python def save_firecrawl_news_data(): # Get the data data = get_firecrawl_news_data() # Format current date for filename date_str = datetime.now().strftime("%Y_%m_%d_%H_%M") filename = f"firecrawl_hacker_news_data_{date_str}.json" # Save the news items to JSON file with open(filename, "w") as f: json.dump(data["extract"]["news_items"], f, indent=4) return filename ``` `save_firecrawl_news_data()` handles saving the scraped Hacker News data to a JSON file. It first calls `get_firecrawl_news_data()` to fetch the latest data from Hacker News. Then, it generates a filename using the current timestamp in the format `YYYY_MM_DD_HH_MM`. The data is saved to this timestamped JSON file with proper indentation, and the filename is returned. This allows us to maintain a historical record of the scraped data with clear timestamps indicating when each scrape occurred. Finally, add a `__main__` block to the `firecrawl_scraper.py` script to allow running the scraper directly from the command line: ```python if __name__ == "__main__": save_firecrawl_news_data() ``` The complete scraper script is available in [our GitHub repository](https://github.com/mendableai/firecrawl/blob/main/examples/hacker_news_scraper/firecrawl_scraper.py). For reference, we also provide [a BeautifulSoup implementation of the same scraper](https://github.com/mendableai/firecrawl/blob/main/examples/hacker_news_scraper/bs4_scraper.py). ## Local Web Scraping Automation Methods In this section, we will explore how to run the scraper from the previous section on schedule using local tools like the Python `schedule` library and cron. ### The basics of the Python `schedule` library `schedule` is a job scheduling library written for humans ( [from the documentation](https://schedule.readthedocs.io/en/stable/index.html)). It runs any Python function or callable periodically using intuitive syntax. To get started, please install it with `pip`: ```bash pip install schedule ``` Then, scheduling a Python function is as easy as shown in the codeblock below: ```python import schedule import time def job(): current_time = time.strftime("%H:%M:%S") print(f"{current_time}: I'm working...") # Schedule it schedule.every(3).seconds.do(job) while True: schedule.run_pending() time.sleep(1) ``` ```out 14:58:23: I'm working... 14:58:26: I'm working... 14:58:29: I'm working... 14:58:32: I'm working... 14:58:35: I'm working... ... ``` To implement scheduling, first convert your task into a function (which we’ve already completed). Next, apply scheduling logic using the `.every(n).period.do` syntax. Below are several examples demonstrating different scheduling patterns: ```python schedule.every(10).minutes.do(job) schedule.every().hour.do(job) schedule.every().day.at("10:30").do(job) schedule.every().monday.do(job) schedule.every().wednesday.at("13:15").do(job) schedule.every().day.at("12:42", "Europe/Amsterdam").do(job) schedule.every().minute.at(":17").do(job) # 17th second of a minute ``` Finally, you need to run an infinite loop that checks for pending scheduled jobs and executes them. The loop below runs continuously, checking if any scheduled tasks are due to run: ```python while True: schedule.run_pending() time.sleep(1) ``` There is much more to the `schedule` library than what we just covered (you should check out [the examples from the documentation](https://schedule.readthedocs.io/en/stable/examples.html)) but they are enough for the purposes of this article. ### Using Python’s `schedule` library to schedule web scrapers Now that we know the basics of `schedule`, let’s use it for our Firecrawl scraper. Start by creating a new `scrape_scheduler.py` script and making the necessary imports: ```python import schedule import time from firecrawl_scraper import save_firecrawl_news_data ``` Here, we import the `schedule` module itself and the `save_firecrawl_news_data()` function from `firecrawl_scraper.py` that downloads the top 30 posts of Hacker News. Then, to run this function on schedule, like every hour, we only need to add a few lines of code: ```python # Schedule the scraper to run every hour schedule.every().hour.do(save_firecrawl_news_data) while True: schedule.run_pending() time.sleep(1) ``` You can start the schedule with: ```bash python scrape_scheduler.py ``` > **Tip**: For debugging purposes, start with a shorter interval like 60 seconds before implementing the hourly schedule. The scheduler will continue running until you terminate the main terminal process executing the `scrape_scheduler.py` script. Thanks to Firecrawl’s AI-powered HTML parsing and layout adaptation capabilities, the scraper is quite resilient to website changes and has a low probability of breaking. Nevertheless, web scraping can be unpredictable, so it’s recommended to review [the exception handling](https://schedule.readthedocs.io/en/stable/exception-handling.html) section of the `schedule` documentation to handle potential errors gracefully. * * * The `schedule` library provides a simple and intuitive way to run periodic tasks like web scrapers. While it lacks some advanced features of other scheduling methods, it’s a great choice for basic scheduling needs and getting started with automated scraping. Just remember to implement proper error handling for production use. ### Using Python’s built-in tools to automate web scrapers In this section, we will explore a few other local scheduling methods that have the advantage of being built into Python or the operating system, making them more reliable and robust than third-party libraries. These methods also provide better error handling, logging capabilities, and system-level control over the scheduling process. #### How to automate a web scraper with `asyncio`? `asyncio` is a Python library for running code concurrently - executing multiple tasks at the same time by switching between them when one is waiting. It’s built into Python and helps schedule tasks efficiently. Here’s why it’s great for web scrapers: 1. It can do other work while waiting for web requests to complete. 2. You can run multiple scrapers at the same time with precise timing control. 3. It uses less computer resources than regular multitasking. 4. It handles errors well with `try/except` blocks. Let’s see how to use `asyncio` for scheduling scrapers: ```python import asyncio import time from firecrawl_scraper import save_firecrawl_news_data async def schedule_scraper(interval_hours: float = 1): while True: try: print(f"Starting scrape at {time.strftime('%Y-%m-%d %H:%M:%S')}") # Run the scraper filename = save_firecrawl_news_data() print(f"Data saved to {filename}") except Exception as e: print(f"Error during scraping: {e}") # Wait for the specified interval await asyncio.sleep(interval_hours * 3600) # Convert hours to seconds async def main(): # Create tasks for different scheduling intervals tasks = [\ schedule_scraper(interval_hours=1), # Run every hour\ # Add more tasks with different intervals if needed\ # schedule_scraper(interval_hours=0.5), # Run every 30 minutes\ # schedule_scraper(interval_hours=2), # Run every 2 hours\ ] # Run all tasks concurrently await asyncio.gather(*tasks) if __name__ == "__main__": # Run the async scheduler asyncio.run(main()) ``` Let’s break down what’s happening above: The `schedule_scraper()` function is an `async` function that runs indefinitely in a loop. For each iteration, it: 1. Runs the scraper and saves the data 2. Handles any errors that occur during scraping 3. Waits for the specified interval using `asyncio.sleep()` The `main()` function sets up concurrent execution of multiple scraper tasks with different intervals. This allows running multiple scrapers simultaneously without blocking each other. This asyncio-based approach has several advantages over the `schedule` library: 1. True concurrency: Multiple scrapers can run simultaneously without blocking each other, unlike `schedule` which runs tasks sequentially. 2. Precise Timing: `asyncio.sleep()` provides more accurate timing control compared to schedule’s `run_pending()` approach. 3. Resource Efficiency: `asyncio` uses cooperative multitasking which requires less system resources than `schedule`’s threading-based approach. 4. Better Error Handling: `Async/await` makes it easier to implement proper error handling and recovery 5. Flexibility: You can easily add or remove scraper tasks and modify their intervals without affecting other tasks The code structure also makes it simple to extend functionality by adding more concurrent tasks or implementing additional error handling logic. #### How to automate a web scraper with cron jobs? Cron is a time-based job scheduler in Unix-like operating systems (Linux, macOS). Think of it as a digital scheduler or calendar that can automatically run programs at specified times. A cron job is simply a task that you schedule to run at specific intervals. For web scraping, cron jobs are incredibly useful because they let you automate your scraper to run at predetermined times. For example, you could set up a cron job to: - Run your scraper every hour to collect real-time data - Execute scraping tasks during off-peak hours (like 2 AM) to minimize server load - Collect data at specific times when websites update their content The scheduling format uses five time fields: minute, hour, day of month, month, and day of week. For instance: - `0 * * * *` means “run every hour at minute 0” - `0 0 * * *` means “run at midnight every day” - `*/15 * * * *` means “run every 15 minutes” Cron jobs are especially reliable for web scraping because they’re built into the operating system, use minimal resources, and continue running even after system reboots. So, let’s run the `save_firecrawl_news_data()` function on a schedule using cron. First, we will create a dedicated script for the cron job named `cron_scraper.py`: ```python # cron_scraper.py import sys import logging from datetime import datetime from pathlib import Path from firecrawl_scraper import save_firecrawl_news_data # Set up logging log_dir = Path("logs") log_dir.mkdir(exist_ok=True) log_file = log_dir / f"scraper_{datetime.now().strftime('%Y_%m')}.log" logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[logging.FileHandler(log_file), logging.StreamHandler(sys.stdout)], ) def main(): try: logging.info("Starting scraping job") filename = save_firecrawl_news_data() # Actual scraping function logging.info(f"Successfully saved data to {filename}") except Exception as e: logging.error(f"Scraping failed: {str(e)}", exc_info=True) if __name__ == "__main__": main() ``` The script implements a production-ready web scraper with logging and error handling. It creates a logs directory, configures detailed logging to both files and console, and wraps the scraping operation in error handling. When run, it executes our scraper function while tracking its progress and any potential issues, making it ideal for automated scheduling through cron jobs. Now, to schedule this script using cron, you will need to make it executable: ```bash chmod +x cron_scraper.py ``` Then, open your crontab file (which is usually empty) with Nano: ```bash EDITOR=nano crontab -e ``` Then, add one or more entries specifying the frequency with which the scraper must run: ```python # Run every minute */1 * * * * cd /absolute/path/to/project && /absolute/path/to/.venv/bin/python cron_scraper.py >> ~/cron.log 2>&1 # Run every hour */1 * * * * cd /absolute/path/to/project && /absolute/path/to/.venv/bin/python cron_scraper.py >> ~/cron.log 2>&1 ``` The above cron job syntax consists of several parts: The timing pattern “ _/1_ \\*\\* \*” breaks down as follows: - First `*/1`: Specifies every minute - First `*`: Represents any hour - Second `*`: Represents any day of the month - Third `*`: Represents any month - Fourth `*`: Represents any day of the week After the timing pattern: - `cd /absolute/path/to/project`: Changes to the project directory of your scraper - `&&`: Chains commands, executing the next only if previous succeeds - `/absolute/path/to/.venv/bin/python`: Specifies the Python interpreter path - `cron_scraper.py`: The script to execute - `>> ~/cron.log 2>&1`: Redirects both standard output (>>) and errors ( `2>&1`) to `cron.log` For hourly execution, the same pattern applies but with `0 * * * *` timing to run at the start of each hour instead of every minute. As soon as you save your crontab file with these commands, the schedule starts and you should see a `logs` directory in the same folder as your `cron_scraper.py`. It must look like this if you have been following along: !["Logs directory showing successful automated web scraping schedule execution with cron jobs and Firecrawl"](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/output.png) You can always check the status of your cron jobs with the following command as well: ```bash tail -f ~/cron.log ``` To cancel a cron job, simply open your crontab file again and remove the line corresponding to the job. #### How to automate a web scraper using Windows Task Scheduler? Windows Task Scheduler is a built-in Windows tool that can automate running programs or scripts at specified times. It’s a robust alternative to cron jobs for Windows users. Let’s set up our scraper to run automatically. First, create a batch file ( `run_scraper.bat`) to run our Python script: ```python @echo off cd /d "C:\path\to\your\project" call venv\Scripts\activate python cron_scraper.py deactivate ``` Then, to set up the task in Windows Task Scheduler: - Open Task Scheduler (search “Task Scheduler” in Windows search) - Click “Create Basic Task” in the right panel - Follow the wizard: - Name: “Hacker News Scraper” - Description: “Scrapes Hacker News hourly” - Trigger: Choose when to run (e.g., “Daily”) - Action: “Start a program” - Program/script: Browse to your `run_scraper.bat` - Start in: Your project directory For more control over the task, you can modify its properties after creation: - Double-click the task - In the “Triggers” tab, click “Edit” to set custom schedules - Common scheduling options: - Run every hour - Run at specific times - Run on system startup - In the “Settings” tab, useful options include: - “Run task as soon as possible after a scheduled start is missed" - "Stop the task if it runs longer than X hours” The Task Scheduler provides several advantages: - Runs even when user is logged out - Detailed history and logging - Ability to run with elevated privileges - Options for network conditions - Retry logic for failed tasks You can monitor your scheduled task through the Task Scheduler interface or check the logs we set up in `cron_scraper.py`. ## Automating Web Scrapers With GitHub Actions One disadvantage all local scheduling methods have is that they rely on your local machine being powered on and connected to the internet. If your computer is turned off, loses power, or loses internet connectivity, your scheduled scraping tasks won’t run. This is where cloud-based solutions like GitHub Actions can provide more reliability and uptime for your web scraping workflows. ### What is GitHub Actions? GitHub Actions is a continuous integration and deployment (CI/CD) platform provided by GitHub that allows you to automate various workflows directly from your [GitHub](https://github.com/) repository. For web scraping, GitHub Actions provides a reliable way to schedule and run your scraping scripts in the cloud. You can define workflows using YAML files that specify when and how your scraper should run, such as on a regular schedule using cron syntax. This means your scraping jobs will continue running even when your local machine is off, as they execute on GitHub’s servers. ### Step 1: Setting Up a GitHub repository To schedule our scrapers with GitHub actions, we first need a GitHub repository. Start by initializing Git in your current workspace: ```bash # Initialize git in your project directory git init git add . git commit -m "Initial commit" ``` Next, create a new public or private GitHub repository and add it as the remote: ```bash # Create a new repo on GitHub.com, then: git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git git branch -M main git push -u origin main ``` Then, create the following directory structure: ```bash mkdir -p .github/workflows ``` This directory will contain our GitHub Actions workflow files with YAML format. These files define how and when our scraping scripts should run. The workflows can be scheduled using cron syntax, triggered by events like pushes or pull requests, and can include multiple steps like installing dependencies and running scripts. ### Step 2: Creating a Workflow file At this stage, create a new `scraper.yml` file inside `.github/workflows` and paste the following contents: ```yaml name: Run Firecrawl Scraper on: schedule: - cron: "0/5 * * * *" # Runs every five minute workflow_dispatch: # Allows manual trigger jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: "3.9" - name: Install dependencies run: | python -m pip install --upgrade pip pip install pydantic firecrawl-py - name: Run scraper run: python firecrawl_scraper.py env: # Add any environment variables your scraper needs FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} - name: Commit and push if changes run: | git config --global user.name 'GitHub Actions Bot' git config --global user.email 'actions@github.com' git add . git commit -m "Update scraped data" || exit 0 git push ``` Let’s break down the key components of this GitHub Actions workflow file: Workflow name: The workflow is named “Run Firecrawl Scraper” which helps identify it in the GitHub Actions interface. Triggers: - Scheduled to run every 5 minutes using cron syntax `0/5 * * * *` (5 minutes is for debugging purposes, please change to hourly later) - Can be manually triggered using `workflow_dispatch` Job configuration: - Runs on latest Ubuntu virtual environment - Contains multiple sequential steps Step details: 1. Checkout: - Uses `actions/checkout@v3` to get repository code 2. Python Setup: - Uses `actions/setup-python@v4` - Configures Python 3.9 environment 3. Dependencies: - Upgrades `pip` - Installs required packages: `pydantic` and `firecrawl-py` 4. Scraper Execution: - Runs `firecrawl_scraper.py` - Uses `FIRECRAWL_API_KEY` from repository secrets 5. Committing the changes: - Creates a commit persisting the downloaded data using GitHub Actions bot. To run this action successfully, you’ll need to store your Firecrawl API key in GitHub secrets. Navigate to your repository’s Settings, then Secrets and variables → Actions. Click the “New repository secret” button and add your API key, making sure to use the exact key name specified in `scraper.yml`. After ensuring that everything is set up correctly, commit and push the latest changes to GitHub: ```bash git add . git commit -m "Add a workflow to scrape on a schedule" ``` Once you do, the workflow must show up in the Actions tab of your repository like below: !["GitHub Actions workflow showing successful execution of scheduled web scraping with Firecrawl"](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/actions.png) Click on the workflow name and press the “Run workflow” button. This launches the action manually and starts the schedule. If you check in after some time, you should see more automatic runs and the results persisted in your repository: !["GitHub Actions workflow showing multiple successful executions of scheduled web scraping with Firecrawl over a 24 hour period"](https://www.firecrawl.dev/images/blog/scheduling-scrapers-images/finished_actions.png) Caution: I left the workflow running overnight (at five minute intervals) and was nastily surprised by 96 workflow runs the next day. Thankfully, GitHub actions are free (up to 2000 min/month) unlike AWS instances. Now, unless you disable the workflow manually by clicking the three dots in the upper-right corner, the scraper continues running on the schedule you specified. ## Best Practices and Optimization When scheduling web scrapers, following best practices ensures reliability, efficiency, and ethical behavior. Here are the key areas to consider: ### 1\. Rate limiting and delays Scraping engines like Firecrawl usually come with built-in rate limiting. However, if you are using custom scrapers written with Python libraries, you must always respect website servers by implementing proper rate limiting and delay strategies. For example, the below example shows adding random delays between your requests in-between requests to respect server load and avoid getting your IP blocked: ```python import time import random def scrape_with_delays(urls): for url in urls: try: # Random delay between 2-5 seconds delay = random.uniform(2, 5) time.sleep(delay) # Your scraping code here response = requests.get(url) except requests.RequestException as e: logging.error(f"Error scraping {url}: {e}") ``` Best practices for rate limiting: - Add random delays between requests (2-5 seconds minimum) - Respect `robots.txt` directives - Implement exponential backoff for retries - Stay under 1 request per second for most sites - Monitor response headers for rate limit information ### 2\. Proxy Rotation This best practice is related to using custom web scrapers. Proxy rotation involves cycling through different IP addresses when making requests to avoid getting blocked. By distributing requests across multiple IPs, you can maintain access to websites that might otherwise flag high-volume traffic from a single source. ```python from itertools import cycle def get_proxy_pool(): proxies = [\ 'http://proxy1:port',\ 'http://proxy2:port',\ 'http://proxy3:port'\ ] return cycle(proxies) def scrape_with_proxies(url, proxy_pool): for _ in range(3): # Max 3 retries try: proxy = next(proxy_pool) response = requests.get( url, proxies={'http': proxy, 'https': proxy}, timeout=10 ) return response except requests.RequestException: continue return None ``` Proxy best practices: - Rotate IPs regularly - Use high-quality proxy services - Implement timeout handling - Monitor proxy health - Keep backup proxies ready ### 3\. Data Storage Strategies The example scrapers we have built today saved the data to a JSON file, which is a simple method. Based on your needs, you may consider building a full data storage solution that saves the scraped data to various formats like as CSV files or to an SQL database. The storage type always depends on the kind of information scraped. Below is an example class that can save lists of dictionaries (like we scraped today) to a CSV file: ```python from datetime import datetime import json import pandas as pd class DataManager: def __init__(self, base_path='data'): self.base_path = Path(base_path) self.base_path.mkdir(exist_ok=True) def save_data(self, data, format='csv'): timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') if format == 'csv': filename = self.base_path / f'data_{timestamp}.csv' pd.DataFrame(data).to_csv(filename, index=False) elif format == 'json': filename = self.base_path / f'data_{timestamp}.json' with open(filename, 'w') as f: json.dump(data, f, indent=2) # Cleanup old files (keep last 7 days) self._cleanup_old_files(days=7) return filename def _cleanup_old_files(self, days): # Implementation for cleaning up old files pass ``` Storage recommendations: - Use appropriate file formats (CSV/JSON/Database) - Implement data versioning - Regular cleanup of old data - Compress historical data - Consider using a database for large datasets ### 4\. Error Logging and Monitoring Web scraping usually requires more sophisticated logging than plain old print statements. Save the following or bookmark it on your browser as it properly sets up a logger using the built-in `logging` module of Python: ```python import logging from pathlib import Path def setup_logging(): log_dir = Path('logs') log_dir.mkdir(exist_ok=True) # File handler for detailed logs file_handler = logging.FileHandler( log_dir / f'scraper_{datetime.now().strftime("%Y%m%d")}.log' ) file_handler.setLevel(logging.DEBUG) # Console handler for important messages console_handler = logging.StreamHandler() console_handler.setLevel(logging.INFO) # Configure logging logging.basicConfig( level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[file_handler, console_handler] ) ``` Monitoring best practices: - Implement comprehensive logging - Set up alerts for critical failures - Monitor memory usage - Track success rates - Log response times ### 5\. Maintaining Your Scraper Maintaining your web scraper is crucial for ensuring reliable and continuous data collection. Web scraping targets are dynamic - websites frequently update their structure, implement new security measures, or change their content organization. Regular maintenance helps catch these changes early, prevents scraping failures, and ensures your data pipeline remains robust. A well-maintained scraper also helps manage resources efficiently, keeps code quality high, and adapts to evolving requirements. Maintenance guidelines: - Regular code updates - Monitor site changes - Update user agents periodically - Check for library updates - Implement health checks - Document maintenance procedures ## Conclusion Throughout this guide, we’ve explored how to effectively schedule web scrapers using local Python and operation system tools as well as GitHub actions. From basic setup to advanced optimization techniques, we’ve covered the essential components needed to build reliable, automated data collection pipelines. The workflow we’ve created not only handles the technical aspects of scheduling but also incorporates best practices for rate limiting, error handling, and data storage - crucial elements for any production-grade scraping system. For those looking to enhance their web scraping capabilities further, I recommend exploring Firecrawl’s comprehensive features through their [/crawl endpoint guide](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) and [/scrape endpoint tutorial](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint). These resources, along with the [official documentation](https://docs.firecrawl.dev/), provide deeper insights into advanced topics like JavaScript rendering, structured data extraction, and batch operations that can significantly improve your web scraping workflows. Whether you’re building training datasets for AI models or monitoring websites for changes, combining scheduled scraping with these powerful tools can help you build more sophisticated and efficient data collection systems. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Job Board Scraping Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Sept 27, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Scraping Job Boards Using Firecrawl Actions and OpenAI ![Scraping Job Boards Using Firecrawl Actions and OpenAI image](https://www.firecrawl.dev/images/blog/firecrawl-openai-job-scraping.jpg) Scraping job boards to extract structured data can be a complex task, especially when dealing with dynamic websites and unstructured content. In this guide, we’ll walk through how to use [Firecrawl Actions](https://firecrawl.dev/) and OpenAI models to efficiently scrape job listings and extract valuable information. ### Why Use Firecrawl and OpenAI? - **Firecrawl** simplifies web scraping by handling dynamic content and providing actions like clicking and scrolling. - **OpenAI’s `o1` and `4o` models** excel at understanding and extracting structured data from unstructured text. `o1` is best for more complex reasoning tasks while `4o` is best for speed and cost. ### Prerequisites - Python 3.7 or higher - API keys for both [Firecrawl](https://firecrawl.dev/) and [OpenAI](https://openai.com/) - Install required libraries: ```bash pip install requests python-dotenv openai ``` ### Step 1: Set Up Your Environment Create a `.env` file in your project directory and add your API keys: ``` FIRECRAWL_API_KEY=your_firecrawl_api_key OPENAI_API_KEY=your_openai_api_key ``` ### Step 2: Initialize API Clients ```python import os import requests import json from dotenv import load_dotenv import openai # Load environment variables load_dotenv() # Initialize API keys firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY") openai.api_key = os.getenv("OPENAI_API_KEY") ``` ### Step 3: Define the Jobs Page URL and Resume Specify the URL of the jobs page you want to scrape and provide your resume for matching. ```python # URL of the jobs page to scrape jobs_page_url = "https://openai.com/careers/search" # Candidate's resume (as a string) resume_paste = """ [Your resume content here] """ ``` ### Step 4: Scrape the Jobs Page Using Firecrawl We use Firecrawl to scrape the jobs page and extract the HTML content. ```python try: response = requests.post( "https://api.firecrawl.dev/v1/scrape", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {firecrawl_api_key}" }, json={ "url": jobs_page_url, "formats": ["markdown"] } ) if response.status_code == 200: result = response.json() if result.get('success'): html_content = result['data']['markdown'] # Prepare the prompt for OpenAI prompt = f""" Extract up to 30 job application links from the given markdown content. Return the result as a JSON object with a single key 'apply_links' containing an array of strings (the links). The output should be a valid JSON object, with no additional text. Markdown content: {html_content[:100000]} """ else: html_content = "" else: html_content = "" except Exception as e: html_content = "" ``` ### Step 5: Extract Apply Links Using OpenAI’s `gpt-4o` Model We use OpenAI’s `gpt-4o` model to parse the scraped content and extract application links. ```python # Extract apply links using OpenAI apply_links = [] if html_content: try: completion = openai.ChatCompletion.create( model="gpt-4o", messages=[\ {\ "role": "user",\ "content": prompt\ }\ ] ) if completion.choices: result = json.loads(completion.choices[0].message.content.strip()) apply_links = result['apply_links'] except Exception as e: pass ``` ### Step 6: Extract Job Details from Each Apply Link We iterate over each apply link and use Firecrawl’s extraction capabilities to get job details. ```python # Initialize a list to store job data extracted_data = [] # Define the extraction schema schema = { "type": "object", "properties": { "job_title": {"type": "string"}, "sub_division_of_organization": {"type": "string"}, "key_skills": {"type": "array", "items": {"type": "string"}}, "compensation": {"type": "string"}, "location": {"type": "string"}, "apply_link": {"type": "string"} }, "required": ["job_title", "sub_division_of_organization", "key_skills", "compensation", "location", "apply_link"] } # Extract job details for each link for link in apply_links: try: response = requests.post( "https://api.firecrawl.dev/v1/scrape", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {firecrawl_api_key}" }, json={ "url": link, "formats": ["extract"], "actions": [{\ "type": "click",\ "selector": "#job-overview"\ }], "extract": { "schema": schema } } ) if response.status_code == 200: result = response.json() if result.get('success'): extracted_data.append(result['data']['extract']) except Exception as e: pass ``` ### Step 7: Match Jobs to Your Resume Using OpenAI’s `o1` Model We use OpenAI’s `o1` model to analyze your resume and recommend the top 3 job listings. ```python # Prepare the prompt prompt = f""" Please analyze the resume and job listings, and return a JSON list of the top 3 roles that best fit the candidate's experience and skills. Include only the job title, compensation, and apply link for each recommended role. The output should be a valid JSON array of objects in the following format: [\ {\ "job_title": "Job Title",\ "compensation": "Compensation",\ "apply_link": "Application URL"\ },\ ...\ ] Based on the following resume: {resume_paste} And the following job listings: {json.dumps(extracted_data, indent=2)} """ # Get recommendations from OpenAI completion = openai.ChatCompletion.create( model="o1-preview", messages=[\ {\ "role": "user",\ "content": prompt\ }\ ] ) # Extract recommended jobs recommended_jobs = json.loads(completion.choices[0].message.content.strip()) ``` ### Step 8: Output the Recommended Jobs Finally, we can print or save the recommended jobs. ```python # Output the recommended jobs print(json.dumps(recommended_jobs, indent=2)) ``` ### Full Code Example on GitHub You can find the full code example [on GitHub](https://github.com/mendableai/firecrawl/tree/main/examples/o1_job_recommender). ### Conclusion By following this guide, you’ve learned how to: - **Scrape dynamic job boards** using Firecrawl. - **Extract structured data** from web pages with custom schemas. - **Leverage OpenAI’s models** to parse content and make intelligent recommendations. This approach can be extended to other websites and data extraction tasks, providing a powerful toolset for automating data collection and analysis. ### References - [Firecrawl Documentation](https://docs.firecrawl.dev/) - [OpenAI API Reference](https://platform.openai.com/docs/api-reference/introduction) That’s it! You’ve now built a pipeline to scrape job boards and find the best job matches using Firecrawl and OpenAI. Happy coding! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## LLM Extraction for Insights Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) May 21, 2024 • [![Caleb Peffer image](https://www.firecrawl.dev/caleb-img.jpeg)Caleb Peffer](https://x.com/CalebPeffer) # Using LLM Extraction for Customer Insights ![Using LLM Extraction for Customer Insights image](https://www.firecrawl.dev/images/blog/g3.png) ### Introduction Understanding our customers - not just who they are, but what they do—is crucial to tailoring our products and services effectively. When running a self-serve motion, you have so many customers come in the door with little to no knowledge of them. The process of proactively understanding who these folks are has traditionally been time-intensive, involving manual data collection and analysis to gather actionable insights. However, with the power of LLMs and their capacity for advanced data extraction, we’ve automated this process. Using LLM extraction and analysis of customer data, LLM we’ve significantly reduced our workload, allowing us to understand and serve our customer base more effectively than ever before. If you have limited technical knowledge, you can build an automation that gets targeted information about your customers for the purposes of product direction and lead gen. Here’s how you can do this yourself with [Make](https://make.com/) and [Firecrawl](https://www.firecrawl.dev/). * * * ### Overview of the Tools **Firecrawl** Firecrawl is a platform for scraping, search, and extraction. It allows you to take data from the web and translate it into LLM-legible markdown or structured data. When we want to get information about our customers, we can use Firecrawl’s LLM extraction functionality to specify the specific information we want from their websites. **Make.com (formerly Integromat)** Make is an automation platform that allows users to create customized workflows to connect various apps and services without needing deep technical knowledge. It uses a visual interface where users can drag and drop elements to design their automations. We can use Make to connect a spreadsheet of user data to Firecrawl, allowing us to do extraction with just a bit of JSON. ### Setting Up the Scenario - Step-by-step guide on setting up the data extraction process. - **Connecting Google Sheets to Make.com** - How user data is initially collected and stored. - **Configuring the HTTP Request in Make.com** - Description of setting up API requests to Firecrawl. - Purpose of these requests (e.g., extracting company information). ### Preparing our Data Before we get started, we want to make sure we prepare our data for Firecrawl. In this case, I created a simple spreadsheet with imported users from our database. We want to take the email domains of our users and transform them into links using the https:// format: ![](https://i.imgur.com/gssynZa.png) We also want to add some attributes that we’d like to know about these companies. For me, I want to understand a bit about the company, their industry, and their customers. I’ve set these in columns as: company\_description company\_type who\_they\_serve Now that we have our data prepared, we can start setting up our automation in Make! ## Setting up our automation To get our automation running, we simply need to follow a three step process in Make. Here, we will choose three apps in our scenario: Google Sheets - Get range values HTTP - Make an API key auth request Google Sheets - Update a row We’ll also want to add the ignore flow control tool in case we run into any errors. This will keep the automation going. ![](https://i.imgur.com/MdCWv30.png) This automation will allow us to extract a set of links from our spreadsheet, send them to Firecrawl for data extraction, then repopulate our spreadsheet with the desired info. Let’s start by configuring our first app. Our goal is to export all of the URLs so that we can send them to Firecrawl for extraction. Here is the configuration for pulling these URLs: ![](https://i.imgur.com/WHa91kY.png) \* _Important_ \- we want to make sure we start pulling data from the second row. If you include the header, you will eventually run into an error. * * * Great! Now that we have that configured, we want to prepare to set up our HTTP request. To do this, we will go to [https://firecrawl.dev](https://firecrawl.dev/) to sign up and get our API key (you can get started for free!). Once you sign up, you can go to [https://firecrawl.dev/account](https://firecrawl.dev/account) to see your API key. We will be using Firecrawl’s Scrape Endpoint. This endpoint will allow us to pull information from a single URL, translate it into clean markdown, and use it to extract the data we need. I will be filling out all the necessary conditions in our Make HTTP request using the API reference in their documentation. Now in Make, I configure the API call using the documentation from Firecrawl. We will be using POST as the HTTP method and have two headers. ``` Header 1: Name: Authorization Value: Bearer your-API-key Header 2: Name: Content-Type Value: application/json ``` ![](https://i.imgur.com/LJ8g142.png) We also want to set our body and content types. Here we will do: ``` Body type: Raw Content type: Json (application/json) ``` We will also click ‘yes’ for parsing our response. This will automatically parse our response into JSON. The request content is the main meat of what we want to achieve. Here is the request content we will use for this use case: ``` { "url": "1. url(B)", "pageOptions": { "onlyMainContent": true }, "extractorOptions": { "mode": "llm-extraction", "extractionPrompt": "Extract the company description (in one sentence explain what the company does), company industry (software, services, AI, etc.) - this really should just be a tag with a couple keywords, and who they serve (who are their customers). If there is no clear information to answer the question, write 'no info'.", "extractionSchema": { "type": "object", "properties": { "company_description": { "type": "string" }, "company_industry": { "type": "string" }, "who_they_serve": { "type": "string" } }, "required": [\ "company_description",\ "company_industry",\ "who_they_serve"\ ] } } } ``` ![](https://i.imgur.com/DrMc1g2.png) \* _Note_ the green field in the screenshot is a dynamic item that you can choose in the Make UI. Instead of `url (B)`, the block may be the first URL in your data. ![](https://i.imgur.com/D4HCBNe.png) Fantastic! Now we have configured our HTTP request. Let’s test it to make sure everything is working as it should be. Click ‘run once’ in Make and we should be getting data back. ![](https://i.imgur.com/QuQZs0U.png) When we run, let’s check our first operation. In the output, we should be getting a `status code: 200`, meaning that our API request was successful. In the output, click on data to make sure we got the data we needed. ![](https://i.imgur.com/pm614VA.png) Our output looks successful! In the llm\_extraction we are seeing the three attributes of data that we wanted from the website. \* _Note_ if you are getting a `500` error on your first operation and `200` responses on the subsequent ones, this may be because the operation is trying to be performed on the first row of your data (the header row). This will cause issues importing the data back into sheets! Make sure you start from the second row as mentioned before. Now that we know the HTTP request is working correctly, all that’s left is to take the outputted JSON from Firecrawl and put it back into our spreadsheet. * * * Now we need to take our extracted data and put it back into our spreadsheet. To do this, we will take the outputted JSON from our HTTP request and export the text into the relevant tables. Let’s start by connecting the same google sheet and specifying the Row Number criteria. Here we will just use the Make UI to choose ‘row number’ ![](https://i.imgur.com/BYpPabk.png) All that’s left is to specify which LLM extracted data goes into which column. Here, we can simply use the UI in Make to set this up. ![](https://i.imgur.com/219tft2.png) That’s it, now it’s time to test our automation! * * * Let’s click `run once` on the Make UI and make sure everything is running smoothly. The automation should start iterating through link-by-link and populating our spreadsheet in real time. ![](https://i.imgur.com/vU1CJlt.png) We have success! Using Make and Firecrawl, we have been able to extract specific information about our customers without the need of manually going to each of their websites. Looking at the data, we are starting to get a better understanding of our customers. However, we are not limited to these specific characteristics. If we want, we can customize our JSON and Extraction Prompt to find out other information about these companies. ### Use Cases LLM extraction allows us to quickly get specific information from the web that’s relevant to our business. We can use these automations to do a variety of tasks. **Product:** Especially for self-serve companies, we can understand the trends in industries using our product. What are the top 2-3 industries using our tech and what are they using it for? This will allow us to make better product decisions by prioritizing the right customers to focus on. **Business Development:** By understanding who our users are, we can look for similar companies who could benefit from our product as well. By doing a similar automation, we can extract positive indicators from prospects that would benefit from our product. We can also use this data to generate better outreach emails that are more specific to the individual prospect. **Market Research:** Market research firms spend tons of time doing secondary research, especially in niche sectors. We can streamline data collection by automating the extraction and organization of data from diverse sources. This automation helps boost efficiency and scales with growing data needs, making it a valuable tool for strategic decision-making in fast-evolving industries. ### Going a step further This was just a simple example of how we can use LLMs to extract relevant data from websites using a static spreadsheet. You can always make this more advanced by connecting this dynamically to your sign ups. Additionally, you could connect this to other tools to further accelerate your productivity. For example, using the extracted content to generate more personalized copy for prospecting. If you found this useful, feel free to let me know! I’d love to hear your feedback or learn about what you’re building. You can reach me at [garrett@mendable.ai](mailto:garrett@mendable.ai). Good luck and happy building! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Caleb Peffer image](https://www.firecrawl.dev/caleb-img.jpeg)\\ Caleb Peffer@CalebPeffer](https://x.com/CalebPeffer) Caleb Peffer is the Chief Executive Officer (CEO) of Firecrawl. Previously, built and scaled Mendable, an innovative "chat with your documents" application, and sold it to major customers like Snapchat, Coinbase, and MongoDB. Also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. Caleb has a passion for building products that help people do their best work. Caleb studied Computer Science and has over 10 years of experience in software engineering. ### More articles by Caleb Peffer [Using LLM Extraction for Customer Insights\\ \\ Using LLM Extraction for Insights and Lead Generation using Make and Firecrawl.](https://www.firecrawl.dev/blog/lead-gen-business-insights-make-firecrawl) ## Crawl Webhooks Introduction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) September 1, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Launch Week I / Day 7: Crawl Webhooks (v1) ![Launch Week I / Day 7: Crawl Webhooks (v1) image](https://www.firecrawl.dev/images/blog/webhooks.png) Welcome to Day 7 of Firecrawl’s Launch Week! We’re excited to introduce new /crawl webhook support. ## Crawl Webhook You can now pass a `webhook` parameter to the `/crawl` endpoint. This will send a POST request to the URL you specify when the crawl is started, updated and completed. The webhook will now trigger for every page crawled and not just the whole result at the end. ![Webhook](https://www.firecrawl.dev/images/blog/webhook-v1.png) ### Webhook Events There are now 4 types of events: - `crawl.started` \- Triggered when the crawl is started. - `crawl.page` \- Triggered for every page crawled. - `crawl.completed` \- Triggered when the crawl is completed to let you know it’s done. - `crawl.failed` \- Triggered when the crawl fails. ### Webhook Response - `success` \- If the webhook was successful in crawling the page correctly. - `type` \- The type of event that occurred. - `id` \- The ID of the crawl. - `data` \- The data that was scraped (Array). This will only be non empty on `crawl.page` and will contain 1 item if the page was scraped successfully. The response is the same as the `/scrape` endpoint. - `error` \- If the webhook failed, this will contain the error message. ## Learn More Learn more about the webhook in our [documentation](https://docs.firecrawl.dev/features/crawl#crawl-webhook). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## OpenAI Predicted Outputs Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Nov 5, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses ![Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses image](https://www.firecrawl.dev/images/blog/openai-predicted-outputs.jpg) Leveraging the full potential of Large Language Models (LLMs) often involves balancing between response accuracy and latency. OpenAI’s new Predicted Outputs feature introduces a way to significantly reduce response times by informing the model about the expected output in advance. In this article, we’ll explore how to use Predicted Outputs with the GPT-4o and GPT-4o-mini models to make your AI applications super fast 🚀. We’ll also provide a practical example of transforming blog posts into SEO-optimized content, a powerful use case enabled by this feature. ### What Are Predicted Outputs? Predicted Outputs allow you to provide the LLM with an anticipated output, especially useful when most of the response is known ahead of time. For tasks like rewriting text with minor modifications, this can drastically reduce the time it takes for the model to generate the desired result. ### Why Use Predicted Outputs? By supplying the model with a prediction of the output, you: - **Reduce Latency:** The model can process and generate responses faster because it doesn’t need to generate the entire output from scratch. - **Enhance Efficiency:** Useful when you can reasonably assume that large portions of the output will remain unchanged. ### Limitations to Keep in Mind While Predicted Outputs are powerful, there are some limitations: - Supported only with **GPT-4o** and **GPT-4o-mini** models. - Certain API parameters are not supported, such as `n` values greater than 1, `logprobs`, `presence_penalty` greater than 0, among others. ### How to Use Predicted Outputs Let’s dive into how you can implement Predicted Outputs in your application. We’ll walk through an example where we optimize a blog post by adding internal links to relevant pages within the same website. #### Prerequisites Make sure you have the following installed: ```bash pip install firecrawl-py openai ``` #### Step 1: Set Up Your Environment Initialize the necessary libraries and load your API keys. ```python import os import json from firecrawl import FirecrawlApp from dotenv import load_dotenv from openai import OpenAI # Load environment variables load_dotenv() # Retrieve API keys from environment variables firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY") openai_api_key = os.getenv("OPENAI_API_KEY") # Initialize the FirecrawlApp and OpenAI client app = FirecrawlApp(api_key=firecrawl_api_key) client = OpenAI(api_key=openai_api_key) ``` #### Step 2: Scrape the Blog Content We’ll start by scraping the content of a blog post that we want to optimize. ```python # Get the blog URL (you can input your own) blog_url = "https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications" # Scrape the blog content in markdown format blog_scrape_result = app.scrape_url(blog_url, params={'formats': ['markdown']}) blog_content = blog_scrape_result.get('markdown', '') ``` #### Step 3: Map the Website for Internal Links Next, we’ll get a list of other pages on the website to which we can add internal links. ```python # Extract the top-level domain top_level_domain = '/'.join(blog_url.split('/')[:3]) # Map the website to get all internal links site_map = app.map_url(top_level_domain) site_links = site_map.get('links', []) ``` #### Step 4: Prepare the Prompt and Prediction We’ll create a prompt instructing the model to add internal links to the blog post and provide the original content as a prediction. ```python prompt = f""" You are an AI assistant helping to improve a blog post. Here is the original blog post content: {blog_content} Here is a list of other pages on the website: {json.dumps(site_links, indent=2)} Please revise the blog post to include internal links to some of these pages where appropriate. Make sure the internal links are relevant and enhance the content. Only return the revised blog post in markdown format. """ ``` #### Step 5: Use Predicted Outputs with the OpenAI API Now, we’ll call the OpenAI API using the `prediction` parameter to provide the existing content. ```python completion = client.chat.completions.create( model="gpt-4o-mini", messages=[\ {\ "role": "user",\ "content": prompt\ }\ ], prediction={ "type": "content", "content": blog_content } ) revised_blog_post = completion.choices[0].message.content ``` #### Step 6: Compare the Original and Revised Content Finally, we’ll compare the number of links in the original and revised blog posts to see the improvements. ```python import re def count_links(markdown_content): return len(re.findall(r'\[.*?\]\(.*?\)', markdown_content)) original_links_count = count_links(blog_content) revised_links_count = count_links(revised_blog_post) print(f"Number of links in the original blog post: {original_links_count}") print(f"Number of links in the revised blog post: {revised_links_count}") ``` ### Conclusion By utilizing Predicted Outputs, you can significantly speed up tasks where most of the output is known, such as content reformatting or minor edits. This feature is a game-changer for developers looking to optimize performance without compromising on the quality of the output. That’s it! In this article, we’ve shown you how to get started with Predicted Outputs using OpenAI’s GPT-4o models. Whether you’re transforming content, correcting errors, or making minor adjustments, Predicted Outputs can make your AI applications faster and more efficient. ### References - [Using Predicted Outputs](https://platform.openai.com/docs/guides/latency-optimization#use-predicted-outputs) - [Firecrawl Documentation](https://www.firecrawl.dev/docs) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Scrape Airbnb Data Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) May 23, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Scrape and Analyze Airbnb Data with Firecrawl and E2B ![Scrape and Analyze Airbnb Data with Firecrawl and E2B image](https://www.firecrawl.dev/images/blog/firecrawl-e2b-airbnb.png) This cookbook demonstrates how to scrape Airbnb data and analyze it using [Firecrawl](https://www.firecrawl.dev/) and the [Code Interpreter SDK](https://github.com/e2b-dev/code-interpreter) from E2B. Feel free to clone the [Github Repository](https://github.com/e2b-dev/e2b-cookbook) or follow along with the steps below. ## Prerequisites - Node.js installed on your machine - Get [E2B API key](https://e2b.dev/docs/getting-started/api-key) - Get [Firecrawl API key](https://firecrawl.dev/) - Get [Anthropic API key](https://anthropic.com/) ## Setup Start by creating a new directory and initializing a new Node.js typescript project: ```bash mkdir airbnb-analysis cd airbnb-analysis npm init -y ``` Next, install the required dependencies: ```bash npm install @anthropic-ai/sdk @e2b/code-interpreter @mendable/firecrawl-js ``` And dev dependencies: ```bash npm install --save-dev @types/node prettier tsx typescript dotenv zod ``` ## Create a `.env` file Create a `.env` file in the root of your project and add the following environment variables: ```bash # TODO: Get your E2B API key from https://e2b.dev/docs E2B_API_KEY="" # TODO: Get your Firecrawl API key from https://firecrawl.dev FIRECRAWL_API_KEY="" # TODO: Get your Anthropic API key from https://anthropic.com ANTHROPIC_API_KEY="" ``` ## Scrape Airbnb data with Firecrawl Create a new file `scraping.ts`. ### Creating the scraping function ```typescript import * as fs from "fs"; import FirecrawlApp from "@mendable/firecrawl-js"; import "dotenv/config"; import { config } from "dotenv"; import { z } from "zod"; ``` 2. Let’s define our `scrapeAirbnb` function which uses Firecrawl to scrape Airbnb listings. We will use Firecrawl’s LLM Extract to try to get the pagination links and then scrape each page in parallel to get the listings. We will save to a JSON file so we can analyze it later and not have to re-scrape. ```typescript export async function scrapeAirbnb() { try { // Initialize the FirecrawlApp with your API key const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY, }); // Define the URL to crawl const listingsUrl = "https://www.airbnb.com/s/San-Francisco--CA--United-States/homes"; const baseUrl = "https://www.airbnb.com"; // Define schema to extract pagination links const paginationSchema = z.object({ page_links: z .array( z.object({ link: z.string(), }), ) .describe("Pagination links in the bottom of the page."), }); const params2 = { pageOptions: { onlyMainContent: false, }, extractorOptions: { extractionSchema: paginationSchema, }, timeout: 50000, // if needed, sometimes airbnb stalls... }; // Start crawling to get pagination links const linksData = await app.scrapeUrl(listingsUrl, params2); console.log(linksData.data["llm_extraction"]); let paginationLinks = linksData.data["llm_extraction"].page_links.map( (link) => baseUrl + link.link, ); // Just in case is not able to get the pagination links if (paginationLinks.length === 0) { paginationLinks = [listingsUrl]; } // Define schema to extract listings const schema = z.object({ listings: z .array( z.object({ title: z.string(), price_per_night: z.number(), location: z.string(), rating: z.number().optional(), reviews: z.number().optional(), }), ) .describe("Airbnb listings in San Francisco"), }); const params = { pageOptions: { onlyMainContent: false, }, extractorOptions: { extractionSchema: schema, }, }; // Function to scrape a single URL const scrapeListings = async (url) => { const result = await app.scrapeUrl(url, params); return result.data["llm_extraction"].listings; }; // Scrape all pagination links in parallel const listingsPromises = paginationLinks.map((link) => scrapeListings(link), ); const listingsResults = await Promise.all(listingsPromises); // Flatten the results const allListings = listingsResults.flat(); // Save the listings to a file fs.writeFileSync( "airbnb_listings.json", JSON.stringify(allListings, null, 2), ); // Read the listings from the file const listingsData = fs.readFileSync("airbnb_listings.json", "utf8"); return listingsData; } catch (error) { console.error("An error occurred:", error.message); } } ``` ### Creating the code interpreter Let’s now prepare our code interepreter to analyze the data. Create a new file `codeInterpreter.ts`. This is where we will use the E2B Code Interpreter SDK to safely run the code that the LLM will generate and get its output. ```typescript import { CodeInterpreter } from "@e2b/code-interpreter"; export async function codeInterpret( codeInterpreter: CodeInterpreter, code: string, ) { console.log( `\n${"=".repeat(50)}\n> Running following AI-generated code: \n${code}\n${"=".repeat(50)}`, ); const exec = await codeInterpreter.notebook.execCell(code, { // You can stream logs from the code interpreter // onStderr: (stderr: string) => console.log("\n[Code Interpreter stdout]", stderr), // onStdout: (stdout: string) => console.log("\n[Code Interpreter stderr]", stdout), // // You can also stream additional results like charts, images, etc. // onResult: ... }); if (exec.error) { console.log("[Code Interpreter error]", exec.error); // Runtime error return undefined; } return exec; } ``` ### Preparing the model prompt and tool execution Create a file called `model.ts` that will contain the prompts, model names and the tools for execution. ```typescript import { Tool } from "@anthropic-ai/sdk/src/resources/beta/tools"; export const MODEL_NAME = "claude-3-opus-20240229"; export const SYSTEM_PROMPT = ` ## your job & context you are a python data scientist. you are given tasks to complete and you run python code to solve them. - the python code runs in jupyter notebook. - every time you call \`execute_python\` tool, the python code is executed in a separate cell. it's okay to multiple calls to \`execute_python\`. - display visualizations using matplotlib or any other visualization library directly in the notebook. don't worry about saving the visualizations to a file. - you have access to the internet and can make api requests. - you also have access to the filesystem and can read/write files. - you can install any pip package (if it exists) if you need to but the usual packages for data analysis are already preinstalled. - you can run any python code you want, everything is running in a secure sandbox environment. `; export const tools: Tool[] = [\ {\ name: "execute_python",\ description:\ "Execute python code in a Jupyter notebook cell and returns any result, stdout, stderr, display_data, and error.",\ input_schema: {\ type: "object",\ properties: {\ code: {\ type: "string",\ description: "The python code to execute in a single cell.",\ },\ },\ required: ["code"],\ },\ },\ ]; ``` ### Putting it all together Create a file `index.ts` to run the scraping and analysis. Here we will load the scraped data, send it to the LLM model, and then interpret the code generated by the model. ```typescript import * as fs from "fs"; import "dotenv/config"; import { CodeInterpreter, Execution } from "@e2b/code-interpreter"; import Anthropic from "@anthropic-ai/sdk"; import { Buffer } from "buffer"; import { MODEL_NAME, SYSTEM_PROMPT, tools } from "./model"; import { codeInterpret } from "./codeInterpreter"; import { scrapeAirbnb } from "./scraping"; const anthropic = new Anthropic(); /** * Chat with Claude to analyze the Airbnb data */ async function chat( codeInterpreter: CodeInterpreter, userMessage: string, ): Promise { console.log("Waiting for Claude..."); const msg = await anthropic.beta.tools.messages.create({ model: MODEL_NAME, system: SYSTEM_PROMPT, max_tokens: 4096, messages: [{ role: "user", content: userMessage }], tools, }); console.log( `\n${"=".repeat(50)}\nModel response: ${msg.content}\n${"=".repeat(50)}`, ); console.log(msg); if (msg.stop_reason === "tool_use") { const toolBlock = msg.content.find((block) => block.type === "tool_use"); const toolName = toolBlock?.name ?? ""; const toolInput = toolBlock?.input ?? ""; console.log( `\n${"=".repeat(50)}\nUsing tool: ${toolName}\n${"=".repeat(50)}`, ); if (toolName === "execute_python") { const code = toolInput.code; return codeInterpret(codeInterpreter, code); } return undefined; } } /** * Main function to run the scraping and analysis */ async function run() { // Load the Airbnb prices data from the JSON file let data; const readDataFromFile = () => { try { return fs.readFileSync("airbnb_listings.json", "utf8"); } catch (err) { if (err.code === "ENOENT") { console.log("File not found, scraping data..."); return null; } else { throw err; } } }; const fetchData = async () => { data = readDataFromFile(); if (!data || data.trim() === "[]") { console.log("File is empty or contains an empty list, scraping data..."); data = await scrapeAirbnb(); } }; await fetchData(); // Parse the JSON data const prices = JSON.parse(data); // Convert prices array to a string representation of a Python list const pricesList = JSON.stringify(prices); const userMessage = ` Load the Airbnb prices data from the airbnb listing below and visualize the distribution of prices with a histogram. Listing data: ${pricesList} `; const codeInterpreter = await CodeInterpreter.create(); const codeOutput = await chat(codeInterpreter, userMessage); if (!codeOutput) { console.log("No code output"); return; } const logs = codeOutput.logs; console.log(logs); if (codeOutput.results.length == 0) { console.log("No results"); return; } const firstResult = codeOutput.results[0]; console.log(firstResult.text); if (firstResult.png) { const pngData = Buffer.from(firstResult.png, "base64"); const filename = "airbnb_prices_chart.png"; fs.writeFileSync(filename, pngData); console.log(`✅ Saved chart to ${filename}`); } await codeInterpreter.close(); } run(); ``` ### Running the code Run the code with: ```bash npm run start ``` ### Results At the end you should get a histogram of the Airbnb prices in San Francisco saved as `airbnb_prices_chart.png`. ![Airbnb Prices Chart](https://www.firecrawl.dev/images/blog/airbnb_prices_chart.png) That’s it! You have successfully scraped Airbnb data and analyzed it using Firecrawl and E2B’s Code Interpreter SDK. Feel free to experiment with different models and prompts to get more insights from the data. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## Mastering Firecrawl's Crawl Endpoint Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Nov 18, 2024 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide ![Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide image](https://www.firecrawl.dev/images/blog/crawl-masterclass/images/mastering-crawl.jpg) ## Introduction Web scraping and data extraction have become essential tools as businesses race to convert unprecedented amounts of online data into LLM-friendly formats. Firecrawl’s powerful web scraping API streamlines this process with enterprise-grade automation and scalability features. This comprehensive guide focuses on Firecrawl’s most powerful feature - the `/crawl` endpoint, which enables automated website scraping at scale. You’ll learn how to: - Recursively traverse website sub-pages - Handle dynamic JavaScript-based content - Bypass common web scraping blockers - Extract clean, structured data for AI/ML applications Want to follow along with our python notebook version of this post? [Check it out here!](https://github.com/mendableai/firecrawl/blob/main/examples/mastering-the-crawl-endpoint/mastering-the-crawl-endpoint.ipynb) ## Table of Contents - [Introduction](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#introduction) - [Web Scraping vs Web Crawling: Understanding the Key Differences](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#web-scraping-vs-web-crawling-understanding-the-key-differences) - [What’s the Difference?](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#whats-the-difference) - [How Firecrawl Combines Both](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#how-firecrawl-combines-both) - [Step-by-Step Guide to Web Crawling with Firecrawl’s API](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#step-by-step-guide-to-web-crawling-with-firecrawls-api) - [Performance & Limits](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#performance--limits) - [Asynchronous Web Crawling with Firecrawl: Efficient Large-Scale Data Collection](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#asynchronous-web-crawling-with-firecrawl-efficient-large-scale-data-collection) - [Asynchronous programming in a nutshell](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#asynchronous-programming-in-a-nutshell) - [Using `async_crawl_url` method](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#using-async_crawl_url-method) - [Benefits of asynchronous crawling](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#benefits-of-asynchronous-crawling) - [How to Save and Store Web Crawling Results](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#how-to-save-and-store-web-crawling-results) - [Local file storage](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#local-file-storage) - [Building AI-Powered Web Crawlers with Firecrawl and LangChain Integration](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#building-ai-powered-web-crawlers-with-firecrawl-and-langchain-integration) - [Conclusion](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl#conclusion) ## Web Scraping vs Web Crawling: Understanding the Key Differences ### What’s the Difference? _Web scraping_ refers to extracting specific data from individual web pages like a Wikipedia article or a technical tutorial. It is primarily used when you need specific information from pages with _known URLs_. _Web crawling_, on the other hand, involves systematically browsing and discovering web pages by following links. It focuses on website navigation and URL discovery. For example, to build a chatbot that answers questions about Stripe’s documentation, you would need: 1. Web crawling to discover and traverse all pages in Stripe’s documentation site 2. Web scraping to extract the actual content from each discovered page ### How Firecrawl Combines Both Firecrawl’s `/crawl` endpoint combines both capabilities: 1. URL analysis: Identifies links through sitemap or page traversal 2. Recursive traversal: Follows links to discover sub-pages 3. Content scraping: Extracts clean content from each page 4. Results compilation: Converts everything to structured data When you pass the URL `https://docs.stripe.com/api` to the endpoint, it automatically discovers and crawls all documentation sub-pages. The endpoint returns the content in your preferred format - whether that’s markdown, HTML, screenshots, links, or metadata. ## Step-by-Step Guide to Web Crawling with Firecrawl’s API Firecrawl is a web scraping engine exposed as a REST API. You can use it from the command line using cURL or using one of its language SDKs for Python, Node, Go, or Rust. For this tutorial, we’ll focus on its Python SDK. To get started: 1. Sign up at [firecrawl.dev](https://firecrawl.dev/) and copy your API key 2. Save the key as an environment variable: ```bash export FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE' ``` Or use a dot-env file: ```bash touch .env echo "FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'" >> .env ``` Then use the Python SDK: ```python from firecrawl import FirecrawlApp from dotenv import load_dotenv load_dotenv() app = FirecrawlApp() ``` Once your API key is loaded, the `FirecrawlApp` class uses it to establish a connection with the Firecrawl API engine. First, we will crawl the [https://books.toscrape.com/](https://books.toscrape.com/) website, which is built for web-scraping practice: ![Homepage of Books to Scrape website showing a grid layout of book covers, prices and ratings - a popular web scraping practice site](https://www.firecrawl.dev/images/blog/crawl-masterclass/images/books_to_scrape_homepage.png) Instead of writing dozens of lines of code with libraries like `beautifulsoup4` or `lxml` to parse HTML elements, handle pagination and data retrieval, Firecrawl’s `crawl_url` endpoint lets you accomplish this in a single line: ```python base_url = "https://books.toscrape.com/" crawl_result = app.crawl_url(url=base_url) ``` The result is a dictionary with the following keys: ```python crawl_result.keys() ``` ```text dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data']) ``` First, we are interested in the status of the crawl job: ```python crawl_result['status'] ``` ```text 'completed' ``` If it is completed, let’s see how many pages were scraped: ```python crawl_result['total'] ``` ```text 1195 ``` Almost 1200 pages (it took about 70 seconds on my machine; the speed vary based on your connection speed). Let’s look at one of the elements of the `data` list: ```python sample_page = crawl_result['data'][10] markdown_content = sample_page['markdown'] print(markdown_content[:500]) ``` ```text - [Home](../../../../index.html) - [Books](../../books_1/index.html) - Womens Fiction # Womens Fiction **17** results. **Warning!** This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning. 01. [![I Had a Nice Time And Other Lies...: How to find love & sh*t like that](../../../../media/cache/5f/72/5f72c8a0d5a7292e2929a354ec8a022f.jpg)](../../../i-had-a-nice-time-and-other-lies-how-to-find-love-sht-like-that_814/index.html) ``` The page corresponds to Women’s Fiction page: ![Screenshot of Books to Scrape website showing the Women's Fiction category page with book listings and pagination](https://www.firecrawl.dev/images/blog/crawl-masterclass/images/women-fiction.png) Firecrawl also includes page metadata in the element’s dictionary as well: ```python sample_page['metadata'] ``` ```text { 'url': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html', 'title': 'Womens Fiction | Books to Scrape - Sandbox', 'robots': 'NOARCHIVE,NOCACHE', 'created': '24th Jun 2016 09:29', 'language': 'en-us', 'viewport': 'width=device-width', 'sourceURL': 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html', 'statusCode': 200, 'description': '', 'ogLocaleAlternate': [] } ``` One thing we didn’t mention is how Firecrawl handles pagination. If you scroll to the bottom of Books-to-Scrape, you will see that it has a “next” button. Before moving on to sub-pages like `books.toscrape.com/category`, Firecrawl first scrapes all sub-pages from the homepage. Later, if a sub-page includes links to already scraped pages, they are ignored. ## Advanced Web Scraping Configuration and Best Practices Firecrawl offers several types of parameters to configure how the endpoint crawls over websites. We will outline them here with their use-cases. ### Scrape Options On real-world projects, you will tweak this parameter the most frequently. It allows you to control how a webpage’s contents are saved. Firecrawl allows the following formats: - Markdown - the default - HTML - Raw HTML (simple copy/paste of the entire webpage) - Links - Screenshot Here is an example request to scrape the Stripe API in four formats: ```python # Crawl the first 5 pages of the stripe API documentation stripe_crawl_result = app.crawl_url( url="https://docs.stripe.com/api", params={ "limit": 5, # Only scrape the first 5 pages including the base-url "scrapeOptions": { "formats": ["markdown", "html", "links", "screenshot"] } } ) ``` When you specify multiple formats, each webpage’s data contains separate keys for each format’s content: ```python stripe_crawl_result['data'][0].keys() ``` ```text dict_keys(['html', 'links', 'markdown', 'metadata', 'screenshot']) ``` The value of the `screenshot` key is a temporary link to a PNG file stored on Firecrawl’s servers and expires within 24 hours. Here is what it looks like for Stripe’s API documentation homepage: ```python from IPython.display import Image Image(stripe_crawl_result['data'][0]['screenshot']) ``` ![Screenshot of Stripe API documentation homepage captured using Firecrawl's web scraping capabilities](https://www.firecrawl.dev/images/blog/crawl-masterclass/notebook_files/notebook_39_0.png) Note that specifying more formats to transform the page’s contents can significantly slow down the process. Another time-consuming operation can be scraping the entire page contents instead of just the elements you want. For such scenarios, Firecrawl allows you to control which elements of a webpage are scraped using the `onlyMainContent`, `includeTags`, and `excludeTags` parameters. Enabling `onlyMainContent` parameter (disabled by default) excludes navigation, headers and footers: ```python stripe_crawl_result = app.crawl_url( url="https://docs.stripe.com/api", params={ "limit": 5, "scrapeOptions": { "formats": ["markdown", "html"], "onlyMainContent": True, }, }, ) ``` `includeTags` and `excludeTags` accepts a list of allowlisted/blocklisted HTML tags, classes and IDs: ```python # Crawl the first 5 pages of the stripe API documentation stripe_crawl_result = app.crawl_url( url="https://docs.stripe.com/api", params={ "limit": 5, "scrapeOptions": { "formats": ["markdown", "html"], "includeTags": ["code", "#page-header"], "excludeTags": ["h1", "h2", ".main-content"], }, }, ) ``` Crawling large websites can take a long time and when appropriate, these small tweaks can have a big impact on the runtime. ### URL Control Apart from scraping configurations, you have four options to specify URL patterns to include or exclude during crawling: - `includePaths` \- targeting specific sections - `excludePaths` \- avoiding unwanted content - `allowBackwardLinks` \- handling cross-references - `allowExternalLinks` \- managing external content Here is a sample request that uses these parameters: ```python # Example of URL control parameters url_control_result = app.crawl_url( url="https://docs.stripe.com/", params={ # Only crawl pages under the /payments path "includePaths": ["/payments/*"], # Skip the terminal and financial-connections sections "excludePaths": ["/terminal/*", "/financial-connections/*"], # Allow crawling links that point to already visited pages "allowBackwardLinks": False, # Don't follow links to external domains "allowExternalLinks": False, "scrapeOptions": { "formats": ["html"] } } ) # Print the total number of pages crawled print(f"Total pages crawled: {url_control_result['total']}") ``` ```out Total pages crawled: 134 ``` In this example, we’re crawling the Stripe documentation website with specific URL control parameters: - The crawler starts at [https://docs.stripe.com/](https://docs.stripe.com/) and only crawls pages under the `"/payments/*"` path - It explicitly excludes the `"/terminal/*"` and `"/financial-connections/*"` sections - By setting allowBackwardLinks to false, it won’t revisit already crawled pages - External links are ignored ( `allowExternalLinks: false`) - The scraping is configured to only capture HTML content This targeted approach helps focus the crawl on relevant content while avoiding unnecessary pages, making the crawl more efficient and focused on the specific documentation sections we need. Another critical parameter is `maxDepth`, which lets you control how many levels deep the crawler will traverse from the starting URL. For example, a `maxDepth` of 2 means it will crawl the initial page and pages linked from it, but won’t go further. Here is another sample request on the Stripe API docs: ```python # Example of URL control parameters url_control_result = app.crawl_url( url="https://docs.stripe.com/", params={ "limit": 100, "maxDepth": 2, "allowBackwardLinks": False, "allowExternalLinks": False, "scrapeOptions": {"formats": ["html"]}, }, ) # Print the total number of pages crawled print(f"Total pages crawled: {url_control_result['total']}") ``` ```out Total pages crawled: 99 ``` Note: When a page has pagination (e.g. pages 2, 3, 4), these paginated pages are not counted as additional depth levels when using `maxDepth`. ### Performance & Limits The `limit` parameter, which we’ve used in previous examples, is essential for controlling the scope of web crawling. It sets a maximum number of pages that will be scraped, which is particularly important when crawling large websites or when external links are enabled. Without this limit, the crawler could potentially traverse an endless chain of connected pages, consuming unnecessary resources and time. While the limit parameter helps control the breadth of crawling, you may also need to ensure the quality and completeness of each page crawled. To make sure all desired content is scraped, you can enable a waiting period to let pages fully load. For example, some websites use JavaScript to handle dynamic content, have iFrames for embedding content or heavy media elements like videos or GIFs: ```python stripe_crawl_result = app.crawl_url( url="https://docs.stripe.com/api", params={ "limit": 5, "scrapeOptions": { "formats": ["markdown", "html"], "waitFor": 1000, # wait for a second for pages to load "timeout": 10000, # timeout after 10 seconds }, }, ) ``` The above code also sets the `timeout` parameter to 10000 milliseconds (10 seconds), which ensures that if a page takes too long to load, the crawler will move on rather than getting stuck. Note: `waitFor` duration applies to all pages the crawler encounters. All the while, it is important to keep the limits of your plan in mind: | Plan | /scrape (requests/min) | /crawl (requests/min) | /search (requests/min) | | --- | --- | --- | --- | | Free | 10 | 1 | 5 | | Hobby | 20 | 3 | 10 | | Standard | 100 | 10 | 50 | | Growth | 1000 | 50 | 500 | ## Asynchronous Web Crawling with Firecrawl: Efficient Large-Scale Data Collection Even after following the tips and best practices from the previous section, the crawling process can be significantly long for large websites with thousands of pages. To handle this efficiently, Firecrawl provides asynchronous crawling capabilities that allow you to start a crawl and monitor its progress without blocking your application. This is particularly useful when building web applications or services that need to remain responsive while crawling is in progress. ### Asynchronous programming in a nutshell First, let’s understand asynchronous programming with a real-world analogy: Asynchronous programming is like a restaurant server taking multiple orders at once. Instead of waiting at one table until the customers finish their meal before moving to the next table, they can take orders from multiple tables, submit them to the kitchen, and handle other tasks while the food is being prepared. In programming terms, this means your code can initiate multiple operations (like web requests or database queries) and continue executing other tasks while waiting for responses, rather than processing everything sequentially. This approach is particularly valuable in web crawling, where most of the time is spent waiting for network responses - instead of freezing the entire application while waiting for each page to load, async programming allows you to process multiple pages concurrently, dramatically improving efficiency. ### Using `async_crawl_url` method Firecrawl offers an intuitive asynchronous crawling method via `async_crawl_url`: ```python app = FirecrawlApp() crawl_status = app.async_crawl_url("https://docs.stripe.com") print(crawl_status) ``` ```python {'success': True, 'id': 'c4a6a749-3445-454e-bf5a-f3e1e6befad7', 'url': 'https://api.firecrawl.dev/v1/crawl/c4a6a749-3445-454e-bf5a-f3e1e6befad7'} ``` It accepts the same parameters and scrape options as `crawl_url` but returns a crawl status dictionary. We are mostly interested in the crawl job `id` and can use it to check the status of the process using `check_crawl_status`: ```python checkpoint = app.check_crawl_status(crawl_status['id']) print(len(checkpoint['data'])) ``` ```python 29 ``` `check_crawl_status` returns the same output as `crawl_url` but only includes the pages scraped so far. You can run it multiple times and see the number of scraped pages increasing. If you want to cancel the job, you can use `cancel_crawl` passing the job id: ```python final_result = app.cancel_crawl(crawl_status['id']) print(final_result) ``` ```python {'status': 'cancelled'} ``` ### Benefits of asynchronous crawling There are many advantages of using the `async_crawl_url` over `crawl_url`: - You can create multiple crawl jobs without waiting for each to complete. - You can monitor progress and manage resources more effectively. - Perfect for batch processing or parallel crawling tasks. - Applications can remain responsive while crawling happens in background - Users can monitor progress instead of waiting for completion - Allows for implementing progress bars or status updates - Easier to integrate with message queues or job schedulers - Can be part of larger automated workflows - Better suited for microservices architectures In practice, you almost always use asynchronous crawling for large websites. ## How to Save and Store Web Crawling Results When crawling large websites, it’s important to save the results persistently. Firecrawl provides the crawled data in a structured format that can be easily saved to various storage systems. Let’s explore some common approaches. ### Local file storage The simplest approach is saving to local files. Here’s how to save crawled content in different formats: ```python import json from pathlib import Path def save_crawl_results(crawl_result, output_dir="firecrawl_output"): # Create output directory if it doesn't exist Path(output_dir).mkdir(parents=True, exist_ok=True) # Save full results as JSON with open(f"{output_dir}/full_results.json", "w") as f: json.dump(crawl_result, f, indent=2) # Save just the markdown content in separate files for idx, page in enumerate(crawl_result["data"]): # Create safe filename from URL filename = ( page["metadata"]["url"].split("/")[-1].replace(".html", "") or f"page_{idx}" ) # Save markdown content if "markdown" in page: with open(f"{output_dir}/{filename}.md", "w") as f: f.write(page["markdown"]) ``` Here is what the above function does: 1. Creates an output directory if it doesn’t exist 2. Saves the complete crawl results as a JSON file with proper indentation 3. For each crawled page: - Generates a filename based on the page URL - Saves the markdown content to a separate .md file ```python app = FirecrawlApp() crawl_result = app.crawl_url(url="https://docs.stripe.com/api", params={"limit": 10}) save_crawl_results(crawl_result) ``` It is a basic function that requires modifications for other scraping formats. ### Database storage For more complex applications, you might want to store the results in a database. Here’s an example using SQLite: ```python import sqlite3 def save_to_database(crawl_result, db_path="crawl_results.db"): conn = sqlite3.connect(db_path) cursor = conn.cursor() # Create table if it doesn't exist cursor.execute( """ CREATE TABLE IF NOT EXISTS pages ( url TEXT PRIMARY KEY, title TEXT, content TEXT, metadata TEXT, crawl_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """ ) # Insert pages for page in crawl_result["data"]: cursor.execute( "INSERT OR REPLACE INTO pages (url, title, content, metadata) VALUES (?, ?, ?, ?)", ( page["metadata"]["url"], page["metadata"]["title"], page.get("markdown", ""), json.dumps(page["metadata"]), ), ) conn.commit() print(f"Saved {len(crawl_result['data'])} pages to {db_path}") conn.close() ``` The function creates a SQLite database with a `pages` table that stores the crawled data. For each page, it saves the URL (as primary key), title, content (in markdown format), and metadata (as JSON). The crawl date is automatically added as a timestamp. If a page with the same URL already exists, it will be replaced with the new data. This provides a persistent storage solution that can be easily queried later. ```python save_to_database(crawl_result) ``` ```python Saved 9 pages to crawl_results.db ``` Let’s query the database to double-check: ```python # Query the database conn = sqlite3.connect("crawl_results.db") cursor = conn.cursor() cursor.execute("SELECT url, title, metadata FROM pages") print(cursor.fetchone()) conn.close() ``` ```python ( 'https://docs.stripe.com/api/errors', 'Errors | Stripe API Reference', { "url": "https://docs.stripe.com/api/errors", "title": "Errors | Stripe API Reference", "language": "en-US", "viewport": "width=device-width, initial-scale=1", "sourceURL": "https://docs.stripe.com/api/errors", "statusCode": 200, "description": "Complete reference documentation for the Stripe API. Includes code snippets and examples for our Python, Java, PHP, Node.js, Go, Ruby, and .NET libraries.", "ogLocaleAlternate": [] } ) ``` ### Cloud storage For production applications, you might want to store results in cloud storage. Here’s an example using AWS S3: ```python import boto3 from datetime import datetime def save_to_s3(crawl_result, bucket_name, prefix="crawls"): s3 = boto3.client("s3") timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") # Save full results full_results_key = f"{prefix}/{timestamp}/full_results.json" s3.put_object( Bucket=bucket_name, Key=full_results_key, Body=json.dumps(crawl_result, indent=2), ) # Save individual pages for idx, page in enumerate(crawl_result["data"]): if "markdown" in page: page_key = f"{prefix}/{timestamp}/pages/{idx}.md" s3.put_object(Bucket=bucket_name, Key=page_key, Body=page["markdown"]) print(f"Successfully saved {len(crawl_result['data'])} pages to {bucket_name}/{full_results_key}") ``` Here is what the function does: - Takes a crawl result dictionary, S3 bucket name, and optional prefix as input - Creates a timestamped folder structure in S3 to organize the data - Saves the full crawl results as a single JSON file - For each crawled page that has markdown content, saves it as an individual `.md` file - Uses boto3 to handle the AWS S3 interactions - Preserves the hierarchical structure of the crawl data For this function to work, you must have `boto3` installed and your AWS credentials saved inside the `~/.aws/credentials` file with the following format: ```bash [default] aws_access_key_id = your_access_key aws_secret_access_key = your_secret_key region = your_region ``` Then, you can execute the function provided that you already have an S3 bucket to store the data: ```python save_to_s3(crawl_result, "sample-bucket-1801", "stripe-api-docs") ``` ```text Successfully saved 9 pages to sample-bucket-1801/stripe-api-docs/20241118_142945/full_results.json ``` ### Incremental saving with async crawls When using async crawling, you might want to save results incrementally sa they come in: ```python import time def save_incremental_results(app, crawl_id, output_dir="firecrawl_output"): Path(output_dir).mkdir(parents=True, exist_ok=True) processed_urls = set() while True: # Check current status status = app.check_crawl_status(crawl_id) # Save new pages for page in status["data"]: url = page["metadata"]["url"] if url not in processed_urls: filename = f"{output_dir}/{len(processed_urls)}.md" with open(filename, "w") as f: f.write(page.get("markdown", "")) processed_urls.add(url) # Break if crawl is complete if status["status"] == "completed": print(f"Saved {len(processed_urls)} pages.") break time.sleep(5) # Wait before checking again ``` Here is what the function does: - Creates an output directory if it doesn’t exist - Maintains a set of processed URLs to avoid duplicates - Continuously checks the crawl status until completion - For each new page found, saves its markdown content to a numbered file - Sleeps for 5 seconds between status checks to avoid excessive API calls Let’s use it while the app crawls Books-to-Scrape website: ```python # Start the crawl crawl_status = app.async_crawl_url(url="https://books.toscrape.com/") # Save results incrementally save_incremental_results(app, crawl_status["id"]) ``` ```python Saved 705 pages. ``` ## Building AI-Powered Web Crawlers with Firecrawl and LangChain Integration Firecrawl has integrations with popular open-source libraries like LangChain and other platforms. ![Popular AI and data tools that integrate with Firecrawl including LangChain, LlamaIndex, CrewAI, and vector databases](https://www.firecrawl.dev/images/blog/crawl-masterclass/notebook_files/image.png) In this section, we will see how to use the LangChain integration to build a basic QA chatbot on the [LangChain Community Integrations](https://python.langchain.com/docs/integrations/providers/) website. Start by installing LangChain and its related libraries: ```bash pip install langchain langchain_community langchain_anthropic langchain_openai ``` Then, add your `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` as variables to your `.env` file. Next, import the `FireCrawlLoader` class from the document loaders module and initialize it: ```python from dotenv import load_dotenv from langchain_community.document_loaders.firecrawl import FireCrawlLoader load_dotenv() loader = FireCrawlLoader( url="https://python.langchain.com/docs/integrations/providers/", mode="crawl", params={"limit": 5, "scrapeOptions": {"onlyMainContent": True}}, ) ``` The class can read your Firecrawl API key automatically since we are loading the variables using `load_dotenv()`. To start the crawl, you can call the `load()` method of the loader object and the scraped contents will be turned into LangChain compatible documents: ```python # Start the crawl docs = loader.load() ``` The next step is chunking: ```python from langchain_text_splitters import RecursiveCharacterTextSplitter # Add text splitting before creating the vector store text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100) # Split the documents split_docs = text_splitter.split_documents(docs) ``` Above, we split the documents into smaller chunks using the `RecursiveCharacterTextSplitter`. This helps make the text more manageable for processing and ensures better results when creating embeddings and performing retrieval. The chunk size of 1,000 characters with 100-character overlap provides a good balance between context preservation and granularity. ```python from langchain_chroma import Chroma from langchain.embeddings import OpenAIEmbeddings from langchain_community.vectorstores.utils import filter_complex_metadata # Create embeddings for the documents embeddings = OpenAIEmbeddings() # Create a vector store from the loaded documents docs = filter_complex_metadata(docs) vector_store = Chroma.from_documents(docs, embeddings) ``` Moving on, we create a vector store using Chroma and OpenAI embeddings. The vector store enables semantic search and retrieval on our documents. We also filter out complex metadata that could cause storage issues. The final step is building the QA chain using Claude 3.5 Sonnet as the language model: ```python from langchain.chains import RetrievalQA from langchain_anthropic import ChatAnthropic # Initialize the language model llm = ChatAnthropic(model="claude-3-5-sonnet-20240620", streaming=True) # Create a QA chain qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(), ) ``` Now, we can ask questions about our documents: ```python # Example question query = "What is the main topic of the website?" answer = qa_chain.invoke(query) print(answer) ``` ```python { 'query': 'What is the main topic of the website?', 'result': """The main topic of the website is LangChain's integrations with Hugging Face. The page provides an overview of various LangChain components that can be used with Hugging Face models and services, including: 1. Chat models 2. LLMs (Language Models) 3. Embedding models 4. Document loaders 5. Tools The page focuses on showing how to use different Hugging Face functionalities within the LangChain framework, such as embedding models, language models, datasets, and other tools.""" } ``` This section demonstrated a process for building a basic RAG pipeline for content scraped using Firecrawl. For this version, we only used 10 pages from the LangChain documentation. As the volume of information increases, the pipeline would need additional refinement. To scale this pipeline effectively, we would need to consider several factors including: - Chunking strategy optimization - Embedding model selection - Vector store performance tuning - Prompt engineering for larger document collections ## Conclusion Throughout this guide, we’ve explored Firecrawl’s `/crawl` endpoint and its capabilities for web scraping at scale. From basic usage to advanced configurations, we covered URL control, performance optimization, and asynchronous operations. We also examined practical implementations, including data storage solutions and integration with frameworks like LangChain. The endpoint’s ability to handle JavaScript content, pagination, and various output formats makes it a versatile tool for modern web scraping needs. Whether you’re building documentation chatbots or gathering training data, Firecrawl provides a robust foundation. By leveraging the configuration options and best practices discussed, you can build efficient and scalable web scraping solutions tailored to your specific requirements. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## AI Resume Job Matcher Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Feb 1, 2025 • [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)Bex Tuychiev](https://x.com/bextuychiev) # Building an AI Resume Job Matching App With Firecrawl And Claude ![Building an AI Resume Job Matching App With Firecrawl And Claude image](https://www.firecrawl.dev/images/blog/resume_parser/ai-resume-parser.jpg) ## Introduction Finding the perfect job can feel like searching for a needle in a haystack. As a developer, you might spend hours scanning through job boards, trying to determine if each position matches your skills and experience. What if we could automate this process using AI? In this tutorial, we’ll build a sophisticated job matching system that combines several powerful technologies: - **Firecrawl** for intelligent web scraping of job postings and resume parsing - **Claude 3.5 Sonnet** for job matching analysis - **Supabase** for managing job sources and tracking - **Discord** for when there is a matching job - **Streamlit** for a user-friendly web interface Our application will: 1. Automatically scrape user-provided job boards at regular intervals 2. Parse your resume from a PDF 3. Use AI to evaluate each job posting against your qualifications 4. Send notifications to Discord when strong matches are found 5. Provide a web interface for managing job sources and viewing results ![Screenshot of the AI resume job matching app showing the main interface with job sources sidebar and resume upload section](https://www.firecrawl.dev/blog/images/demo.png) By the end of this tutorial, you’ll have a fully automated job search assistant that runs in the cloud and helps you focus on the opportunities that matter most. Whether you’re actively job hunting or just keeping an eye on the market, this tool will save you countless hours of manual searching and evaluation. If this project sounds interesting, you can start using it straight away by cloning [its GitHub repository](https://github.com/BexTuychiev/ai-resume-job-matching). The local setup instructions are provided in the README. On the other hand, if you want to understand how the different parts of the project work together, continue reading! ## Overview of the App Before diving into the technical details, let’s walk through a typical user journey to understand how the app works. The process starts with the user adding web pages with job listings. Here are examples of acceptable pages: - `https://datacamp.com/jobs` - `https://openai.com/careers/search/` - `https://apply.workable.com/huggingface` As you can probably tell from the example URLs, the app doesn’t work with popular job platforms like Indeed or Glassdoor. This is because these platforms already have sophisticated job matching functionality built into their systems. Instead, this app focuses on company career pages and job boards that don’t offer automated matching - places where you’d normally have to manually review each posting. This allows you to apply the same intelligent matching to opportunities that might otherwise slip through the cracks. Each job listings source is added to a Supabase database under the hood for persistence and displayed in the sidebar (you have the option to delete them). After the user inputs all job sources, they can add their PDF link in the main section of the app. The app uses [Firecrawl](https://firecrawl.dev/), an AI-powered scraping engine that extracts structured data from webpages and PDF documents. To parse resumes, Firecrawl requires a direct file link to the PDF. After parsing the resume, the app crawls all job sources using Firecrawl to gather job listings. Each listing is then analyzed against the resume by Claude to determine compatibility. The UI clearly shows whether a candidate is qualified for each position, along with Claude’s reasoning. For matching jobs, the app automatically sends notifications to the user’s Discord account via a webhook. ![Screenshot of a Discord notification showing a job match alert with details about the position and match analysis](https://www.firecrawl.dev/blog/images/discord_alert.png) The app automatically rechecks all job sources weekly to ensure you never miss a great opportunity. ## The Tech Stack Used in the App Building a reliable job matching system requires careful selection of tools that can handle complex tasks while remaining maintainable and cost-effective. Let’s explore the core technologies that power our application and why each was chosen: ### 1\. [Firecrawl](https://firecrawl.ai/) for AI-powered web scraping At the heart of our job discovery system is Firecrawl, an AI-powered web scraping engine. Unlike traditional scraping libraries that rely on brittle HTML selectors, Firecrawl uses natural language understanding to identify and extract content. This makes it ideal for our use case because: - It can handle diverse job board layouts without custom code for each site - Maintains reliability even when websites update their structure - Automatically bypasses common anti-bot measures - Handles JavaScript-rendered content out of the box - Provides clean, structured data through [Pydantic](https://pydantic.dev/) schemas ### 2\. [Claude 3.5 Sonnet](https://www.anthropic.com/claude) for job matching For the critical task of evaluating job fit, we use Claude 3.5 Sonnet through the [LangChain](https://www.langchain.com/) framework. This AI model excels at understanding both job requirements and candidate qualifications in context. We chose Claude because: - Handles complex job requirements effectively - Offers consistent and reliable evaluations - More cost-effective than GPT-4 for this use case - Integrates seamlessly with LangChain for structured outputs ### 3\. [Supabase](https://supabase.com/) for data management To manage job sources and tracking, we use Supabase as our database backend. This modern database platform offers: - PostgreSQL database with a generous free tier - Real-time capabilities for future features - Simple REST API for database operations - Built-in authentication system - Excellent developer experience with their Python SDK ### 4\. [Discord](https://discord.com/) for Notifications When a matching job is found, our system sends notifications through Discord webhooks. This might seem like an unusual choice, but Discord offers several advantages: - Free and widely adopted - Rich message formatting with embeds - Simple webhook integration - Mobile notifications - Supports dedicated channels for job alerts - Threading for discussions about specific opportunities ### 5\. [Streamlit](https://streamlit.io/) for user interface The web interface is built with Streamlit, a Python framework for data applications. We chose Streamlit because: - It enables rapid development of data-focused interfaces - Provides built-in components for common UI patterns - Handles async operations smoothly - Offers automatic hot-reloading during development - Requires no JavaScript knowledge - Makes deployment straightforward ### 6\. [GitHub Actions](https://github.com/features/actions) for automation To ensure regular job checking, we use GitHub Actions for scheduling. This service provides: - Free scheduling for public repositories - Built-in secret management - Reliable cron scheduling - Easy maintenance and modifications - Integrated version control - Comprehensive logging and monitoring This carefully selected stack provides a robust foundation while keeping costs minimal through generous free tiers. The combination of AI-powered tools (Firecrawl and Claude) with modern infrastructure (Supabase, Discord, GitHub Actions) creates a reliable and scalable job matching system that can grow with your needs. Most importantly, this stack minimizes maintenance overhead - a crucial factor for any automated system. The AI-powered components adapt to changes automatically, while the infrastructure services are managed by their respective providers, letting you focus on finding your next great opportunity rather than maintaining the system. ## Breaking Down the App Components When you look at [the GitHub repository](https://github.com/BexTuychiev/ai-resume-job-matching) of the app, you will see the following file structure: ![GitHub repository file structure showing key files like models.py, scraper.py, and app.py](https://www.firecrawl.dev/blog/images/github_snapshot.png) Several files in the repository serve common purposes that most developers will recognize: - `.gitignore`: Specifies which files Git should ignore when tracking changes - `README.md`: Documentation explaining what the project does and how to use it - `requirements.txt`: Lists all Python package dependencies needed to run the project Let’s examine the remaining Python scripts and understand how they work together to power the application. The explanations will be in a logical order building from foundational elements to higher-level functionality. ### 1\. Core data structures - `src/models.py` At the heart of our job matching system are three Pydantic models that define the core data structures used throughout the application. These models not only provide type safety and validation but also serve as schemas that guide Firecrawl’s AI in extracting structured data from web pages. ```python class Job(BaseModel): title: str = Field(description="Job title") url: str = Field(description="URL of the job posting") company: str = Field(description="Company name") ``` The `Job` model represents an individual job posting with three essential fields: - `title`: The position’s name - `url`: Direct link to the job posting - `company`: Name of the hiring organization This model is used by both the scraper to extract job listings and the Discord notifier to format job match notifications. The `Field` descriptions guide the Firecrawl’s AI to better locate the HTML/CSS components containing the relevant information. ```python class JobSource(BaseModel): url: str = Field(description="URL of the job board") last_checked: Optional[datetime] = Field(description="Last check timestamp") ``` The `JobSource` model tracks job board URLs and when they were last checked: - `url`: The job board’s web address - `last_checked`: Optional timestamp of the last scraping attempt This model is primarily used by the database component to manage job sources and the scheduler to track when sources need to be rechecked. ```python class JobListings(BaseModel): jobs: List[Job] = Field(description="List of job postings") ``` Finally, the `JobListings` model serves as a container for multiple `Job` objects. This model is crucial for the scraper component, as it tells Firecrawl to extract all job postings from a page rather than just the first one it finds. These models form the foundation of our application’s data flow: 1. The scraper uses them to extract structured data from web pages 2. The database uses them to store and retrieve job sources 3. The matcher uses them to process job details 4. The Discord notifier uses them to format notifications By defining these data structures upfront, we ensure consistency throughout the application and make it easier to modify the data model in the future if needed. ### 2\. Database operations - `src/database.py` The database component handles persistence of job sources using Supabase, a PostgreSQL-based backend service. This module provides essential CRUD (Create, Read, Update, Delete) operations for managing job board URLs and their check history. ```python class Database: def __init__(self): url = os.getenv("SUPABASE_URL") key = os.getenv("SUPABASE_KEY") self.client = create_client(url, key) def save_job_source(self, url: str) -> None: """Save a job source to the database""" self.client.table("job_sources").upsert( {"url": url, "last_checked": None} ).execute() ``` The `Database` class initializes a connection to Supabase using environment variables and provides four key methods: 1. `save_job_source`: Adds or updates a job board URL in the database. The `upsert` operation ensures no duplicate entries are created. 2. `delete_job_source`: Removes a job source from tracking: ```python def delete_job_source(self, url: str) -> None: self.client.table("job_sources").delete().eq("url", url).execute() ``` 3. `get_job_sources`: Retrieves all tracked job sources: ```python def get_job_sources(self) -> List[JobSource]: response = self.client.table("job_sources").select("*").execute() return [JobSource(**source) for source in response.data] ``` 4. `update_last_checked`: Updates the timestamp when a source was last checked: ```python def update_last_checked(self, url: str) -> None: self.client.table("job_sources").update({"last_checked": "now()"}).eq( "url", url ).execute() ``` This database component is used by: - The Streamlit interface ( `app.py`) for managing job sources through the sidebar - The scheduler ( `scheduler.py`) for tracking when sources were last checked - The automated GitHub Action workflow for persistent storage between runs By using Supabase, we get a reliable, scalable database with minimal setup and maintenance requirements. The `JobSource` model we defined earlier ensures type safety when working with the database records throughout the application. ### 3\. Scraping with Firecrawl - `src/scraper.py` The scraper component handles all web scraping operations using Firecrawl, an AI-powered scraping engine. This module is responsible for parsing resumes and extracting job listings from various sources. ```python @st.cache_data(show_spinner=False) def _cached_parse_resume(pdf_link: str) -> str: """Cached version of resume parsing""" app = FirecrawlApp() response = app.scrape_url(url=pdf_link) return response["markdown"] class JobScraper: def __init__(self): self.app = FirecrawlApp() ``` The `JobScraper` class initializes a Firecrawl connection and provides three main methods: 1. `parse_resume`: Extracts text content from a PDF resume. Uses Streamlit’s caching to avoid re-parsing the same resume: ```python async def parse_resume(self, pdf_link: str) -> str: """Parse a resume from a PDF link.""" return _cached_parse_resume(pdf_link) ``` 2. `scrape_job_postings`: Batch scrapes multiple job board URLs using the `JobListings` schema to guide Firecrawl’s extraction: ```python async def scrape_job_postings(self, source_urls: list[str]) -> list[Job]: response = self.app.batch_scrape_urls( urls=source_urls, params={ "formats": ["extract"], "extract": { "schema": JobListings.model_json_schema(), "prompt": "Extract information based on the schema provided", }, }, ) jobs = [] for job in response["data"]: jobs.extend(job["extract"]["jobs"]) return [Job(**job) for job in jobs] ``` If you want to understand Firecrawl’s syntax better, refer to our [separate guide on its `/scrape` endpoint](https://www.firecrawl.dev/blog/mastering-firecrawl-scrape-endpoint). 3. `scrape_job_content`: Retrieves the full content of a specific job posting for detailed analysis: ```python async def scrape_job_content(self, job_url: str) -> str: """Scrape the content of a specific job posting.""" response = self.app.scrape_url(url=job_url) return response["markdown"] ``` This entire scraper component is used by: - The Streamlit interface ( `app.py`) for initial resume parsing and job discovery - The scheduler ( `scheduler.py`) for automated periodic job checks - The matcher component for detailed job content analysis The use of Firecrawl’s AI capabilities allows the scraper to handle diverse webpage layouts without custom selectors, while Streamlit’s caching helps optimize performance by avoiding redundant resume parsing. ### 4\. Job matching with Claude - `src/matcher.py` The matcher component uses Claude 3.5 Sonnet through LangChain to evaluate whether a candidate’s resume matches a job posting. This module provides intelligent job fit analysis with structured outputs. ```python class JobMatcher: def __init__(self): self.llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0) self.response_schemas = [\ ResponseSchema(\ name="is_match",\ description="Whether the candidate is a good fit for the job (true/false)",\ ),\ ResponseSchema(\ name="reason",\ description="Brief explanation of why the candidate is or isn't a good fit",\ ),\ ] ``` The `JobMatcher` class initializes with two key components: 1. A Claude instance configured for consistent outputs (temperature=0) 2. Response schemas that define the structure of the matching results: - `is_match`: Boolean indicating if the candidate is qualified - `reason`: Explanation of the matching decision ```python self.prompt = ChatPromptTemplate.from_messages([\ (\ "system",\ "You are an expert job interviewer with decades of experience. Analyze the resume and job posting to determine if the candidate is a good fit. Be critical in your assessment and accept only applicants that meet at least 75% of the requirements.",\ ),\ (\ "human",\ """\ Resume:\ {resume}\ \ Job Posting:\ {job_posting}\ \ Determine if this candidate is a good fit and explain why briefly.\ {format_instructions}\ """,\ ),\ ]) self.output_parser = StructuredOutputParser.from_response_schemas( self.response_schemas ) ``` > Note: The system prompt significantly affects how jobs are matched. You can make it more relaxed or strict when evaluating candidates. Use a looser prompt if you want to apply to more jobs, or a stricter one if you’re being more selective. The class also sets up: - A prompt template that positions Claude as an expert interviewer and sets a high bar for matches (75% of requirements) - An output parser that ensures responses follow the defined schema ```python async def evaluate_match(self, resume: str, job_posting: str) -> Dict: """Evaluate if a candidate is a good fit for a job.""" formatted_prompt = self.prompt.format( resume=resume, job_posting=job_posting, format_instructions=self.output_parser.get_format_instructions(), ) response = await self.llm.ainvoke(formatted_prompt) return self.output_parser.parse(response.content) ``` The `evaluate_match` method: 1. Takes a resume and job posting as input 2. Formats the prompt with the provided content 3. Sends the request to Claude 4. Parses and returns the structured response This entire matcher component is used by: - The Streamlit interface ( `app.py`) for real-time job matching - The scheduler ( `scheduler.py`) for automated matching checks - The Discord notifier to determine when to send alerts By using Claude with structured outputs, we ensure consistent and reliable job matching that can be easily integrated into the broader application workflow. ### 5\. Sending notifications with Discord - `src/discord.py` The Discord component handles sending notifications when matching jobs are found. It uses Discord’s webhook functionality to deliver rich, formatted messages about job matches. ```python class DiscordNotifier: def __init__(self): self.webhook_url = os.getenv("DISCORD_WEBHOOK_URL") ``` First, we initialize the notifier with a Discord webhook URL from environment variables. This URL is where all notifications will be sent. ```python async def send_match(self, job: Job, match_reason: str): """Send a job match notification to Discord""" if not self.webhook_url: return webhook = DiscordWebhook(url=self.webhook_url) embed = DiscordEmbed( title=f"🎯 New Job Match Found!", description=f"**{job.title}** at **{job.company}**\n\n{match_reason}", color="5865F2", # Discord's blue color scheme ) ``` The `send_match` method creates the notification: - Takes a `Job` object and the AI’s matching reason as input - Creates a webhook connection to Discord - Builds an embed message with: - An eye-catching title with emoji - Job title and company in bold - The AI’s explanation of why this job matches ```python # Add fields with job details embed.add_embed_field(name="🏢 Company", value=job.company, inline=True) embed.add_embed_field( name="🔗 Job URL", value=f"[Apply Here]({job.url})", inline=True ) webhook.add_embed(embed) webhook.execute() ``` Finally, the method: - Adds structured fields for company and job URL - Uses emojis for visual appeal - Creates a clickable “Apply Here” link - Sends the formatted message to Discord This component is used by: - The matcher component when a job match is found - The scheduler for automated notifications - The Streamlit interface for real-time match alerts The use of Discord embeds provides a clean, professional look for notifications while making it easy for users to access job details and apply links directly from the message. ### 6\. Automated source checking script - `src/scheduler.py` The scheduler component handles automated periodic checking of job sources, coordinating between all other components to continuously monitor for new matching positions. ```python class JobScheduler: def __init__(self): self.scraper = JobScraper() self.matcher = JobMatcher() self.notifier = DiscordNotifier() self.db = Database() self.resume_url = os.getenv("RESUME_URL") self.check_interval = int(os.getenv("CHECK_INTERVAL_MINUTES", "15")) self.processed_jobs = set() logger.info(f"Initialized scheduler with {self.check_interval} minute interval") ``` The `JobScheduler` class initializes with: - All necessary components (scraper, matcher, notifier, database) - Resume URL from environment variables - Configurable check interval (defaults to 15 minutes) - A set to track processed jobs and avoid duplicates - Logging setup for monitoring operations ```python async def process_source(self, source): """Process a single job source""" try: logger.info(f"Processing source: {source.url}") # Parse resume resume_content = await self.scraper.parse_resume(self.resume_url) # Get jobs from source jobs = await self.scraper.scrape_job_postings([source.url]) logger.info(f"Found {len(jobs)} jobs from {source.url}") ``` The `process_source` method starts by: - Logging the current operation - Parsing the user’s resume - Scraping all jobs from the given source ```python # Process new jobs for job in jobs: if job.url in self.processed_jobs: logger.debug(f"Skipping already processed job: {job.url}") continue job_content = await self.scraper.scrape_job_content(job.url) result = await self.matcher.evaluate_match(resume_content, job_content) if result["is_match"]: logger.info(f"Found match: {job.title} at {job.company}") await self.notifier.send_match(job, result["reason"]) self.processed_jobs.add(job.url) ``` For each job found, it: - Skips if already processed - Scrapes the full job description - Evaluates the match against the resume - Sends a Discord notification if it’s a match - Marks the job as processed ```python async def run(self): """Main scheduling loop""" logger.info("Starting job scheduler...") while True: try: sources = self.db.get_job_sources() logger.info(f"Found {len(sources)} job sources") ``` The `run` method starts the main loop by: - Getting all job sources from the database - Logging the number of sources found ```python for source in sources: if not source.last_checked or ( datetime.utcnow() - source.last_checked > timedelta(minutes=self.check_interval) ): await self.process_source(source) else: logger.debug( f"Skipping {source.url}, next check in " f"{(source.last_checked + timedelta(minutes=self.check_interval) - datetime.utcnow()).total_seconds() / 60:.1f} minutes" ) await asyncio.sleep(60) # Check every minute ``` For each source, it: - Checks if it needs processing (never checked or interval elapsed) - Processes the source if needed - Logs skipped sources with time until next check - Waits a minute before the next iteration ```python except Exception as e: logger.error(f"Scheduler error: {str(e)}") await asyncio.sleep(60) ``` Error handling: - Catches and logs any exceptions - Waits a minute before retrying - Ensures the scheduler keeps running despite errors This component is used by: - The GitHub Actions workflow for automated checks - The command-line interface for manual checks - The logging system for monitoring and debugging The extensive logging helps track operations and diagnose issues, while the modular design allows for easy maintenance and updates. ### 7\. User interface with Streamlit - `app.py` The Streamlit interface provides a user-friendly way to manage job sources and run manual job matching checks. Let’s break down each component: 1. First, we set up the necessary imports and helper functions: ```python import streamlit as st import asyncio from dotenv import load_dotenv from src.scraper import JobScraper from src.matcher import JobMatcher from src.discord import DiscordNotifier from src.database import Database load_dotenv() async def process_job(scraper, matcher, notifier, job, resume_content): """Process a single job posting""" job_content = await scraper.scrape_job_content(job.url) result = await matcher.evaluate_match(resume_content, job_content) if result["is_match"]: await notifier.send_match(job, result["reason"]) return job, result ``` The `process_job` function handles the core job matching logic for a single posting: 1. Scrapes the full job content using the provided URL 2. Evaluates if the resume matches the job requirements 3. Sends a notification if there’s a match 4. Returns both the job and match result for further processing 5. The main application setup and sidebar for managing job sources: ```python async def main(): st.title("Resume Parser and Job Matcher") # Initialize services scraper = JobScraper() matcher = JobMatcher() notifier = DiscordNotifier() db = Database() # Sidebar for managing job sources with st.sidebar: st.header("Manage Job Sources") new_source = st.text_input("Add Job Source URL") if st.button("Add Source"): db.save_job_source(new_source) st.success("Job source added!") ``` The `main()` function sets up the core Streamlit application interface: 1. Creates a title for the app 2. Initializes the key services (scraper, matcher, notifier, database) 3. Adds a sidebar with controls for managing job source URLs 4. Provides a text input and button to add new job sources 5. Saves valid sources to the database The sidebar allows users to maintain a list of job boards and company career pages to monitor for new postings. 3. The source management interface: ```python # List and delete existing sources st.subheader("Current Sources") for source in db.get_job_sources(): col1, col2 = st.columns([3, 1]) with col1: st.text(source.url) with col2: if st.button("Delete", key=source.url): db.delete_job_source(source.url) st.rerun() ``` This section displays the list of current job sources and provides delete functionality: 1. Shows a “Current Sources” subheader 2. Iterates through all sources from the database 3. Creates a two-column layout for each source 4. First column shows the source URL 5. Second column has a delete button 6. When delete is clicked, removes the source and refreshes the page The delete functionality helps users maintain their source list by removing outdated or unwanted job boards. The `rerun()` call ensures the UI updates immediately after deletion. 4. The main content area with instructions and resume input: ```python st.markdown( """ This app helps you find matching jobs by: - Analyzing your resume from a PDF URL - Scraping job postings from your saved job sources - Using AI to evaluate if you're a good fit for each position Simply paste your resume URL below to get started! """ ) resume_url = st.text_input( "**Enter Resume PDF URL**", placeholder="https://www.website.com/resume.pdf", ) ``` 5. The job analysis workflow: ```python if st.button("Analyze") and resume_url: with st.spinner("Parsing resume..."): resume_content = await scraper.parse_resume(resume_url) sources = db.get_job_sources() if not sources: st.warning("No job sources configured. Add some in the sidebar!") return with st.spinner("Scraping job postings..."): jobs = await scraper.scrape_job_postings([s.url for s in sources]) ``` 6. Parallel job processing and results display: ```python with st.spinner(f"Analyzing {len(jobs)} jobs..."): tasks = [] for job in jobs: task = process_job(scraper, matcher, notifier, job, resume_content) tasks.append(task) for coro in asyncio.as_completed(tasks): job, result = await coro st.subheader(f"Job: {job.title}") st.write(f"URL: {job.url}") st.write(f"Match: {'✅' if result['is_match'] else '❌'}") st.write(f"Reason: {result['reason']}") st.divider() st.success(f"Analysis complete! Processed {len(jobs)} jobs.") ``` This section creates tasks to analyze multiple jobs simultaneously by comparing them against the user’s resume. As each analysis completes, it displays the results including job title, URL, match status and reasoning. The parallel approach makes the processing more efficient than analyzing jobs one at a time. The interface provides: - A sidebar for managing job sources - Clear instructions for users - Real-time feedback during processing - Visual indicators for matches (✅) and non-matches (❌) - Detailed explanations for each job evaluation - Parallel processing for better performance This component ties together all the backend services into a user-friendly interface that makes it easy to manage job sources and run manual checks. ### 8\. GitHub Actions workflow - `.github/workflows/scheduler.yml` The GitHub Actions workflow automates the job checking process by running the scheduler at regular intervals. Let’s break down the configuration: 1. First, we define the workflow name and triggers: ```yaml name: Job Matcher Scheduler on: push: branches: [main] schedule: - cron: "0 0 * * 1" # Run every Monday at midnight ``` This configuration: - Names the workflow “Job Matcher Scheduler” - Triggers on pushes to the main branch (for testing) - Runs automatically every Monday at midnight using cron syntax - 0: Minute (0) - 0: Hour (0 = midnight) - \*: Day of month (any) - \*: Month (any) - 1: Day of week (1 = Monday) 2. Define the job and its environment: ```yaml jobs: check-jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: "3.10" ``` This section: - Creates a job named “check-jobs” - Uses the latest Ubuntu runner - Checks out the repository code - Sets up Python 3.10 3. Install dependencies: ```yaml - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements.txt ``` 4. Set up environment variables and run the scheduler: ```yml - name: Run job checker env: FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} DISCORD_WEBHOOK_URL: ${{ secrets.DISCORD_WEBHOOK_URL }} RESUME_URL: ${{ secrets.RESUME_URL }} SUPABASE_URL: ${{ secrets.SUPABASE_URL }} SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }} CHECK_INTERVAL_MINUTES: 15 run: | python -m src.scheduler ``` This final step: - Sets up all necessary environment variables from GitHub Secrets - Configures the check interval - Runs the scheduler script The workflow provides: - Automated weekly job checks - Secure handling of sensitive credentials - Consistent environment for running checks - Detailed logs of each run - Easy modification of the schedule To use this workflow, you need to: 1. Add all required secrets to your GitHub repository 2. Ensure your repository is public (for free GitHub Actions minutes) 3. Verify the workflow is enabled in your Actions tab The weekly schedule helps stay within GitHub’s free tier limits while still regularly checking for new opportunities. ## Conclusion We’ve built a powerful automated job matching system that combines several modern technologies into a cohesive solution. By integrating Firecrawl for web scraping, Claude AI for intelligent matching, Discord for notifications, GitHub Actions for scheduling, and Supabase for storage, we’ve created a practical tool that automates the tedious parts of job searching. This allows job seekers to focus their energy on more important tasks like preparing for interviews and improving their skills. ### Next Steps The modular design of this system opens up many possibilities for future enhancements. You could expand support to additional job boards, implement more sophisticated matching algorithms, or add alternative notification methods like email. Consider building a mobile interface or adding analytics to track your application success rates. The foundation we’ve built makes it easy to adapt and enhance the system as your needs evolve. Feel free to fork the repository and customize it to match your specific job search requirements. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Bex Tuychiev image](https://www.firecrawl.dev/bex.jpg)\\ Bex Tuychiev@bextuychiev](https://x.com/bextuychiev) Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics ### More articles by Bex Tuychiev [Building an Automated Price Tracking Tool\\ \\ Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.](https://www.firecrawl.dev/blog/automated-price-tracking-tutorial-python) [Web Scraping Automation: How to Run Scrapers on a Schedule\\ \\ Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.](https://www.firecrawl.dev/blog/automated-web-scraping-free-2025) [Automated Data Collection - A Comprehensive Guide\\ \\ Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.](https://www.firecrawl.dev/blog/automated-data-collection-guide) [BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python\\ \\ Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.](https://www.firecrawl.dev/blog/beautifulsoup4-vs-scrapy-comparison) [How to Build an Automated Competitor Price Monitoring System with Python\\ \\ Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.](https://www.firecrawl.dev/blog/automated-competitor-price-scraping) [Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude\\ \\ Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.](https://www.firecrawl.dev/blog/crunchbase-scraping-with-firecrawl-claude) [How to Create Custom Instruction Datasets for LLM Fine-tuning\\ \\ Learn how to build high-quality instruction datasets for fine-tuning large language models (LLMs). This guide covers when to create custom datasets, best practices for data collection and curation, and a practical example of building a code documentation dataset.](https://www.firecrawl.dev/blog/custom-instruction-datasets-llm-fine-tuning) [Data Enrichment: A Complete Guide to Enhancing Your Data Quality\\ \\ Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.](https://www.firecrawl.dev/blog/complete-guide-to-data-enrichment) ## Gamma Onboarding Supercharged Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Aug 8, 2024 • [![Jon Noronha image](https://www.firecrawl.dev/customers/jon-noronha.jpg)Jon Noronha](https://x.com/thatsjonsense) # How Gamma Supercharges Onboarding with Firecrawl ![How Gamma Supercharges Onboarding with Firecrawl image](https://www.firecrawl.dev/images/blog/customer-story-gamma.jpg) At [Gamma](https://gamma.app/), we recently launched Gamma Sites, which allows anyone to build a website as easily as writing a doc. To showcase the power of our platform, we wanted to transform existing sites into the Gamma format. That’s where Firecrawl came in. Not only did Firecrawl enable us to import existing web pages, but it also unlocked a new input for our AI presentation generator. Now, users can pull in a blog post, Notion page, or other online document and convert it into a presentation effortlessly. Integrating Firecrawl into our production environment was a breeze. We already use markdown internally, so it was just a matter of plugging in the Firecrawl API, feeding it a URL, and getting clean markdown in return. The simplicity of scraping out all the extraneous content and retrieving just the text and images is what we would miss the most if we had to stop using Firecrawl. Throughout the integration process, the support from the Firecrawl team was outstanding. They were quick to respond to our feature requests and ensured a smooth experience. Article updated recently ## About the Author [![Jon Noronha image](https://www.firecrawl.dev/customers/jon-noronha.jpg)\\ Jon Noronha@thatsjonsense](https://x.com/thatsjonsense) Jon Noronha is the founder of Gamma, building the anti-Powerpoint. He is also the Ex VP of Product at Optimizely. [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## cURL Authentication Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Dec 13, 2024 • [![Rudrank Riyam image](https://www.firecrawl.dev/rudrank.jpg)Rudrank Riyam](https://x.com/rudrankriyam) # A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl ![A Complete Guide Scraping Authenticated Websites with cURL and Firecrawl image](https://www.firecrawl.dev/images/blog/complete-guide-to-curl-authentication-firecrawl-api.jpg) Scraping authenticated websites is often a key requirement for developers and data analysts. While many graphical tools exist, using **cURL**, a powerful command-line utility, gives you granular control over HTTP requests. Coupled with **Firecrawl**, a scraping API that can handle dynamic browser interactions and complex authentication flows, you can seamlessly extract data from behind login forms, protected dashboards, and other restricted content. Before we get started, we only recommend scraping behind authentication if you have permission from the resources owner. In this guide, we’ll first introduce cURL and common authentication methods. Then, we’ll show how to combine these approaches with Firecrawl’s API, enabling you to scrape authenticated pages that would otherwise be challenging to access. You’ll learn everything from basic authentication to custom headers, bearer tokens, cookies, and even multi-step logins using Firecrawl’s action sequences. ## What is cURL? **cURL (Client URL)** is a command-line tool for transferring data using various network protocols, commonly HTTP and HTTPS. It’s usually pre-installed on Unix-like systems (macOS, Linux) and easily available for Windows. With cURL, you can quickly test APIs, debug endpoints, and automate repetitive tasks. Check if cURL is installed by running: ```bash curl --version ``` If installed, you’ll see version details. If not, follow your operating system’s instructions to install it. cURL is lightweight and script-friendly—an excellent choice for integrating with tools like Firecrawl. With cURL at your fingertips, you can seamlessly orchestrate authenticated scraping sessions by combining cURL’s request capabilities with Firecrawl’s browser-powered scraping. ## Why Use Firecrawl for Authenticated Scraping? **Firecrawl** is an API designed for scraping websites that might be hard to handle with a simple HTTP client. While cURL can handle direct requests, Firecrawl provides the ability to: - Interact with websites that require JavaScript execution. - Navigate multiple steps of login forms. - Manage cookies, headers, and tokens easily. - Extract content in structured formats like Markdown or JSON. By pairing cURL’s command-line power with Firecrawl’s scraping engine, you can handle complex authentication scenarios—like logging into a site with a username/password form, or including custom headers and tokens—that would be difficult to script using cURL alone. ## Authentication Methods Authenticated scraping means you must prove your identity or authorization to the target server before accessing protected content. Common methods include: 1. **Basic Authentication** 2. **Bearer Token (OAuth 2.0)** 3. **Custom Header Authentication** 4. **Cookie-Based (Session) Authentication** We’ll look at each method in the context of cURL, and then integrate them with Firecrawl for real-world scraping scenarios. ### 1\. Basic Authentication **Basic Auth** sends a username and password encoded in Base64 with each request. It’s simple but should always be used over HTTPS to protect credentials. **cURL Syntax:** ```bash curl -u username:password https://api.example.com/securedata ``` For APIs requiring only an API key (as username): ```bash curl -u my_api_key: https://api.example.com/data ``` **With Firecrawl:** If Firecrawl’s endpoint itself requires Basic Auth (or if the site you’re scraping uses Basic Auth), you can include this in your request: ```bash curl -u YOUR_API_KEY: https://api.firecrawl.dev/v1/scrape ``` This authenticates you to the Firecrawl API using Basic Auth, and you can then direct Firecrawl to scrape authenticated targets. ### 2\. Bearer Token Authentication (OAuth 2.0) **Bearer Tokens** (often from OAuth 2.0 flows) are secure, time-limited keys that you include in the `Authorization` header. **cURL Syntax:** ```bash curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/profile ``` **With Firecrawl:** To scrape a site requiring a bearer token, you can instruct Firecrawl to use it: ```bash curl -X POST https://api.firecrawl.dev/v1/scrape \ -H "Authorization: Bearer fc_your_api_key_here" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "formats": ["markdown"] }' ``` Here, `fc_your_api_key_here` is your Firecrawl API token. Firecrawl will handle the scraping behind the scenes, and you can also add target-specific headers or actions if needed. ### 3\. Custom Header Authentication Some APIs require custom headers for authentication (e.g., `X-API-Key: value`). These headers are sent alongside requests to prove authorization. **cURL Syntax:** ```bash curl -H "X-API-Key: your_api_key_here" https://api.example.com/data ``` **With Firecrawl:** To scrape a page requiring a custom header, just include it in the POST data: ```bash curl -X POST https://api.firecrawl.dev/v1/scrape \ -H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://protected.example.com", "headers": { "X-Custom-Auth": "token123" } }' ``` Firecrawl will use the custom header `X-Custom-Auth` when loading the page. ### 4\. Cookie-Based Authentication Websites often rely on sessions and cookies for authentication. After logging in via a form, a cookie is set, allowing subsequent authenticated requests. **cURL for Cookie Handling:** Save cookies after login: ```bash curl -c cookies.txt -X POST https://example.com/login \ -d "username=yourusername&password=yourpassword" ``` Use these cookies for subsequent requests: ```bash curl -b cookies.txt https://example.com/protected ``` **With Firecrawl:** If you need to scrape a protected page that uses cookies for authentication, you can first obtain the cookies using cURL, then pass them to Firecrawl: 1. **Obtain Cookies:** ```bash curl -c cookies.txt -X POST https://example.com/login \ -d "username=yourusername&password=yourpassword" ``` 2. **Use Cookies with Firecrawl:** ```bash curl -b cookies.txt -X POST https://api.firecrawl.dev/v1/scrape \ -H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/protected", "formats": ["markdown"] }' ``` Firecrawl will then request the protected URL using the cookies you’ve supplied. ## Real-World Examples ### GitHub API GitHub’s API supports token-based auth: ```bash curl -H "Authorization: token ghp_YOUR_TOKEN" https://api.github.com/user/repos ``` Scraping authenticated GitHub pages (like private profiles) with Firecrawl: ```bash curl -X POST https://api.firecrawl.dev/v1/scrape \ -H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://github.com/settings/profile", "headers": { "Cookie": "user_session=YOUR_SESSION_COOKIE; tz=UTC", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" } }' ``` ### Dev.to Authentication Dev.to uses API keys as headers: ```bash curl -H "api-key: YOUR_DEV_TO_API_KEY" https://dev.to/api/articles/me ``` To scrape behind login forms, leverage Firecrawl actions: ```bash curl -X POST https://api.firecrawl.dev/v1/scrape \ -H "Authorization: Bearer YOUR_FIRECRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://dev.to/enter", "actions": [\ {"type": "wait", "milliseconds": 2000},\ {"type": "click", "selector": "input[type=email]"},\ {"type": "write", "text": "your@email.com"},\ {"type": "click", "selector": "input[type=password]"},\ {"type": "write", "text": "your_password"},\ {"type": "click", "selector": "button[type=submit]"},\ {"type": "wait", "milliseconds": 3000},\ {"type": "navigate", "url": "https://dev.to/dashboard"},\ {"type": "scrape"}\ ] }' ``` Firecrawl can interact with the page dynamically, just like a browser, to submit forms and then scrape the resulting authenticated content. ## Conclusion When combined, **cURL and Firecrawl** provide a powerful toolkit for scraping authenticated websites. cURL’s flexibility in handling HTTP requests pairs perfectly with Firecrawl’s ability to navigate, interact, and extract data from pages that require authentication. Whether you need to pass API keys in headers, handle OAuth tokens, emulate sessions, or fill out login forms, these tools make the process efficient and repeatable. Try the examples provided, check out [Firecrawl’s documentation](https://docs.firecrawl.dev/introduction) for more advanced use cases, and start confidently scraping authenticated websites today! **Happy cURLing and Firecrawling!** Article updated recently ## About the Author [![Rudrank Riyam image](https://www.firecrawl.dev/rudrank.jpg)\\ Rudrank Riyam@rudrankriyam](https://x.com/rudrankriyam) Rudrank Riyam is a Technical Writer & Author. [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## Grok-2 Setup Guide Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Oct 21, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Getting Started with Grok-2: Setup and Web Crawler Example ![Getting Started with Grok-2: Setup and Web Crawler Example image](https://www.firecrawl.dev/images/blog/grok-2-web-crawler.jpg) Grok-2, the latest language model from x.ai, brings advanced language understanding capabilities to developers, enabling the creation of intelligent applications with ease. In this tutorial, we’ll walk you through setting up Grok-2, obtaining an API key, and then building a web crawler using Firecrawl to extract structured data from any website. ## Part 1: Setting Up Grok-2 Before diving into coding, we need to set up Grok-2 and get an API key. ### Step 1: Sign Up for an x.ai Account To access the Grok-2 API, you’ll need an x.ai account. 1. **Visit the Sign-Up Page:** Go to [x.ai Sign-Up](https://accounts.x.ai/sign-up?redirect=cloud-console). 2. **Register:** Fill out the registration form with your email and create a password. 3. **Verify Your Email:** Check your inbox for a verification email from x.ai and click the link to verify your account. ### Step 2: Fund Your Account To use the Grok-2 API, your account must have funds. 1. **Access the Cloud Console:** After logging in, you’ll be directed to the x.ai Cloud Console. 2. **Navigate to Billing:** Click on the **Billing** tab in the sidebar. 3. **Add Payment Method:** Provide your payment details to add credits to your account. ### Step 3: Obtain Your API Key With your account funded, you can now generate an API key. 1. **Go to API Keys:** Click on the **API Keys** tab in the Cloud Console. 2. **Create a New API Key:** Click on **Create New API Key** and give it a descriptive name. 3. **Copy Your API Key:** Make sure to copy your API key now, as it won’t be displayed again for security reasons. _Note: Keep your API key secure and do not share it publicly._ ## Part 2: Building a Web Crawler with Grok-2 and Firecrawl Now that Grok-2 is set up, let’s build a web crawler to extract structured data from websites. ### Prerequisites - **Python 3.6+** - **Firecrawl Python Library** - **Requests Library** - **dotenv Library** Install the required packages: ```bash pip install firecrawl-py requests python-dotenv ``` ### Step 1: Set Up Environment Variables Create a `.env` file in your project directory to store your API keys securely. ```env GROK_API_KEY=your_grok_api_key FIRECRAWL_API_KEY=your_firecrawl_api_key ``` _Replace `your_grok_api_key` and `your_firecrawl_api_key` with your actual API keys._ ### Step 2: Initialize Your Script Create a new Python script (e.g., `web_crawler.py`) and start by importing the necessary libraries and loading your environment variables. ```python import os import json import requests from dotenv import load_dotenv from firecrawl import FirecrawlApp # Load environment variables from .env file load_dotenv() # Retrieve API keys grok_api_key = os.getenv("GROK_API_KEY") firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY") # Initialize FirecrawlApp app = FirecrawlApp(api_key=firecrawl_api_key) ``` ### Step 3: Define the Grok-2 API Interaction Function We need a function to interact with the Grok-2 API. ```python def grok_completion(prompt): url = "https://api.x.ai/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": f"Bearer {grok_api_key}" } data = { "messages": [\ {"role": "system", "content": "You are a helpful assistant."},\ {"role": "user", "content": prompt}\ ], "model": "grok-2", "stream": False, "temperature": 0 } response = requests.post(url, headers=headers, json=data) response_data = response.json() return response_data['choices'][0]['message']['content'] ``` ### Step 4: Identify Relevant Pages on the Website Define a function to find pages related to our objective. ```python def find_relevant_pages(objective, url): prompt = f"Based on the objective '{objective}', suggest a 1-2 word search term to locate relevant information on the website." search_term = grok_completion(prompt).strip() map_result = app.map_url(url, params={"search": search_term}) return map_result.get("links", []) ``` ### Step 5: Extract Data from the Pages Create a function to scrape the pages and extract the required data. ```python def extract_data_from_pages(links, objective): for link in links[:3]: # Limit to top 3 links scrape_result = app.scrape_url(link, params={'formats': ['markdown']}) content = scrape_result.get('markdown', '') prompt = f"""Given the following content, extract the information related to the objective '{objective}' in JSON format. If not found, reply 'Objective not met'. Content: {content} Remember: - Only return JSON if the objective is met. - Do not include any extra text. """ result = grok_completion(prompt).strip() if result != "Objective not met": try: data = json.loads(result) return data except json.JSONDecodeError: continue # Try the next link if JSON parsing fails return None ``` ### Step 6: Implement the Main Function Combine everything into a main function. ```python def main(): url = input("Enter the website URL to crawl: ") objective = input("Enter your data extraction objective: ") print("\nFinding relevant pages...") links = find_relevant_pages(objective, url) if not links: print("No relevant pages found.") return print("Extracting data from pages...") data = extract_data_from_pages(links, objective) if data: print("\nData extracted successfully:") print(json.dumps(data, indent=2)) else: print("Could not find data matching the objective.") if __name__ == "__main__": main() ``` ### Step 7: Run the Script Save your script and run it from the command line. ```bash python web_crawler.py ``` **Example Interaction:** ``` Enter the website URL to crawl: https://example.com Enter your data extraction objective: Retrieve the list of services offered. Finding relevant pages... Extracting data from pages... Data extracted successfully: { "services": [\ "Web Development",\ "SEO Optimization",\ "Digital Marketing"\ ] } ``` ## Conclusion In this tutorial, we’ve successfully set up Grok-2, obtained an API key, and built a web crawler using Firecrawl. This powerful combination allows you to automate the process of extracting structured data from websites, making it a valuable tool for various applications. ## Next Steps - **Explore More Features:** Check out the Grok-2 and Firecrawl documentation to learn about additional functionalities. - **Enhance Error Handling:** Improve the script with better error handling and logging. - **Customize Data Extraction:** Modify the extraction logic to suit different objectives or data types. ## References - [x.ai Grok-2 API Documentation](https://docs.x.ai/docs) - [Firecrawl Python Library Documentation](https://docs.firecrawl.dev/sdks/python) - [x.ai Cloud Console](https://accounts.x.ai/cloud-console) - [GitHub Repository with Full Code](https://github.com/mendableai/firecrawl/tree/main/examples/grok_web_crawler) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## Install BeautifulSoup Easily Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Aug 9, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # How to quickly install BeautifulSoup with Python [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It provides simple methods for navigating, searching, and modifying the parse tree, saving you hours of work. Beautiful Soup is great for web scraping projects where you need to extract specific pieces of information from web pages. Some common use cases for BeautifulSoup include extracting article text or metadata from news sites, scraping product details and pricing from e-commerce stores, gathering data for machine learning datasets, and more. In this tutorial, we’ll walk through several ways to get BeautifulSoup installed on your system and show you some basic usage examples to get started. ## Installing BeautifulSoup There are a few different ways you can install BeautifulSoup depending on your Python environment and preferences. ### Using pip The recommended way to install BeautifulSoup is with pip: ```bash python -m pip install beautifulsoup4 ``` This will install the latest version of BeautifulSoup 4. Make sure you have a recent version of Python (3.6+) and pip. ### Using conda If you’re using the Anaconda Python distribution, you can install BeautifulSoup from the conda-forge channel: ```bash conda install -c conda-forge beautifulsoup4 ``` ### In a virtual environment It’s good practice to install Python packages in an isolated virtual environment for each project. You can set up BeautifulSoup in a new virtual environment like this: ```bash python -m venv bsenv source bsenv/bin/activate # On Windows, use `bsenv\Scripts\activate` pip install beautifulsoup4 ``` ## Troubleshooting Here are a few things to check if you run into issues installing BeautifulSoup: - Make sure your Python version is 3.6 or higher - Upgrade pip to the latest version: `python -m pip install --upgrade pip` - If using conda, ensure your Anaconda installation is up-to-date - Verify you have proper permissions to install packages. Use `sudo` or run the command prompt as an administrator if needed. Check the BeautifulSoup documentation or post on Stack Overflow if you need further assistance. ## Usage Examples Let’s look at a couple quick examples of how to use BeautifulSoup once you have it installed. ### Parsing HTML Here’s how you can use BeautifulSoup to parse HTML retrieved from a web page: ```python from bs4 import BeautifulSoup import requests url = "https://mendable.ai" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') print(soup.title.text) # 'Example Domain' ``` We use the requests library to fetch the HTML from a URL, then pass it to BeautifulSoup to parse. This allows us to navigate and search the HTML using methods like `find()` and `select()`. ### Extracting Data BeautifulSoup makes it easy to extract data buried deep within nested HTML tags. For example, to get all the links from a page: ```python links = soup.find_all('a') for link in links: print(link.get('href')) # 'https://www.firecrawl.dev/' ``` The `find_all()` method retrieves all `` tag elements. We can then iterate through them and access attributes like the `href` URL using `get()`. By chaining together `find()` and `select()` methods, you can precisely target elements and attributes to scrape from the messiest of HTML pages. BeautifulSoup is an indispensable tool for any Python web scraping project. For more advanced web scraping projects, consider using a dedicated scraping service like [Firecrawl](https://firecrawl.dev/). Firecrawl takes care of the tedious parts of web scraping, like proxy rotation, JavaScript rendering, and avoiding detection, allowing you to focus your efforts on working with the data itself. Check out the [Python SDK](https://docs.firecrawl.dev/sdks/python) here. ## References - BeautifulSoup documentation: [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - Real Python’s BeautifulSoup Tutorial: [https://realpython.com/beautiful-soup-web-scraper-python/](https://realpython.com/beautiful-soup-web-scraper-python/) - Firecrawl web scraping service: [https://firecrawl.dev/](https://firecrawl.dev/) Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## LLM Extract Introduction Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) August 31, 2024 • [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)Nicolas Camara](https://x.com/nickscamara_) # Launch Week I / Day 6: LLM Extract (v1) ![Launch Week I / Day 6: LLM Extract (v1) image](https://www.firecrawl.dev/images/blog/firecrawl-llm-extract.png) Welcome to Day 6 of Firecrawl’s Launch Week! We’re excited to introduce v1 support for LLM Extract. ## Introducing the Extract Format LLM extraction is now available in v1 under the extract format. To extract structured from a page, you can pass a schema to the endpoint or just provide a prompt. ![Extract](https://www.firecrawl.dev/images/blog/extract.png) **Output** ![Output](https://www.firecrawl.dev/images/blog/extract-output-llm.png) ## Extracting without schema (New) You can now extract without a schema by just passing a prompt to the endpoint. The LLMs choose the structure of the data. ![Output](https://www.firecrawl.dev/images/blog/extract-no-schema.png) ## Learn More Learn more about the extract format in our [documentation](https://docs.firecrawl.dev/features/extract). Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Nicolas Camara image](https://www.firecrawl.dev/nick-img.jpeg)\\ Nicolas Camara@nickscamara\_](https://x.com/nickscamara_) Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software. ### More articles by Nicolas Camara [Using OpenAI's Realtime API and Firecrawl to Talk with Any Website\\ \\ Build a real-time conversational agent that interacts with any website using OpenAI's Realtime API and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Talk-with-Any-Website-Using-OpenAIs-Realtime-API-and-Firecrawl) [Extract website data using LLMs\\ \\ Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.](https://www.firecrawl.dev/blog/data-extraction-using-llms) [Getting Started with Grok-2: Setup and Web Crawler Example\\ \\ A detailed guide on setting up Grok-2 and building a web crawler using Firecrawl.](https://www.firecrawl.dev/blog/grok-2-setup-and-web-crawler-example) [Launch Week I / Day 6: LLM Extract (v1)\\ \\ Extract structured data from your web pages using the extract format in /scrape.](https://www.firecrawl.dev/blog/launch-week-i-day-6-llm-extract) [Launch Week I / Day 7: Crawl Webhooks (v1)\\ \\ New /crawl webhook support. Send notifications to your apps during a crawl.](https://www.firecrawl.dev/blog/launch-week-i-day-7-webhooks) [OpenAI Swarm Tutorial: Create Marketing Campaigns for Any Website\\ \\ A guide to building a multi-agent system using OpenAI Swarm and Firecrawl for AI-driven marketing strategies](https://www.firecrawl.dev/blog/openai-swarm-agent-tutorial) [Build a 'Chat with website' using Groq Llama 3\\ \\ Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.](https://www.firecrawl.dev/blog/chat-with-website) [Scrape and Analyze Airbnb Data with Firecrawl and E2B\\ \\ Learn how to scrape and analyze Airbnb data using Firecrawl and E2B in a few lines of code.](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b) ## Fix Cloudflare Error 1015 Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) Aug 6, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Cloudflare Error 1015: How to solve it? Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner. This typically happens if you are making a large number of requests in a short period. ## How to solve it? To resolve Cloudflare Error 1015 you can reduce the frequency of your requests to stay within the allowed limit. Another way to solve it is to use a service like [Firecrawl](https://firecrawl.dev/), which rotates proxies to prevent any single proxy from hitting the rate limit. This approach can help you avoid triggering the Cloudflare 1015 error. Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications) ## Introducing Teams Feature Introducing /extract - Get web data with a prompt [Try now](https://www.firecrawl.dev/extract) August 26, 2024 • [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)Eric Ciarla](https://x.com/ericciarla) # Launch Week I / Day 1: Introducing Teams ![Launch Week I / Day 1: Introducing Teams image](https://www.firecrawl.dev/images/blog/firecrawl-teams.png) Welcome to Firecrawl’s first ever Launch Week! Over the course of the next five days, we’ll be bringing you an exciting new feature every day. We’re kicking off Day 1 with the launch of Teams - one of our most highly requested features. **What is Teams?** No one wants to work on web scraping projects alone. Teams enables you to collaborate with your co-workers and transform the way you approach data collection. Ever wanted to work on complex scraping projects with your entire team? With Teams, you can achieve this with just a few clicks. ![Invite Team Member Form](https://www.firecrawl.dev/images/blog/teams-demo.png) **New Pricing Plans to Support Teams** To accommodate teams of all sizes, we’ve updated our pricing structure. Our Hobby plan now includes 2 seats, perfect for small collaborations. The Standard plan offers 4 seats for growing teams, while our Growth plan supports larger groups with 8 seats. For enterprise-level needs, we offer custom seating options to fit any organization. Stay tuned for more exciting announcements throughout Launch Week. We can’t wait to show you what’s next! Article updated recently [🔥](https://www.firecrawl.dev/) ## Ready to _Build?_ Start scraping web data for your AI apps today. No credit card needed. Get Started ## About the Author [![Eric Ciarla image](https://www.firecrawl.dev/eric-img.jpeg)\\ Eric Ciarla@ericciarla](https://x.com/ericciarla) Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users. ### More articles by Eric Ciarla [How to Create an llms.txt File for Any Website\\ \\ Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.](https://www.firecrawl.dev/blog/How-to-Create-an-llms-txt-File-for-Any-Website) [Cloudflare Error 1015: How to solve it?\\ \\ Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.](https://www.firecrawl.dev/blog/cloudflare-error-1015-how-to-solve-it) [Build an agent that checks for website contradictions\\ \\ Using Firecrawl and Claude to scrape your website's data and look for contradictions.](https://www.firecrawl.dev/blog/contradiction-agent) [Why Companies Need a Data Strategy for Generative AI\\ \\ Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.](https://www.firecrawl.dev/blog/why-companies-need-a-data-strategy-for-generative-ai) [Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses\\ \\ A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.](https://www.firecrawl.dev/blog/getting-started-with-predicted-outputs-openai) [How to easily install requests with pip and python\\ \\ A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-easily-install-requests-with-pip-and-python) [How to quickly install BeautifulSoup with Python\\ \\ A guide on installing the BeautifulSoup library in Python using various methods, with usage examples and troubleshooting tips](https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python) [How to Use OpenAI's o1 Reasoning Models in Your Applications\\ \\ Learn how to harness OpenAI's latest o1 series models for complex reasoning tasks in your apps.](https://www.firecrawl.dev/blog/how-to-use-openai-o1-reasoning-models-in-applications)