Introducing /extract - Get web data with a prompt

ChangelogNew

  • Extract Improvements - v1.4.1

    We’ve significantly enhanced our data extraction capabilities with several key updates:

    • Extract now returns a lot more data
    • Improved infrastructure reliability
    • Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing
    • Enhanced crawl cancellation functionality for better control over running jobs
  • /extract changes

    We have updated the /extract endpoint to now be asynchronous. When you make a request to /extract, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible.

    For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model.

    For more details about the parameters, refer to the docs sent to you.

  • v1.2.0

    Introducing /v1/search

    The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query.

    Include scrapeOptions with formats: ["markdown"] to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).

    More info here: v1/search docs

    Fixes and improvements

    • Fixed LLM not following the schema in the python SDK for /extract
    • Fixed schema json not being able to be sent to the /extract endpoint through the Node SDK
    • Prompt is now optional for the /extract endpoint
    • Our fork of MinerU is now default for PDF Parsing
  • v1.1.0

    Changelog Highlights

    Feature Enhancements

    • New Features:
      • Geolocation, mobile scraping, 4x faster parsing, better webhooks,
      • Credit packs, auto-recharges and batch scraping support.
      • Iframe support and query parameter differentiation for URLs.
      • Similar URL deduplication.
      • Enhanced map ranking and sitemap fetching.

    Performance Improvements

    • Faster crawl status filtering and improved map ranking algorithm.
    • Optimized Kubernetes setup and simplified build processes.
    • Sitemap discoverability and performance improved

    Bug Fixes

    • Resolved issues:
      • Badly formatted JSON, scrolling actions, and encoding errors.
      • Crawl limits, relative URLs, and missing error handlers.
    • Fixed self-hosted crawling inconsistencies and schema errors.

    SDK Updates

    • Added dynamic WebSocket imports with fallback support.
    • Optional API keys for self-hosted instances.
    • Improved error handling across SDKs.

    Documentation Updates

    • Improved API docs and examples.
    • Updated self-hosting URLs and added Kubernetes optimizations.
    • Added articles: mastering /scrape and /crawl.

    Miscellaneous

    • Added new Firecrawl examples
    • Enhanced metadata handling for webhooks and improved sitemap fetching.
    • Updated blocklist and streamlined error messages.
  • Batch Scrape

    Introducing Batch Scrape

    You can now scrape multiple URLs simultaneously with our new Batch Scrape endpoint.

    • Read more about the Batch Scrape endpoint here.
    • Python SDK (1.4.x) and Node SDK (1.7.x) updated with batch scrape support.
  • Cancel Crawl in the SDKs, More Examples, Improved Speed

    • Added crawl cancellation support for the Python SDK (1.3.x) and Node SDK (1.6.x)
    • OpenAI Voice + Firecrawl example added to the repo
    • CRM lead enrichment example added to the repo
    • Improved our Docker images
    • Limit and timeout fixes for the self hosted playwright scraper
    • Improved speed of all scrapes
  • Fixes + Improvements (no version bump)

    • Fixed 500 errors that would happen often in some crawled websites and when servers were at capacity
    • Fixed an issue where v1 crawl status wouldn’t properly return pages over 10mb
    • Fixed an issue where screenshot would return undefined
    • Push improvements that reduce speed times when a scraper fails
  • Actions

    Introducing Actions

    Interact with pages before extracting data, unlocking more data from every site!

    Firecrawl now allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.

    • Version 1.5.x of the Node SDK now supports type-safe Actions.
    • Actions are now available in the REST API and Python SDK (no version bumps required!).

    Here is a python example of how to use actions to navigate to google.com, search for Firecrawl, click on the first result, and take a screenshot.

    from firecrawl import FirecrawlApp
    
    app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
    
    # Scrape a website:
    scrape_result = app.scrape_url('firecrawl.dev',
        params={
            'formats': ['markdown', 'html'],
            'actions': [
                {"type": "wait", "milliseconds": 2000},
                {"type": "click", "selector": "textarea[title=\"Search\"]"},
                {"type": "wait", "milliseconds": 2000},
                {"type": "write", "text": "firecrawl"},
                {"type": "wait", "milliseconds": 2000},
                {"type": "press", "key": "ENTER"},
                {"type": "wait", "milliseconds": 3000},
                {"type": "click", "selector": "h3"},
                {"type": "wait", "milliseconds": 3000},
                {"type": "screenshot"}
            ]
        }
    )
    print(scrape_result)
    

    For more examples, check out our API Reference.

  • Firecrawl E2E Type Safe LLM Extract

    Mid-September Updates

    Typesafe LLM Extract

    • E2E Type Safety for LLM Extract in Node SDK version 1.5.x.
    • 10x cheaper in the cloud version. From 50 to 5 credits per extract.
    • Improved speed and reliability.

    Rust SDK v1.0.0

    • Rust SDK v1 is finally here! Check it out here.

    Map Improved Limits

    • Map smart results limits increased from 100 to 1000.

    Faster scrape

    • Scrape speed improved by 200ms-600ms depending on the website.

    Launching changelog

    • For now on, for every new release, we will be creating a changelog entry here.

    Improvements

    • Lots of improvements pushed to the infra and API. For all Mid-September changes, refer to the commits here.
  • September 8, 2024

    Patch Notes (No version bump)

    • Fixed an issue where some of the custom header params were not properly being set in v1 API. You can now pass headers to your requests just fine.
  • Firecrawl V1

    Firecrawl V1 is here! With that we introduce a more reliable and developer friendly API.

    Here is what’s new:

    • Output Formats for /scrape: Choose what formats you want your output in.
    • New /map endpoint: Get most of the URLs of a webpage.
    • Developer friendly API for /crawl/id status.
    • 2x Rate Limits for all plans.
    • Go SDK and Rust SDK.
    • Teams support.
    • API Key Management in the dashboard.
    • onlyMainContent is now default to true.
    • /crawl webhooks and websocket support.

    Learn more about it here.

    Start using v1 right away at https://firecrawl.dev