Introducing /extract - Get web data with a prompt

Jan 23, 2025

•

Bex Tuychiev imageBex Tuychiev

Mastering the Extract Endpoint in Firecrawl

Mastering the Extract Endpoint in Firecrawl image

Introduction

Getting useful data from websites can be tricky. While humans can easily read and understand web content, turning that information into structured data for AI applications often requires complex code and constant maintenance. Traditional web scraping tools break when websites change, and writing separate code for each website quickly becomes overwhelming.

Firecrawl’s extract endpoint changes this by using AI to automatically understand and pull data from any website. Whether you need to gather product information from multiple stores, collect training data for AI models, or keep track of competitor prices, the extract endpoint can handle it with just a few lines of code. Similar to our previous guides about Firecrawl’s scrape, crawl, and map endpoints, this article will show you how to use this powerful new tool to gather structured data from the web reliably and efficiently.

Understanding the Extract Endpoint

The extract endpoint is different from Firecrawl’s other tools because it can understand and process entire websites at once, not just single pages. Think of it like having a smart assistant who can read through a whole website and pick out exactly the information you need, whether that’s from one page or thousands of pages.

Here’s what makes the extract endpoint special:

  • It can process multiple URLs at once, saving you time and effort
  • You can use simple English to describe what data you want, no extra coding required
  • It works with entire websites by adding /* to the URL (like website.com/*)
  • If requested information is not found in the given URL, the web search feature can automatically search related links on the page to find and extract the missing data

The extract endpoint is perfect for three main tasks:

  1. Full website data collection: Gathering information from every page on a website, like product details from an online store
  2. Data enrichment: Adding extra information to your existing data by finding details across multiple websites
  3. AI training data: Creating clean, structured datasets to train AI models

At the time of writing the article, you will have 500k free output tokens for the endpoint when you sign up for an account. The pricing page also contains information on other pricing options.

Getting Started with Extract

Let’s set up everything you need to start using the extract endpoint. First, you’ll need to sign up for a free Firecrawl account at firecrawl.dev to get your API key. Once you have your account, follow these simple steps to get started:

  1. Create a new project folder and set up your environment:
# Create a new folder and move into it
mkdir extract-project
cd extract-project

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # For Unix/macOS
venv\Scripts\activate     # For Windows

# Install the required packages
pip install firecrawl-py python-dotenv pydantic
  1. Create a .env file to safely store your API key:
# Create the file and add your API key
echo "FIRECRAWL_API_KEY='your-key-here'" >> .env

# Create a .gitignore file to keep your API key private
echo ".env" >> .gitignore
  1. Create a simple Python script to test the setup:
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel

# Load your API key
load_dotenv()

# Create a Firecrawl app instance
app = FirecrawlApp()

# Create a Pydantic model
class PageData(BaseModel):
    title: str
    trusted_companies: list[str]

# Test the connection
result = app.extract(
    urls=["https://firecrawl.dev"],
    params={
        "prompt": "Extract the contents of the page based on the schema provided.",
        "schema": PageData.model_json_schema(),
    },
)
print(result['data'])
{
    'title': 'Turn websites into _LLM-ready_ data',
    'trusted_companies': [
        'Zapier', 'Gamma',
        'Nvidia', 'PHMG',
        'StackAI', 'Teller.io',
        'Carrefour', 'Vendr',
        'OpenGov.sg', 'CyberAgent',
        'Continue.dev', 'Bain',
        'JasperAI', 'Palladium Digital',
        'Checkr', 'JetBrains',
        'You.com'
    ]
}

First, we import the necessary libraries:

  • FirecrawlApp: The main class for interacting with Firecrawl’s API
  • load_dotenv: For loading environment variables from the .env file
  • BaseModel from pydantic: For creating data schemas

After loading the API key from the .env file, we create a FirecrawlApp instance which will handle our API requests.

We then define a PageData class using Pydantic that specifies the structure of data we want to extract:

  • title: A string field for the page title
  • trusted_companies: A list of strings for company names

Finally, we make an API call using app.extract() with:

  • A URL to scrape (firecrawl.dev)
  • Parameters including:
    • A prompt telling the AI what to extract
    • The schema converted to JSON format

The result will contain the extracted data matching our PageData schema structure.

That’s all you need to start using the extract endpoint! In the next section, we’ll look at different ways to extract data and how to get exactly the information you need.

Basic Extraction Patterns

Let’s start with some common patterns for extracting data using Firecrawl.

Using Nested Schemas

Let’s look at how to extract more complex data using nested schemas. These are like folders within folders - they help organize related information together.

from pydantic import BaseModel
from typing import Optional, List

# Define a schema for author information
class Author(BaseModel):
    name: str
    bio: Optional[str]
    social_links: List[str]

# Define a schema for blog posts
class BlogPost(BaseModel):
    title: str
    author: Author
    publish_date: str
    content_summary: str

# Use the nested schema with Firecrawl
result = app.extract(
    urls=["https://example-blog.com/post/1"],
    params={
        "prompt": "Extract the blog post information including author details",
        "schema": BlogPost.model_json_schema()
    }
)

This code will give you organized data like this inside result['data']:

{
    'title': 'How to Make Great Pizza',
    'author': {
        'name': 'Chef Maria',
        'bio': 'Professional chef with 15 years experience',
        'social_links': ['https://twitter.com/chefmaria', 'https://instagram.com/chefmaria']
    },
    'publish_date': '2024-01-15',
    'content_summary': 'A detailed guide to making restaurant-quality pizza at home'
}

Now that we’ve seen how to extract structured data from a single page, let’s look at how to handle multiple items.

Capturing multiple items of the same type

Firecrawl follows your Pydantic schema definition to the letter. This can lead to interesting scenarios like below:

class Car(BaseModel):
    make: str
    brand: Author
    manufacture_date: str
    mileage: int

If you use the above schema for scraping a car dealership website, you will only get a single car in the result.

To get multiple cars, you need to wrap your schema in a container class:

class Inventory(BaseModel):
    cars: List[Car]

result = app.extract(
    urls=["https://cardealership.com/inventory"],
    params={
        "prompt": "Extract the full inventory including all cars and metadata",
        "schema": Inventory.model_json_schema()
    }
)

This will give you organized data with multiple cars in a structured array.

Processing Entire Websites

When you want to get data from an entire website, you can use the /* pattern. This tells Firecrawl to look at all pages on the website.

class ProductInfo(BaseModel):
    name: str
    price: float
    description: str
    in_stock: bool

# This will check all pages on the website
result = app.extract(
    urls=["https://example-store.com/*"],
    params={
        "prompt": "Find all product information on the website",
        "schema": ProductInfo.model_json_schema()
    }
)

Remember that using /* will process multiple pages, so it will use more credits and thus, output more tokens. Currently, large-scale site coverage is not supported for massive websites like Amazon, eBay or Airbnb. Also, complex logical questions like find all comments made between 10am-12pm in 2011 are not yet fully operational.

Since extract endpoint is still in Beta, features and performance will continue to evolve.

Making Your Extractions Better

Here are some real examples of how to get better results:

# Bad prompt
result = app.extract(
    urls=["https://store.com"],
    params={
        "prompt": "Get prices",  # Too vague!
        "schema": ProductInfo.model_json_schema()
    }
)

# Good prompt
result = app.extract(
    urls=["https://store.com"],
    params={
        "prompt": "Find all product prices in USD, including any sale prices. If there's a range, get the lowest price.",  # Clear and specific
        "schema": ProductInfo.model_json_schema()
    }
)

Pro Tips:

  1. Start with one or two URLs to test your schema and prompt
  2. Use clear field names in your schemas (like product_name instead of just name)
  3. Break complex data into smaller, nested schemas
  4. Add descriptions to your schema fields to help the AI understand what to look for

In the next section, we’ll explore best practices for schema definition, including how to create advanced Pydantic models, design effective schemas, improve extraction accuracy with field descriptions, and see complex schema examples in action.

Advanced Extraction Patterns

Let’s look at a couple of advanced patterns for extracting data effectively from different types of websites.

Asynchronous Extraction

In practice, you will usually scrape dozens if not hundreds of URLs with extract. For these larger jobs, you can use async_extract which processes URLs concurrently, significantly reducing the total execution time. This asynchronous version returns the same results as extract but can be up to 10x faster when processing multiple URLs. Simply use app.async_extract() instead of app.extract() and the async version returns a job object with status and progress information that you can use to track the extraction progress. This allows your program to continue executing other tasks while the extraction runs in the background, rather than blocking until completion.

class HackerNewsArticle(BaseModel):
    title: str = Field(description="The title of the article")
    url: Optional[str] = Field(description="The URL of the article if present")
    points: int = Field(description="Number of upvotes/points")
    author: str = Field(description="Username of the submitter")
    comments_count: int = Field(description="Number of comments on the post")
    posted_time: str = Field(description="When the article was posted")

class HackerNewsResponse(BaseModel):
    articles: List[HackerNewsArticle]

extract_job = app.async_extract(
    urls=[
        "https://news.ycombinator.com/",
        # your other URLs here ...
    ],
    params={
        "schema": HackerNewsResponse.model_json_schema(),
        "prompt": """
        Extract articles from the Hacker News front page. Each article should include:
        - The title of the post
        - The linked URL (if present)
        - Number of points/upvotes
        - Username of who posted it
        - Number of comments
        - When it was posted
    """}
)

Let’s break down what’s happening above:

  1. First, we define a HackerNewsArticle Pydantic model that specifies the structure for each article:

    • title: The article’s headline
    • url: The link to the full article (optional)
    • points: Number of upvotes
    • author: Username who posted it
    • comments_count: Number of comments
    • posted_time: When it was posted
  2. We then create a HackerNewsResponse model that contains a list of these articles.

  3. Instead of using the synchronous extract(), we use async_extract().

  4. The extraction job is configured with:

    • A list of URLs to process (in this case, the Hacker News homepage)
    • Parameters including our schema and a prompt that guides the AI in extracting the correct information

This approach is particularly useful when you need to scrape multiple pages, as it prevents your application from blocking while waiting for each URL to be processed sequentially.

async_extract returns an extraction job object that includes a job_id. You can periodically pass this job ID to get_extract_status method to check on the job’s progress:

job_status = app.get_extract_status(extract_job.job_id)

print(job_status)
{
    "status": "pending",
    "progress": 36,
    "results": [{
        "url": "https://news.ycombinator.com/",
        "data": { ... }
    }]
}

Possible states include:

  • completed: The extraction finished successfully.
  • pending: Firecrawl is still processing your request.
  • failed: An error occurred; data was not fully extracted.
  • cancelled: The job was cancelled by the user.

A completed output of async_extract will be the same as plain extract with additional status of completed key-value pair.

Using Web Search

Another powerful feature of the extract endpoint is the ability to search the related pages to the URL you are scraping for additional information. For example, let’s say we are scraping GitHub’s trending page and we want to get the following information:

class GithubRepository(BaseModel):
    name: str = Field(description="The name of the repository")
    url: str = Field(description="The URL of the repository")
    stars: int = Field(description="Number of stars/favorites")
    license: str = Field(description="The open-source license of the repository")
    n_releases: str = Field(description="The number of releases of the repo if any")

class GithubTrendingResponse(BaseModel):
    repositories: List[GithubRepository]

If you look at the GitHub trending page, you will see that the first page doesn’t contain any information about the license or number of releases. Those pieces of info are on each repository’s page.

To get this information, we can set the enableWebSearch parameter in the extract endpoint. This parameter will search the related pages to the URL you are scraping for additional information.

data = app.extract(
    urls=["https://github.com/trending"],
    params={
        "schema": GithubTrendingResponse.model_json_schema(),
        "prompt": "Extract information based on the schema provided. If there are missing fields, search the related pages for the missing information.",
        "enableWebSearch": True,
    },
)

data['data']['repositories'][0]
{
 'url': 'https://github.com/yt-dlp/yt-dlp',
 'name': 'YT-DLP',
 'license': 'Unlicense license',
 'n_stars': 98854,
 'n_releases': '103'
}

The web search feature is particularly powerful when you need to gather comprehensive information that might be spread across multiple pages, saving you from having to write complex crawling logic or multiple extraction calls.

Best Practices For Schema Definition

When designing schemas for the extract endpoint, following these best practices will help you get better, more reliable results:

1. Start simple, then expand

# Start with basic fields
class Product(BaseModel):
    name: str
    price: float

# Then gradually add complexity
class Product(BaseModel):
    name: str = Field(description="Full product name including brand")
    price: float = Field(description="Current price in USD")
    variants: List[str] = Field(description="List of variants of the product")
    specifications: str = Field(description="Specifications of the product")

The simple schema focuses on just the essential fields (name and price) which is great for:

  • Initial testing and validation
  • Cases where you only need basic product information
  • Faster development and debugging

The complex schema adds more detailed fields and descriptions which helps when:

  • You need comprehensive product data
  • The source pages have varied formats
  • You want to ensure consistent data quality
  • You need specific variants or specifications

By starting simple and expanding gradually, you can validate your extraction works before adding complexity. The field descriptions in the complex schema also help guide the AI to extract the right information in the right format.

2. Use clear, descriptive field names and types

# Poor naming and typing
class BadSchema(BaseModel):
    n: str  # Unclear name
    p: float  # Unclear name
    data: Any  # Too flexible

# Better naming and typing
class GoodSchema(BaseModel):
    product_name: str
    price_usd: float
    technical_specifications: str

Use type hints that match your expected data. This helps both the AI model and other developers understand your schema. The only exception is that Firecrawl doesn’t support a datetime data type, so if you are scraping temporal information, use str.

3. Structure Complex Data Hierarchically

class Author(BaseModel):
    name: str
    bio: Optional[str]
    social_links: List[str]

class Article(BaseModel):
    title: str
    author: Author
    content: str
    tags: List[str]

class Blog(BaseModel):
    articles: List[Article]
    site_name: str
    last_updated: datetime

Breaking down complex data into nested models makes the schema more maintainable and helps the AI understand relationships between data points. In the above example, the Blog model contains a list of Article objects, which in turn contain Author objects. This hierarchical structure clearly shows how the data is organized - a blog has many articles, and each article has one author with their own properties.

4. Include Example Data

class ProductReview(BaseModel):
    rating: int = Field(
        description="Customer rating from 1-5 stars"
    )
    comment: str
    reviewer_name: str
    verified_purchase: bool

    class Config:
        json_schema_extra = {
            "example": {
                "rating": 4,
                "comment": "Great product, fast shipping!",
                "reviewer_name": "John D.",
                "verified_purchase": True
            }
        }

Providing examples in your schema helps the AI understand exactly what format you expect, especially for complex or ambiguous fields.

Pro tips:

  • Use Optional fields when data might not always be present
  • Include unit information in field descriptions (e.g., “price in USD”, “weight in kg”)
  • Use enums for fields with a fixed set of possible values

A Real-World Example: Scraping GitHub Trending Repositories

Let’s build a complete example that extracts trending repository data from GitHub’s trending page. This example will show how to:

  1. Design a robust schema
  2. Handle nested data
  3. Process multiple items
  4. Use field descriptions effectively

Let’s start by making fresh imports and setup:

from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize Firecrawl
app = FirecrawlApp()

Now, let’s define our nested schema for the data:

class Developer(BaseModel):
    username: str = Field(description="GitHub username of the developer")
    profile_url: str = Field(description="URL to the developer's GitHub profile")


class Repository(BaseModel):
    name: str = Field(description="Repository name in format 'username/repo'")
    description: Optional[str] = Field(
        description="Repository description if available"
    )
    url: str = Field(description="Full URL to the repository")
    language: Optional[str] = Field(description="Primary programming language used")
    stars: int = Field(description="Total number of stars")
    forks: int = Field(description="Total number of forks")
    stars_today: int = Field(description="Number of stars gained today")
    developers: List[Developer] = Field(
        default_factory=list, description="List of contributors to the repository"
    )


class TrendingData(BaseModel):
    repositories: List[Repository] = Field(description="List of trending repositories")

The schema above defines three Pydantic models for structured GitHub trending data:

  • Developer: Represents a contributor with their GitHub username and profile URL
  • Repository: Contains details about a GitHub repo including name, description, URL, language, stars count, forks count, stars gained today, and a list of contributors
  • TrendingData: The root model that contains a list of trending repositories

Next, let’s define a function for running a scraper using our schema:

def scrape_trending_repos():
    # Extract the data
    result = app.extract(
        urls=["https://github.com/trending"],
        params={
            "prompt": """
            Extract information about trending GitHub repositories.
            For each repository:
            - Get the full repository name (username/repo)
            - Get the description if available
            - Extract the primary programming language
            - Get the total stars and forks
            - Get the number of stars gained today
            - Get information about contributors including their usernames and profile URLs

            Note:
            - Stars and forks should be converted to numbers (e.g., '1.2k' → 1200)
            - Stars today should be extracted from the "X stars today" text
            - Ensure all URLs are complete (add https://github.com if needed)
            """,
            "schema": TrendingData.model_json_schema()
        }
    )

    return result

Now, let’s print the details of some of the scraped repos:

    # Process and display the results
    if result["success"]:
        trending = TrendingData(**result["data"])

        # Print the top 3 repositories
        print("🔥 Top 3 Trending Repositories:")
        for repo in trending.repositories[:3]:
            print(f"\n📦 {repo.name}")
            print(f"đź“ť {repo.description[:100]}..." if repo.description else "No description")
            print(f"🌟 {repo.stars:,} total stars ({repo.stars_today:,} today)")
            print(f"🔤 {repo.language or 'Unknown language'}")
            print(f"đź”— {repo.url}")

Example output:

🔥 Top 3 Trending Repositories:

📦 microsoft/garnet
đź“ť Garnet: A Remote Cache-Store for Distributed Applications...
🌟 12,483 total stars (1,245 today)
🔤 C#
đź”— https://github.com/microsoft/garnet

📦 openai/whisper
đź“ť Robust Speech Recognition via Large-Scale Weak Supervision...
🌟 54,321 total stars (523 today)
🔤 Python
đź”— https://github.com/openai/whisper

📦 lencx/ChatGPT
đź“ť ChatGPT Desktop Application (Mac, Windows and Linux)...
🌟 43,210 total stars (342 today)
🔤 TypeScript
đź”— https://github.com/lencx/ChatGPT

This example demonstrates several best practices:

  1. Nested Models: Using separate models for repositories and developers
  2. Smart Defaults: Using default_factory for lists and timestamps
  3. Clear Descriptions: Each field has a detailed description
  4. Type Safety: Using proper types like HttpUrl for URLs
  5. Detailed Prompt: The prompt clearly explains how to handle special cases

You can extend this example by:

  • Adding time period filtering (daily, weekly, monthly)
  • Including more repository metadata
  • Saving results to a database
  • Setting up automated daily tracking
  • Adding error handling and retries

Extract vs Other Endpoints

Firecrawl offers three main endpoints for gathering web data: /extract, /scrape, and /crawl. Each serves different use cases and has unique strengths. Let’s explore when to use each one.

Extract vs Scrape

The /scrape endpoint is best for:

  • Single-page detailed extraction
  • Getting multiple output formats (HTML, markdown, screenshots)
  • JavaScript-heavy websites requiring browser rendering
  • Capturing visual elements or taking screenshots
# Scrape endpoint example - multiple formats
result = app.scrape_url(
    "https://example.com",
    params={
        "formats": ["html", "markdown", "screenshot", "extract"],
        "extract": {
            "schema": MySchema.model_json_schema(),
            "prompt": "Extract product information"
        }
    }
)

The /extract endpoint is better for:

  • Processing multiple URLs efficiently
  • Website-wide data extraction
  • Complex structured data extraction
  • Building data enrichment pipelines
# Extract endpoint example - multiple URLs
result = app.extract(
    urls=["https://example.com/*"],
    params={
        "prompt": "Extract all product information",
        "schema": ProductSchema.model_json_schema()
    }
)

Note that the extract functionality is built into the scrape_url function and usually returns identical structured results like extract. The difference is that scrape_url is designed for single pages or when other extraction options are required.

Extract vs Crawl

The /crawl endpoint excels at:

  • Discovering and following links automatically
  • Complete website archiving
  • Sitemap generation
  • Sequential page processing
# Crawl endpoint example
result = app.crawl_url(
    "https://example.com",
    params={
        "limit": 5,
        "includePaths": "*/blog/*",
        "excludePaths": "*/author/*",
        "scrapeOptions": {
            "formats": ["markdown", "html"],
            "includeTags": ["code", "#page-header"],
            "excludeTags": ["h1", "h2", ".main-content"]
        }
    }
)

The /extract endpoint is preferable when:

  • You know exactly which URLs to process
  • You need specific structured data from known pages
  • You want parallel processing of multiple URLs

Choosing the Right Endpoint

Here’s a decision framework for choosing between endpoints:

  1. Use /extract when:

    • You need structured data from known URLs
    • You want to process multiple pages in parallel
    • You need to combine data from different websites
    • You have a specific schema for the data you want
  2. Use /scrape when:

    • You need multiple output formats
    • The website requires JavaScript rendering
    • You need screenshots or visual elements
    • You’re dealing with single pages that need detailed extraction
  3. Use /crawl when:

    • You need to discover pages automatically
    • You want to archive entire websites
    • You need to follow specific URL patterns
    • You’re building a complete site map

Ultimately, the extraction functionality is built into all endpoints, allowing you to choose the most appropriate one for your specific use case while maintaining consistent data extraction capabilities.

Reference our detailed guides for more information:

Conclusion

The /extract endpoint combines AI-driven understanding with structured data validation to solve common web scraping challenges. Instead of maintaining brittle HTML selectors or writing custom parsing logic, developers can describe their data needs in plain English while ensuring consistent output through schema validation. This approach works particularly well for projects that need to gather structured data from multiple sources or websites that frequently change their layouts.

For those new to the endpoint, we recommend starting with simple schemas and single URLs before tackling more complex extractions. This helps in understanding how the AI interprets your prompts and how different schema designs affect the extraction results. As you become more comfortable with the basics, you can explore advanced features like website-wide extraction, nested schemas, and data enrichment patterns to build more sophisticated data pipelines.

Further Reading

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Bex Tuychiev image
Bex Tuychiev@bextuychiev

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics

More articles by Bex Tuychiev

Building an Automated Price Tracking Tool

Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.

Web Scraping Automation: How to Run Scrapers on a Schedule

Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.

Automated Data Collection - A Comprehensive Guide

Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.

BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python

Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.

How to Build an Automated Competitor Price Monitoring System with Python

Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.

Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude

Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.

Data Enrichment: A Complete Guide to Enhancing Your Data Quality

Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.

Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl

Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.