Mastering the Extract Endpoint in Firecrawl

Introduction
Getting useful data from websites can be tricky. While humans can easily read and understand web content, turning that information into structured data for AI applications often requires complex code and constant maintenance. Traditional web scraping tools break when websites change, and writing separate code for each website quickly becomes overwhelming.
Firecrawl’s extract endpoint changes this by using AI to automatically understand and pull data from any website. Whether you need to gather product information from multiple stores, collect training data for AI models, or keep track of competitor prices, the extract endpoint can handle it with just a few lines of code. Similar to our previous guides about Firecrawl’s scrape, crawl, and map endpoints, this article will show you how to use this powerful new tool to gather structured data from the web reliably and efficiently.
Understanding the Extract Endpoint
The extract endpoint is different from Firecrawl’s other tools because it can understand and process entire websites at once, not just single pages. Think of it like having a smart assistant who can read through a whole website and pick out exactly the information you need, whether that’s from one page or thousands of pages.
Here’s what makes the extract endpoint special:
- It can process multiple URLs at once, saving you time and effort
- You can use simple English to describe what data you want, no extra coding required
- It works with entire websites by adding
/*
to the URL (likewebsite.com/*
) - If requested information is not found in the given URL, the web search feature can automatically search related links on the page to find and extract the missing data
The extract endpoint is perfect for three main tasks:
- Full website data collection: Gathering information from every page on a website, like product details from an online store
- Data enrichment: Adding extra information to your existing data by finding details across multiple websites
- AI training data: Creating clean, structured datasets to train AI models
At the time of writing the article, you will have 500k free output tokens for the endpoint when you sign up for an account. The pricing page also contains information on other pricing options.
Getting Started with Extract
Let’s set up everything you need to start using the extract endpoint. First, you’ll need to sign up for a free Firecrawl account at firecrawl.dev to get your API key. Once you have your account, follow these simple steps to get started:
- Create a new project folder and set up your environment:
# Create a new folder and move into it
mkdir extract-project
cd extract-project
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # For Unix/macOS
venv\Scripts\activate # For Windows
# Install the required packages
pip install firecrawl-py python-dotenv pydantic
- Create a
.env
file to safely store your API key:
# Create the file and add your API key
echo "FIRECRAWL_API_KEY='your-key-here'" >> .env
# Create a .gitignore file to keep your API key private
echo ".env" >> .gitignore
- Create a simple Python script to test the setup:
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel
# Load your API key
load_dotenv()
# Create a Firecrawl app instance
app = FirecrawlApp()
# Create a Pydantic model
class PageData(BaseModel):
title: str
trusted_companies: list[str]
# Test the connection
result = app.extract(
urls=["https://firecrawl.dev"],
params={
"prompt": "Extract the contents of the page based on the schema provided.",
"schema": PageData.model_json_schema(),
},
)
print(result['data'])
{
'title': 'Turn websites into _LLM-ready_ data',
'trusted_companies': [
'Zapier', 'Gamma',
'Nvidia', 'PHMG',
'StackAI', 'Teller.io',
'Carrefour', 'Vendr',
'OpenGov.sg', 'CyberAgent',
'Continue.dev', 'Bain',
'JasperAI', 'Palladium Digital',
'Checkr', 'JetBrains',
'You.com'
]
}
First, we import the necessary libraries:
FirecrawlApp
: The main class for interacting with Firecrawl’s APIload_dotenv
: For loading environment variables from the.env
fileBaseModel
frompydantic
: For creating data schemas
After loading the API key from the .env
file, we create a FirecrawlApp
instance which will handle our API requests.
We then define a PageData
class using Pydantic that specifies the structure of data we want to extract:
title
: A string field for the page titletrusted_companies
: A list of strings for company names
Finally, we make an API call using app.extract()
with:
- A URL to scrape (
firecrawl.dev
) - Parameters including:
- A prompt telling the AI what to extract
- The schema converted to JSON format
The result will contain the extracted data matching our PageData
schema structure.
That’s all you need to start using the extract endpoint! In the next section, we’ll look at different ways to extract data and how to get exactly the information you need.
Basic Extraction Patterns
Let’s start with some common patterns for extracting data using Firecrawl.
Using Nested Schemas
Let’s look at how to extract more complex data using nested schemas. These are like folders within folders - they help organize related information together.
from pydantic import BaseModel
from typing import Optional, List
# Define a schema for author information
class Author(BaseModel):
name: str
bio: Optional[str]
social_links: List[str]
# Define a schema for blog posts
class BlogPost(BaseModel):
title: str
author: Author
publish_date: str
content_summary: str
# Use the nested schema with Firecrawl
result = app.extract(
urls=["https://example-blog.com/post/1"],
params={
"prompt": "Extract the blog post information including author details",
"schema": BlogPost.model_json_schema()
}
)
This code will give you organized data like this inside result['data']
:
{
'title': 'How to Make Great Pizza',
'author': {
'name': 'Chef Maria',
'bio': 'Professional chef with 15 years experience',
'social_links': ['https://twitter.com/chefmaria', 'https://instagram.com/chefmaria']
},
'publish_date': '2024-01-15',
'content_summary': 'A detailed guide to making restaurant-quality pizza at home'
}
Now that we’ve seen how to extract structured data from a single page, let’s look at how to handle multiple items.
Capturing multiple items of the same type
Firecrawl follows your Pydantic schema definition to the letter. This can lead to interesting scenarios like below:
class Car(BaseModel):
make: str
brand: Author
manufacture_date: str
mileage: int
If you use the above schema for scraping a car dealership website, you will only get a single car in the result.
To get multiple cars, you need to wrap your schema in a container class:
class Inventory(BaseModel):
cars: List[Car]
result = app.extract(
urls=["https://cardealership.com/inventory"],
params={
"prompt": "Extract the full inventory including all cars and metadata",
"schema": Inventory.model_json_schema()
}
)
This will give you organized data with multiple cars in a structured array.
Processing Entire Websites
When you want to get data from an entire website, you can use the /*
pattern. This tells Firecrawl to look at all pages on the website.
class ProductInfo(BaseModel):
name: str
price: float
description: str
in_stock: bool
# This will check all pages on the website
result = app.extract(
urls=["https://example-store.com/*"],
params={
"prompt": "Find all product information on the website",
"schema": ProductInfo.model_json_schema()
}
)
Remember that using /*
will process multiple pages, so it will use more credits and thus, output more tokens. Currently, large-scale site coverage is not supported for massive websites like Amazon, eBay or Airbnb. Also, complex logical questions like find all comments made between 10am-12pm in 2011 are not yet fully operational.
Since extract
endpoint is still in Beta, features and performance will continue to evolve.
Making Your Extractions Better
Here are some real examples of how to get better results:
# Bad prompt
result = app.extract(
urls=["https://store.com"],
params={
"prompt": "Get prices", # Too vague!
"schema": ProductInfo.model_json_schema()
}
)
# Good prompt
result = app.extract(
urls=["https://store.com"],
params={
"prompt": "Find all product prices in USD, including any sale prices. If there's a range, get the lowest price.", # Clear and specific
"schema": ProductInfo.model_json_schema()
}
)
Pro Tips:
- Start with one or two URLs to test your schema and prompt
- Use clear field names in your schemas (like
product_name
instead of justname
) - Break complex data into smaller, nested schemas
- Add descriptions to your schema fields to help the AI understand what to look for
In the next section, we’ll explore best practices for schema definition, including how to create advanced Pydantic models, design effective schemas, improve extraction accuracy with field descriptions, and see complex schema examples in action.
Advanced Extraction Patterns
Let’s look at a couple of advanced patterns for extracting data effectively from different types of websites.
Asynchronous Extraction
In practice, you will usually scrape dozens if not hundreds of URLs with extract
. For these larger jobs, you can use async_extract
which processes URLs concurrently, significantly reducing the total execution time. This asynchronous version returns the same results as extract
but can be up to 10x faster when processing multiple URLs. Simply use app.async_extract()
instead of app.extract()
and the async version returns a job object with status and progress information that you can use to track the extraction progress. This allows your program to continue executing other tasks while the extraction runs in the background, rather than blocking until completion.
class HackerNewsArticle(BaseModel):
title: str = Field(description="The title of the article")
url: Optional[str] = Field(description="The URL of the article if present")
points: int = Field(description="Number of upvotes/points")
author: str = Field(description="Username of the submitter")
comments_count: int = Field(description="Number of comments on the post")
posted_time: str = Field(description="When the article was posted")
class HackerNewsResponse(BaseModel):
articles: List[HackerNewsArticle]
extract_job = app.async_extract(
urls=[
"https://news.ycombinator.com/",
# your other URLs here ...
],
params={
"schema": HackerNewsResponse.model_json_schema(),
"prompt": """
Extract articles from the Hacker News front page. Each article should include:
- The title of the post
- The linked URL (if present)
- Number of points/upvotes
- Username of who posted it
- Number of comments
- When it was posted
"""}
)
Let’s break down what’s happening above:
-
First, we define a
HackerNewsArticle
Pydantic model that specifies the structure for each article:title
: The article’s headlineurl
: The link to the full article (optional)points
: Number of upvotesauthor
: Username who posted itcomments_count
: Number of commentsposted_time
: When it was posted
-
We then create a
HackerNewsResponse
model that contains a list of these articles. -
Instead of using the synchronous
extract()
, we useasync_extract()
. -
The extraction job is configured with:
- A list of URLs to process (in this case, the Hacker News homepage)
- Parameters including our schema and a prompt that guides the AI in extracting the correct information
This approach is particularly useful when you need to scrape multiple pages, as it prevents your application from blocking while waiting for each URL to be processed sequentially.
async_extract
returns an extraction job object that includes a job_id
. You can periodically pass this job ID to get_extract_status
method to check on the job’s progress:
job_status = app.get_extract_status(extract_job.job_id)
print(job_status)
{
"status": "pending",
"progress": 36,
"results": [{
"url": "https://news.ycombinator.com/",
"data": { ... }
}]
}
Possible states include:
completed
: The extraction finished successfully.pending
: Firecrawl is still processing your request.failed
: An error occurred; data was not fully extracted.cancelled
: The job was cancelled by the user.
A completed output of async_extract
will be the same as plain extract
with additional status
of completed
key-value pair.
Using Web Search
Another powerful feature of the extract endpoint is the ability to search the related pages to the URL you are scraping for additional information. For example, let’s say we are scraping GitHub’s trending page and we want to get the following information:
class GithubRepository(BaseModel):
name: str = Field(description="The name of the repository")
url: str = Field(description="The URL of the repository")
stars: int = Field(description="Number of stars/favorites")
license: str = Field(description="The open-source license of the repository")
n_releases: str = Field(description="The number of releases of the repo if any")
class GithubTrendingResponse(BaseModel):
repositories: List[GithubRepository]
If you look at the GitHub trending page, you will see that the first page doesn’t contain any information about the license or number of releases. Those pieces of info are on each repository’s page.
To get this information, we can set the enableWebSearch
parameter in the extract
endpoint. This parameter will search the related pages to the URL you are scraping for additional information.
data = app.extract(
urls=["https://github.com/trending"],
params={
"schema": GithubTrendingResponse.model_json_schema(),
"prompt": "Extract information based on the schema provided. If there are missing fields, search the related pages for the missing information.",
"enableWebSearch": True,
},
)
data['data']['repositories'][0]
{
'url': 'https://github.com/yt-dlp/yt-dlp',
'name': 'YT-DLP',
'license': 'Unlicense license',
'n_stars': 98854,
'n_releases': '103'
}
The web search feature is particularly powerful when you need to gather comprehensive information that might be spread across multiple pages, saving you from having to write complex crawling logic or multiple extraction calls.
Best Practices For Schema Definition
When designing schemas for the extract endpoint, following these best practices will help you get better, more reliable results:
1. Start simple, then expand
# Start with basic fields
class Product(BaseModel):
name: str
price: float
# Then gradually add complexity
class Product(BaseModel):
name: str = Field(description="Full product name including brand")
price: float = Field(description="Current price in USD")
variants: List[str] = Field(description="List of variants of the product")
specifications: str = Field(description="Specifications of the product")
The simple schema focuses on just the essential fields (name and price) which is great for:
- Initial testing and validation
- Cases where you only need basic product information
- Faster development and debugging
The complex schema adds more detailed fields and descriptions which helps when:
- You need comprehensive product data
- The source pages have varied formats
- You want to ensure consistent data quality
- You need specific variants or specifications
By starting simple and expanding gradually, you can validate your extraction works before adding complexity. The field descriptions in the complex schema also help guide the AI to extract the right information in the right format.
2. Use clear, descriptive field names and types
# Poor naming and typing
class BadSchema(BaseModel):
n: str # Unclear name
p: float # Unclear name
data: Any # Too flexible
# Better naming and typing
class GoodSchema(BaseModel):
product_name: str
price_usd: float
technical_specifications: str
Use type hints that match your expected data. This helps both the AI model and other developers understand your schema. The only exception is that Firecrawl doesn’t support a datetime
data type, so if you are scraping temporal information, use str
.
3. Structure Complex Data Hierarchically
class Author(BaseModel):
name: str
bio: Optional[str]
social_links: List[str]
class Article(BaseModel):
title: str
author: Author
content: str
tags: List[str]
class Blog(BaseModel):
articles: List[Article]
site_name: str
last_updated: datetime
Breaking down complex data into nested models makes the schema more maintainable and helps the AI understand relationships between data points. In the above example, the Blog
model contains a list of Article
objects, which in turn contain Author
objects. This hierarchical structure clearly shows how the data is organized - a blog has many articles, and each article has one author with their own properties.
4. Include Example Data
class ProductReview(BaseModel):
rating: int = Field(
description="Customer rating from 1-5 stars"
)
comment: str
reviewer_name: str
verified_purchase: bool
class Config:
json_schema_extra = {
"example": {
"rating": 4,
"comment": "Great product, fast shipping!",
"reviewer_name": "John D.",
"verified_purchase": True
}
}
Providing examples in your schema helps the AI understand exactly what format you expect, especially for complex or ambiguous fields.
Pro tips:
- Use
Optional
fields when data might not always be present - Include unit information in field descriptions (e.g., “price in USD”, “weight in kg”)
- Use enums for fields with a fixed set of possible values
A Real-World Example: Scraping GitHub Trending Repositories
Let’s build a complete example that extracts trending repository data from GitHub’s trending page. This example will show how to:
- Design a robust schema
- Handle nested data
- Process multiple items
- Use field descriptions effectively
Let’s start by making fresh imports and setup:
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
import os
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Initialize Firecrawl
app = FirecrawlApp()
Now, let’s define our nested schema for the data:
class Developer(BaseModel):
username: str = Field(description="GitHub username of the developer")
profile_url: str = Field(description="URL to the developer's GitHub profile")
class Repository(BaseModel):
name: str = Field(description="Repository name in format 'username/repo'")
description: Optional[str] = Field(
description="Repository description if available"
)
url: str = Field(description="Full URL to the repository")
language: Optional[str] = Field(description="Primary programming language used")
stars: int = Field(description="Total number of stars")
forks: int = Field(description="Total number of forks")
stars_today: int = Field(description="Number of stars gained today")
developers: List[Developer] = Field(
default_factory=list, description="List of contributors to the repository"
)
class TrendingData(BaseModel):
repositories: List[Repository] = Field(description="List of trending repositories")
The schema above defines three Pydantic models for structured GitHub trending data:
Developer
: Represents a contributor with their GitHub username and profile URLRepository
: Contains details about a GitHub repo including name, description, URL, language, stars count, forks count, stars gained today, and a list of contributorsTrendingData
: The root model that contains a list of trending repositories
Next, let’s define a function for running a scraper using our schema:
def scrape_trending_repos():
# Extract the data
result = app.extract(
urls=["https://github.com/trending"],
params={
"prompt": """
Extract information about trending GitHub repositories.
For each repository:
- Get the full repository name (username/repo)
- Get the description if available
- Extract the primary programming language
- Get the total stars and forks
- Get the number of stars gained today
- Get information about contributors including their usernames and profile URLs
Note:
- Stars and forks should be converted to numbers (e.g., '1.2k' → 1200)
- Stars today should be extracted from the "X stars today" text
- Ensure all URLs are complete (add https://github.com if needed)
""",
"schema": TrendingData.model_json_schema()
}
)
return result
Now, let’s print the details of some of the scraped repos:
# Process and display the results
if result["success"]:
trending = TrendingData(**result["data"])
# Print the top 3 repositories
print("🔥 Top 3 Trending Repositories:")
for repo in trending.repositories[:3]:
print(f"\n📦 {repo.name}")
print(f"đź“ť {repo.description[:100]}..." if repo.description else "No description")
print(f"🌟 {repo.stars:,} total stars ({repo.stars_today:,} today)")
print(f"🔤 {repo.language or 'Unknown language'}")
print(f"đź”— {repo.url}")
Example output:
🔥 Top 3 Trending Repositories:
📦 microsoft/garnet
đź“ť Garnet: A Remote Cache-Store for Distributed Applications...
🌟 12,483 total stars (1,245 today)
🔤 C#
đź”— https://github.com/microsoft/garnet
📦 openai/whisper
đź“ť Robust Speech Recognition via Large-Scale Weak Supervision...
🌟 54,321 total stars (523 today)
🔤 Python
đź”— https://github.com/openai/whisper
📦 lencx/ChatGPT
đź“ť ChatGPT Desktop Application (Mac, Windows and Linux)...
🌟 43,210 total stars (342 today)
🔤 TypeScript
đź”— https://github.com/lencx/ChatGPT
This example demonstrates several best practices:
- Nested Models: Using separate models for repositories and developers
- Smart Defaults: Using
default_factory
for lists and timestamps - Clear Descriptions: Each field has a detailed description
- Type Safety: Using proper types like
HttpUrl
for URLs - Detailed Prompt: The prompt clearly explains how to handle special cases
You can extend this example by:
- Adding time period filtering (daily, weekly, monthly)
- Including more repository metadata
- Saving results to a database
- Setting up automated daily tracking
- Adding error handling and retries
Extract vs Other Endpoints
Firecrawl offers three main endpoints for gathering web data: /extract
, /scrape
, and /crawl
. Each serves different use cases and has unique strengths. Let’s explore when to use each one.
Extract vs Scrape
The /scrape
endpoint is best for:
- Single-page detailed extraction
- Getting multiple output formats (HTML, markdown, screenshots)
- JavaScript-heavy websites requiring browser rendering
- Capturing visual elements or taking screenshots
# Scrape endpoint example - multiple formats
result = app.scrape_url(
"https://example.com",
params={
"formats": ["html", "markdown", "screenshot", "extract"],
"extract": {
"schema": MySchema.model_json_schema(),
"prompt": "Extract product information"
}
}
)
The /extract
endpoint is better for:
- Processing multiple URLs efficiently
- Website-wide data extraction
- Complex structured data extraction
- Building data enrichment pipelines
# Extract endpoint example - multiple URLs
result = app.extract(
urls=["https://example.com/*"],
params={
"prompt": "Extract all product information",
"schema": ProductSchema.model_json_schema()
}
)
Note that the extract functionality is built into the scrape_url
function and usually returns identical structured results like extract
. The difference is that scrape_url
is designed for single pages or when other extraction options are required.
Extract vs Crawl
The /crawl
endpoint excels at:
- Discovering and following links automatically
- Complete website archiving
- Sitemap generation
- Sequential page processing
# Crawl endpoint example
result = app.crawl_url(
"https://example.com",
params={
"limit": 5,
"includePaths": "*/blog/*",
"excludePaths": "*/author/*",
"scrapeOptions": {
"formats": ["markdown", "html"],
"includeTags": ["code", "#page-header"],
"excludeTags": ["h1", "h2", ".main-content"]
}
}
)
The /extract
endpoint is preferable when:
- You know exactly which URLs to process
- You need specific structured data from known pages
- You want parallel processing of multiple URLs
Choosing the Right Endpoint
Here’s a decision framework for choosing between endpoints:
-
Use
/extract
when:- You need structured data from known URLs
- You want to process multiple pages in parallel
- You need to combine data from different websites
- You have a specific schema for the data you want
-
Use
/scrape
when:- You need multiple output formats
- The website requires JavaScript rendering
- You need screenshots or visual elements
- You’re dealing with single pages that need detailed extraction
-
Use
/crawl
when:- You need to discover pages automatically
- You want to archive entire websites
- You need to follow specific URL patterns
- You’re building a complete site map
Ultimately, the extraction functionality is built into all endpoints, allowing you to choose the most appropriate one for your specific use case while maintaining consistent data extraction capabilities.
Reference our detailed guides for more information:
Conclusion
The /extract
endpoint combines AI-driven understanding with structured data validation to solve common web scraping challenges. Instead of maintaining brittle HTML selectors or writing custom parsing logic, developers can describe their data needs in plain English while ensuring consistent output through schema validation. This approach works particularly well for projects that need to gather structured data from multiple sources or websites that frequently change their layouts.
For those new to the endpoint, we recommend starting with simple schemas and single URLs before tackling more complex extractions. This helps in understanding how the AI interprets your prompts and how different schema designs affect the extraction results. As you become more comfortable with the basics, you can explore advanced features like website-wide extraction, nested schemas, and data enrichment patterns to build more sophisticated data pipelines.
Further Reading
On this page
Introduction
Understanding the Extract Endpoint
Getting Started with Extract
Basic Extraction Patterns
Using Nested Schemas
Capturing multiple items of the same type
Processing Entire Websites
Making Your Extractions Better
Advanced Extraction Patterns
Asynchronous Extraction
Using Web Search
Best Practices For Schema Definition
1. Start simple, then expand
2. Use clear, descriptive field names and types
3. Structure Complex Data Hierarchically
4. Include Example Data
A Real-World Example: Scraping GitHub Trending Repositories
Extract vs Other Endpoints
Extract vs Scrape
Extract vs Crawl
Choosing the Right Endpoint
Conclusion
Further Reading
About the Author

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
More articles by Bex Tuychiev
Building an Automated Price Tracking Tool
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.
Web Scraping Automation: How to Run Scrapers on a Schedule
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.
Automated Data Collection - A Comprehensive Guide
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.
BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.
How to Build an Automated Competitor Price Monitoring System with Python
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.
Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.
Data Enrichment: A Complete Guide to Enhancing Your Data Quality
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.
Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl
Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.