Introducing /extract - Get web data with a prompt

Mar 17, 2025

•

Bex Tuychiev imageBex Tuychiev

Converting Entire Websites into Agents with Firecrawl's LLMs.txt Endpoint and OpenAI Agents SDK

Introduction

Imagine turning any website into a conversational assistant you can simply chat with to access information. This article explores how to combine Firecrawl’s llms.txt endpoint with the new OpenAI Agents SDK to make this possible. The llms.txt endpoint extracts web content in an AI-friendly format, efficiently capturing meaningful information while filtering out navigation elements, ads, and other distractions that traditional web scraping struggles with.

By processing this extracted content through the OpenAI Agents SDK, we can create interactive agents that understand questions and respond with accurate information based on the website’s content. Whether you’re working with documentation, knowledge bases, or company websites, this approach makes information more accessible and engaging. In this article, you’ll get a step-by-step breakdown of building this application from extraction to interface design, providing you with all the tools needed to create your own website-to-agent converter.

Demo of a website converted to an interactive agent using Firecrawl and OpenAI Agents SDK

What is Firecrawl’s LLMs.txt endpoint?

When working with large language models, one of the biggest challenges is providing them with clean, structured content from websites. Firecrawl’s LLMs.txt endpoint elegantly solves this problem as a specialized API service that transforms website content into a format optimized specifically for LLMs. By intelligently crawling websites, this service extracts meaningful content while simultaneously filtering out distracting elements like navigation bars, advertisements, and other web clutter. The end result is pristine, structured text data that allows LLMs to process information efficiently and understand website content with remarkable clarity.

What is llms.txt in the first place?

To fully appreciate Firecrawl’s solution, it helps to understand the underlying standard it implements. The llms.txt standard is a forward-thinking proposal developed by Jeremy Howard and published in September 2024 on llmstxt.org. This initiative suggests that websites add a /llms.txt markdown file to their root directories, providing concise background information, guidance, and links to detailed markdown files that LLMs can easily process.

This standard directly addresses a fundamental limitation that has long plagued LLMs: their context windows simply cannot accommodate most websites in their entirety. Through the provision of standardized, clean markdown versions of important content, website owners can help LLMs access critical information without wrestling with the complexities of HTML, JavaScript, and other web technologies that traditionally make content extraction a daunting task.

Following a precise structure, the llms.txt format organizes information in a specific order: beginning with a title (H1), followed by an optional description in a blockquote, optional details, and methodically organized lists of links to additional resources. For scenarios requiring shorter contexts, some links can be marked as “Optional.” This thoughtfully designed structure enables both LLMs and conventional programming tools to process information with maximum efficiency.

How to use Firecrawl’s generate_llms_text endpoint?

Having understood the concept behind llms.txt, let’s explore a practical implementation using Firecrawl’s API. In this walkthrough, we’ll extract content from Python’s documentation website in a format optimized for LLMs. Our first step is installing the required package and configuring our client:

# Install the package
# !pip install firecrawl

# Import and initialize the client
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
import os

load_dotenv()

# Set up the client with your API key
firecrawl = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

With our environment properly configured, we can now use the synchronous version of the LLMs.txt generator to process Python’s documentation:

# Define parameters for the generation process
params = {
    "maxUrls": 5,  # Limit to 5 pages for faster processing
    "showFullText": True  # Get both summary and full text
}

# Generate LLMs.txt for Python's documentation site
results = firecrawl.generate_llms_text(
    url="https://docs.python.org/3/tutorial/",
    params=params
)

Once the processing finishes, our next step is to verify success and extract the valuable content:

# Check if the generation was successful
if results['success']:
    # Extract the concise llms.txt content
    summary_content = results['data']['llmstxt']
    
    # Extract the detailed llms-full.txt content if available
    full_content = results['data'].get('llmsfulltxt', '')
    
    print(summary_content)
else:
    print(f"Error: {results.get('error', 'Unknown error')}")
    exit()
# https://docs.python.org/3/tutorial/ llms.txt

- [Whetting Your Appetite](https://docs.python.org/3/tutorial/appetite.html): Introduction to Python programming and its capabilities.
- [Virtual Environments Guide](https://docs.python.org/3/tutorial/venv.html): Learn to create and manage Python virtual environments effectively.
- [Interactive Input Editing](https://docs.python.org/3/tutorial/interactive.html): Explore Python's interactive input editing and history features.
- [Python Standard Library Overview](https://docs.python.org/3/tutorial/stdlib.html): Explore Python's standard library modules and functionalities.
- [Python Programming Tutorial](https://docs.python.org/3/tutorial/index.html): Comprehensive guide to learning Python programming language basics.

Having successfully extracted the content, we can now preserve it for future use by saving it to files:

# Save the extracted content to files
with open("python_docs_summary.md", "w") as f:
    f.write(summary_content)

with open("python_docs_full.md", "w") as f:
    f.write(full_content)

Now comes the truly exciting part – putting this extracted content to work. One powerful application is feeding it directly to an LLM to create an intelligent documentation assistant:

import openai

# Example: Create a simple function to query the documentation
def query_python_docs(question, content=summary_content):
    """
    Use an LLM to answer a question based on the extracted Python documentation.
    This is a simplified example - in practice you might use embeddings,
    vector search, or other techniques for larger content.
    """
    
    prompt = f"""
    You are a Python documentation assistant. Answer the following question
    based on the Python documentation provided below:
    
    QUESTION: {question}
    
    PYTHON DOCUMENTATION:
    {content[:4000]}  # Using first 4000 chars for brevity
    """
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

# Example usage
answer = query_python_docs("How do I write a basic for loop in Python?")
print("\nQuery result:")
print(answer)
Query result:
To write a basic for loop in Python, you can use the following syntax:

for variable in iterable:
    # Code to execute for each item in the iterable

Here's a simple example of a for loop that iterates over a list of numbers:

numbers = [1, 2, 3, 4, 5]
for number in numbers:
    print(number)

In this example, the loop will print each number from the numbers list. You can replace iterable with any collection type, such as lists, tuples, strings, or ranges.

Through this comprehensive example, we’ve demonstrated how Firecrawl’s synchronous generate_llms_text method elegantly extracts structured content from Python’s documentation. The clean, LLM-friendly format dramatically simplifies the creation of specialized tools like documentation assistants or code helpers, eliminating the need to wrestle with the complexities of HTML parsing and content extraction that traditionally plague web scraping projects.

Breaking down the website-to-agent application

Converting websites into interactive agents involves several carefully designed components working together in a cohesive system. Let’s examine each component of the application to understand how they function and why specific design decisions were made.

The knowledge extraction engine: agents.py

Domain knowledge structuring and agent creation

The agents.py module forms the cognitive core of our application, transforming raw website content into structured knowledge and conversational agents.

# ... existing code ...
async def extract_domain_knowledge(content: str, url: str) -> DomainKnowledge:
    """
    Extract structured domain knowledge from website content.
    """
    # Create knowledge extraction agent
    knowledge_extractor = Agent(
        name="Knowledge Extractor",
        instructions="""Extract comprehensive domain knowledge from the provided website content.
        Identify:
        1. Core concepts and their relationships
        2. Specialized terminology and definitions
        3. Key insights and principles
        
        For each concept, assess its centrality/importance to the domain.
        For terminology, provide clear definitions and examples when available.
        For insights, evaluate confidence based on how explicitly they're stated.
        
        Structure everything according to the output schema.
        """,
        output_type=DomainKnowledge,
        model="gpt-4o-mini",
        model_settings=ModelSettings(
            temperature=0.2,  # Low temperature for more deterministic extraction
            max_tokens=4096,  # Allow space for comprehensive knowledge extraction
        )
    )
    # ... existing code ...

This implementation uses a structured approach to knowledge extraction rather than simply treating website content as raw context. This provides several advantages:

  1. Explicit knowledge structure: By organizing information into concepts, terminology, and insights, the system creates a more navigable knowledge base compared to unstructured text dumps.

  2. Prioritization: The implementation assigns importance scores to concepts and confidence scores to insights, allowing the agent to prioritize information when responding to queries.

  3. Relationship mapping: The system explicitly captures relationships between concepts, enabling more sophisticated reasoning about the domain.

The low temperature setting (0.2) ensures more deterministic and reliable extraction, making the resulting knowledge more consistent across different runs and websites.

The content acquisition system: llms_text.py

The llms_text.py module handles communication with the Firecrawl API to extract clean, structured content from websites.

# ... existing code ...
def extract_website_content(url: str, max_urls: int = 10, show_full_text: bool = True) -> Dict:
    """
    Extract website content using Firecrawl's LLMs.txt API.
    """
    # Initialize the client
    firecrawl = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
    
    # Define generation parameters
    params = {
        "maxUrls": max_urls,
        "showFullText": show_full_text
    }
    
    # Generate LLMs.txt with async processing and polling
    job = firecrawl.async_generate_llms_text(
        url=url,
        params=params
    )
    # ... existing code ...

This module uses Firecrawl’s asynchronous API with polling rather than a simple synchronous request. This design choice offers several benefits:

  1. Scalability: By using an asynchronous job-based approach, the application can handle larger websites without timeout issues.

  2. User experience: The polling approach allows for progress updates to be shown to the user while extraction is in progress.

  3. Flexibility: The implementation supports both concise (llmstxt) and comprehensive (llmsfulltxt) extraction modes, allowing users to balance between speed and completeness.

The choice to use Firecrawl’s specialized LLMs.txt endpoint rather than building a custom web scraper significantly reduces complexity while improving extraction quality. The endpoint automatically handles navigation elements, ads, and other noise that would require complex filtering in a custom solution.

Data structures and type safety: models.py

The models.py module defines the structured data types that flow through the application using Pydantic.

# ... existing code ...
class Concept(BaseModel):
    name: str
    description: str
    related_concepts: List[str]
    importance_score: float  # 0.0-1.0 indicating centrality

class Terminology(BaseModel):
    term: str
    definition: str
    context: Optional[str]
    examples: List[str]

class Insight(BaseModel):
    content: str
    topics: List[str]
    confidence: float  # 0.0-1.0 indicating confidence
# ... existing code ...

These structured models provide several key advantages:

  1. Type safety: The use of Pydantic ensures that data is always properly formatted and validated throughout the application.

  2. Self-documenting code: The explicit structure makes it immediately clear what information is expected and how it should be formatted.

  3. Integration with OpenAI Agents: The models serve as output schemas for the agent framework, enabling structured extraction directly from LLM outputs.

  4. Extensibility: New attributes can be added to these models without breaking existing functionality, allowing the application to evolve over time.

The choice to use importance_score and confidence fields exemplifies how the application incorporates uncertainty handling and prioritization directly into its data models.

Configuration management: config.py

The config.py module handles environment variables and default settings in a clean, centralized way.

# ... existing code ...
# Default settings
DEFAULT_MAX_URLS = 10
DEFAULT_USE_FULL_TEXT = True
DEFAULT_MODEL = "gpt-4o"
DEFAULT_TEMPERATURE = 0.3
DEFAULT_MAX_TOKENS = 1024

# Ensure API keys are available
if not OPENAI_API_KEY:
    raise ValueError("OPENAI_API_KEY is not set in environment variables or .env file")
# ... existing code ...

This separation of configuration from implementation provides several benefits:

  1. Environment independence: The application can easily run in different environments (development, testing, production) by changing environment variables.

  2. Early validation: The module validates required API keys at startup rather than failing at runtime when they’re first used.

  3. Single source of truth: Default values are defined in one place, making it easy to adjust application behavior without searching through code.

  4. Security: API keys are never hardcoded in application logic, reducing security risks.

The user experience layer: ui.py

Creating an accessible interface with real-time features

The ui.py module delivers a sophisticated yet user-friendly interface using Streamlit. This component represents the most complex part of the application, coordinating state management, asynchronous operations, and real-time UI updates.

Session state management

One critical aspect of the UI implementation is proper session state management, which ensures the application maintains conversation history and agent state across Streamlit reruns:

def init_session_state():
    if 'domain_agent' not in st.session_state:
        st.session_state.domain_agent = None
    if 'domain_knowledge' not in st.session_state:
        st.session_state.domain_knowledge = None
    if 'messages' not in st.session_state:
        st.session_state.messages = []
    if 'extraction_status' not in st.session_state:
        st.session_state.extraction_status = None
    if 'pending_response' not in st.session_state:
        st.session_state.pending_response = None

This approach creates a reliable foundation for state persistence, handling several critical variables:

  • domain_agent: The agent created from website knowledge
  • domain_knowledge: Structured representation of the website
  • messages: Conversation history
  • extraction_status: Current state of content extraction
  • pending_response: Response waiting to be added to the chat history
Progressive disclosure in the main UI flow

The main app flow uses progressive disclosure to guide users through the agent creation process:

def run_app():
    # Initialize session state
    init_session_state()
    
    # Check if we have a pending response to add to the message history
    if st.session_state.pending_response is not None:
        st.session_state.messages.append({"role": "assistant", "content": st.session_state.pending_response})
        st.session_state.pending_response = None
    
    # App title and description in main content area
    st.title("WebToAgent")
    st.subheader("Extract domain knowledge from any website and create specialized AI agents.")
    
    # Display welcome message using AI chat message component
    if not st.session_state.domain_agent:
        with st.chat_message("assistant"):
            st.markdown("đź‘‹ Welcome! Enter a website URL in the sidebar, and I'll transform it into an AI agent you can chat with.")

This design provides several advantages:

  1. User guidance: New users receive clear instructions instead of being overwhelmed by options
  2. Contextual interface: The UI adjusts based on the current application state
  3. Immediate feedback: The welcome message appears in a chat bubble, previewing the conversation format
Website processing and agent creation

The UI module orchestrates the process of website extraction and agent creation with detailed progress updates:

# Process form submission
if submit_button and website_url:
    st.session_state.extraction_status = "extracting"
    
    try:
        with st.spinner("Extracting website content with Firecrawl..."):
            content = extract_website_content(
                url=website_url, 
                max_urls=max_pages,
                show_full_text=use_full_text
            )
            
            # Show content sample
            with st.expander("View extracted content sample"):
                st.text(content['llmstxt'][:1000] + "...")
            
        # Process content to extract knowledge
        with st.spinner("Analyzing content and generating knowledge model..."):
            domain_knowledge = asyncio.run(extract_domain_knowledge(
                content['llmstxt'] if not use_full_text else content['llmsfulltxt'],
                website_url
            ))

This implementation:

  1. Provides transparency: Each processing step is clearly communicated
  2. Offers visibility: Users can view samples of the extracted content
  3. Handles asynchronous operations: Properly manages async code in Streamlit’s synchronous environment
Real-time token streaming architecture

Perhaps the most advanced aspect of the UI implementation is the token streaming system, which provides real-time responses while working within Streamlit’s reactive programming model:

def stream_agent_response(agent, prompt):
    """Stream agent response using a background thread and a queue for real-time token streaming."""
    # Create a queue to transfer tokens from async thread to main thread
    token_queue = queue.Queue()
    
    # Flag to signal when the async function is complete
    done_event = threading.Event()
    
    # Create a shared variable to collect the complete response
    response_collector = []

The streaming implementation uses multiple coordinated techniques:

# The thread function to run the async event loop
def run_async_loop():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    
    async def process_stream():
        token_count = 0
        try:
            result = Runner.run_streamed(agent, prompt)
            
            # Process all stream events
            async for event in result.stream_events():
                # Only handle text delta events
                if (event.type == "raw_response_event" and 
                    isinstance(event.data, ResponseTextDeltaEvent) and 
                    event.data.delta):
                    # Put the token in the queue
                    token = event.data.delta
                    token_queue.put(token)
                    
                    # Safely append to collector (no session state access)
                    response_collector.append(token)
                    token_count += 1

This sophisticated approach offers several critical benefits:

  1. Background thread processing: Allows async operations to run separately from the main Streamlit thread
  2. Queue-based coordination: Safely transfers tokens between threads
  3. State preservation: Maintains response state across Streamlit reruns
  4. Event-based design: Cleanly handles streaming events from the OpenAI API
Chat interface implementation

The chat interface integrates streaming responses with user input:

def display_chat_interface():
    """Display chat interface for interacting with the domain agent."""
    # Display chat history
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.markdown(message["content"])
    
    # Chat input
    if prompt := st.chat_input("Ask a question about this domain..."):
        # Add user message to chat history
        st.session_state.messages.append({"role": "user", "content": prompt})
        
        # Display user message
        with st.chat_message("user"):
            st.markdown(prompt)

And for the actual response streaming:

# Get agent response with streaming
with st.chat_message("assistant"):
    try:
        # Stream the response tokens
        token_stream = stream_agent_response(st.session_state.domain_agent, prompt)
        st.write_stream(token_stream)
        
    except Exception as e:
        # Fallback to non-streaming response if streaming fails
        st.warning(f"Streaming failed ({str(e)}), using standard response method.")
        try:
            full_response = get_non_streaming_response(st.session_state.domain_agent, prompt)
            st.markdown(full_response)
        except Exception as e2:
            st.error(f"Error generating response: {str(e2)}")

This interface design offers:

  1. Seamless interactivity: The chat experience feels natural and responsive
  2. Robustness: Comprehensive error handling with fallback mechanisms
  3. Real-time feedback: Users see responses as they’re generated, enhancing engagement

The UI module’s implementation demonstrates a deep understanding of the challenges involved in building interactive AI interfaces, with careful attention to state management, asynchronous processing, and error resilience.

Putting it all together: app.py

The app.py script serves as the entry point, configuring the Streamlit environment and launching the UI:

# ... existing code ...
# Set page config
st.set_page_config(
    page_title="KnowledgeForge",
    page_icon="🧠",
    layout="wide",
    initial_sidebar_state="expanded"
)
# ... existing code ...

While simple, this module makes important configuration choices:

  1. Wide layout: The use of “wide” layout maximizes screen real estate for both the chat interface and website content display.

  2. Expanded sidebar: Starting with the sidebar expanded ensures that users immediately see the URL input and configuration options.

  3. Consistent branding: The page title and icon establish a consistent identity for the application.

This minimalist approach to the entry point reflects the principle of separation of concerns, keeping the bootstrapping logic separate from the application implementation.

Architecture and design philosophy

Looking across these components, we can see several guiding principles:

  1. Separation of concerns: Each module has a clear responsibility, from data extraction to knowledge structuring to user interaction.

  2. Type safety: The use of Pydantic models and type hints throughout ensures reliable data flow between components.

  3. User-centered design: From the streaming responses to the progressive disclosure of features, the application prioritizes user experience.

  4. Robustness: Error handling, fallback mechanisms, and validation checks are implemented throughout to ensure reliability.

The application demonstrates how specialized tools (Firecrawl’s LLMs.txt, OpenAI’s Agents SDK) can be combined to create a system that’s greater than the sum of its parts, turning passive website content into interactive knowledge agents.

Conclusion And Next Steps

The website-to-agent application represents a powerful fusion of web extraction, knowledge modeling, and conversational AI technologies. By dissecting its components, we’ve seen how careful architectural decisions and attention to user experience create a seamless process for converting static web content into interactive knowledge agents. The separation of concerns across modules, robust error handling, and sophisticated streaming implementations all contribute to a resilient and extensible system.

Looking forward, this application could be enhanced by incorporating multi-modal capabilities that process images and videos alongside text content. Adding memory persistence would enable agents to retain conversation history between sessions, while implementing a knowledge graph visualization would help users understand how concepts in the website relate to each other. The real potential lies in domain adaptation - training specialized extraction models for specific industries like healthcare, finance, or education would dramatically improve knowledge quality for those domains.

Ready to build your own website-to-agent application? Sign up for Firecrawl at https://firecrawl.dev and explore the LLMs.txt endpoint that powers this entire system. In just minutes, you can extract clean, structured content from any website without worrying about complex scraping logic or content filtering. Whether you’re building a question-answering system, a domain-specific chatbot, or a comprehensive knowledge base, Firecrawl’s API provides the foundation you need to transform web content into valuable, interactive experiences.

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Bex Tuychiev image
Bex Tuychiev@bextuychiev

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics

More articles by Bex Tuychiev

Building an Automated Price Tracking Tool

Learn how to build an automated price tracker in Python that monitors e-commerce prices and sends alerts when prices drop.

Web Scraping Automation: How to Run Scrapers on a Schedule

Learn how to automate web scraping in Python using free scheduling tools to run scrapers reliably in 2025.

Automated Data Collection - A Comprehensive Guide

A comprehensive guide to building robust automated data collection systems using modern tools and best practices.

BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python

A comprehensive comparison of BeautifulSoup4 and Scrapy to help you choose the right Python web scraping tool.

How to Build a Client Relationship Tree Visualization Tool in Python

Build an application that discovers and visualizes client relationships by scraping websites with Firecrawl and presenting the data in an interactive tree structure using Streamlit and PyVis.

How to Build an Automated Competitor Price Monitoring System with Python

Learn how to build an automated price monitoring system in Python to track and compare competitor prices across e-commerce sites.

Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude

Learn how to scrape company and funding data from Crunchbase using Firecrawl and Claude.

How to Create Custom Instruction Datasets for LLM Fine-tuning

A comprehensive guide to creating instruction datasets for fine-tuning LLMs, including best practices and a practical code documentation example.