Introducing /extract - Get web data with a prompt

Jan 12, 2025

•

Bex Tuychiev imageBex Tuychiev

How to Build a Bulk Sales Lead Extractor in Python Using AI

How to Build a Bulk Sales Lead Extractor in Python Using AI image

Introduction

Sales teams waste a lot of time gathering lead information from websites by hand. The Sales Lead Extractor app we are going to build in the article makes this task much faster by using smart web scraping and a simple interface. Sales professionals can upload a list of website URLs and pick what data they want to collect. The app then uses Streamlit and Firecrawl’s API to automatically gather that information.

The best part about this app is how flexible and simple it is to use. Users aren’t stuck with preset data fields - they can choose exactly what information they need, like company names, contact details, or team sizes. The app’s AI technology reads through websites and turns the data into a clean Excel file. What used to take hours of copying and pasting can now be done in just a few minutes.

Let’s dive into building this powerful lead extraction tool step by step, starting with covering the prerequisite concepts.

Prerequisites

Before we dive into building the Sales Lead Extractor, make sure you have the following prerequisites in place:

  1. Python Environment Setup

    • Python 3.8 or higher installed
    • A code editor of your choice
  2. Required Accounts

    • A Firecrawl account with an API key (sign up at https://firecrawl.dev)
    • GitHub account (if you plan to deploy the app)

Don’t worry if you are completely new to Firecrawl as we will its basics in the next section.

  1. Technical Knowledge

    • Basic understanding of Python programming
    • Familiarity with web concepts (URLs, HTML)
    • Basic understanding of data structures (JSON, CSV)
  2. Project Dependencies

    • Streamlit for the web interface
    • Firecrawl for AI-powered web scraping
    • Pydantic for data validation
    • Pandas for data manipulation

Firecrawl Basics

Firecrawl is an AI-powered web scraping API that takes a different approach from traditional scraping libraries. Instead of relying on HTML selectors or XPaths, it uses natural language understanding to identify and extract content. Here’s a simple example:

from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel

load_dotenv()

class CompanyInfo(BaseModel):
    company: str
    email: str

app = FirecrawlApp()
schema = CompanyInfo.model_json_schema()

data = app.scrape_url(
    "https://openai.com",
    params={
        "formats": ["extract"],
        "extract": {
            "prompt": "Find the company name and contact email",
            "schema": schema
        }
    }
)

company_info = CompanyInfo(**data["extract"])
print(company_info)  # Shows validated company info

In this example, we:

  1. Define a Pydantic model CompanyInfo to structure the scraping process
  2. Initialize the Firecrawl client
  3. Convert the Pydantic model to a JSON schema that Firecrawl understands
  4. Make an API call to extract company info from openai.com using natural language

The key advantage of Firecrawl is that we can describe what we want to extract in plain English (“Find the company name and contact email”) rather than writing complex selectors. The AI understands the intent and returns structured data matching our schema.

Now that we understand how Firecrawl’s AI-powered extraction works, let’s build a complete sales lead extraction tool that leverages these capabilities at scale.

Project Setup

Let’s start by setting up our development environment and installing the necessary dependencies.

  1. Create a working directory

First, create a working directory and initialize a virtual environment:

mkdir sales-lead-extractor
cd sales-lead-extractor
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate
  1. Install Dependencies

We’ll use Poetry for dependency management. If you haven’t installed Poetry yet:

curl -sSL https://install.python-poetry.org | python3 -

Then, initialize it inside the current working directory:

poetry init

Type “^3.10” when asked for the Python version but, don’t specify the dependencies interactively.

Next, install the project dependencies with the add command:

poetry add streamlit firecrawl-py pandas pydantic openpyxl python-dotenv
  1. Build the project structure
mkdir data src
touch .gitignore README.md .env src/{app.py,models.py,scraper.py}
  1. Configure environment variables

Inside the .env file in the root directory, add your Firecrawl API key:

FIRECRAWL_API_KEY=your_api_key_here
  1. Start the app UI

Run the Streamlit app (which is blank just now) to ensure everything is working:

poetry run streamlit run src/app.py

You should see the Streamlit development server start up and your default browser open to the app’s interface. Keep this tab open to see the changes we make to the app in the next steps.

Building the Lead Extraction App Step-by-Step

We will take a top-down approach to building the app: starting with the high-level UI components and user flows, then implementing the underlying functionality piece by piece. This approach will help us validate the app’s usability early and ensure we’re building exactly what users need.

Step 1: Adding a file upload component

We start with the following imports at the top of src/app.py:

import streamlit as st
import pandas as pd
from typing import List, Dict

Then, we will define the main function. Inside, we write the page title and subtitle and add a component for file uploads:

def main():
    st.title("Sales Lead Extractor")
    st.write(
        "Upload a file with website URLs of your leads and define the fields to extract"
    )

    # File upload
    uploaded_file = st.file_uploader("Choose a file", type=["csv", "txt"])

The file uploader component allows users to upload either CSV or TXT files containing URLs. We’ll need a helper function to parse these files and extract the URLs.

# Paste the function after the imports but before the main() function
def load_urls(uploaded_file) -> List[str]:
    """Load URLs from uploaded file"""
    if uploaded_file.name.endswith(".csv"):
        df = pd.read_csv(uploaded_file, header=None)
        return df.iloc[:, 0].tolist()
    else:
        content = uploaded_file.getvalue().decode()
        return [url.strip() for url in content.split("\n") if url.strip()]

The load_urls() function handles both CSV and TXT file formats:

  • For CSV files: It assumes URLs are in the first column and uses pandas to read them
  • For TXT files: It reads the content and splits by newlines to get individual URLs

In both cases, it returns a clean list of URLs with any empty lines removed. Note that the function expects files in the following format:

https://website1.com
https://website2.com
https://website3.com

For simplicity, we are skipping implementing security best practices to validate the format of the URLs and the overall files.

We can now use this in our main function to process the uploaded file:

# Continue the main function
def main():
    ...

    if uploaded_file:
        urls = load_urls(uploaded_file)
        st.write(f"Loaded {len(urls)} URLs")

This gives users immediate feedback about how many URLs were successfully loaded from their file.

Now, add this to the end of src/app.py:

if __name__ == "__main__":
    main()

Step 2: Adding a form to define fields interactively

Once a user uploads their file, they must be presented with another form to specify what information they are looking to extract from the leads. To add this functionality, continue src/app.py with the following code block:

def main():
    # ... the rest of the function

    if uploaded_file:
        urls = load_urls(uploaded_file)
        st.write(f"Loaded {len(urls)} URLs")

        # <THIS PART IS NEW> #
        # Dynamic field input
        st.subheader("Define Fields to Extract")

        fields: Dict[str, str] = {}
        col1, col2 = st.columns(2)

        with col1:
            num_fields = st.number_input(
                "Number of fields", min_value=1, max_value=10, value=3
            )

        for i in range(num_fields):
            col1, col2 = st.columns(2)
            with col1:
                field_name = st.text_input(f"Field {i+1} name", key=f"name_{i}")

                # Convert field_name to lower snake case
                field_name = field_name.lower().replace(" ", "_")

            with col2:
                field_desc = st.text_input(f"Field {i+1} description", key=f"desc_{i}")

            if field_name and field_desc:
                fields[field_name] = field_desc

Here’s what’s happening:

  1. We create a dictionary called fields to store the field definitions
  2. Using Streamlit’s column layout, we first ask users how many fields they want to extract (1-10)
  3. For each field, we create two columns:
    • Left column: Input for the field name (automatically converted to snake_case)
    • Right column: Input for field description that explains what to extract
  4. Valid field name/description pairs are added to the fields dictionary
  5. This fields dictionary will later be used to:
    • Create a dynamic Pydantic model for validation
    • Guide the AI in extracting the right information from each website

The form is responsive and updates in real-time as users type, providing a smooth experience for defining extraction parameters.

Example inputs users might provide:

Field 1:

  • Name: company_name
  • Description: Extract the company name from the website header or footer

Field 2:

  • Name: employee_count
  • Description: Find the number of employees mentioned on the About or Company page

Field 3:

  • Name: tech_stack
  • Description: Look for technology names mentioned in job postings or footer

Field 4:

  • Name: contact_email
  • Description: Find the main contact or sales email address

Field 5:

  • Name: industry
  • Description: Determine the company’s primary industry from their description

Once the user fills out the fields and descriptions, they must click on a button that fires up the entire system under the hood:

# Continuing main()
def main():
    ...

    if uploaded_file:
        ...

        if st.button("Start Extraction") and fields:
            with st.spinner(
                "Extracting data. This may take a while, so don't close the window."
            ):
            pass

Currently, nothing happens when the user clicks on the field but we will build up the functionality in the next steps.

Step 3: Building a dynamic Pydantic model from input fields

Let’s convert the fields and descriptions provided through the Streamlit UI into a Pydantic model inside the src/models.py file:

from pydantic import BaseModel, Field
from typing import Dict, Any

The pydantic BaseModel provides data validation and serialization capabilities. We use Field to add helpful metadata like descriptions to our model fields. For type safety, we use Dict and Any type hints - Dict helps define our field definitions while Any provides flexibility in typing when needed.

class DynamicLeadModel(BaseModel):
    """Dynamic model for lead data extraction"""
    pass

Then, we define a new class DynamicLeadModel(BaseModel) with the purpose of dynamically generating Pydantic models based on user-defined fields.

class DynamicLeadModel(BaseModel):
    """Dynamic model for lead data extraction"""

    @classmethod
    def create_model(cls, fields: Dict[str, str]) -> type[BaseModel]:
        """
        Create a dynamic Pydantic model based on user-defined fields

        Args:
            fields: Dictionary of field names and their descriptions
        """
        field_annotations = {}
        field_definitions = {}

        for field_name, description in fields.items():
            field_annotations[field_name] = str
            field_definitions[field_name] = Field(description=description)

        # Create new model dynamically
        return type(
            "LeadData",
            (BaseModel,),
            {"__annotations__": field_annotations, **field_definitions},
        )

This class contains a create_model class method that takes a dictionary of field names and descriptions as input (passed through the Streamlit UI). For each field, it creates type annotations and field definitions with metadata. The method then returns a new dynamically created Pydantic model class that will be used by Firecrawl to guide its AI-powered scraping engine while extracting the lead information.

Step 4: Scraping input URLs with Firecrawl

Let’s move to src/scraper.py to write the scraping functionality:

# src/scraper.py
from firecrawl import FirecrawlApp
from typing import Dict, List
from datetime import datetime
import pandas as pd
from models import DynamicLeadModel

The datetime module is imported to generate unique timestamps for the output files. pandas (imported as pd) is used to create and manipulate DataFrames for storing the scraped lead data and exporting it to Excel.

class LeadScraper:
    def __init__(self):
        self.app = FirecrawlApp()

The LeadScraper class is initialized with a FirecrawlApp instance that provides the connection to the scraping engine. This allows us to make API calls to extract data from the provided URLs.

    # ... continue the class
    async def scrape_leads(self, urls: List[str], fields: Dict[str, str]) -> str:
        """Scrape multiple leads using Firecrawl's batch extraction endpoint"""
        # Create dynamic model
        model = DynamicLeadModel.create_model(fields)

        # Extract data for all URLs at once
        data = self.app.batch_scrape_urls(
            urls,
            params={
                "formats": ["extract"],
                "extract": {"schema": model.model_json_schema()},
            },
        )

        # Process and format the results
        results = [
            {"url": result["metadata"]["url"], **result["extract"]}
            for result in data["data"]
        ]

        # Convert to DataFrame
        df = pd.DataFrame(results)

        # Save results
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"data/leads_{timestamp}.xlsx"
        df.to_excel(filename, index=False)

        return filename

Next, we define a scrape_leads method that takes a list of URLs and field definitions as input. The method creates a dynamic model based on the field definitions, extracts data from all URLs in a single batch request using Firecrawl’s API, processes the results into a DataFrame, and saves them to an Excel file with a timestamp. The method returns the path to the generated Excel file.

Here is a sample output file that may be generated by the function:

urlcompany_nameemployee_countindustrycontact_email
https://acme.comAcme Corp500-1000Technologyinfo@acme.com
https://betainc.comBeta Inc100-500Healthcarecontact@betainc.com
https://gammatech.ioGamma Tech1000+Softwaresales@gammatech.io
https://deltasolutions.comDelta Sol50-100Consultinghello@deltasol.com

When the “Start extraction” button is clicked in the UI, the app must create a LeadScraper instance and call its scrape_leads() method asynchronously using asyncio.run(). So, let’s return to src/app.py.

Step 5: Creating a download link with the results

First, update the imports with the following version:

# src/app.py
import streamlit as st
import pandas as pd
from typing import List, Dict
import time
import asyncio
import os
from scraper import LeadScraper
from dotenv import load_dotenv

load_dotenv()

The imports have been updated to include all necessary dependencies for the app:

  • time for tracking execution duration
  • asyncio for asynchronous execution
  • os for file path operations
  • LeadScraper from our custom scraper module
  • load_dotenv for loading environment variables

Next, continue the last if block of main() function with the following codeblock:

def main():
    ...

    if uploaded_file:
        ...

        if st.button("Start Extraction") and fields:
            with st.spinner(
                "Extracting data. This may take a while, so don't close the window."
            ):
                start_time = time.time()
                scraper = LeadScraper()

                # Run scraping asynchronously
                result_file = asyncio.run(scraper.scrape_leads(urls, fields))

                elapsed_time = time.time() - start_time
                elapsed_mins = int(elapsed_time // 60)
                elapsed_secs = int(elapsed_time % 60)

                # Show download link
                with open(result_file, "rb") as f:
                    st.download_button(
                        "Download Results",
                        f,
                        file_name=os.path.basename(result_file),
                        mime="text/csv",
                    )
                st.balloons()
                st.success(
                    f"Extraction complete! Time taken: {elapsed_mins}m {elapsed_secs}s"
                )

This section adds the core extraction functionality:

When the “Start Extraction” button is clicked and fields are defined:

  • Shows a loading spinner with message
  • Records start time
  • Creates LeadScraper instance
  • Runs scraping asynchronously using asyncio
  • Calculates elapsed time in minutes and seconds
  • Opens the result file and creates download button
  • Shows success message with time taken
  • Displays celebratory balloons animation

The code handles the entire extraction workflow from triggering the scrape to delivering results to the user, with progress feedback throughout the process.

At this point, the core app functionality is finished. Try it out with few different CSV files containing potential leads your company has.

In the final step, we will deploy the app to Streamlit Cloud.

Step 6: Deploying the app to Streamlit Cloud

Before deploying, let’s create a requirements.txt file in the root directory:

poetry export -f requirements.txt --output requirements.txt --without-hashes

Next, create a new file called .streamlit/secrets.toml:

FIRECRAWL_API_KEY = "your_api_key_here"

Also, add these to your .gitignore file:

__pycache__/

# Virtual Environment
.env
.venv
venv/
ENV/

# IDE
.idea/
.vscode/

# Project specific
.streamlit/secrets.toml
data/
*.xlsx

Now follow these steps to deploy:

  1. Push to GitHub
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/yourusername/sales-lead-extractor.git
git push -u origin main
  1. Deploy on Streamlit Cloud
  • Go to share.streamlit.io
  • Click “New app”
  • Connect your GitHub repository
  • Select the main branch and src/app.py as the entry point
  • Add your Firecrawl API key in the Secrets section using the same format as .streamlit/secrets.toml
  • Click “Deploy”

Your app will be live in a few minutes.

Conclusion

We’ve built a powerful lead extraction tool that combines the simplicity of Streamlit with the AI capabilities of Firecrawl. The app allows sales teams to:

  • Upload lists of target websites
  • Define custom data fields to extract
  • Get structured lead data in Excel format
  • Save hours of manual data gathering

Key features implemented:

  • Dynamic field definition
  • Batch URL processing
  • Progress tracking
  • Excel export
  • Cloud deployment

Feel free to extend the app with features like:

  • Authentication
  • Lead scoring
  • CRM integration
  • Custom extraction templates
  • Rate limiting
  • Error handling

The complete source code is available on GitHub.

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Bex Tuychiev image
Bex Tuychiev@bextuychiev

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics

More articles by Bex Tuychiev

Building an Automated Price Tracking Tool

Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.

Web Scraping Automation: How to Run Scrapers on a Schedule

Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.

Automated Data Collection - A Comprehensive Guide

Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.

BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python

Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.

How to Build an Automated Competitor Price Monitoring System with Python

Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.

Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude

Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.

Data Enrichment: A Complete Guide to Enhancing Your Data Quality

Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.

Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl

Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.