How to Build a Bulk Sales Lead Extractor in Python Using AI

Introduction

Sales teams waste a lot of time gathering lead information from websites by hand. The Sales Lead Extractor app we are going to build in the article makes this task much faster by using smart web scraping and a simple interface. Sales professionals can upload a list of website URLs and pick what data they want to collect. The app then uses Streamlit and Firecrawl’s API to automatically gather that information.

The best part about this app is how flexible and simple it is to use. Users aren’t stuck with preset data fields - they can choose exactly what information they need, like company names, contact details, or team sizes. The app’s AI technology reads through websites and turns the data into a clean Excel file. What used to take hours of copying and pasting can now be done in just a few minutes.

Let’s dive into building this powerful lead extraction tool step by step, starting with covering the prerequisite concepts.

Prerequisites

Before we dive into building the Sales Lead Extractor, make sure you have the following prerequisites in place:

Python Environment Setup
- Python 3.8 or higher installed
- A code editor of your choice
Required Accounts
- A Firecrawl account with an API key (sign up at https://firecrawl.dev)
- GitHub account (if you plan to deploy the app)

Don’t worry if you are completely new to Firecrawl as we will its basics in the next section.

Technical Knowledge
- Basic understanding of Python programming
- Familiarity with web concepts (URLs, HTML)
- Basic understanding of data structures (JSON, CSV)
Project Dependencies
- Streamlit for the web interface
- Firecrawl for AI-powered web scraping
- Pydantic for data validation
- Pandas for data manipulation

Firecrawl Basics

Firecrawl is an AI-powered web scraping API that takes a different approach from traditional scraping libraries. Instead of relying on HTML selectors or XPaths, it uses natural language understanding to identify and extract content. Here’s a simple example:

from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel

load_dotenv()

class CompanyInfo(BaseModel):
    company: str
    email: str

app = FirecrawlApp()
schema = CompanyInfo.model_json_schema()

data = app.scrape_url(
    "https://openai.com",
    params={
        "formats": ["extract"],
        "extract": {
            "prompt": "Find the company name and contact email",
            "schema": schema
        }
    }
)

company_info = CompanyInfo(**data["extract"])
print(company_info)  # Shows validated company info

In this example, we:

Define a Pydantic model CompanyInfo to structure the scraping process
Initialize the Firecrawl client
Convert the Pydantic model to a JSON schema that Firecrawl understands
Make an API call to extract company info from openai.com using natural language

The key advantage of Firecrawl is that we can describe what we want to extract in plain English (“Find the company name and contact email”) rather than writing complex selectors. The AI understands the intent and returns structured data matching our schema.

Now that we understand how Firecrawl’s AI-powered extraction works, let’s build a complete sales lead extraction tool that leverages these capabilities at scale.

Project Setup

Let’s start by setting up our development environment and installing the necessary dependencies.

Create a working directory

First, create a working directory and initialize a virtual environment:

mkdir sales-lead-extractor
cd sales-lead-extractor
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install Dependencies

We’ll use Poetry for dependency management. If you haven’t installed Poetry yet:

curl -sSL https://install.python-poetry.org | python3 -

Then, initialize it inside the current working directory:

poetry init

Type “^3.10” when asked for the Python version but, don’t specify the dependencies interactively.

Next, install the project dependencies with the add command:

poetry add streamlit firecrawl-py pandas pydantic openpyxl python-dotenv

Build the project structure

mkdir data src
touch .gitignore README.md .env src/{app.py,models.py,scraper.py}

Configure environment variables

Inside the .env file in the root directory, add your Firecrawl API key:

FIRECRAWL_API_KEY=your_api_key_here

Start the app UI

Run the Streamlit app (which is blank just now) to ensure everything is working:

poetry run streamlit run src/app.py

You should see the Streamlit development server start up and your default browser open to the app’s interface. Keep this tab open to see the changes we make to the app in the next steps.

Building the Lead Extraction App Step-by-Step

We will take a top-down approach to building the app: starting with the high-level UI components and user flows, then implementing the underlying functionality piece by piece. This approach will help us validate the app’s usability early and ensure we’re building exactly what users need.

Step 1: Adding a file upload component

We start with the following imports at the top of src/app.py:

import streamlit as st
import pandas as pd
from typing import List, Dict

Then, we will define the main function. Inside, we write the page title and subtitle and add a component for file uploads:

def main():
    st.title("Sales Lead Extractor")
    st.write(
        "Upload a file with website URLs of your leads and define the fields to extract"
    )

    # File upload
    uploaded_file = st.file_uploader("Choose a file", type=["csv", "txt"])

The file uploader component allows users to upload either CSV or TXT files containing URLs. We’ll need a helper function to parse these files and extract the URLs.

# Paste the function after the imports but before the main() function
def load_urls(uploaded_file) -> List[str]:
    """Load URLs from uploaded file"""
    if uploaded_file.name.endswith(".csv"):
        df = pd.read_csv(uploaded_file, header=None)
        return df.iloc[:, 0].tolist()
    else:
        content = uploaded_file.getvalue().decode()
        return [url.strip() for url in content.split("\n") if url.strip()]

The load_urls() function handles both CSV and TXT file formats:

For CSV files: It assumes URLs are in the first column and uses pandas to read them
For TXT files: It reads the content and splits by newlines to get individual URLs

In both cases, it returns a clean list of URLs with any empty lines removed. Note that the function expects files in the following format:

https://website1.com
https://website2.com
https://website3.com

For simplicity, we are skipping implementing security best practices to validate the format of the URLs and the overall files.

We can now use this in our main function to process the uploaded file:

# Continue the main function
def main():
    ...

    if uploaded_file:
        urls = load_urls(uploaded_file)
        st.write(f"Loaded {len(urls)} URLs")

This gives users immediate feedback about how many URLs were successfully loaded from their file.

Now, add this to the end of src/app.py:

if __name__ == "__main__":
    main()

Step 2: Adding a form to define fields interactively

Once a user uploads their file, they must be presented with another form to specify what information they are looking to extract from the leads. To add this functionality, continue src/app.py with the following code block:

def main():
    # ... the rest of the function

    if uploaded_file:
        urls = load_urls(uploaded_file)
        st.write(f"Loaded {len(urls)} URLs")

        # <THIS PART IS NEW> #
        # Dynamic field input
        st.subheader("Define Fields to Extract")

        fields: Dict[str, str] = {}
        col1, col2 = st.columns(2)

        with col1:
            num_fields = st.number_input(
                "Number of fields", min_value=1, max_value=10, value=3
            )

        for i in range(num_fields):
            col1, col2 = st.columns(2)
            with col1:
                field_name = st.text_input(f"Field {i+1} name", key=f"name_{i}")

                # Convert field_name to lower snake case
                field_name = field_name.lower().replace(" ", "_")

            with col2:
                field_desc = st.text_input(f"Field {i+1} description", key=f"desc_{i}")

            if field_name and field_desc:
                fields[field_name] = field_desc

Here’s what’s happening:

We create a dictionary called fields to store the field definitions
Using Streamlit’s column layout, we first ask users how many fields they want to extract (1-10)
For each field, we create two columns:
- Left column: Input for the field name (automatically converted to snake_case)
- Right column: Input for field description that explains what to extract
Valid field name/description pairs are added to the fields dictionary
This fields dictionary will later be used to:
- Create a dynamic Pydantic model for validation
- Guide the AI in extracting the right information from each website

The form is responsive and updates in real-time as users type, providing a smooth experience for defining extraction parameters.

Example inputs users might provide:

Field 1:

Name: company_name
Description: Extract the company name from the website header or footer

Field 2:

Name: employee_count
Description: Find the number of employees mentioned on the About or Company page

Field 3:

Name: tech_stack
Description: Look for technology names mentioned in job postings or footer

Field 4:

Name: contact_email
Description: Find the main contact or sales email address

Field 5:

Name: industry
Description: Determine the company’s primary industry from their description

Once the user fills out the fields and descriptions, they must click on a button that fires up the entire system under the hood:

# Continuing main()
def main():
    ...

    if uploaded_file:
        ...

        if st.button("Start Extraction") and fields:
            with st.spinner(
                "Extracting data. This may take a while, so don't close the window."
            ):
            pass

Currently, nothing happens when the user clicks on the field but we will build up the functionality in the next steps.

Step 3: Building a dynamic Pydantic model from input fields

Let’s convert the fields and descriptions provided through the Streamlit UI into a Pydantic model inside the src/models.py file:

from pydantic import BaseModel, Field
from typing import Dict, Any

The pydantic BaseModel provides data validation and serialization capabilities. We use Field to add helpful metadata like descriptions to our model fields. For type safety, we use Dict and Any type hints - Dict helps define our field definitions while Any provides flexibility in typing when needed.

class DynamicLeadModel(BaseModel):
    """Dynamic model for lead data extraction"""
    pass

Then, we define a new class DynamicLeadModel(BaseModel) with the purpose of dynamically generating Pydantic models based on user-defined fields.

class DynamicLeadModel(BaseModel):
    """Dynamic model for lead data extraction"""

    @classmethod
    def create_model(cls, fields: Dict[str, str]) -> type[BaseModel]:
        """
        Create a dynamic Pydantic model based on user-defined fields

        Args:
            fields: Dictionary of field names and their descriptions
        """
        field_annotations = {}
        field_definitions = {}

        for field_name, description in fields.items():
            field_annotations[field_name] = str
            field_definitions[field_name] = Field(description=description)

        # Create new model dynamically
        return type(
            "LeadData",
            (BaseModel,),
            {"__annotations__": field_annotations, **field_definitions},
        )

This class contains a create_model class method that takes a dictionary of field names and descriptions as input (passed through the Streamlit UI). For each field, it creates type annotations and field definitions with metadata. The method then returns a new dynamically created Pydantic model class that will be used by Firecrawl to guide its AI-powered scraping engine while extracting the lead information.

Step 4: Scraping input URLs with Firecrawl

Let’s move to src/scraper.py to write the scraping functionality:

# src/scraper.py
from firecrawl import FirecrawlApp
from typing import Dict, List
from datetime import datetime
import pandas as pd
from models import DynamicLeadModel

The datetime module is imported to generate unique timestamps for the output files. pandas (imported as pd) is used to create and manipulate DataFrames for storing the scraped lead data and exporting it to Excel.

class LeadScraper:
    def __init__(self):
        self.app = FirecrawlApp()

The LeadScraper class is initialized with a FirecrawlApp instance that provides the connection to the scraping engine. This allows us to make API calls to extract data from the provided URLs.

    # ... continue the class
    async def scrape_leads(self, urls: List[str], fields: Dict[str, str]) -> str:
        """Scrape multiple leads using Firecrawl's batch extraction endpoint"""
        # Create dynamic model
        model = DynamicLeadModel.create_model(fields)

        # Extract data for all URLs at once
        data = self.app.batch_scrape_urls(
            urls,
            params={
                "formats": ["extract"],
                "extract": {"schema": model.model_json_schema()},
            },
        )

        # Process and format the results
        results = [
            {"url": result["metadata"]["url"], **result["extract"]}
            for result in data["data"]
        ]

        # Convert to DataFrame
        df = pd.DataFrame(results)

        # Save results
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"data/leads_{timestamp}.xlsx"
        df.to_excel(filename, index=False)

        return filename

Next, we define a scrape_leads method that takes a list of URLs and field definitions as input. The method creates a dynamic model based on the field definitions, extracts data from all URLs in a single batch request using Firecrawl’s API, processes the results into a DataFrame, and saves them to an Excel file with a timestamp. The method returns the path to the generated Excel file.

Here is a sample output file that may be generated by the function:

url	company_name	employee_count	industry	contact_email
https://acme.com	Acme Corp	500-1000	Technology	info@acme.com
https://betainc.com	Beta Inc	100-500	Healthcare	contact@betainc.com
https://gammatech.io	Gamma Tech	1000+	Software	sales@gammatech.io
https://deltasolutions.com	Delta Sol	50-100	Consulting	hello@deltasol.com

When the “Start extraction” button is clicked in the UI, the app must create a LeadScraper instance and call its scrape_leads() method asynchronously using asyncio.run(). So, let’s return to src/app.py.

Step 5: Creating a download link with the results

First, update the imports with the following version:

# src/app.py
import streamlit as st
import pandas as pd
from typing import List, Dict
import time
import asyncio
import os
from scraper import LeadScraper
from dotenv import load_dotenv

load_dotenv()

The imports have been updated to include all necessary dependencies for the app:

time for tracking execution duration
asyncio for asynchronous execution
os for file path operations
LeadScraper from our custom scraper module
load_dotenv for loading environment variables

Next, continue the last if block of main() function with the following codeblock:

def main():
    ...

    if uploaded_file:
        ...

        if st.button("Start Extraction") and fields:
            with st.spinner(
                "Extracting data. This may take a while, so don't close the window."
            ):
                start_time = time.time()
                scraper = LeadScraper()

                # Run scraping asynchronously
                result_file = asyncio.run(scraper.scrape_leads(urls, fields))

                elapsed_time = time.time() - start_time
                elapsed_mins = int(elapsed_time // 60)
                elapsed_secs = int(elapsed_time % 60)

                # Show download link
                with open(result_file, "rb") as f:
                    st.download_button(
                        "Download Results",
                        f,
                        file_name=os.path.basename(result_file),
                        mime="text/csv",
                    )
                st.balloons()
                st.success(
                    f"Extraction complete! Time taken: {elapsed_mins}m {elapsed_secs}s"
                )

This section adds the core extraction functionality:

When the “Start Extraction” button is clicked and fields are defined:

Shows a loading spinner with message
Records start time
Creates LeadScraper instance
Runs scraping asynchronously using asyncio
Calculates elapsed time in minutes and seconds
Opens the result file and creates download button
Shows success message with time taken
Displays celebratory balloons animation

The code handles the entire extraction workflow from triggering the scrape to delivering results to the user, with progress feedback throughout the process.

At this point, the core app functionality is finished. Try it out with few different CSV files containing potential leads your company has.

In the final step, we will deploy the app to Streamlit Cloud.

Step 6: Deploying the app to Streamlit Cloud

Before deploying, let’s create a requirements.txt file in the root directory:

poetry export -f requirements.txt --output requirements.txt --without-hashes

Next, create a new file called .streamlit/secrets.toml:

FIRECRAWL_API_KEY = "your_api_key_here"

Also, add these to your .gitignore file:

__pycache__/

# Virtual Environment
.env
.venv
venv/
ENV/

# IDE
.idea/
.vscode/

# Project specific
.streamlit/secrets.toml
data/
*.xlsx

Now follow these steps to deploy:

Push to GitHub

git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/yourusername/sales-lead-extractor.git
git push -u origin main

Deploy on Streamlit Cloud

Go to share.streamlit.io
Click “New app”
Connect your GitHub repository
Select the main branch and src/app.py as the entry point
Add your Firecrawl API key in the Secrets section using the same format as .streamlit/secrets.toml
Click “Deploy”

Your app will be live in a few minutes.

Conclusion

We’ve built a powerful lead extraction tool that combines the simplicity of Streamlit with the AI capabilities of Firecrawl. The app allows sales teams to:

Upload lists of target websites
Define custom data fields to extract
Get structured lead data in Excel format
Save hours of manual data gathering

Key features implemented:

Dynamic field definition
Batch URL processing
Progress tracking
Excel export
Cloud deployment

Feel free to extend the app with features like:

Authentication
Lead scoring
CRM integration
Custom extraction templates
Rate limiting
Error handling

The complete source code is available on GitHub.