How to Build a Bulk Sales Lead Extractor in Python Using AI

Introduction
Sales teams waste a lot of time gathering lead information from websites by hand. The Sales Lead Extractor app we are going to build in the article makes this task much faster by using smart web scraping and a simple interface. Sales professionals can upload a list of website URLs and pick what data they want to collect. The app then uses Streamlit and Firecrawl’s API to automatically gather that information.
The best part about this app is how flexible and simple it is to use. Users aren’t stuck with preset data fields - they can choose exactly what information they need, like company names, contact details, or team sizes. The app’s AI technology reads through websites and turns the data into a clean Excel file. What used to take hours of copying and pasting can now be done in just a few minutes.
Let’s dive into building this powerful lead extraction tool step by step, starting with covering the prerequisite concepts.
Prerequisites
Before we dive into building the Sales Lead Extractor, make sure you have the following prerequisites in place:
-
Python Environment Setup
- Python 3.8 or higher installed
- A code editor of your choice
-
Required Accounts
- A Firecrawl account with an API key (sign up at https://firecrawl.dev)
- GitHub account (if you plan to deploy the app)
Don’t worry if you are completely new to Firecrawl as we will its basics in the next section.
-
Technical Knowledge
- Basic understanding of Python programming
- Familiarity with web concepts (URLs, HTML)
- Basic understanding of data structures (JSON, CSV)
-
Project Dependencies
- Streamlit for the web interface
- Firecrawl for AI-powered web scraping
- Pydantic for data validation
- Pandas for data manipulation
Firecrawl Basics
Firecrawl is an AI-powered web scraping API that takes a different approach from traditional scraping libraries. Instead of relying on HTML selectors or XPaths, it uses natural language understanding to identify and extract content. Here’s a simple example:
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel
load_dotenv()
class CompanyInfo(BaseModel):
company: str
email: str
app = FirecrawlApp()
schema = CompanyInfo.model_json_schema()
data = app.scrape_url(
"https://openai.com",
params={
"formats": ["extract"],
"extract": {
"prompt": "Find the company name and contact email",
"schema": schema
}
}
)
company_info = CompanyInfo(**data["extract"])
print(company_info) # Shows validated company info
In this example, we:
- Define a Pydantic model
CompanyInfo
to structure the scraping process - Initialize the Firecrawl client
- Convert the Pydantic model to a JSON schema that Firecrawl understands
- Make an API call to extract company info from
openai.com
using natural language
The key advantage of Firecrawl is that we can describe what we want to extract in plain English (“Find the company name and contact email”) rather than writing complex selectors. The AI understands the intent and returns structured data matching our schema.
Now that we understand how Firecrawl’s AI-powered extraction works, let’s build a complete sales lead extraction tool that leverages these capabilities at scale.
Project Setup
Let’s start by setting up our development environment and installing the necessary dependencies.
- Create a working directory
First, create a working directory and initialize a virtual environment:
mkdir sales-lead-extractor
cd sales-lead-extractor
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install Dependencies
We’ll use Poetry for dependency management. If you haven’t installed Poetry yet:
curl -sSL https://install.python-poetry.org | python3 -
Then, initialize it inside the current working directory:
poetry init
Type “^3.10” when asked for the Python version but, don’t specify the dependencies interactively.
Next, install the project dependencies with the add
command:
poetry add streamlit firecrawl-py pandas pydantic openpyxl python-dotenv
- Build the project structure
mkdir data src
touch .gitignore README.md .env src/{app.py,models.py,scraper.py}
- Configure environment variables
Inside the .env
file in the root directory, add your Firecrawl API key:
FIRECRAWL_API_KEY=your_api_key_here
- Start the app UI
Run the Streamlit app (which is blank just now) to ensure everything is working:
poetry run streamlit run src/app.py
You should see the Streamlit development server start up and your default browser open to the app’s interface. Keep this tab open to see the changes we make to the app in the next steps.
Building the Lead Extraction App Step-by-Step
We will take a top-down approach to building the app: starting with the high-level UI components and user flows, then implementing the underlying functionality piece by piece. This approach will help us validate the app’s usability early and ensure we’re building exactly what users need.
Step 1: Adding a file upload component
We start with the following imports at the top of src/app.py
:
import streamlit as st
import pandas as pd
from typing import List, Dict
Then, we will define the main
function. Inside, we write the page title and subtitle and add a component for file uploads:
def main():
st.title("Sales Lead Extractor")
st.write(
"Upload a file with website URLs of your leads and define the fields to extract"
)
# File upload
uploaded_file = st.file_uploader("Choose a file", type=["csv", "txt"])
The file uploader component allows users to upload either CSV or TXT files containing URLs. We’ll need a helper function to parse these files and extract the URLs.
# Paste the function after the imports but before the main() function
def load_urls(uploaded_file) -> List[str]:
"""Load URLs from uploaded file"""
if uploaded_file.name.endswith(".csv"):
df = pd.read_csv(uploaded_file, header=None)
return df.iloc[:, 0].tolist()
else:
content = uploaded_file.getvalue().decode()
return [url.strip() for url in content.split("\n") if url.strip()]
The load_urls()
function handles both CSV and TXT file formats:
- For CSV files: It assumes URLs are in the first column and uses pandas to read them
- For TXT files: It reads the content and splits by newlines to get individual URLs
In both cases, it returns a clean list of URLs with any empty lines removed. Note that the function expects files in the following format:
https://website1.com
https://website2.com
https://website3.com
For simplicity, we are skipping implementing security best practices to validate the format of the URLs and the overall files.
We can now use this in our main function to process the uploaded file:
# Continue the main function
def main():
...
if uploaded_file:
urls = load_urls(uploaded_file)
st.write(f"Loaded {len(urls)} URLs")
This gives users immediate feedback about how many URLs were successfully loaded from their file.
Now, add this to the end of src/app.py
:
if __name__ == "__main__":
main()
Step 2: Adding a form to define fields interactively
Once a user uploads their file, they must be presented with another form to specify what information they are looking to extract from the leads. To add this functionality, continue src/app.py
with the following code block:
def main():
# ... the rest of the function
if uploaded_file:
urls = load_urls(uploaded_file)
st.write(f"Loaded {len(urls)} URLs")
# <THIS PART IS NEW> #
# Dynamic field input
st.subheader("Define Fields to Extract")
fields: Dict[str, str] = {}
col1, col2 = st.columns(2)
with col1:
num_fields = st.number_input(
"Number of fields", min_value=1, max_value=10, value=3
)
for i in range(num_fields):
col1, col2 = st.columns(2)
with col1:
field_name = st.text_input(f"Field {i+1} name", key=f"name_{i}")
# Convert field_name to lower snake case
field_name = field_name.lower().replace(" ", "_")
with col2:
field_desc = st.text_input(f"Field {i+1} description", key=f"desc_{i}")
if field_name and field_desc:
fields[field_name] = field_desc
Here’s what’s happening:
- We create a dictionary called
fields
to store the field definitions - Using Streamlit’s column layout, we first ask users how many fields they want to extract (1-10)
- For each field, we create two columns:
- Left column: Input for the field name (automatically converted to snake_case)
- Right column: Input for field description that explains what to extract
- Valid field name/description pairs are added to the fields dictionary
- This fields dictionary will later be used to:
- Create a dynamic Pydantic model for validation
- Guide the AI in extracting the right information from each website
The form is responsive and updates in real-time as users type, providing a smooth experience for defining extraction parameters.
Example inputs users might provide:
Field 1:
- Name:
company_name
- Description: Extract the company name from the website header or footer
Field 2:
- Name:
employee_count
- Description: Find the number of employees mentioned on the About or Company page
Field 3:
- Name:
tech_stack
- Description: Look for technology names mentioned in job postings or footer
Field 4:
- Name:
contact_email
- Description: Find the main contact or sales email address
Field 5:
- Name: industry
- Description: Determine the company’s primary industry from their description
Once the user fills out the fields and descriptions, they must click on a button that fires up the entire system under the hood:
# Continuing main()
def main():
...
if uploaded_file:
...
if st.button("Start Extraction") and fields:
with st.spinner(
"Extracting data. This may take a while, so don't close the window."
):
pass
Currently, nothing happens when the user clicks on the field but we will build up the functionality in the next steps.
Step 3: Building a dynamic Pydantic model from input fields
Let’s convert the fields and descriptions provided through the Streamlit UI into a Pydantic model inside the src/models.py
file:
from pydantic import BaseModel, Field
from typing import Dict, Any
The pydantic BaseModel
provides data validation and serialization capabilities. We use Field to add helpful metadata like descriptions to our model fields. For type safety, we use Dict
and Any
type hints - Dict
helps define our field definitions while Any
provides flexibility in typing when needed.
class DynamicLeadModel(BaseModel):
"""Dynamic model for lead data extraction"""
pass
Then, we define a new class DynamicLeadModel(BaseModel)
with the purpose of dynamically generating Pydantic models based on user-defined fields.
class DynamicLeadModel(BaseModel):
"""Dynamic model for lead data extraction"""
@classmethod
def create_model(cls, fields: Dict[str, str]) -> type[BaseModel]:
"""
Create a dynamic Pydantic model based on user-defined fields
Args:
fields: Dictionary of field names and their descriptions
"""
field_annotations = {}
field_definitions = {}
for field_name, description in fields.items():
field_annotations[field_name] = str
field_definitions[field_name] = Field(description=description)
# Create new model dynamically
return type(
"LeadData",
(BaseModel,),
{"__annotations__": field_annotations, **field_definitions},
)
This class contains a create_model
class method that takes a dictionary of field names and descriptions as input (passed through the Streamlit UI). For each field, it creates type annotations and field definitions with metadata. The method then returns a new dynamically created Pydantic model class that will be used by Firecrawl to guide its AI-powered scraping engine while extracting the lead information.
Step 4: Scraping input URLs with Firecrawl
Let’s move to src/scraper.py
to write the scraping functionality:
# src/scraper.py
from firecrawl import FirecrawlApp
from typing import Dict, List
from datetime import datetime
import pandas as pd
from models import DynamicLeadModel
The datetime
module is imported to generate unique timestamps for the output files. pandas
(imported as pd
) is used to create and manipulate DataFrames for storing the scraped lead data and exporting it to Excel.
class LeadScraper:
def __init__(self):
self.app = FirecrawlApp()
The LeadScraper
class is initialized with a FirecrawlApp instance that provides the connection to the scraping engine. This allows us to make API calls to extract data from the provided URLs.
# ... continue the class
async def scrape_leads(self, urls: List[str], fields: Dict[str, str]) -> str:
"""Scrape multiple leads using Firecrawl's batch extraction endpoint"""
# Create dynamic model
model = DynamicLeadModel.create_model(fields)
# Extract data for all URLs at once
data = self.app.batch_scrape_urls(
urls,
params={
"formats": ["extract"],
"extract": {"schema": model.model_json_schema()},
},
)
# Process and format the results
results = [
{"url": result["metadata"]["url"], **result["extract"]}
for result in data["data"]
]
# Convert to DataFrame
df = pd.DataFrame(results)
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"data/leads_{timestamp}.xlsx"
df.to_excel(filename, index=False)
return filename
Next, we define a scrape_leads
method that takes a list of URLs and field definitions as input. The method creates a dynamic model based on the field definitions, extracts data from all URLs in a single batch request using Firecrawl’s API, processes the results into a DataFrame, and saves them to an Excel file with a timestamp. The method returns the path to the generated Excel file.
Here is a sample output file that may be generated by the function:
url | company_name | employee_count | industry | contact_email |
---|---|---|---|---|
https://acme.com | Acme Corp | 500-1000 | Technology | info@acme.com |
https://betainc.com | Beta Inc | 100-500 | Healthcare | contact@betainc.com |
https://gammatech.io | Gamma Tech | 1000+ | Software | sales@gammatech.io |
https://deltasolutions.com | Delta Sol | 50-100 | Consulting | hello@deltasol.com |
When the “Start extraction” button is clicked in the UI, the app must create a LeadScraper
instance and call its scrape_leads()
method asynchronously using asyncio.run()
. So, let’s return to src/app.py
.
Step 5: Creating a download link with the results
First, update the imports with the following version:
# src/app.py
import streamlit as st
import pandas as pd
from typing import List, Dict
import time
import asyncio
import os
from scraper import LeadScraper
from dotenv import load_dotenv
load_dotenv()
The imports have been updated to include all necessary dependencies for the app:
time
for tracking execution durationasyncio
for asynchronous executionos
for file path operationsLeadScraper
from our custom scraper moduleload_dotenv
for loading environment variables
Next, continue the last if
block of main()
function with the following codeblock:
def main():
...
if uploaded_file:
...
if st.button("Start Extraction") and fields:
with st.spinner(
"Extracting data. This may take a while, so don't close the window."
):
start_time = time.time()
scraper = LeadScraper()
# Run scraping asynchronously
result_file = asyncio.run(scraper.scrape_leads(urls, fields))
elapsed_time = time.time() - start_time
elapsed_mins = int(elapsed_time // 60)
elapsed_secs = int(elapsed_time % 60)
# Show download link
with open(result_file, "rb") as f:
st.download_button(
"Download Results",
f,
file_name=os.path.basename(result_file),
mime="text/csv",
)
st.balloons()
st.success(
f"Extraction complete! Time taken: {elapsed_mins}m {elapsed_secs}s"
)
This section adds the core extraction functionality:
When the “Start Extraction” button is clicked and fields are defined:
- Shows a loading spinner with message
- Records start time
- Creates
LeadScraper
instance - Runs scraping asynchronously using
asyncio
- Calculates elapsed time in minutes and seconds
- Opens the result file and creates download button
- Shows success message with time taken
- Displays celebratory balloons animation
The code handles the entire extraction workflow from triggering the scrape to delivering results to the user, with progress feedback throughout the process.
At this point, the core app functionality is finished. Try it out with few different CSV files containing potential leads your company has.
In the final step, we will deploy the app to Streamlit Cloud.
Step 6: Deploying the app to Streamlit Cloud
Before deploying, let’s create a requirements.txt
file in the root directory:
poetry export -f requirements.txt --output requirements.txt --without-hashes
Next, create a new file called .streamlit/secrets.toml
:
FIRECRAWL_API_KEY = "your_api_key_here"
Also, add these to your .gitignore
file:
__pycache__/
# Virtual Environment
.env
.venv
venv/
ENV/
# IDE
.idea/
.vscode/
# Project specific
.streamlit/secrets.toml
data/
*.xlsx
Now follow these steps to deploy:
- Push to GitHub
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/yourusername/sales-lead-extractor.git
git push -u origin main
- Deploy on Streamlit Cloud
- Go to share.streamlit.io
- Click “New app”
- Connect your GitHub repository
- Select the main branch and
src/app.py
as the entry point - Add your Firecrawl API key in the Secrets section using the same format as
.streamlit/secrets.toml
- Click “Deploy”
Your app will be live in a few minutes.
Conclusion
We’ve built a powerful lead extraction tool that combines the simplicity of Streamlit with the AI capabilities of Firecrawl. The app allows sales teams to:
- Upload lists of target websites
- Define custom data fields to extract
- Get structured lead data in Excel format
- Save hours of manual data gathering
Key features implemented:
- Dynamic field definition
- Batch URL processing
- Progress tracking
- Excel export
- Cloud deployment
Feel free to extend the app with features like:
- Authentication
- Lead scoring
- CRM integration
- Custom extraction templates
- Rate limiting
- Error handling
The complete source code is available on GitHub.
On this page
Introduction
Prerequisites
Firecrawl Basics
Project Setup
Building the Lead Extraction App Step-by-Step
Step 1: Adding a file upload component
Step 2: Adding a form to define fields interactively
Step 3: Building a dynamic Pydantic model from input fields
Step 4: Scraping input URLs with Firecrawl
Step 5: Creating a download link with the results
Step 6: Deploying the app to Streamlit Cloud
Conclusion
About the Author

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
More articles by Bex Tuychiev
Building an Automated Price Tracking Tool
Build an automated e-commerce price tracker in Python. Learn web scraping, price monitoring, and automated alerts using Firecrawl, Streamlit, PostgreSQL.
Web Scraping Automation: How to Run Scrapers on a Schedule
Learn how to automate web scraping in Python using free tools like schedule, asyncio, cron jobs and GitHub Actions. This comprehensive guide covers local and cloud-based scheduling methods to run scrapers reliably in 2025.
Automated Data Collection - A Comprehensive Guide
Learn how to build robust automated data collection systems using modern tools and best practices. This guide covers everything from selecting the right tools to implementing scalable collection pipelines.
BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python
Learn the key differences between BeautifulSoup4 and Scrapy for web scraping in Python. Compare their features, performance, and use cases to choose the right tool for your web scraping needs.
How to Build an Automated Competitor Price Monitoring System with Python
Learn how to build an automated competitor price monitoring system in Python that tracks prices across e-commerce sites, provides real-time comparisons, and maintains price history using Firecrawl, Streamlit, and GitHub Actions.
Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude
Learn how to build a web scraper in Python that gathers company details, funding rounds, and investor information from public sources like Crunchbase using Firecrawl and Claude for automated data collection and analysis.
Data Enrichment: A Complete Guide to Enhancing Your Data Quality
Learn how to enrich your data quality with a comprehensive guide covering data enrichment tools, best practices, and real-world examples. Discover how to leverage modern solutions like Firecrawl to automate data collection, validation, and integration for better business insights.
Building an Intelligent Code Documentation RAG Assistant with DeepSeek and Firecrawl
Learn how to build an intelligent documentation assistant powered by DeepSeek and RAG (Retrieval Augmented Generation) that can answer questions about any documentation website by combining language models with efficient information retrieval.