Building an Open-Source Project Monitoring Tool with Firecrawl and Streamlit

In this article, weâll explore the OS-watch application - a tool designed to help teams and individuals stay informed about trending open-source projects. The application scrapes GitHubâs trending repositories page, filters results based on user criteria, and delivers notifications via Slack on a scheduled basis.
To see the app in action, see our demo above!
Application overview
OS-watch is a Python application that monitors trending projects on GitHub and sends notifications to Slack. The program searches GitHubâs trending page, collects information about repositories that match specific criteria, and delivers custom alerts when it finds something interesting. Users can search by keywords like âmachine learningâ or âweb development,â filter by programming language, and set up automated checks that run on a schedule.
The application uses Firecrawl to handle the web scraping part, which means it can extract data from GitHubâs website without dealing with complex HTML parsing. Streamlit provides the user interface, making it easy to interact with the application through a web browser. The notification system connects to Slack through webhooks, delivering formatted messages with repository details directly to your chosen channel.
Key components
The application is built using several Python modules that work together:
config.py
- Manages application settings and configuration through Pydantic modelsscraper.py
- Connects to GitHub using Firecrawl and extracts repository informationnotifier.py
- Creates and sends formatted notifications to Slackscheduler.py
- Runs tasks in the background on a specified scheduleapp.py
- Provides the user interface built with Streamlitrun_app.py
- Starts the application with the correct settings
Each module handles a specific part of the applicationâs functionality. The modular design makes the code easier to maintain and update. Users interact with the Streamlit interface to set search parameters, while the scheduler runs in the background to check for new trending repositories at regular intervals.
Setting up the environment
Now that we understand what OS-watch does and how itâs structured, letâs set up everything you need to run the application. This process involves obtaining API keys, configuring your Slack workspace, and setting up your local development environment.
Prerequisites
To run OS-watch, youâll need:
- Python 3.10 or higher: The application uses modern Python features
- Poetry: A dependency management tool that makes installation easier
- Firecrawl API key: Required for web scraping GitHubâs trending page
- Slack workspace: With permissions to create incoming webhooks
Installation steps
Getting OS-watch up and running involves a few straightforward steps. First, clone the repository to your local machine:
git clone https://github.com/yourusername/os-watch.git
cd os-watch
Next, if you donât already have Poetry installed, youâll need to install it:
curl -sSL https://install.python-poetry.org | python3 -
With Poetry available, you can now install all the dependencies:
poetry install
After installation, create your environment configuration file by making a copy of the example file:
cp .env.example .env
Finally, activate the virtual environment created by Poetry:
poetry shell
With these steps completed, youâll have the basic structure in place. However, before running the application, youâll need to set up the required services.
Getting a Firecrawl API key
Firecrawl is the web scraping service that powers OS-watchâs ability to extract structured data from GitHub. To get an API key:
- Visit Firecrawlâs website and create an account
- After signing in, navigate to your account dashboard
- Look for the âAPI Keysâ section
- Click âGenerate New API Keyâ
- Copy the key - it should start with âfc-â
- Add this key to your
.env
file asFIRECRAWL_API_KEY=fc-your_key_here
This API key allows OS-watch to make requests to Firecrawlâs service, which handles the complex task of scraping GitHubâs trending pages and extracting the repository information in a structured format.
Setting up Slack webhooks
For the notification feature to work, OS-watch needs a way to send messages to your Slack workspace. This is done through Slackâs incoming webhooks:
- Log in to your Slack workspace
- Visit api.slack.com/apps and click âCreate New Appâ
- Choose âFrom scratchâ and give your app a name like âOS-watchâ
- Select the workspace where you want to receive notifications
- In the left sidebar, click âIncoming Webhooksâ
- Toggle âActivate Incoming Webhooksâ to ON
- Click âAdd New Webhook to Workspaceâ
- Select the channel where you want notifications to appear
- Click âAllowâ to authorize the webhook
- Copy the webhook URL (it looks like
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
) - Add this URL to your
.env
file asSLACK_WEBHOOK_URL=your_webhook_url_here
With the webhook configured, OS-watch will be able to send rich, formatted messages about trending repositories directly to your specified Slack channel.
Environment configuration
The final step in setting up OS-watch is configuring the .env
file, which controls how the application operates. Letâs look at each setting in detail:
-
FIRECRAWL_API_KEY
: Your personal API key for accessing Firecrawl services- Example:
FIRECRAWL_API_KEY=fc-abcd1234
- Required for the application to function
- Example:
-
SLACK_WEBHOOK_URL
: The URL where OS-watch sends notifications- Example:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXX
- Required for notifications, but you can run searches without it
- Example:
-
NOTIFICATION_FREQUENCY
: How often to check for trending repositories- Options:
hourly
,daily
, orweekly
- Example:
NOTIFICATION_FREQUENCY=daily
- Default:
daily
- Options:
-
NOTIFICATION_TIME
: When to send notifications (for daily/weekly frequencies)- Format: 24-hour time (HH:MM)
- Example:
NOTIFICATION_TIME=09:00
(for 9:00 AM) - Default:
09:00
-
SEARCH_KEYWORDS
: Terms to look for in repository names and descriptions- Format: Comma-separated list
- Example:
SEARCH_KEYWORDS=machine learning,ai,nlp
- Default:
python,ml,ai
-
SEARCH_LANGUAGE
: Programming language to filter by- Example:
SEARCH_LANGUAGE=Python
- Leave empty to include all languages
- Default: empty (all languages)
- Example:
-
SEARCH_PERIOD
: Time frame for trending repositories- Options:
daily
,weekly
, ormonthly
- Example:
SEARCH_PERIOD=weekly
- Default:
daily
- Options:
These settings allow you to customize how OS-watch operates without changing any code. For example, you might set SEARCH_KEYWORDS=rust,webassembly
if youâre interested in Rust and WebAssembly projects, or NOTIFICATION_FREQUENCY=weekly
if you prefer less frequent updates.
A complete .env
file might look like this:
FIRECRAWL_API_KEY=fc-your_api_key_here
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXX
NOTIFICATION_FREQUENCY=daily
NOTIFICATION_TIME=09:00
SEARCH_KEYWORDS=python,ml,ai,llm
SEARCH_LANGUAGE=
SEARCH_PERIOD=daily
With your environment fully configured, youâre now ready to start the application with:
python run_app.py
This command will launch the Streamlit server and open the application in your browser at http://localhost:8501
, where you can begin exploring trending repositories and setting up your automated monitoring.
The Configuration Module (config.py)
With our environment set up and configured, letâs examine how OS-watch actually manages these settings through its configuration module. The config.py
file serves as the central nervous system of our application, translating environment variables into structured data models that the rest of the application can easily consume.
At the heart of this module is Pydantic, a data validation library that enforces type hints at runtime. This gives us both the flexibility of Python and the safety of strong type checking, ensuring our application runs reliably even when configuration values change.
Data models
The configuration module defines several Pydantic models that represent different aspects of our application:
GitHubRepository model
This model represents an individual GitHub repository with all its relevant properties:
class GitHubRepository(BaseModel):
"""Model for a GitHub repository extracted from trending page"""
name: str = Field(description="Full name of the repository (owner/repo)")
description: Optional[str] = Field(None, description="Repository description")
language: Optional[str] = Field(None, description="Main programming language")
stars_count: Optional[str] = Field(None, description="Total number of stars")
stars_today: Optional[str] = Field(None, description="Stars gained today")
forks_count: Optional[str] = Field(None, description="Total number of forks")
repo_owner: Optional[str] = Field(None, description="Repository owner")
repo_url: Optional[str] = Field(None, description="Repository URL")
This model not only defines the structure of repository data but also documents each field through the Field
descriptor, making the code self-documenting. Notice how some fields are marked as Optional
, allowing for flexibility when data might be incomplete.
The Repositories
class serves as a simple wrapper that contains a list of GitHubRepository
objects:
class Repositories(BaseModel):
"""Wrapper model for a list of GitHub repositories"""
repositories: List[GitHubRepository]
This structure is particularly important for our Firecrawl integration, as it defines the schema used for structured data extraction. Firecrawl converts this schema into JSON, guiding its AI engine with the Field descriptors to find the necessary HTML/CSS selectors containing the information we want.
NotificationConfig model
This model handles all settings related to Slack notifications:
class NotificationConfig(BaseModel):
"""Configuration for notifications"""
webhook_url: str
frequency: str # "hourly", "daily", "weekly"
time_of_day: Optional[str] = "09:00" # For daily/weekly notifications
These fields directly correspond to the environment variables we set up earlier, controlling how and when notifications are sent.
SearchConfig model
This model manages parameters for GitHub trending searches:
class SearchConfig(BaseModel):
"""Configuration for GitHub trend searches"""
keywords: List[str]
language: Optional[str] = None
time_period: str = "daily" # "daily", "weekly", "monthly"
Again, these fields align with our environment configuration, defining what repositories weâre interested in finding.
AppConfig model
Finally, the main configuration class combines the notification and search settings:
class AppConfig(BaseModel):
"""Main application configuration"""
notification: NotificationConfig
search: SearchConfig
This hierarchical structure keeps our configuration organized and makes it easy to access related settings together.
Loading configuration from environment
The key bridge between the environment variables we set up earlier and our application is the load_from_env()
class method in AppConfig
:
@classmethod
def load_from_env(cls):
"""Load configuration from environment variables"""
return cls(
notification=NotificationConfig(
webhook_url=os.environ.get("SLACK_WEBHOOK_URL", ""),
frequency=os.environ.get("NOTIFICATION_FREQUENCY", "daily"),
time_of_day=os.environ.get("NOTIFICATION_TIME", "09:00"),
),
search=SearchConfig(
keywords=os.environ.get("SEARCH_KEYWORDS", "python,ml,ai").split(","),
language=os.environ.get("SEARCH_LANGUAGE", None),
time_period=os.environ.get("SEARCH_PERIOD", "daily"),
),
)
This method reads the environment variables we configured in our .env
file and converts them into the appropriate data types for our models. For example, it transforms the comma-separated keyword string into a proper Python list. It also provides sensible default values for any missing configuration, ensuring the application can run even with incomplete settings.
Default configuration
To handle cases where environment variables might not be set, the module defines a DEFAULT_CONFIG
object:
DEFAULT_CONFIG = AppConfig(
notification=NotificationConfig(
webhook_url="",
frequency="daily",
),
search=SearchConfig(
keywords=["python", "ml", "ai"],
time_period="daily",
),
)
This default configuration allows the application to start with reasonable settings out of the box. New users can immediately begin using the application without configuring every detail, and they can gradually customize their experience as they become more familiar with the tool.
The configuration moduleâs well-structured approach makes the rest of the application simpler and more maintainable. By centralizing all configuration management in one place, other modules can focus on their specific responsibilities without worrying about how to access or validate configuration values. This separation of concerns is a key design principle that makes OS-watch both robust and easy to extend.
The Scraper Module (scraper.py)
The scraper module bridges the gap between our application and the GitHub website, converting raw HTML into structured data that follows our Pydantic models. By using Firecrawlâs structured extraction features, we avoid the fragility of traditional web scrapers that break when sites change their layouts or styling.
GitHubTrendScraper class
The main component of this module is the GitHubTrendScraper
class, which encapsulates all scraping functionality:
class GitHubTrendScraper:
"""Scraper for GitHub trending repositories"""
def __init__(self, config: SearchConfig):
self.config = config
# Get API key from environment variables
self.api_key = os.environ.get("FIRECRAWL_API_KEY", "")
# Initialize FirecrawlApp with API key
self.firecrawl_app = FirecrawlApp(api_key=self.api_key)
During initialization, the scraper takes a SearchConfig
object (from our config module) and sets up the Firecrawl client using the API key from our environment variables.
Building the URL
The first step in scraping is constructing the correct URL to fetch:
def build_url(self) -> str:
"""Build the GitHub trending URL based on configuration"""
url = "https://github.com/trending"
# Add language filter if specified
if self.config.language and self.config.language.lower() != "all":
url += f"/{self.config.language.lower()}"
# Add time period
url += f"?since={self.config.time_period}"
return url
This method dynamically builds the GitHub trending URL based on the userâs configuration. If a specific programming language is selected, itâs added to the path. The time period (daily, weekly, or monthly) is added as a query parameter.
The scraping process
The main scrape()
method orchestrates the entire scraping process:
def scrape(self) -> List[Dict[str, Any]]:
"""Scrape GitHub trending repositories using structured extraction"""
url = self.build_url()
# Use Firecrawl to scrape the trending page with structured extraction
try:
# Call Firecrawl with structured extraction as shown in the notebook
result = self.firecrawl_app.scrape_url(
url,
params={
"formats": ["extract"],
"extract": {
"prompt": "Scrape the GitHub trending page and extract the repositories based on the schema provided.",
"schema": Repositories.model_json_schema(),
},
},
)
# Check if we got structured data back
if "extract" in result and "repositories" in result["extract"]:
# Convert the extracted data to our standard dictionary format
repositories = self._process_extracted_repos(
result["extract"]["repositories"]
)
# Filter repositories based on keywords
filtered_repos = self._filter_by_keywords(repositories)
return filtered_repos
else:
print("Structured extraction failed, no repository data found")
return []
except Exception as e:
print(f"Error scraping GitHub trending page: {str(e)}")
return []
This method:
- Builds the correct URL based on configuration
- Makes a request to Firecrawlâs API with structured extraction parameters
- Processes the extracted repositories into a standardized format
- Filters the repositories based on the userâs keywords
- Returns the filtered list of repositories
Notice how we pass our Pydantic schema (Repositories.model_json_schema()
) to Firecrawlâs extraction engine. This tells Firecrawl exactly what data we want and how it should be structured. The prompt provides additional context to guide the extraction process.
Processing extracted repositories
Once we have the raw data from Firecrawl, we need to process it into a consistent format:
def _process_extracted_repos(
self, extracted_repos: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""Process the structured extracted repositories into our standard format"""
processed_repos = []
for i, repo in enumerate(extracted_repos):
# Using the format from the notebook
processed_repo = {
"name": repo.get("name", ""),
"display_name": (
repo.get("name", "").split("/")[-1]
if "/" in repo.get("name", "")
else repo.get("name", "")
),
"url": repo.get(
"repo_url", f"https://github.com/{repo.get('name', '')}"
),
"description": repo.get("description", ""),
"stars": repo.get("stars_count", ""),
"today_stars": repo.get("stars_today", ""),
"language": repo.get("language", self.config.language or "Unknown"),
"rank": i + 1, # Assign rank based on position in the list
"forks": repo.get("forks_count", ""),
"owner": repo.get("repo_owner", ""),
}
processed_repos.append(processed_repo)
return processed_repos
This method transforms the raw extraction data into a consistent format for our application. It handles missing fields gracefully with default values and adds derived information like display_name
(the repository name without the owner) and rank
(based on position in the trending list).
Filtering by keywords
An additional step is filtering repositories based on the userâs keywords:
def _filter_by_keywords(
self, repositories: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""Filter repositories based on configured keywords"""
if not self.config.keywords:
return repositories
filtered = []
for repo in repositories:
# Check if any keyword matches in name or description
if any(
k.lower() in repo["name"].lower()
or k.lower() in repo["description"].lower()
for k in self.config.keywords
):
filtered.append(repo)
return filtered
This method checks each repository to see if its name or description contains any of the keywords specified in the configuration. The filtering is case-insensitive and only returns repositories that match at least one keyword. This method is already used inside the scrape
method.
Leveraging Firecrawlâs capabilities
What makes this scraper particularly powerful is how it uses Firecrawlâs structured extraction capabilities. Instead of writing complex CSS selectors or XPath queries that might break when GitHub updates its UI, we simply:
- Define our data schema using Pydantic models
- Pass this schema to Firecrawl
- Provide a simple prompt describing what we want to extract
Firecrawl handles the complexities of finding the right elements, extracting the data, and returning it in our desired format. This approach is much more resilient to website changes and requires significantly less code than traditional web scraping methods.
The Notifier Module (notifier.py)
After repositories are scraped and filtered, the notifier.py
module handles the delivery of this information to users. This module converts repository data into structured Slack messages that provide context and relevant information in a readable format.
The notifier module serves as the communication component in the applicationâs architecture. It transforms repository data into notifications, allowing team members to quickly review trending projects without needing to process raw data.
SlackNotifier class
The core component of this module is the SlackNotifier
class:
class SlackNotifier:
"""Notifier for sending repository updates to Slack"""
def __init__(self, config: NotificationConfig):
self.config = config
The class is initialized with a NotificationConfig
object containing the webhook URL and scheduling preferences. This follows the applicationâs pattern of passing configuration objects between components, maintaining separation of concerns.
Sending notifications
The main method is send_notification()
, which manages the notification process:
def send_notification(
self, repositories: List[Dict[str, Any]], search_terms: List[str]
) -> bool:
"""Send a notification with trending repositories to Slack"""
if not repositories:
return False
if not self.config.webhook_url:
print("No webhook URL configured")
return False
# Create the message payload
payload = self._create_message_payload(repositories, search_terms)
try:
# Send the message to Slack
response = requests.post(
self.config.webhook_url,
data=json.dumps(payload),
headers={"Content-Type": "application/json"},
)
if response.status_code != 200:
print(
f"Failed to send notification: {response.status_code} {response.text}"
)
return False
return True
except Exception as e:
print(f"Error sending notification: {str(e)}")
return False
This method performs several key functions:
- Verifies repositories exist to notify about
- Validates webhook URL configuration
- Creates a formatted message payload
- Sends a POST request to the Slack webhook
- Handles any errors during transmission
- Returns success or failure status
The method includes error handling and logging, returning a boolean value to indicate success or failure, allowing the caller to respond appropriately.
Creating message payload
The _create_message_payload
method structures messages for Slack in a readable format. The implementation can be analyzed in several parts:
First, the method establishes the current time and builds the initial message structure:
def _create_message_payload(
self, repositories: List[Dict[str, Any]], search_terms: List[str]
) -> Dict[str, Any]:
"""Create a formatted Slack message with repository information"""
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# Create the message blocks
blocks = [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"đ GitHub Trending Update: {', '.join(search_terms)}",
},
},
{
"type": "context",
"elements": [
{
"type": "plain_text",
"text": f"Found {len(repositories)} trending repositories ⢠{now}",
}
],
},
{"type": "divider"},
]
This creates a header with search terms, a context line showing the number of repositories found and timestamp, and a divider to separate the header from repository content.
Next, the method iterates through repositories and adds blocks for each one:
# Add repository information
for repo in repositories:
star_info = f"â {repo.get('stars', 'N/A')}"
if repo.get("today_stars"):
star_info += f" (+{repo.get('today_stars')} today)"
blocks.extend([
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*<{repo['url']}|{repo['name']}>*\n{repo.get('description', 'No description')}",
},
},
For each repository, the method formats star information, including todayâs increase when available. It creates a section block with the repository name as a clickable link, followed by its description.
The method continues by adding context about each repository:
{
"type": "context",
"elements": [
{
"type": "mrkdwn",
"text": f"{star_info} ⢠Rank: #{repo.get('rank', 'N/A')} ⢠Language: {repo.get('language', 'Unknown')}",
}
],
},
{"type": "divider"},
])
This context block presents metrics in a compact format: star count, trending rank, and programming language. A divider separates each repository entry.
Finally, the method adds a footer and returns the complete structure:
# Add footer
blocks.append(
{
"type": "context",
"elements": [
{"type": "mrkdwn", "text": "Powered by Open-source Watch"}
],
}
)
return {"blocks": blocks}
The footer includes attribution to the application. The complete structure is wrapped in a dictionary with a âblocksâ key, conforming to Slackâs API expectations.
This structured approach creates a consistent notification format that accommodates varying amounts of content while maintaining readability.
Slack message structure
The message structure demonstrates effective conversion of technical data into informative content. The key components include:
Header and context
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"đ GitHub Trending Update: {', '.join(search_terms)}",
},
},
{
"type": "context",
"elements": [
{
"type": "plain_text",
"text": f"Found {len(repositories)} trending repositories ⢠{now}",
}
],
}
This creates a header with the search terms and an emoji indicator, followed by a context line showing the repository count and timestamp. This provides users with immediate context about the notificationâs content and timing.
Repository blocks
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*<{repo['url']}|{repo['name']}>*\n{repo.get('description', 'No description')}",
},
},
{
"type": "context",
"elements": [
{
"type": "mrkdwn",
"text": f"{star_info} ⢠Rank: #{repo.get('rank', 'N/A')} ⢠Language: {repo.get('language', 'Unknown')}",
}
],
}
Each repository is presented with a section containing its name as a clickable link and description, followed by metrics in smaller text: star count, ranking, and programming language. This provides both descriptive and quantitative information in a concise format.
The use of Slackâs block structure creates a consistent notification layout that maintains readability regardless of content volume.
Notification effectiveness
The notifications are designed to be informative and functional. By including repository links and contextual metrics like star counts and rankings, users can make informed decisions about which repositories warrant further investigation.
This approach to notification design prioritizes user needs, organizing information in a way that facilitates efficient assessment. It converts raw data into structured content that helps users quickly understand the significance of trending repositories.
The notifier module completes the applicationâs data pipeline, delivering filtered repository data to users in a structured format. This demonstrates how appropriate formatting can enhance the utility of technical information, making it accessible and useful to team members.
The Scheduler Module (scheduler.py)
After setting up data collection with the scraper module and delivery via the notifier module, we need a mechanism to automate these processes. The scheduler.py
module provides this functionality, offering a background task scheduling system that runs operations at specified intervals without requiring manual intervention.
The scheduler module serves as the automation engine of the OS-watch application, allowing users to set up periodic checks for trending repositories based on their preferred frequency. It handles the complexities of time-based execution while maintaining state between application restarts.
Scheduler class
The core component of this module is the Scheduler
class, which manages background tasks:
class Scheduler:
"""Background task scheduler for periodic repository checks"""
def __init__(self, config: NotificationConfig, task_function=None):
self.config = config
self.task_function = task_function
self.running = False
self.thread = None
self.last_run_time = None
self.next_run_time = None
# Load any previous state
self._load_state()
The class is initialized with a NotificationConfig
object that specifies the frequency and time for scheduled runs, as well as a function to execute at each scheduled interval. The initialization sets up internal tracking variables and loads any existing state from disk.
State management
Maintaining scheduler state between application restarts is crucial for reliability. Two methods handle this persistence:
def _load_state(self):
"""Load the scheduler state from disk if available"""
try:
if os.path.exists(SCHEDULER_STATE_FILE):
with open(SCHEDULER_STATE_FILE, 'rb') as f:
state = pickle.load(f)
self.last_run_time = state.get('last_run_time')
self.next_run_time = state.get('next_run_time')
except Exception as e:
print(f"Error loading scheduler state: {str(e)}")
def _save_state(self):
"""Save the current scheduler state to disk"""
try:
state = {
'last_run_time': self.last_run_time,
'next_run_time': self.next_run_time
}
os.makedirs(os.path.dirname(SCHEDULER_STATE_FILE), exist_ok=True)
with open(SCHEDULER_STATE_FILE, 'wb') as f:
pickle.dump(state, f)
except Exception as e:
print(f"Error saving scheduler state: {str(e)}")
The _load_state()
method retrieves the previous execution times from a pickle file if it exists, while _save_state()
writes the current state to disk. This approach ensures that if the application restarts, it maintains awareness of previous executions and doesnât disrupt the configured schedule.
Starting and stopping the scheduler
The scheduler provides methods to start and stop background processing:
def start(self):
"""Start the scheduler in a background thread"""
if self.running:
return False
if not self.task_function:
print("No task function provided")
return False
self.running = True
self.thread = threading.Thread(target=self._run_scheduler, daemon=True)
self.thread.start()
return True
def stop(self):
"""Stop the scheduler thread"""
self.running = False
if self.thread:
self.thread = None
return True
The start()
method creates a daemon thread that runs in the background without blocking the main application. It performs validation to ensure a task function is provided and that the scheduler isnât already running. The stop()
method safely terminates the scheduler thread by setting the running flag to False.
Creating the scheduler as a daemon thread is particularly important for a web application like OS-watch, as it allows the scheduler to run independently of the user interface while still terminating automatically when the main application stops.
Scheduler loop
The core scheduling logic resides in the _run_scheduler()
method:
def _run_scheduler(self):
"""Main scheduler loop that runs in a background thread"""
# Calculate next run time if not already set
if not self.next_run_time:
self.next_run_time = self._calculate_next_run_time()
self._save_state()
while self.running:
now = datetime.now()
# Check if it's time to run
if now >= self.next_run_time:
try:
# Execute the task
self.task_function()
self.last_run_time = now
except Exception as e:
print(f"Error executing scheduled task: {str(e)}")
# Calculate the next run time
self.next_run_time = self._calculate_next_run_time()
# Save the updated state
self._save_state()
# Sleep for a short time before checking again
time.sleep(10) # Check every 10 seconds
This method implements a continuous loop that:
- Calculates the next run time if not already defined
- Periodically checks the current time against the scheduled time
- Executes the task when the scheduled time arrives
- Updates the run time records and saves state
- Calculates the next execution time
The loop incorporates a sleep interval to prevent excessive CPU usage, checking the schedule every 10 seconds rather than continuously.
Calculating run times
A key part of the scheduler is determining when to execute tasks based on the configured frequency:
def _calculate_next_run_time(self):
"""Calculate the next time to run based on frequency configuration"""
now = datetime.now()
if self.config.frequency == "hourly":
# Run at the beginning of the next hour
next_run = now.replace(minute=0, second=0, microsecond=0) + timedelta(hours=1)
elif self.config.frequency == "daily":
# Parse the time of day from configuration
try:
hour, minute = map(int, self.config.time_of_day.split(':'))
next_run = now.replace(hour=hour, minute=minute, second=0, microsecond=0)
# If this time has already passed today, schedule for tomorrow
if next_run <= now:
next_run += timedelta(days=1)
except ValueError:
# Default to 9 AM if time format is invalid
next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
if next_run <= now:
next_run += timedelta(days=1)
elif self.config.frequency == "weekly":
# Run weekly on Monday at the specified time
try:
hour, minute = map(int, self.config.time_of_day.split(':'))
# Calculate days until next Monday
days_until_monday = (7 - now.weekday()) % 7
if days_until_monday == 0 and now.hour > hour:
days_until_monday = 7
next_run = now.replace(hour=hour, minute=minute, second=0, microsecond=0)
next_run += timedelta(days=days_until_monday)
except ValueError:
# Default to Monday at 9 AM
days_until_monday = (7 - now.weekday()) % 7
next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
next_run += timedelta(days=days_until_monday)
else:
# Default to daily at 9 AM if frequency is unknown
next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
if next_run <= now:
next_run += timedelta(days=1)
return next_run
This method handles three different frequencies:
- Hourly: Runs at the beginning of each hour
- Daily: Runs at a specific time each day, defaulting to 9 AM if not specified
- Weekly: Runs on Monday at a specific time
For each frequency, the method calculates the next appropriate execution time based on the current time. It includes error handling for invalid time formats and ensures that if the scheduled time has already passed, it calculates the next occurrence rather than attempting to run for a time in the past.
Scheduler information
The scheduler provides an interface for the application to query its status:
def get_next_run_info(self):
"""Get information about the scheduler status"""
now = datetime.now()
if not self.next_run_time:
return {
"next_run": "Not scheduled",
"last_run": "Never",
"time_until": "Unknown"
}
# Calculate time until next run
time_diff = self.next_run_time - now
hours, remainder = divmod(time_diff.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
formatted_last_run = self.last_run_time.strftime("%Y-%m-%d %H:%M:%S") if self.last_run_time else "Never"
return {
"next_run": self.next_run_time.strftime("%Y-%m-%d %H:%M:%S"),
"last_run": formatted_last_run,
"time_until": f"{int(hours)}h {int(minutes)}m {int(seconds)}s"
}
This method returns a dictionary with information about:
- The next scheduled run time
- The last time a task was executed
- The time remaining until the next execution
This information is particularly useful for the user interface, allowing users to see when the next check for trending repositories will occur without needing to understand the scheduling logic.
Thread safety and persistence
The scheduler module is designed with several important technical considerations:
-
Thread safety: Running as a daemon thread ensures that the scheduler operates independently of the main application thread, preventing UI blocking while still terminating automatically when the application stops.
-
Persistence: By saving its state to disk, the scheduler maintains continuity even if the application restarts. This ensures that scheduled notifications remain predictable and consistent.
-
Error handling: The scheduler incorporates robust error handling to prevent task failures from crashing the entire scheduler thread. Errors are logged but donât disrupt the scheduling process.
-
Configuration flexibility: The scheduler adapts to different frequencies and times specified in the configuration, providing users with flexibility in how often they receive notifications.
This combination of features makes the scheduler module a reliable automation component that operates quietly in the background while providing valuable functionality to the overall application.
The Streamlit Interface (app.py)
Now that weâve explored the core functionality modules of our application, letâs take a look at how everything comes together in the user interface. The app.py
module provides a Streamlit-based UI that connects all the components and gives users an intuitive way to interact with the application.
Streamlit makes it straightforward to create web applications in Python without needing to understand complex web frameworks. The OS-watch interface is designed with three main tabs that organize the functionality logically.
Application structure
The application is built around a tabbed interface with three main sections:
tab1, tab2, tab3 = st.tabs(["Search", "Configure", "Results"])
- Search tab - Allows users to search for trending repositories in real-time and manage the scheduler
- Configure tab - Provides interfaces for setting up Slack notifications and testing them
- Results tab - Displays the most recent search results in an expandable format
This organization makes the application intuitive to navigate while maintaining clear separation between different functions.
Component integration
The app.py
module serves as the orchestrator that brings together all the other components:
# Import our modules with absolute imports instead of relative imports
from src.config import AppConfig, NotificationConfig, SearchConfig, DEFAULT_CONFIG
from src.scraper import GitHubTrendScraper
from src.notifier import SlackNotifier
from src.scheduler import Scheduler
When a user submits a search, the application:
- Updates the configuration based on user inputs
- Creates a scraper instance with the current search configuration
- Runs the scraper to fetch trending repositories
- Optionally sends notifications via the notifier
- Displays the results in the UI
Similarly, when a user starts the scheduler, the application creates and manages a background task that periodically performs these operations according to the configured schedule.
State management
Streamlit applications rely on session state to maintain information between user interactions. OS-watch uses session state to store:
# Initialize session state
if "config" not in st.session_state:
st.session_state.config = DEFAULT_CONFIG
if "scheduler" not in st.session_state:
st.session_state.scheduler = Scheduler()
if "last_results" not in st.session_state:
st.session_state.last_results = []
if "is_scheduled" not in st.session_state:
st.session_state.is_scheduled = False
- The current application configuration
- The scheduler instance
- The most recent search results
- The current scheduler status
This state management approach allows the application to maintain consistency across different user interactions and page refreshes.
User workflow
From the userâs perspective, the application provides a straightforward workflow:
- Configure search parameters (keywords, language, time period)
- Run a search or set up scheduled notifications
- Configure notification settings (webhook URL, frequency, time)
- View and explore the search results
The interface is designed to be intuitive, with forms for input collection and clear feedback after operations complete:
if submitted:
# Update the configuration
keywords_list = [k.strip() for k in keywords.split(",") if k.strip()]
lang = None if language == "All" else language
st.session_state.config.search = SearchConfig(
keywords=keywords_list, language=lang, time_period=time_period
)
# Run the search
with st.spinner("Searching for trending repositories..."):
results = run_scrape_task()
# Show success message
if results:
st.success(f"Found {len(results)} trending repositories matching your keywords!")
else:
st.warning("No trending repositories found matching your keywords.")
This approach gives users immediate feedback and makes the application feel responsive despite the potentially time-consuming background operations.
The Streamlit interface serves its purpose effectively: providing an accessible way to interact with the core functionality without requiring the user to understand the underlying code. By leveraging Streamlitâs simple but powerful components, OS-watch delivers a clean, intuitive interface that makes monitoring GitHub trends straightforward for technical and non-technical users alike.
Conclusion
OS-watch demonstrates how modern Python libraries and APIs can be combined to create a powerful tool for monitoring open-source trends. By using Firecrawl for web scraping, Streamlit for the user interface, and a custom scheduler for automation, weâve built an application that provides real value with relatively little code.
The modular design makes it easy to extend - for example, by adding additional notification channels or data sources. The applicationâs architecture follows good practices for configuration management, separation of concerns, and persistent state.
What makes OS-watch particularly powerful is its use of Firecrawlâs structured extraction capabilities. Traditional web scraping approaches often break when websites change their layouts or styling, but Firecrawlâs AI-powered extraction provides resilience against these changes. By defining our data schema with Pydantic models and passing it to Firecrawl, we can extract precisely the information we need without writing complex CSS selectors or XPath queries.
Try Firecrawl for your projects
If youâre building applications that need to extract data from the web, Firecrawl can significantly simplify your development process:
- Get started for free - Sign up at firecrawl.dev and explore the platform with the free tier
- Explore the documentation - Learn about structured extraction, web crawling, and more in the comprehensive docs
- Join the community - Connect with other developers using Firecrawl to share tips and get support
Next steps
Here are some potential enhancements for the OS-watch application:
- Add additional data sources beyond GitHub trending
- Implement more notification channels (email, Discord, etc.)
- Create a more sophisticated filtering system
- Add data visualization for trends over time
- Implement user authentication and multi-user support
We hope this tutorial has helped you understand both the application itself and the powerful combination of technologies it uses. By combining Firecrawl, Streamlit, and Pythonâs rich ecosystem, you can build powerful web applications that deliver real value to your users or organization.
On this page
Application overview
Key components
Setting up the environment
Prerequisites
Installation steps
Getting a Firecrawl API key
Setting up Slack webhooks
Environment configuration
The Configuration Module (config.py)
Data models
GitHubRepository model
NotificationConfig model
SearchConfig model
AppConfig model
Loading configuration from environment
Default configuration
The Scraper Module (scraper.py)
GitHubTrendScraper class
Building the URL
The scraping process
Processing extracted repositories
Filtering by keywords
Leveraging Firecrawl's capabilities
The Notifier Module (notifier.py)
SlackNotifier class
Sending notifications
Creating message payload
Slack message structure
Header and context
Repository blocks
Notification effectiveness
The Scheduler Module (scheduler.py)
Scheduler class
State management
Starting and stopping the scheduler
Scheduler loop
Calculating run times
Scheduler information
Thread safety and persistence
The Streamlit Interface (app.py)
Application structure
Component integration
State management
User workflow
Conclusion
Try Firecrawl for your projects
Next steps
About the Author

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics
More articles by Bex Tuychiev
Building an Automated Price Tracking Tool
Learn how to build an automated price tracker in Python that monitors e-commerce prices and sends alerts when prices drop.
Web Scraping Automation: How to Run Scrapers on a Schedule
Learn how to automate web scraping in Python using free scheduling tools to run scrapers reliably in 2025.
Automated Data Collection - A Comprehensive Guide
A comprehensive guide to building robust automated data collection systems using modern tools and best practices.
BeautifulSoup4 vs. Scrapy - A Comprehensive Comparison for Web Scraping in Python
A comprehensive comparison of BeautifulSoup4 and Scrapy to help you choose the right Python web scraping tool.
How to Build a Client Relationship Tree Visualization Tool in Python
Build an application that discovers and visualizes client relationships by scraping websites with Firecrawl and presenting the data in an interactive tree structure using Streamlit and PyVis.
How to Build an Automated Competitor Price Monitoring System with Python
Learn how to build an automated price monitoring system in Python to track and compare competitor prices across e-commerce sites.
Scraping Company Data and Funding Information in Bulk With Firecrawl and Claude
Learn how to scrape company and funding data from Crunchbase using Firecrawl and Claude.
How to Create Custom Instruction Datasets for LLM Fine-tuning
A comprehensive guide to creating instruction datasets for fine-tuning LLMs, including best practices and a practical code documentation example.