Mar 19, 2025

•

Building an Open-Source Project Monitoring Tool with Firecrawl and Streamlit

In this article, we’ll explore the OS-watch application - a tool designed to help teams and individuals stay informed about trending open-source projects. The application scrapes GitHub’s trending repositories page, filters results based on user criteria, and delivers notifications via Slack on a scheduled basis.

To see the app in action, see our demo above!

Application overview

OS-watch is a Python application that monitors trending projects on GitHub and sends notifications to Slack. The program searches GitHub’s trending page, collects information about repositories that match specific criteria, and delivers custom alerts when it finds something interesting. Users can search by keywords like “machine learning” or “web development,” filter by programming language, and set up automated checks that run on a schedule.

The application uses Firecrawl to handle the web scraping part, which means it can extract data from GitHub’s website without dealing with complex HTML parsing. Streamlit provides the user interface, making it easy to interact with the application through a web browser. The notification system connects to Slack through webhooks, delivering formatted messages with repository details directly to your chosen channel.

Key components

The application is built using several Python modules that work together:

config.py - Manages application settings and configuration through Pydantic models
scraper.py - Connects to GitHub using Firecrawl and extracts repository information
notifier.py - Creates and sends formatted notifications to Slack
scheduler.py - Runs tasks in the background on a specified schedule
app.py - Provides the user interface built with Streamlit
run_app.py - Starts the application with the correct settings

Each module handles a specific part of the application’s functionality. The modular design makes the code easier to maintain and update. Users interact with the Streamlit interface to set search parameters, while the scheduler runs in the background to check for new trending repositories at regular intervals.

Setting up the environment

Now that we understand what OS-watch does and how it’s structured, let’s set up everything you need to run the application. This process involves obtaining API keys, configuring your Slack workspace, and setting up your local development environment.

Prerequisites

To run OS-watch, you’ll need:

Python 3.10 or higher: The application uses modern Python features
Poetry: A dependency management tool that makes installation easier
Firecrawl API key: Required for web scraping GitHub’s trending page
Slack workspace: With permissions to create incoming webhooks

Installation steps

Getting OS-watch up and running involves a few straightforward steps. First, clone the repository to your local machine:

git clone https://github.com/yourusername/os-watch.git
cd os-watch

Next, if you don’t already have Poetry installed, you’ll need to install it:

curl -sSL https://install.python-poetry.org | python3 -

With Poetry available, you can now install all the dependencies:

poetry install

After installation, create your environment configuration file by making a copy of the example file:

cp .env.example .env

Finally, activate the virtual environment created by Poetry:

poetry shell

With these steps completed, you’ll have the basic structure in place. However, before running the application, you’ll need to set up the required services.

Getting a Firecrawl API key

Firecrawl is the web scraping service that powers OS-watch’s ability to extract structured data from GitHub. To get an API key:

Visit Firecrawl’s website and create an account
After signing in, navigate to your account dashboard
Look for the “API Keys” section
Click “Generate New API Key”
Copy the key - it should start with “fc-”
Add this key to your .env file as FIRECRAWL_API_KEY=fc-your_key_here

This API key allows OS-watch to make requests to Firecrawl’s service, which handles the complex task of scraping GitHub’s trending pages and extracting the repository information in a structured format.

Setting up Slack webhooks

For the notification feature to work, OS-watch needs a way to send messages to your Slack workspace. This is done through Slack’s incoming webhooks:

Log in to your Slack workspace
Visit api.slack.com/apps and click “Create New App”
Choose “From scratch” and give your app a name like “OS-watch”
Select the workspace where you want to receive notifications
In the left sidebar, click “Incoming Webhooks”
Toggle “Activate Incoming Webhooks” to ON
Click “Add New Webhook to Workspace”
Select the channel where you want notifications to appear
Click “Allow” to authorize the webhook
Copy the webhook URL (it looks like https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX)
Add this URL to your .env file as SLACK_WEBHOOK_URL=your_webhook_url_here

With the webhook configured, OS-watch will be able to send rich, formatted messages about trending repositories directly to your specified Slack channel.

Environment configuration

The final step in setting up OS-watch is configuring the .env file, which controls how the application operates. Let’s look at each setting in detail:

FIRECRAWL_API_KEY: Your personal API key for accessing Firecrawl services
- Example: FIRECRAWL_API_KEY=fc-abcd1234
- Required for the application to function
SLACK_WEBHOOK_URL: The URL where OS-watch sends notifications
- Example: SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXX
- Required for notifications, but you can run searches without it
NOTIFICATION_FREQUENCY: How often to check for trending repositories
- Options: hourly, daily, or weekly
- Example: NOTIFICATION_FREQUENCY=daily
- Default: daily
NOTIFICATION_TIME: When to send notifications (for daily/weekly frequencies)
- Format: 24-hour time (HH:MM)
- Example: NOTIFICATION_TIME=09:00 (for 9:00 AM)
- Default: 09:00
SEARCH_KEYWORDS: Terms to look for in repository names and descriptions
- Format: Comma-separated list
- Example: SEARCH_KEYWORDS=machine learning,ai,nlp
- Default: python,ml,ai
SEARCH_LANGUAGE: Programming language to filter by
- Example: SEARCH_LANGUAGE=Python
- Leave empty to include all languages
- Default: empty (all languages)
SEARCH_PERIOD: Time frame for trending repositories
- Options: daily, weekly, or monthly
- Example: SEARCH_PERIOD=weekly
- Default: daily

These settings allow you to customize how OS-watch operates without changing any code. For example, you might set SEARCH_KEYWORDS=rust,webassembly if you’re interested in Rust and WebAssembly projects, or NOTIFICATION_FREQUENCY=weekly if you prefer less frequent updates.

A complete .env file might look like this:

FIRECRAWL_API_KEY=fc-your_api_key_here
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXX
NOTIFICATION_FREQUENCY=daily
NOTIFICATION_TIME=09:00
SEARCH_KEYWORDS=python,ml,ai,llm
SEARCH_LANGUAGE=
SEARCH_PERIOD=daily

With your environment fully configured, you’re now ready to start the application with:

python run_app.py

This command will launch the Streamlit server and open the application in your browser at http://localhost:8501, where you can begin exploring trending repositories and setting up your automated monitoring.

The Configuration Module (config.py)

With our environment set up and configured, let’s examine how OS-watch actually manages these settings through its configuration module. The config.py file serves as the central nervous system of our application, translating environment variables into structured data models that the rest of the application can easily consume.

At the heart of this module is Pydantic, a data validation library that enforces type hints at runtime. This gives us both the flexibility of Python and the safety of strong type checking, ensuring our application runs reliably even when configuration values change.

Data models

The configuration module defines several Pydantic models that represent different aspects of our application:

GitHubRepository model

This model represents an individual GitHub repository with all its relevant properties:

class GitHubRepository(BaseModel):
    """Model for a GitHub repository extracted from trending page"""
    name: str = Field(description="Full name of the repository (owner/repo)")
    description: Optional[str] = Field(None, description="Repository description")
    language: Optional[str] = Field(None, description="Main programming language")
    stars_count: Optional[str] = Field(None, description="Total number of stars")
    stars_today: Optional[str] = Field(None, description="Stars gained today")
    forks_count: Optional[str] = Field(None, description="Total number of forks")
    repo_owner: Optional[str] = Field(None, description="Repository owner")
    repo_url: Optional[str] = Field(None, description="Repository URL")

This model not only defines the structure of repository data but also documents each field through the Field descriptor, making the code self-documenting. Notice how some fields are marked as Optional, allowing for flexibility when data might be incomplete.

The Repositories class serves as a simple wrapper that contains a list of GitHubRepository objects:

class Repositories(BaseModel):
    """Wrapper model for a list of GitHub repositories"""
    repositories: List[GitHubRepository]

This structure is particularly important for our Firecrawl integration, as it defines the schema used for structured data extraction. Firecrawl converts this schema into JSON, guiding its AI engine with the Field descriptors to find the necessary HTML/CSS selectors containing the information we want.

NotificationConfig model

This model handles all settings related to Slack notifications:

class NotificationConfig(BaseModel):
    """Configuration for notifications"""
    webhook_url: str
    frequency: str  # "hourly", "daily", "weekly"
    time_of_day: Optional[str] = "09:00"  # For daily/weekly notifications

These fields directly correspond to the environment variables we set up earlier, controlling how and when notifications are sent.

SearchConfig model

This model manages parameters for GitHub trending searches:

class SearchConfig(BaseModel):
    """Configuration for GitHub trend searches"""
    keywords: List[str]
    language: Optional[str] = None
    time_period: str = "daily"  # "daily", "weekly", "monthly"

Again, these fields align with our environment configuration, defining what repositories we’re interested in finding.

AppConfig model

Finally, the main configuration class combines the notification and search settings:

class AppConfig(BaseModel):
    """Main application configuration"""
    notification: NotificationConfig
    search: SearchConfig

This hierarchical structure keeps our configuration organized and makes it easy to access related settings together.

Loading configuration from environment

The key bridge between the environment variables we set up earlier and our application is the load_from_env() class method in AppConfig:

@classmethod
def load_from_env(cls):
    """Load configuration from environment variables"""
    return cls(
        notification=NotificationConfig(
            webhook_url=os.environ.get("SLACK_WEBHOOK_URL", ""),
            frequency=os.environ.get("NOTIFICATION_FREQUENCY", "daily"),
            time_of_day=os.environ.get("NOTIFICATION_TIME", "09:00"),
        ),
        search=SearchConfig(
            keywords=os.environ.get("SEARCH_KEYWORDS", "python,ml,ai").split(","),
            language=os.environ.get("SEARCH_LANGUAGE", None),
            time_period=os.environ.get("SEARCH_PERIOD", "daily"),
        ),
    )

This method reads the environment variables we configured in our .env file and converts them into the appropriate data types for our models. For example, it transforms the comma-separated keyword string into a proper Python list. It also provides sensible default values for any missing configuration, ensuring the application can run even with incomplete settings.

Default configuration

To handle cases where environment variables might not be set, the module defines a DEFAULT_CONFIG object:

DEFAULT_CONFIG = AppConfig(
    notification=NotificationConfig(
        webhook_url="",
        frequency="daily",
    ),
    search=SearchConfig(
        keywords=["python", "ml", "ai"],
        time_period="daily",
    ),
)

This default configuration allows the application to start with reasonable settings out of the box. New users can immediately begin using the application without configuring every detail, and they can gradually customize their experience as they become more familiar with the tool.

The configuration module’s well-structured approach makes the rest of the application simpler and more maintainable. By centralizing all configuration management in one place, other modules can focus on their specific responsibilities without worrying about how to access or validate configuration values. This separation of concerns is a key design principle that makes OS-watch both robust and easy to extend.

The Scraper Module (scraper.py)

The scraper module bridges the gap between our application and the GitHub website, converting raw HTML into structured data that follows our Pydantic models. By using Firecrawl’s structured extraction features, we avoid the fragility of traditional web scrapers that break when sites change their layouts or styling.

GitHubTrendScraper class

The main component of this module is the GitHubTrendScraper class, which encapsulates all scraping functionality:

class GitHubTrendScraper:
    """Scraper for GitHub trending repositories"""

    def __init__(self, config: SearchConfig):
        self.config = config
        # Get API key from environment variables
        self.api_key = os.environ.get("FIRECRAWL_API_KEY", "")
        # Initialize FirecrawlApp with API key
        self.firecrawl_app = FirecrawlApp(api_key=self.api_key)

During initialization, the scraper takes a SearchConfig object (from our config module) and sets up the Firecrawl client using the API key from our environment variables.

Building the URL

The first step in scraping is constructing the correct URL to fetch:

def build_url(self) -> str:
    """Build the GitHub trending URL based on configuration"""
    url = "https://github.com/trending"

    # Add language filter if specified
    if self.config.language and self.config.language.lower() != "all":
        url += f"/{self.config.language.lower()}"

    # Add time period
    url += f"?since={self.config.time_period}"

    return url

This method dynamically builds the GitHub trending URL based on the user’s configuration. If a specific programming language is selected, it’s added to the path. The time period (daily, weekly, or monthly) is added as a query parameter.

The scraping process

The main scrape() method orchestrates the entire scraping process:

def scrape(self) -> List[Dict[str, Any]]:
    """Scrape GitHub trending repositories using structured extraction"""
    url = self.build_url()

    # Use Firecrawl to scrape the trending page with structured extraction
    try:
        # Call Firecrawl with structured extraction as shown in the notebook
        result = self.firecrawl_app.scrape_url(
            url,
            params={
                "formats": ["extract"],
                "extract": {
                    "prompt": "Scrape the GitHub trending page and extract the repositories based on the schema provided.",
                    "schema": Repositories.model_json_schema(),
                },
            },
        )

        # Check if we got structured data back
        if "extract" in result and "repositories" in result["extract"]:
            # Convert the extracted data to our standard dictionary format
            repositories = self._process_extracted_repos(
                result["extract"]["repositories"]
            )

            # Filter repositories based on keywords
            filtered_repos = self._filter_by_keywords(repositories)

            return filtered_repos
        else:
            print("Structured extraction failed, no repository data found")
            return []

    except Exception as e:
        print(f"Error scraping GitHub trending page: {str(e)}")
        return []

This method:

Builds the correct URL based on configuration
Makes a request to Firecrawl’s API with structured extraction parameters
Processes the extracted repositories into a standardized format
Filters the repositories based on the user’s keywords
Returns the filtered list of repositories

Notice how we pass our Pydantic schema (Repositories.model_json_schema()) to Firecrawl’s extraction engine. This tells Firecrawl exactly what data we want and how it should be structured. The prompt provides additional context to guide the extraction process.

Processing extracted repositories

Once we have the raw data from Firecrawl, we need to process it into a consistent format:

def _process_extracted_repos(
    self, extracted_repos: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """Process the structured extracted repositories into our standard format"""
    processed_repos = []

    for i, repo in enumerate(extracted_repos):
        # Using the format from the notebook
        processed_repo = {
            "name": repo.get("name", ""),
            "display_name": (
                repo.get("name", "").split("/")[-1]
                if "/" in repo.get("name", "")
                else repo.get("name", "")
            ),
            "url": repo.get(
                "repo_url", f"https://github.com/{repo.get('name', '')}"
            ),
            "description": repo.get("description", ""),
            "stars": repo.get("stars_count", ""),
            "today_stars": repo.get("stars_today", ""),
            "language": repo.get("language", self.config.language or "Unknown"),
            "rank": i + 1,  # Assign rank based on position in the list
            "forks": repo.get("forks_count", ""),
            "owner": repo.get("repo_owner", ""),
        }

        processed_repos.append(processed_repo)

    return processed_repos

This method transforms the raw extraction data into a consistent format for our application. It handles missing fields gracefully with default values and adds derived information like display_name (the repository name without the owner) and rank (based on position in the trending list).

Filtering by keywords

An additional step is filtering repositories based on the user’s keywords:

def _filter_by_keywords(
    self, repositories: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """Filter repositories based on configured keywords"""
    if not self.config.keywords:
        return repositories

    filtered = []
    for repo in repositories:
        # Check if any keyword matches in name or description
        if any(
            k.lower() in repo["name"].lower()
            or k.lower() in repo["description"].lower()
            for k in self.config.keywords
        ):
            filtered.append(repo)

    return filtered

This method checks each repository to see if its name or description contains any of the keywords specified in the configuration. The filtering is case-insensitive and only returns repositories that match at least one keyword. This method is already used inside the scrape method.

Leveraging Firecrawl’s capabilities

What makes this scraper particularly powerful is how it uses Firecrawl’s structured extraction capabilities. Instead of writing complex CSS selectors or XPath queries that might break when GitHub updates its UI, we simply:

Define our data schema using Pydantic models
Pass this schema to Firecrawl
Provide a simple prompt describing what we want to extract

Firecrawl handles the complexities of finding the right elements, extracting the data, and returning it in our desired format. This approach is much more resilient to website changes and requires significantly less code than traditional web scraping methods.

The Notifier Module (notifier.py)

After repositories are scraped and filtered, the notifier.py module handles the delivery of this information to users. This module converts repository data into structured Slack messages that provide context and relevant information in a readable format.

The notifier module serves as the communication component in the application’s architecture. It transforms repository data into notifications, allowing team members to quickly review trending projects without needing to process raw data.

SlackNotifier class

The core component of this module is the SlackNotifier class:

class SlackNotifier:
    """Notifier for sending repository updates to Slack"""

    def __init__(self, config: NotificationConfig):
        self.config = config

The class is initialized with a NotificationConfig object containing the webhook URL and scheduling preferences. This follows the application’s pattern of passing configuration objects between components, maintaining separation of concerns.

Sending notifications

The main method is send_notification(), which manages the notification process:

def send_notification(
    self, repositories: List[Dict[str, Any]], search_terms: List[str]
) -> bool:
    """Send a notification with trending repositories to Slack"""
    if not repositories:
        return False

    if not self.config.webhook_url:
        print("No webhook URL configured")
        return False

    # Create the message payload
    payload = self._create_message_payload(repositories, search_terms)

    try:
        # Send the message to Slack
        response = requests.post(
            self.config.webhook_url,
            data=json.dumps(payload),
            headers={"Content-Type": "application/json"},
        )

        if response.status_code != 200:
            print(
                f"Failed to send notification: {response.status_code} {response.text}"
            )
            return False

        return True

    except Exception as e:
        print(f"Error sending notification: {str(e)}")
        return False

This method performs several key functions:

Verifies repositories exist to notify about
Validates webhook URL configuration
Creates a formatted message payload
Sends a POST request to the Slack webhook
Handles any errors during transmission
Returns success or failure status

The method includes error handling and logging, returning a boolean value to indicate success or failure, allowing the caller to respond appropriately.

Creating message payload

The _create_message_payload method structures messages for Slack in a readable format. The implementation can be analyzed in several parts:

First, the method establishes the current time and builds the initial message structure:

def _create_message_payload(
    self, repositories: List[Dict[str, Any]], search_terms: List[str]
) -> Dict[str, Any]:
    """Create a formatted Slack message with repository information"""
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Create the message blocks
    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"🔍 GitHub Trending Update: {', '.join(search_terms)}",
            },
        },
        {
            "type": "context",
            "elements": [
                {
                    "type": "plain_text",
                    "text": f"Found {len(repositories)} trending repositories • {now}",
                }
            ],
        },
        {"type": "divider"},
    ]

This creates a header with search terms, a context line showing the number of repositories found and timestamp, and a divider to separate the header from repository content.

Next, the method iterates through repositories and adds blocks for each one:

    # Add repository information
    for repo in repositories:
        star_info = f"⭐ {repo.get('stars', 'N/A')}"
        if repo.get("today_stars"):
            star_info += f" (+{repo.get('today_stars')} today)"

        blocks.extend([
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*<{repo['url']}|{repo['name']}>*\n{repo.get('description', 'No description')}",
                },
            },

For each repository, the method formats star information, including today’s increase when available. It creates a section block with the repository name as a clickable link, followed by its description.

The method continues by adding context about each repository:

            {
                "type": "context",
                "elements": [
                    {
                        "type": "mrkdwn",
                        "text": f"{star_info} • Rank: #{repo.get('rank', 'N/A')} • Language: {repo.get('language', 'Unknown')}",
                    }
                ],
            },
            {"type": "divider"},
        ])

This context block presents metrics in a compact format: star count, trending rank, and programming language. A divider separates each repository entry.

Finally, the method adds a footer and returns the complete structure:

    # Add footer
    blocks.append(
        {
            "type": "context",
            "elements": [
                {"type": "mrkdwn", "text": "Powered by Open-source Watch"}
            ],
        }
    )

    return {"blocks": blocks}

The footer includes attribution to the application. The complete structure is wrapped in a dictionary with a “blocks” key, conforming to Slack’s API expectations.

This structured approach creates a consistent notification format that accommodates varying amounts of content while maintaining readability.

Slack message structure

The message structure demonstrates effective conversion of technical data into informative content. The key components include:

Header and context

{
    "type": "header",
    "text": {
        "type": "plain_text",
        "text": f"🔍 GitHub Trending Update: {', '.join(search_terms)}",
    },
},
{
    "type": "context",
    "elements": [
        {
            "type": "plain_text",
            "text": f"Found {len(repositories)} trending repositories • {now}",
        }
    ],
}

This creates a header with the search terms and an emoji indicator, followed by a context line showing the repository count and timestamp. This provides users with immediate context about the notification’s content and timing.

Repository blocks

{
    "type": "section",
    "text": {
        "type": "mrkdwn",
        "text": f"*<{repo['url']}|{repo['name']}>*\n{repo.get('description', 'No description')}",
    },
},
{
    "type": "context",
    "elements": [
        {
            "type": "mrkdwn",
            "text": f"{star_info} • Rank: #{repo.get('rank', 'N/A')} • Language: {repo.get('language', 'Unknown')}",
        }
    ],
}

Each repository is presented with a section containing its name as a clickable link and description, followed by metrics in smaller text: star count, ranking, and programming language. This provides both descriptive and quantitative information in a concise format.

The use of Slack’s block structure creates a consistent notification layout that maintains readability regardless of content volume.

Notification effectiveness

The notifications are designed to be informative and functional. By including repository links and contextual metrics like star counts and rankings, users can make informed decisions about which repositories warrant further investigation.

This approach to notification design prioritizes user needs, organizing information in a way that facilitates efficient assessment. It converts raw data into structured content that helps users quickly understand the significance of trending repositories.

The notifier module completes the application’s data pipeline, delivering filtered repository data to users in a structured format. This demonstrates how appropriate formatting can enhance the utility of technical information, making it accessible and useful to team members.

The Scheduler Module (scheduler.py)

After setting up data collection with the scraper module and delivery via the notifier module, we need a mechanism to automate these processes. The scheduler.py module provides this functionality, offering a background task scheduling system that runs operations at specified intervals without requiring manual intervention.

The scheduler module serves as the automation engine of the OS-watch application, allowing users to set up periodic checks for trending repositories based on their preferred frequency. It handles the complexities of time-based execution while maintaining state between application restarts.

Scheduler class

The core component of this module is the Scheduler class, which manages background tasks:

class Scheduler:
    """Background task scheduler for periodic repository checks"""

    def __init__(self, config: NotificationConfig, task_function=None):
        self.config = config
        self.task_function = task_function
        self.running = False
        self.thread = None
        self.last_run_time = None
        self.next_run_time = None

        # Load any previous state
        self._load_state()

The class is initialized with a NotificationConfig object that specifies the frequency and time for scheduled runs, as well as a function to execute at each scheduled interval. The initialization sets up internal tracking variables and loads any existing state from disk.

State management

Maintaining scheduler state between application restarts is crucial for reliability. Two methods handle this persistence:

def _load_state(self):
    """Load the scheduler state from disk if available"""
    try:
        if os.path.exists(SCHEDULER_STATE_FILE):
            with open(SCHEDULER_STATE_FILE, 'rb') as f:
                state = pickle.load(f)
                self.last_run_time = state.get('last_run_time')
                self.next_run_time = state.get('next_run_time')
    except Exception as e:
        print(f"Error loading scheduler state: {str(e)}")

def _save_state(self):
    """Save the current scheduler state to disk"""
    try:
        state = {
            'last_run_time': self.last_run_time,
            'next_run_time': self.next_run_time
        }
        os.makedirs(os.path.dirname(SCHEDULER_STATE_FILE), exist_ok=True)
        with open(SCHEDULER_STATE_FILE, 'wb') as f:
            pickle.dump(state, f)
    except Exception as e:
        print(f"Error saving scheduler state: {str(e)}")

The _load_state() method retrieves the previous execution times from a pickle file if it exists, while _save_state() writes the current state to disk. This approach ensures that if the application restarts, it maintains awareness of previous executions and doesn’t disrupt the configured schedule.

Starting and stopping the scheduler

The scheduler provides methods to start and stop background processing:

def start(self):
    """Start the scheduler in a background thread"""
    if self.running:
        return False

    if not self.task_function:
        print("No task function provided")
        return False

    self.running = True
    self.thread = threading.Thread(target=self._run_scheduler, daemon=True)
    self.thread.start()
    return True

def stop(self):
    """Stop the scheduler thread"""
    self.running = False
    if self.thread:
        self.thread = None
    return True

The start() method creates a daemon thread that runs in the background without blocking the main application. It performs validation to ensure a task function is provided and that the scheduler isn’t already running. The stop() method safely terminates the scheduler thread by setting the running flag to False.

Creating the scheduler as a daemon thread is particularly important for a web application like OS-watch, as it allows the scheduler to run independently of the user interface while still terminating automatically when the main application stops.

Scheduler loop

The core scheduling logic resides in the _run_scheduler() method:

def _run_scheduler(self):
    """Main scheduler loop that runs in a background thread"""
    # Calculate next run time if not already set
    if not self.next_run_time:
        self.next_run_time = self._calculate_next_run_time()
        self._save_state()

    while self.running:
        now = datetime.now()

        # Check if it's time to run
        if now >= self.next_run_time:
            try:
                # Execute the task
                self.task_function()
                self.last_run_time = now
            except Exception as e:
                print(f"Error executing scheduled task: {str(e)}")

            # Calculate the next run time
            self.next_run_time = self._calculate_next_run_time()
            # Save the updated state
            self._save_state()

        # Sleep for a short time before checking again
        time.sleep(10)  # Check every 10 seconds

This method implements a continuous loop that:

Calculates the next run time if not already defined
Periodically checks the current time against the scheduled time
Executes the task when the scheduled time arrives
Updates the run time records and saves state
Calculates the next execution time

The loop incorporates a sleep interval to prevent excessive CPU usage, checking the schedule every 10 seconds rather than continuously.

Calculating run times

A key part of the scheduler is determining when to execute tasks based on the configured frequency:

def _calculate_next_run_time(self):
    """Calculate the next time to run based on frequency configuration"""
    now = datetime.now()

    if self.config.frequency == "hourly":
        # Run at the beginning of the next hour
        next_run = now.replace(minute=0, second=0, microsecond=0) + timedelta(hours=1)

    elif self.config.frequency == "daily":
        # Parse the time of day from configuration
        try:
            hour, minute = map(int, self.config.time_of_day.split(':'))
            next_run = now.replace(hour=hour, minute=minute, second=0, microsecond=0)
            # If this time has already passed today, schedule for tomorrow
            if next_run <= now:
                next_run += timedelta(days=1)
        except ValueError:
            # Default to 9 AM if time format is invalid
            next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
            if next_run <= now:
                next_run += timedelta(days=1)

    elif self.config.frequency == "weekly":
        # Run weekly on Monday at the specified time
        try:
            hour, minute = map(int, self.config.time_of_day.split(':'))
            # Calculate days until next Monday
            days_until_monday = (7 - now.weekday()) % 7
            if days_until_monday == 0 and now.hour > hour:
                days_until_monday = 7

            next_run = now.replace(hour=hour, minute=minute, second=0, microsecond=0)
            next_run += timedelta(days=days_until_monday)
        except ValueError:
            # Default to Monday at 9 AM
            days_until_monday = (7 - now.weekday()) % 7
            next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
            next_run += timedelta(days=days_until_monday)

    else:
        # Default to daily at 9 AM if frequency is unknown
        next_run = now.replace(hour=9, minute=0, second=0, microsecond=0)
        if next_run <= now:
            next_run += timedelta(days=1)

    return next_run

This method handles three different frequencies:

Hourly: Runs at the beginning of each hour
Daily: Runs at a specific time each day, defaulting to 9 AM if not specified
Weekly: Runs on Monday at a specific time

For each frequency, the method calculates the next appropriate execution time based on the current time. It includes error handling for invalid time formats and ensures that if the scheduled time has already passed, it calculates the next occurrence rather than attempting to run for a time in the past.

Scheduler information

The scheduler provides an interface for the application to query its status:

def get_next_run_info(self):
    """Get information about the scheduler status"""
    now = datetime.now()

    if not self.next_run_time:
        return {
            "next_run": "Not scheduled",
            "last_run": "Never",
            "time_until": "Unknown"
        }

    # Calculate time until next run
    time_diff = self.next_run_time - now
    hours, remainder = divmod(time_diff.total_seconds(), 3600)
    minutes, seconds = divmod(remainder, 60)

    formatted_last_run = self.last_run_time.strftime("%Y-%m-%d %H:%M:%S") if self.last_run_time else "Never"

    return {
        "next_run": self.next_run_time.strftime("%Y-%m-%d %H:%M:%S"),
        "last_run": formatted_last_run,
        "time_until": f"{int(hours)}h {int(minutes)}m {int(seconds)}s"
    }

This method returns a dictionary with information about:

The next scheduled run time
The last time a task was executed
The time remaining until the next execution

This information is particularly useful for the user interface, allowing users to see when the next check for trending repositories will occur without needing to understand the scheduling logic.

Thread safety and persistence

The scheduler module is designed with several important technical considerations:

Thread safety: Running as a daemon thread ensures that the scheduler operates independently of the main application thread, preventing UI blocking while still terminating automatically when the application stops.
Persistence: By saving its state to disk, the scheduler maintains continuity even if the application restarts. This ensures that scheduled notifications remain predictable and consistent.
Error handling: The scheduler incorporates robust error handling to prevent task failures from crashing the entire scheduler thread. Errors are logged but don’t disrupt the scheduling process.
Configuration flexibility: The scheduler adapts to different frequencies and times specified in the configuration, providing users with flexibility in how often they receive notifications.

This combination of features makes the scheduler module a reliable automation component that operates quietly in the background while providing valuable functionality to the overall application.

The Streamlit Interface (app.py)

Now that we’ve explored the core functionality modules of our application, let’s take a look at how everything comes together in the user interface. The app.py module provides a Streamlit-based UI that connects all the components and gives users an intuitive way to interact with the application.

Streamlit makes it straightforward to create web applications in Python without needing to understand complex web frameworks. The OS-watch interface is designed with three main tabs that organize the functionality logically.

Application structure

The application is built around a tabbed interface with three main sections:

tab1, tab2, tab3 = st.tabs(["Search", "Configure", "Results"])

Search tab - Allows users to search for trending repositories in real-time and manage the scheduler
Configure tab - Provides interfaces for setting up Slack notifications and testing them
Results tab - Displays the most recent search results in an expandable format

This organization makes the application intuitive to navigate while maintaining clear separation between different functions.

Component integration

The app.py module serves as the orchestrator that brings together all the other components:

# Import our modules with absolute imports instead of relative imports
from src.config import AppConfig, NotificationConfig, SearchConfig, DEFAULT_CONFIG
from src.scraper import GitHubTrendScraper
from src.notifier import SlackNotifier
from src.scheduler import Scheduler

When a user submits a search, the application:

Updates the configuration based on user inputs
Creates a scraper instance with the current search configuration
Runs the scraper to fetch trending repositories
Optionally sends notifications via the notifier
Displays the results in the UI

Similarly, when a user starts the scheduler, the application creates and manages a background task that periodically performs these operations according to the configured schedule.

State management

Streamlit applications rely on session state to maintain information between user interactions. OS-watch uses session state to store:

# Initialize session state
if "config" not in st.session_state:
    st.session_state.config = DEFAULT_CONFIG

if "scheduler" not in st.session_state:
    st.session_state.scheduler = Scheduler()

if "last_results" not in st.session_state:
    st.session_state.last_results = []

if "is_scheduled" not in st.session_state:
    st.session_state.is_scheduled = False

The current application configuration
The scheduler instance
The most recent search results
The current scheduler status

This state management approach allows the application to maintain consistency across different user interactions and page refreshes.

User workflow

From the user’s perspective, the application provides a straightforward workflow:

Configure search parameters (keywords, language, time period)
Run a search or set up scheduled notifications
Configure notification settings (webhook URL, frequency, time)
View and explore the search results

The interface is designed to be intuitive, with forms for input collection and clear feedback after operations complete:

if submitted:
    # Update the configuration
    keywords_list = [k.strip() for k in keywords.split(",") if k.strip()]
    lang = None if language == "All" else language

    st.session_state.config.search = SearchConfig(
        keywords=keywords_list, language=lang, time_period=time_period
    )

    # Run the search
    with st.spinner("Searching for trending repositories..."):
        results = run_scrape_task()

    # Show success message
    if results:
        st.success(f"Found {len(results)} trending repositories matching your keywords!")
    else:
        st.warning("No trending repositories found matching your keywords.")

This approach gives users immediate feedback and makes the application feel responsive despite the potentially time-consuming background operations.

The Streamlit interface serves its purpose effectively: providing an accessible way to interact with the core functionality without requiring the user to understand the underlying code. By leveraging Streamlit’s simple but powerful components, OS-watch delivers a clean, intuitive interface that makes monitoring GitHub trends straightforward for technical and non-technical users alike.

Conclusion

OS-watch demonstrates how modern Python libraries and APIs can be combined to create a powerful tool for monitoring open-source trends. By using Firecrawl for web scraping, Streamlit for the user interface, and a custom scheduler for automation, we’ve built an application that provides real value with relatively little code.

The modular design makes it easy to extend - for example, by adding additional notification channels or data sources. The application’s architecture follows good practices for configuration management, separation of concerns, and persistent state.

What makes OS-watch particularly powerful is its use of Firecrawl’s structured extraction capabilities. Traditional web scraping approaches often break when websites change their layouts or styling, but Firecrawl’s AI-powered extraction provides resilience against these changes. By defining our data schema with Pydantic models and passing it to Firecrawl, we can extract precisely the information we need without writing complex CSS selectors or XPath queries.

Try Firecrawl for your projects

If you’re building applications that need to extract data from the web, Firecrawl can significantly simplify your development process:

Get started for free - Sign up at firecrawl.dev and explore the platform with the free tier
Explore the documentation - Learn about structured extraction, web crawling, and more in the comprehensive docs
Join the community - Connect with other developers using Firecrawl to share tips and get support

Next steps

Here are some potential enhancements for the OS-watch application:

Add additional data sources beyond GitHub trending
Implement more notification channels (email, Discord, etc.)
Create a more sophisticated filtering system
Add data visualization for trends over time
Implement user authentication and multi-user support

We hope this tutorial has helped you understand both the application itself and the powerful combination of technologies it uses. By combining Firecrawl, Streamlit, and Python’s rich ecosystem, you can build powerful web applications that deliver real value to your users or organization.

🔥

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Bex Tuychiev@bextuychiev

Bex is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics

Building an Open-Source Project Monitoring Tool with Firecrawl and Streamlit

Application overview

Key components

Setting up the environment

Prerequisites

Installation steps

Getting a Firecrawl API key

Setting up Slack webhooks

Environment configuration

The Configuration Module (config.py)

Data models

GitHubRepository model

NotificationConfig model

SearchConfig model

AppConfig model

Loading configuration from environment

Default configuration

The Scraper Module (scraper.py)

GitHubTrendScraper class

Building the URL

The scraping process

Processing extracted repositories

Filtering by keywords

Leveraging Firecrawl’s capabilities

The Notifier Module (notifier.py)

SlackNotifier class

Sending notifications

Creating message payload

Slack message structure

Header and context

Repository blocks

Notification effectiveness

The Scheduler Module (scheduler.py)

Scheduler class

State management

Starting and stopping the scheduler

Scheduler loop

Calculating run times

Scheduler information

Thread safety and persistence

The Streamlit Interface (app.py)

Application structure

Component integration

State management

User workflow

Conclusion

Try Firecrawl for your projects

Next steps

Ready to Build?

About the Author

More articles by Bex Tuychiev