Introducing /extract - Get web data with a prompt

Oct 21, 2024

•

Nicolas Camara imageNicolas Camara

Getting Started with Grok-2: Setup and Web Crawler Example

Getting Started with Grok-2: Setup and Web Crawler Example image

Grok-2, the latest language model from x.ai, brings advanced language understanding capabilities to developers, enabling the creation of intelligent applications with ease. In this tutorial, we’ll walk you through setting up Grok-2, obtaining an API key, and then building a web crawler using Firecrawl to extract structured data from any website.

Part 1: Setting Up Grok-2

Before diving into coding, we need to set up Grok-2 and get an API key.

Step 1: Sign Up for an x.ai Account

To access the Grok-2 API, you’ll need an x.ai account.

  1. Visit the Sign-Up Page: Go to x.ai Sign-Up.
  2. Register: Fill out the registration form with your email and create a password.
  3. Verify Your Email: Check your inbox for a verification email from x.ai and click the link to verify your account.

Step 2: Fund Your Account

To use the Grok-2 API, your account must have funds.

  1. Access the Cloud Console: After logging in, you’ll be directed to the x.ai Cloud Console.
  2. Navigate to Billing: Click on the Billing tab in the sidebar.
  3. Add Payment Method: Provide your payment details to add credits to your account.

Step 3: Obtain Your API Key

With your account funded, you can now generate an API key.

  1. Go to API Keys: Click on the API Keys tab in the Cloud Console.
  2. Create a New API Key: Click on Create New API Key and give it a descriptive name.
  3. Copy Your API Key: Make sure to copy your API key now, as it won’t be displayed again for security reasons.

Note: Keep your API key secure and do not share it publicly.

Part 2: Building a Web Crawler with Grok-2 and Firecrawl

Now that Grok-2 is set up, let’s build a web crawler to extract structured data from websites.

Prerequisites

  • Python 3.6+
  • Firecrawl Python Library
  • Requests Library
  • dotenv Library

Install the required packages:

pip install firecrawl-py requests python-dotenv

Step 1: Set Up Environment Variables

Create a .env file in your project directory to store your API keys securely.

GROK_API_KEY=your_grok_api_key
FIRECRAWL_API_KEY=your_firecrawl_api_key

Replace your_grok_api_key and your_firecrawl_api_key with your actual API keys.

Step 2: Initialize Your Script

Create a new Python script (e.g., web_crawler.py) and start by importing the necessary libraries and loading your environment variables.

import os
import json
import requests
from dotenv import load_dotenv
from firecrawl import FirecrawlApp

# Load environment variables from .env file
load_dotenv()

# Retrieve API keys
grok_api_key = os.getenv("GROK_API_KEY")
firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")

# Initialize FirecrawlApp
app = FirecrawlApp(api_key=firecrawl_api_key)

Step 3: Define the Grok-2 API Interaction Function

We need a function to interact with the Grok-2 API.

def grok_completion(prompt):
    url = "https://api.x.ai/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {grok_api_key}"
    }
    data = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "model": "grok-2",
        "stream": False,
        "temperature": 0
    }
    response = requests.post(url, headers=headers, json=data)
    response_data = response.json()
    return response_data['choices'][0]['message']['content']

Step 4: Identify Relevant Pages on the Website

Define a function to find pages related to our objective.

def find_relevant_pages(objective, url):
    prompt = f"Based on the objective '{objective}', suggest a 1-2 word search term to locate relevant information on the website."
    search_term = grok_completion(prompt).strip()
    map_result = app.map_url(url, params={"search": search_term})
    return map_result.get("links", [])

Step 5: Extract Data from the Pages

Create a function to scrape the pages and extract the required data.

def extract_data_from_pages(links, objective):
    for link in links[:3]:  # Limit to top 3 links
        scrape_result = app.scrape_url(link, params={'formats': ['markdown']})
        content = scrape_result.get('markdown', '')
        prompt = f"""Given the following content, extract the information related to the objective '{objective}' in JSON format. If not found, reply 'Objective not met'.

Content: {content}

Remember:
- Only return JSON if the objective is met.
- Do not include any extra text.
"""
        result = grok_completion(prompt).strip()
        if result != "Objective not met":
            try:
                data = json.loads(result)
                return data
            except json.JSONDecodeError:
                continue  # Try the next link if JSON parsing fails
    return None

Step 6: Implement the Main Function

Combine everything into a main function.

def main():
    url = input("Enter the website URL to crawl: ")
    objective = input("Enter your data extraction objective: ")

    print("\nFinding relevant pages...")
    links = find_relevant_pages(objective, url)

    if not links:
        print("No relevant pages found.")
        return

    print("Extracting data from pages...")
    data = extract_data_from_pages(links, objective)

    if data:
        print("\nData extracted successfully:")
        print(json.dumps(data, indent=2))
    else:
        print("Could not find data matching the objective.")

if __name__ == "__main__":
    main()

Step 7: Run the Script

Save your script and run it from the command line.

python web_crawler.py

Example Interaction:

Enter the website URL to crawl: https://example.com
Enter your data extraction objective: Retrieve the list of services offered.

Finding relevant pages...
Extracting data from pages...

Data extracted successfully:
{
  "services": [
    "Web Development",
    "SEO Optimization",
    "Digital Marketing"
  ]
}

Conclusion

In this tutorial, we’ve successfully set up Grok-2, obtained an API key, and built a web crawler using Firecrawl. This powerful combination allows you to automate the process of extracting structured data from websites, making it a valuable tool for various applications.

Next Steps

  • Explore More Features: Check out the Grok-2 and Firecrawl documentation to learn about additional functionalities.
  • Enhance Error Handling: Improve the script with better error handling and logging.
  • Customize Data Extraction: Modify the extraction logic to suit different objectives or data types.

References

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Nicolas Camara image
Nicolas Camara@nickscamara_

Nicolas Camara is the Chief Technology Officer (CTO) at Firecrawl. He previously built and scaled Mendable, one of the pioneering "chat with your documents" apps, which had major Fortune 500 customers like Snapchat, Coinbase, and MongoDB. Prior to that, Nicolas built SideGuide, the first code-learning tool inside VS Code, and grew a community of 50,000 users. Nicolas studied Computer Science and has over 10 years of experience in building software.

More articles by Nicolas Camara