Introducing /extract - Get web data with a prompt

Aug 13, 2024

Wendong Fan imageWendong Fan

Building Knowledge Graphs from Web Data using CAMEL-AI and Firecrawl

Building Knowledge Graphs from Web Data using CAMEL-AI and Firecrawl image

This post explores techniques for building knowledge graphs by extracting data from web pages using CAMEL-AI and Firecrawl.

We’ll cover:

  • Multi-agent role-playing task setup
  • Web scraping implementation
  • Knowledge graph construction
  • Agent monitoring techniques

To demonstrate these concepts, we’ll build a knowledge graph to analyze Yusuf Dikec’s performance in the 2024 Paris Olympics. The notebook version is here.

Yusuf Dikec at the Paris Olympics

🐫 Setting Up CAMEL and Firecrawl

To get started, install the CAMEL package with all its dependencies:

pip install camel-ai[all]==0.1.6.3

Next, set up your API keys for Firecrawl and OpenAI to enable interaction with external services.

API Keys

You’ll need to set up your API keys for both Firecrawl and OpenAI. This ensures that the tools can interact with external services securely.

Your can go to here to get free API Key from Firecrawl

import os
from getpass import getpass

# Prompt for the Firecrawl API key securely
firecrawl_api_key = getpass('Enter your API key: ')
os.environ["FIRECRAWL_API_KEY"] = firecrawl_api_key


openai_api_key = getpass('Enter your API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key

🌐 Effortless Web Scraping with Firecrawl

Firecrawl simplifies web scraping and cleaning content from web pages. Here’s an example of scraping content from a specific post on the CAMEL AI website:

from camel.loaders import Firecrawl

firecrawl = Firecrawl()

response = firecrawl.tidy_scrape(
    url="https://www.camel-ai.org/post/crab"
)

print(response)

🛠️ Web Information Retrieval using CAMEL’s RAG and Firecrawl

Let’s retrieve relevant information from a list of URLs using CAMEL’s RAG model. We’ll define a function that uses Firecrawl for web scraping and CAMEL’s AutoRetriever for retrieving the most relevant information based on a query:

from camel.configs import ChatGPTConfig
from camel.models import ModelFactory
from camel.retrievers import AutoRetriever
from camel.toolkits import OpenAIFunction, SearchToolkit
from camel.types import ModelPlatformType, ModelType, StorageType

def retrieve_information_from_urls(urls: list[str], query: str) -> str:
    r"""Retrieves relevant information from a list of URLs based on a given
    query.

    This function uses the `Firecrawl` tool to scrape content from the
    provided URLs and then uses the `AutoRetriever` from CAMEL to retrieve the
    most relevant information based on the query from the scraped content.

    Args:
        urls (list[str]): A list of URLs to scrape content from.
        query (str): The query string to search for relevant information.

    Returns:
        str: The most relevant information retrieved based on the query.

    Example:
        >>> urls = ["https://example.com/article1", "https://example.com/
        article2"]
        >>> query = "latest advancements in AI"
        >>> result = retrieve_information_from_urls(urls, query)
    """
    aggregated_content = ''

    # Scrape and aggregate content from each URL
    for url in urls:
        scraped_content = Firecrawl().tidy_scrape(url)
        aggregated_content += scraped_content

    # Initialize the AutoRetriever for retrieving relevant content
    auto_retriever = AutoRetriever(
        vector_storage_local_path="local_data", storage_type=StorageType.QDRANT
    )

    # Retrieve the most relevant information based on the query
    # You can adjust the top_k and similarity_threshold value based on your needs
    retrieved_info = auto_retriever.run_vector_retriever(
        query=query,
        contents=aggregated_content,
        top_k=3,
        similarity_threshold=0.5,
    )

    return retrieved_info

Let’s put the retrieval function to the test by gathering some information about the 2024 Olympics. The first run may take about 50 seconds as it needs to build a local vector database

retrieved_info = retrieve_information_from_urls(
    query="Which country won the most golden prize in 2024 Olympics?",
    urls=[
        "https://en.wikipedia.org/wiki/2024_Summer_Olympics",
        "https://olympics.com/en/paris-2024",
    ],
)

print(retrieved_info)

🎉 Thanks to CAMEL’s RAG pipeline and Firecrawl’s tidy scraping capabilities, this function effectively retrieves relevant information from the specified URLs! You can now integrate this function into CAMEL’s Agents to automate the retrieval process further.

📹 Monitoring AI Agents with AgentOps

AgentOps is a powerful tool for tracking and analyzing the execution of CAMEL agents. To set up AgentOps, obtain an API key and configure it in your environment:

import os
from getpass import getpass

agentops_api_key = getpass('Enter your API key: ')
os.environ["AGENTOPS_API_KEY"] = agentops_api_key

import agentops
agentops.init(default_tags=["CAMEL"])

With AgentOps set up, you can monitor and analyze the execution of your CAMEL agents, gaining valuable insights into their performance and behavior.

🧠 Constructing Knowledge Graphs

CAMEL can build and store knowledge graphs from text data, enabling advanced analysis and visualization of relationships. Here’s how to set up a Neo4j instance and define a function to create a knowledge graph:

from camel.storages import Neo4jGraph
from camel.loaders import UnstructuredIO
from camel.agents import KnowledgeGraphAgent

from camel.storages import Neo4jGraph
from camel.loaders import UnstructuredIO
from camel.agents import KnowledgeGraphAgent

def knowledge_graph_builder(text_input: str) -> None:
    r"""Build and store a knowledge graph from the provided text.

    This function processes the input text to create and extract nodes and relationships,
    which are then added to a Neo4j database as a knowledge graph.

    Args:
        text_input (str): The input text from which the knowledge graph is to be constructed.

    Returns:
        graph_elements: The generated graph element from knowlegde graph agent.
    """

    # Set Neo4j instance
    n4j = Neo4jGraph(
        url="Your_URI",
        username="Your_Username",
        password="Your_Password",
    )

    # Initialize instances
    uio = UnstructuredIO()
    kg_agent = KnowledgeGraphAgent()

    # Create an element from the provided text
    element_example = uio.create_element_from_text(text_input, element_id="001")

    # Extract nodes and relationships using the Knowledge Graph Agent
    graph_elements = kg_agent.run(element_example, parse_graph_elements=True)

    # Add the extracted graph elements to the Neo4j database
    n4j.add_graph_elements(graph_elements=[graph_elements])

    return graph_elements

🤖🤖 Multi-Agent Role-Playing with CAMEL

CAMEL enables role-playing sessions where AI agents interact to accomplish tasks using various tools. Let’s guide an assistant agent to perform a comprehensive study of the Turkish shooter in the 2024 Paris Olympics:

  1. Define the task prompt.
  2. Configure the assistant agent with tools for web information retrieval and knowledge graph building.
  3. Initialize the role-playing session.
  4. Start the interaction between agents.
from typing import List

from colorama import Fore

from camel.agents.chat_agent import FunctionCallingRecord
from camel.societies import RolePlaying
from camel.utils import print_text_animated
from camel.societies import RolePlaying

task_prompt = """Do a comprehensive study of the Turkish shooter in 2024 paris
olympics, write a report for me, then create a knowledge graph for the report.
You should use search tool to get related urls first, then use retrieval tool
to get the retrieved content back, finally use tool to create the
knowledge graph to finish the task."""

retrieval_tool = OpenAIFunction(retrieve_information_from_urls)
search_tool = OpenAIFunction(SearchToolkit().search_duckduckgo)
knowledge_graph_tool = OpenAIFunction(knowledge_graph_builder)

tool_list = [
    retrieval_tool,
    search_tool,
    knowledge_graph_tool,
]

assistant_model_config = ChatGPTConfig(
    tools=tool_list,
    temperature=0.0,
)

role_play_session = RolePlaying(
    assistant_role_name="CAMEL Assistant",
    user_role_name="CAMEL User",
    assistant_agent_kwargs=dict(
        model=ModelFactory.create(
            model_platform=ModelPlatformType.OPENAI,
            model_type=ModelType.GPT_4O,
            model_config_dict=assistant_model_config.as_dict(),
        ),
        tools=tool_list,
    ),
    user_agent_kwargs=dict(),
    task_prompt=task_prompt,
    with_task_specify=False,
)

input_msg = role_play_session.init_chat()
while n < 10:
    n += 1
    assistant_response, user_response = role_play_session.step(input_msg)

    if "CAMEL_TASK_DONE" in user_response.msg.content:
        break

    input_msg = assistant_response.msg

Now we can set up the role playing session with this:

# Initialize the role-playing session
role_play_session = RolePlaying(
    assistant_role_name="CAMEL Assistant",
    user_role_name="CAMEL User",
    assistant_agent_kwargs=dict(
        model=ModelFactory.create(
            model_platform=ModelPlatformType.OPENAI,
            model_type=ModelType.GPT_4O_MINI,
            model_config_dict=assistant_model_config.as_dict(),
        ),
        tools=tool_list,
    ),
    user_agent_kwargs=dict(),
    task_prompt=task_prompt,
    with_task_specify=False,
)

Print the system message and task prompt like this:

print(
    Fore.GREEN
    + f"AI Assistant sys message:\n{role_play_session.assistant_sys_msg}\n"
)
print(Fore.BLUE + f"AI User sys message:\n{role_play_session.user_sys_msg}\n")

print(Fore.YELLOW + f"Original task prompt:\n{task_prompt}\n")
print(
    Fore.CYAN
    + "Specified task prompt:"
    + f"\n{role_play_session.specified_task_prompt}\n"
)
print(Fore.RED + f"Final task prompt:\n{role_play_session.task_prompt}\n")

Set the termination rule and start the interaction between agents:

NOTE: This session will take approximately 5 minutes and will consume around $0.02 in tokens by using GPT4o-mini.

n = 0
input_msg = role_play_session.init_chat()
while n < 10: # Limit the chat to 10 turns
    n += 1
    assistant_response, user_response = role_play_session.step(input_msg)

    if assistant_response.terminated:
        print(
            Fore.GREEN
            + (
                "AI Assistant terminated. Reason: "
                f"{assistant_response.info['termination_reasons']}."
            )
        )
        break
    if user_response.terminated:
        print(
            Fore.GREEN
            + (
                "AI User terminated. "
                f"Reason: {user_response.info['termination_reasons']}."
            )
        )
        break
    # Print output from the user
    print_text_animated(
        Fore.BLUE + f"AI User:\n\n{user_response.msg.content}\n",
        0.01
    )

    # Print output from the assistant, including any function
    # execution information
    print_text_animated(Fore.GREEN + "AI Assistant:", 0.01)
    tool_calls: List[FunctionCallingRecord] = [
        FunctionCallingRecord(**call.as_dict())
        for call in assistant_response.info['tool_calls']
    ]
    for func_record in tool_calls:
        print_text_animated(f"{func_record}", 0.01)
    print_text_animated(f"{assistant_response.msg.content}\n", 0.01)

    if "CAMEL_TASK_DONE" in user_response.msg.content:
        break

    input_msg = assistant_response.msg

End the AgentOps Session like so:

# End the AgentOps session
agentops.end_session("Success")

🌟 Highlights

This blog demonstrates the power of CAMEL and Firecrawl for Advanced RAG with Knowledge Graphs. Key tools utilized include:

  • CAMEL: A multi-agent framework for Retrieval-Augmented Generation and role-playing scenarios.
  • Firecrawl: A web scraping tool for extracting and cleaning content from web pages.
  • AgentOps: A monitoring and analysis tool for tracking CAMEL agent execution.
  • Qdrant: A vector storage system used with CAMEL’s AutoRetriever.
  • Neo4j: A graph database for constructing and storing knowledge graphs.
  • DuckDuckGo Search: Utilized within the SearchToolkit to gather relevant URLs.
  • OpenAI: Provides state-of-the-art language models for tool-calling and embeddings.

We hope this blog post has inspired you to harness the power of CAMEL and Firecrawl for your own projects. Happy researching and building! If you want to run this blog post as a notebook, click here!

About the Author

Wendong Fan image
Wendong Fan@ttokzzzzz

Wendong Fan is an AI Engineer at Eigent AI.

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.