Introducing Authenticated Scraping

Aug 7, 2024

•

Eric Ciarla imageEric Ciarla

How to Use OpenAI's Structured Outputs and JSON Strict Mode

How to Use OpenAI's Structured Outputs and JSON Strict Mode image

Getting structured data from LLMs is super useful for developers integrating AI into their applications, enabling more reliable parsing and processing of model outputs.

OpenAI just released new versions of gpt-4o and gpt-4o-mini which include huge improvements for developers looking to get structured data from LLMs. With the introduction of Structured Outputs and JSON Strict Mode developers can now guarantee a JSON output 100% of the time when setting strict to true.

Structured Outputs Evaluation Scores from OpenAI's latest models Figure 1: Structured Output Evaluation Scores from OpenAI’s latest models

Without further ado, let’s dig into how to use these latest models with and get reliable structured data from them.

How to use Structured Outputs with JSON Strict Mode

To demonstrate the power of these models, we can use JSON Strict mode to extract structured data from a web page. See the code on Github.

Prerequisites

Install the required libraries:

!pip install firecrawl-py openai

Step 1: Initialize the FirecrawlApp and OpenAI Client

from firecrawl import FirecrawlApp
from openai import OpenAI

firecrawl_app = FirecrawlApp(api_key='FIRECRAWL_API_KEY')
client = OpenAI(api_key='OPENAI_API_KEY')

Step 2: Scrape Data from a Web Page

url = 'https://mendable.ai'
scraped_data = firecrawl_app.scrape_url(url)

Step 3: Define the OpenAI API Request

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that extracts structured data from web pages."
    },
    {
        "role": "user",
        "content": f"Extract the headline and description from the following HTML content: {scraped_data['content']}"
    }
]

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "extracted_data",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "headline": {
                    "type": "string"
                },
                "description": {
                    "type": "string"
                }
            },
            "required": ["headline", "description"],
            "additionalProperties": False
        }
    }
}

Step 4: Call the OpenAI API and Extract Structured Data

If you are wondering which models you can use with OpenAI’s structued output and JSON Strict mode it is both gpt-4o-2024-08-06 and gpt-4o-mini-2024-07-18.

chat_completion = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=messages,
    response_format=response_format
)

extracted_data = chat_completion.choices[0].message.content

print(extracted_data)

By following these steps, you can reliably extract structured data from web pages using OpenAI’s latest models with JSON Strict Mode.

That’s about it! In this article, we showed you how to use Structured Output with scraped web data, but the sky’s the limit when it comes to what you can build with reliable structured output from LLMs!

References

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Eric Ciarla image
Eric Ciarla@ericciarla

Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.

More articles by Eric Ciarla

How to Create an llms.txt File for Any Website

Learn how to generate an llms.txt file for any website using the llms.txt Generator and Firecrawl.

Announcing Firestarter, our open source tool that turns any website into a chatbot

Spin up a fully functional RAG chatbot from any website URL using Firecrawl and Upstash—clean markdown in, OpenAI-compatible API out, all in under a minute.

Building Fire Enrich, our open source data enrichment tool

See how we built Fire Enrich, an open source tool that uses Firecrawl, OpenAI, and a multi-agent system to automate data enrichment — fully transparent, extensible, and built for developers.

Cloudflare Error 1015: How to solve it?

Cloudflare Error 1015 is a rate limiting error that occurs when Cloudflare detects that you are exceeding the request limit set by the website owner.

Build an agent that checks for website contradictions

Using Firecrawl and Claude to scrape your website's data and look for contradictions.

Why Companies Need a Data Strategy for Generative AI

Learn why a well-defined data strategy is essential for building robust, production-ready generative AI systems, and discover practical steps for curation, maintenance, and integration.

Getting Started with OpenAI's Predicted Outputs for Faster LLM Responses

A guide to leveraging Predicted Outputs to speed up LLM tasks with GPT-4o models.

How to easily install requests with pip and python

A tutorial on installing the requests library in Python using various methods, with usage examples and troubleshooting tips