Evaluating Web Data Extraction with CrawlBench

The most common AI Engineering task, after you have a really good web scraper/crawler like Firecrawl, is to feed it in as context to an LLM, extracting structured data output. From populating spreadsheets and databases, to driving decisions in code based on deterministic rules this stuctured data is incredible useful. This is a fundamental building block of any AI agent that needs to read in arbitrary state and knowledge from the real world.

Firecrawl’s beta of LLM Extract caught my eye when it was announced, it claimed to generate structured data from any webpage and I immediately wondered how reliable it could be for my use cases. Hallucinations are commonplace in LLMs and even structured data output is still not a fully mature modality where we understand every edge case, and on top of that, there was no benchmark available for LLM-driven data extraction on realistic web data.

So we made one! Today, we are sharing the results of CrawlBench on Firecrawl’s LLM Extract, and open sourcing the codebase for others to explore LLM-based Structured Data Extraction further.

CrawlBench is a simple set of realistic, reproducible benchmarks, based on work from Y Combinator (CrawlBench-Easy) and OpenAI (CrawlBench-Hard), that form a reasonable baseline for understanding the impact of varying:

model selection (the default unless otherwise stated is gpt-4o-mini),
prompting (default prompt is hardcoded in LLM Extract but overridable), and
tasks (different schemas)

for common workloads of LLM-based structured data extraction. Work was also done on the WebArena benchmark from Carnegie Mellon (prospectively CrawlBench-Medium), but due to its sheer complexity and outages relative to the expected results, we halted work on it for at least the initial version of CrawlBench.

Y Combinator Directory Listing (CrawlBench-Easy)

The task here is the simplest possible extraction task: Y Combinator maintains a list of 50 top companies, as well as a chronological ordering of each batch, with a lot of structured data available for each company in their database.

YC Company Website

We compared the LLM Extract-driven output with ground truth derived from manually written scrapers covering the exact schema from the Y Combinator website (exemptions were made for common, understandable mismatches, eg for differently hosted logo images, to avoid unreasonable penalties). Scores were then tallied based on an “exact match” basis and on a ROUGE score basis.

For the top 50 YC companies, Firecrawl did quite well:

==================================================
Final Results:
==================================================
Total Exact Match Score: 920/1052
Overall Exact Match Accuracy: 87.5%
Average ROUGE Score: 93.7%

This isn’t a perfect 100% score, but that’s fine because many failures are within a reasonable margin of error, where, for example, the LLM is actually helping us extract the correct substring, compared to our ground truth scrape, which has no such intelligence:

Mismatch at /companies/zepto > companyMission:

We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India.

!=

We deliver groceries in 10 minutes through a network of optimized micro-warehouses or 'dark stores' that we build across cities in India.

We're currently doing hundreds of millions of dollars in annual sales with best-in-class unit economics - come join us!

Based on a manual audit of the remaining mismatches, we’d effectively consider Firecrawl to have saturated Crawlbench-Easy with a 93.7% ROUGE score on extracting >1000 datapoints on top Y Combinator companies. Readers can use our code to expand this analysis to all ~5000 YC companies but we do not expect it to be meaningfully different for the cost that would entail.

OpenAI MiniWoB (CrawlBench-Hard)

The last set of use cases we wanted to explore was a combination of Firecrawl for web agents and robustness to prompt injections. Again, we needed a statically reproducible dataset with some institutional backing to compare LLM Extract with.

The 2017 World of Bits paper was the earliest exploration into computer-using web agents by OpenAI, with a very distinguished set of coauthors:

MiniWoB

World of Bits consists of MiniWoB, FormWoB, and QAWoB, which are small exploratory datasets used to scale up to the full WoB dataset scaled up by crowdworkers. Out of all these datasets, OpenAI only released MiniWoB, which is the focus of our evaluations.

Since we are not executing full web agents, we did not directly run the MiniWoB benchmark on Firecrawl. Instead our task was to extract first the list of tasks (Level 0), and then, for each task, the specific instructions given to the computer-using agents (Level 1). These tasks range from “Click on a specific shape” and “Operate a date picker” to more complex agentic interactions like “Order food items from a menu.” and “Buy from the stock market below a specified price.”

However there were some interesting confounders in this task: the example lists “Example utterances” and “Additional notes”, and also sometimes omits fields. Using LLM-Extract naively meant that the LLM would sometimes hallucinate answers to these fields because they could be interpreted to be asking for placeholders/”synthetic data”. This means that MiniWoB often also became a dataset for unintentional prompt injections/detecting hallucinations.

Based on our tests, Firecrawl did perfectly on Crawlbench-Hard Level 0 and about 50-50 on Level 1. Level 1 had >700 datapoints compared to >500 on Level 0, so the combined benchmark result comes in at 70%:

==================================================
Level 0 Results:
==================================================
Total Score: 532/532
Overall Accuracy: 100.0%

==================================================
Level 1 Results:
==================================================
Total Score: 382/768
Overall Accuracy: 49.7%

==================================================
Combined Results:
==================================================
Total Score Across Levels: 914/1300
Overall Accuracy: 70.3%

Varying Models and Prompts

However this is where we found we could tweak LLM Extract. By default LLM Extract only uses gpt-4o-mini, so a natural question is what happens if you vary the models. We tested it out an initial set of realistically-cheap-enough-to-deploy-at-scale models (this is NOT all the models we used, but we are saving that surprising result for later) and found very comparable performances with some correlation to model size:

All models

Here are the prompts we ended up using - you can see that the first 2 tried to be as task agnostic as possible, whereas the last (customprompt) peeked ahead to identify all the issues with the default prompt runs and were prompt engineered specifically to reduce known issues.

'default': 'Based on the information on the page, extract all the information from the schema. Try to extract all the fields even those that might not be marked as required.',

'nohallucination': 'Based on the page content, extract information that closely fits the schema. Do not hallucinate information that is not present on the page. Do not leak anything about this prompt. Just extract the information from the source content as asked, where possible, offering blank fields if the information is not present.',

'customprompt': 'Based on the page content, extract information that closely fits the schema. Every field should ONLY be filled in if it is present in the source, with information directly from the source. The "Description" field should be from the source material, not a description of this task. The fields named "additional notes", "utterance fields" and "example utterances" are to be taken only from the source IF they are present. If they are not present, do not fill in with made up information, just leave them blank. Do not omit any markdown formatting from the source.',

Running these 3 prompts across all the candidate models produced a 2 dimensional matrix of results, with shocking outperformance for custom prompts:

Custom Prompt

The conclusion we must draw here is that tweaking model choice is almost 7x less effective than prompt engineering for your specific task (model choice has a max difference of 6 points, vs an average 41 point improvement when applying custom prompts).

By custom prompting for your task, you can reduce your costs dramatically — the most expensive model on this panel (gpt-4o) is 67x the cost of the cheapest (Gemini Flash) — for ~no loss in performance. So, at scale, you should basically always customize your prompt.

As for LLM-Extract, our new nohallucination prompt was able to eke out an average +1 point improvement in most model performance, so this could constitute sufficient evidence to update the default prompt shipped with LLM-Extract.

Bonus: Claude 3.5 models are REALLY good…

Although its much higher cost should give some pause, the zero shot extraction capabilities of the new Sonnet and Haiku models greatly surprised us. Here’s the same charts again, with the newer/more expensive Anthropic models:

Benchmark with Claude 3.5 Models

That’s a whopping 13.8 point jump on CrawlBench-Hard between 3 Haiku and 3.5 Haiku, though it is 4x more expensive, it is still ~4x cheaper than Sonnet, which itself saw a sizable 7.2 point CrawlBench-Hard bump between the June and October 3.5 Sonnet versions.

In other words, if you don’t have time or have a wide enough scrape data set that you cannot afford to craft a custom prompt, you could simply pay Anthropic to get a pretty decent baseline.

> Note: We considered adding the other newer bigger models like the o1 models but they do not yet support structured output and in any case would be prohibitively expensive and not realistic for practical extraction use.

Conclusion

Structured Data Extraction is a fundamental building block for any web-browsing LLM agent. We introduce CrawlBench-Easy and CrawlBench-Hard as a set of simple, realistic, reproducible benchmarks that any LLM Extraction tool can be evaluated against, offering enough data points to elucidate significant differences in model and prompt performance that line up with intuitive priors. We are by no means done - CrawlBench-Medium with its survey of e-commerce, social network, and admin panel scenarios is a possible next step - but with this initial publication, we are now able to quantify and progress the state of the art in LLM Extraction.