Introducing /extract - Get web data with a prompt

Dec 15, 2024

•

Eric Ciarla imageEric Ciarla

Why Companies Need a Data Strategy for Generative AI

Why Companies Need a Data Strategy for Generative AI image

Companies Need a Data Strategy for Generative AI

A year ago, Generative AI powered search seemed like the perfect quick win for companies taking their first steps with AI. Initial implementations would be straightforward and had the potential to drive significant value and productivity. Building a basic version only took an afternoon, and almost every company had a few engineers creating prototypes. Most of which fell short following contact with real users. After building AI search for companies like MongoDB, Coinbase, and Snap, we learned the nuances that make the difference between a demo and a production-ready system that actually drives value. It all comes back to a simple truth: the system is only as good as the data going in.

Problems that come up with building Generative AI Apps

As mentioned before, it is pretty easy for someone to get a basic retrieval augmented generation (RAG) search system working with a subset of company data. But as you scale with more data and users, this approach breaks and yields mediocre results at best. Here’s why:

  • Context crowding: Correct context for a given query gets crowded out by bad context. Take the Snap AR docs for example, they have 4 different products on their developer documentation website and they all have getting started pages. If a user asks a vague query like “how do I get started” to a basic RAG chatbot, the answer is going to most likely be an incorrect amalgamation of the 4 getting started guides.
  • Outdated data: Information and processes constantly iterate, and documentation is not always maintained. One of our first customers, Spectrocloud, was benchmarking our chatbot before going into production and they found that one answer in particular was not correct. At first we thought that the model (GPT-3 at the time) was hallucinating, but after manually searching the docs we found the outdated source information on an obscure part of the documentation.
  • Data cleanliness: If data isn’t clean, performance worsens and costs soar. We powered the chatbot on the documentation for Langchain, and data cleanliness and specifically prompt injection was a huge issue. Many of the Langchain pages had prompt examples embedded in them, which confused the model at inference time. Early on with Langchain we also noticed that a lot of unnecessary extra information was in our index like navigation menus on every page.
  • Data access: Accessing a variety of data sources is often critical for companies, but it introduces a host of challenges. For example, at Firecrawl, we’ve seen that many large companies simply want to access web data from their own websites, but even this can involve complex permissioning, authentication, and data-fetching hurdles.

Forming a data strategy to solve these problems

To mitigate these issues, companies building these apps should have a data strategy with the goal of curating and maintaining quality data. Based on the aforementioned problems, here are some practical suggestions to guide your strategy.

  • Metadata Management: Good metadata is your first defense against context crowding. Every piece of content should be tagged with essential details like product association, who created it, and who can access it. This enables advanced filtering and more accurate responses.
  • Data Maintenance: To keep data fresh and reliable, the teams that create content should be responsible for regular reviews and updates. When underlying information changes, the corresponding documentation needs to change with it.
  • Data Sanitation: Raw data rarely arrives in ideal form. Before ingestion, strip away unnecessary formatting and information while preserving the essential details. While each content source requires different handling, tools like Unstructured can help standardize this process.
  • Data Access & Integration: Build the infrastructure to access your data sources seamlessly. You’ll need continuous data flow from knowledge bases, ticketing systems, websites, and more. Tools like Firecrawl can help build these pipelines and ensure high-quality data ingestion.

Conclusion

The industry is still in the early stages of solving these complex issues, there’s also significant opportunity for innovative companies to emerge and tackle various aspects of this problem. Startups like Glean, Unstructured, and our own Firecrawl have made some incredible progress, but no one has solved it all. No matter what tools emerge to make the process easier, having a robust data strategy is foundational to building production ready Generative AI Apps. Thank you to Alex Meckes, SK Ramakuru, and Brian Leonard for their valuable insights and feedback that helped shape this post!

Ready to Build?

Start scraping web data for your AI apps today.
No credit card needed.

About the Author

Eric Ciarla image
Eric Ciarla@ericciarla

Eric Ciarla is the Chief Operating Officer (COO) of Firecrawl and leads marketing. He also worked on Mendable.ai and sold it to companies like Snapchat, Coinbase, and MongoDB. Previously worked at Ford and Fracta as a Data Scientist. Eric also co-founded SideGuide, a tool for learning code within VS Code with 50,000 users.

More articles by Eric Ciarla