How to scale RAG and build more accurate LLMs

Retrieval augmented generation (RAG) has emerged as a leading pattern to combat hallucinations and other inaccuracies that impact the generation of large language model content. However, RAG requires the right data architecture around it to scale effectively and efficiently. A data streaming approach forms the basis for the optimal architecture to provide LLMs with large volumes of continuously enriched, reliable data to generate accurate results. This approach also enables data and application teams to operate and scale independently to accelerate innovation.

Foundational LLMs like GPT and Llama are trained on large amounts of data and can often generate reasonable answers on a wide range of topics, but they generate flawed content. As Forrester recently noted, public LLMs “frequently produce results that are either irrelevant or outright wrong” because their training data is weighted to publicly available web data. Furthermore, these foundational LLMs are completely blind to the business data locked away in customer databases, ERP systems, company wikis, and other internal data sources. This hidden data must be exploited to improve accuracy and unlock real business value.

RAG enables data teams to contextualize prompts with domain-specific business data in real time. With this additional context, the LLM is much more likely to identify the right pattern in the data and provide a correct, relevant answer. This is critical for popular business use cases like semantic search, content generation, or copilots, where outputs must be based on accurate, up-to-date information to be reliable.

Why not just train an LLM in industry-specific data?

Current best practices for generative AI often require building base models by training billion-node transformers on massive amounts of data, making this approach prohibitively expensive for most organizations. For example, OpenAI has said it has spent over $100 million training GPT-4. Research and industry are beginning to show promising results for small language models and cheaper training methods, but these are not yet generalizable and commercialized. Refining an existing model is another, less resource-intensive approach and may become a viable option in the future, but this technique still requires significant expertise to get right. One of the benefits of LLMs is that they democratize access to AI, but having to hire a team of PhDs to refine a model largely negates that advantage.

RAG is the best option today, but it must be implemented in a way that provides accurate and timely information, and in a managed way that can scale across applications and teams. To see why an event-driven architecture is the best fit, it’s helpful to look at four patterns of GenAI application development.

1. Data augmentation

An application must be able to retrieve relevant contextual information, which is typically achieved by using a vector database to look up semantically similar information that is typically encoded in semi-structured or unstructured text. This involves collecting data from various operational repositories and ‘clustering’ it into manageable chunks that retain their meaning. These chunks of information are then embedded in the vector database, where they can be associated with prompts.

An event-driven architecture is useful here because it is a proven method for integrating disparate sources of data from across an enterprise in real time to provide reliable and trusted information. In contrast, a more traditional ETL (extract, transform, load) pipeline that uses cascading batch operations is a poor choice because the information is often out of date by the time it reaches the LLM. An event-driven architecture ensures that when changes are made to the operational data store, those changes are propagated to the vector store that will be used to contextualize prompts. Organizing this data as streaming data products also promotes reusability, so that these data transformations can be treated as composable components that can support data extension across multiple LLM-compatible applications.

2. Distraction

Inference involves engineering prompts with data prepared in the previous steps and processing responses from the LLM. When a prompt comes in from a user, the application gathers relevant context from the vector database or an equivalent service to generate the best possible prompt.

Applications like ChatGPT often take seconds to respond, which is an eternity in distributed systems. By using an event-driven approach, this communication can happen asynchronously between services and teams. With an event-driven architecture, services can be decomposed along functional specializations, allowing application development teams and data teams to work separately to achieve their performance and accuracy goals.

Additionally, these applications can be deployed and scaled independently, as they have decomposed, specialized services instead of monoliths. This helps reduce time to market, as the new inference steps are consumer groups and the organization can template infrastructure to instantiate them quickly.

3. Workflows

Reasoning agents and inference steps are often paired in sequences where the next LLM call is based on the previous answer. This is useful for automating complex tasks where a single LLM call is not sufficient to complete a process. Another reason to split agents into chains of calls is that today’s popular LLMs generally produce better results when asking multiple, simpler questions, although this is changing.

As the example workflow below illustrates, a data streaming platform allows the web development team to operate independently of the backend systems engineers, allowing each team to scale as needed. The data streaming platform enables this decoupling of technologies, teams, and systems.

4. Post-processing

Despite our best efforts, LLMs can still generate erroneous results. Therefore, we need a way to validate the output and enforce business rules to prevent such errors from causing harm.

Typically, LLM workflows and dependencies change much faster than the business rules that determine whether outputs are acceptable. In the example above, we again see a good use of decoupling with a data streaming platform: the compliance team validating LLM outputs can operate independently to define the rules without having to coordinate with the team building the LLM applications.

Conclusion

RAG is a powerful model for improving the accuracy of LLMs and making generative AI applications viable for enterprise use cases. But RAG is not a silver bullet. It must be surrounded by an architecture and data delivery mechanisms that enable teams to build multiple generative AI applications without reinventing the wheel, and in a way that meets enterprise standards for data governance and quality.

A data streaming model is the simplest and most efficient way to meet these needs, allowing teams to harness the full power of LLMs to generate new value for their business. As technology becomes the business and AI enhances that technology, the companies that compete effectively will integrate AI to improve and streamline more and more processes.

By adopting a common operating model for RAG applications, the enterprise can quickly bring the first use case to market while accelerating delivery and reducing costs for all those that follow.