No LLM can be trusted in isolation

I started building Glama after a simple observation: no LLM can be trusted on its own.

From awe to skepticism

Like many others, I was first introduced to LLMs through OpenAI’s GPT-2 model. Initially, I created a prompt, shared it with the model, and usually accepted the answer as “most likely correct.” But like many others at the time, I still viewed the technology as a promise of what will be possible in the future, rather than as a trusted colleague to consult with.

Later, in June 2020, when GPT-3 came out, impressed by many incredible demos, I started researching what it’s like to rely on LLMs for help with daily tasks in my field. This is where my confidence in LLMs started to diminish…

Trust, but verify

There is a phenomenon known as the Gell-Mann Amnesia Effect. The effect describes how an expert may find numerous errors in an article in his field, but then accept information on other topics as accurate and forget about the flaws he has just identified. Being aware of the phenomenon and observing the frequency of errors in the information I received, I no longer trusted LLMs without validating their answers.

Over time, more models began to appear, each making more grandiose statements than the others. I started experimenting with all of them. Whatever the prompt, I developed the habit of copying and pasting my prompts into multiple models such as OpenAI, Claude, and Gemini. This change in behavior brought me to a further insight:

A single LLM can be unreliable, but when multiple models independently reach the same conclusion, it increases confidence in the accuracy of the information.

As a result, my confidence in LLMs became commensurate with the level of consensus achieved by consulting multiple models.

Limitations of LLMs

We have found that relying on a single LLM is dangerous. Based on my understanding of the technology, I believe this limitation is inherent to LLMs (rather than a matter of model quality). It is because of the following reasons:

Dataset bias: Each LLM is trained on a specific dataset and inherits its biases and limitations.
Knowledge closure: LLMs have a fixed end date for knowledge, lacking information about recent events.
Hallucination: LLMs can generate plausible-sounding but incorrect information.
Domain specificity: Models excel in certain areas but underperform in others.
Ethical inconsistency: Reconciliation techniques vary, leading to inconsistent treatment of ethical questions.
Overconfidence: LLMs can present incorrect information with great confidence.

By using multiple LLMs we can mitigate these limitations. Different models can complement each other’s strengths, allowing the user to verify information and provide a more balanced perspective. While not perfect, this approach significantly improves the reliability of LLMs.

(!REMARK)
In addition to what is discussed in this article, I would also like to draw attention to the emergence of “AI services” (it is no longer correct to call them just LLM models) that can reason. These services combine techniques such as Dynamic Chain-of-Thought (CoT), Reflection and Verbal Reinforcement Learning to provide answers aimed at providing a higher level of confidence. There is a great article that goes into detail about what these techniques are and how they work. We are actively working to bring these capabilities to Glama.

Glama: streamlining multi-model interactions

Recognizing the limitations of relying on one model, I developed Glama as a solution to streamline the process of obtaining perspectives from multiple LLMs. Glama provides a unified platform where users can interact with different AI models simultaneously, effectively creating a panel of AI advisors.

Key features of Glama include:

Queries on multiple models: simultaneously consult multiple LLMs, including the latest from Google, OpenAI and Anthropic.
Enterprise-level security:
- Your data remains under your control and is never used for model training.
- End-to-end encryption (AES 256, TLS 1.2+) for data in transit and at rest.
- SOC 2 compliance, meeting strict security standards.
Seamless integration:
- Admin console for easy team management, including SSO and domain authentication.
- Collaboration features like shared chat templates for streamlined workflows.
Comparative analysis: Easily compare answers side by side to identify consistencies and discrepancies between models.
Customizable model selection: Choose which LLMs to consult based on your specific needs and security requirements.

By facilitating secure, efficient access to diverse AI perspectives, Glama empowers users to make more informed decisions, leveraging the strengths of multiple models while mitigating individual weaknesses – all within a robust, enterprise-ready environment.

Conclusion

In today’s AI landscape, relying on a single LLM is akin to seeking advice from just one expert – potentially valuable, but inherently limited. Glama embodies the principle that diversity in AI perspectives leads to more robust and reliable results. By streamlining access to multiple LLMs, Glama not only saves time but also improves the quality of AI-enabled decision-making.

As we continue to navigate the evolving world of AI, tools like Glama will play a critical role in helping users leverage the collective intelligence of multiple models…

There isn’t one AI to rule them all, but with Glama you can harness the power of many.