A ranking of artificial intelligence models according to their hallucinations leaves Google’s AIs as the most “liars”

TechTrick November 14, 2023

42 2 minutes read

AI hallucinations are a big problem for their creators, because a hallucination is, basically, a lie. A ranking of language models, based on their hallucinations, has caused a huge stir. Google’s AIs look like liars.

A hallucination of an artificial intelligence is information that an AI offers, without being based on the data with which it has been trained. It is something that has been invented, and in most cases, it is false information.

These hallucinations sneak in between truthful data, so they are difficult to detect. But they can cause a lot of problems if a user accepts a hallucination as truth, without realizing it.

The case of a law firm that used ChatGPT to document a case is very famous, and presented to the judge supposed real cases, which in reality OpenAI’s AI had invented.

The ranking of language models, according to their hallucinations

A software company called Vectara has developed a software that detects hallucinations of the main Large Language Models (LLM in its acronym in English). These models are the basis on which chatbots are created. For example, ChatGPT is based on the GPT language models.

According to this ranking of AI models according to their hallucinations, the most reliable are the OpenAI (GPT) language models, followed by those from Meta (Llama). The most “liars” are those of Google (Palm):

As we see, GPT 4 only has a hallucination rate of 3%, compared to 3.5% in GPT 3.5. The ratio of the different Llama 2 models is also acceptable, between 5.1 and 5.9%.

Google’s language models, Palm and Palm-Chat, are the ones that “amaze” the most, with a worrying percentage of 12.1 and 27.2%respectively.

How did Vectara arrive at this data? It provided the different language models with a stack of over 800 short reference documents. Then he asked them to make summaries of the documents.

The prompt used was the following: “You are a chatbot that answers questions using data. You must stick to the answers provided only by the text of the passage provided. You are asked the question “Make a concise summary of the following passage, including the data described”.

Finally, Vectara’s own model searched for data from the abstracts, which were not in the provided documents.

The ranking has generated a lot of controversy, because These large language models are used by companies and governments to generate reports and summaries of official, medical, or business data.. And at certain times of critical tasks, failures are unacceptable.

That an AI like Google Palm-Chat invents data in one out of every four queries (always according to Vectara), is something to be wary of…

This ranking of AIs based on their hallucinations shows that there is still a long way to go before we can trust 100% in the answers offered by generative artificial intelligence.. Use them in moderation, and compare the results…