Hallucinations AI deteriorate even when new systems become stronger

Last month, AI shoe that processes the technical support of the cursor, The upcoming tool for computer programmersHe pointed out several customers to change corporate policy. It is said that they were no longer allowed to use the cursor on more than just one computer.

In angry posts internetCustomers complained. Some have canceled their cursor accounts. And some were still angry when they realized what had happened: AI Bot announced a change in policy that didn’t exist.

“We don’t have such a policy. Of course you can use the cursor on multiple machines,” the CEO and co -founder of the company Michael Truell, he wrote in Reddit post. “Unfortunately, this is an incorrect answer from the AI support in the first line.”

More than two years after the arrival of Chatgpt, technical companies, office workers and everyday consumers of AI robots use for still a wide range of tasks. However, there is still no way to ensure that these systems provide accurate information.

The latest and most powerful technology-tzv. Compliance systems from companies such as Openai, Google and Chinese Start-up Deepseek-generate more errors, no less. As their mathematical skills have improved significantly, their handle on facts became more shaken. It is not entirely clear why.

Today’s AI robots are based on complex mathematical systems that learn their skills by analyzing a huge amount of digital data. No – and they can’t – decide what is true and what is false. Sometimes they just invent things, a phenomenon that some scientists call AI called hallucinations. In one test, the hallucination rate of newer AI systems was up to 79 percent.

These systems use mathematical probabilities to guess the best reaction, not a strict set of rules defined by human engineers. So they make a certain number of errors. “Despite our best efforts, they will always be hallucinated,” said AMR Awadallah, Vectara, a general director, a start-up that creates AI tools for companies and former Executive Director of Google. “It will never disappear.”

For several years, this phenomenon has raised concerns about the reliability of these systems. Although they are useful in some situations – such as writing news, summary of office documents and generating computer code – their errors can cause problems.

Boti AI linked to search engines such as Google and Bing, sometimes generating search results that are ridiculously bad. If you ask them for a good marathon on the west coast, they could design a race in Philadelphia. If they tell you the number of households in Illinois, they may quote a source that does not include this information.

These hallucinations may not be a big problem for many people, but it is a serious problem for anyone who uses technology with court documents, medical information or sensitive business data.

“You spend a lot of time trying to find out which answers are factual and which are not,” said Pratik Verma, co -founder and CEO OcularA company that helps businesses to go through a problem with hallucination. “Not the solution of these errors correctly eliminates the value of AI systems to automate tasks for you.”

Cursor and Mr. Truell did not answer the request for comment.

For more than two years, the company like Open and Google has constantly improved their AI systems and reduced the frequency of these errors. But using new thinking systems growing errors. According to the company’s own tests, the latest Openai systems hallucinate higher than the previous company system.

The company found that the O3 – its strongest system – hallucinated 33 percent of the time in performing its Personqa test, which includes answering questions about public personalities. This is more than double the degree of hallucination of the previous Openai system called O1. The new O4-mini hallucinated even higher pace: 48 percent.

When performing another test called SIMPLEQA, which asks more general questions, the Halucinacean rate for O3 and O4-mini was 51 percent and 79 percent. The previous system, O1, hallucinated 44 percent of the time.

It describes tests in detail in paperOpeni said that more research is needed to understand the cause of these results. Because AI systems learn from more data than people can wrap their heads, technologists are trying to determine why they behave in a way they do it.

“Hallucinations are, inherently, more common in thinking models, although we are actively working to reduce the higher levels of hallucinations that we have seen in O3 and O4-mini,” said Gaby Rail spokesman. “We will continue to research hallucinations across all models to improve accuracy and reliability.”

Hannaneh Hajishirzi, a professor at the University of Washington and a researcher in Allen Institute for artificial intelligence is part of a team that recently invented the way of watching the system’s behavior back to the individual data on which it was trained. But because the systems learn from so many data – and because they can generate almost anything – this new tool cannot explain everything. “We still don’t know how these models work exactly,” she said.

Tests of independent companies and researchers suggest that the level of hallucinations is also growing for thinking models from companies such as Google and Deepseek.

Since the end of 2023, Mr. Awadallaha Vectar has watched how often the chatbots come out of the truth. The Company asks these systems to perform a direct task that is easily verified: summarize specific news articles. Even then chatbots are permanently inventing information.

The original Vectary research estimated that in this situation the cottage created information at least 3 percent of the time and sometimes up to 27 percent.

In a year and a half since the company like Openai and Google have moved these numbers to a range of 1 or 2 percent. Others, such as San Francisco Start-up Anthropic, were around 4 percent. However, the level of hallucinations in this test has increased by means of thinking systems. Deepseek’s reasoning system, R1, hallucinated 14.3 percent of time. Onlai’s O3 climbed to 6.8.

(The New York Times sued Openai and his partner Microsoft, who accused them of violating copyrights related to intelligence content AI.

For years, companies such as OpenAi rely on a simple concept: the more Internet data they brought to their AI systems, the better these systems would work. On the Internet, however, they used almost all English text, which meant that they needed a new way to improve their chatbots.

Thus, these companies are more leaning on the technique that scientists call strengthening learning. With this process, the system can learn behavior through experiments and mistakes. In some areas it works well, such as mathematical and computer programming. However, it will not reach other areas.

“The way these systems are trained will begin to focus on one task-and they will start forgetting the others,” said Laura Perez-Beltrachini, a research worker at the University of Edinburgh, one of a The team carefully examines the problem of hallucination.

Another problem is that the thinking models are designed to spend time “thinking” through complex problems before they deal with the answer. When trying to solve the problem step by step, they risk the hallucination at every step. Mistakes can be complicated because they spend more time thinking.

The latest robots reveal each step to users, which means users can also see every error. Scientists also found that in many cases the steps displayed by the robot unrelated to the response it eventually brings.

“What the system says that they think it is not necessarily what they think,” said Aryo Pradipta Gema, researcher AI at the University of Edinburgh and a colleague from Anthropic.