New research shows that larger models like OpenAI’s GPT-4 and Meta’s LLaMA are more prone to generating false information, raising concerns over the reliability of the most sophisticated AI systems as they continue to evolve.
A study published in the journal Nature explored how advanced language models, while excelling at handling complex questions, are more likely to fabricate responses to questions they can’t accurately answer.
Models such as GPT-4, Meta’s LLaMA, and BLOOM, developed by BigScience, were tested on a range of subjects including math and geography. The research found that these models often answer nearly every question posed to them, leading to a greater frequency of incorrect answers compared to older, less complex versions.
According to the study’s co-author, José Hernández-Orallo from the Valencian Research Institute for Artificial Intelligence, the issue arises because these models are programmed to provide answers regardless of whether they are accurate.
“They are answering almost everything these days. And that means more correct, but also more incorrect [answers],” said Hernández-Orallo.
This behavior has led experts, like Mike Hicks from the University of Glasgow, to criticize the models for what he describes as “pretending to be knowledgeable.”
While these AI systems have improved in answering more sophisticated questions, they continue to struggle with simpler ones, raising concerns about their overall reliability. Interestingly, human evaluators often misjudged the accuracy of the AI’s responses, further complicating the issue.
Researchers suggest that these AI systems could be improved by programming them to decline answering when unsure, rather than attempting to answer every question. However, doing so may limit the perceived capability of these models, which companies may be hesitant to showcase.