The billionaire entrepreneur suggested that technology firms must increasingly rely on “synthetic” data—content generated by AI models themselves—to train and refine future systems. This practice is already being adopted by leading AI developers as they grapple with the limitations of available data.
“The cumulative sum of human knowledge has been exhausted in AI training. That happened basically last year,” said Musk during a livestreamed interview on X, his social media platform. Musk, who founded the AI company xAI in 2023, explained that synthetic data could play a key role in addressing this challenge.
AI models, such as OpenAI’s GPT-4o, are typically trained on vast datasets sourced from the internet. These models learn to identify patterns within the data, enabling them to perform tasks like predicting text or generating coherent responses. However, Musk argued that with finite human-created content available, future training would depend on AI systems generating their own material.
“The only way to supplement that is with synthetic data,” Musk said, describing a process where AI models “write an essay or come up with a thesis, grade themselves, and go through this process of self-learning.”
Major AI firms like Meta and Microsoft are already incorporating synthetic data into their models, such as Meta’s Llama and Microsoft’s Phi-4. Google and OpenAI have also explored this approach. However, Musk warned that synthetic data presents risks, particularly AI “hallucinations,” where models produce inaccurate or nonsensical results.
“How do you know if it hallucinated the answer or if it’s a real answer?” Musk said, highlighting the challenge of ensuring synthetic content maintains accuracy.
Experts have echoed Musk’s concerns. Andrew Duncan, director of foundational AI at the UK’s Alan Turing Institute, noted that over-reliance on synthetic data could lead to “model collapse,” where the quality of AI outputs deteriorates. “When you start to feed a model synthetic stuff, you get diminishing returns,” he said, warning of potential biases and reduced creativity in outputs.
Duncan also raised concerns about the proliferation of AI-generated content online, which could inadvertently be reabsorbed into training datasets, compounding the issue.
The debate over high-quality data has become a legal flashpoint in the AI industry. OpenAI admitted last year that models like ChatGPT rely on copyrighted material for training, prompting demands from publishers and creatives for compensation. As AI continues to expand, control over data sources is poised to remain a critical issue.