In an effort to make each LL.M. or large language model more powerful than the last, AI companies have exhausted nearly all of the open internet, and are running out of data.They may be forced to train upcoming models on AI-generated data, which comes with its own problems
Artificial intelligence companies are facing a huge challenge that will make the billions of dollars invested in them by big tech companies meaningless: they are running out of the Internet.
In the race to develop bigger and more advanced large-scale language models, artificial intelligence companies have consumed nearly all of the open internet and now face an impending data end, The Wall Street Journal reports.
The problem is prompting some companies to look to alternative sources of training data, such as publicly available video recordings and the creation of artificial intelligence-generated “synthetic data.” However, using AI-generated data to train an AI model is a problem in itself—it leads to a higher likelihood of hallucinations in the AI model.
Additionally, discussions around synthetic data have raised serious concerns about the potential consequences of training AI models on AI-generated data. Experts believe that over-reliance on AI-generated data can lead to digital “inbreeding,” which could ultimately lead to the collapse of the AI models themselves.
While entities like Dataology, founded by former Meta and Google DeepMind researcher Ari Morcos, are exploring ways to train scaling models with less data and resources, most of the major players are experimenting with some rather unconventional and controversial data training methods .
For example, OpenAI is considering using publicly available YouTube video transcriptions to train its GPT-5 model, according to sources cited by the Wall Street Journal, although the AI company has faced criticism for using such videos to train Sora and could face Video litigation creator.
Still, companies like OpenAI and Anthropic are planning to address this problem by developing high-quality synthetic data, although specific details about their methods remain unclear.
Concerns about artificial intelligence companies have been around for quite some time. While some, like Epoch researcher Pablo Villalobos, predict that AI may exhaust its available training data in the coming years, major breakthroughs are widely seen as easing those concerns.
However, another solution to this dilemma: AI companies could choose not to pursue larger, more advanced models, given the environmental costs associated with their development, including massive energy consumption and the exposure of computing chips to rare earth minerals. rely.
(Based on input from each agency)