In the process of creating large and increasingly complex language models, artificial intelligence companies have encountered a huge obstacle: the exhaustion of accessible Internet data.
The Wall Street Journal reports that these companies have nearly exhausted the resources available on the open internet, signaling an impending shortage of data critical to training AI models.
Who would have thought that one day they would run out of data?
Seek alternative data sources
(Photo: Carlos Muza for Unsplash)
Even as AI companies invest billions of dollars in AI training models, companies still can’t address the elephant in the room: running out of the internet.
As traditional internet data reserves dwindle, AI companies are exploring alternative ways to obtain training data. Some are turning to publicly available video recordings and synthetic data generated by artificial intelligence algorithms. However, this approach also comes with its own set of challenges, including a higher risk of hallucinating AI models due to reliance on artificially generated data.
Related article: Artificial intelligence may take over management positions in scientific research, study suggests
Concerns surrounding synthetic data
The reliance on synthetic data has raised concerns among experts about the potential drawbacks of using such datasets to train artificial intelligence models, FirstPost reported. There are concerns about this phenomenon, known as “digital inbreeding,” whereby AI models trained on AI-generated data may encounter stability issues, leading to poor performance or failure.
Controversial Data Training Methods
To combat data scarcity, AI giants like OpenAI are considering unconventional strategies to train their models.
For example, ChatGPT maker OpenAI is reportedly considering using publicly available YouTube video transcriptions to train its GPT-5 model. However, the practice has drawn criticism and may even lead to legal challenges from video content creators.
Solving the problem of data scarcity through artificial intelligence training models
(Photo: KIRILL KUDRYAVTSEV/AFP via Getty Images)
Photo taken on February 26, 2024 shows the logo of the ChatGPT application developed by the U.S. artificial intelligence research group OpenAI and the letters AI on a laptop screen shown on a smartphone screen (L) in Frankfurt am Main in western Germany .
Despite the challenges, companies like OpenAI and Anthropic are actively working to improve the quality of synthetic data to address data scarcity. While specific methods remain confidential, the companies aim to develop high-quality synthetic data to sustain AI model training.
Looking forward to breakthrough
Despite growing concerns about data scarcity, many experts remain optimistic about the potential for technological breakthroughs to alleviate these challenges.
While predictions suggest that AI may exhaust its available training data in the near future, significant advances in AI research could provide solutions to alleviate this dilemma.
Sustainable Artificial Intelligence Development Practice
In the race for larger, more advanced artificial intelligence models, there is growing awareness of the environmental impact of their development.
Some advocate a shift in focus toward sustainable AI development practices that consider factors such as energy consumption and the environmental impact of mining rare earth minerals for computing chips.
As early as November 2023, Technology Times reported that artificial intelligence companies were about to run out of high-quality training data. A few months later, the topic came up again, and data depletion seemed to be another problem they had to overcome.
Also read: New study uses AI to accurately predict monsoon rainfall 30 days in advance
ⓒ 2024 TECHTIMES.com All rights reserved. No reproduction without permission.