Earlier this week, wall street Magazine According to reports, artificial intelligence companies have encountered difficulties in collecting high-quality training data. today, New York Times Details some of the ways companies are dealing with this issue. Unsurprisingly, it involves matters that fall into the murky gray area of AI copyright law.
The story begins with OpenAI, which was in desperate need of training data and reportedly developed the Whisper audio transcription model to overcome the difficulties, transcribing more than 1 million hours of YouTube videos to train its state-of-the-art large-scale language model GPT-4.This is based on New York Times, the report said the company knew there were legal issues with this but believed it was fair use. OpenAI President Greg Brockman personally participated in collecting the videos used, era wrote.
OpenAI spokesperson Lindsay Held said edge In an email, the company curated “unique” datasets for each of its models to “help them understand the world” and maintain their global research competitiveness. Held added that the company uses “numerous sources, including partners with public data and non-public data,” and is considering generating its own synthetic data.
this era The article states that the company exhausted its supply of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after browsing other resources. By then, it had trained the model on data including computer code from Github, a database of chess moves and homework assignments from Quizlet.
Google spokesman Matt Bryant told edge In an email, the company “saw unconfirmed reports of OpenAI activity,” adding that “unauthorized scraping or downloading of YouTube content is prohibited in both our robots.txt file and our Terms of Service.” This echoes the company’s terms of use. YouTube CEO Neal Mohan made similar remarks this week about the possibility of OpenAI using YouTube to train its Sora video generation model. Bryant said Google takes “technical and legal measures” to prevent such unauthorized use “when we have a clear legal or technical basis for doing so.”
Google also reportedly collected transcripts from YouTube era” source. Bryant said the company “trained the model on some YouTube content in accordance with our agreements with YouTube creators.”
this era wrote that Google’s legal department asked the company’s privacy team to adjust its policy language to expand its handling of consumer data in office tools like Google Docs. The new policy was reportedly issued on July 1 specifically to take advantage of the distraction of the Independence Day holiday weekend.
Meta also pushes the limits of the availability of good training data and is on record era It is said that its AI team discussed the unlicensed use of copyrighted works while trying to catch up with OpenAI. After browsing “virtually every available English-language book, essay, poetry, and news article on the Internet,” the company apparently considered taking steps like paying for book licensing or even acquiring a major publisher outright. The company made privacy-focused changes in the wake of the Cambridge Analytica scandal that also apparently limited how it uses consumer data.
Google, OpenAI, and the broader AI training community are grappling with the rapid evaporation of model training data, and the more data it absorbs, the better the models get.this Magazine This week wrote that the company could outpace new content by 2028.
Possible solutions to the mentioned problems Magazine Monday consisted of training on “synthetic” data created by one’s own models, or so-called “lesson learning,” which involves feeding the models high-quality data in an orderly manner in the hope that they will be able to build “smarter” relationships between concepts using There is much less information about the link, but neither method has been proven. But the other option these companies have is to use whatever they can find, whether they’re licensed or not, and based on the multiple lawsuits that have been filed in the last year or so, that approach is arguably a bit concerning.