Recently, it has been revealed that Apple, Anthropic, Nvidia, and Salesforce, among others, utilized YouTube subtitles to train their AI systems. This dataset consists of subtitles extracted from over 170,000 YouTube videos, belonging to more than 48,000 channels.
“Apple has sourced data for their AI from several companies”, according to content creator Marques Brownleeone. He revealed in his X post that Apple scraped vast amounts of data, including transcripts, from YouTube videos.
The YouTube Subtitles dataset is a part of a larger collection called The Pile, developed by the nonprofit organization EleutherAI. This collection aims to provide a valuable dataset for AI development to those outside big tech companies.
Along with the YouTube transcripts, The Pile encompasses datasets from various sources, including books, Wikipedia articles, speeches from the European Parliament, and even emails from Enron. The Pile is gaining popularity, as Apple used it to train its OpenELM AI model, and Salesforce's AI model has been downloaded more than 86,000 times.
Ethical and Legal Implications
Violation of YouTube's terms of service
The use of YouTube content, specifically in the form of scraped captions, for training AI models raises questions about potential violations of YouTube's terms of service.
YouTube's CEO Neal Mohan previously stated that using video content, including transcripts, to train AI would go against the platform's terms. OpenAI has not disclosed whether it is training Sora based on YouTube content.
Lack of consent from content creators
One of the major concerns surrounding the use of YouTube videos for AI training is the lack of consent from the creators. Many content creators expressed their frustration at the unauthorized use of their work, particularly when it comes to deleted videos or those from creators who have since removed their online presence.
Creators such as David Pakman of "The David Pakman Show" and Julia Walsh, CEO of Complexly, voiced their frustrations, emphasizing the effort and resources they invest in producing content.
Companies' responses to the allegations
In response to the allegations, Spokesperson Jennifer Martinez of Anthropic stated that their utilization of The Pile dataset only includes “a very small subset” of YouTube subtitles, and does not violate YouTube's terms of service.