HIX AI
Collapse
Simple
Home > Discover > Apple, Anthropic, and Other Tech Giants Secretly Used YouTube Videos to Train AI

Apple, Anthropic, and Other Tech Giants Secretly Used YouTube Videos to Train AI

Written by
ArticleGPT

Reviewed and fact-checked by the HIX.AI Team

2 min readJul 18, 2024
Apple, Anthropic, and Other Tech Giants Secretly Used YouTube Videos to Train AI

In a Nutshell

Tech giants such as Apple and Anthropic have allegedly used YouTube video captions to train AI, raising worries about data rights and fair use.

Recently, it has been revealed that Apple, Anthropic, Nvidia, and Salesforce, among others, utilized YouTube subtitles to train their AI systems. This dataset consists of subtitles extracted from over 170,000 YouTube videos, belonging to more than 48,000 channels.

“Apple has sourced data for their AI from several companies”, according to content creator Marques Brownleeone. He revealed in his X post that Apple scraped vast amounts of data, including transcripts, from YouTube videos.

The YouTube Subtitles dataset is a part of a larger collection called The Pile, developed by the nonprofit organization EleutherAI. This collection aims to provide a valuable dataset for AI development to those outside big tech companies.

Along with the YouTube transcripts, The Pile encompasses datasets from various sources, including books, Wikipedia articles, speeches from the European Parliament, and even emails from Enron. The Pile is gaining popularity, as Apple used it to train its OpenELM AI model, and Salesforce's AI model has been downloaded more than 86,000 times.

Violation of YouTube's terms of service

The use of YouTube content, specifically in the form of scraped captions, for training AI models raises questions about potential violations of YouTube's terms of service.

YouTube's CEO Neal Mohan previously stated that using video content, including transcripts, to train AI would go against the platform's terms. OpenAI has not disclosed whether it is training Sora based on YouTube content.

Lack of consent from content creators

One of the major concerns surrounding the use of YouTube videos for AI training is the lack of consent from the creators. Many content creators expressed their frustration at the unauthorized use of their work, particularly when it comes to deleted videos or those from creators who have since removed their online presence.

Creators such as David Pakman of "The David Pakman Show" and Julia Walsh, CEO of Complexly, voiced their frustrations, emphasizing the effort and resources they invest in producing content.

Companies' responses to the allegations

In response to the allegations, Spokesperson Jennifer Martinez of Anthropic stated that their utilization of The Pile dataset only includes “a very small subset” of YouTube subtitles, and does not violate YouTube's terms of service.

Based on 3 search sources

3 sources

Apple, Anthropic, and other companies used YouTube videos to train AI

YouTube has said using creators’ content to train AI systems would violate its terms of service — so what happens if they did?

Investigation finds companies are training AI models with YouTube content without permission

YouTube video transcripts funneled into model training data without alerting content creators

YouTube creators surprised to find Apple and others trained AI on their videos

Once again, EleutherAI's data frustrates professional content creators.

On This Page

  • Ethical and Legal Implications
  • Companies' responses to the allegations