Nvidia leak reveals they scraped “80 years worth” of YouTube videos a day to train AI
Tom Fisk via PexelsNvidia scraped videos from YouTube, Netflix, and other sources to train its AI products. This is according to leaked internal Slack chats, emails, and documents.
This massive Nvidia leak is part of an investigation by 404 Media. The publication revealed that Nvidia has been scraping the videos from different sources for its Omniverse 3D world generator, self-driving car systems, and “digital human” products.
The employees involved in scraping videos often questioned the ethics and legality of it but were silenced by managers. These managers also said they had clearance to use that content from the highest levels of the company.
The videos have been mainly scraped from YouTube, but content from sources like Netflix and GitHub has also been used.
In a Slack message, a Nvidia employee also suggested scraping movies. The reasoning is “Movies are actually a good source of data to get gaming-like 3D consistency and fictional content but much higher quality.”
To this, Ming-Yu Liu, Vice President of Research at Nvidia, replied, “We need a volunteer to download all the movies.”
Emails viewed by 404 Media show project managers discussing the use of 20 to 30 virtual machines on Amazon Web Services to download 80 years’ worth of videos per day.
“We are finalizing the v1 data pipeline and securing the necessary computing resources to build a video data factory that can yield a human lifetime visual experience worth of training data per day,” Liu said in an email in May.
In Slack channels, employees also discussed which YouTube channel’s videos should be scraped for AI training. A research scientist posted several links to YouTube channels in a Slack channel and said: “If you are still open to suggestions about YouTube channels that we could download, here are a couple of channels that might be interesting to consider.”
The links were from YouTube channels of brands like Expedia and Architectural Digest’s official channel, as well as individual content creators like Marques Brownlee (MKBHD). The scientist added a note saying: “Tech product reviews – super high quality,” next to MKBHD’s YouTube video link.
When asked about the legal and ethical aspects of using copyrighted content to train an AI model, Nvidia told 404 Media that its practice is “in full compliance with the letter and the spirit of copyright law.”
In July, Nvidia was also accused of using data from a third-party company to train its AI models. The third-party company in question had obtained that data by scraping YouTube videos from creators without permission.