Meta 'knowingly' trained AI with copyrighted books, authors claim

Meta logo seen on smartphone and AI letters on the background. Shutterstock/ Ascannio

Facebook and Instagram parent company Meta is facing allegations that it knowingly used copyrighted books for training its artificial intelligence (AI) models, despite warnings from its own legal team.

The claims are part of a copyright infringement lawsuit filed by notable authors, including comedian Sarah Silverman and Pulitzer Prize winner Michael Chabon.

According to a recent filing in the lawsuit, Meta’s lawyers had cautioned the company about the legal risks associated with using thousands of pirated books to train its AI language model, known as Llama. Despite these warnings, the company proceeded with the controversial practice, as outlined in the complaint.

The legal action, initiated this summer, was consolidated in a new filing on Monday, combining two separate lawsuits against Meta. The authors contend that Meta utilised their works without permission for the development of Meta’s AI models.

The latest complaint includes chat logs featuring a Meta-affiliated researcher discussing the acquisition of the dataset in a Discord server. These logs serve as potential evidence indicating that Meta was aware of the potential legal challenges related to the use of the books under US copyright law.

The quoted chat logs reveal discussions between researcher Tim Dettmers and Meta’s legal department regarding the legality of using book files as training data. Dettmers mentioned in 2021 that, due to legal reasons, they were unable to use the dataset known as “The Pile” in its current form. The dataset was acknowledged by Meta as being used to train the first version of Llama.

Dettmers’ correspondence with Meta’s legal team suggested concerns about the dataset’s compliance with US copyright law. The researchers in the chat identified “books with active copyrights” as a potential issue and debated whether training on such data would fall under the fair use doctrine, a legal principle protecting certain unlicensed uses of copyrighted works.

The lawsuit, which gained attention after a California judge dismissed part of the Silverman lawsuit last month, is evolving as the authors seek to amend their claims.

The outcome of these lawsuits could have broader implications for the AI industry, potentially affecting the development and cost of data-hungry models.

Access deeper industry intelligence

Experience unmatched clarity with a single platform that combines unique data, AI, and human expertise.

Find out more

The cases against tech companies using copyright-protected works to train AI models may lead to increased scrutiny and compensation demands from content creators.

Simultaneously, new regulations in Europe may require companies to disclose the data used in training their AI models, exposing them to additional legal risks.

Meta previously released the first version of its Llama language model in February, disclosing the use of the “Books3 section of ThePile” for training. However, the company did not provide details on the training data for the latest version, Llama 2, released for commercial use this summer.

Llama 2 is free to use for companies with less than 700 million monthly active users and is considered a potential disruptor in the market for generative AI software.

GlobalData’s Thematic Intelligence: Artificial Intelligence report estimates the total AI market will be worth $383.3bn in 2030, implying a 21% compound annual growth rate between 2022 and 2030.

Copyright issues can arise with AI if it collects proprietary content from a media site and is trained on intellectual property.

For instance, content creators can easily use generative AI to create realistic content using the images of popular Marvel characters such as Spider-Man or the Hulk without consent from Disney.

In 2023, Disney’s CEO, Bob Iger, highlighted that AI’s disruptive capabilities would create considerable issues with intellectual property management and that the company’s legal team was working to identify potential challenges.

In January 2023, Getty images announced a lawsuit against Stability AI in London’s High Court of Justice alleging the image generator infringed on Getty’s copyrighted photographs.

In February, visual artists Sarah Andersen, Kelly McKernan and Karla Ortiz filed a class action complaint in a US District Court in California against defendants Stability AI, Midjourney and DeviantArt, alleging that their works were used without permission to train AI.

Sections

Sections

Sections

Sections

Authors accuse Meta of ‘knowingly’ training AI with copyrighted books

Go deeper with GlobalData

ChatGPT Trailblazers - How Startups Democratize Generative Artificial Intelligence (AI)

Generative Artificial Intelligence (AI) Powerplay: What’s in the Big Tech AI Playbook

Data Insights

Access deeper industry intelligence

ChatGPT Trailblazers - How Startups Democratize Generative Artificial Intelligence (AI)

Generative Artificial Intelligence (AI) Powerplay: What’s in the Big Tech AI Playbook

Go deeper with GlobalData

Cadence launches agentic AI platform AuraStack AI Super Agent

Nvidia announces Jetson T3000 and T2000 edge AI modules

Tower Semiconductor announces $3bn manufacturing expansion in Japan

Oracle unveils AI-native builder for agentic apps in Fusion platform

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

Go deeper with GlobalData

Data Insights

Access deeper industry intelligence

Sign up for our daily news round-up!

Give your business an edge with our leading industry insights.

Go deeper with GlobalData

Go deeper with GlobalData

Access deeper industry intelligence

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing