Facebook and Instagram parent company Meta is facing allegations that it knowingly used copyrighted books for training its artificial intelligence (AI) models, despite warnings from its own legal team.

The claims are part of a copyright infringement lawsuit filed by notable authors, including comedian Sarah Silverman and Pulitzer Prize winner Michael Chabon.

According to a recent filing in the lawsuit, Meta’s lawyers had cautioned the company about the legal risks associated with using thousands of pirated books to train its AI language model, known as Llama. Despite these warnings, the company proceeded with the controversial practice, as outlined in the complaint.

The legal action, initiated this summer, was consolidated in a new filing on Monday, combining two separate lawsuits against Meta. The authors contend that Meta utilised their works without permission for the development of Meta’s AI models.

The latest complaint includes chat logs featuring a Meta-affiliated researcher discussing the acquisition of the dataset in a Discord server. These logs serve as potential evidence indicating that Meta was aware of the potential legal challenges related to the use of the books under US copyright law.

The quoted chat logs reveal discussions between researcher Tim Dettmers and Meta’s legal department regarding the legality of using book files as training data. Dettmers mentioned in 2021 that, due to legal reasons, they were unable to use the dataset known as “The Pile” in its current form. The dataset was acknowledged by Meta as being used to train the first version of Llama.

Dettmers’ correspondence with Meta’s legal team suggested concerns about the dataset’s compliance with US copyright law. The researchers in the chat identified “books with active copyrights” as a potential issue and debated whether training on such data would fall under the fair use doctrine, a legal principle protecting certain unlicensed uses of copyrighted works.

The lawsuit, which gained attention after a California judge dismissed part of the Silverman lawsuit last month, is evolving as the authors seek to amend their claims.

The outcome of these lawsuits could have broader implications for the AI industry, potentially affecting the development and cost of data-hungry models.

The cases against tech companies using copyright-protected works to train AI models may lead to increased scrutiny and compensation demands from content creators.

Simultaneously, new regulations in Europe may require companies to disclose the data used in training their AI models, exposing them to additional legal risks.

Meta previously released the first version of its Llama language model in February, disclosing the use of the “Books3 section of ThePile” for training. However, the company did not provide details on the training data for the latest version, Llama 2, released for commercial use this summer.

Llama 2 is free to use for companies with less than 700 million monthly active users and is considered a potential disruptor in the market for generative AI software.

GlobalData’s Thematic Intelligence: Artificial Intelligence report estimates the total AI market will be worth $383.3bn in 2030, implying a 21% compound annual growth rate between 2022 and 2030.

How well do you really know your competitors?

Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.

Company Profile – free sample

Thank you!

Your download email will arrive shortly

Not ready to buy yet? Download a free sample

We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form

By GlobalData
Visit our Privacy Policy for more information about our services, how we may use, process and share your personal data, including information of your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.

Copyright issues can arise with AI if it collects proprietary content from a media site and is trained on intellectual property.

For instance, content creators can easily use generative AI to create realistic content using the images of popular Marvel characters such as Spider-Man or the Hulk without consent from Disney.

In 2023, Disney’s CEO, Bob Iger, highlighted that AI’s disruptive capabilities would create considerable issues with intellectual property management and that the company’s legal team was working to identify potential challenges.

In January 2023, Getty images announced a lawsuit against Stability AI in London’s High Court of Justice alleging the image generator infringed on Getty’s copyrighted photographs.

In February, visual artists Sarah Andersen, Kelly McKernan and Karla Ortiz filed a class action complaint in a US District Court in California against defendants Stability AI, Midjourney and DeviantArt, alleging that their works were used without permission to train AI.