The US District Court for the Northern District of California granted summary judgment in favor of an artificial intelligence (AI) company, finding that its use of lawfully acquired copyrighted materials for training and its digitization of acquired print works fell within the bounds of fair use. However, the district court explicitly rejected the AI company’s attempt to invoke fair use as a defense to rely on pirated copies of copyrighted works as lawful training data. Andrea Bartz, et al. v. Anthropic PBC, Case No. 24-CV-05417-WHA (N.D. Cal. June 23, 2025) (Alsup, J.)
Anthropic, an AI company, acquired more than seven million copyrighted books without authorization by downloading them from pirate websites. It also lawfully purchased print books, removed their bindings, scanned each page, and stored them in digitized, searchable files. The goal was twofold:
- To create a central digital library intended, in Anthropic’s words, to contain “all the books in the world” and to be preserved indefinitely.
- To use this library to train the large language models (LLMs) that power Anthropic’s AI assistant, Claude.
Each work selected for training the LLM was copied through four main stages:
- Each selected book was copied from the library to create a working copy for training.
- Each book was “cleaned” by removing low-value or repetitive content (e.g., footers).
- Cleaned books were converted into “tokenized” versions by being simplified and split into short character sequences, then translated into numerical tokens using Anthropic’s custom dictionary. These tokens were repeatedly used in training, allowing the model to discover statistical relationships across massive text data.
- Each fully trained LLM itself retained “compressed” copies of the books.
Once the LLM was trained, it did not output any of the books through Claude to the public. The company placed particular value on books with well-curated facts, structured analyses, and compelling narratives (i.e., works that reflected well-written creative expressions) because Claude’s users expected clear, accurate, and well-written responses to their questions.
Andrea Bartz, along with two other authors whose books were copied from pirated and purchased sources and used to train Claude, sued Anthropic for copyright infringement. In response, Anthropic filed an early motion for summary judgment on fair use only under Section 107 of the Copyright Act.
To assess the applicability of the fair use defense, the court separated and analyzed Anthropic’s actions across three distinct categories of use.
Transformative training (fair use)
The authors challenged only the inputs used to train the LLMs, not their outputs. The district court found that Anthropic’s use of copyrighted books to train its LLMs was a transformative use, comparable to how humans read and learn from texts and produce new, original writing. While the authors claimed that the LLMs memorized their creative expression, there was no evidence that Claude released infringing material to the public. The court concluded that using the works as training inputs – not for direct replication, but to enable the generation of new content – favored a finding of fair use.
Format-shifting copies (fair use)
The authors challenged Anthropic’s conversion of the copyrighted works from print to digital format, although they did not allege that Anthropic distributed any of the digital copies outside the company. The district court found that Anthropic had lawfully purchased the print editions and acquired the right to retain and use them for all ordinary purposes. Each print copy was digitized to save space and enable search functionality, and the original was destroyed after conversion. The court concluded that the print-to-digital format change was transformative under fair use.
Liability for piracy (not fair use)
The district court agreed with the authors that Anthropic’s downloading and retention of more than seven million pirated books – without payment – was not a fair use, regardless of whether the books were ultimately used to train its AI models. Even after Anthropic decided not to train its LLMs on those pirated copies, it kept them as part of a central research library, a use the court found inherently infringing and non-transformative. The court rejected Anthropic’s argument that its long-term goal of a transformative use (training LLMs) could retroactively justify the initial infringement, emphasizing that each act of copying must be judged by its own objective use. The court explained that “such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”
Anthropic now faces a jury trial limited to damages for its pirated copies.
Practice note: This is the first federal court decision analyzing the defense of fair use of copyrighted material to train generative AI. Two days after this decision issued, another Northern District of California judge ruled in Kadrey et al. v. Meta Platforms Inc. et al., Case No. 3:23-cv-03417, and concluded that the AI technology at issue in his case was transformative. However, the basis for his ruling in favor of Meta on the question of fair use was not transformation, but the plaintiffs’ failure “to present meaningful evidence that Meta’s use of their works to create [a generative AI engine] impacted the market” for the books.