andrea bartzanthropiccopyrightfair useFeaturedgenerative aillmspirate librariesscanningtrainingtransformative use

Judge Alsup: Training AI On Copyrighted Works? Fair Use. Building Pirate Libraries? Not So Much

from the right-to-read dept

While dozens of AI copyright lawsuits wind their way through courts nationwide, Judge William Alsup’s ruling this week in Bartz v. Anthropic stands out — not just because it’s from one of the most thoughtful tech judges on the federal bench, but because it charts a somewhat nuanced path through the copyright minefield that could define how AI companies operate going forward.

The ruling has sparked predictably divergent takes, with observers claiming it’s both a big win and a big loss for AI. But the real story is more interesting: Alsup has essentially created a roadmap that validates legitimate AI training while drawing clear lines around what crosses into infringement.

The bottom line: this may cost Anthropic some serious money, but it’s actually great news for generative AI development generally should it stand up.

In short, Judge Alsup found that training an AI system on unlicensed copyright works is easily transformative fair use. So too was buying physical books and scanning them to be digital copies used for training. However, initially downloading a bunch of unlicensed works and storing them long-term as a kind of central library can be infringing.

To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library. Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.

The good (and I believe correct) part is that training is transformative fair use. Judge Alsup goes through the standard four factors analysis, with the correct emphasis on the transformative nature of the use for generative AI training. Alsup notes that the training on generative AI tools on a corpus of information is the equivalent of how humans learn from works of the past, not to replace them, but to learn from them:

In short, the purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.

The first factor favors fair use for the training copies.

He finds similarly (though for slightly different reasons) on the hard copy books Anthropic purchased to scan. The scanning, a la Google Books, was for transformative purposes and, a la the Sony Betamax case, to make the content more convenient:

Storage and searchability are not creative properties of the copyrighted work itself but physical properties of the frame around the work or informational properties about the work. See Texaco, 802 F. Supp. at 14 (physical), aff’d, 60 F.3d at 919; Google, 804 F.3d at 225 (informational); Sony Corp. of Am. v. Universal City Studios, Inc. (“Sony Betamax”), 464 U.S. 417, 447 (1984) (rightful interests). In Texaco, the court reasoned that if a purchased scientific journal article had been copied “onto microfilm to conserve space, this might [have been] a persuasive transformative use.” 802 F. Supp. at 14 (Judge Pierre Leval), aff’d, 60 F.3d at 919 (reducing “bulk[ ]” “might suffice to tilt the first fair use factor in favor of Texaco if these purposes were dominant“). In Google Books, the court reasoned that a print-to-digital change to expose information about the work was transformative. Google, 804 F.3d at 225 (Judge Pierre Leval). And, in Sony Betamax, the Supreme Court held that making a recording of a television show in order to instead watch it at a later time was copying but did not usurp any rightful interest of the copyright owner. 464 U.S. at 447, 455. Important to the Supreme Court’s reasoning was the expectation that most such copiers would not distribute the permanent copies of the work.

And since that was effectively the same as what Anthropic did here, it gets another vote towards fair use:

Here, every purchased print copy was copied in order to save storage space and to enable searchability as a digital copy. The print original was destroyed. One replaced the other. And, there is no evidence that the new, digital copy was shown, shared, or sold outside the company. This use was even more clearly transformative than those in Texaco, Google, and Sony Betamax (where the number of copies went up by at least one), and, of course, more transformative than those uses rejected in Napster (where the number went up by “millions” of copies shared for free with others).

Thankfully, Alsup flatly rejects the idea that it can’t be fair use because authors/publishers might have wished to license these works at a higher rate. That’s not how this works:

Yes, Authors also might have wished to charge Anthropic more for digital than for print copies. And, this order takes for granted that Authors could have succeeded if Anthropic had been barred from the format change. “But the Constitution’s language [in Clause 8] nowhere suggests that [the copyright owner’s] limited exclusive right should include a right to divide markets or a concomitant right to charge different purchasers different prices for the same book, [merely] say to increase or to maximize gain.” See Kirtsaeng v. John Wiley & Sons, Inc., 568 U.S. 519, 552 (2013); see also U.S. CONST. art. I., § 8, cl. 8. Nor does the Copyright Act itself. Section 106 sets out exclusive rights that fair uses under Section 107 abridge. Section 106(1) reserves to the copyright owner the right to make reproductions. But on our facts we face the unusual situation where one copy entirely replaced the another. And, Section 106(2) reserves to the copyright owner the right to make derivative works that add or subtract creative material — as occurs in a “translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, [or] condensation” of a book, 17 U.S.C. § 101 (definitions). For some “other modification[ ]” of a book to constitute a “derivative work,” it must itself “represent an original work of authorship.” Ibid. But on our facts the format was changed but no content was added or subtracted. See Mirage Editions, Inc. v. Albuquerque A.R.T. Co., 856 F.2d 1341, 1342, 1343– 44 (9th Cir. 1988) (yes where elements added to create new decorative ceramic).4 Section 106(3) further reserves to the copyright owner the right to distribute copies. But again, the replacement copy here was kept in the central library, not distributed. Cf. Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169, 176–78 (2d Cir. 2018) (enabling searching for “information about the material” can be transformative use, even if some distribution results); Lewis Galoob Toys, Inc. v. Nintendo of Am., Inc., 964 F.2d 965, 968, 971 (9th Cir. 1992) (using nifty converter to “merely enhance[ ]” audiovisual displays emitted from purchased videogame cartridge was fair use of those displays partly because no surplus copies of cartridge or displays were ever created).

As a result, Anthropic’s format-change from print library copies to digital library copies was transformative under fair use factor one. Anthropic was entitled to retain a copy of these works in a print format. It retained them instead in a digital format, easing storage and searchability. And, the further copies made therefrom for purposes of training LLMs were themselves transformative for that further reason, as above.

My quibble with this is that there is an argument that with the books that were either legally purchased or licensed and then used for training, should you even need to get to the fair use argument at all. If you buy a used book and read it and learn from it without directly paying the author or publisher, it’s not because of “fair use” that you do it. It’s because reading and learning from the work doesn’t trigger copyright at all.

However, if we must go to fair use based on the fact that in this training process copies were made, having Alsup call it transformative fair use is a good outcome.

But then there’s the question of the non-licensed book collections (things like Books3 and LibGen) that Anthropic downloaded from the internet and then stored in the internal “digital library” it was using. And here, Alsup is not impressed and finds it difficult to see the fair use. Basically, in those cases, the company was clearly just downloading unlicensed copies to put into its own library.

This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

This feels close to reasonable. There are certainly plenty of cases on the books that show that simply downloading unlicensed content off the internet can be seen as infringing (though I’d still quibble that under the exact text of copyright law it only counts as a “copy” if it’s a “material object,” and purely digital content isn’t covered — but courts have long rejected that argument).

Where it still worries me a bit is that this feels pretty similar to things like “indexing the web.” Organizations like Google and the Internet Archive and many others copy all the content they can find online and store it in giant databases/indexes/libraries. And those have been found to be fair use in the past.

So what makes this different?

Judge Alsup tries to distinguish this from key cases regarding internet scanning, but this part feels weaker to me:

Nor were the initial copies made immediately transformed into a significantly altered form. In Perfect 10, images were copied by the search engine in thumbnail form only and deployed immediately into the transformative use of identifying the full-sized images and the pages from which they came. 508 F.3d at 1160, 1165, 1167. And, in Kelly v. Arriba Software Corp., images were copied at full size and then into thumbnails for immediate use in building a search engine, after which the full-sized copies were immediately deleted. 336 F.3d 811, 815 (9th Cir. 2003). Not here. The full-text copies of books were downloaded and maintained “forever.”

Nor does the initial copying here even resemble the full-text copying in the Google Books cases. There, libraries of authorized copies already had been assembled, and all copies therefrom were made for direct employment in a one-to-one further fair use — whether the transformative use of pointing to the works themselves, the use of providing the works in formats for print-disabled patrons, or the use of insuring against going out of print, getting lost, and becoming otherwise unavailable. HathiTrust, 755 F.3d at 97, 101, 103; Google, 804 F.3d at 206, 216–18, 228 (further distinguishing search and snippet uses, which “test[ed] the boundaries of fair use”). Not so here concerning the pirated copies. No authorized copies existed from which Anthropic made its first copies. No full-text copy therefrom was put immediately into use training LLMs. Not every copy was even necessary nor used for training LLMs. No initial copy was ever deleted, even if never used or no longer used. The university libraries and Google went to exceedingly great lengths to ensure that all copies were secured against unauthorized uses — both through technical measures and through legal agreements among all participants. Not so here. The library copies lacked internal controls limiting access and use.

This… feels like rationalization. Yes, the Perfect 10 and Arriba cases were about thumbnails, but search engines do more than turning content into thumbnails, and we generally consider that — even when it sweeps up infringing works on its own — to still be a fair use. So while I understand the logic of what Alsup is saying here, I do worry that it goes too far, and could wipe out other important and valuable uses.

Without going into too much detail on the other four factors (since they tend to matter less here), Alsup says the nature of the works cuts against fair use (but this factor rarely matters much in the final analysis), and while the copying required pretty much the entirety of the copyright-covered works, it leans towards fair use because (as multiple other cases have shown over the years), the use involved the amount necessary to achieve the transformative nature of the work.

Copies selected for inclusion in training sets were selected because they were complete and because they contained rich protectible expression, or so this order accepts the record shows for Authors. Was all this copying reasonably necessary to the transformative use?

Yes.

“What matters [ ] is not so much ‘the amount and substantiality of the portion used’ in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public [in the purported secondary use] for which it may serve as a competing substitute [for the primary use].”

Then there’s the dreaded “effect of the use upon the market” factor, which I honestly think shouldn’t be a fair use factor at all. But in this case, Alsup splits the three classes of works, saying the training use again favors fair use, since it has no direct impact on the market. The use to build the library is mixed again: the purchased copies is seen as neutral, while the unlicensed download copies cuts against fair use (again).

So, in the end: fair use for training, fair use for buying used books and scanning them, not fair use for downloading Books3/LibGen and creating an internal library out of them:

This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason. But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies.

The win for AI is that the training aspect (and even the scanning aspect) are found to be fair use. But, the people who say this is a win for the authors aren’t entirely wrong, because the downloading of the unauthorized copies was done by almost all of the big foundation LLM companies (though it’s not clear all of them set up a similar “library” as Anthropic did).

The prediction is that this one part, on which Alsup says there should be a trial, will likely lead Anthropic to try to settle the case and pay up for that use. That wouldn’t surprise me, given the insane statutory damages rates (effectively starting at $750 per work infringed, but going all the way up to a potential $150k per work if found to be willful).

Though, it also strikes me that even if the authors win, the remedy here wouldn’t require the destruction of the LLMs themselves, since it’s not the tool that is infringing, but rather the separate storage as a library.

Also left open, to me, is the question of what would happen if a model figured out a way to train on those works like Books3/LibGen just by scanning them when found elsewhere online, and not creating the internal library. That could limit some of the usefulness of those collections but would, in theory, avoid some of the liability risk Alsup sees here.

The end result then is that this ruling favors LLM training, which is good for innovation and usefulness. It might, however, ding more sketchy ancillary practices of the big LLM creators. And maybe that’s the proper balance? Alsup has created a framework that distinguishes between legitimate, transformative innovation practices and what amounts to direct infringement with a corporate veneer.

This distinction matters because it gives other AI companies a clear playbook (one that may come too late for some): if you want to avoid Anthropic’s potential liability, don’t create permanent archives of questionably sourced content. The ruling essentially says you can learn from copyrighted works, but you can’t just wholesale copy them into your corporate library.

Some will argue that’s a distinction without a difference, but it’s actually how copyright is supposed to work — focusing on the nature of the use rather than blanket prohibitions on touching copyrighted content.

Of course, this is still just one district court ruling among many pending cases, and appeals are inevitable. But if this framework holds up, it could reshape how AI companies approach data collection — favoring more legally defensible practices over the pure “move fast and break things” approach that might prove to be more trouble than it was worth.

Filed Under: , , , , , , , , ,

Companies: anthropic

Source link

Related Posts

1 of 33