This week, AI firm Anthropic (the folks behind Claude) found themselves the focus of attention of U.S. District Court for the Northern District of California.
New laws for new technologies
The tl;dr is: the court ruled that (a) piracy for the purpose of training an LLM is still piracy, so there’ll be a separate case about the fact that Anthropic did not pay for copies of all the books their model ingested, but (b) training a model on books and then selling access to that model, which can then produce output based on what it has “learned” from those books, is considered transformative work and therefore fair use.
Compelling arguments have been made both ways on this topic already, e.g.:
- Some folks are very keen to point out that it’s totally permitted for humans to read, and even memorise, entire volumes, and then use what they’ve learned when they produce new work. They argue that what an LLM “does” is not materially different from an impossibly well-read human.
- By way of counterpoint, it’s been observed that such a human would still be personally liable if the “inspired” output they subsequently created was derivative to the point of violating copyright, but we don’t yet have a strong legal model for assessing AI output in the same way. (BBC News article about Disney & Universal vs. Midjourney is going to be very interesting!)
- Furthermore, it might be impossible to conclusively determine that the way GenAI works is fundamentally comparable to human thought. And that’s the thing that got me thinking about this particular thought experiment.
A moment of philosophy
Here’s a thought experiment:
Support I trained an LLM on all of the books of just one author (plus enough additional language that it was able to meaningfully communicate). Let’s take Stephen King’s 65 novels and 200+ short stories, for example. We’ll sell access to the API we produce.

The output of this system would be heavily-biased by the limited input it’s been given: anybody familiar with King’s work would quickly spot that the AI’s mannerisms echoed his writing style. Appropriately prompted – or just by chance – such a system would likely produce whole chapters of output that would certainly be considered to be a substantial infringement of the original work, right?
If I make KingLLM, I’m going to get sued, rightly enough.
But if we accept that (and assume that the U.S. District Court for the Northern District of California would agree)… then this ruling on Anthropic would carry a curious implication. That if enough content is ingested, the operation of the LLM in itself is no longer copyright infringement.
Which raises the question: where is the line? What size of corpus must a system be trained upon before its processing must necessarily be considered transformative of its inputs?
Clearly, trying to answer that question leads to a variant of the sorites paradox. Nobody can ever say that, for example, an input of twenty million words is enough to make a model transformative but just one fewer and it must be considered to be perpetually ripping off what little knowledge it has!
But as more of these copyright holder vs. AI company cases come to fruition, it’ll be interesting to see where courts fall. What is fair use and what is infringing?
And wherever the answers land, I’m sure there’ll be folks like me coming up with thought experiments that sit uncomfortably in the grey areas that remain.