The Huge Grey Area in the Anthropic Ruling

This week, AI firm Anthropic (the folks behind Claude) found themselves the focus of attention of U.S. District Court for the Northern District of California.

New laws for new technologies

The tl;dr is: the court ruled that (a) piracy for the purpose of training an LLM is still piracy, so there’ll be a separate case about the fact that Anthropic did not pay for copies of all the books their model ingested, but (b) training a model on books and then selling access to that model, which can then produce output based on what it has “learned” from those books, is considered transformative work and therefore fair use.

Fragment of court ruling with a line highlighted that reads: This order grants summary judgment for Anthropic that the training use was a fair use.

Compelling arguments have been made both ways on this topic already, e.g.:

  • Some folks are very keen to point out that it’s totally permitted for humans to read, and even memorise, entire volumes, and then use what they’ve learned when they produce new work. They argue that what an LLM “does” is not materially different from an impossibly well-read human.
  • By way of counterpoint, it’s been observed that such a human would still be personally liable if the “inspired” output they subsequently created was derivative to the point of  violating copyright, but we don’t yet have a strong legal model for assessing AI output in the same way. (BBC News article about Disney & Universal vs. Midjourney is going to be very interesting!)
  • Furthermore, it might be impossible to conclusively determine that the way GenAI works is fundamentally comparable to human thought. And that’s the thing that got me thinking about this particular thought experiment.

A moment of philosophy

Here’s a thought experiment:

Support I trained an LLM on all of the books of just one author (plus enough additional language that it was able to meaningfully communicate). Let’s take Stephen King’s 65 novels and 200+ short stories, for example. We’ll sell access to the API we produce.

Monochrome photograph showing a shelf packed full of Stephen King's novels.
I suppose it’s possible that Stephen King was already replaced long ago with an AI that was instructed to churn out horror stories about folks in isolated Midwestern locales being harassed by a pervasive background evil?

The output of this system would be heavily-biased by the limited input it’s been given: anybody familiar with King’s work would quickly spot that the AI’s mannerisms echoed his writing style. Appropriately prompted – or just by chance – such a system would likely produce whole chapters of output that would certainly be considered to be a substantial infringement of the original work, right?

If I make KingLLM, I’m going to get sued, rightly enough.

But if we accept that (and assume that the U.S. District Court for the Northern District of California would agree)… then this ruling on Anthropic would carry a curious implication. That if enough content is ingested, the operation of the LLM in itself is no longer copyright infringement.

Which raises the question: where is the line? What size of corpus must a system be trained upon before its processing must necessarily be considered transformative of its inputs?

Clearly, trying to answer that question leads to a variant of the sorites paradox. Nobody can ever say that, for example, an input of twenty million words is enough to make a model transformative but just one fewer and it must be considered to be perpetually ripping off what little knowledge it has!

But as more of these copyright holder vs. AI company cases come to fruition, it’ll be interesting to see where courts fall. What is fair use and what is infringing?

And wherever the answers land, I’m sure there’ll be folks like me coming up with thought experiments that sit uncomfortably in the grey areas that remain.

×

Reactions

No time to comment? Send an emoji with just one click!

0 comments

    Reply here

    Your email address will not be published. Required fields are marked *

    Reply on your own site

    Reply elsewhere

    You can reply to this post on Mastodon (@blog@danq.me).

    Reply by email

    I'd love to hear what you think. Send an email to b26814@danq.me; be sure to let me know if you're happy for your comment to appear on the Web!