1 of 34

GENERATIVE AI MEETS COPYRIGHT LAW

Pamela Samuelson, Berkeley Law School

CS195: Social Implications of Computing Technology

October 18, 2023

2 of 34

3 PRINCIPAL © QUESTIONS

When (if at all) does making copies of ©’d works as training data for generative AI systems infringe ©s in those works?
When (if at all) are AI-generated outputs (e.g., texts or images) infringing derivative works of ingested content?
When (if at all) is removal or alteration of copyright management information (CMI) associated with copies of works violate a copyright-related rule (§ 1202)?

All three questions are posed in litigations vs generative AI companies
All three are likely topics for a Copyright Office report later this year

Gen AI copyright

10/18/23

3 of 34

WHO IS GETTING SUED BY WHOM & FOR (c)?

Stability AI for (c) infringement & 1202 violations

Getty: Stable Diffusion’s model was trained on 12 million Getty images
Sarah Anderson (on behalf of a class of visual artists whose works were used as training data) claims training data = illegal copying + all outputs = infringing DWs

Midjourney & Deviant Art also defendants in Anderson

OpenAI for (c) infringement & 1202 violations

Silverman & Tremblay class action lawsuits on behalf of all authors whose works were used as training data focus on ChatGPT (claims are same as Anderson’s)
Authors Guild class action lawsuit (no 1202 claims)

Meta for (c) infringement & 1202 violations

Kadrey class action on behalf of all authors whose works were used as training data focus on LLaMA (claims are same as Anderson’s)

Alphabet for (c) infringement & privacy violations

J.L. class action lawsuit on behalf of all persons whose data was used as training data focus on Bard

Gen AI copyright

10/18/23

4 of 34

OTHER AI LAWSUITS

OpenAI, GitHub, & MS sued for breach of open source licenses & 1202

Focus on Copilot which uses OpenAI’s Codex LLM to generate code in response to user prompts; Codex was trained on billions of lines of open source code
4 John Does are class action plaintiffs

OpenAI for privacy violations & related torts

P.M. is named plaintiff on behalf of a class of users whose data ChatGPT uses

NeoCortext for right of publicity violations

Young is named plaintiff for class action
Reface SW allows users to swap their or others’ face images for celebrities’ faces

Prisma Labs for violation of Illinois biometric data privacy law

Flora is class action plaintiff, suing Prisma over Lensa SW that collects biometric data of its users’ faces when they create avatars

Gen AI copyright

10/18/23

5 of 34

CLASS ACTION LAWSUITS

Many people may be harmed in the same way by the same conduct by company A which caused the same or very similar harm to them
Harm may be small enough as to each individual that suing the company for that would be cost-prohibitive, but if small harms are aggregated, total harms may be substantial enough to justify a lawsuit
Some individuals may contact a lawyer & claim to represent a class of similarly situated people who were injured in the same way
Court must “certify” that these class claims are sound
Class action cases often settle, but can result in large judgments if they go to a jury trial (which is why they often settle—risk reduced)

Gen AI copyright

10/18/23

6 of 34

OTHER GENERATIVE AI LEGAL PROCEEDINGS

Copyright Office held 4 “listening sessions” in spring 2023 to consider stakeholder perspectives on gen AI training data & output DW issues

1 on literary works, 1 on visual art, 1 audiovisual works, 1 on musical works

Notice of Inquiry posed ?s on which it wants comments

Comments due by Oct 30, 2023; will be posted on the Internet

Office will publish a report & maybe make recommendations to Congress

Congressional committees have held hearings on generative AI issues

Some witnesses have called for legislative changes

Collective licensing for use of works as training data
Federal right of publicity law (to address celebrity impersonations)
§ 1202 amendment so strict liability if CMI removed
What about requiring AI outputs to identify as such?

Gen AI copyright

10/18/23

7 of 34

MOTIVATIONS FOR LAWSUITS

Class action lawyers see potential for big $$$ awards

Of which they may be eligible for as much as 1/3 of the take
Cases may establish new legal precedents, burnish reputation, & attract new clients

Many authors, visual artists, songwriters, scriptwriters, & programmers fiercely object to use of their works as training data

You didn’t ask permission & you’re not compensating us for uses of our works
Only reason gen AI systems produce high quality outputs is because you built models on our works
Gen AI outputs are likely to compete in the marketplace with our works, & deprive us of income
Gen AI developers are huge corporations profiting from their uses of our works, so it’s unjust that they are making big bucks on our backs

Gen AI copyright

10/18/23

8 of 34

Gen AI copyright

10/18/23

�

9 of 34

BROADER MORAL PANIC ABOUT GEN AI

Future of Life Open Letter asking for a 6 month pause in AI development:

“AI systems with human-competitive intelligence can pose profound risks to society and humanity….As stated in the widely-endorsed Asilomar AI Principles, Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources. Unfortunately, this level of planning and management is not happening, even though recent months have seen AI labs locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one – not even their creators – can understand, predict, or reliably control.”

Ongoing national & international conversations focus on how to regulate AI
One commentary characterizes generative AI as a “Marxist nightmare” because “the work of millions [of writers is] accruing to a few capitalist owners who pay nothing for all of that labor”
Screenwriters & actors have been striking in part out of concern they will be replaced by gen AI outputs; will gen AI take away our jobs & livelihoods?

Gen AI copyright

10/18/23

10 of 34

Original works of authorship (modest creativity requirement + eligible works)
That have been fixed in a tangible medium of expression

Rights vest in authors who can sell or license those rights in whole or part
© grants rights to control reproductions, distributions of copies, making of derivative works, & public performances & displays
Rights last for life of author + 70 years (for firms, 95 years from publication)
© protects “original expression” in works, not ideas, facts, or methods
©’s exclusive rights are limited by fair use & various other doctrines
Very generous remedies if infringement is found: actual damages & infringers’ profits or statutory damages, injunctive relief, impoundment & destruction of infringing material, & attorney fees

Gen AI copyright

10/18/23

11 of 34

FAIR USE

Fair uses of in-(c) contents are not infringements

Example: SCT ruled that Google made fair use of parts of the Java API when reimplementing them in the Android platform

Fair use is a defense to charges of infringement
Courts consider four factors in deciding if a use is fair:

Purpose & character of the challenged use

Favored uses: criticism, comment, news, teaching, research & scholarship
“Transformative” uses (e.g., parodies or when D’s work has different purposes than P’s)

Nature of the (c)’d work
Amount & substantiality of the taking
Effect of the challenged use on the market for or value of the work

Gen AI copyright

10/18/23

12 of 34

INGESTING WORKS AS TRAINING DATA

Webcrawling texts, images, & other content posted on open sites on the Internet involves making copies of that content, as does curating that content for training dataset, which may trigger (c)’s reproduction right
In Field v. Google, a court held that Google’s copying of Internet content for the purpose of indexing & caching contents was fair use
In Authors Guild v. Google, an appellate court held that G’s digitizing of millions of in-(c) books from research libraries for the purpose of indexing contents & serving up snippets in response to search queries, as well as other computational uses, was transformative fair use
In A.V. v. iParadigms, storage of student papers in plagiarism detection SW was held fair use; not exploiting the works’ expression

Gen AI copyright

10/18/23

13 of 34

WHY SOME OBJECT TO FAIR USE CLAIMS

In Field & Authors Guild, Google was making it easier for users to find (c) owners’ works, not producing outputs to compete w/ those works

Stable Diffusion, by contrast, makes it easy for users to produce images that compete with the ingested images

(c) owners did not consent to ingestion, can’t easily opt out, & think it’s only fair that they get paid given the high value of their well-curated content, without which generative AI systems could only produce garbage

Authors Guild survey found that 90% of authors think AI should pay for this
Licensing markets are emerging for ingestion of works as training data

Gen AI copyright

10/18/23

14 of 34

CONSIDERATIONS AFFECTING FAIR USE

Silverman complaint alleges that OpenAI was trained on books in a corpus of ”pirated” books

They too may be available on the open Internet, but not posted by the author
Training on SciHub & Anna’s Archive may be troublesome as well

Possible courts may treat LLMs differently than diffusion models

LLMs may be less vulnerable than diffusion because more abstracted

Possible courts could differentiate among data types

Because SW is functional & Codex is trained on open source, courts may be more receptive to (c) defenses
Literary works used to train LLMs contain lots of unprotectable data
Music & visual art (c) owners are the most adamant objectors to gen AI

Gen AI copyright

10/18/23

15 of 34

MARKET HARM?

Getty says it has established a market for licensing use of its 12M images as training data

Has Stability’s use of these images as training data harmed this market?
Getty’s claim of unfairness is stronger than Andersen’s because it can issue a license to Stability; Andersen & other visual artists in the class can’t

Reddit has announced that anyone who now uses Reddit posts as training data needs a license (but Reddit doesn’t own (c) in the posts)

If gen AI developer used Reddit posts before Reddit announced this new market, does that make its use of the Reddit posts fair?
Or does the new licensing market create an entry barrier so that early developers don’t have to pay but later ones do?

Gen AI copyright

10/18/23

16 of 34

COLLECTIVE LICENSE?

Some of gen AI’s critics would be satisfied if Congress created a collective license regime that would tax gen AI developers to create a fund that could be used to compensate right holders

Two prominent (c) scholars in EU have proposed such a regime
But some critics of gen AI oppose this idea because they simply don’t want their works to be used as training data—want to opt out

Immense practical problems with such a solution

Internet is global & everyone’s posts are used training data, not just professional writers & artists
Costly to set up a collecting society to handle & dole out $
How to decide which authors/right holders get what amount of $?

ASCAP pays according to performances of music (which exploit works’ expressions)

Gen AI copyright

10/18/23

17 of 34

COUNTERVAILING CONSIDERATIONS

Developers of generative AI systems are generally not interested in the “expression” in (c)’d works, but in the data they embody

For them, documents are bags of words & raw material for computational uses
(c) law only cares about a work’s “original expression” (i.e., the way in which authors express their ideas or design images or music)
Data, facts, & ideas embodied in (c)’d works are not within (c)’s scope

Generative AI enables creative reuses of data embodied in works

Constitutional purpose of (c) is to “promote the progress of science [i.e., knowledge] & useful arts”
Generative AI systems advance that purpose
Fair use is often said to provide “breathing space” for ongoing creativity
“Transformative” uses of existing works are often found to be fair

Gen AI copyright

10/18/23

18 of 34

WHAT IS “EXPRESSION”?

Original melodies of music, sequences of words in poems, lines of source code, artistic compositions are “expressions” in (c) law
Detailed sequences of events in a drama, well-developed characters in novels, close paraphrasing of text, & structural similarities in architectural works can be ”expression” too (“nonliteral infringement”)

Infringement is likely when a defendant’s work embodies substantially similar expression that defendant improperly appropriated from plaintiff’s work
Often ultimate issue is based on a ”lay observer/listener/viewer” perspective

Would lay observer think that D took what was pleasing/attractive from P’s work?
Proxy for how substantial is the risk that defendant’s work will supplant demand for the plaintiff’s work or a licensed derivative

Gen AI copyright

10/18/23

19 of 34

“THIN” & ”THICK” COPYRIGHTS

Some works, especially artistic & fanciful works (e.g., Monet’s water lily paintings), are highly expressive

They typically enjoy a broader scope of (c) & thinner scope of fair use because of the more substantial quantum of expressive elements they embody
Substantial similarity may be measured by whether 2 works have the same overall “look & feel”

Other works, such as fact compilations & computer programs, have a lesser quantum of expressive elements & contain more unprotectable elements & so have a “thinner” scope of copyright protection

Courts will try to “filter out” unprotectable elements before deciding about infringement

Gen AI copyright

10/18/23

20 of 34

Gen AI copyright

10/18/23

21 of 34

Gen AI copyright

10/18/23

22 of 34

UNPROTECTABLE ELEMENTS

Ideas, concepts, principles
Facts, data, research, knowledge, mathematical formulae
Procedures, processes, systems, & methods of operation
Elements of works for which there is, as a practical matter, only a few ways to express the idea, fact, or function (“merger”)
Elements of works that are common, standard, or constrained by external factors (”scenes a faire” elements)
The subject(s) of the work
Genre or style
Inferences about what elements are typically proximate (or not) & how particular kinds of works tend to be constructed

Gen AI copyright

10/18/23

23 of 34

HO v TAFLOVE

Ho & Taflove were engineering professors at Northwestern
Chang 1^st worked for Ho as a PhD student, during which Ho developed a new mathematical model of how electrons behave under certain circumstances
Chang later switched to working with Taflove; they published a paper & book chapter about this model, using equations, figures, & text from Ho’s writings

Ho’s paper about the model was rejected because Taflove-Chang’s paper preempted it

But Ho lost his (c) infringement lawsuit vs Taflove & Chang

Model was an idea, an effort to depict a scientific principle, not fiction
The equations, figures and text were the only practical ways to express the model’s idea
Under Baker v. Selden & the merger doctrine, those elements were not copyrightable expressions

Gen AI copyright

10/18/23

24 of 34

BAKER v SELDEN

“The copyright of a work on mathematical science cannot give to the author an exclusive right to the methods of operation which he propounds, or to the diagrams which he employs to explain them, so as to prevent an engineer from using them whenever occasion requires.
The very object of publishing a book on science or the useful arts is to communicate to the world the useful knowledge which it contains. But this object would be frustrated if the knowledge could not be used without incurring the guilt of piracy of the book.
And where the art it teaches cannot be used without employing the methods and diagrams used to illustrate the book, … such methods and diagrams are to be considered as necessary incidents to the art, and given therewith to the public; not given for the purpose of publication in other works explanatory of the art, but for the purpose of practical application.”

Gen AI copyright

10/18/23

25 of 34

GEN AI OUTPUTS CAN INFRINGE DW RIGHT

Copyright owners have exclusive rights to make derivative works (DWs)
“Derivative work” is statutorily defined as:

A work based upon one or more pre-existing works
Such as translation, musical arrangement, condensation, & 6 other examples,
Or any other form in which a work may be recast, transformed or adapted

Gen AI systems may infringe if they produce images or texts that are substantially similar to expressive elements of works on which they were trained

Silverman complaint vs OpenAI: ChatGPT output = detailed summary of my book
Training data may contain many copies of similar works (e.g., images of cartoon characters), so AI model may “memorize” them

Gen AI copyright

10/18/23

26 of 34

Gen AI copyright

10/18/23

Figure 15: Successful attempt to infringe on Snoopy using Midjourney, and Stable Diffusion

Although none of the generative AI images is an exact copy of the copyrighted images shown above, or any others I could find, the strength of Snoopy as a copyrightable character is probably enough to make the generated images infringing.

Figure 16 is the same, except that it depicts for vintage Mickey Mouse images obtained from a Google image search images created in Midjourney (bottom left) and Stable Diffusion (bottom right) with the prompt “classic style mickey mouse winking.”

27 of 34

INDIRECT INFRINGEMENT

If A copies B’s work, A is a direct infringer

Is person who asks Midjourney to produce images of Snoopy’s doghouse with Xmas lights an infringer?
Is Midjourney an indirect infringer if that user infringes?

Indirect infringement if C has the right & ability to control A’s conduct & enjoys a financial benefit from A’s infringement of B’s work

Several gen AI lawsuits make this claim vs gen AI developers

SCT’s Sony Betamax & Grokster cases hold that a tech developer is not an indirect infringer if it makes & sells technology that has (or is capable of) substantial non-infringing uses

Even if tech developer knows some or even many users will use it to infringe
Do gen AI systems qualify for this safe harbor?

Gen AI copyright

10/18/23

28 of 34

LINK BETWEEN INPUT & OUTPUT (C) ISSUES?

If generative AI outputs infringe DW right, that may affect how courts view the input issue
In Sega v. Accolade, it was fair use to reverse engineer SW to get access to interface information needed to make Accolade’s videogames interoperable with Sega’s platform

That interface was unprotectable by (c) law, not part of the Sega SW’s “original expression”
Court implied that if Accolade had reverse eng’d in order to appropriate expression from Sega’s program, its reverse eng’g might have been unfair

Possible court could find gen AI inputs infringing if outputs are
Outputs as “fruit of poisonous tree” if inputs found to infringe

Gen AI copyright

10/18/23

29 of 34

GENERATIVE AI OUTPUTS

In general, texts & images generated in response to user prompts will not be “substantially similar” to expressive elements in work(s) used as training data

Insofar as this is true, outputs are unlikely to infringe (c)’s DW right

Being “based upon” another work is not sufficient to infringe DW rights

Andersen class action claim vs Stability admits that substantial similarity is unlikely

Trial judge plans to dismiss the DW claim vs Stability, but gave leave to amend

Complaint vs. GitHub reports that 1 study found that Copilot suggests code matching training data in 1% of outputs generated with the aid of Codex

Is that enough to make Copilot a direct or indirect infringer?

Getty complaint vs Stability gives an example of its DW claim

2 images are similar, even maybe substantially similar, but as to expressive elements?

Gen AI copyright

10/18/23

30 of 34

GETTY PHOTO STABLE DIFFUSION IMAGE

Gen AI copyright

10/18/23

31 of 34

17 U.S.C. § 1202

Illegal to intentionally remove or alter CMI knowing that the removal or alteration may facilitate infringement

Does this happen in the course of training models or generating outputs?

Law was enacted in 1998 out of concern that would-be infringers could remove or alter CMI embedded in digital copies

Infringers could substitute false CMI & then offer copies to the public as though they were the owners of rights in that content
Or hackers might just “liberate” copies of the work as though in the public domain

Congress recognized it would be difficult to measure actual damages, so it created a statutory damage remedy

Range is from $2500-$25,000 per violation
Complaint vs GitHub estimates statutory damages of $9 billion

Gen AI copyright

10/18/23

32 of 34

GETTY PHOTO STABLE DIFFUSION IMAGE

Gen AI copyright

10/18/23

33 of 34

CONCLUDING THOUGHTS

Gen AI not the first disruptive technology to attract (c) lawsuits

Challenges failed in cases involving VCRs, MP3 players, & RS-DVRs

Because these technologies had substantial non-infringing uses

Other challenges succeeded

Grokster & Napster, makers of p2p file sharing SW, held liable because they induced users to infringe or contributed to user infringements knowing they were doing so
SCT held that Aereo’s technology that allowed people to watch broadcast TV on mobile devices infringed public performance rights in ABC’s programs

Current (c) litigations vs gen AI developers are in very early stages

I don’t find the DW or 1202 claims persuasive
Training data ingestion issue is of greatest concern

Generative AI researchers should offer comments when Copyright Office asks for them because there’s a lot at stake

Gen AI copyright

10/18/23

34 of 34

SOME QUESTIONS

What do you think of the pirated book claims?

What if Meta trained on Books3 corpus of 190K pirated books?
Does that taint the fair use claim?

Should genAI companies have to disclose what data their models were trained on?
Should authors/artists be able to opt-out of having works as training data?
Authors Guild complaint argues that only reason GPT3 & 4 produce such high quality ouputs is because they free-ride on our highly creative works; also make authors “unwitting accomplices in their own destruction”; authors are losing work because of genAI
If training data copies infringe, what remedy should courts order?

Gen AI copyright

10/18/23