1 of 34

GENERATIVE AI MEETS COPYRIGHT LAW

Pamela Samuelson, Berkeley Law School

CS195: Social Implications of Computing Technology

October 18, 2023

2 of 34

3 PRINCIPAL © QUESTIONS

  • When (if at all) does making copies of ©’d works as training data for generative AI systems infringe ©s in those works?
  • When (if at all) are AI-generated outputs (e.g., texts or images) infringing derivative works of ingested content?
  • When (if at all) is removal or alteration of copyright management information (CMI) associated with copies of works violate a copyright-related rule (§ 1202)?

    • All three questions are posed in litigations vs generative AI companies
    • All three are likely topics for a Copyright Office report later this year

Gen AI copyright

2

10/18/23

3 of 34

WHO IS GETTING SUED BY WHOM & FOR (c)?

  • Stability AI for (c) infringement & 1202 violations
    • Getty: Stable Diffusion’s model was trained on 12 million Getty images
    • Sarah Anderson (on behalf of a class of visual artists whose works were used as training data) claims training data = illegal copying + all outputs = infringing DWs
      • Midjourney & Deviant Art also defendants in Anderson
  • OpenAI for (c) infringement & 1202 violations
    • Silverman & Tremblay class action lawsuits on behalf of all authors whose works were used as training data focus on ChatGPT (claims are same as Anderson’s)
    • Authors Guild class action lawsuit (no 1202 claims)
  • Meta for (c) infringement & 1202 violations
    • Kadrey class action on behalf of all authors whose works were used as training data focus on LLaMA (claims are same as Anderson’s)
  • Alphabet for (c) infringement & privacy violations
    • J.L. class action lawsuit on behalf of all persons whose data was used as training data focus on Bard

Gen AI copyright

3

10/18/23

4 of 34

OTHER AI LAWSUITS

  • OpenAI, GitHub, & MS sued for breach of open source licenses & 1202
    • Focus on Copilot which uses OpenAI’s Codex LLM to generate code in response to user prompts; Codex was trained on billions of lines of open source code
    • 4 John Does are class action plaintiffs
  • OpenAI for privacy violations & related torts
    • P.M. is named plaintiff on behalf of a class of users whose data ChatGPT uses
  • NeoCortext for right of publicity violations
    • Young is named plaintiff for class action
    • Reface SW allows users to swap their or others’ face images for celebrities’ faces
  • Prisma Labs for violation of Illinois biometric data privacy law
    • Flora is class action plaintiff, suing Prisma over Lensa SW that collects biometric data of its users’ faces when they create avatars

Gen AI copyright

4

10/18/23

5 of 34

CLASS ACTION LAWSUITS

  • Many people may be harmed in the same way by the same conduct by company A which caused the same or very similar harm to them
  • Harm may be small enough as to each individual that suing the company for that would be cost-prohibitive, but if small harms are aggregated, total harms may be substantial enough to justify a lawsuit
  • Some individuals may contact a lawyer & claim to represent a class of similarly situated people who were injured in the same way
  • Court must “certify” that these class claims are sound
  • Class action cases often settle, but can result in large judgments if they go to a jury trial (which is why they often settle—risk reduced)

Gen AI copyright

5

10/18/23

6 of 34

OTHER GENERATIVE AI LEGAL PROCEEDINGS

  • Copyright Office held 4 “listening sessions” in spring 2023 to consider stakeholder perspectives on gen AI training data & output DW issues
    • 1 on literary works, 1 on visual art, 1 audiovisual works, 1 on musical works
      • Copyright owners generally VERY upset about generative AI as infringements
    • Notice of Inquiry posed ?s on which it wants comments
      • Comments due by Oct 30, 2023; will be posted on the Internet
    • Office will publish a report & maybe make recommendations to Congress
  • Congressional committees have held hearings on generative AI issues
    • Some witnesses have called for legislative changes
      • Collective licensing for use of works as training data
      • Federal right of publicity law (to address celebrity impersonations)
      • § 1202 amendment so strict liability if CMI removed
      • What about requiring AI outputs to identify as such?

Gen AI copyright

6

10/18/23

7 of 34

MOTIVATIONS FOR LAWSUITS

  • Class action lawyers see potential for big $$$ awards
    • Of which they may be eligible for as much as 1/3 of the take
    • Cases may establish new legal precedents, burnish reputation, & attract new clients
  • Many authors, visual artists, songwriters, scriptwriters, & programmers fiercely object to use of their works as training data
    • You didn’t ask permission & you’re not compensating us for uses of our works
    • Only reason gen AI systems produce high quality outputs is because you built models on our works
    • Gen AI outputs are likely to compete in the marketplace with our works, & deprive us of income
    • Gen AI developers are huge corporations profiting from their uses of our works, so it’s unjust that they are making big bucks on our backs

Gen AI copyright

7

10/18/23

8 of 34

Gen AI copyright

8

10/18/23

9 of 34

BROADER MORAL PANIC ABOUT GEN AI

  • Future of Life Open Letter asking for a 6 month pause in AI development:
    • “AI systems with human-competitive intelligence can pose profound risks to society and humanity….As stated in the widely-endorsed Asilomar AI Principles, Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources. Unfortunately, this level of planning and management is not happening, even though recent months have seen AI labs locked in an out-of-control race to develop and deploy ever more powerful digital minds that no one – not even their creators – can understand, predict, or reliably control.”
  • Ongoing national & international conversations focus on how to regulate AI
  • One commentary characterizes generative AI as a “Marxist nightmare” because “the work of millions [of writers is] accruing to a few capitalist owners who pay nothing for all of that labor”
  • Screenwriters & actors have been striking in part out of concern they will be replaced by gen AI outputs; will gen AI take away our jobs & livelihoods?

Gen AI copyright

9

10/18/23

10 of 34

© IN A NUTSHELL

  • Copyright protection attaches automatically by operation of law to
    • Original works of authorship (modest creativity requirement + eligible works)
    • That have been fixed in a tangible medium of expression
  • Rights vest in authors who can sell or license those rights in whole or part
  • © grants rights to control reproductions, distributions of copies, making of derivative works, & public performances & displays
  • Rights last for life of author + 70 years (for firms, 95 years from publication)
  • © protects “original expression” in works, not ideas, facts, or methods
  • ©’s exclusive rights are limited by fair use & various other doctrines
  • Very generous remedies if infringement is found: actual damages & infringers’ profits or statutory damages, injunctive relief, impoundment & destruction of infringing material, & attorney fees

Gen AI copyright

10

10/18/23

11 of 34

FAIR USE

  • Fair uses of in-(c) contents are not infringements
    • Example: SCT ruled that Google made fair use of parts of the Java API when reimplementing them in the Android platform
  • Fair use is a defense to charges of infringement
  • Courts consider four factors in deciding if a use is fair:
    • Purpose & character of the challenged use
      • Favored uses: criticism, comment, news, teaching, research & scholarship
      • “Transformative” uses (e.g., parodies or when D’s work has different purposes than P’s)
    • Nature of the (c)’d work
    • Amount & substantiality of the taking
    • Effect of the challenged use on the market for or value of the work

Gen AI copyright

11

10/18/23

12 of 34

INGESTING WORKS AS TRAINING DATA

  • Webcrawling texts, images, & other content posted on open sites on the Internet involves making copies of that content, as does curating that content for training dataset, which may trigger (c)’s reproduction right
  • In Field v. Google, a court held that Google’s copying of Internet content for the purpose of indexing & caching contents was fair use
  • In Authors Guild v. Google, an appellate court held that G’s digitizing of millions of in-(c) books from research libraries for the purpose of indexing contents & serving up snippets in response to search queries, as well as other computational uses, was transformative fair use
  • In A.V. v. iParadigms, storage of student papers in plagiarism detection SW was held fair use; not exploiting the works’ expression

Gen AI copyright

12

10/18/23

13 of 34

WHY SOME OBJECT TO FAIR USE CLAIMS

  • In Field & Authors Guild, Google was making it easier for users to find (c) owners’ works, not producing outputs to compete w/ those works
    • Stable Diffusion, by contrast, makes it easy for users to produce images that compete with the ingested images
  • (c) owners did not consent to ingestion, can’t easily opt out, & think it’s only fair that they get paid given the high value of their well-curated content, without which generative AI systems could only produce garbage
    • Authors Guild survey found that 90% of authors think AI should pay for this
    • Licensing markets are emerging for ingestion of works as training data

Gen AI copyright

13

10/18/23

14 of 34

CONSIDERATIONS AFFECTING FAIR USE

  • Silverman complaint alleges that OpenAI was trained on books in a corpus of ”pirated” books
    • They too may be available on the open Internet, but not posted by the author
    • Training on SciHub & Anna’s Archive may be troublesome as well
  • Possible courts may treat LLMs differently than diffusion models
    • LLMs may be less vulnerable than diffusion because more abstracted
  • Possible courts could differentiate among data types
    • Because SW is functional & Codex is trained on open source, courts may be more receptive to (c) defenses
    • Literary works used to train LLMs contain lots of unprotectable data
    • Music & visual art (c) owners are the most adamant objectors to gen AI

Gen AI copyright

14

10/18/23

15 of 34

MARKET HARM?

  • Getty says it has established a market for licensing use of its 12M images as training data
    • Has Stability’s use of these images as training data harmed this market?
    • Getty’s claim of unfairness is stronger than Andersen’s because it can issue a license to Stability; Andersen & other visual artists in the class can’t
  • Reddit has announced that anyone who now uses Reddit posts as training data needs a license (but Reddit doesn’t own (c) in the posts)
    • If gen AI developer used Reddit posts before Reddit announced this new market, does that make its use of the Reddit posts fair?
    • Or does the new licensing market create an entry barrier so that early developers don’t have to pay but later ones do?

Gen AI copyright

15

10/18/23

16 of 34

COLLECTIVE LICENSE?

  • Some of gen AI’s critics would be satisfied if Congress created a collective license regime that would tax gen AI developers to create a fund that could be used to compensate right holders
    • Two prominent (c) scholars in EU have proposed such a regime
    • But some critics of gen AI oppose this idea because they simply don’t want their works to be used as training data—want to opt out
  • Immense practical problems with such a solution
    • Internet is global & everyone’s posts are used training data, not just professional writers & artists
    • Costly to set up a collecting society to handle & dole out $
    • How to decide which authors/right holders get what amount of $?
      • ASCAP pays according to performances of music (which exploit works’ expressions)

Gen AI copyright

16

10/18/23

17 of 34

COUNTERVAILING CONSIDERATIONS

  • Developers of generative AI systems are generally not interested in the “expression” in (c)’d works, but in the data they embody
    • For them, documents are bags of words & raw material for computational uses
    • (c) law only cares about a work’s “original expression” (i.e., the way in which authors express their ideas or design images or music)
    • Data, facts, & ideas embodied in (c)’d works are not within (c)’s scope
  • Generative AI enables creative reuses of data embodied in works
    • Constitutional purpose of (c) is to “promote the progress of science [i.e., knowledge] & useful arts”
    • Generative AI systems advance that purpose
    • Fair use is often said to provide “breathing space” for ongoing creativity
    • “Transformative” uses of existing works are often found to be fair

Gen AI copyright

17

10/18/23

18 of 34

WHAT IS “EXPRESSION”?

  • Original melodies of music, sequences of words in poems, lines of source code, artistic compositions are “expressions” in (c) law
  • Detailed sequences of events in a drama, well-developed characters in novels, close paraphrasing of text, & structural similarities in architectural works can be ”expression” too (“nonliteral infringement”)
    • Infringement is likely when a defendant’s work embodies substantially similar expression that defendant improperly appropriated from plaintiff’s work
    • Often ultimate issue is based on a ”lay observer/listener/viewer” perspective
      • Would lay observer think that D took what was pleasing/attractive from P’s work?
      • Proxy for how substantial is the risk that defendant’s work will supplant demand for the plaintiff’s work or a licensed derivative

Gen AI copyright

18

10/18/23

19 of 34

“THIN” & ”THICK” COPYRIGHTS

  • Some works, especially artistic & fanciful works (e.g., Monet’s water lily paintings), are highly expressive
    • They typically enjoy a broader scope of (c) & thinner scope of fair use because of the more substantial quantum of expressive elements they embody
    • Substantial similarity may be measured by whether 2 works have the same overall “look & feel”
  • Other works, such as fact compilations & computer programs, have a lesser quantum of expressive elements & contain more unprotectable elements & so have a “thinner” scope of copyright protection
    • Courts will try to “filter out” unprotectable elements before deciding about infringement

Gen AI copyright

19

10/18/23

20 of 34

Gen AI copyright

20

10/18/23

21 of 34

Gen AI copyright

21

10/18/23

22 of 34

UNPROTECTABLE ELEMENTS

  • Ideas, concepts, principles
  • Facts, data, research, knowledge, mathematical formulae
  • Procedures, processes, systems, & methods of operation
  • Elements of works for which there is, as a practical matter, only a few ways to express the idea, fact, or function (“merger”)
  • Elements of works that are common, standard, or constrained by external factors (”scenes a faire” elements)
  • The subject(s) of the work
  • Genre or style
  • Inferences about what elements are typically proximate (or not) & how particular kinds of works tend to be constructed

Gen AI copyright

22

10/18/23

23 of 34

HO v TAFLOVE

  • Ho & Taflove were engineering professors at Northwestern
  • Chang 1st worked for Ho as a PhD student, during which Ho developed a new mathematical model of how electrons behave under certain circumstances
  • Chang later switched to working with Taflove; they published a paper & book chapter about this model, using equations, figures, & text from Ho’s writings
    • Ho’s paper about the model was rejected because Taflove-Chang’s paper preempted it
  • But Ho lost his (c) infringement lawsuit vs Taflove & Chang
    • Model was an idea, an effort to depict a scientific principle, not fiction
    • The equations, figures and text were the only practical ways to express the model’s idea
    • Under Baker v. Selden & the merger doctrine, those elements were not copyrightable expressions

Gen AI copyright

23

10/18/23

24 of 34

BAKER v SELDEN

  • “The copyright of a work on mathematical science cannot give to the author an exclusive right to the methods of operation which he propounds, or to the diagrams which he employs to explain them, so as to prevent an engineer from using them whenever occasion requires.
  • The very object of publishing a book on science or the useful arts is to communicate to the world the useful knowledge which it contains. But this object would be frustrated if the knowledge could not be used without incurring the guilt of piracy of the book.
  • And where the art it teaches cannot be used without employing the methods and diagrams used to illustrate the book, … such methods and diagrams are to be considered as necessary incidents to the art, and given therewith to the public; not given for the purpose of publication in other works explanatory of the art, but for the purpose of practical application.”

Gen AI copyright

24

10/18/23

25 of 34

GEN AI OUTPUTS CAN INFRINGE DW RIGHT

  • Copyright owners have exclusive rights to make derivative works (DWs)
  • “Derivative work” is statutorily defined as:
    • A work based upon one or more pre-existing works
    • Such as translation, musical arrangement, condensation, & 6 other examples,
    • Or any other form in which a work may be recast, transformed or adapted
  • Gen AI systems may infringe if they produce images or texts that are substantially similar to expressive elements of works on which they were trained
    • Silverman complaint vs OpenAI: ChatGPT output = detailed summary of my book
    • Training data may contain many copies of similar works (e.g., images of cartoon characters), so AI model may “memorize” them

Gen AI copyright

25

10/18/23

26 of 34

Gen AI copyright

26

10/18/23

Figure 15: Successful attempt to infringe on Snoopy using Midjourney, and Stable Diffusion

Although none of the generative AI images is an exact copy of the copyrighted images shown above, or any others I could find, the strength of Snoopy as a copyrightable character is probably enough to make the generated images infringing.

Figure 16 is the same, except that it depicts for vintage Mickey Mouse images obtained from a Google image search images created in Midjourney (bottom left) and Stable Diffusion (bottom right) with the prompt “classic style mickey mouse winking.”

30

     

27 of 34

INDIRECT INFRINGEMENT

  • If A copies B’s work, A is a direct infringer
    • Is person who asks Midjourney to produce images of Snoopy’s doghouse with Xmas lights an infringer?
    • Is Midjourney an indirect infringer if that user infringes?
  • Indirect infringement if C has the right & ability to control A’s conduct & enjoys a financial benefit from A’s infringement of B’s work
    • Several gen AI lawsuits make this claim vs gen AI developers
  • SCT’s Sony Betamax & Grokster cases hold that a tech developer is not an indirect infringer if it makes & sells technology that has (or is capable of) substantial non-infringing uses
    • Even if tech developer knows some or even many users will use it to infringe
    • Do gen AI systems qualify for this safe harbor?

Gen AI copyright

27

10/18/23

28 of 34

LINK BETWEEN INPUT & OUTPUT (C) ISSUES?

  • If generative AI outputs infringe DW right, that may affect how courts view the input issue
  • In Sega v. Accolade, it was fair use to reverse engineer SW to get access to interface information needed to make Accolade’s videogames interoperable with Sega’s platform
    • That interface was unprotectable by (c) law, not part of the Sega SW’s “original expression”
    • Court implied that if Accolade had reverse eng’d in order to appropriate expression from Sega’s program, its reverse eng’g might have been unfair
  • Possible court could find gen AI inputs infringing if outputs are
  • Outputs as “fruit of poisonous tree” if inputs found to infringe

Gen AI copyright

28

10/18/23

29 of 34

GENERATIVE AI OUTPUTS

  • In general, texts & images generated in response to user prompts will not be “substantially similar” to expressive elements in work(s) used as training data
    • Insofar as this is true, outputs are unlikely to infringe (c)’s DW right
      • Being “based upon” another work is not sufficient to infringe DW rights
    • Andersen class action claim vs Stability admits that substantial similarity is unlikely
      • Trial judge plans to dismiss the DW claim vs Stability, but gave leave to amend
    • Complaint vs. GitHub reports that 1 study found that Copilot suggests code matching training data in 1% of outputs generated with the aid of Codex
      • Is that enough to make Copilot a direct or indirect infringer?
    • Getty complaint vs Stability gives an example of its DW claim
      • 2 images are similar, even maybe substantially similar, but as to expressive elements?

Gen AI copyright

29

10/18/23

30 of 34

GETTY PHOTO STABLE DIFFUSION IMAGE

Gen AI copyright

30

10/18/23

31 of 34

17 U.S.C. § 1202

  • Illegal to intentionally remove or alter CMI knowing that the removal or alteration may facilitate infringement
    • Does this happen in the course of training models or generating outputs?
  • Law was enacted in 1998 out of concern that would-be infringers could remove or alter CMI embedded in digital copies
    • Infringers could substitute false CMI & then offer copies to the public as though they were the owners of rights in that content
    • Or hackers might just “liberate” copies of the work as though in the public domain
  • Congress recognized it would be difficult to measure actual damages, so it created a statutory damage remedy
    • Range is from $2500-$25,000 per violation
    • Complaint vs GitHub estimates statutory damages of $9 billion

Gen AI copyright

31

10/18/23

32 of 34

GETTY PHOTO STABLE DIFFUSION IMAGE

Gen AI copyright

32

10/18/23

33 of 34

CONCLUDING THOUGHTS

  • Gen AI not the first disruptive technology to attract (c) lawsuits
    • Challenges failed in cases involving VCRs, MP3 players, & RS-DVRs
      • Because these technologies had substantial non-infringing uses
    • Other challenges succeeded
      • Grokster & Napster, makers of p2p file sharing SW, held liable because they induced users to infringe or contributed to user infringements knowing they were doing so
      • SCT held that Aereo’s technology that allowed people to watch broadcast TV on mobile devices infringed public performance rights in ABC’s programs
  • Current (c) litigations vs gen AI developers are in very early stages
    • I don’t find the DW or 1202 claims persuasive
    • Training data ingestion issue is of greatest concern
  • Generative AI researchers should offer comments when Copyright Office asks for them because there’s a lot at stake

Gen AI copyright

33

10/18/23

34 of 34

SOME QUESTIONS

  • What do you think of the pirated book claims?
    • What if Meta trained on Books3 corpus of 190K pirated books?
    • Does that taint the fair use claim?
  • Should genAI companies have to disclose what data their models were trained on?
  • Should authors/artists be able to opt-out of having works as training data?
  • Authors Guild complaint argues that only reason GPT3 & 4 produce such high quality ouputs is because they free-ride on our highly creative works; also make authors “unwitting accomplices in their own destruction”; authors are losing work because of genAI
  • If training data copies infringe, what remedy should courts order?

Gen AI copyright

34

10/18/23