1 of 26

EVALUATING GENERATIVE AI LICENSING TERMS

IP Section of the California Lawyers Association

July 14, 2023

2 of 26

TERMINOLOGY

  • Prompt: usually user-generated content designed to elicit a response from the ML/AI model

3 of 26

TERMINOLOGY (CONT.)

  • Input: anything you provide as part of the prompt, but some models are also context-sensitive and will use surrounding language and even info in other files.
  • Output: what is generated by the AI – text, code, images, music, video, etc.
  • Model: numbers and equations that transform input into output. Works in conjunction with other software.
  • API-based models: you access the model via a SaaS; model is not physically provided to you.
  • Training data or datasets: data used to initially train a model. Different from a prompt and different from dynamically provided data.
  • Fine-tuning: using an existing model to train a new one with some additional training data to make it better at specific tasks or in particular domains.
  • Prompt-tuning: adding more task or context-specific info to prompts. Can be done by a human (“hard prompt”) or an AI (“soft prompt”).
  • Foundational models: models trained on a broad set of data that can be used for a wide variety of tasks with minimal fine-tuning. Ex: GPT-3, DALL-E e, Stable Diffusion.

4 of 26

UNDERSTANDING AI/ML MODELS

"Machine Learning" (https://xkcd.com/1838/) by Randall Munroe at xkcd is licensed under the Creative Commons Attribution-NonCommercial 2.5 License (https://creativecommons.org/licenses/by-nc/2.5/).

5 of 26

THE COPYRIGHT CONTROVERSY AROUND AI/ML MODELS

Some countries have passed laws exempting model training from the ambit of copyright law. The US has not, meaning we are waiting on a court to determine whether model training might constitute fair use or sit outside the scope of copyright law. In the US, the following are open questions:

  • Does training the model constitute copyright infringement?
  • Is the model itself copyrightable?
  • Is output from the model copyrightable?
    • The US Copyright Office has said not unless it has been modified by a human and the human additions can be identified. However, the USCO discounted the thousands of prompt iterations used by the user to elicit the specific image desired. The decision may be open to a lawsuit.
  • Is the output a derivative work of the training data?

6 of 26

UNIQUE RISKS IN USING GENERATIVE AI

  • Exposing confidential or proprietary company information to a 3rd party and subsequent 3rd parties
  • Exposing personally identifiable data to a 3rd party and subsequent 3rd parties
    • Specially regulated industries like healthcare and banks remain wary of AI
  • Output that violates a 3rd party IP right
  • Output that is:
    • Incorrect (including output based on out-of-date info) – “hallucinations”
    • Harmful: racist, sexist, violent, graphic, offensive, obscene, suggestive of self-harm, etc.
    • Not suitable for minors
    • Unhelpful, unresponsive
    • A security vulnerability (specific to code-related models)

7 of 26

UNIQUE RISKS IN USING GENERATIVE AI (CONT.)

  • Output that cannot be copyrighted
  • DMCA Section 1202 violations related to distributing copyrighted works where the copyright management info has been altered or deleted
  • Receiving and potentially processing PII in violation of various data privacy laws
  • Vendor technology may have future limitations due to legal claims against vendors
  • Technology is so new that pricing models are likely to change quickly and often

8 of 26

UNIQUE RISKS IN USING GENERATIVE AI FOR CODE

  • Risk is higher for products that are physically distributed, including when customers are allowed to self-host
  • No way to distinguish code written by the AI from code written by a human
    • Implications related to copyrightability of code
    • Implications related to potential lawsuits from distributing such code
  • Code suggested is weighted towards most commonly seen, not most recent, so suggestions more likely to include unpatched vulnerabilities than if package were provided by a package manager like NPM
    • Also why most suggestions, if relatively short, are likely uncopyrightable
  • Third party code embedded in company source files doesn’t benefit from the same security ecosystem as third party code that remains identifiable as such:
    • Security monitoring tools won’t necessarily identify the same vulnerability
    • Headlines related to major vulnerabilities will not trigger internal concern
    • Build systems can’t and won’t pull newer versions of the same code – more code must be maintained in-house

9 of 26

NEGOTIATING AI-RELATED CONTRACTS

10 of 26

IP OWNERSHIP

  • Make sure you retain ownership over your input, including derivatives of your input, and any input automatically detected by the AI software.
  • Make sure you get a license to use the output as desired – this will often include a right to sublicense, which you might not be used to asking for. Vendor cannot grant you unqualified ownership because the model might give the same output to more than one customer (and output might turn out to be a derivative work of the training data in some cases). Note again, the output may not be copyrightable at all.
  • When it’s important to own copyright in the output, make significant human modifications to it and document them.

11 of 26

IP-RELATED WARRANTIES AND INDEMNITIES

  • AI software vendors hesitate to provide any IP warranties or indemnities related to the output because they know:
    • It can be confidently inaccurate
    • They can’t control the input and therefore have no control over the output
    • Even if they could control the input, they still wouldn’t have total knowledge of all possible output since they can’t test all possible input
    • The models don’t have full knowledge of every possible domain
    • Built-in safeguards can be manipulated and skirted
    • Legally, we don’t know whether output might constitute a derivative work of the training data. This is likely to end up being a case-by-case analysis

12 of 26

WARRANTY AND INDEMNITY PITFALLS

  • Various vendors have provided marketing materials and public announcements proclaiming that they offer various indemnities or other protections and many business people take this at face value. But, when you dig into the actual contracts, you’ll find that the terms are either not there at all, lackluster, or provided alongside other addendums or exhibits with conflicting language and no clear order of governance.
  • For the vendors, signing up as many customers as possible and making entire industries reliant on their technology is a way to minimize the risk of negative court findings or regulatory action

13 of 26

MARKETING MISDIRECTION OR DISARRAY?

14 of 26

NEGOTIATING WARRANTIES AND INDEMNITIES

  • Understand that warranties and indemnities related to “the Services” likely don’t extend to the output
  • Look carefully at IP indemnity exception language that is non-standard
    • Indemnities that fall away with even slight modifications to the output are of limited use (use “but for” language)
    • Indemnities that fall away when the output is combined with other material is of limited use (again, use “but for” language)
  • Don’t waste time negotiating warranties and indemnities if the limitation of liability remains very small

15 of 26

THE VALUE OF IP-RELATED WARRANTIES AND INDEMNITIES

  • The more serious the legal cases against the vendor become, the less likely they will have the money to indemnify all customers. They may pay a lot on class actions and have nothing left for indemnities or have enough for the first few indemnified customers only.
  • As Johnson & Johnson has demonstrated with its baby powder litigation, even large companies can try to put subsidiaries into bankruptcy, forcing customers who were owed an indemnity to defend themselves and then try to recoup a small portion of those expenses in bankruptcy court.
  • As with Linux, companies would likely form a defense fund and coordinate on litigation. The litigation would essentially be a tax on the industry, but at least in the tech industry, is unlikely to give any one company or group of companies any sort of advantage over its competitors.

16 of 26

THE VALUE OF IP-RELATED WARRANTIES AND INDEMNITIES (CONT.)

Pursuing a vendor’s customers is unusual and in some contexts, very expensive because the evidence necessary is extremely fact-dependent and highly technical:

  • Even if, say, GitHub loses many claims against it, to pursue a GitHub customer, a plaintiff would have to 1) have registered its copyright with the USCO, 2) identify snippets of their code from a binary product (extremely difficult), 3) prove that the snippet was significant enough to warrant copyright protection, AND 4) that the customer had access to the plaintiff’s code (access + substantial similarity to a copyrighted work = copyright infringement)
  • A copyright holder in a specific image would need to prove that the customer’s image was identical or a derivative work of their own and these sorts of judgments can go either way in court. Winning one such case doesn’t make it easier to win others
  • In contrast, patent trolling is much easier because once it’s proven that the vendor’s tech embodies a patent, everyone using the tech is also in violation of the patent
  • This is why the AI-related cases so far are all class actions – it’s likely the only profitable form of litigation here, and the one against GitHub doesn’t even allege copyright infringement

17 of 26

DATA-RELATED PROVISIONS

Companies that process customer data (as GDPR processors or similar) generally have an obligation to keep that data confidential and only use it for purpose of providing the applicable service. So, in the AI context, the following is problematic if the AI software is going to communicate with customers directly or otherwise process their data:

  • No confidentiality obligations apply to the input or derivatives of the input (which might be in the output)
  • The vendor can use the input or its derivatives to further train its model (because it’s for the benefit of the vendor and not the customer)
  • Broad rights to use input and its derivatives to improve vendor’s services, especially where “Vendor” includes all affiliates

Watch out for:

  • Vendors who promise not to use input for training but don’t otherwise promise confidentiality
  • Vendors who only provide data privacy assurances in the DPA, but the DPA only applies to PII and not all customer data. It’s becoming easier to extract PII from otherwise confidential data. A breach provision solely with respect to PII may not apply to otherwise confidential info.

Similar concerns apply to companies processing employee data.

18 of 26

DATA-RELATED REALITY

  • The only way to completely eliminate PII is to eliminate it from the training data and retrain the model. That is extremely expensive and no one is doing this in response to individual data subject requests
  • Models can be directed to not output certain PII, but:
    • This is imperfect
    • Tools to do this at scale are not yet commonly available
    • It’s unclear if this would satisfy the relevant authorities
    • It’s still hard if not impossible for either an AI vendor or a customer of such a vendor to respond to a data subject request about what data the AI vendor or customer have about the data subject. There is no way to prompt a model to provide all information about a particular person (or anything, really). That also means that inaccurate data cannot be corrected
  • Many European data protection authorities are investigating various AI tools as we speak
  • The EU will probably create more GDPR-style regulations apportioning liability within the AI value chain, some of which cannot be modified by contract

19 of 26

DATA: THE UPSHOT

  • If you act as a processor, insist on all the standard assurances you usually get from other subprocessors with regard to how data will be used, stored, secured, and deleted
    • Your contracts with your customers likely require this
    • Various data privacy laws also require this until they get officially updated to say otherwise

20 of 26

MITIGATING RISK WHEN YOU CAN’T CONTRACTUALLY ASSIGN LIABILITY

21 of 26

BENEFITS OF USING PUBLICLY AVAILABLE MODELS

Consider avoiding API-based services and use a publicly available model in-house, particularly if:

  • You process data of customers in sensitive industries
  • You are an industry leader (you are big enough to hire your own AI engineers)
  • A substantial portion of the value of your business comes from proprietary data you treat as a trade secret

Benefits include:

  • It’s free
  • No data leaves your company; no additional sub-processor to manage and disclose
  • Potential for faster improvements
  • Transparency around training data
    • Some developers are honoring requests to exclude certain works from the training data
  • Greater ability to fine-tune for your use case
  • Eliminate risk of vendor limiting features or changing pricing model
  • May be able to pick and choose compliance-related add-ons

Every large company adapted to the Internet eventually and now even Walmart has software engineers. The next evolution involves hiring people with AI expertise.

22 of 26

DIFFERENTIATING BETWEEN MODEL PROVIDERS

  • Use providers that do more to avoid outputting derivative works of the training data. Examples:
    • Copilot filters suggestions for those that exactly match what’s on GitHub
    • Stable Diffusion will no longer allow users to prompt for artwork in a particular artist’s style
  • Use providers that are transparent about their training data and use less controversial data sets. Examples:
    • StarCoder is trained solely on permissively licensed code, whereas Copilot is trained on a wide variety of licensed code
    • Adobe Firefly is trained on Adobe Stock images and openly licensed and public domain content, whereas DALL-E has likely been trained on a variety of content available online under many different licenses
  • If using API-based services, pick vendors most likely to be responsive to data protection authorities and most likely to update their data processing agreements promptly in accordance with new guidance from the authorities

23 of 26

ASSESSING USE CASES FOR GENERATIVE AI

Large language models, including those trained on code can be thought of as sophisticated autocomplete tools:

  • They are incapable of planning – they do not necessarily understand sequences of events or that one is prerequisite for another unless fine-tuned or given an extended prompt.
  • They have no way to fact-check the accuracy of what they are saying. They are showing you text they have seen most often.
  • Today’s models do not necessarily weigh the credibility of sources in providing an answer.
  • The model’s output is tied to the training data at the time of training; output may not reflect subsequent events (like new wars, areas affected by floods, etc.).
  • Can only respond to a limited amount of input, so they can forget something previously told to them, sometimes even in the same chat session.
  • With enough experimentation, most, if not all of them, can be prompted to provide output that the developers have tried to create mechanisms to prevent.

24 of 26

GOOD AND BAD USE CASES

  • Some of the best use cases for generative AI include brainstorming ideas, using it to categorize input, and getting rough translations.
  • The worst use cases involve relying on it to provide accurate, up-to-date information or uses contingent on the model meeting a nebulous standard (like “don’t provide inappropriate content to minors.”).
  • Likely illegal to use for automated decision-making with respect to individuals in many contexts and geographies (credit scores, getting a loan approved, admission to a university, etc.).

25 of 26

CREATE INTERNAL POLICIES GUIDING USE OF AI TOOLS

  • Educate your staff on the limitations of AI models and the limited legal protections you are receiving - you need to counter a lot of marketing hype
  • Recommend turning on various filters and other protections offered by the vendors
  • Consider approving AI use on a per use-case basis
  • Recommend common-sense risk-limiting measures:
    • Don’t use images with company logos, copyrighted characters or mascots, a celebrity’s likeness, obvious similarities to copyrighted works
    • Don’t use long pieces of code unless you’re sure it represents the standard way of doing something in that language (i.e. it’s functional and not copyrightable)
    • Remove PII from output except where it is copyright management info
    • Fact-check all output

26 of 26

THANK YOU!

Please reach out to me at kate@katedowninglaw.com if you think of any more questions!

This presentation is Copyright © Kate Downing 2023 and is subject to the Creative Commons Attribution 4.0 International License.