1 of 68

The Legislation Game:

Introduction to Legal Issues in Artificial Intelligence and Large Language Models

Paweł Kamocki

ESSAI Summer School 2024, 15-19.07.2024

2 of 68

3 of 68

CLARIN

  • Common Language Resources and Technology Infrastructure
  • ESFRI roadmap 2006, ESFRI ERIC status 2012, Landmark 2016
  • Easy and sustainable access for scholars in SSH
  • digital language data (written, spoken, video or multimodal)
  • tools to discover, analyse, combine data wherever they are located
  • single sign-on environment (you all can get an account)
  • Ecosystem for knowledge exchange
  • Some services integrated in EOSC

3

4 of 68

CLARIN for Open Science

4

  • Promotion of sharing & re-use of language data through sustainable data registries
  • Enhancement & deployment of interoperability of language data & services
    • common metadata framework
    • distributed network of FAIR certified data repositories for language data

  • Promotion of
    • comparative perspectives
    • multidisciplinary collaboration
    • transnational research
    • responsible data science
  • Support for linguistic diversity
    • data covering many languages
    • tools for many languages
    • language resources in all modalities
    • discipline- & language-agnostic

5 of 68

FAIR Principles

5

Findable

Accessible

Interoperable

Reusable

Key elements

    • Persistent Identifiers (PIDs)
    • Data Management Plan
    • Metadata
    • Licences
    • Repositories

6 of 68

CLARIN’s Macroscope Potential

6

Source: Rosnay, 1979

7 of 68

CLARIN’s countries and centres

7

7

  • A consortium of type ERIC
    • 24 members
    • 2 observers
    • 1 linked party
  • A distributed network of 70 centres
    • 21 CTS certified data centres
    • Strong focus on FAIRness & interoperability
      • Federated login
      • Central metadata harvesting for easy discovery
      • Chained services
    • 25 Knowledge Centres

8 of 68

How CLARIN works?

8

9 of 68

Virtual Language Observatory

https://vlo.clarin.eu

  • Facet search
  • Links to landing pages
  • Download options
  • Details on licences
  • Details on technical features
  • Overview of tools that match the data
  • Citing LRs:

9

10 of 68

Language Resource Switchboard

https://switchboard.clarin.eu/

Upload a text to find a matching tool for NLP tasks. It can be accessed directly from the VLO.

10

11 of 68

11

12 of 68

“Legislation game”

Jeff Koterba/Cagle Cartoons

https://www.duluthnewstribune.com/opinion/columns/national-view-voters-fear-regulation-of-ai-so-far-is-insufficient

12

13 of 68

Legal reasoning

All men are mortal. [Major premise]

Socrates is a man. [Minor premise]

Therefore, Socrates is mortal [Conclusion]

Reproduction is a copyright-restricted act

Training AI models necessitates acts of reproduction

Therefore, training AI models is a copyright-restricted act

13

14 of 68

Course Outline

  1. Copyright issues in AI (training, models, outputs)
  2. Data protection issues (GDPR) in AI training
  3. AI Act (+ European Strategy for Data)

14

15 of 68

Copyright primer I

  • An intellectual property right that protects creative works against uses (“copying”) not authorised by their authors

  • Sources of copyright:
    • International
      • Berne Convention 1886
    • European
      • Directive on Copyright in Information Society (InfoSoc) 2001
      • Directive on Copyright in Digital Single Market (DSM) 2019
    • National
      • national Copyright Acts

Q1: Why copyright?

Q2: Today, is copyright more important than before? Why?

15

16 of 68

Copyright primer II

What is protected? [subject matter]

  • Literary (incl. software), artistic and scientific works
    • Works (expressions), not ideas
      • taste of cheese is not protected (CJEU, Levola Hengelo)
    • Incorporeal asset (independent of the physical carrier)
  • Condition: originality
    • UK (historically): “labour, skill and judgement”
    • US: “original = not copied”
    • EU: “author’s own intellectual creation”
      • author’s personality is expressed in free and creative choices
      • expression not dictated by rules and constraints
        • BUT “multiplicity of shapes” (CJEU, Brompton bicycle 2020)
  • Collections (compilations), e.g. datasets
  • Exclusions (in some countries) for “official works”

How long does copyright protection last? [term]

  • EU, US, JP…: 70 years after the death of the author (Life+70)
    • minimum: Life + 50 (e.g. China, Belarus, Iran, ...)
    • maximum: Life + 100 (Mexico)

16

17 of 68

Copyright: exclusive rights

  • Exclusive rights: permission required!
  • Moral rights [Berne Convention, Article 6 bis]
    • Attribution
    • Integrity
  • Exploitation (or: economic) rights
    • EU: rights harmonised in the InfoSoc Directive:
      • Reproduction (copying) [Article 2]
        • direct or indirect
        • temporary or permanent
        • by any means and in any form
        • in whole or in part (CJEU (Infopaq): 11 consecutive words)
      • Communication to the public (sharing) [Article 3]
        • transmission (broadcast) OR
        • making available to the public
      • Distribution
    • Not harmonised at the EU level: Adaptation
      • Derivative work: original elements from a preexisting work + new original elements
      • “without prejudice” to copyright in the preexisting work

Q1: Is internet scraping a copyright-relevant act?

Q2: Can AI be trained without making reproductions?

Q3: Is communication to the public relevant in AI training/use?

17

18 of 68

Permission (license)

  • License vs. transfer
  • Individually granted licenses vs. public licenses
    • individually granted: from person A (author) to person B (e.g. a registered user, Terms of Service), usually purpose-specific
    • public: from person A (author) to the general public (e.g. CC)
  • Proprietary licenses vs. open licenses
    • Open: free to use by anyone and for any purpose
  • Licenses can be limited as to:
    • purpose
    • granted rights
    • territorial scope
    • duration (max: term of copyright)
    • exclusivity
    • sublicenseability

18

19 of 68

Content licenses (CC)

  • 4 building blocks:
    • BY: attribution (in every license)
    • SA: share-alike (“viral”)
    • ND: no derivatives
    • NC: non-commercial (open)
  • BY, BY-SA (Wikipedia), BY-NC, BY-ND, BY-NC-SA, BY-NC-ND
  • other tools: CC0 (waiver), Public Domain Mark

19

20 of 68

Software licenses (FOSS)

  • Free/Open Software (FOSS) licenses
    • all: access to the source code and the right to modify it
    • copyleft (viral):
      • strong (GPL) or
      • weak (LGPL)
    • permissive (non-viral): MIT, BSD, Apache

20

21 of 68

Copyright in AI training

  • AI training entails reproduction of data
  • no copyright in training data if, e.g.:
    • pure facts (e.g., data from measurements)
    • human creations that are not original (e.g.: too short, banal)
    • works expressly excluded from copyright (e.g. official works in some countries)
    • works in which copyright expired (author died +70 years ago)
    • AI-generated data
  • training data protected by copyright:
    • copyright is held by the company who trains AI
      • no copyright issues (the rights are executed by the rightholder)
      • e.g. data generated by employees
    • copyright is held by third parties
      • data are licensed, e.g. under CC licenses, terms of service
        • note: are NC and ND requirements in CC licenses compatible with AI training?
      • data are not licensed (e.g. scraped data)
        • exception needed

21

22 of 68

Copyright exceptions (in general)

  • three-step test for exceptions (Article 9(2) of the Berne Convention)
    • Certain special cases
    • Do not conflict with a normal exploitation of the work
    • Do not unreasonably prejudice the legitimate interests of the authors
  • EU harmonisation of exceptions
    • Article 5 of the InfoSoc Directive (2001)
      • limitative list of exceptions, e.g. temporary copy, private copy, quotation, research, for libraries, for people with disabilities…
    • 2019 DSM Directive: exceptions for Text and Data Mining (TDM)
  • US: fair use doctrine (§107 of the US Copyright Act)

22

23 of 68

US: the fair use doctrine

  • UK (historically): fair abridgement → fair dealing
  • 4 criteria in §107 US Copyright Act (since 1976):
    • Purpose and character of the use
      • commercial vs. non-commercial → derivative vs. transformative
    • Nature of the work
      • unpublished vs. published, fiction vs. non-fiction
    • Amount and substantiality of the part used
      • the less the better, but use of entire work still possible
    • Effect on the (potential) market value of the work
      • risk of substitution
  • can be used as a “shield”, but also as a “sword” (Lenz v. Universal 2015)
  • “technology-friendly” applications:
    • mass digitisation (Google Books 2015, Hathi Trust 2012)
      • transforming text into digital data
      • new search capabilities → new methods of scientific inquiry
    • use of APIs in software development (Google v. Oracle 2021)

23

24 of 68

US: the fair use doctrine

  • UK (historically): fair abridgement → fair dealing
  • 4 criteria in §107 US Copyright Act (since 1976):
    • Purpose and character of the use
      • commercial vs. non-commercial → derivative vs. transformative
    • Nature of the work
      • unpublished vs. published, fiction vs. non-fiction
    • Amount and substantiality of the part used
      • the less the better, but use of entire work still possible
    • Effect on the (potential) market value of the work
      • risk of substitutionÉ
  • can be used as a “shield”, but also as a “sword” (Lenz v. Universal 2015)
  • “technology-friendly” applications:
    • mass digitisation (Google Books 2015, Hathi Trust 2012)
      • transforming text into digital data
      • new search capabilities → new methods of scientific inquiry
    • use of APIs in software development (Google v. Oracle 2021)

24

25 of 68

EU: Text and Data Mining Exceptions in the DSM Directive

  • definition of TDM:

‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations (Article 2(2) DSM)

  • Article 4 DSM: TDM exception for the “general public”
    • condition 1: lawful access to the work
      • Recital 14: license/subscription OR free availability online
    • condition 2: use for TDM has not been expressly reserved in “appropriate manner” (e.g. with machine-readable means)
    • reproductions may be retained “for as long as necessary” for TDM purposes ( :( )
  • Article 3 DSM: TDM exception for scientific research
    • beneficiaries: research organisations, cultural heritage institutions
    • condition: lawful access to the work (cf. above)
    • reproductions may be retained “with appropriate level of security” for further research or validation of results
    • rightholders may apply technological protection measures “to protect the security and integrity of [their] networks and databases
  • both exceptions override contracts

Q: Are exceptions for AI training a good thing? Why and why not?

25

26 of 68

Copyright and AI models

Q: Considering what you already know about copyright, are AI models (≠ AI systems) protected by copyright? Should they?

hint: idea vs. expression

nevertheless: licensing models is common practice

  • dedicated RAIL licenses

proprietary models available through APIs with Terms of Use

  • ToU (binding contracts) affect the use of the API, as well as the outputs of the model
  • e.g. OpenAI ToU: “You may not (...) Use Output to develop models that compete with OpenAI”.

26

27 of 68

Copyright in AI outputs:

Are AI outputs copyright-protected works?

  • Problem: no human authorship
    • requirement of the Berne Convention?
      • death, nationality, honor, reputation of the author
    • copyright term refers to the death of the author
    • BUT: “work for hire” doctrine in certain countries (US, UK)
      • copyright belongs ab initio to a legal entity
    • EU Software Directive admits that a legal entity can be considered author of software
  • No human ownership = no originality (author’s own intellectual creation)
  • no originality if the expression of the work is dictated by technical considerations (CJEU)
  • skill (e.g. in prompting) is not enough to claim copyright (CJEU)

Q: If AI-generated works were protected by copyright, who would be the rightholder? (user? provider? AI itself?)

27

28 of 68

Copyright in AI outputs:

Position of the US Copyright Office I

“A recent entrance to paradise”

  • AI-generated
  • US Copyright Office refused registration (2022)
  • confirmed by District Court (2023)
  • “absent any human involvement”

28

29 of 68

Copyright in AI outputs:

Position of the US Copyright Office II

“Théâtre d’opéra spatial”

  • Award-winning
  • Generated by Midjourney
  • User input “at least 624 prompts” + modifications in Photoshop
  • registration refused (2023)
  • “insufficient authorship”

29

30 of 68

Copyright in AI outputs:

Position of the US Copyright Office III

“Zarya of the Dawn” (graphic novel)

  • written by Kris Kashtanova
  • illustrated by Midjourney
  • US Copyright Office (2023):
    • book as a whole (plot, dialogues) admitted for registration
    • individual images refused protection

30

31 of 68

Copyright in AI outputs: grey areas

  • Role of the user
    • Machine-generated vs. machine-assisted?
      • difference: the degree of human involvement
      • nowadays almost every work is machine-assisted
    • The concept of authorship is bound to evolve with technological progress
      • example: photography
    • US Copyright Office Policy Statement (2023)
      • criterion: have the traditional elements of authorship been conceived and executed by a man, or by a machine?
      • mere prompting is not enough to justify authorship
      • copyright protection possible if AI-generated outputs were creatively arranged or modified by a human
  • Relation with the input data (adaptation/derivative work?)
    • regurgitation and “training data extraction attacks”

31

32 of 68

Copyright in AI outputs: the “ownership gap”

  • ChatGPT: (co-)author of hundreds of books on Amazon
    • how many books are “secretly” AI-generated?
      • copyfraud – false copyright claim in public domain content
      • presumption of authorship (Article 15(1) Berne Convention, Article 5 of the EU Enforcement Directive) for those whose name “appear on the work in the usual manner”
      • BUT OpenAI ToU: “you are prohibited from (...) representing that Output was human-generated when it was not”
  • Article 50(2) AI Act: 2. Providers of AI systems, including general-purpose AI systems, generating synthetic audio, image, video or text content, shall ensure that the outputs of the AI system are marked in a machine-readable format and detectable as artificially generated or manipulated.
    • feasibile…?

32

33 of 68

Sui generis right in AI outputs?

  • H. Demsetz, Toward a Theory of Property Rights, 1967: technological development → new property rights
    • legal certainty in transactions
    • prevent market failure
  • European Parliament 2020: AI-outputs ‘must’ be protected under Intellectual Property Rights in order to encourage investment and improve legal certainty
  • European Commission 2023: “the issue of AI-generated works does not deserve a specific legislative intervention”
  • UK Copyright, Designs and Patents Act (CDPA) 1988
    • computer-generated works: works generated by computer in circumstances such that there is no human author of the work
    • sui generis IP right (called “copyright” in the CDPA), for 50 years after their creation
    • ‘author’: the person by whom the arrangements necessary for the creation of the work are undertaken
    • almost never used in courts, ‘unclear and contradictory’
    • reform proposal 2022 (failed)

33

34 of 68

Proposed reform in France (2023)

  • Proposition de loi visant à encadrer l’intelligence artificielle par le droit d’auteur, No. 1630 (12 septembre 2023)
  • Author’s permission necessary to integrate a copyright-protected work in an AI system
    • opt-in instead of opt-out?
    • contradiction of the TDM exceptions?
  • Copyright in AI-generated outputs should belong to the authors of works that enabled its generation (names marked in the output)
    • practical application…?
    • violation of EU law!
  • Collective rights management for AI-generated works
    • designated organisation perceives remuneration and redistributes it among entitled authors
    • levy collected for works used in AI systems whose authorship cannot be determined

34

35 of 68

Lawsuits concerning copyright in AI:

Getty Images vs. Stability AI (UK)

  • Stability AI: London-based provider of generative AI tools (incl. Stable Diffusion)

    • claim I: unlawful use of scraped data for AI training
      • note: no “commercial TDM” exception in the UK
    • claim II: infringement of copyright in those images by reproduction of substantial parts

  • hearing expected in 2025

35

36 of 68

Lawsuits concerning copyright in AI:

New York Times vs. Open AI (US)

  • claim: unlawful use of NYT articles to train the GPT model
  • allegation 1: GPT can regurgitate near-verbatim copies of NYT articles
  • allegation 2: when prompted, GPT can produce long excerpts of NYT articles than search engines, allowing paywall circumvention (impact on the market value)
  • Open AI statement:
    • AI training is fair use (transformative);
    • memorisation (and regurgitation) is a bug (not a feature), and a result of intentional manipulation

Meanwhile in China (February 8, 2024, Guangzhou Internet Court): a court found an AI provider guilty of copyright infringement after its text-to-image tool generated images of Ultraman (cartoon character) substantially similar to the original

36

37 of 68

PART II: Data Protection Issues in AI Training

37

38 of 68

Data protection: a primer

  • Main source: General Data Protection Regulation (GDPR)
    • not new: data protection laws in Germany, France since mid-1970s
    • repealed the Data Protection Directive 1995
    • adopted 2016, applicable since 25 May 2018
    • became an international standard
    • applies to data processing if:
      • the controller is established in the EU OR
      • the processing is related to providing goods and services to or monitoring the behaviour of individuals in the EU
  • GDPR does not apply to processing for “prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties” (e.g. video surveillance by the police)
    • Law Enforcement Directive 2016
  • ePrivacy Directive 2002 (amended 2009)
    • unsolicited emails, cookies, traffic data…
    • ePrivacy Regulation proposed in 2017 (stuck)

38

39 of 68

Data protection: terminology

  • Personal data:
    • any information… (fact/opinion, true/false, any format)
    • …related to… (by content / by purpose / by result)
    • …an identified or identifiable… (possible to single out by any means reasonably likely to be used)
      • reasonably likely, taking into account costs of and the amount of time required for identification (at the time of processing)
    • …natural person (living individual, a.k.a. data subject).
  • Sensitive data (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, biometric data, genetic data, health, sex life and sexual orientation) — processing forbidden in principle [Article 9 GDPR]
  • Processing: any operation or set of operations on personal data
  • Anonymisation (permanent, irreversible) vs. pseudonymisation (reversible)
  • Controller: person (…) [or] body which, alone or jointly with others, determines the purposes and [essential] means of [processing]
  • Processor: person (…) [or] body which processes personal data on behalf of the controller
    • Data Processing Agreement (controller — processor)
  • Data Protection Officer: liaison between the controller and the supervisory authority

39

40 of 68

Data protection principles (overview)

Article 5 GDPR

  • Lawfulness, fairness and transparency
  • Purpose limitation
  • Data minimisation
  • Accuracy
  • Storage limitation
  • Integrity and confidentiality (a.k.a. Security)
  • Accountability

40

41 of 68

Data protection principles:

Lawfulness

  • legal basis needed (list in Article 6 GDPR), e.g.:
      • consent
        • any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her
        • can be withdrawn at any time (not retroactively)
      • legitimate interest
        • balancing test taking into account: the nature of the data, data subject’s reasonable expectations, impact on the data subject
        • right to object
      • performance of a contract, public interest (specific provision)

41

42 of 68

Data protection principles:

Transparency

  • Right to information
    • data subject has to be informed about e.g.
      • identity of the controller
      • purposes of the processing
      • legitimate interests pursued (if applicable)
      • his or her rights (incl. right to file a complaint)
      • recipients (persons or bodies who will have access to the data)
      • duration of storage or criteria used to determine it
    • derogation: if disproportionate effort required OR if provision of information may impair the objectives of research (the information should then be made publicly available)
  • Right of access

42

43 of 68

Data protection principles:

Purpose Limitation, Data Minimisation

  • Purpose limitation
    • Data should be collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes;
    • exception (purpose extension): further processing for research purposes is not to be considered incompatible with the initial purpose
    • E.g. archiving is a different purpose from research
  • Data minimisation
    • Data should be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed
    • no exceptions

43

44 of 68

Data protection principles:

Accuracy, Storage Limitation

  • Accuracy
    • Data should be adequate, accurate and, where necessary, kept up to date
    • Right of rectification
  • Storage Limitation
    • Data should be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed
    • exception: research and archiving in public interest (with safeguards and appropriate technical and organisational measures) — storage for longer periods possible (not: indefinite)

44

45 of 68

Data protection principles:

Security (cf. also Articles 32-34 GDPR)

  • Data should be processed in a manner that ensures appropriate security of the personal data, including protection against unauthorised or unlawful processing and against accidental loss, destruction or damage, using appropriate technical or organisational measures
    • backup copies, stress-tests
    • Data Breach Policy:
      • documentation of all data breaches
      • notification to the supervisory authority within 72 hours (if any risk for data subjects)
      • communication to the data subject (if high risk for data subjects)

45

46 of 68

Data protection principles:

Accountability

  • The controller shall be responsible for, and be able to demonstrate compliance with all data protection principles
    • Data protection ‘by design and by default’ [Article 25 GDPR]
    • Record of data processing activities [Article 30 GDPR]
    • Data Protection Impact Assessment (self-assessment) [Article 35 GDPR]
      • mandatory in some cases (new technologies, high risks, large quantities of data), always advisable (especially in research context)
      • documented; in consultation with the Data Protection Officer
      • assessment of necessity and proportionality
      • identification of risks (what if everything goes wrong?) and measures to address them

46

47 of 68

Rights of data subjects (overview)

  • Right to be provided with information (cf. Transparency principle)
  • Right of access
    • limitations possible for research purposes
  • Right to rectification (cf. Accuracy principle)
  • Right to erasure (‘Right to be forgotten’) or restriction
    • if processing violates GDPR principle(s)
  • Right to data portability
    • to receive the data he/she provided to the controller in a structured, commonly used and machine-readable format and transmit it to another controller (e.g. switching service providers)
  • Right to object (processing based on legitimate/public interest)
    • controller can still demonstrate “compelling legitimate grounds”
  • Freedom from automated individual decision-making, including profiling

47

48 of 68

Freedom from automated individual decision-making (Article 22 GDPR)

  • Principle: General prohibition on fully automated (without any human intervention) individual decision-making (including profiling) that has a legal or similarly significant effect (e.g. financial situation, health, employment, access to education)
  • Exception if:
    • necessary to perform or enter into a contract OR
    • expressly authorised by law OR
    • data subject gave explicit consent
    • + safeguards, including at least:
      • the right to contest the decision and to obtain human intervention
  • Transparency requirements (incl. meaningful information about the logic involved)
  • WP29 Guidelines

48

49 of 68

GDPR compliance in AI training

(CNIL’s AI how-to sheets)

  1. Define the purpose
  2. Define the legal status of the stakeholders
  3. Define the Legal basis
  4. Data Protection Impact Assessment
  5. Data Protection by Design and by Default

49

50 of 68

  1. Defining a purpose
  • purpose should always be specified, explicit and legitimate
  • purpose limitation, transparency, data minimisation, storage limitation
  • when future applications can be defined from the development stage (purpose-specific systems): AI training and deployment can have one and the same purpose
    • e.g., monitoring of train traffic
  • future applications cannot be defined at the development stage: general-purpose AI
    • CNIL: correctly defined purpose refers to:
      • types of system developed AND
      • examples of potential applications (esp. high-risk ones)
      • functionalities excluded by design
      • conditions of use/distribution (e.g. Open Source)
    • Ex. (CNIL): an organisation wishes to develop a voice recognition model capable of identifying a speaker and his/her language in order to commercialise it for different operational uses in the production phase (e.g. tools for identifying people by voice assistants or voice translation applications on a mobile device, etc.).
  • AI developed for research purposes: lower degree of precision is acceptable

50

51 of 68

Defining the legal status of various stakeholders 1/2

  • Case-by-case analysis necessary
  • The controller:
    • decides why (purpose) and generally how (essential means) the system will be trained
    • is at the initiative of development AND
    • created the training dataset OR
    • entrusted this task to a service provider with sufficiently detailed documented instructions OR
    • decided to use a pre-existing dataset for training (developed by another controller) OR
    • decided to use a pre-existing model
    • Ex.: where social media data are used for AI-training, the SM platform provider is not the controller (even though it may lay down conditions for re-use)

51

52 of 68

Defining the legal status of various stakeholders 2/2

  • Joint controllers
    • e.g. consortium
    • processing for a common purpose or for own purpose
      • Q: who benefits from the processing?
  • Processor
    • Develops an AI system for a client (controller) OR
    • Collects training data according to documented instructions
    • signs a Data Processing Agreement with a controller
    • BUT: If uses the system/dataset also for own purposes -> separate processing (as controller)

52

53 of 68

Defining the legal basis 1/2

  • Case-by-case analysis
  • Consent?
    • validity criteria: freely given, specific, informed, unambiguous
    • the right to withdraw cannot be guaranteed = not an appropriate basis
    • granularity (per purpose)
    • unfeasible with scraped data
  • Legitimate interest?
    • is the pursued interest legitimate? (i.e., not illegal)
    • is the processing of personal data necessary? (can anonymised/synthetic data be used instead?)
    • is there a disproportionate impact on the rights and freedoms of data subjects?
      • Ex. training a system predicting one’s psychological profile on online data related to this person would be excessive
    • implement measures to minimize the impact: pseudonymisation, exclusion of sensitive data, elaborated selection criteria (data minimisation)

53

54 of 68

Defining the legal basis 2/2

  • Public interest?
    • only if based on a specific legal provision (normative text)
    • may be available for public research institutes
  • Necessary for the performance of a contract?
    • only available in very limited cases (e.g. subscription to a personalised email generation service)
    • Terms of Service of a social network are not an appropriate legal basis for reusing data for AI training, as such reuse is not necessary to perform the contract (CJEU, C-252/21)

54

55 of 68

Data Protection Impact Assessment (DPIA)

  • necessary if the developed system is likely to create a high risk to the rights and freedoms of natural persons
    • AI development often meets the EDPB criteria (e.g.: large-scale collection of personal data, crossing or combination of data sets, innovative uses or application of new technological or organisational solutions)
  • according to the CNIL:
    • the development of all systems identified as high-risk in the AI Act
    • the development of all foundation models
    • the development of all general-purpose AI systems

requires a DPIA if it consists of personal data processing

  • Risks to consider include:
    • misuse of training data, esp. in case of a data breach;
    • automated discrimination of certain users by the AI system;
    • ‘hallucinations’ concerning real persons;
    • regurgitation of personal data in case of attacks;
    • loss of control over published online data (e.g. one’s tweets)
  • Measures to be taken on the basis of a DPIA include:
    • enhanced security (e.g. encryption of training data);
    • data minimisation (e.g. by replacing some personal data with synthetic data)
    • measures to reinforce the rights of data subjects (e.g., machine unlearning)
    • auditing

55

56 of 68

Data Protection by Design and by Default

  • stick to the defined purpose!
  • avoid over-collection and excessive annotation
    • are they necessary for the purpose? (if not: delete; data pruning)
    • find volume “sweet spot”
  • choose the least privacy-invasive method
  • CNIL: “the use of deep learning must be justified and should therefore not be systematic”...
  • recommended training protocols to consider:
    • decentralised training (e.g. federated learning) – allow greater control over datasets without combining them (BUT: security concerns)
    • cryptography (secure multiparty computation, homomorphic encryption)
    • keep an eye on most recent developments!
  • if possible, stay away from “sensitive data”
  • representativeness of data to avoid bias
  • define data retention periods (and stick to them)
    • automate deletion, if possible
    • some data can be archived (cf. national archiving laws)
  • document data (CNIL’s documentation model)

56

57 of 68

AI Act

  • 21 April 2021: proposed by the European Commission
  • 13 March 2024: accepted by the European Parliament
  • 21 May 2024: approved by the EU Council
  • BREAKING! Publication 12.07.2024
  • will become applicable 24 months after publication, BUT:
    • for prohibited AI systems: 6 months after publication
    • for certain high-risk AI systems: 36 months after publication
    • for general-purpose AI systems: 12 months after publication
  • large territorial scope (systems developed in the EU, offered in the EU, outputs used in the EU…) – bound to become an international standard

57

58 of 68

AI Act: AI governance

  • AI Office

58

59 of 68

AI Act: classification of AI

  • by degree of risk
    • prohibited systems (Chapter II)
    • high-risk systems (Chapter III)
    • minimal risk (not regulated)

  • by purpose
    • general-purpose models (Chapter V)

  • transparency obligations (Chapter IV)

59

AI system

AI model

60 of 68

Prohibited AI systems

  • deploying subliminal, manipulative, or deceptive techniques to distort behaviour and impair informed decision-making, causing significant harm;
  • exploiting vulnerabilities related to age, disability, or socio-economic circumstances to distort behaviour, causing significant harm.
  • biometric categorisation systems inferring sensitive attributes (race, political opinions, sex life, sexual orientation…)
  • social scoring, i.e., evaluating or classifying individuals or groups based on social behaviour or personal traits, causing detrimental or unfavourable treatment of those people.
  • assessing the risk of an individual committing criminal offenses solely based on profiling or personality traits, except when used to augment human assessments based on objective, verifiable facts directly linked to criminal activity.
  • compiling facial recognition databases by untargeted scraping of facial images from the internet or CCTV footage.
  • inferring emotions in workplaces or educational institutions, except for medical or safety reasons.
  • ‘real-time’ remote biometric identification (RBI) in publicly accessible spaces for law enforcement, except when:
  • searching for missing persons, victims,
  • preventing substantial and imminent threat to life, or foreseeable terrorist attack; or
  • identifying suspects in serious crimes

60

61 of 68

High-risk AI systems

  • used in areas regulated by EU law listed in Annex I
    • e.g. toys, recreational watercraft, lifts, radio equipment, cableway installations, medical devices, civil aviation security, motor vehicles, aircrafts…
  • corresponding to use cases listed in Annex III
    • identified areas: biometrics, critical infrastructure, education (access, evaluation, detection of prohibited behaviour during tests), access to and enjoyment of public services, law enforcement, migration, justice
    • UNLESS: only perform a narrow procedural task or a preparatory task, improve the results of human activity, detects decision-making patterns or deviations and is not meant to replace human decisions without review
      • burden of proof on the provider (documented assessment)
  • profiling is always high-risk (using personal data to automatically assess aspects of a person’s life)

61

62 of 68

High-risk AI systems: obligations of providers

62

63 of 68

Transparency obligations (Chapter III)

  • AI systems intended to interact directly with human:

must be designed in such a way that the natural persons concerned are informed that they are interacting with an AI system

unless it is obvious

  • AI systems, including general-purpose AI systems, generating synthetic audio, image, video or text content,

the outputs of the AI system are marked in a machine-readable format and detectable as artificially generated or manipulated.

…as far as technically feasible…

  • emotion recognition systems or biometric categorisation systems

deployers shall inform the exposed persons about the functioning of the system

-AI systems that generate or manipulate image, audio or video content constituting a deep fake

deployers shall disclose that the content has been artificially generated or manipulated

when content forms part of an evidently artistic, creative, satirical, fictional work: disclosure in a manner that does not hamper the display or enjoyment of the work.

-AI systems that generate or manipulate text which is published with the purpose of informing the public on matters of public interest

deployers shall disclose that the content has been artificially generated or manipulated

UNLESS content has undergone a process of human review or editorial control and where a natural or legal person holds editorial responsibility for the publication

  • transparency does not apply to crime detection and investigation
  • adoption of codes of practice encouraged (AI office)

63

64 of 68

General-purpose AI models (Chapter IV)

  • ‘general-purpose AI model’ means an AI model, including where such an AI model is trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications, except AI models that are used for research, development or prototyping activities before they are placed on the market; (Art. 3 (63))
  • Large generative AI models are a typical example for a general-purpose AI model, given that they allow for flexible generation of content, such as in the form of text, audio, images or video, that can readily accommodate a wide range of distinctive tasks. (Recital 99)

64

65 of 68

GPAI models with systemic risks

  • GPAI model with systemic risks if:
    • it has high impact capabilities (≥ the most advanced GPAIs)
      • presumed if > 10^25 floating point operations used for its training (https://ourworldindata.org/grapher/artificial-intelligence-training-computation)
        • provider can prove the contrary
      • provider has to notify the Commission
    • OR specific decision of the Commission
      • provider may request reassessment
    • list kept by the Commission

65

66 of 68

GPAI models: obligations of providers

  1. technical documentation of the model
    • information on training, testing, evaluation
    • minimum: Annex XI
    • provided to AI Office and national authorities upon request
  2. information and documentation for system providers
    • enable good understanding of capabilities and limitations of the model
    • enable compliance with the AI Act
    • minimum: Annex XII
      • a) and b) does not apply to providers of open source models with descriptions (parameters) publicly available UNLESS systemic risks
  3. copyright policy
    • esp. complying with opt-outs in the general TDM exception
  4. “sufficiently detailed” summary about the training data
    • to be made publicly available
    • template to be provided by the AI Office
  5. codes of practice, harmonised standards (presumption of compliance)
  6. providers from third countries should establish an authorised representative in the EU

66

67 of 68

Additional obligations of providers of GPAI models with systemic risks

  • perform model evaluation (incl. adversarial testing) to identify and mitigate systemic risks
  • assess and mitigate systemic risks stemming from the development, placing on the market, and the use of the model
  • monitor and report serious incidents (incl. corrective measures) to the AI Office
  • ensure appropriate level of security for the model and physical infrastructure

  • codes of practice, harmonised standards (presumption of compliance)

67

68 of 68

68