1 of 30

Automating Cookie Consent

and GDPR Violation Detection

Dino Bollinger, Karel Kubicek, Carlos Cotrini, David Basin

31st USENIX Security Symposium (August 11, 2022)

2 of 30

Cookie consent

  • Solomos et al. (2019): 90% of websites use tracking cookies
  • EU law: Websites must notify users, gather consent
  • Consent notices to comply with regulations

2

3 of 30

ePrivacy Directive and General Data Protection Regulation (GDPR)

ePrivacy Directive:

  • All but strictly necessary data processing requires consent

GDPR Consent:

  • Freely-given
  • Unambiguous
  • Specific
  • Informed
  • Purpose-limited

3

4 of 30

Non-compliance is widespread

  • Empirical studies:
    • Non-compliance in up to 80% of websites (e.g. cookies set before consent)

(Utz 2019, Trevisan 2019, Matte 2020, Nouwens 2020, Kampanos 2021, Santos 2021, etc.)

    • Websites do not respect user choices

(Libert 2018, Trevisan 2019, Matte 2020, Nouwens 2020, etc.)

  • Usability: dark patterns successfully trick users

(Bösch 2016, Grassl 2020, Hasner 2021, Sanchez-Rola 2019, Htut Soe 2020, etc.)

4

5 of 30

5

Goal: Enforce cookie consent on client-side while browsing.

6 of 30

Our solution: CookieBlock

  • Browser extension to predict purposes for cookies using machine learning
  • Enforce cookie consent on client-side

6

Implementation:

    • Crawl web to gather training data (ground truth)
    • Extract features from cookies
    • Train classifier model and evaluate
    • Apply the model in the browser extension

7 of 30

Data collection: selecting data sources

7

  • Consent Management Platform (CMP):
  • Source for training data:

8 of 30

Data collection: web crawlers

  • CMP presence crawler:
    • Input: 6 million domains, sourced from Tranco list
    • Output: ~37.5k domains with confirmed CMP

  • Cookie consent crawler:
    • Browse websites, gather cookie declarations + observed cookies
    • Based on OpenWPM – visit subpages, move cursor, etc.
    • Successful for ~30k domains

8

HTTP GET

Image source: Mozilla Firefox (https://commons.wikimedia.org/wiki/File:Firefox_logo,_2017.svg)�The Firefox logo is a trademark of the Mozilla Foundation in the U.S. and other countries.

9 of 30

Data collection: results of OpenWPM crawl

9

10 of 30

Feature extraction from textual cookies

10

Ex 1: Shannon entropy

    • Higher entropy, more randomness
    • Indicator for unique identifiers

Ex 2: Content encodings

    • JSON, CSV, Base64, etc.

52 types in total, including:

    • Name patterns, content size, timestamps, language strings, content encoding, cookie flags, third-party status, expiry, etc.

11 of 30

XGBoost classifier and baseline

  • XGBoost used for training our model
  • Cookiepedia
    • Repository of cookies labeled by experts
    • Same cookie purposes, 70% of our dataset

11

entropy > 0.8

Session?

- 3

+ 3

- 1

False

True

True

False

Prediction: Advertising

XGBoost

Cookiepedia

12 of 30

Classifier evaluation

12

Cookiepedia bal. accuracy

84.7% ± 0.3%

XGBoost bal. accuracy

84.2% ± 0.27%

13 of 30

CookieBlock browser extension

  • User defines consent preferences when installed
  • Classifies cookies and deletes those with rejected purpose
  • Available for Firefox, Chrome, Edge and Opera (8k users)

Empirical evaluation:

  • No broken functionality on 85 out of 100 pages
  • Authentication issues on 7 websites
  • 7 broken consent popups, 1 language setting problem

13

14 of 30

Potential violations: per type

14

15 of 30

Potential violations: wrong purpose and undeclared cookies

15

Undeclared cookies

  • Cookies that were not listed in the consent
  • 82.5% of websites
  • 40.2% of cookies in total not declared

GDPR informed consent requirement

Image sources: https://commons.wikimedia.org/wiki/File:Twemoji2_1f36a.svg; https://commons.wikimedia.org/wiki/File:Question_mark_alternate.svg �The Twemoji cookie image is licensed under the Creative Commons Attribution 4.0 International license, and has been altered from its original form. We claim no ownership of the image.

“Google Analytics“ cookie with wrong purpose

  • Detected on 8.2% of all websites
  • 2.7% misclassified as necessary

Decision from Planet49 case

16 of 30

Potential violations: implicit and ignored consent

16

Cookies set prior to user’s consent

  • Found on 69.7% of all domains
    • Nouwens et al. (32.5%)
    • Matte et al. (9.9%)

Article 5(3) of the ePrivacy Directive

Cookies set despite negative consent

  • Found on 21.3% of all domains
    • Differs from Matte et al. (5.3%)

Article 5(3) of the ePrivacy Directive

17 of 30

Potential violations: histogram

17

18 of 30

Conclusion

  • Consent notices are broken
  • Crawled ground truth + extracted features
  • Trained XGBoost model to predict purposes for cookies
  • CookieBlock enforces user consent preferences
  • Detected 8 potential violation types on ~95% of websites

18

  • Thank you for your attention!�Questions?

More info, source, extension links:�https://karelkubicek.github.io/post/cookieblock

Dino Bollinger, Karel Kubicek, Dr. Carlos Cotrini, Prof. David Basin

19 of 30

Backup Slides

19

20 of 30

Consent management platforms: market share & analysis

20

CMP

Market share

Remote

Labels

tarteaucitron.js

0.16%

Usercentrics

0.16%

CookiePro

0.15%

Borlabs Cookie

0.12%

EU Cookie Law

0.12%

PrimeBox CookieBar

0.09%

Cookie Script

0.07%

Cookie Information

0.06%

Termly

0.05%

Cookie Info Script

0.05%

Easy GDPR

0.04%

CMP

Market share

Remote

Labels

Osano

2.25%

Cookie Notice

1.29%

OneTrust

1.17%

OptAnon

1.08%

Cookie Law Info

0.95%

Cookiebot

0.77%

Quantcast CMP

0.68%

UK Cookie Consent

0.33%

TrustArc

0.26%

WP GDPR Comp.

0.20%

Moove GDPR Comp.

0.18%

21 of 30

Feature importance

21

How many times is the feature used�High weight → feature used close to the leaf

How many cookies were influenced by the feature�High importance → feature close to the root

22 of 30

Model precision and recall

22

88.5%

94.5%

81.7%

87.3%

vs.

Cookiepedia

XGBoost

Strictly necessary

Precision

Recall

78.7%

38.1%

76.3%

52.9%

vs.

Cookiepedia

XGBoost

Functionality

Precision

Recall

93.0%

84.2%

89.7%

89.8%

vs.

Cookiepedia

XGBoost

Performance/analytics

Precision

Recall

79.0%

94.9%

89.8%

93.6%

vs.

Cookiepedia

XGBoost

Tracking/advertising

Precision

Recall

Cookiepedia accuracy 86.1% ± 0.1%

XGBoost accuracy 87.2% ± 0.23%

23 of 30

CookieBlock: manual evaluation

  • 7 websites with authentication issues
    • authorstream.com
    • walmart.com
    • dafont.com
    • tpsl-india.in
    • sage.com
    • eventbrite.co.uk
    • formstack.com
  • 8 websites with functional issues
    • tandf.co.uk: can't change region
    • sherdog.com: CMP issues
    • martindale.com: CMP issues
    • taboola.com: CMP issues
    • thegatewaypundit.com: CMP issues
    • philips.com: CMP issues
    • windowsupdate.com: CMP issues
    • prweb.com: CMP issues

23

24 of 30

Related work – extensions

  • Rachel's GDPR Consent Manager
  • CookieEnforcer
  • Consent-O-Matic
  • I don’t care about cookies
  • uBlock Origin with Easylist cookies

24

25 of 30

Related work – comparable approaches

  • CCCC: Corralling Cookies into Categories with CookieMonster [Hu et al., 2021]

  • CookieEnforcer: Automated Cookie Notice Analysis and Enforcement [Khandelwal et al., 2022]
    • Requires honest implementation of consent
    • Works only for English (EN) websites
    • ML not in browser, but in crawler - limited to crawled websites with "recipes"

25

Accuracy: 0.867

26 of 30

Violation detection: outliers, conflicting purposes

26

Conflicting purposes

  • 2+ purposes for same cookie
  • Cookie is enabled by consent to at least one purpose
  • Found on 2.3% of all websites
    • 0.7% of all sites use “Necessary” and another class

Non-ambiguous requirement of GDPR

Image sources: https://commons.wikimedia.org/wiki/File:Twemoji2_1f36a.svg�The Twemoji cookie image is licensed under the Creative Commons Attribution 4.0 International license, and has been altered from its original form. We claim no ownership of the image.

Outlier purpose from majority opinion

  • Majority label for third-party cookie, find declarations that do not match
  • Outliers found on 30.9% of all websites

Lower bound, indicates misbehavior

27 of 30

Violation detection: unclassified cookies, incorrect expiry

27

Incorrect expiry

  • Cookie expiry is 1.5 times longer than declared
  • Also includes cases of session cookies being persistent
  • On 13.5% of all domains

Violation in Planet49 case

Image source: https://commons.wikimedia.org/wiki/File:Twemoji_1f565.svgThe Twemoji clock image is licensed under the Creative Commons Attribution 4.0 International license. We claim no ownership of the image.

Unclassified cookies

  • Unclassified in the declaration on 25.4% of all websites
  • ~4% of declarations were unclassified
  • Cannot be rejected in Cookiebot notice

Informed consent requirement of GDPR

28 of 30

Violation statistics, repeated results after 1 year

28

May 2021 crawl (29’206 websites)

Cookiebot: 45.8%, OneTrust: 52.1%, Termly: 2.2%

July 2022 crawl (52’162 websites)

Cookiebot: 57.9%, OneTrust: 39.6%, Termly: 2.6%

29 of 30

Violation statistics, grouped by CMP

29

May 2021 crawl

July 2022 crawl

30 of 30

Presentation Authors:

Dino Bollinger, dino.bollinger@gmail.com� Karel Kubicek, karel.kubicek@inf.ethz.ch

Team:

Dino Bollinger, Karel Kubicek, Dr. Carlos Cotrini, Prof. David Basin

ETH Zurich

D-INFK

Institute of Information Security

https://informationsecurity.ethz.ch/