1 of 20

Multilingual Scraper� of Privacy Policies� and Terms of Service

David Bernhard, Luka Nenadic, �Stefan Bechtold, Karel Kubicek

CSLAW 2025, March 27

2 of 20

Motivation

2

McDonald & Cranor, 2008

3 of 20

Background

  • Research opportunities:
    1. Privacy policies (policies) and terms of services (terms) document stated website behavior
    2. Assessing the effect of new legislation such as the Digital Services Act
    3. Computational advancements and novel methods facilitate working with large text corpora
  • Previous studies are limited in several dimensions:
    • Small sample size (e.g., 50 platform terms in Lippi et al., 2019)
    • Focus on English and the United States (e.g., reported by: Arora et al., 2022; Ciclosi et al., 2023; Mhaidli et al., 2023)
  • Individual corpora are inefficient and impede the reproducibility of research findings
  • Lack of a large-scale, long-term, and multilingual database of website policies and terms

3

4 of 20

Contributions

  1. We developed and deployed a multilingual (37 languages) scraper of policies and terms
  2. We have run it monthly on around 800 000 websites since January 2024 and plan to continue
  3. We release the data to foster interdisciplinary research
  4. Our paper provides the first end-to-end evaluation of such a scraper

4

5 of 20

Methods

1. Discovery:

  • Navigation

1. Discovery:

  • Navigation > search engine

5

6 of 20

Methods

1. Discovery:

  • Navigation > search engine� > common URLs (example.com/policy)

2. Classification (discovered page → {policy, other}):

  • Labeled >1000 documents from the scrape
  • Trained multilingual distilled BERT model

3. Extraction:

  • Document body extraction (Readability library)
  • PDF extraction (pdfminer.six)

4. Result:

  • Clean Markdown document (stored in DB)
  • Raw HTML document (stored on disk)

5

120 GB/year

7 of 20

Sample

  • Chrome User Experience Report (CrUX) (accurately represents users, Ruth et al. 2022)
    • Sample representing both popular website and long-tail with same weight
    • All EU countries, EFTA, US, GB, TR, RU
      • 35 countries ✕ 8 ranks ✕ 5000 websites ~ set of 500 000 websites
  • Challenge with long-term study:
    • Keep scraping the same list (gradually outdating)?
    • Generate a new list every month (no continuity)?
    • We do both → 2 ✕ 500 000 websites ~ set of 800 000 websites

7

8 of 20

Evaluation

First thorough end-to-end evaluation of this type of scraper’s performance:

  1. Random sample of 100 websites each for English, German, French, Italian, and Croatian
  2. Stratified sample across different popularity buckets
  3. Annotation of result (and potential error type) for each website’s policy and terms
  4. Assessment of potential risks > anonymization prior to dataset publication

8

9 of 20

Results policies

9

Overall�F1-score:

79%

True positive

63%

True negative

10%

9%

False positive

15%

Wrong document

3%

True negative (no policy present, scraper found none)

False positive (scraper found a policy but there is none)

False negative (scraper found nothing but there is a policy)

Wrong document (e.g., cookie policy)

True positive (policy present, scraper found it)

False negative

10 of 20

Results terms

10

True negative (no terms present, scraper found none)

False positive (scraper found terms but there are none)

False negative (scraper found nothing but there are terms)

Wrong document (e.g., EULA, B2B terms)

True positive (terms present, scraper found them)

Overall�F1-score:

75%

11 of 20

Applications and Future Work

  • Scraper as a living long-term tool
  • Policies and terms in our database can benefit various interdisciplinary research:
    1. Evolution of policies and terms over time (e.g., effects of novel legislation)
    2. Discrepancy between stated and actual website behavior
    3. Linguistic heterogeneity
    4. Trends for less popular websites
    5. Interaction of website policies and terms

11

12 of 20

Conclusion

  • Studies analyzing websites policies and terms lack comparability and reproducibility
  • We address this problem by introducing a multilingual, long-term, and large-scale scraper and database of policies and terms
  • Thorough end-to-end evaluation across 5 languages
  • Our hope: unified database for future interdisciplinary research

12

Dataset, more info, contact at: https://karelkubicek.github.io/post/pptc

13 of 20

Backup slides

13

14 of 20

14

Countries

rank 1k

rank 5k

rank 10k

rank 50k

rank 100k

rank 500k

rank 1M

rank 5M

Total websites

EN speaking

US

1000

4000

5000

5000

5000

5000

5000

5000

35000

GB (Great Britain)

1000

4000

5000

5000

5000

5000

5000

5000

35000

Core EU

FR

1000

4000

5000

5000

5000

5000

5000

0

30000

DE

1000

4000

5000

5000

5000

5000

5000

0

30000

Rest of EU

AT

1000

4000

5000

5000

5000

5000

0

0

25000

BE

1000

4000

5000

5000

5000

5000

0

0

25000

BG

1000

4000

5000

5000

5000

5000

0

0

25000

HR

1000

4000

5000

5000

5000

0

0

0

20000

CY

1000

4000

5000

5000

0

0

0

0

15000

CZ

1000

4000

5000

5000

5000

5000

0

0

25000

DK

1000

4000

5000

5000

5000

5000

0

0

25000

EE

1000

4000

5000

5000

5000

0

0

0

20000

FI

1000

4000

5000

5000

5000

5000

0

0

25000

GR

1000

4000

5000

5000

5000

5000

0

0

25000

HU

1000

4000

5000

5000

5000

5000

0

0

25000

IE

1000

4000

5000

5000

5000

5000

0

0

25000

IT

1000

4000

5000

5000

5000

5000

5000

0

30000

LV

1000

4000

5000

5000

5000

0

0

0

20000

LT

1000

4000

5000

5000

5000

0

0

0

20000

LU

1000

4000

5000

5000

5000

0

0

0

20000

MT

1000

4000

5000

5000

0

0

0

0

15000

NL

1000

4000

5000

5000

5000

5000

5000

0

30000

PL

1000

4000

5000

5000

5000

5000

5000

0

30000

PT

1000

4000

5000

5000

5000

5000

0

0

25000

RO

1000

4000

5000

5000

5000

5000

0

0

25000

SK

1000

4000

5000

5000

5000

5000

0

0

25000

SI

1000

4000

5000

5000

5000

5000

0

0

25000

ES

1000

4000

5000

5000

5000

5000

5000

0

30000

SE

1000

4000

5000

5000

5000

5000

0

0

25000

EFTA

CH

1000

4000

5000

5000

5000

5000

0

0

25000

IS

1000

4000

5000

5000

0

0

0

0

15000

NO

1000

4000

5000

5000

5000

5000

0

0

25000

LI

1000

4000

5000

5000

0

0

0

0

15000

Other notable

Türkiye

1000

4000

5000

5000

5000

5000

5000

30000

Russia

1000

4000

5000

5000

5000

5000

5000

30000

15 of 20

ML performance

Training data:

  • 415 positive and 133 negative samples of policies
    • negative sample also contains: cookie policies, privacy outline pages
  • 273 positive and 810 negative samples of terms
    • negative sample also contains: EULAs, B2B terms

Performance:

  • 93.2% and 92.3% accuracy for policies and terms respectively

Related work often reports performance on non-real world data

  • positive samples: large policies datasets
  • negative samples: pages from random browsing

15

16 of 20

Evaluation

16

17 of 20

Results (Sankey plot)

17

18 of 20

Motivation

  • Standard form contracts are extremely pervasive (Slawson, 1971)
  • Privacy policies (policies) and terms of service (terms) as typical website contracts
  • Consumer do not pay much attention to them (Bakos & Marotta-Wurgler, 2014; Obar & Oeldorf-Hirsch, 2020)
  • These legal documents are long and difficult to understand (Benoliel & Becher, 2019; Samples et al., 2024)
  • Previous research is often limited in two central dimensions
    • Small sample size (e.g., 50 platform terms in Lippi et al., 2019)
    • Focus on English and the United States (e.g., reported by: Arora et al., 2022; Ciclosi et al., 2023; Mhaidli et al., 2023)

18

19 of 20

Motivation and Background

19

20 of 20

Credits

Paper authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek

Presenters:

Luka Nenadic

  • PhD student at ETH Zurich
  • Funded by SNSF grant No. 10002634.
  • lnenadic@ethz.ch

Karel Kubicek

Sharing without permission of the authors is not allowed.

20