1 of 20

Multilingual Scraper� of Privacy Policies� and Terms of Service

David Bernhard, Luka Nenadic, �Stefan Bechtold, Karel Kubicek

CSLAW 2025, March 27

2 of 20

Motivation

2

McDonald & Cranor, 2008

3 of 20

Background

Research opportunities:

Privacy policies (policies) and terms of services (terms) document stated website behavior
Assessing the effect of new legislation such as the Digital Services Act
Computational advancements and novel methods facilitate working with large text corpora

Previous studies are limited in several dimensions:

Small sample size (e.g., 50 platform terms in Lippi et al., 2019)
Focus on English and the United States (e.g., reported by: Arora et al., 2022; Ciclosi et al., 2023; Mhaidli et al., 2023)

Individual corpora are inefficient and impede the reproducibility of research findings
Lack of a large-scale, long-term, and multilingual database of website policies and terms

3

4 of 20

Contributions

We developed and deployed a multilingual (37 languages) scraper of policies and terms
We have run it monthly on around 800 000 websites since January 2024 and plan to continue
We release the data to foster interdisciplinary research
Our paper provides the first end-to-end evaluation of such a scraper

4

5 of 20

Methods

1. Discovery:

Navigation

1. Discovery:

Navigation > search engine

5

6 of 20

Methods

1. Discovery:

Navigation > search engine� > common URLs (example.com/policy)

2. Classification (discovered page → {policy, other}):

Labeled >1000 documents from the scrape
Trained multilingual distilled BERT model

3. Extraction:

Document body extraction (Readability library)
PDF extraction (pdfminer.six)

4. Result:

Clean Markdown document (stored in DB)

Raw HTML document (stored on disk)

5

120 GB/year

7 of 20

Sample

Chrome User Experience Report (CrUX) (accurately represents users, Ruth et al. 2022)

Sample representing both popular website and long-tail with same weight
All EU countries, EFTA, US, GB, TR, RU

35 countries ✕ 8 ranks ✕ 5000 websites ~ set of 500 000 websites

Challenge with long-term study:

Keep scraping the same list (gradually outdating)?
Generate a new list every month (no continuity)?
We do both → 2 ✕ 500 000 websites ~ set of 800 000 websites

7

8 of 20

Evaluation

First thorough end-to-end evaluation of this type of scraper’s performance:

Random sample of 100 websites each for English, German, French, Italian, and Croatian
Stratified sample across different popularity buckets
Annotation of result (and potential error type) for each website’s policy and terms
Assessment of potential risks > anonymization prior to dataset publication

8

9 of 20

Results policies

9

Overall�F1-score:

79%

True positive

63%

True negative

10%

9%

False positive

15%

Wrong document

3%

True negative (no policy present, scraper found none)

False positive (scraper found a policy but there is none)

False negative (scraper found nothing but there is a policy)

Wrong document (e.g., cookie policy)

True positive (policy present, scraper found it)

False negative

10 of 20

Results terms

10

True negative (no terms present, scraper found none)

False positive (scraper found terms but there are none)

False negative (scraper found nothing but there are terms)

Wrong document (e.g., EULA, B2B terms)

True positive (terms present, scraper found them)

Overall�F1-score:

75%

11 of 20

Applications and Future Work

Scraper as a living long-term tool
Policies and terms in our database can benefit various interdisciplinary research:

Evolution of policies and terms over time (e.g., effects of novel legislation)
Discrepancy between stated and actual website behavior
Linguistic heterogeneity
Trends for less popular websites
Interaction of website policies and terms

11

12 of 20

Conclusion

Studies analyzing websites policies and terms lack comparability and reproducibility
We address this problem by introducing a multilingual, long-term, and large-scale scraper and database of policies and terms
Thorough end-to-end evaluation across 5 languages
Our hope: unified database for future interdisciplinary research

12

Dataset, more info, contact at: https://karelkubicek.github.io/post/pptc

13 of 20

Backup slides

13

14 of 20

14

	Countries	rank 1k	rank 5k	rank 10k	rank 50k	rank 100k	rank 500k	rank 1M	rank 5M	Total websites
EN speaking	US	1000	4000	5000	5000	5000	5000	5000	5000	35000
EN speaking	GB (Great Britain)	1000	4000	5000	5000	5000	5000	5000	5000	35000
Core EU	FR	1000	4000	5000	5000	5000	5000	5000	0	30000
Core EU	DE	1000	4000	5000	5000	5000	5000	5000	0	30000
Rest of EU	AT	1000	4000	5000	5000	5000	5000	0	0	25000
	BE	1000	4000	5000	5000	5000	5000	0	0	25000
	BG	1000	4000	5000	5000	5000	5000	0	0	25000
	HR	1000	4000	5000	5000	5000	0	0	0	20000
	CY	1000	4000	5000	5000	0	0	0	0	15000
	CZ	1000	4000	5000	5000	5000	5000	0	0	25000
	DK	1000	4000	5000	5000	5000	5000	0	0	25000
	EE	1000	4000	5000	5000	5000	0	0	0	20000
	FI	1000	4000	5000	5000	5000	5000	0	0	25000
	GR	1000	4000	5000	5000	5000	5000	0	0	25000
	HU	1000	4000	5000	5000	5000	5000	0	0	25000
	IE	1000	4000	5000	5000	5000	5000	0	0	25000
	IT	1000	4000	5000	5000	5000	5000	5000	0	30000
	LV	1000	4000	5000	5000	5000	0	0	0	20000
	LT	1000	4000	5000	5000	5000	0	0	0	20000
	LU	1000	4000	5000	5000	5000	0	0	0	20000
	MT	1000	4000	5000	5000	0	0	0	0	15000
	NL	1000	4000	5000	5000	5000	5000	5000	0	30000
	PL	1000	4000	5000	5000	5000	5000	5000	0	30000
	PT	1000	4000	5000	5000	5000	5000	0	0	25000
	RO	1000	4000	5000	5000	5000	5000	0	0	25000
	SK	1000	4000	5000	5000	5000	5000	0	0	25000
	SI	1000	4000	5000	5000	5000	5000	0	0	25000
	ES	1000	4000	5000	5000	5000	5000	5000	0	30000
	SE	1000	4000	5000	5000	5000	5000	0	0	25000
EFTA	CH	1000	4000	5000	5000	5000	5000	0	0	25000
	IS	1000	4000	5000	5000	0	0	0	0	15000
	NO	1000	4000	5000	5000	5000	5000	0	0	25000
	LI	1000	4000	5000	5000	0	0	0	0	15000
Other notable	Türkiye	1000	4000	5000	5000	5000	5000	5000		30000
Other notable	Russia	1000	4000	5000	5000	5000	5000	5000		30000

15 of 20

ML performance

Training data:

415 positive and 133 negative samples of policies

negative sample also contains: cookie policies, privacy outline pages

273 positive and 810 negative samples of terms

negative sample also contains: EULAs, B2B terms

Performance:

93.2% and 92.3% accuracy for policies and terms respectively

Related work often reports performance on non-real world data

positive samples: large policies datasets
negative samples: pages from random browsing

15

16 of 20

Evaluation

16

17 of 20

Results (Sankey plot)

17

18 of 20

Motivation

Standard form contracts are extremely pervasive (Slawson, 1971)
Privacy policies (policies) and terms of service (terms) as typical website contracts
Consumer do not pay much attention to them (Bakos & Marotta-Wurgler, 2014; Obar & Oeldorf-Hirsch, 2020)
These legal documents are long and difficult to understand (Benoliel & Becher, 2019; Samples et al., 2024)
Previous research is often limited in two central dimensions

Small sample size (e.g., 50 platform terms in Lippi et al., 2019)
Focus on English and the United States (e.g., reported by: Arora et al., 2022; Ciclosi et al., 2023; Mhaidli et al., 2023)

18

19 of 20

Motivation and Background

19

20 of 20

Credits

Paper authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek

Presenters:

Luka Nenadic

PhD student at ETH Zurich
Funded by SNSF grant No. 10002634.
lnenadic@ethz.ch

Karel Kubicek

Postdoc at INRIA, Privatics team
Funded by SNSF Postdoc.Mobility grant No. P500PT_225449
karel.kubicek@inria.fr
https://karelkubicek.github.io

Sharing without permission of the authors is not allowed.

20