Multilingual Scraper� of Privacy Policies� and Terms of Service
David Bernhard, Luka Nenadic, �Stefan Bechtold, Karel Kubicek
CSLAW 2025, March 27
Motivation
2
McDonald & Cranor, 2008
Background
3
Contributions
4
Methods
1. Discovery:
1. Discovery:
5
Methods
1. Discovery:
2. Classification (discovered page → {policy, other}):
3. Extraction:
4. Result:
5
120 GB/year
Sample
7
Evaluation
First thorough end-to-end evaluation of this type of scraper’s performance:
8
Results policies
9
Overall�F1-score:
79%
True positive
63%
True negative
10%
9%
False positive
15%
Wrong document
3%
True negative (no policy present, scraper found none)
False positive (scraper found a policy but there is none)
False negative (scraper found nothing but there is a policy)
Wrong document (e.g., cookie policy)
True positive (policy present, scraper found it)
False negative
Results terms
10
True negative (no terms present, scraper found none)
False positive (scraper found terms but there are none)
False negative (scraper found nothing but there are terms)
Wrong document (e.g., EULA, B2B terms)
True positive (terms present, scraper found them)
Overall�F1-score:
75%
Applications and Future Work
11
Conclusion
12
Dataset, more info, contact at: https://karelkubicek.github.io/post/pptc
Backup slides
13
14
| Countries | rank 1k | rank 5k | rank 10k | rank 50k | rank 100k | rank 500k | rank 1M | rank 5M | Total websites |
EN speaking | US | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 35000 |
GB (Great Britain) | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 35000 | |
Core EU | FR | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 |
DE | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 | |
Rest of EU | AT | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 |
BE | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
BG | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
HR | 1000 | 4000 | 5000 | 5000 | 5000 | 0 | 0 | 0 | 20000 | |
CY | 1000 | 4000 | 5000 | 5000 | 0 | 0 | 0 | 0 | 15000 | |
CZ | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
DK | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
EE | 1000 | 4000 | 5000 | 5000 | 5000 | 0 | 0 | 0 | 20000 | |
FI | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
GR | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
HU | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
IE | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
IT | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 | |
LV | 1000 | 4000 | 5000 | 5000 | 5000 | 0 | 0 | 0 | 20000 | |
LT | 1000 | 4000 | 5000 | 5000 | 5000 | 0 | 0 | 0 | 20000 | |
LU | 1000 | 4000 | 5000 | 5000 | 5000 | 0 | 0 | 0 | 20000 | |
MT | 1000 | 4000 | 5000 | 5000 | 0 | 0 | 0 | 0 | 15000 | |
NL | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 | |
PL | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 | |
PT | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
RO | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
SK | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
SI | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
ES | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | 0 | 30000 | |
SE | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
EFTA | CH | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 |
IS | 1000 | 4000 | 5000 | 5000 | 0 | 0 | 0 | 0 | 15000 | |
NO | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 0 | 0 | 25000 | |
LI | 1000 | 4000 | 5000 | 5000 | 0 | 0 | 0 | 0 | 15000 | |
Other notable | Türkiye | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | | 30000 |
Russia | 1000 | 4000 | 5000 | 5000 | 5000 | 5000 | 5000 | | 30000 |
ML performance
Training data:
Performance:
Related work often reports performance on non-real world data
15
Evaluation
16
Results (Sankey plot)
17
Motivation
18
Motivation and Background
19
Credits
Paper authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek
Presenters:
Luka Nenadic
Karel Kubicek
Sharing without permission of the authors is not allowed.
20