1 of 35

Stop saving IP addresses (!?)

Matt Mathis

mattmathis@measurementlab

Original presentation: Mar 16 2022

�

2 of 35

Our Goal Today

A conversation
Balancing the tension between

Preserving or enhancing our data for the best possible analysis
while strengthening the our commitments to our users privacy

With our choices about

IP addresses
data annotations
and other metadata affect the data quality, utility and user privacy

We need to (periodically) review and reconsider our options

3 of 35

Outline

Problem statement
Review the current platform
Inventory possible approaches to provide the right level of client identification

Please add your suggestions

Inventory some risks

What might go wrong?

Inventory of research that might be impacted

Please point out other research lines that we might have missed

4 of 35

Our Privacy Statement

Current: M-Lab Collects, Archives and Publishes the IP addresses of all tests

Explicitly stated: M-Lab Privacy Policy
This was part of MLab's “Founding Architecture”

Crowd sourced Internet measurement data, indexed by IP address and date

However this policy is problematic for some communities and use cases

Internet regulators in privacy sensitive jurisdictions (primarily EU)
Device manufacturers/integrators that market to the same jurisdictions
Potential partners that need to understand Internet performance
Increasing general user awareness and concern about privacy

We need to review and reflect our choices

5 of 35

We need to identify repeated tests from single clients

Deduping heavy hitters
Global Feb 2022 data
28M client IPs contribute 91M tests

10 clients contribute 2%

100 contribute 6%

1000 contribute 8%

3474 (0.012%) contribute 10%

10000 contribute 12%

1% contribute 30%

10% contribute 58%

We must identify repeated tests

And somehow thin them
Otherwise they bias measurements

6 of 35

We need to strike a balance

(Partially) Redact IP addresses
Keep (or add) sufficient "anonymized identity"

Dedup repeated tests from heavy hitters; and
Continue to support important research that needs some sort of identifiers

But the real forcing function is the language in the Privacy Statement

It must accurately reflect our technical choices
If must be acceptable to GDPR lawyers and other consumers

Please contribute your thoughts discuss@measurementlab.net

7 of 35

Some (weakly) related secondary problems

In some parts of the world IP addresses are very weak identifiers anyhow

Nearly all consumers are behind Carrier Grade NATs (CG-NAT)

Possibly many thousands of people sharing IP Addresses

In principle we would like to add some sort of client identifier

But this is potentially a worse privacy exposure than IP addresses
And can be frauded to confuse our data

In the same areas, our /26 server footprint is prohibitive

We can't deploy at all, because we can't get enough addresses

For mobile measurement, Tower ID would be more useful

Tracking phone to tower associations leaks extreme PII, even without any other information
It would be better to track just Tower IDs

Discard IP addresses and all other device identifiers

8 of 35

The Current Platform

9 of 35

MLab overview (simplified)

GCS Archive

BigQuery

Fleet of Measurement Nodes

NDT + sidecar services

UUIDs

Pusher

Annotator

TCP info

.pcap

traceroute

NDT (UUIDs)

Annotations

TCP info

.pcap

Traceroute

ETL Pipeline

Parse

Join

Locate Service

Incoming requests

Views

Gardner

Publish

10 of 35

Elements of the Fleet

Network Diagnostic Test

Uses TCP to transfer data as fast as possible for 10 seconds
Instrument and log every detail

UUID: guaranteed to be unique identifiers for every TCP connection

Used as join keys

Annotator: Collect contemporary metadata

From Route Views: routing prefix, BGP origin AS Number
From MaxMind: Geolocation
We can extended it in the future

TCP info: state snapshots every 10 mS in the raw data, 100 mS in BQ
Full packet captures (.pcap files)
Traceroute: Scamper MDA traceroute of the path to the client
Pusher: Copies all data to Google Cloud Storage

11 of 35

Elements of the Pipeline

GCS Archive

Contains all raw data as produced by the fleet (12+ years, 3.7 PBytes!)
Published for pedantic completeness

Separate parser for each data type

NDT, UUIDs, Annotations, TCP info, .pcap, traceroute, etc

Join on UUIDs

Currently NDT and Annotations, more in the future

Views

Harmonize permissions and BQ schema

Gardner: reprocess everything periodically

Improvements to the pipeline
Newest/most important data every 2-6 weeks
Historical data from web100 fleet (2009-2019) every 6 months
But all processing rates are tunable

12 of 35

Approaches

13 of 35

General strategy

Add more information TBD to the annotations

For example add obfuscated (hashed) IP addresses

Can still detect multiple tests from a single client IP
Could even use use multiple hashes with different properties

Annotation service was designed in anticipation of this use

Redact IP addresses everywhere

Several design choices as well

All bits or partial (e.g. keeping the high and clobbering the low bits)
Zeros, random or something else?
Which processing step?

For this presentation, (nearly) all options are on the table

What annotation to add
What address bits to remove or replace
And where to make the changes (Fleet or Pipeline)

14 of 35

Possible annotations

RouteViews (in production today)

Network routing prefix, AS Number, AS name, etc

MaxMind (in production today)

ISO 3166 GeoCoding: Country, region, city, postal code, etc
Fuzzed lat/long

Multiple choices for obfuscated IP addresses

Commonly: one_way_hash(IP, key)

Protocol fingerprinting

Nominally constant protocol features that differ by implementation

TCP options, Window Scale etc (generally appear in TCP info)
TLS/SSL/HTTPS negotiation (supported ciphers, parameters and compression)

We don't currently capture this information at all

Unknown privacy implications

Some fingerprints might be unexpectedly unique

15 of 35

Hashes to obfuscate IP addresses

Can still detect repeated tests from a single address
We want to include a changeable key, so we can change the hash periodically

Minimal defense against leaked or recovered keys
Otherwise the hash has to be good forever in order to preserve privacy forever

Tradeoff between

key lifetime
ability to investigate how the client population changes over time

A critical element of understanding how the Internet changes over time

Use multiple key to generate multiple hashes

Update the keys on an alternating schedule
Can collect the union of all hashes seen for a given IP address, w/o exposing the IP address

16 of 35

Possible annotations provided by the client

Servers can pass additional metadata from client to BQ via raw data and parsers

Currently used by some clients to provide OS and browser major version
Clients might provide some additional information

Warrants extremely careful auditing

For example a client UUID

Has its own set of privacy and security issues

Not really under consideration, but included for completeness

Standard fuzzed versions of high precision html5 location

e.g. geohash
Do the mapping on the client without leaving clues in the archived data
Census tracts (US only)
Nomenclature of Territorial Units for Statistics (NUTS - Europe only)

All require client changes that may be hard to deploy

17 of 35

Redacting IP addresses in the Fleet

Advantages

Very strong (future) statements about privacy

e.g. "MLab does not collect IP addresses"

We can absolutely conform to all GDPR style privacy policies

Disadvantages

Annotations must to be applied before the redaction

Also on the fleet
May incur missing attributes if the annotator needs to access external data

We can NOT retroactively improve annotations

Contrary to the philosophy of the rest of the data in the pipeline
Any attributes that we overlook now can never be added retroactively

18 of 35

Redacting IP addresses in the parser

Advantages

We don't forfeit any annotation data
We can change our minds later about any and all annotations or obfuscations

Disadvantages

Weaker statements about privacy

We can remove "IP addresses are published"
We can only change to "IP addresses are securely collected and archived"

Need a separate privacy strategy for the raw data archived in GCS

Note that the sheer size of the Archive makes searching a challenge

3.7+ PetaBytes, growing by ~4 TeraBytes/day

But this will certainly be used as part of any transitional strategy

We will prototype any future fleet based anonymization in the parser

Get some real field experience with redacted data
Long term hybrid solutions are also possible

19 of 35

Acceptable Use Agreements (AUA)

Standard for nearly everybody else, but MLab does not have one: open is open

Enforcing an AUA requires managed ACLs

Several different versions are in use in the community

Only share summaries of sensitive data
Explicit prohibition of attempting to deanonymize obfuscated data

Our raw archive is most problematic

We may want a ASA on all historical raw data with non-redacted IP addresses

In principle we could support multiple versions of the BQ tables

No NDA on data with completely redacted IP addresses

This applies to the statistics pipeline as well

Suitable NDAs on data with obfuscated or clear text IP address

ACLs and ASAs do not change "collect and archive IP addresses"

All only modify "publish"
Is this enough?

20 of 35

Server side NAT (aka S-NAT, L4 routing, etc)

Load Balancers that rewrite the connection 4 tuple in front of servers

Noticed as part of piloting NDT on Google Cloud

GCP servers have private RFC 1918 addresses, which are rewritten at the cloud edge
This is standard practice for a huge class of industrial web servers
Linux kernel includes a built in full function NAT (used in OpenWrt and elsewhere)

MLab could use S-NAT to isolate public and private address spaces

Replace switches L4 load balancers or add S-NAT to the server root container
Might also address the /26 footprint problem

Requires replumbing the Annotation service

It needs to get the address from the public side of the NAT

Many corner cases to be considered and tested in the short term

ICMP, ICMP6, traceroute, etc (all carrying transcribed IP headers, including IP-in-IP tunnels)

Too many high risk problems to deploy anytime soon

Save for a future iteration(?)

22 of 35

Potential Leaks

IP addresses leaking because we missed something

The header checksum in .pcaps can be used to recover missing IP addresses

You only need to guess the server IP address

Headers inside of ICMP message

This is especially problematic for traceroute

What else, that we don't know about?

Cryptographic attacks on the poor choices for the hash
Many AUA include prohibitions on reverse engineering and publishing flaws

23 of 35

What about traceroute?

The last hop near the destination is very likely to have a unique IP address
Some researchers studying traceroute find obfuscated data useless

Can't disambiguate multiple addresses on one device or subnet without full addresses

24 of 35

IPv4 specific risks

Generic exposure to brute force attacks

The entire address space can be scanned against keys or hashes

Critically important that the AS Number and/or routing prefix are preserved

Required to map results to ISP and network "location"

All designs invite "plaintext attacks" and differential cryptology

25 of 35

IPv6 specific risks

Exposing IPv6 addresses all can seed scanners and other attacks
RFC 4779 specified setting IPv6 address from the Ether MAC address

Stable over time, leaks manufacturer and device age to potential attackers

RFC 4941 specified changing IPv6 addresses (relatively slowly) over time

Addresses are stable longer than necessary

RFC 8981 species the use of temporary addresses

Clients (or applications) can choose to have very short address lifetimes

Even without address redaction it is sometimes hard to identify requests from one client

26 of 35

Impaired Experiments

27 of 35

Understanding repeated tests from single IPs

Many NDT integrations do periodic measurements

e.g. self tuning home routers
1-4 per day are common

Some usage patterns are probably WiFi debugging or mapping

dozens of tests over a few hours with widely variable results

Some usage patterns appear quite abusive

Is MLab being used to DDOS something?

Summary statistics need to down sample or exclude repeated tests

Otherwise they bias our measurements

28 of 35

Understanding shared IP addresses

Home NATs (Network Address Translation)

Detectable today when clients have different TCP fingerprints

Clients are from different vendors or vintages

Carrier Grade NAT

Typically an entire region shares one small block of addresses

This is the default in some parts of the world

Detectable from wide variation of TCP fingerprints, minRTT, maxBW
Same statistics from several consecutive IP addresses

Not on an address block boundary
Obscuring adjacent addresses breaks "counting" experiments

Tunnels

The tunnel exit must be a public IP address

in order for returning ACKs to be routed properly

Geo information is often confused because the user is trying to hide their true location

29 of 35

Understanding IP address stability

Most (?) consumer ISPs use dynamic address assignments

The addresses change from time to time

Daily is common in parts of Europe
I typically see new addresses about twice per year at my own house
Some people report new addresses on every modem reboot

Can only infer from address lifetime and turnover rates

Confounded by address obfuscation

30 of 35

Validating MLab itself

Find stable sets of beacons with stable test patterns
Canary new server code on a few servers
Compare data between server versions
We have semi-automatic tools to do this now

Will be able to do it continuously for all deployments some time in the future

31 of 35

Curated clients

A number of researchers work with known sets of measurement clients

Privacy is not an issue

It is super useful to be able to easily pick out your own data
Alternatives include:

Client annotations
Logging test UUIDs

32 of 35

Questions and discussions

What techniques have we overlooked?

Are there additional techniques that we should add to our toolbox?

What downsides have we overlooked?

Other experiments that we might we impair?
Other risks that we should consider?

34 of 35

But what is PII?

GDPR Article 4, Definitions, paragraph 1:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

Do we have a person? Can our data indirectly identify a person? I think not....

Can somebody help us find as strong answer to this question?

35 of 35

My wish

Our official statement could become:

MLab collects IP addresses but does not collect any information that might be used to directly or indirectly identify you, the user.

This statement is already true, we just don't use the words.