1 of 35

1

Stop saving IP addresses (!?)

Matt Mathis

mattmathis@measurementlab

Original presentation: Mar 16 2022

@

2 of 35

Our Goal Today

  • A conversation
  • Balancing the tension between
    • Preserving or enhancing our data for the best possible analysis
    • while strengthening the our commitments to our users privacy
  • With our choices about
    • IP addresses
    • data annotations
    • and other metadata affect the data quality, utility and user privacy
  • We need to (periodically) review and reconsider our options

2

3 of 35

Outline

  • Problem statement
  • Review the current platform
  • Inventory possible approaches to provide the right level of client identification
    • Please add your suggestions
  • Inventory some risks
    • What might go wrong?
  • Inventory of research that might be impacted
    • Please point out other research lines that we might have missed

3

4 of 35

Our Privacy Statement

  • Current: M-Lab Collects, Archives and Publishes the IP addresses of all tests
    • Explicitly stated: M-Lab Privacy Policy
    • This was part of MLab's “Founding Architecture”
      • Crowd sourced Internet measurement data, indexed by IP address and date
  • However this policy is problematic for some communities and use cases
    • Internet regulators in privacy sensitive jurisdictions (primarily EU)
    • Device manufacturers/integrators that market to the same jurisdictions
    • Potential partners that need to understand Internet performance
    • Increasing general user awareness and concern about privacy
  • We need to review and reflect our choices

4

5 of 35

We need to identify repeated tests from single clients

  • Deduping heavy hitters
  • Global Feb 2022 data
  • 28M client IPs contribute 91M tests

10 clients contribute 2%

100 contribute 6%

1000 contribute 8%

3474 (0.012%) contribute 10%

10000 contribute 12%

1% contribute 30%

10% contribute 58%

  • We must identify repeated tests
    • And somehow thin them
    • Otherwise they bias measurements

5

6 of 35

We need to strike a balance

  • (Partially) Redact IP addresses
  • Keep (or add) sufficient "anonymized identity"
    • Dedup repeated tests from heavy hitters; and
    • Continue to support important research that needs some sort of identifiers

  • But the real forcing function is the language in the Privacy Statement
    • It must accurately reflect our technical choices
    • If must be acceptable to GDPR lawyers and other consumers

Please contribute your thoughts discuss@measurementlab.net

6

7 of 35

Some (weakly) related secondary problems

  • In some parts of the world IP addresses are very weak identifiers anyhow
    • Nearly all consumers are behind Carrier Grade NATs (CG-NAT)
      • Possibly many thousands of people sharing IP Addresses
    • In principle we would like to add some sort of client identifier
      • But this is potentially a worse privacy exposure than IP addresses
      • And can be frauded to confuse our data

  • In the same areas, our /26 server footprint is prohibitive
    • We can't deploy at all, because we can't get enough addresses

  • For mobile measurement, Tower ID would be more useful
    • Tracking phone to tower associations leaks extreme PII, even without any other information
    • It would be better to track just Tower IDs
      • Discard IP addresses and all other device identifiers

7

8 of 35

The Current Platform

8

9 of 35

MLab overview (simplified)

9

GCS Archive

BigQuery

Fleet of Measurement Nodes

NDT + sidecar services

UUIDs

Pusher

Annotator

TCP info

.pcap

traceroute

NDT (UUIDs)

Annotations

TCP info

.pcap

Traceroute

ETL Pipeline

Parse

Join

Locate Service

Incoming requests

Views

Gardner

Publish

Publish

10 of 35

Elements of the Fleet

  • Network Diagnostic Test
    • Uses TCP to transfer data as fast as possible for 10 seconds
    • Instrument and log every detail
  • UUID: guaranteed to be unique identifiers for every TCP connection
    • Used as join keys
  • Annotator: Collect contemporary metadata
    • From Route Views: routing prefix, BGP origin AS Number
    • From MaxMind: Geolocation
    • We can extended it in the future
  • TCP info: state snapshots every 10 mS in the raw data, 100 mS in BQ
  • Full packet captures (.pcap files)
  • Traceroute: Scamper MDA traceroute of the path to the client
  • Pusher: Copies all data to Google Cloud Storage

10

11 of 35

Elements of the Pipeline

  • GCS Archive
    • Contains all raw data as produced by the fleet (12+ years, 3.7 PBytes!)
    • Published for pedantic completeness
  • Separate parser for each data type
    • NDT, UUIDs, Annotations, TCP info, .pcap, traceroute, etc
  • Join on UUIDs
    • Currently NDT and Annotations, more in the future
  • Views
    • Harmonize permissions and BQ schema
  • Gardner: reprocess everything periodically
    • Improvements to the pipeline
    • Newest/most important data every 2-6 weeks
    • Historical data from web100 fleet (2009-2019) every 6 months
    • But all processing rates are tunable

11

12 of 35

Approaches

12

13 of 35

General strategy

  • Add more information TBD to the annotations
    • For example add obfuscated (hashed) IP addresses
      • Can still detect multiple tests from a single client IP
      • Could even use use multiple hashes with different properties
    • Annotation service was designed in anticipation of this use
  • Redact IP addresses everywhere
    • Several design choices as well
      • All bits or partial (e.g. keeping the high and clobbering the low bits)
      • Zeros, random or something else?
      • Which processing step?
  • For this presentation, (nearly) all options are on the table
    • What annotation to add
    • What address bits to remove or replace
    • And where to make the changes (Fleet or Pipeline)

13

14 of 35

Possible annotations

  • RouteViews (in production today)
    • Network routing prefix, AS Number, AS name, etc
  • MaxMind (in production today)
    • ISO 3166 GeoCoding: Country, region, city, postal code, etc
    • Fuzzed lat/long
  • Multiple choices for obfuscated IP addresses
    • Commonly: one_way_hash(IP, key)
  • Protocol fingerprinting
    • Nominally constant protocol features that differ by implementation
      • TCP options, Window Scale etc (generally appear in TCP info)
      • TLS/SSL/HTTPS negotiation (supported ciphers, parameters and compression)
        • We don't currently capture this information at all
    • Unknown privacy implications
      • Some fingerprints might be unexpectedly unique

14

15 of 35

Hashes to obfuscate IP addresses

  • Can still detect repeated tests from a single address
  • We want to include a changeable key, so we can change the hash periodically
    • Minimal defense against leaked or recovered keys
    • Otherwise the hash has to be good forever in order to preserve privacy forever
  • Tradeoff between
    • key lifetime
    • ability to investigate how the client population changes over time
      • A critical element of understanding how the Internet changes over time
  • Use multiple key to generate multiple hashes
    • Update the keys on an alternating schedule
    • Can collect the union of all hashes seen for a given IP address, w/o exposing the IP address

15

16 of 35

Possible annotations provided by the client

  • Servers can pass additional metadata from client to BQ via raw data and parsers
    • Currently used by some clients to provide OS and browser major version
    • Clients might provide some additional information
      • Warrants extremely careful auditing
  • For example a client UUID
    • Has its own set of privacy and security issues
      • Not really under consideration, but included for completeness
  • Standard fuzzed versions of high precision html5 location
    • e.g. geohash
    • Do the mapping on the client without leaving clues in the archived data
    • Census tracts (US only)
    • Nomenclature of Territorial Units for Statistics (NUTS - Europe only)
  • All require client changes that may be hard to deploy

16

17 of 35

Redacting IP addresses in the Fleet

  • Advantages
    • Very strong (future) statements about privacy
      • e.g. "MLab does not collect IP addresses"
    • We can absolutely conform to all GDPR style privacy policies

  • Disadvantages
    • Annotations must to be applied before the redaction
      • Also on the fleet
      • May incur missing attributes if the annotator needs to access external data
    • We can NOT retroactively improve annotations
      • Contrary to the philosophy of the rest of the data in the pipeline
      • Any attributes that we overlook now can never be added retroactively

17

18 of 35

Redacting IP addresses in the parser

  • Advantages
    • We don't forfeit any annotation data
    • We can change our minds later about any and all annotations or obfuscations
  • Disadvantages
    • Weaker statements about privacy
      • We can remove "IP addresses are published"
      • We can only change to "IP addresses are securely collected and archived"
    • Need a separate privacy strategy for the raw data archived in GCS
      • Note that the sheer size of the Archive makes searching a challenge
        • 3.7+ PetaBytes, growing by ~4 TeraBytes/day
  • But this will certainly be used as part of any transitional strategy
    • We will prototype any future fleet based anonymization in the parser
      • Get some real field experience with redacted data
      • Long term hybrid solutions are also possible

18

19 of 35

Acceptable Use Agreements (AUA)

  • Standard for nearly everybody else, but MLab does not have one: open is open
    • Enforcing an AUA requires managed ACLs
  • Several different versions are in use in the community
    • Only share summaries of sensitive data
    • Explicit prohibition of attempting to deanonymize obfuscated data
  • Our raw archive is most problematic
    • We may want a ASA on all historical raw data with non-redacted IP addresses
  • In principle we could support multiple versions of the BQ tables
    • No NDA on data with completely redacted IP addresses
      • This applies to the statistics pipeline as well
    • Suitable NDAs on data with obfuscated or clear text IP address
  • ACLs and ASAs do not change "collect and archive IP addresses"
    • All only modify "publish"
    • Is this enough?

19

20 of 35

Server side NAT (aka S-NAT, L4 routing, etc)

  • Load Balancers that rewrite the connection 4 tuple in front of servers
    • Noticed as part of piloting NDT on Google Cloud
      • GCP servers have private RFC 1918 addresses, which are rewritten at the cloud edge
      • This is standard practice for a huge class of industrial web servers
      • Linux kernel includes a built in full function NAT (used in OpenWrt and elsewhere)
  • MLab could use S-NAT to isolate public and private address spaces
    • Replace switches L4 load balancers or add S-NAT to the server root container
    • Might also address the /26 footprint problem
  • Requires replumbing the Annotation service
    • It needs to get the address from the public side of the NAT
  • Many corner cases to be considered and tested in the short term
    • ICMP, ICMP6, traceroute, etc (all carrying transcribed IP headers, including IP-in-IP tunnels)
  • Too many high risk problems to deploy anytime soon
    • Save for a future iteration(?)

20

21 of 35

Risks

21

22 of 35

Potential Leaks

  • IP addresses leaking because we missed something
    • The header checksum in .pcaps can be used to recover missing IP addresses
      • You only need to guess the server IP address
    • Headers inside of ICMP message
      • This is especially problematic for traceroute
    • What else, that we don't know about?
  • Cryptographic attacks on the poor choices for the hash
  • Many AUA include prohibitions on reverse engineering and publishing flaws

22

23 of 35

What about traceroute?

  • The last hop near the destination is very likely to have a unique IP address
  • Some researchers studying traceroute find obfuscated data useless
    • Can't disambiguate multiple addresses on one device or subnet without full addresses

23

24 of 35

IPv4 specific risks

  • Generic exposure to brute force attacks
    • The entire address space can be scanned against keys or hashes
  • Critically important that the AS Number and/or routing prefix are preserved
    • Required to map results to ISP and network "location"
  • All designs invite "plaintext attacks" and differential cryptology

24

25 of 35

IPv6 specific risks

  • Exposing IPv6 addresses all can seed scanners and other attacks
  • RFC 4779 specified setting IPv6 address from the Ether MAC address
    • Stable over time, leaks manufacturer and device age to potential attackers
  • RFC 4941 specified changing IPv6 addresses (relatively slowly) over time
    • Addresses are stable longer than necessary
  • RFC 8981 species the use of temporary addresses
    • Clients (or applications) can choose to have very short address lifetimes
  • Even without address redaction it is sometimes hard to identify requests from one client

25

26 of 35

Impaired Experiments

26

27 of 35

Understanding repeated tests from single IPs

  • Many NDT integrations do periodic measurements
    • e.g. self tuning home routers
    • 1-4 per day are common
  • Some usage patterns are probably WiFi debugging or mapping
    • dozens of tests over a few hours with widely variable results
  • Some usage patterns appear quite abusive
    • Is MLab being used to DDOS something?

  • Summary statistics need to down sample or exclude repeated tests
    • Otherwise they bias our measurements

27

28 of 35

Understanding shared IP addresses

  • Home NATs (Network Address Translation)
    • Detectable today when clients have different TCP fingerprints
      • Clients are from different vendors or vintages
  • Carrier Grade NAT
    • Typically an entire region shares one small block of addresses
      • This is the default in some parts of the world
    • Detectable from wide variation of TCP fingerprints, minRTT, maxBW
    • Same statistics from several consecutive IP addresses
      • Not on an address block boundary
      • Obscuring adjacent addresses breaks "counting" experiments
  • Tunnels
    • The tunnel exit must be a public IP address
      • in order for returning ACKs to be routed properly
    • Geo information is often confused because the user is trying to hide their true location

28

29 of 35

Understanding IP address stability

  • Most (?) consumer ISPs use dynamic address assignments
    • The addresses change from time to time
      • Daily is common in parts of Europe
      • I typically see new addresses about twice per year at my own house
      • Some people report new addresses on every modem reboot
  • Can only infer from address lifetime and turnover rates
    • Confounded by address obfuscation

29

30 of 35

Validating MLab itself

  • Find stable sets of beacons with stable test patterns
  • Canary new server code on a few servers
  • Compare data between server versions
  • We have semi-automatic tools to do this now
    • Will be able to do it continuously for all deployments some time in the future

30

31 of 35

Curated clients

  • A number of researchers work with known sets of measurement clients
    • Privacy is not an issue
  • It is super useful to be able to easily pick out your own data
  • Alternatives include:
    • Client annotations
    • Logging test UUIDs

31

32 of 35

Questions and discussions

  • What techniques have we overlooked?
    • Are there additional techniques that we should add to our toolbox?
  • What downsides have we overlooked?
    • Other experiments that we might we impair?
    • Other risks that we should consider?

32

33 of 35

Discuss

33

34 of 35

But what is PII?

GDPR Article 4, Definitions, paragraph 1:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

Do we have a person? Can our data indirectly identify a person? I think not....

Can somebody help us find as strong answer to this question?

34

35 of 35

My wish

Our official statement could become:

MLab collects IP addresses but does not collect any information that might be used to directly or indirectly identify you, the user.

This statement is already true, we just don't use the words.

35