1 of 39

Data privacy:�What could possibly go wrong?

Prof Ben Rubinstein

The University of Melbourne

You call that a �data release?

2 of 39

Partners in it’s-not-a-crime

2

Dr Chris Culnane

Castellate Consulting & UoM

A/Prof Vanessa Teague

Thinking Cybersecurity & ANU

Anything brilliant in this talk is due to them. Clumsy presentation is due to me.�

With special thanks also to Anthony Carbines MP, Peter Tonoli, and David Watts.

3 of 39

Privacy is a wicked problem needing multiple disciplines

What can go wrong when lessons of some are ignored?

Let’s start with some examples from the Australian context.

4 of 39

The 2016 Medicare data release

4

C. Culnane, B. I. P. Rubinstein, and V. Teague.

Health data in an open world.

CoRR, abs/1712.05627, 2017.

5 of 39

August 2016 – MBS/PBS dataset is released

Medicare/Pharmaceutical Benefits Schedules

  • “Medicare is Australia’s universal health insurance scheme… It guarantees all Australians (and some overseas visitors) access to a wide range of health and hospital services at low or no cost.”
  • “PBS provides timely, reliable and affordable access to necessary medicines for Australians… Under the PBS, the government subsidises the cost of medicine for most medical conditions.”

3 billion lines of data

  • 10% (2.9 million) of the population
  • MBS 30 years: 1984 – 2014
  • PBS 11 years: 2003 – 2014
  • Almost*: Demographics + billing records

5

Infographic from the Dept of Health

6 of 39

*What protections were in place?

Encrypted Provider IDs and Patient PINs

Collapsed locations into 4 geographic regions

  • patient MBS enrolment
  • provider location

Year of birth

Removal of centenarians

Service & supply date perturbed by up to 14 days

  • Constant perturbation within a patient’s records

Extremely low service volume items removed

6

(Encrypted) patient ID

0345952108

Gender

F

Year of birth

1963

(Encrypted) patient ID

0345952108

State

Vic-Tas

Date

7 Aug 1992

(Encrypted) supplier ID

2340981234

Item code

00023 (GP visit)

Price paid by patient

$85

Price reimbursed by Medicare

$60

Various other details

“GP” in U.S.: “Family Doc”.�Or: HIV medicine,�2nd trimester labor, …

7 of 39

September 2016: Decryption reported to DHS

Decrypted Provider IDs in full

  • described pseudo RNG insecure in setting
  • reversable by guesswork

Aust Privacy Commissioner: Encryption dates back…

  • 2005: DHS for 10% PBS releases to select groups
  • mid-1990s: former Health Insurance Commission

Instead: should have used RSA, AES, random IDs

We responsibly disclosed to Dept

  • providers can be recovered
  • high risk of patient reID also

7

8 of 39

Privacy Amendment�(Re-identification Offence) Bill 2016

One day before agreed�announcement day…

Attorney-General memo:�Intention to amend Privacy Act 1988

  • Legislative instrument permitting�retroactive prosecution

The Bill

  • Criminalises reID of Commonwealth data
  • Up to 2-year jail term
  • Reverses burden of proof
  • Retrospective legislation (see memo above)
  • Exemptions for AG or Dept-sanctioned investigations
  • Apparently* might not apply to all* academics

8

Life of a ReID Bill

12 Oct 2016 – Intro to the Senate

10 Nov 2016 – Ref to Senate Legal Committee

16 Dec 2016 – Consultation Period Ended

7 Feb 2017 – Senate Committee Report

. . .

6 Jun 2019 – Zombie Bill!

But even without passing, the retroactive Bill’s intended outcome was achieved: �stifle disclosure of (existing) breaches

9 of 39

Overwhelmingly critical response

Law Council of Australia

  • The reverse onus provisions should be removed …”
  • definition of ‘de-identified’ … not always clear cut”

Australian Bankers Associations

  • “re-identification … might accidentally occur, �without the bank intending”

14/15 submissions critical

Final Senate Committee report:

“The committee notes the concerns … However…the bill �provides a necessary and proportionate response”

9

10 of 39

10

“Health Minister Sussan Ley insists the data, which was �loaded onto the internet, does not identify patients.”

11 of 39

But, we’re abound with health data

11

12 of 39

Searching for Vanessa

17,310 women share her birthyear

59 also had children born 2006, 2011 in Australia

23 also based in Victoria

0 with child DOB with perturbations

12

Anyone could do this!!

Not in dataset

13 of 39

Mothers unique in MBS-PBS-10%

 

13

14 of 39

“It’s only a sample” or “bah humbug ‘confidence’

14

DHS whole-of-population statistics on MBS billing rates

  • aggregated: 10yr age ranges, billing state, billing month
  • 27% of codes are uniquely reported
  • E.g. aortic valve replacement, Former PM in Brisbane, �Aug 2011 (in pop, not sample)

15 of 39

Reidentifications

Wiki/news articles on 18 mums with 2+ births

  • 13 had no matches, including�Gillian Triggs – Former Human Rights Commis.�Natasha Stott-Despoja – Former Senator�Cathy Freeman – Olympic Athlete�Tanya Plibersek – Current MP
  • 2 rejected due to inconsistent information
  • 3 returned a unique match

25 more queries

  • Professional footballers with injuries
  • Politicians with reported unusual surgeries
  • 4 more unique matches
  • One is an AFL team captain matching medical history, birth year, and interstate movements
  • One has likely been reported as the oldest person in a state to have received a surgery, as confirmed by whole-of-population data

15

Other risks – only partially assessed

Fingerprinting by billing amounts

Melbourne Pharmaceutical Datathon with postcodes

16 of 39

A release born out of an open-data-first environment

16

17 of 39

The 2018 Myki �data release

17

C. Culnane, B. I. P. Rubinstein, and V. Teague.

Stop the open data bus, we want to get off.

CoRR, abs/1908.05004, 2019.

18 of 39

July 2018 – Myki dataset is released

Myki

  • Public Transport Victoria’s (PTV) contactless smart card ticketing system
  • Used across rural and metropolitan areas for travel on buses, trains and trams by touching on and off

1.8 billion lines of data (touch-on/off events)

  • Spanning mid-2015 – mid-2018
  • All events (metro train & tram) in the period
  • 15m cards in dataset (over double VIC population)
  • Released by PTV via Data Science Melbourne’s 2018 Datathon: 190 teams competing 24 July – 26 Sept on predictive models

18

Wikimedia user Fiveapu

19 of 39

What protections were in place?

cardId

  • Fixed 6-digit number associated with �all events for a single card
  • Not Myki card number (found on physical cards)
  • Not uniformly random, has structure: �15m cards across 24m numbers; last 180k correlate with event times; strange gaps exist

Apparently no change to times or locations

Apparently no removal of low volumes

19

cardId

154449

Date-time

2015-08-10 12:34:56

Touch type

Touch on

Location info

Stop ID, route ID, etc.

Card type

Type 48 – �Transit Police Travel Pass

!!! 74 card types: 371 Federal Police (type 46), 1232 Transit Police (type 48), �8 Federal Parliamentarians (type 50), �424 State Parliamentarians (type 51), �697k children 5-18, 179k secondary schoolers

1

173211

173338

191920

154449

356913

180637

20 of 39

Searching for Chris and Ben

We had registered our Myki’s, giving access to 6mo’s data down to per-second events

  • Pick 1 event* from online account to search for.�48 matches for Ben, 59 for Chris
  • Pick 2nd event*: only 1 match each

�* Weren’t “clever”: Ben 8-9am, 5-6pm; Chris 7-8am, 7-8pm

To validate further

  • First events closely matched our card activations
  • Sample of other trips matched

20

Wikimedia public domain

In the dataset

21 of 39

Co-traveller analysis for Ben and Chris

Definition: Two cards are co-travellers if they touch on at the same stop within 5 seconds of one another.

In the 18mo period from 2017 onwards

  • 2106 co-travellers of Chris (mostly tram-based)

38 repeat co-travellers

  • 8591 co-travellers of Ben (trams and trains)

363 repeat co-travellers

  • Were we mutual co-travellers? Yes, 7 times!�Ben most frequent of Chris�Chris 4th most frequent of Ben (3rd a child concession)

Conclusion: co-travellers rare, repeats rarer; concerning � ease of finding family members/close partners

21

22 of 39

Searching for a Friend

Chris searched Peter Tonoli* in the data

  • Peter and Chris boarded a tram together after an evening seminar followed by a social group gathering
  • Using (a) Chris’s card reidentification and (b) calendar records of the seminar event, found correct touch on
  • 4 co-travellers with Chris’s touch on event
  • Of these, 2x concession, 1x an infrequent traveller
  • Only 1 match remained

To validate further

  • Match frequently travelled to Peter’s residential area
  • Peter recalled his card expiry, matched end of events

Conclusion: particularly concerning for domestic violence cases

* Peter consented for conducting and reporting on this search

22

Wikimedia user Wrev

In the dataset

23 of 39

Searching for a Stranger

Can we search for a public figure?

  • 218 Melbourne metro stations
  • 424 State Parliamentarian Travel Passes
  • Expect outer-urban stations to be visited by few

Anthony Carbines MP (State Member for Ivanhoe)

  • Electorate office metres from Rosanna train station
  • Often tweets about train travel

We linked* the Myki dataset to this prior knowledge

  • 2x type 51 cards visited Rosanna, only 1x often
  • Linking with Twitter verifies times (18x), these 3x also uniquely identify this card (of any type)

* Mr Carbines consented to inclusion in our report

23

In the dataset

24 of 39

September 2018 – Responsible disclosures

Timeline of responsible disclosures

  • Sept 14: PTV notified OVIC of participant’s re-id concerns
  • Sept 20: We notified OVIC and PTV of findings
  • Oct 8: OVIC informed PTV and Dept of Premier & Cabinet of OVIC investigation; conferred with Data61 for validation

Breach of Privacy and Data Protection Act 2014 (VIC)

  • OVIC found PTV to have breached the PDP Act
  • 15 August 2019 detailed investigation report published

24

25 of 39

Five-Safes cracking

25

Wikimedia user Jon.lorquet

C. Culnane, B. I. P. Rubinstein, and D. Watts.

Not fit for purpose: A critical analysis of the ’five safes’.

CoRR, abs/2011.02142, 2020.

26 of 39

Introducing…. 5 Safes

“The Five Safes framework takes a multi-dimensional approach to managing disclosure risk. Each safe refers to an independent but related aspect of disclosure risk. The framework poses specific questions to help assess and describe each risk aspect (or safe) in a qualitative way. This allows data custodians to place appropriate controls, not just on the data itself, but on the manner in which data is accessed. The framework is designed to facilitate safe data release and prevent over-regulation.”

�– ABS, Five Safes Framework – Data Confidentiality Guide

26

safe people

safe projects

safe settings

safe data

safe outputs

27 of 39

Historical context for 4-5 Safes

Introduced 2002 by Felix Ritchie, UK Office of National Statistics

  • The ONS “Virtual Microdata Lab Security Model”
  • 2007: from 4-5 Safes with “Safe Data” added to handle �de-identified data contexts

Initially a data protection framework to guide risk management, has grown into cornerstone of data sharing and access policy and legislation

  • Public Sector (Data Sharing) Act 2016 (South Aust.)
  • Digital Economy Act 2017 (UK)
  • Data Availability and Transparency Act 2022 (Aust.)
  • Est. of the Office of the National Data Commissioner (ONDC)

27

28 of 39

Critical analysis (partial list)

Genesis of 5 Safes a mindset of avoiding over regulation

  • Portfolio model’ view of 5 Safes (Lane et al. 2008)
  • Use of personal data requires consent (with exemptions)�But “safe data” (e.g., de-identified data) can be shared

Emotive and appropriated language

  • Compare ”Five Safes” with “Five Risks” (our term), �“Five Data Sharing Principles” (ONDC), or “secure” (OECD)
  • 100% safety/security not generally possible (need utility)
  • Cf. “open data” vs. “open source” and ”open government”

Simplistic guidance across safes

28

“The Data Sharing Principles looks suspiciously like the Five Safes.” �Justin Warren, EFA Board

“The Five Safes framework takes a multi-dimensional approach to managing disclosure risk. Each safe refers to an independent but related aspect of disclosure risk. The framework poses specific questions to help assess and describe each risk aspect (or safe) in a qualitative way. This allows data custodians to place appropriate controls, not just on the data itself, but on the manner in which data is accessed. The framework is designed to facilitate safe data release and prevent over-regulation.”

– ABS, Five Safes Framework–Data Confidentiality Guide

29 of 39

Why stop at 4 5 Safes: Safer, safest?

29

Pages 18-19

30 of 39

Closer look at “safe data” (IANAL)

Data Availability and Transparency Code 2022

�Data Availability and Transparency Act 2022

The Privacy Act 1988 defines “de-identified” data (note changes coming to Act)

30

31 of 39

Does usage of 5 Safes promote PETs?

31

Google searches conducted Wed 14 Feb 2024

32 of 39

Looking again at “safe data” – 2018 ACS Guide

32

33 of 39

What can we do?

33

34 of 39

We need (some)�hammers

Golden hammer (aka. law of the instrument, law of the hammer, Maslow's hammer/gavel):�A cognitive bias that involves an over-reliance on a familiar tool. Abraham Maslow wrote in 1966, "If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail.” (Wikipedia)

Not all hammers are golden

  • Risk management frameworks and good governance
  • Well-designed legislation; e.g. some improvements likely coming to the Privacy Act

Hammers from computer science that are sometimes missing in privacy management

  • Threat models (cybersecurity): Describe attacker goals, utility, capabilities, information
  • Definitions (theoretical CS), security properties (cyber): Aligned with threat model, falsifiable

34

35 of 39

Consider differential privacy

Threat model

  • Goal/utility: to recover a data subject’s data from a release
  • Capabilities: unlimited compute, observe outputs*
  • Info: Instead of assuming away collusion, prior knowledge, ability to link to 3rd-party data, etc. they know data, mechanism (Kerckhoff), etc.

Definition / security property

  • Info 🡪 neighbouring relation
  • Attacks on threat model (Dinur & Nissim) 🡪 randomization, perfection impossible
  • This + usability (e.g., composition) 🡪 DP!

35

36 of 39

DeID as property of datasets… (compare to DP)

    • Unique in the Crowd: The privacy bounds of human mobility - Yves-Alexandre de Montjoye, César A. Hidalgo, Michel Verleysen & Vincent D. Blondel
    • Not So Unique in the Crowd: a Simple and Effective Algorithm for Anonymizing Location Data - Yi Song, Daniel Dahlmeier & Stephane Bressan
    • Unique in the shopping mall: On the reidentifiability of credit card metadata - Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland
    • Big Data and Innovation, Setting the Record Straight: De-identification Does Work - Ann Cavoukian & Daniel Castro
    • No silver bullet: De-identification still doesn't work - Arvind Narayanan & Edward W. Felten
      • ReID models overly narrow by assuming only access to public data
      • ReID does not require specialist skills
      • Quantification of reID probabilities at best lower bounds/fundamentally meaningless

36

37 of 39

Recommendations out of Myki experience

OVIC’s outcomes for public data custodians

  • Challenges in identifying privacy risks
  • Vic public sector needs a high level of data literacy
  • Appropriate processes and expertise should support decisions to release de-ID data
  • Governance for risk management and decision-making key

Our report’s recommendations

  • Technical mechanisms: e.g., opting for DP on aggregate data over longitudinal release of unit data. PETs more generally.
  • Procedural mechanisms: openness about privacy-protection mechanisms and independent examination by experts prior to release.

Cf. 2017 DP for Opal data: Data61 proposal, our response, and a counter

37

38 of 39

Closing Proposal: Avoiding death by N safes

Unbounded N safes?

  • 4 safes: aimed to be ”security model” for data access facility
  • 5-8 safes: fundamentally different context, no clear goal, “safe data” thrives in definitions unlinked to a security property
  • Conjecture: Room for more “safes” as no “safe” is in fact safe! Can’t achieve success without knowing what success looks like

A modest (?) proposal

  • Only endorse risk frameworks that present threat models/security properties, or require this first step
  • This will naturally prefer PETs when there’s a PET for the task

38

39 of 39

Thankyou!

As a parting note…

The golden hammer of criminalising re-identification is back…

UK passed the Data Protection Act 2018 criminalising the knowing �reidentification of “anonymised” data

Australia’s AG has (Feb 2023) made a proposal as part of the Privacy Act reforms: �"Proposal 4.7: Consult on introducing a criminal offence for malicious re-identification of de-identified information where there is an intention to harm another or obtain an illegitimate benefit, with appropriate exceptions."

Govt has already agreed in their Sept 2023 response

39