1 of 9

CSE 163

Privacy

��Hunter Schafer

💬Before Class: Any plans for spring break?

🎵Music: Tom Misch

2 of 9

Anonymous Data Isn’t

Mid 1990s, insurance group in Massachusetts published anonymous records of hospital visits with attributes like name, address, social security removed but left in demographic information.

Turns out this data release was not so anonymous!

  • Latanya Sweeney was able to link demographic information in hospital data with voter rolls. Found which hospital record corresponded to governor.

Sweeney estimates 87% of the US is uniquely identified by knowing 1) date of birth, 2) sex, and 3) zip code.

2

3 of 9

k-anonymity

K-anonymity: A first definition of privacy by Sweeney that requires every query results in at least k people in the dataset.

  • Achieved by removing columns or fuzzing values

Weakness: Fails under composition

3

4 of 9

Differential Privacy

A stronger notion of privacy that guarantees how much information you can learn about a person.

Consider two worlds, one where A participates in a study and one where they don’t. If results of the study are similar, we say it respects differential privacy.

4

5 of 9

Differential Privacy

Say an algorithm or analysis is 𝜀-differentially private if results with or without any single person in the dataset are “at most 𝜀” off.

  • Defining how close results are is a little complex, but is a statement of probabilities
  • If 𝜀 = 0, require results to be exactly the same
  • If 𝜀 is small, require results to be very similar
  • If 𝜀 is large, require more deviation in results (less privacy)

Two methods for commonly achieving 𝜀-differential privacy

  • Jittering Result (Laplace Mechanism)
  • Randomized Response

5

6 of 9

Jittering

Take result of analysis and add a small amount of random noise to result.

  • Example: Report average age of census but add a small random number to it

Specifically if you add noise that follows a Laplace distribution with parameter 𝜀, you can achieve 𝜀-differential privacy.

  • See below for 𝜀=0.5 (red dashes), 𝜀=1 (blue solid), 𝜀=2 (purple dots)

6

7 of 9

Randomized Response

What if we don’t trust the data collector with our data?

  • Even with differentially private statistics published, they still have access to the raw data. What if they get hacked?

�Change the differential privacy mechanism to be done locally rather than centrally!

Differentially Private Polling Procedure:

  • Call up person.Ask them to flip a coin (don’t tell us result)
  • If Heads, tell us their honest answer to question (“Yes”/”No”)
  • If Tails, flip coin again
    • If Heads, report “Yes”
    • If Tails, report “No”

Key idea: Can learn aggregate trends without knowing true result of individual

7

8 of 9

Randomized Response Analysis

Key property: People tell the truth ¾ of the time and lie ¼ of the time. ½ of the time they are honest, and then half of the time they tell us a random answer that lines up with the truth.

To see why this work, suppose we know the answer is “Yes” for ⅓ of people. How many “Yes” responses would we expect in this procedure?

  • ⅓ of the population has true answer “Yes”. ¾ of them will tell us the truth so we will get a total of ¼ of responses being honest “Yes”es
  • ⅔ of the population has the true answer “No”, but ¼ of the time they will randomly tell us “Yes”. This means we would expect ⅙ of the population to lie and tell us “Yes”
  • Total of “Yes” received (on average): ¼ + ⅙ = 5/12

In general, work backwards to solve for underlying probability

8

9 of 9

Group Work:

Best Practices

When you first working with this group:

  • Introduce yourself!
  • If possible, angle one of your screens so that everyone can discuss together

Tips:

  • Starts with making sure everyone agrees to work on the same problem
  • Make sure everyone gets a chance to contribute!
  • Ask if everyone agrees and periodically ask each other questions!
  • Call TAs over for help if you need any!

9