1 of 41

Differential Privacy: �Meanings and Caveats

Ninghui Li

Department of Computer Science Purdue University

2 of 41

Defining Privacy is Hard

  • Lots of privacy notions
    • E.g., k anonymity, l diversity, t closeness, and many others
  • Why defining privacy is hard?
    • Difficult to agree on adversary goal.
    • Difficult to agree on adversary power.
    • Too strong , then not achievable.
    • Too weak, then not enough
    • Information is correlated.

2

10/7/2021

3 of 41

What is Privacy?

It is complicated!

Some concepts from the book “Understanding Privacy” by Daniel J. Solove: 

  1. the right to be let alone
  2. limited access to the self
  3. secrecy—the concealment of certain matters from others;
  4. control over others' use of information about oneself
  5. personhood—the protection of one’s personality, individuality, and dignity;
  6. intimacy—control over, or limited access to, one’s intimate relationships or aspects of life.

3

10/7/2021

4 of 41

Impossibility of “Privacy as Secrecy”

  • Dalenius [in 1977] proposes this as privacy notion: “Access to a statistical database should not enable one to learn anything about an individual that could not be learned without access.”
    • Similar to the notion of semantic security for encryption
    • Requires a prior-to-posterier bound
    • Not possible if one wants utility.
  • The Terry’s height example:
    • Adversary knows “Terry is two inches shorter than the average Lithuanian woman”
    • If one publishes average height of Lithuanian woman, the adversary learns Terry’s height.

4

10/7/2021

5 of 41

Another Example

  • Assume that smoking causes lung cancer is not yet public knowledge, and an organization conducted a study that demonstrates this connection and now wants to publish the results.
  • A smoker Carl was not involved in the study, but complains that publishing the result of this study affects his privacy, because others would know that he has a higher chance of getting lung cancer, and as a result he may suffer damages, e.g., his health insurance premium may increase.

  • Can Carl legitimately complain about his privacy being violated by publishing results of the study?

5

10/7/2021

6 of 41

Different Manifestation of the Impossibility Result

  • Dwork & Naor: “absolute disclosure prevention (while preserving utility at the same time) is impossible because of the arbitrary auxiliary information the adversary may have”.
  • Kifer and Machanavajjhala: “achieving both utility and privacy is impossible without making assumptions about the data.”
  • Li et al. (Membership privacy framework):without restricting the adversary’s prior belief about the dataset distribution, achieving privacy requires publishing essentially the same information for two arbitrary datasets

6

10/7/2021

  • Dwork & Naor: On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy, Journal of Privacy and Confidentiality, 2008.
  • Kifer and Machanavajjhala: No Free Lunch in Data Privacy, SIGMOD 2011.
  • Li et al.: Membership privacy: a unifying framework for privacy definitions, CCS 2013.

7 of 41

Analogies with Crypto

  • Semantic security can be achieved. Why can’t we achieve a privacy notion similar to semantic security?
    • There are two kinds of recipients in encryption, but only one in the setting for privacy.
  • What about order/property-preserving encryption?
    • Security defined as simulating an ideal world
  • Real-world-ideal-world” approach also used in Secure Multiparty Computation

7

10/7/2021

8 of 41

Differential Privacy [Dwork et al. 2006]

  •  

8

10/7/2021

9 of 41

Bounded and Unbounded DP

  • Bounded DP
    • Neighboring: replacing one item with another
    • Two neighboring datasets have the same number of items
    • Publishing exact number of items is private
  • Unbounded DP
    • Neighboring: removing one item
    • Publishing exact number of items is non-private
    • ε-UDP implies (2ε)-BDP

9

10/7/2021

10 of 41

Properties of DP

  • Post-processing Invariance
    • If A1 satisfies ε1-DP, then A2(A1 ( ⋅ )) always satisfies ε1-DP
  • Sequential Composability
    • If A1 satisfies ε1-DP, and A2 satisfies ε2-DP, then outputting both A1 and A2 satisfies (ε1+ε2)-DP
    • ε Is known as the privacy budget.
  • Parallel Composability
    • If D is partitioned into two parts, applying A1 and A2 on the two parts satisfy (max(ε1,ε2))-DP
    • When partition is based on data domain, this holds only under unbounded DP

10

10/7/2021

Need to be careful with conditions for neighboring!

11 of 41

Kasivisiwanathan-Smith’s Formulation of DP’s Real-Ideal World Guarantee

  • Databases are assumed to be vectors in Vn,
    • V is the domain from which each tuple is drawn,
    • n is the length of the input dataset.
  • Adversary’s external knowledge is captured via a prior probability distribution b on Vn.
  • A mechanism is said to have α-semantic privacy if for any b, t, i, we have SD(bp[ |t], bip [ |t]) ≤ α,
    • where SD gives the statistical distance (total variation distance): SD(P,Q)=maxT |P(T)-Q(T)|
    • bp[ |t] gives the posterior belief after seeing t
    • bip [ |t] gives the posterior belief after seeing t in the ideal world where the i’th record is replaced.

11

10/7/2021

12 of 41

Kasiviswanathan-Smith’s Bayesian Formulation (continued)

  • Results:
    • ϵ-DP implies α-semantic privacy for α = eϵ-1.
    • For 0 <ϵ≤0.45, ϵ/2-semantic privacy implies 3ϵ-DP.
  • Our Critique:
    • Using a distribution over Vn to model adversary’s external knowledge is limiting; it does not model adversary’s knowledge about the world
    • The relationship among parameters is messy.

12

10/7/2021

Ganta, Kasivisiwanathan and Smith: Composition Attacks and Auxiliary Information in Data Privacy, KDD’08.

Kasivisiwanathan and Smith: On the `Semantics' of Differential Privacy: A Bayesian Formulation. arXic 2013, Journal of Privacy and Confidentiality 2014.

13 of 41

Our Formulation of DP’s Real-World Ideal-World Privacy Guarantee

  •  

13

10/7/2021

14 of 41

Genius of Idea Behind DP

  • Privacy is hard because which information to hide is difficult to enumerate and information may correlate
  • By identifying a world without one individual’s data as an ideal world for the individual, and providing real-world-ideal-world bound, DP does not need to provide prior-to-posterier bound, and does not need to deal with data correlation
  • DP simulates privacy is “control over others' use of information about oneself

14

10/7/2021

15 of 41

DP’s Similar-Decision-Regardless-of-Prior Guarantee

  • Regardless of external knowledge, an adversary with access to the sanitized database makes similar decisions whether or not one individual’s data is included in the original database.

15

10/7/2021

16 of 41

The Personal Data Principle

  • Data privacy means giving an individual control over his or her personal data. An individual's privacy is not violated if no personal data about the individual is used.
  • Privacy does not mean that no information about the individual is learned, or no harm is done to an individual; enforcing the latter is infeasible and unreasonable.

16

10/7/2021

17 of 41

OECDPrivacy Principles

  • 1. Collection Limitation Principle
    • There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.
  • 2. Data Quality Principle
    • Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, complete and kept up-to-date.

17

10/7/2021

18 of 41

OECDPrivacy Principles

  • 3. Purpose Specification Principle
    • The purposes for which personal data are collected should be specified not later than at the time of data collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose.
  • 4. Use Limitation Principle
    • Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with Principle 3 except:
    • a) with the consent of the data subject; or
    • b) by the authority of law.

18

10/7/2021

19 of 41

OECDPrivacy Principles

  • 5. Security Safeguards Principle
    • Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorized access, destruction, use, modification or disclosure of data.
  • 6. Openness Principle
    • There should be a general policy of openness about developments, practices and policies with respect to personal data. Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller.

19

10/7/2021

20 of 41

OECDPrivacy Principles

  • 7. Individual Participation Principle
    • An individual should have the right:
    • a) to request to know whether or not the data controller has data relating to him;
    • b) to request data relating to him, …
    • c) to be given reasons if a request is denied; and
    • d) to request the data to be rectified, completed or amended.
  • 8. Accountability Principle
    • A data controller should be accountable for complying with measures which give effect to the principles stated above.

20

10/7/2021

21 of 41

Critique of DP

  • From [Kifer and Machanavajjhala, 2011]
  • Additional popularized claims have been made about the privacy guarantees of differential privacy. These include:
    • It makes no assumptions about how data are generated.
    • It protects an individual’s information (even) if an attacker knows about all other individuals in the data.
    • It is robust to arbitrary background knowledge.

21

10/7/2021

Kifer and Machanavajjhala: No Free Lunch in Data Privacy, SIGMOD 2011.

22 of 41

An Attempt at Providing Prior-to-Posterior Bound in [Dwork et al. 2006]

  • A mechanism is said to be (k, ε)-simulatable if for every informed adversary who already knows all except for k entries in the dataset D, every output, and every predicate f, the change in the adversary's belief on f is multiplicative-bounded by eε.
  • Thm: ε-DP is equivalent to (1,ε)-simulatable.
  • Does this mean ε-DP provides prior-to-posterior bound for an arbitrary adversary?
    • Wouldn’t that conflict with the impossibility results?

22

10/7/2021

Dwork et al.: Calibrating Noise to Sensitivity in Private Data Analysis. TCC 2006.

23 of 41

An Example Adapted from [Kifer and Machanavajjhala, 2011]

  • Bob or one of his 9 immediate family members may have contracted a highly contagious disease, in which case the entire family would have been infected. An adversary asks the query “how many people at Bob's family address have this disease?”
  • What can be learned from an answer produced while satisfying ε-DP?
    • Answer: Adversary’s belief change on Bob’s disease status may change by something close to e10ε.
  • Anything wrong here?

23

10/7/2021

24 of 41

In A Sense, No

  1. An adversary’s belief about Bob’s disease status may change by a factor of e10ϵ due to data correlation. This is an example that DP cannot bound prior-to-posterior belief change against arbitrary external knowledge.
  2. DP’s guarantee about real-to-ideal bound remains valid.
  3. The analysis in [Dwork et al. 2006] is potentially misleading, because it could lead one to think that DP can offer more protection than it actually does.
    • The notion of informed adversary, while appearing strong, is in fact, very limiting.
  4. Applying PDP, ε-DP is doing what it is supposed to do, but stay tuned

24

10/7/2021

25 of 41

Caveats of Applying DP

  • How neighboring datasets is defined
  • How many pieces of information from one user is collected in the local setting
  • What constitutes an individual’s data
  • One individual’s data or personal data under one individual’s control
  • Group privacy
  • Moral challenge
  • Choosing epsilon value
  • Learning models and applying to individuals
  • Privacy and discrimination

25

10/7/2021

26 of 41

Defining Neighbors Incorrectly

  • Edge-DP in graph data is inappropriate
    • Typically one individual controls a node and its relationship.
    • ``Attacks’’ on graph anonymization typically in the form of node identification.
    • Suppose the goal is to protect edge info, then edge-DP still fails, because of correlation between edges.
  • Packet-level DP for networking data is inappropriate
  • Cell-level DP in matrix data is usually inappropriate
  • Pixel-level DP for image is meaningless
  • Single-picture level DP where one individual has many pictures is likely inappropriate

26

10/7/2021

27 of 41

Local Setting

  • Google’s RAPPOR system is not good enough
    • Erlingsson et al. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. CCS 2014.
    • One system may collect answers to many questions; and each question is answered with privacy budget ε
  • Apple seems to be doing the same

27

10/7/2021

28 of 41

What Constitutes An Individual’s Personal Data?

  • Is the genome of my parents, children, sibling, cousins “my personal information”?

  • Example: DeCode Genetics, based in Reykjavík, says it has collected full DNA sequences on 10,000 individuals. And because people on the island are closely related, DeCode says it can now also extrapolate to accurately guess the DNA makeup of nearly all other 320,000 citizens of that country, including those who never participated in its studies.

28

10/7/2021

29 of 41

Such legal and ethical questions still need to be resolved

  • Evidences suggest that such privacy concerns will be recognized.
  • In 2003, the supreme court of Iceland ruled that a daughter has the right to prohibit the transfer of her deceased father's health information to a Health Sector Database, not because her right acting as a substitute of her deceased father, but in the recognition that she might, on the basis of her right to protection of privacy, have an interest in preventing the transfer of health data concerning her father into the database, as information could be inferred from such data relating to the hereditary characteristics of her father which might also apply to herself.

29

10/7/2021

https://epic.org/privacy/genetic/iceland_decision.pdf

30 of 41

Lesson

  • When dealing with genomic and health data, one cannot simply say correlation doesn't matter because of Personal Data Principle, and may have to quantify and deal with such correlation.

30

10/7/2021

31 of 41

My Personal Data or Personal Data Under My Control?

  • Consider the following variants of the Bob example.
  • Case (a). Bob lives in a dorm building with 9 other unrelated individuals. Either they all have the disease or none. One can query how many individuals at this address have the disease.
  • Case (b). The original example: Bob and 9 family members.
  • Case (c). Bob and 9 minors for which Bob is the legal guardian.

31

10/7/2021

32 of 41

Our Tentative Answer

  • Case (a). Bob and 9 other unrelated individuals.
    • DP does what it suppose to do based on Personal Data Principle.
  • Case (b). The original example: Bob and 9 family members.
    • Difficult to say: on the borderline and not enough information.
  • Case (c). Bob and 9 minors
    • Using DP this way is inappropriate, because Bob controls the 9 other records as well, and

32

10/7/2021

33 of 41

Group Privacy as a Potential Challenge to Personal Data Principle

  • Can a group of individuals, none of whom has specifically authorized usage of their personal information, together sue on privacy grounds that aggregate information about them is leaked?
    • If so, satisfying DP is not sufficient.
    • Would size of group matter?

33

10/7/2021

34 of 41

A Moral Challenge to DP

  • Question from Quora:
    • Say I steal 2 cents from every bank account in America. I am proven guilty, but everyone I stole from says they're fine with it. What happens?

  • If one makes profit from applying DP to a dataset of many individuals, isn’t this morally the same as the above?

34

10/7/2021

35 of 41

How to Choose ε

  • From the inventors of DP: “The choice of ϵ is essentially a social question. We tend to think of ϵ as, say, 0.01, 0.1, or in some cases, ln 2 or ln 3”.

  • Our position.
    • ϵ of between 0.1 and 1 is often acceptable
    • ϵ close to 5 might be applicable in rare cases, but needs careful analysis
    • ϵ above 10 means very little

  • Why?

35

10/7/2021

36 of 41

Consult This Table of Change in Belief: p is prior; numbers in table are posterior

36

10/7/2021

37 of 41

Apply a Model Learned with DP Arbitrarily.

  • There are two steps in Big Data
    • Learning a model from data from individuals in A
    • Apply the model to individuals in B, using some (typically less sensitive) personal info of each individual, one can learn (typically more sensitive) personal info.
      • The sets A and B may overlap
  • The notion of DP deals with only the first step.
  • Even if a model is learned while satisfying DP, applying it may still result in privacy concern, because it uses each individual’s personal info.

37

10/7/2021

38 of 41

The Target Pregnancy Prediction Example

  • Target assigns every customer a Guest ID number and stores a history of everything they've bought and any demographic information Target has collected from them or bought from other sources.
  • Looking at historical buying data for all the ladies who had signed up for Target baby registries in the past, Target's algorithm was able to identify about 25 products that, when analyzed together, allowed Target to assign each shopper a ``pregnancy prediction'' score.
  • Target could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.

38

10/8/2021

https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html

39 of 41

Privacy and Discrimination

  • What if one applies a classifier (trained in privacy-preserving way) to public information (such as gender, age, race, nationality, etc.) and make decisions accordingly
  • Is there privacy concern?
  • Maybe okay for privacy, but that may be discrimination.
  • Better privacy may cause more discrimination!
    • From Wheelan’s book “Naked Economics”
    • Hiring African Americans with (and w/o) criminal background checks.

39

10/7/2021

40 of 41

Caveats of Applying DP

  • How neighboring datasets is defined
  • How many pieces of information from one user is collected in the local setting
  • What constitutes an individual’s data
  • One individual’s data or personal data under one individual’s control
  • Group privacy
  • Moral challenge
  • Choosing epsilon value
  • Learning models and applying to individuals
  • Privacy and discrimination

40

10/7/2021

41 of 41

When is ϵ-DP Good Enough?

  • Applying ϵ-DP in a particular setting provides sufficient privacy guarantee when the following conditions hold:
    • (0) Group privacy / morality challenges do not hold
    • (1) The Personal Data Principle can be applied;
    • (2) All data one individual controls are included in the difference of two neighboring datasets;
      • With (1) and (2), even if some information about an individual is learned because of correlation, one can defend DP.
    • (3) An appropriate ϵ value is used.

41

10/7/2021