De-Anonymizing Social Networks: FAQ
Last update: May 8, 2009Social network users don't care about their privacy.
Furthermore, while the popular meaning of the term "social network" refers to online social networks or social-network services, we use the term in its technical sense, which is much broader. For example, it encompasses phone-call graphs and other networks such as this sexual network. See the survey in our paper for more examples.
A distinction must also be made between semi-public and public information. Users of online social networks take comfort in the fact that only their friends can see some or all of the information they share, no matter how loosely the term "friend" may be defined in this context. Many individuals who share information with their online friends are not comfortable sharing it with the entire world. This difference is captured by the emerging paradigm of contextual integrity, which aims to formalize individuals' privacy expectations for information transmission in various contexts.
Twitter and Flickr are public networks. Where's the privacy breach?
We are computer scientists, not attackers. Our motivation was not to breach anyone's privacy, but to demonstrate the efficacy and practicality of de-anonymization in the context of social networks. This would not have been possible with non-public networks, because we would not have been able to measure the success of our de-anonymization algorithm.
While Flickr and Twitter afforded a small, feasible case study, our algorithm can be applied to any two social networks as long as some small fraction of their edges captures real-life relationships.
Who would want to de-anonymize a social network graph and why?
We describe several scenarios in the paper.
The strongest adversary is a government-level agency interested in global surveillance. Its objective is large-scale collection of detailed information about as many individuals as possible. Another attack scenario involves abusive marketing. If an unethical company were able to de-anonymize the graph using publicly available data, it could engage in abusive marketing aimed at specific individuals. Phishing and spamming also gain from social-network de-anonymization. Using detailed information about the victim gleaned from his or her de-anonymized social-network profile, a phisher or a spammer will be able to craft a highly individualized, believable message (cf. [JJJM07]). Yet another category of attacks involves targeted de-anonymization of specific individuals by stalkers, investigators, nosy colleagues, employers, and neighbors. Their objective is to use this information to recognize the victim's node in the anonymized network and to learn sensitive information about her, including all of her social relationships in that network.
So where are the actual datasets that might get de-anonymized?
In the paper, we compiled an extensive list of scenarios that involve sharing of social-network data, often in an anonymized form. A few significant categories are:
- Advertisers on online social networks.
- Application developers on online social networks.
- Anonymous graphs published for academic research--either crawled from online social networks or collected through other means.
- Phone-call graphs outsourced for fraud detection, etc.
- A variety of special-purpose graphs, such as health networks (of doctors and patients).
Sharing of anonymized graphs does not violate privacy laws.
In the full version of the paper, we survey potentially applicable privacy laws. We are not lawyers, but the meaning of key terms such as "personally identifiable information" seems to vary from legislation to legislation. The liability arising from de-anonymization has not been tested in courts. We note that Canada's PIPEDA and the EU directive 95/46/EC do encompass deductive disclosure or inference-based disclosure.
Your algorithm cannot possibly work. Graph isomorphism is hard.
Graph isomorphism for real-world social networks is in fact linear-time for all practical purposes. Each user's network neighborhood is unique.
So your algorithm exploits the fact that everybody's network neighborhood is unique. Where's the novelty here?
Our task is far harder than graph isomorphism because a user's neighborhoods on two different networks are only somewhat similar and far from isomorphic. The overlap between the user populations of Flickr and Twitter at the time of our study was small, and among the common users, the fraction of overlapping relationships was only 14%. We had to develop a variety of heuristics to tease out the mapping between the graphs, including self-testing and self-correction.
Hasn't this already been done by Backstrom, Dwork and Kleinberg?
We describe the differences between our work and that of Backstrom et al. in our paper.
The primary difference is that our techniques allow the attacker to map two entirely different graphs, obtained from different sources, to each other. Thus the potential scale of our attacks is much bigger, making most nodes in the anonymous graph potentially vulnerable. Second, our attack model is passive, i.e., the attacker doesn't have to introduce fake users or tamper with the actual network in any way, making the attack virtually undetectable.
Your algorithm needs a few dozens of "seed" mappings to get started. Where would an attacker obtain those?
In the paper, we explain how to obtain the "seed" mappings.
If the attacker is already a user of the social network, he knows all details about his own node and its neighbors [KMNX08; SZD08]. Some networks permit manual access to profiles even if large-scale crawling is restricted (e.g., at the time of our study, Facebook allowed viewing of information about "friends" of any member by default). Some users may make their details public even in networks that keep them private by default. The attacker may even pay a handful of users for information about themselves and their friends [LKG+08], or learn it from compromised computers or stolen mobile phones. For example, the stored log of phone calls provides auxiliary information for de-anonymizing the phone-call graph. With an active attack (e.g., [BDK07]), the attacker may create fake nodes and edges with features that will be easy to recognize in the anonymized graph, such as a clique or an almost-clique.
You can inject random noise as a protection mechanism.
We present quantitative evidence that our re-identification algorithm withstands injection of noise. The amount of noise that would be required to foil our algorithm is likely render the graph useless for most practical purposes.
You can use k-anonymity as a protection mechanism.
In real graphs with high average degree, achieving k-anonymity is likely to result in destruction of the topological structure of the graph, rendering it useless for most practical purposes. Furthermore, it is not clear that k-anonymity provides meaningful privacy guarantees in this or any other context.
So, what's the solution?
We do not believe that there exists a technical solution to the problem of anonymity in social networks. Specifically, we do not believe that any graph transformation can (a) satisfy a robust definition of privacy, (b) withstand de-anonymization attacks described in our paper, and (c) preserve the utility of the graph for common data-mining and advertising purposes. Therefore, we advocate non-technical solutions.
First, the false dichotomy between personally identifiable and non-personally identifiable information should disappear from privacy policies, laws, etc. Any aspect of an individual's online personality can be used for de-anonymization, and this reality should be recognized by the relevant legislation and corporate privacy policies. Laws, regulations, and business practices need to be adjusted to reflect the fine gradations of intents and purposes underlying information sharing and disclosure.
Second, social-network operators should stop relying on anonymization as the "get out of jail" card insofar as user privacy is concerned. They should inform users when their information is disclosed to third parties, even if this information has been anonymized, and give them an opportunity to opt out. This need not jeopardize the business case of online social networks. We expect that most users will not object to their information being used for commercial purposes such as targeted advertising. Furthermore, network operators who only deal with reputable advertisers and marketing partners may even enjoy a competitive advantage in the marketplace.