1 of 40

Chapter 11 Link Prediction

Summary

  • Predicting Network Evolution
  • Unsupervised approaches
  • Supervised approaches
  • Evaluation

Reading

  • Liben-Nowel & Kleinberg paper

2 of 40

Link Prediction

Goal

Understanding how networks evolve

Problem definition

Given a snapshot of a network at time t, (accurately) predict the edges that will appear in the network during the interval (t, t+1)

Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks."

Journal of the American society for information science and technology 58.7 (2007): 1019-1031.

3 of 40

Examples of uses of

Link Prediction

Monitor terrorist networks – deducing possible interactions/missing links between terrorists (without direct evidence)

Suggest interactions or collaborations that haven’t yet been exploited within an organization

Friendship prediction (i.e., Facebook)

4 of 40

Link Prediction

Link prediction is used to predict future possible links in the network (e.g., Facebook).

Or, it can be used to predict missing links due to incomplete data (e.g., Food-webs)

RESEARCH ARTICLE

Link Prediction in Criminal Networks: A Tool for Criminal Intelligence Analysis

Giulia Berlusconi1, Francesco Calderoni1*, Nicola Parolini2, Marco Verani2,Carlo Piccardi3*

1 Università Cattolica del Sacro Cuore and Transcrime, Milano, Italy, 2 MOX, Department of Mathematics, Politecnico di Milano, Milano, Italy, 3 Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy

* francesco.calderoni@unicatt.it (FC); carlo.piccardi@polimi.it (CP)

5 of 40

  1. Given a graph G = (V,E) the set of possible edges to be predicted is O(|V|^2);

  • Real networks tend to be sparse

False Positive prediction issue�(i.e., we are likely to predict edges that will never appear)

Link Prediction

Task Complexity

6 of 40

Scientists who are close in the network �(i.e., have common colleagues)

will likely collaborate in the future

Concretizing an Intuition…

Goal:

  • make this intuitive notion precise and understand which measures of “proximity” leads to accurate predictions

7 of 40

  1. Consider as input a graph G at time t
  2. Consider all the possible pairs of nodes (u,v)
  3. Compute a link formation scores:� score(u,v)
  4. Build a list of all possible edges ordered by scores (from highest to lowest)
  5. Verify, following that ordering, the predictions on the graph at time t+1

score is a measure of proximity

Link Prediction

Workflow

8 of 40

Link Prediction

Unsupervised

Supervised

Define a set of proximity measures unrelated to the particular network

Extract knowledge from the network in order to improve prediction accuracy

Approaches

9 of 40

Unsupervised Link Prediction

10 of 40

Unsupervised Link Prediction

Unsupervised measurements rely on different structural properties of networks

Neighborhood measures

  • Common Neighbors, Adamic Adar, Jaccard, Preferential Attachment

Path-based measures

  • Graph distance, Katz

Ranking

  • Sim Rank, Hitting time, Page Rank

Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031.

11 of 40

How many friends we have to share in order to become friends?

Common Neighbors:the more friends we share, �the more likely we will become friends

Jaccard:�the more similar our friends circles are, �the more likely we will become friends

Unsupervised Link Prediction

Neighborhood measures

12 of 40

How many friends we have to share in order to become friends?

Adamic Adar:�the more selective our mutual friends are, �the more likely we will become friends

Preferential Attachment:�the more friends we have, �the more likely we will become friends

Unsupervised Link Prediction

Neighborhood measures

13 of 40

How distant are we?

Graph Distance:�(negated) length of the shortest path between two nodes

Katz:weighted sum over all the paths between two nodes

Unsupervised Link Prediction

Path-based measures

14 of 40

How similar are we?

SimRank:�two nodes are similar to the extent that their neighborhoods are similar

Unsupervised Link Prediction

Ranking

15 of 40

Unsupervised Link Prediction

Limits

  • Different kinds of networks are described �by the same general closed formula
  • An average overall performance �between 6% and 12%

Measure comparison

  • No single winner
  • Almost all predictors outperform the random predictor �⇒ there is useful information in network topology

16 of 40

Supervised Link Prediction

17 of 40

Supervised Link Prediction

The process is now organized in 4 steps:

    • Split the data in train/test
    • Learning a model on the train
    • Use the model for prediction
    • Compare the prediction with the test

A natural way to do it: �build a “classifier” over a set of network features.

18 of 40

Learn a Classifier (i.e., a Decision Tree) over unsupervised LP scores to generalize the assumption they made on the network growth model

Staking Unsupervised Scores

19 of 40

Supervised Link Prediction

Frequent Pattern

Mining

GERM

Evolution rules can be extracted from the network history and used to identify/predict recurrent patterns

  • e.g., generalization of triadic closure

Berlingerio, Michele, et al. "Mining graph evolution rules." joint European conference on machine learning and knowledge discovery in databases (2009).

20 of 40

Supervised Link Prediction

Network Embedding

Idea

Graphs can be mapped into vector spaces

  • Node/edge similarity scores can be used to define metric spaces
  • Metric spaces enable a more natural application of approaches from DM/ML

NB: Different “mappings” facilitate the solution of different classes of problems

21 of 40

Supervised Link Prediction

Limits

  • No Free Lunch
  • Model construction is often complex and, usually, more time/resource demanding than directly applying unsupervised scores.

Embedding is not The Answer, �only a different way to reason on graphs...

Results: Higher performances w.r.t. unsupervised approaches

22 of 40

Given a predictor p is there a way to decide if it is a “good” one?

First Step:

verify that p outperforms the random predictor.

Random Predictor

each edge has the same probability to

appear in the network

if ratio > 1 then p is meaningful

Evaluation

23 of 40

Evaluation: Comparing Predictors

We need to analyze either the performances ratio, ROC and/or Precision Recall curve.

24 of 40

Precision Vs. Recall

  • Precision: PPV = TP/(TP+FP)
  • Recall: TPR = TP/(TP+FN)

ROC (Receiver operating characteristic)

  • 1-Specificity: FPR = FP/(FP+TN)
  • Recall: TPR = TP/(TP+FN)

Note:

  • ROC and PR spaces are isomorphic�(the use of ROC is more widespread)
  • Numerical comparison can be done using �the AUROC (area under the ROC curve)

Evaluation: ROC and PR curve

25 of 40

Accuracy could be improved extending simple models with more complex (even semantic) informations:

  • Link strength
  • Geographical information
  • ...

Link Prediction needs to be revised while in some scenarios:

  • Dynamic Networks
  • Multiplex networks
  • ...

Link Prediction

Something more...

26 of 40

Key Messages

Predict new links that will arise in a network is not easy:

  1. Networks are, usually, sparse
  2. Cold Start Problem
    • What if I don’t have enough information?
      • Can I predict bridges?
  3. False Positive prediction
    • Bridges !?!
  4. Simple approaches are “too simple”
  5. Complex approaches are costly

27 of 40

Case Study:

Interaction Prediction in Dynamic Networks �

28 of 40

Case StudyInteraction Prediction

Link Prediction goal:

Predict ties that are not present in actual network configuration.

Ties are persistent structures that once appeared cannot disappear (i.e., friendship...)

Interaction Prediction goal:

Predict interactions that will occur (either for the first time or not) among nodes already observed in the network.

Interactions are volatile structures that can occur multiple times and whose value can vanish as time goes by (i.e., telephone calls...)

Rossetti et al. "Interaction prediction in dynamic networks exploiting community discovery." IEEE/ACM ASONAM, 2015.

29 of 40

Case StudyInteraction Prediction

Given a set G = {G0 , . . . Gt , . . . Gτ } of ordered network observations, with t ∈ T = {0...τ}, the interaction prediction problem aims to predict new interactions that will took place at time τ + 1 thus composing Gτ+1.

Idea:

  • Model network evolution through temporal snapshots;
  • False Positive reduction: �Community Discovery as a bound for strong ties;
  • Time-Aware approach: �time series forecast of topological measures;
  • Supervised Approach: �ensemble of classifiers learnt on the topological features, tested on the forecasted ones.

30 of 40

Case StudyInteraction Prediction

Step 1: For each temporal snapshot t ∈ T compute a partition �Ct = {Ct,0, . . . , Ct,k} of Gt using a community discovery algorithm.

Step 2: For each t ∈ T compute a set of measures F for each nodes pair (u,v) belonging to at least a community in Ct

Step 3: For each node pair (u, v) and feature f ∈ F build a time series Su,v and apply a forecasting techniques in order to obtain its future expected value fu,v

Step 4: Use the set of expected values fu,v to predict future intra-community interactions.

31 of 40

Case StudyInteraction Prediction

Step 1: Community Discovery (CD)

Each CD algorithm proposes its own Community Definition.

  • Demon (ego-network based, overlap)
  • Louvain (modularity, crisp partition)
  • Infomap �(conductance, crisp partition)

32 of 40

Case StudyInteraction Prediction

Step 2: Features

On the identified communities we compute three set of features:

  • Pairwise Structural Features(i.e., Jaccard, CN, Adamic/Adar…)
  • Node Topology Features(PageRank, edge betweenness…)
  • Community Features(i.e., density, size, shared communities, avg. clustering…)

33 of 40

Case StudyInteraction Prediction

Step 3: Time Series Forecast

For each time series we apply several forecasting model in order to extract the expected future value.

34 of 40

Case StudyInteraction Prediction

Step 4: Classification

Once learned the features we design two different experiments:

  • Balanced ScenarioThe positive and negative class are balanced through downsampling to design a standard baseline

  • Unbalanced ScenarioThe data positive/negative class ratio is maintained. Due to network sparsity we observe a strong negative prevalence (~98%)

35 of 40

Case StudyInteraction Prediction: Balanced Scenario

Very high accuracy and AUC

CD approaches contribution to IP is topology sensitive

36 of 40

Case StudyInteraction Prediction: Balanced Scenario (cont’d)

Which feature set is the most predictive?

False Positive Filtering (FSF) �vs.�No Filtering (SF)

All Forecast with Filtering �vs. �No Filtering

37 of 40

Case StudyInteraction Prediction: Unbalanced Scenario

Negative class:

  • Social 95.9%
  • DBLP 98.9%

Very hard baselines

  • majority classifiers scores ~.96 and ~.99 precision �(always predicting “no edges”)

  • the proposed workflow is able to reach ~.96 and ~.45 precision �w.r.t. the positive class

38 of 40

Case StudyInteraction Prediction: What about weak links?

High accuracy is guaranteed by focusing the prediction on intra-community interactions.

Inter-Community Interaction Prediction

Focus on the predicting the presence/absence of at least a new interaction across two communities

  • no identification of the “real” endpoints
  • no identification of the multiplicity

Idea

  1. Construct a new network where the meta-nodes are the communities
  2. Apply the same workflow to such graph

39 of 40

Case StudyInteraction Prediction

Even though Interaction prediction is a complex problem it is possible to reach high accuracy through:

  • Target selection:�False Positive reduction via Community Discovery�Weak interactions treated as “special cases
  • Local topology history analysis:Feature forecast via Time Series analysis

Moreover, each type of datasets demands a specific CD algorithm:

  • One-to-one interactions (i.e., social ones)
  • Many-to-many interactions (i.e., co-authorship relations)

40 of 40

Chapter 10 Conclusion

Take Away Messages

  1. Link wiring patterns define how network topology evolves
  2. Predicting new links is a complex task due to network sparse topologies
  3. Static and dynamic formulation of the problem account for different peculiarities

What’s Next

Chapter 12:�Dynamic Community Discovery��

Suggested Readings

  • Liben-Nowell & Kleinberg paper