1 of 90

Shallow Representation Learning

Node Embedding

2 of 90

Unit Objectives

  • Introducing node embedding problem
  • Embedding in simple graphs
  • Embedding in complex graphs

3 of 90

Node Embedding

  • Node Embedding
    • Low dimensional vector representation of a node
    • Preserves some properties of the node in original graph
  • Node in two domains
    • Graph domain: nodes and their connections
    • Embedding domain: node as a continuous vector
  • Node embedding as mapping
    • Mapping a node from the graph domain to the embedding domain

4 of 90

Embedding Methods: Intuition

A good node representation should be able to reconstruct information that is desired to be preserved

mapping

Extractor

Reconstructor

 

 

Objective: Reconstruction Loss

5 of 90

Node Embedding: Visual Example

Input

Output

Zachary’s Karate Club Graph

6 of 90

Graph Embedding

Simple Graphs

Node Co-occurrence

Structural Role

Node status

Community Structure

Complex Graphs

Heterogeneous Graphs

Bipartite Graphs

Multi-dimensional Graphs

Signed Graphs

Hypergraphs

Dynamic Graphs

7 of 90

Embedding Simple Graphs

Node Co-occurrence Methods

8 of 90

Notions of Co-occurrence

… if the degree distribution of a connected graph follows a power law, we observe that the frequency which vertices appear in the short random walks will also follow a power-law distribution …

Language

Co-occurrence is the basis of fundamental representations in NLP: Vector representation of words or neural language models

9 of 90

Notion of Co-Occurrence

Source: School of Information, Pratt Institute

Co-Authorship Network

An author is connected to limited set of authors

This can be used as a notion of Co-occurrence

Can we use learn node representations that preserves the co-occurrence relations in the network?

10 of 90

Language Models

  •  

11 of 90

Word Frequency in Natural Language

Word frequency in natural language follows a power law

Slide from Bryan Perozzi et al.

 

 

The second most used word appears half as often as the most used word.

The third most used word appears one-third the number of times the most used word appears, and so on

12 of 90

Connection: Language and Graphs

Scale Free Graph

Artificial Language: Short truncated random walks as sentences

Random Walk on Graph

 

Short random walks

Vertex frequency in random walks on

scale free graphs also follows a power

law.

13 of 90

Connection: Language and Graph

  • Learning Word Representation
    • Learn a dense representation of words so that the word co-occurrence information is preserved
  • Learning Node Representation
    • Learn dense representation of the nodes in the graph such that node neighborhood information (community structure) in the graph is preserved

14 of 90

Language Model in Concrete Terms

  •  

15 of 90

The Problems

  •  

 

16 of 90

The Hope

  • Solving the optimization problem builds representations that capture the shared similarities in local graph structure between vertices
  • Vertices which have similar neighborhoods will acquire similar representations (encoding co-citation similarity)

17 of 90

DeepWalk

18 of 90

DeepWalk Method

1) Input graph

3) Representation Mapping

4) Hierarchical Softmax

5) Output: Representation

2) Random Walks

19 of 90

SkipGram: Details

  •  

20 of 90

Random Walks

  •  

 

 

 

Otherwise

21 of 90

https://towardsdatascience.com/node2vec-explained-graphically-749e49b7eb6b

22 of 90

Learning the Parameters

  •  

23 of 90

Learning Objective

  •  

24 of 90

Learning Objective (Cntd..)

  •  

25 of 90

Learning Objective (Cntd …)

  •  

26 of 90

SkipGram Architecture

27 of 90

 

  • Two approximation techniques to speed-up
    • Hierarchical Softmax
    • Negative Sampling

 

 

28 of 90

Hierarchical Softmax

  •  

29 of 90

Hierarchical Softmax

  •  

30 of 90

Hierarchical Softmax

  •  

 

 

31 of 90

Algorithm

32 of 90

Negative Sampling

  •  

33 of 90

Negative Sampling

  •  

34 of 90

Negative sampling

Target

Supervised Learning: Logistic Regression Model

 

 

 

 

35 of 90

Objective Function

 

 

 

 

36 of 90

Sampling Distribution

 

Selecting Negative Samples

 

Increases the probability of choosing low degree nodes

Empirical frequency

37 of 90

Social and Structural Aspect

 

 

 

 

 

 

 

 

 

 

 

 

 

 

38 of 90

node2vec

node2vec: Scalable Feature Learning for Networks, Grover and Leskovec, SIGKDD’16

39 of 90

Node2vec Objective

  • The Goal: Embed nodes with similar neighborhoods are also close in the feature space
    • Nodes belonging to the same community
    • Nodes having similar neighborhood profile
  • Unsupervised: Maximum likelihood estimation
  • Require redefining neighborhood function that is flexible to cover both the aspects

40 of 90

Biased Random Walk

 

 

 

 

 

 

 

 

 

 

 

Local microscopic view

 

Global macroscopic view

41 of 90

BFS - DFS Tradeoff with Interpolation

  •  

42 of 90

Biased (2nd Order) Random Walk

 

 

 

 

 

 

 

 

 

 

First order random walk probability:

 

Second order biased random walk probability:

 

 

 

 

43 of 90

Simulating BFS vs DFS

 

 

DFS

BFS

44 of 90

Performance Comparison

45 of 90

Performance Comparison

Multi-Label Node Classification (Macro-F1 scores)

46 of 90

Performance Comparison

(a)

(b)

(c)

(d)

Link Prediction(AUC scores)

47 of 90

struc2vec

48 of 90

Representation Context

 

Similar degree values – 5 and 4

Connected to similar no of triangles – 3 and 2

Connected to the rest of the network by two nodes

 

49 of 90

struc2vec: Core Ideas

  • Model structural similarity independent of edge and node attributes and position in the network
    • Two nodes having similar local structure
  • Notion of similarity is hierarchical
    • At bottom, depends only on node degree
    • At top, entire network from the viewpoint of a node
  • Generate contexts using random walks on multilayer graph
    • Two nodes appearing in similar contexts will likely have similar structure

50 of 90

struc2vec: Steps

  • Step-1: Determine structural similarity between a node pair
    • Hierarchical measure for structural similarity
    • Assess structural similarity at each level of hierarchy
  • Step-2: Construct a weighted multi-layer graph
    • Each node is present in every layer
    • Each level corresponds to a level in similarity measure hierarchy
    • Edge weight between every node pair within each layer is inversely proportional to their structural distance

51 of 90

struc2vec: Steps

  • Step-3: Generate context for each node from multi-layer graph
    • Biased random walk on the multi-layer graph to generate node seq
    • A sequence is likely to include structurally similar nodes
  • Step-4: Learning node representation
    • Learn latent representation from context (random walk) using techniques like Skipgram

52 of 90

Hierarchical Structural Similarity

  • Automorphism: Strong Structural Equivalence

  • Notion of hierarchy
    • Two nodes that have same degree are structurally similar (Level 1)
    • If their neighbors also have same degree, they are even more structurally similar (Level 2)

RED, GREEN 🡪 Automorphism

PURPLE, BROWN 🡪 Structurally Similar

53 of 90

Hierarchical Structural Similarity

 

 

 

 

 

 

 

 

 

 

 

54 of 90

Hierarchical Structural Similarity

 

 

 

 

 

 

 

 

 

55 of 90

Hierarchical Structural Similarity

 

 

 

 

 

Dynamic Time Warping (DWT)

 

DWT computes element-wise match such that the distance between matched elements are minimized

 

56 of 90

Hierarchical Structural Similarity

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

57 of 90

Constructing Multi-Layer Graph (M)

  •  

Layer 1

Layer 2

 

58 of 90

Generate Context from M

  •  

59 of 90

Performance

DeepWalk

Node2vec

Struc2vec

Barbell Graph (10,10)

60 of 90

Mirrored Karate Network

DeepWalk

Node2vec

Struc2vec

61 of 90

Classification Performance

62 of 90

Graph Embedding

Simple Graphs

Node Co-occurrence

Structural Role

Node status

Community Structure

Complex Graphs

Heterogeneous Graphs

Bipartite Graphs

Multi-dimensional Graphs

Signed Graphs

Hypergraphs

Dynamic Graphs

63 of 90

HNE

Heterogeneous Network Embedding

64 of 90

Heterogeneous Networks

Nodes and contents of different types

Multiple types if links

 

 

 

65 of 90

Unsupervised feature

learning

HNE

Supervised learning

Classification

Clustering

Link Prediction

Recommendation

66 of 90

Network Representation

  •  

67 of 90

Embedding Method

 

 

 

 

 

 

 

 

 

 

 

Optimize and Update Parameters

 

 

 

 

68 of 90

Embedding Method

  •  

69 of 90

Embedding Method

  •  

Dependent on adjacency matrix

Independent of adjacency matrix

70 of 90

Embedding Method

  •  

71 of 90

Non-linearity with Deep Architecture

 

 

 

 

 

 

 

 

 

 

 

Optimize and Update Parameters

 

 

 

 

 

 

 

 

 

72 of 90

Non-linearity with Deep Architecture

 

 

73 of 90

Non-linearity with Deep Architecture

  •  

74 of 90

Non-Linearity with Deep Architecture

  •  

 

75 of 90

HNE Architecture

ConvNet

ConvNet

ConvNet

ConvNet

ConvNet

ConvNet

FC Layer

FC Layer

FC Layer

FC Layer

FC Layer

FC Layer

Image-Image Module

Image-Text Module

Text-Text Module

Prediction Layer

Similarity Prediction

Pair-wise nodes from network

Non-linear Embedding

Non-linear Embedding

Non-linear Embedding

Linear Embedding

Linear Embedding

Linear Embedding

Mini batch

Weight sharing

76 of 90

metapath2vec

Heterogeneous Network Embedding

77 of 90

Heterogeneous Information Network (HIN)

 

 

 

 

 

 

78 of 90

HIN: Metapath

 

 

 

 

 

 

 

Author Collaboration:

Authors Sharing Venue:

79 of 90

Metapath-based Random Walk

 

Random walk is metapath conditioned

80 of 90

Homogeneous Vs Metapath Random Walks

 

 

Homogeneous RW:

Metapath-based RW (OAPVPAO):

 

81 of 90

Embedding Model: Skipgram

 

There are 12 nodes in the graph

 

 

There is a problem with the definition of k

in the paper.

I would rather suggest to restrict it to set of nodes

that co-occur with the center node within a

predefined window

82 of 90

Embedding Model: Skipgram

Objective Function

Loss Function with Negative Sampling

 

 

 

 

83 of 90

Algorithm

As there is a lack of definition of k. Take that to be

A predefined context window size

84 of 90

Metapath2vec++

  • Node type information is used in random walk but not in softmax
    • Softmax can be normalized with respect to node type of the context node
  • While inferring specific type of context node, metapath2vec uses all types of nodes a negative sample
    • Negative samples can be drawn from the distribution of the nodes of the same type as the targeted context node
  • Metapath2vec++ outputs a set of multinomial distribution for each node type

85 of 90

Embedding Model: Metapath2vec++

 

 

86 of 90

Metapath2vec++: Two Core Ideas

  • Context node type-based softmax

 

87 of 90

Metapath2vec++: Two Core Ideas

  • Objective function that considers node type-based negative sampling

 

 

 

 

 

Distribution over node type author only

88 of 90

Evaluation

Aminer Computer Science (CS) dataset

Database and Information Systems (DBIS) dataset

89 of 90

Evaluation: Similarity Search

90 of 90

Summary

  • Node co-occurrence preservation 🡪 Structure preservation
    • Random walk to extract co-occurrence information
    • DeepWalk based on Skipgram model
  • Real networks show structural role preservation
    • node2vec: biased random walk for BFS-DFS tradeoff
    • struc2vec: purely structural role preserving method
  • Heterogeneous network embedding
    • Proximity-based model (HNE)
    • Metapath-based skipgram for Heterogeneous information network