4 of 87

Limitation of RNN Models

Slow Computation for Longer Sequences

Can not be done in parallel due to timesteps dependencies

Vanishing Gradient or Exploding Gradient Problem
Limited amount Information into hidden state
History is forgotten after number of time steps

Long range dependence

5 of 87

Attention Based Model: Transformer

6 of 87

Attention Based Model: Transformer

10 of 87

Advantages of Transformer

Minimize (or at least not increase) computational complexity per layer.
Minimize path length between any pair of words to facilitate learning of long-range dependencies.
Maximize the amount of computation that can be parallelized.

11 of 87

Graph Transformer

12 of 87

Do Transformers Really Perform Bad for Graph Representation? Chengxuan Ying et al. NeurIPS 2021

Graphormer

13 of 87

Graph Transformer: Graphormer

Motivation - Whether transformer architecture is suitable to model graph representation learning.
Build directly upon standard Transformer Encoder.
Add Structural Encoding into Model.

Centrality Encoding
Spatial Encoding
Edge Encoding

Achieves better performance than GNNs.

15 of 87

Graphormer : Centrality Encoding

Centrality: How important a node is in the graph.
Graphormer uses Degree Centrality.
Develop a Centrality Encoding which assigns each node two learnable embedding vectors according to its indegree and outdegree.

16 of 87

Graphormer : Spatial Encoding

An advantage of Transformer is its global receptive field.

In each Transformer layer, each token can attend to the information at any position and then process its representation.

Issue: model has to explicitly specify different positions or encode the positional dependency (such as locality).
For sequential data (Text)

Absolute Positional Encoding: give each position an embedding
Relative Positional Encoding: encode the relative distance of any two positions

17 of 87

Graphormer : Spatial Encoding

For graphs, nodes are not arranged as a sequence.

Need some notion of distance.

Spatial Encoding: effectively capture the structural information.
: function that measures the spatial relation between v_iand v_jin graph G.

Shortest Path Distance (SPD)

18 of 87

Graphormer : Edge Encoding

In many graph tasks, edges also have structural features.

In a molecular graph, atom pairs may have features describing the type of bond between them.

How to better incorporate edge feature information by exploiting the global transformer architecture provided?

Learn edge embedding along with Shortest Path of each node pair.

19 of 87

Graphormer : Edge Encoding

SP_i,j( e₁, e₂, …, e_N): edges in shortest path between node pairs.
compute an average of the dot-products of the edge feature and a learnable embedding along the path.

20 of 87

Graphormer : Transformer Encoder

Use classic Transformer encoder implementation

Modification- Apply the layer normalization (LN) before the multi-head self-attention (MHA) and the feedforward blocks (FFN)

21 of 87

Graphormer : Special Virtual Node

Instead of READOUT/Pooling, add a special node called [VNode].
make connection between [VNode] and each node individually.
Entire graph representation h_G would be the node feature of [VNode] (Similar to [CLS] token)

22 of 87

Graphormer : Results

23 of 87

Pre-training for Graph Neural Networks

24 of 87

Issues in Supervised Deep GNN

Requires Large Amount of Labelled Data

Task-specific labeled data can be extremely scarce.
Data labeling resource and time intensive
GNNs overfit to small training data

Low out-of-distribution generalization ability of the trained models.

Graph data from real-world applications often contain out-of-distribution samples
GNNs extrapolate poorly

25 of 87

GNN Pre-training

Effective Solution: Pre-train and Transfer Learning/Fine-tuning

Train a GNN model with a massive corpus of pre-training graphs.
Utilize the pre-trained GNN model as initialization
Fine-tune the model parameters based on the specific downstream task.

26 of 87

Challenges in GNN Pre-training

Requires large number of labeled pre-training datasets that are from the same domain as the downstream task.

Requires substantial domain expertise
Carefully select pretraining task that are correlated with the downstream task of interest
Naive Pre-training leads to negative transfer

27 of 87

Self-Supervised GNN Pre-training

Contrastive SSL Pre-training

Constructs the supervised signals at the inter-graph level and learns the representation by contrasting with other graphs

Generative SSL Pre-training

Focuses on reconstructing the original graph at the intra-graph level.

28 of 87

Strategies for Pre-training Graph Neural Networks , Weihua Hu et al. ICLR 20 20

Strategies for Pre-training Graph Neural Networks

Generative SSL Pre-training

29 of 87

Key Insight

Develop an effective (Self-supervised) strategy for pre-training GNNs.
Pre-train an expressive GNN at the level of individual nodes as well as entire graphs.
Learn useful local and global representations simultaneously

30 of 87

Pretraining GNN

Pre-training has been hugely successful in computer vision and natural language processing.
Let’s consider pre-training GNNs!

Q1. How effective is pre-training GNNs?
Q2. What is the effective pre-training strategy?

31 of 87

How Effective is Pre-training GNNs

Naive strategy - Multi-task supervised pre-training

32 of 87

How Effective is Pre-training GNNs

Naive strategy - Limited performance improvement on downstream tasks.(Negative Transfer)

33 of 87

What is Effective Pretraining Strategy

Key idea: Pre-train both node and graph embeddings.

34 of 87

What is Effective Pre-training Strategy

Key idea: Pre-train both node and graph embeddings.

GNN can capture domain-specific knowledge of both local and global structure.

35 of 87

Proposed Pre-training Method

36 of 87

Node-level : Attribute Masking

Mask node/edge attributes
Use GNN to generate node embeddings
Use the embeddings to predict masked attributes.

37 of 87

Node-level : Attribute Masking

Through solving the masked attribute prediction task, a GNN is forced to learn domain knowledge, .e.g., chemical rules.
Learning the regularities of the node/edge attributes distributed over graph structure.

38 of 87

Node-level : Context Prediction

For each graph, sample one center node v.
Extract neighborhood and context graphs for v.
(K-Hop) Neighborhood Graphs: all nodes and edges that are at most K-hops away from v.
Context Graph: subgraph that is between r₁-hops and r₂-hops away from v (i.e., it is a ring of width r₁ - r₂)

39 of 87

Node-level : Context Prediction

Goal - Predict context graph of center node v using node embedding h_v^(K), generated by K-layer GNN.
Use GNNs to encode neighborhood and context graphs into vectors/embedding
Intuition: Subgraphs that are surrounded by similar contexts are semantically similar. [word2vec model, Mikolov et al. NIPS 2013].

40 of 87

Node-level : Context Prediction

Use negative sampling to jointly learn the main GNN and the context GNN.
Maximize/minimize the inner product between true/false (neighborhood, context) pairs.

Positive neighborhood-context pair, let v = v′ and G = G′.
Negative neighborhood-context pair, randomly sample v′ from a randomly chosen graph G′.

41 of 87

Graph-level : Supervised Attribute Prediction

Multi-task supervised training on many relevant labels.

42 of 87

Overall Strategy

43 of 87

Experimental Results

Avoids negative transfer.
Significantly improve the performance.

44 of 87

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training, Jiezhong Qiu et al. KDD 2020

Graph Contrastive Coding

Contrastive SSL Pre-training

46 of 87

GNN Pre-Training Problem

Learn a function ƒ that maps a vertex to a low-dimensional feature vector.
Structural Similarity: map vertices with similar local network topologies close to each other in the embedding space.
Transferability/Adaptability: It is compatible with vertices and graphs unseen during pre-training.

47 of 87

Graph Contrastive Coding (GCC)

Hypothesis: Graph structure patterns are universal and transferable across network

48 of 87

GCC Pre-training

Given a set of graphs, pre-train a universal graph neural network encoder to capture the structural patterns behind these graphs.
To achieve this, authors design proper self-supervised tasks and learning objectives for graph-structured data.
Pre-training Task: Subgraph instance discrimination.
Learning Objective: InfoNCE loss

49 of 87

GCC Pre-training

Contrastive Learning(InfoNCE) loss:

Contrastive learning is a natural choice to capture similarity from data.
Output instance representations that are capable of capturing the similarities between instances.

Contrastive learning looks up a single key (denoted by k₊) that q matches in the dictionary.

50 of 87

GCC Pre-training

Crucial Questions:

Q1: How to define subgraph instances in graphs?
Q2: How to define (dis)similar instance pairs in and across graphs?
Q3: What are the proper graph encoders?

51 of 87

Define Subgraph Instances

GCC use subgraphs as contrastive instances by extending each single vertex to its local structure.
Define an instance to be its r-ego network/substructure.
A r-ego network: a substructure around v which is at r-hop distance from v.

52 of 87

Define (dis)similar Instances

Two random data augmentations of the same r-ego network as a similar instance pair and define the data augmentation as graph sampling.
Given r-ego network G_v around vertex v, the graph sampling for GCC follows the three-step:

Step-1: Random Walk with Restart

Start a random walk on G from the ego vertex v.
Iteratively travels to its neighborhood with the probability proportional to the edge weight.
At each step, with a positive probability the walk returns back to the starting vertex v

53 of 87

Define (dis)similar Instances

Given r-ego network G_v around vertex v, the graph sampling for GCC follows the three-step:

Step-2: Subgraph induction

The random walk with restart collects a subset of vertices surrounding v: Ŝ_v
Subgraph Ĝ_v induced by Ŝ_vis then regarded as an augmented version of

the r-ego network G_v

Step-2: Subgraph induction

We anonymize the sampled graph by relabeling its vertices in arbitrary order.

54 of 87

Define (dis)similar Instances

If two subgraphs are augmented from same r-ego network, we treat them as a similar instance pair.
If two subgraphs are augmented from different r-ego network, we treat them as a dissimilar instance pair.

55 of 87

Define Graph Encoders

GCC model is not sensitive to different choices of GNN encoders.
In practice, they adopt the Graph Isomorphism Network (GIN), a state-of-the-art graph neural network model, as a graph encoder.

56 of 87

GCC Pre-training

57 of 87

GCC Fine-tuning

58 of 87

GCC Fine-tuning: Full vs Freezing

59 of 87

Experiments

Self-supervised pre-training is performed on six graph datasets.

Fine-tuning Tasks

Node Classification
Graph Classification
Top-K Similarity Search

60 of 87

Node Classification Task

61 of 87

Graph Classification Task

62 of 87

Top-K Similarity Search

63 of 87

Prompt Tuning for Graph Neural Networks

64 of 87

GPPT: Graph Pre-training and Prompt Tuning to Generalize Graph Neural Networks, Mingchen Sun et.al. KDD 2022

Graph Pre-training and Prompt Tuning(GPPT)

65 of 87

Limitation of Pre-training/Fine Tuning

Training objective gap between the constructed pretext and dedicated downstream tasks.
During pre-training (Linked Prediction Pretext task), GNNs’ parameters θ are optimized to generate close embeddings for connected node pairs.
During the downstream task, if the disconnected pairs share the same class, the pre-trained model requires to be tuned with many epochs to adapt to the new problem(e.g. Node Classification).
Fine-tuning is time consuming and even results in worse downstream performance than training from scratch (Negative Transfer).

66 of 87

Graph Prompt Tuning

Effective Solution: “Pre-train, Prompt, Fine-tune”

67 of 87

Graph Prompt Tuning : Challenges

Infeasible to apply the semantic prompting function to bridge the various graph machine learning tasks.

Infeasible use the textual template (e.g., “I felt so [MASK]” as adopted in NLP) to reformulate the node classification to edge prediction

It is unaware how to design the informative prompt to better reformulate the downstream application.

Using the example of node classification, the prompt should take into account the relative neighborhoods of target node to boost the classification accuracy.

68 of 87

GPPT : Key Contribution

First adopt the masked edge prediction, the most simplest and popular pretext task, to pre-train GNNs
Authors modify the standalone node into a token pair and reformulate the downstream node classification to look the same as edge prediction.

69 of 87

GPPT : Graph Prompting Function

A graph prompting function is applied to change the standalone node v_i into prompt v_i'
Considering the edge prediction or contrastive learning pretext task, prompt v_i' s described by the template of token pairs:

T_task(y): task token to describe the downstream application.
T_str(v_i): structure token to represent target node.

70 of 87

GPPT : Graph Prompting Function

Rethinking the node classification downstream task:

The Task Token T_task(y) could be given by a node label waiting to be classified.
The Structure Token T_str(v_i) could be represented by the subgraph surrounding target node vi to provide more structural information.

71 of 87

GPPT : Task Token Generation

Task token could be understood as a class-prototype node added to the original graph.
Motivated by NLP, the task token should be embedded into a trainable vector.
The optimal embedding of task token T_task(y_c) should be at the center of node embeddings of class y_c.

72 of 87

GPPT : Task Token Generation

In real-world networks, one of the inherent graph characteristics is cluster structure

Nodes within the same cluster have dense connections to each other

Given the edge prediction pretext task, the pre-trained node embeddings will also be clustered in the embedding space.
Hence optimal embedding of task tokens should thus vary with clusters.
To better conduct the node classification job at each cluster, authors propose the cluster-based task token generation

73 of 87

GPPT : Task Token Generation

3 steps in cluster-based task token generation.

Authors adopt the scalable clustering module (e.g., METIS) to pre-process and split nodes into multiple non-overlapped clusters: {𝐺₁…𝐺_M}, where M is the hyper-parameter of cluster number.
For each cluster m, we train an independent task token embeddings:

Given task token T_task(y_c) of node v_i at cluster m, it is embedded by vector e_c^m

74 of 87

GPPT : Structure Token Generation

Structure token T_str(v_i) denotes the subgraph centered at node v_i.
Authors leveraged the first-order adjacent nodes to Embed structure token to continuous vector as follows:

75 of 87

GPPT : Overall Learning Process

Prompt Addition: graph prompting function generates a series of token pairs to be classified.
Prompt Answer: Given each token pair we embed them into continuous vectors.

We then concatenate them as input to the pre-trained projection head and obtain the linking probability.
We answer and classify target node vi with label yc if it obtains the highest probability.

Prompt Tuning: Following the pretext training objective, we fine-tune GNNs.

76 of 87

GPPT : Architecture

77 of 87

Experimental Results: Node Classification

78 of 87

Universal Prompt Tuning for Graph Neural Networks, Taoran Fang et.al. NeurIPS 2023

Graph Prompt Feature(GPF)

79 of 87

Prompt Tuning : Background

Prompt tuning a pre-trained LLM

Step 1: Pre-training an LLM using the Masked Language Modeling (MLM).
Step 2: Reformulating the downstream task by a prompt on the input sentence.

80 of 87

Prompt Tuning : Background

Pre-trained LLM vs Pre-trained GNNs

How to apply prompt tuning on pre-trained GNNs?

81 of 87

Graph Prompt Tuning : Prior Work

Prior work (GPPT) utilize graph prompt tuning by modifying the downstream task to the link prediction, which is consistent with the pre-training strategy they use.

82 of 87

Graph Prompt Tuning : Limitation

There is no unified pre-training task for GNNs.
Challenging to design general prompting functions.
Existing methods have limited applicability and are only compatible with models pre-trained by the link prediction.

83 of 87

Graph Prompt Tuning

Template design. Generate the graph template, which includes learnable components in its adjacency matrix and feature matrix.

Prompt optimization. We search for the optimal prompt parameters according to the downstream task.

84 of 87

Specialized Graph Prompt Tuning

According to the motivation of prompt tuning, the graph prompt design is close related to the pre-training task involved.

However, there are so many pre-training strategies in the graph field. Can we design a universal graph prompt tuning method for all these strategies?

85 of 87

Universal Graph Prompt Tuning

Graph Prompt Feature (GPF):

GPF focuses on incorporating additional learnable parameters into the feature space of the input graph.
The learnable vector is added to the graph features X to generate the prompted features X*.

Graph Prompt Feature-Plus (GPF-plus):

GPF-plus sets a different feature vector for each node in the graph.

86 of 87

Experimental Evaluation

GPF and GPF-plus achieved better results than fine-tuning in 80% of the experiments.

87 of 87

Experimental Evaluation

Comparison with existing graph prompt-based methods.

GPF and GPF-plus achieved better results than specialized graph prompt methods by a large margin.

1 of 87

2 of 87

3 of 87

4 of 87

5 of 87

6 of 87

7 of 87

8 of 87

9 of 87

10 of 87

11 of 87

12 of 87

13 of 87

14 of 87

15 of 87

16 of 87

17 of 87

18 of 87

19 of 87

20 of 87

21 of 87

22 of 87

23 of 87

24 of 87

25 of 87

26 of 87

27 of 87

28 of 87

29 of 87

30 of 87

31 of 87

32 of 87

33 of 87

34 of 87

35 of 87

36 of 87

37 of 87

38 of 87

39 of 87

40 of 87

41 of 87

42 of 87

43 of 87

44 of 87

45 of 87

46 of 87

47 of 87

48 of 87

49 of 87

50 of 87

51 of 87

52 of 87

53 of 87

54 of 87

55 of 87

56 of 87

57 of 87

58 of 87

59 of 87

60 of 87

61 of 87

62 of 87

63 of 87

64 of 87

65 of 87

66 of 87

67 of 87

68 of 87

69 of 87

70 of 87

71 of 87

72 of 87

73 of 87

74 of 87

75 of 87

76 of 87

77 of 87

78 of 87

79 of 87

80 of 87