1 of 87

Graph Foundational Model

2 of 87

Transformer

3 of 87

4 of 87

Limitation of RNN Models

  • Slow Computation for Longer Sequences
    • Can not be done in parallel due to timesteps dependencies
  • Vanishing Gradient or Exploding Gradient Problem
  • Limited amount Information into hidden state
  • History is forgotten after number of time steps
    • Long range dependence

5 of 87

Attention Based Model: Transformer

6 of 87

Attention Based Model: Transformer

7 of 87

8 of 87

9 of 87

10 of 87

Advantages of Transformer

  • Minimize (or at least not increase) computational complexity per layer.
  • Minimize path length between any pair of words to facilitate learning of long-range dependencies.
  • Maximize the amount of computation that can be parallelized.

11 of 87

Graph Transformer

12 of 87

Do Transformers Really Perform Bad for Graph Representation? Chengxuan Ying et al. NeurIPS 2021

Graphormer

13 of 87

Graph Transformer: Graphormer

  • Motivation - Whether transformer architecture is suitable to model graph representation learning.
  • Build directly upon standard Transformer Encoder.
  • Add Structural Encoding into Model.
    • Centrality Encoding
    • Spatial Encoding
    • Edge Encoding
  • Achieves better performance than GNNs.

14 of 87

15 of 87

Graphormer : Centrality Encoding

  • Centrality: How important a node is in the graph.
  • Graphormer uses Degree Centrality.
  • Develop a Centrality Encoding which assigns each node two learnable embedding vectors according to its indegree and outdegree.

16 of 87

Graphormer : Spatial Encoding

  • An advantage of Transformer is its global receptive field.
    • In each Transformer layer, each token can attend to the information at any position and then process its representation.
  • Issue: model has to explicitly specify different positions or encode the positional dependency (such as locality).
  • For sequential data (Text)
    • Absolute Positional Encoding: give each position an embedding
    • Relative Positional Encoding: encode the relative distance of any two positions

17 of 87

Graphormer : Spatial Encoding

  • For graphs, nodes are not arranged as a sequence.
    • Need some notion of distance.
  • Spatial Encoding: effectively capture the structural information.
  • : function that measures the spatial relation between vi and vj in graph G.
    • Shortest Path Distance (SPD)

18 of 87

Graphormer : Edge Encoding

  • In many graph tasks, edges also have structural features.
    • In a molecular graph, atom pairs may have features describing the type of bond between them.
  • How to better incorporate edge feature information by exploiting the global transformer architecture provided?
    • Learn edge embedding along with Shortest Path of each node pair.

19 of 87

Graphormer : Edge Encoding

  • SPi,j( e1, e2, …, eN): edges in shortest path between node pairs.
  • compute an average of the dot-products of the edge feature and a learnable embedding along the path.

20 of 87

Graphormer : Transformer Encoder

  • Use classic Transformer encoder implementation
    • Modification- Apply the layer normalization (LN) before the multi-head self-attention (MHA) and the feedforward blocks (FFN)

21 of 87

Graphormer : Special Virtual Node

  • Instead of READOUT/Pooling, add a special node called [VNode].
  • make connection between [VNode] and each node individually.
  • Entire graph representation hG would be the node feature of [VNode] (Similar to [CLS] token)

22 of 87

Graphormer : Results

23 of 87

Pre-training for Graph Neural Networks

24 of 87

Issues in Supervised Deep GNN

  • Requires Large Amount of Labelled Data
    • Task-specific labeled data can be extremely scarce.
    • Data labeling resource and time intensive
    • GNNs overfit to small training data
  • Low out-of-distribution generalization ability of the trained models.
    • Graph data from real-world applications often contain out-of-distribution samples
    • GNNs extrapolate poorly

25 of 87

GNN Pre-training

  • Effective Solution: Pre-train and Transfer Learning/Fine-tuning
    • Train a GNN model with a massive corpus of pre-training graphs.
    • Utilize the pre-trained GNN model as initialization
    • Fine-tune the model parameters based on the specific downstream task.

26 of 87

Challenges in GNN Pre-training

  • Requires large number of labeled pre-training datasets that are from the same domain as the downstream task.
    • Requires substantial domain expertise
    • Carefully select pretraining task that are correlated with the downstream task of interest
    • Naive Pre-training leads to negative transfer

27 of 87

Self-Supervised GNN Pre-training

  • Contrastive SSL Pre-training
    • Constructs the supervised signals at the inter-graph level and learns the representation by contrasting with other graphs
  • Generative SSL Pre-training
    • Focuses on reconstructing the original graph at the intra-graph level.

28 of 87

Strategies for Pre-training Graph Neural Networks

Generative SSL Pre-training

29 of 87

Key Insight

  • Develop an effective (Self-supervised) strategy for pre-training GNNs.
  • Pre-train an expressive GNN at the level of individual nodes as well as entire graphs.
  • Learn useful local and global representations simultaneously

30 of 87

Pretraining GNN

  • Pre-training has been hugely successful in computer vision and natural language processing.
  • Let’s consider pre-training GNNs!
    • Q1. How effective is pre-training GNNs?
    • Q2. What is the effective pre-training strategy?

31 of 87

How Effective is Pre-training GNNs

  • Naive strategy - Multi-task supervised pre-training

32 of 87

How Effective is Pre-training GNNs

  • Naive strategy - Limited performance improvement on downstream tasks.(Negative Transfer)

33 of 87

What is Effective Pretraining Strategy

  • Key idea: Pre-train both node and graph embeddings.

34 of 87

What is Effective Pre-training Strategy

  • Key idea: Pre-train both node and graph embeddings.
    • GNN can capture domain-specific knowledge of both local and global structure.

35 of 87

Proposed Pre-training Method

36 of 87

Node-level : Attribute Masking

  • Mask node/edge attributes
  • Use GNN to generate node embeddings
  • Use the embeddings to predict masked attributes.

37 of 87

Node-level : Attribute Masking

  • Through solving the masked attribute prediction task, a GNN is forced to learn domain knowledge, .e.g., chemical rules.
  • Learning the regularities of the node/edge attributes distributed over graph structure.

38 of 87

Node-level : Context Prediction

  • For each graph, sample one center node v.
  • Extract neighborhood and context graphs for v.
  • (K-Hop) Neighborhood Graphs: all nodes and edges that are at most K-hops away from v.
  • Context Graph: subgraph that is between r1-hops and r2-hops away from v (i.e., it is a ring of width r1 - r2)

39 of 87

Node-level : Context Prediction

  • Goal - Predict context graph of center node v using node embedding hv(K), generated by K-layer GNN.
  • Use GNNs to encode neighborhood and context graphs into vectors/embedding
  • Intuition: Subgraphs that are surrounded by similar contexts are semantically similar. [word2vec model, Mikolov et al. NIPS 2013].

40 of 87

Node-level : Context Prediction

  • Use negative sampling to jointly learn the main GNN and the context GNN.
  • Maximize/minimize the inner product between true/false (neighborhood, context) pairs.

  • Positive neighborhood-context pair, let v = v′ and G = G′.
  • Negative neighborhood-context pair, randomly sample v′ from a randomly chosen graph G′.

41 of 87

Graph-level : Supervised Attribute Prediction

  • Multi-task supervised training on many relevant labels.

42 of 87

Overall Strategy

43 of 87

Experimental Results

  • Avoids negative transfer.
  • Significantly improve the performance.

44 of 87

Graph Contrastive Coding

Contrastive SSL Pre-training

45 of 87

46 of 87

GNN Pre-Training Problem

  • Learn a function ƒ that maps a vertex to a low-dimensional feature vector.
  • Structural Similarity: map vertices with similar local network topologies close to each other in the embedding space.
  • Transferability/Adaptability: It is compatible with vertices and graphs unseen during pre-training.

47 of 87

Graph Contrastive Coding (GCC)

  • Hypothesis: Graph structure patterns are universal and transferable across network

48 of 87

GCC Pre-training

  • Given a set of graphs, pre-train a universal graph neural network encoder to capture the structural patterns behind these graphs.
  • To achieve this, authors design proper self-supervised tasks and learning objectives for graph-structured data.
  • Pre-training Task: Subgraph instance discrimination.
  • Learning Objective: InfoNCE loss

49 of 87

GCC Pre-training

  • Contrastive Learning(InfoNCE) loss:
    • Contrastive learning is a natural choice to capture similarity from data.
    • Output instance representations that are capable of capturing the similarities between instances.

Contrastive learning looks up a single key (denoted by k+) that q matches in the dictionary.

50 of 87

GCC Pre-training

  • Crucial Questions:
    • Q1: How to define subgraph instances in graphs?
    • Q2: How to define (dis)similar instance pairs in and across graphs?
    • Q3: What are the proper graph encoders?

51 of 87

Define Subgraph Instances

  • GCC use subgraphs as contrastive instances by extending each single vertex to its local structure.
  • Define an instance to be its r-ego network/substructure.
  • A r-ego network: a substructure around v which is at r-hop distance from v.

52 of 87

Define (dis)similar Instances

  • Two random data augmentations of the same r-ego network as a similar instance pair and define the data augmentation as graph sampling.
  • Given r-ego network Gv around vertex v, the graph sampling for GCC follows the three-step:
    • Step-1: Random Walk with Restart
      • Start a random walk on G from the ego vertex v.
      • Iteratively travels to its neighborhood with the probability proportional to the edge weight.
      • At each step, with a positive probability the walk returns back to the starting vertex v

53 of 87

Define (dis)similar Instances

  • Given r-ego network Gv around vertex v, the graph sampling for GCC follows the three-step:
    • Step-2: Subgraph induction
      • The random walk with restart collects a subset of vertices surrounding v: Ŝv
      • Subgraph Ĝv induced by Ŝvis then regarded as an augmented version of

the r-ego network Gv

    • Step-2: Subgraph induction
      • We anonymize the sampled graph by relabeling its vertices in arbitrary order.

54 of 87

Define (dis)similar Instances

  • If two subgraphs are augmented from same r-ego network, we treat them as a similar instance pair.
  • If two subgraphs are augmented from different r-ego network, we treat them as a dissimilar instance pair.

55 of 87

Define Graph Encoders

  • GCC model is not sensitive to different choices of GNN encoders.
  • In practice, they adopt the Graph Isomorphism Network (GIN), a state-of-the-art graph neural network model, as a graph encoder.

56 of 87

GCC Pre-training

57 of 87

GCC Fine-tuning

58 of 87

GCC Fine-tuning: Full vs Freezing

59 of 87

Experiments

  • Self-supervised pre-training is performed on six graph datasets.

  • Fine-tuning Tasks
    • Node Classification
    • Graph Classification
    • Top-K Similarity Search

60 of 87

Node Classification Task

61 of 87

Graph Classification Task

62 of 87

Top-K Similarity Search

63 of 87

Prompt Tuning for Graph Neural Networks

64 of 87

Graph Pre-training and Prompt Tuning(GPPT)

65 of 87

Limitation of Pre-training/Fine Tuning

  • Training objective gap between the constructed pretext and dedicated downstream tasks.
  • During pre-training (Linked Prediction Pretext task), GNNs’ parameters θ are optimized to generate close embeddings for connected node pairs.
  • During the downstream task, if the disconnected pairs share the same class, the pre-trained model requires to be tuned with many epochs to adapt to the new problem(e.g. Node Classification).
  • Fine-tuning is time consuming and even results in worse downstream performance than training from scratch (Negative Transfer).

66 of 87

Graph Prompt Tuning

  • Effective Solution: “Pre-train, Prompt, Fine-tune”

67 of 87

Graph Prompt Tuning : Challenges

  • Infeasible to apply the semantic prompting function to bridge the various graph machine learning tasks.
    • Infeasible use the textual template (e.g., “I felt so [MASK]” as adopted in NLP) to reformulate the node classification to edge prediction
  • It is unaware how to design the informative prompt to better reformulate the downstream application.
    • Using the example of node classification, the prompt should take into account the relative neighborhoods of target node to boost the classification accuracy.

68 of 87

GPPT : Key Contribution

  • First adopt the masked edge prediction, the most simplest and popular pretext task, to pre-train GNNs
  • Authors modify the standalone node into a token pair and reformulate the downstream node classification to look the same as edge prediction.

69 of 87

GPPT : Graph Prompting Function

  • A graph prompting function is applied to change the standalone node vi into prompt vi'
  • Considering the edge prediction or contrastive learning pretext task, prompt vi' s described by the template of token pairs:

  • Ttask(y): task token to describe the downstream application.
  • Tstr(vi): structure token to represent target node.

70 of 87

GPPT : Graph Prompting Function

  • Rethinking the node classification downstream task:
    • The Task Token Ttask(y) could be given by a node label waiting to be classified.
    • The Structure Token Tstr(vi) could be represented by the subgraph surrounding target node vi to provide more structural information.

71 of 87

GPPT : Task Token Generation

  • Task token could be understood as a class-prototype node added to the original graph.
  • Motivated by NLP, the task token should be embedded into a trainable vector.
  • The optimal embedding of task token Ttask(yc) should be at the center of node embeddings of class yc.

72 of 87

GPPT : Task Token Generation

  • In real-world networks, one of the inherent graph characteristics is cluster structure
    • Nodes within the same cluster have dense connections to each other
  • Given the edge prediction pretext task, the pre-trained node embeddings will also be clustered in the embedding space.
  • Hence optimal embedding of task tokens should thus vary with clusters.
  • To better conduct the node classification job at each cluster, authors propose the cluster-based task token generation

73 of 87

GPPT : Task Token Generation

  • 3 steps in cluster-based task token generation.
    • Authors adopt the scalable clustering module (e.g., METIS) to pre-process and split nodes into multiple non-overlapped clusters: {𝐺1…𝐺M}, where M is the hyper-parameter of cluster number.
    • For each cluster m, we train an independent task token embeddings:

    • Given task token Ttask(yc) of node vi at cluster m, it is embedded by vector ecm

74 of 87

GPPT : Structure Token Generation

  • Structure token Tstr(vi) denotes the subgraph centered at node vi.
  • Authors leveraged the first-order adjacent nodes to Embed structure token to continuous vector as follows:

75 of 87

GPPT : Overall Learning Process

  • Prompt Addition: graph prompting function generates a series of token pairs to be classified.
  • Prompt Answer: Given each token pair we embed them into continuous vectors.
    • We then concatenate them as input to the pre-trained projection head and obtain the linking probability.
    • We answer and classify target node vi with label yc if it obtains the highest probability.
  • Prompt Tuning: Following the pretext training objective, we fine-tune GNNs.

76 of 87

GPPT : Architecture

77 of 87

Experimental Results: Node Classification

78 of 87

Graph Prompt Feature(GPF)

79 of 87

Prompt Tuning : Background

  • Prompt tuning a pre-trained LLM
    • Step 1: Pre-training an LLM using the Masked Language Modeling (MLM).
    • Step 2: Reformulating the downstream task by a prompt on the input sentence.

80 of 87

Prompt Tuning : Background

  • Pre-trained LLM vs Pre-trained GNNs

How to apply prompt tuning on pre-trained GNNs?

81 of 87

Graph Prompt Tuning : Prior Work

  • Prior work (GPPT) utilize graph prompt tuning by modifying the downstream task to the link prediction, which is consistent with the pre-training strategy they use.

82 of 87

Graph Prompt Tuning : Limitation

  • There is no unified pre-training task for GNNs.
  • Challenging to design general prompting functions.
  • Existing methods have limited applicability and are only compatible with models pre-trained by the link prediction.

83 of 87

Graph Prompt Tuning

  • Template design. Generate the graph template, which includes learnable components in its adjacency matrix and feature matrix.
  • Prompt optimization. We search for the optimal prompt parameters according to the downstream task.

84 of 87

Specialized Graph Prompt Tuning

  • According to the motivation of prompt tuning, the graph prompt design is close related to the pre-training task involved.
  • However, there are so many pre-training strategies in the graph field. Can we design a universal graph prompt tuning method for all these strategies?

85 of 87

Universal Graph Prompt Tuning

  • Graph Prompt Feature (GPF):
    • GPF focuses on incorporating additional learnable parameters into the feature space of the input graph.
    • The learnable vector is added to the graph features X to generate the prompted features X*.
  • Graph Prompt Feature-Plus (GPF-plus):
    • GPF-plus sets a different feature vector for each node in the graph.

86 of 87

Experimental Evaluation

  • GPF and GPF-plus achieved better results than fine-tuning in 80% of the experiments.

87 of 87

Experimental Evaluation

  • Comparison with existing graph prompt-based methods.
    • GPF and GPF-plus achieved better results than specialized graph prompt methods by a large margin.