CS224W-Machine Learning with Graph-GNN2

A Single Layer of a GNN

A GNN Layer = Message + Aggregation

Different instantiations under this perspective

GCN, GGraphSAGGE, GAT,…

Compress a set of vectors into a single vector

Message Computation (消息计算)

Message function:
$\textbf{m}_u^{(l)} = \text{MSG} ^{(l)}(\textbf{h}_u^{(l-1)})$
- $\text{MSG}^{(l)}$ : 消息计算时的权重/函数
- Intuition: Each node will create a message, which will be sent to other nodes later.
- Example: A Linear layer $\textbf{m}_u^{(l)} = \textbf{W}^{(l)} \textbf{h}_u^{(l-1)}$
  - Multiply node features with weight matrix $\textbf{W}^{(l)}$

Message Aggregation (消息聚合)

Aggregation function:
$\textbf{h}_v^{(l)} = \text{AGG}^{(l)}(\{\textbf{m}_u^{(l)},u \in N(v)\})$
- $\text{AGG}^{(l)}$ : 聚合器
- Intuition: Node $v$ will aggregate the messages from its neighbors $u$
- Example: $\text{Sum}(\cdot), \text{Mean}(\cdot)$ or $\text{Max}(\cdot)$ aggregator
  - $\textbf{h}_v^{(l)} = \text{Sum}(\{\textbf{m}_u^{(l)}, u \in N(v)\})$

Message Aggregation: Issue (考虑在聚合时可能会出现自身节点信息的丢失，即自己到自己时的信息可能会发生丢失)

Issue: Information from node $v$ itself could get lost

Computation of $\textbf{h}_v^{(l)}$ does not directly depend on $\textbf{h}_v^{(l-1)}$

Solution: Include $\textbf{h}_v^{(l-1)}$ when computing $\textbf{h}_v^{(l)}$

Message: compute message from node $v$ itself
- Usually, a different message computation will be performed
  - 在不同的连接节点的时候使用不同的函数/权重

Aggregation: After aggregating from neighbors, we can aggregate the message from node $v$ itself
- Via concatenation or summation
$\textbf{h}_v^{(l)} = \text{CONCAT}(\text{AGG}(\{\textbf{m}_u^{(l)},u \in N(v)\}),\textbf{m}_v^{(l)})$
- First aggregate from neighbors
$\text{AGG}(\{\textbf{m}_u^{(l)},u \in N(v)\}$
- Then aggregate from node itself
$\textbf{m}_v^{(l)}$

Classical GNN Layers

GCN Graph Convolutional Networks

\textbf{h}_v^{(l)} = \sigma \left( \textbf{W}^{(l)} \sum_{u \in N(v)} \frac{\textbf{h}_u^{(l-1)}}{|N(v)|}\right)

Message
- Each neighbor: $\textbf{m}_u^{(l)} = \frac{1}{|N(v)|}\textbf{W}^{(l)}\textbf{h}_u^{(l-1)}$ . Normalized by node degree

Aggregation:
- Sum over message from neighbors, then apply activation
- $\textbf{h}_v^{(l)} = \sigma (\text{Sum}(\{\textbf{m}_u^{(l)},u \in N(v)\}))$

GraphSAGE

\textbf{h}_v^{(l)} = \sigma \left( \textbf{W}^{(l)} \cdot \text{CONCAT} \left( \textbf{h}_v^{(l-1)}, \text{AGG} \left( \{ \textbf{h}_u^{(l-1)} \, | \, u \in \mathcal{N}(v) \} \right) \right) \right)

Message is computed within the $\textbf{AGG}(\cdot)$

Two-stage aggregation
- Stage 1: Aggregation from node neighbors
  $\textbf{h}_{N(v)}^{(l)} \leftarrow \text{AGG}(\{\textbf{h}_u^{(l-1)},\forall u \in N(v)\})$
- Stage 2: Further aggregate over the node itself
$\textbf{h}_v^{(l)} \leftarrow \sigma(\textbf{W}^{(l)}\cdot \text{CONCAT}(\textbf{h}_v^{(l-1)},\textbf{h}_{N(v)}^{(l)}))$

GraphSAGE Neighbor Aggregation (GraphSAGE 邻居节点的聚合方法)

Mean: Take a weighted average of neighbors

\text{AGG} = \sum_{u \in N(v)} \frac{\textbf{h}_u^{(l-1)}}{|N(v)|}

Pool: Transform neighbor vectors and apply symmetric vector function $\text{Mean}(\cdot)$ or $\text{Max}(\cdot)$

\text{AGG} = \text{Mean}(\{ \text{MLP}(\textbf{h}_u^{(l-1)}),\forall u \in N(v)\})

LSTM: Apply LSTM to reshuffled of neighbors

\text{AGG} = \text{LSTM}([\textbf{h}_u^{(l-1)},\forall u \in \pi(N(v))])

GraphSAGE: $L_2$  Normalization (在嵌入之后进行归一化)

$\mathcal{l}_2$ Normalization

Optimal: Apply $l_2$ normalization to $\textbf{h}_v^{(l)}$ at every layer

$\textbf{h}_v^{(l)} \leftarrow \frac{\textbf{h}_v^{(l)}}{||\textbf{h}_v^{(l)}||_2}$ , $\forall v \in V$ where $||u||_2 = \sqrt{\sum_i u^2_i}$ ( $l_2$ -norm)

Without $l_2$  normalization, the embedding vectors have different scales ( $l_2$ -norm) for vectors

In some cases (not always), normalization of embedding results in performance improvement

After $l_2$ normalization, all vectors will have the same $l_2$ -norm

Graph Attention Networks

\textbf{h}_v^{(l)} = \sigma \left(\sum_{u \in N(v)} \alpha_{vu} \textbf{W}^{(l)} \textbf{h}_u^{(l-1)}\right)

In GCN/GraphSAGE
- $\alpha_{vu} = \frac{1}{|N(v)|}$ is the weighting factor (importance) of node $u$ ’s message to node $v$ .
- $\Longrightarrow$ $\alpha_{vu}$ is defined explicitly based on the structural properties of the graph (node degree)
- $\Longrightarrow$ All neighbors $u \in N(v)$ are equally important to node $v$

根据这个权重，那就需要考虑加入注意力机制，即考虑到每个节点对本节点影响的程度不同，因此可以考虑加入注意力机制。

Not all node’s neighbors are equally important

Attention is inspired by cognitive attention

The attention $\alpha_{vu}$ focuses on the important parts of the input data and fades out the rest
- Idea: the NN should devote more computing power on that small but important part of the data.
- Which part of the data is more important depends on the context and is learned through training

Weighting factors $\alpha_{vu}$ be learned?

Goal: Specify arbitrary importance to different neighbors of each node in the graph

Idea: Compute embedding $\textbf{h}_v^{(l)}$ of each node in the graph following an attention strategy

Nodes attend over their neighborhood’s message

Implicitly specifying different weights to different nodes in a neighborhood

Attention Mechanism

Let $\alpha_{vu}$ be computed as a byproduct of an attention mechanism $a$ :

Let a compute attention coefficients $e_{vu}$ across pairs of nodes $u,v$ based on their messages:
$e_{vu} = a(\textbf{W}^{(l)} \textbf{h}_u^{(l-1)}, \textbf{W}^{(l)}\textbf{h}_v^{(l-1)})$
- $e_{vu}$ indicates the importance of $u$ ’s message to node $v$
- $e_{AB} = a\left( \textbf{W}^{(l)}\textbf{h}_A^{(l-1)},\textbf{W}^{(l)} \textbf{h}_B ^{(l-1)}\right)$

Normalize $e_{vu}$ into the final attention weight $\alpha_{vu}$
- Use the softmax function, so that $\sum_{u \in N(v)} \alpha_{vu} =1$
$\alpha_{vu} = \frac{e^{e_{vu}}}{\sum_{k \in N(v)}e^{e_{vk}}}$

Weighted sum based on the final attention weight $\alpha_{vu}$

\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \alpha_{vu} \textbf{W}^{(l)}\textbf{h}_u^{(l-1)}\right)

Weighted sum using $\alpha_{AB},\alpha_{AC},\alpha_{AD}$ :

\textbf{h}_A^{(l)} = \sigma (\alpha_{AB}\textbf{W}^{(l)}\textbf{h}_B^{(l-1)}+\alpha_{AC}\textbf{W}^{(l)}\textbf{h}_C^{(l-1)}+ \alpha_{AD} \textbf{W}^{(l)} \textbf{h}_D^{(l-1)})

注意力机制 $a$  的形式？

Use a simple single-layer neural network $a$ have trainable paremeters (weights in the Linear layer)

\begin{align*} e_{AB} &= a(\textbf{W}^{(l)} \textbf{h}_A^{(l-1)}, \textbf{W}^{(l)} \textbf{h}_B ^{(l-1)}) \\ &= \text{Linear}(\text{Concat}(\textbf{W}^{(l)} \textbf{h}_A^{(l-1)}, \textbf{W}^{(l)})) \end{align*}

Parameters of $a$ are trained jointly
- Learn the parameters together with weight matrices(i.e., other parameter of the neural net $\textbf{W}^{(l)}$ ) in an end-to-end fashion

多头注意力机制(Multi-head Attention)

Stabilizes the learning process of attention mechanism

Create multiple attention scores (each replica with a different set of parameters):
- $\textbf{h}_v^{(l)}[1] = \sigma(\sum_{u \in N(v)} \alpha^1_{vu} \textbf{W}^{(l)} \textbf{h}_u^{(l-1)})$
- $\textbf{h}_v^{(l)}[2] = \sigma(\sum_{u \in N(v)} \alpha^2_{vu} \textbf{W}^{(l)} \textbf{h}_u^{(l-1)})$
- $\textbf{h}_v^{(l)}[3] = \sigma(\sum_{u \in N(v)} \alpha^3_{vu} \textbf{W}^{(l)} \textbf{h}_u^{(l-1)})$

Outputs are aggregated
- By concatenation or summation
- $\textbf{h}_v^{(l)} = \text{AGG}(\textbf{h}_v^{(l)}[1],\textbf{h}_v^{(l)}[2],\textbf{h}_v^{(l)}[3])$

Benefits of Attention Mechanism 注意力机制的好处

Key benefit: Allows for (implicitly) specifying different importance values ( $\alpha_{vu}$ ) to different neighbors

GNN Layers in Practice

Many modern deep learning modules can be incorporated into a GNN layer

Batch Normalization
- Stabilize neural network training

Dropout
- Prevent overfitting

Attention/Gating
- Control the importance of a message

More
- Any other useful deep learning modules

Batch Normalization

Goal: Stabilize neural networks training

Idea: Given a batch of inputs (node embeddings)
- Re-center the node embeddings into zero mean
- Re-scale the variance into unit variance

Setup

Input: $X \in \mathbb{R}^{N \times D}$

$N$ node embeddings

Trainable Parameters: $\gamma,\beta \in \mathbb{R}^D$

Output: $Y \in \mathbb{R}^{N \times D}$

Normalized node embeddings

Step 1: Compute the mean and variance over $N$ embeddings

\mu_i = \frac{1}{N} \sum_{i=1}^N X_{i,j}\\ \sigma^2_j = \frac{1}{N} \sum_{i=1}^N (X_{i,j}- \mu_j)^2

Step 2: Normalize the feature using computed mean and variance

\hat{X}_{i,j} = \frac{X_{i,j} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}\\ Y_{i,j} = \gamma_j \hat{X}_{i,j} + \beta_j

Dropout

Goal: Regullarize a neural net to prevent overfitting

Idea:
- During training: with some probability $p$ , randomly set neurons to zero (turn off)
- During testing: Use all the neurons for computation

Dropout for GNNs
- In GNN, Dropout is applied to the linear layer in the message function
  - A simple message function with linear layer: $\textbf{m}_u^{(l)} = \textbf{W}^{(l)} \textbf{h}_u^{(l-1)}$

Activation (non-linearity)

Apply activation to $i$ -th dimension of embedding $\textbf{x}$

Rectified linear unit (ReLU)

\text{ReLU}(\textbf{x}_i) = \max (\textbf{x}_i,0)

Sigmoid
- Used only when we want to restrict the range of our embeddings

\sigma(\textbf{x}_i) = \frac{1}{1+e^{-\textbf{x}_i}}

Parametric ReLU
- Empirically performs better than ReLU

\text{PReLU}(\textbf{x}_i) = \max (\textbf{x}_i,0) + a_i \min (\textbf{x}_i,0)

$a_i$ is a trainable parameter.

Stacking Layers of a GNN

Stack layers sequentially

Ways of adding skip connections

Construct a Graph Neural Network

The standard way: Stack GNN layers sequentially

Input: Initial raw node feature $\textbf{x}_v$

Output: Node embeddings $\textbf{h}_v^{(L)}$ after $L$ GNN layers

如果GNN的层数过多会引起过于平滑的问题(Why?)

The Over-smoothing Problem

the issue of stacking many GNN layer
- GNN suffers from the over-smoothing problem

The over-smoothing problem: All the node embeddings converge to the same value
- This is bad because we want to use node embeddings to differentiate nodes

Receptive Field of a GNN

Receptive field: the set of nodes that determine the embedding of a node of interest

In a $K$ -layer GNN, each node has a receptive field of $K$ -hop neighborhood

Receptive field overlap for two nodes

The shared neighbors quickly grows when we increase the number of hops(num of GNN layers)

Via the notion of the receptive filed to explain over-smoothing

We know the embedding of a node is determined by its receptive field
- If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar

Stack many GNN layers $\Longrightarrow$ nodes will have highly-overlapped receptive fields $\Longrightarrow$ node embeddings will be highly similar $\Longrightarrow$ suffer from the over-smoothing problem.

层数过多之后，链接的节点越多，即扩展的节点越多，导致节点的嵌入就会越来越相似。都是连接的差不多的节点。

Be cautious when adding GNN layers - Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help.

Expressive Power for Shallow GNNs? How to make a shallow GNN more expressive?

Solution 1: Increase the expressive power within each GNN layer
- In our previous examples, each transformation or aggregation function only include one linear layer
- We can make aggregation / transformation become a deep neural network!

Solution 2: Add layers that do not pass messages
- A GNN does not necessarily only contain GNN layers
  - add MLP layers (applied to each nodes)before and after GNN layers, as pre-process layers and post-process layers

Pre-processing layers: Important when encoding node features is necessary.

Post-processing layers: Important when reasoning/ transformation over node embeddings are needed

If our problem still requires many GNN layers, we need to add skip connections in GNNs

Add skip connections in GNNs
- Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes
- Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN

为什么skip connections?

Intuition: Skip connections create a mixture of models

$N$ skip connections $\rightarrow$ $2^N$ possible paths

Each path could have up to $N$ modules

We automatically get a mixture of shallow GNNs and deep GNNs

Example: GCN with Skip Connections

A standard GCN layer

\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \textbf{W}^{(l)} \frac{\textbf{u}_u^{(l-1)}}{|N(v)|}\right)

A GCN layer with skip connection

\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \textbf{W}^{(l)} \frac{\textbf{u}_u^{(l-1)}}{|N(v)|} + \textbf{h}_v^{(l-1)}\right)

Example: Other Options of Skip Connections

Directly skip to the last layer - The final layer directly aggregates from the all the node embeddings in the previous layers.

Graph Manipulation in GNNs

General GNN Framework

Idea: Raw input graph $\neq$ computational graph

Graph feature augmentation

Graph structure manipulation

Why Manipulate Graphs?

Our assumption so far has been: Raw input graph $=$ Computational graph

Reasons for breaking this assumption Graph Manipulation Approaches

Feature level:
- The input graph lacks features $\Longrightarrow$ Feature augmentation

Structure level:
- The graph is too spare $\Longrightarrow$ inefficient message passing $\Longrightarrow$ Add virtual nodes/ edges
- The graph is too dense $\Longrightarrow$ message passing is too costly $\Longrightarrow$ Sample neighbors when doing message passing
- The graph is too large $\Longrightarrow$ can not fit the computational graph into a GPU $\Longrightarrow$ Sample subgraphs to compute embeddings

Feature Augmentation on Graphs

Why do we need feature augmentation?

Input graph does not have node features

This is common when we only have the adj. matrix (邻接矩阵)

Standard approaches:

Assign constant values to nodes

Assign unique IDs to nodes

These IDs are converted into one-hot vectors

Certain structures are hard to learn by GNN

We can use cycle count as augmented node features

Other commonly used augmented features

Clustering coefficient

PageRank

Centrality

$\cdots$

Add Virtual Nodes/ Edges

Motivation: Augment sparse graphs

Add virtual edges
1. Common approach: Connect 2-hop neghbors via virtual edges
1. Intuition: Instead of using adj. matrix $A$ for GNN computation, use $A + A^2$
Example: Bipartite graphs
- Author-to-papers (they authored)
- 2-hop virtual edges make an author-author collaboration graph

Add virtual nodes
1. The virtual node will connect to all the nodes in the graph
  1. Suppose in a sparse graph, two nodes have shortest path distance of 10
  1. After adding the virtual node, all the nodes will have a distance of 2
    1. Node A - Virtual node - Node B
1. Benefits: Greatly improves message passing in sparse graphs

Our approach so far: All the neighbors are used for message passing
1. Problem: Dense/large graphs, high-degree nodes
1. New idea: (Randomly) determine a node’s neighborhood for message passing

Example: Neighborhood Sampling (类似于随机采样)

For example, we can randomly choose 2 neighbors to pass messages
- Only nodes B and D will pass message to A

Next time when we compute the embeddings, we can sample different neighbors
- Only nodes C and D will pass message to A

In expectation, we can get embeddings similar to the case where all the neighbors are used
- Benefits: Greatly reduce computational cost