CS224W-Machine Learning with Graph-GNN2

A Single Layer of a GNN

A GNN Layer = Message + Aggregation

Compress a set of vectors into a single vector

Message Computation (消息计算)

Message Aggregation (消息聚合)

Message Aggregation: Issue (考虑在聚合时可能会出现自身节点信息的丢失,即自己到自己时的信息可能会发生丢失)

Issue: Information from node vv itself could get lost

  • Computation of hv(l)\textbf{h}_v^{(l)} does not directly depend on hv(l1)\textbf{h}_v^{(l-1)}

Solution: Include hv(l1)\textbf{h}_v^{(l-1)} when computing hv(l)\textbf{h}_v^{(l)}

Classical GNN Layers

GCN Graph Convolutional Networks

hv(l)=σ(W(l)uN(v)hu(l1)N(v))\textbf{h}_v^{(l)} = \sigma \left( \textbf{W}^{(l)} \sum_{u \in N(v)} \frac{\textbf{h}_u^{(l-1)}}{|N(v)|}\right)

GraphSAGE

hv(l)=σ(W(l)CONCAT(hv(l1),AGG({hu(l1)uN(v)})))\textbf{h}_v^{(l)} = \sigma \left( \textbf{W}^{(l)} \cdot \text{CONCAT} \left( \textbf{h}_v^{(l-1)}, \text{AGG} \left( \{ \textbf{h}_u^{(l-1)} \, | \, u \in \mathcal{N}(v) \} \right) \right) \right)

GraphSAGE Neighbor Aggregation (GraphSAGE 邻居节点的聚合方法)

AGG=uN(v)hu(l1)N(v)\text{AGG} = \sum_{u \in N(v)} \frac{\textbf{h}_u^{(l-1)}}{|N(v)|}
AGG=Mean({MLP(hu(l1)),uN(v)})\text{AGG} = \text{Mean}(\{ \text{MLP}(\textbf{h}_u^{(l-1)}),\forall u \in N(v)\})
AGG=LSTM([hu(l1),uπ(N(v))])\text{AGG} = \text{LSTM}([\textbf{h}_u^{(l-1)},\forall u \in \pi(N(v))])

GraphSAGE: L2L_2 Normalization (在嵌入之后进行归一化)

l2\mathcal{l}_2  Normalization

Graph Attention Networks

hv(l)=σ(uN(v)αvuW(l)hu(l1))\textbf{h}_v^{(l)} = \sigma \left(\sum_{u \in N(v)} \alpha_{vu} \textbf{W}^{(l)} \textbf{h}_u^{(l-1)}\right)

根据这个权重,那就需要考虑加入注意力机制,即考虑到每个节点对本节点影响的程度不同,因此可以考虑加入注意力机制。

Not all node’s neighbors are equally important

Weighting factors αvu\alpha_{vu} be learned?

Goal: Specify arbitrary importance to different neighbors of each node in the graph

Idea: Compute embedding hv(l)\textbf{h}_v^{(l)} of each node in the graph following an attention strategy

Attention Mechanism

Let αvu\alpha_{vu} be computed as a byproduct of an attention mechanism aa:

hv(l)=σ(uN(v)αvuW(l)hu(l1))\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \alpha_{vu} \textbf{W}^{(l)}\textbf{h}_u^{(l-1)}\right)

Weighted sum using αAB,αAC,αAD\alpha_{AB},\alpha_{AC},\alpha_{AD}:

hA(l)=σ(αABW(l)hB(l1)+αACW(l)hC(l1)+αADW(l)hD(l1))\textbf{h}_A^{(l)} = \sigma (\alpha_{AB}\textbf{W}^{(l)}\textbf{h}_B^{(l-1)}+\alpha_{AC}\textbf{W}^{(l)}\textbf{h}_C^{(l-1)}+ \alpha_{AD} \textbf{W}^{(l)} \textbf{h}_D^{(l-1)})

注意力机制aa 的形式?

Use a simple single-layer neural network aa have trainable paremeters (weights in the Linear layer)

eAB=a(W(l)hA(l1),W(l)hB(l1))=Linear(Concat(W(l)hA(l1),W(l)))\begin{align*} e_{AB} &= a(\textbf{W}^{(l)} \textbf{h}_A^{(l-1)}, \textbf{W}^{(l)} \textbf{h}_B ^{(l-1)}) \\ &= \text{Linear}(\text{Concat}(\textbf{W}^{(l)} \textbf{h}_A^{(l-1)}, \textbf{W}^{(l)})) \end{align*}

多头注意力机制(Multi-head Attention)

Stabilizes the learning process of attention mechanism

Benefits of Attention Mechanism 注意力机制的好处

GNN Layers in Practice

Many modern deep learning modules can be incorporated into a GNN layer

Batch Normalization

Setup

Input: XRN×DX \in \mathbb{R}^{N \times D}

Trainable Parameters: γ,βRD\gamma,\beta \in \mathbb{R}^D

Output: YRN×DY \in \mathbb{R}^{N \times D}

Step 1: Compute the mean and variance over NN embeddings

μi=1Ni=1NXi,jσj2=1Ni=1N(Xi,jμj)2\mu_i = \frac{1}{N} \sum_{i=1}^N X_{i,j}\\ \sigma^2_j = \frac{1}{N} \sum_{i=1}^N (X_{i,j}- \mu_j)^2

Step 2: Normalize the feature using computed mean and variance

X^i,j=Xi,jμjσj2+ϵYi,j=γjX^i,j+βj\hat{X}_{i,j} = \frac{X_{i,j} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}\\ Y_{i,j} = \gamma_j \hat{X}_{i,j} + \beta_j

Dropout

Activation (non-linearity)

Apply activation to ii-th dimension of embedding x\textbf{x}

ReLU(xi)=max(xi,0)\text{ReLU}(\textbf{x}_i) = \max (\textbf{x}_i,0)
σ(xi)=11+exi\sigma(\textbf{x}_i) = \frac{1}{1+e^{-\textbf{x}_i}}
PReLU(xi)=max(xi,0)+aimin(xi,0)\text{PReLU}(\textbf{x}_i) = \max (\textbf{x}_i,0) + a_i \min (\textbf{x}_i,0)

aia_i is a trainable parameter.

Stacking Layers of a GNN

Construct a Graph Neural Network

如果GNN的层数过多会引起过于平滑的问题(Why?)

The Over-smoothing Problem

Receptive Field of a GNN

Receptive field: the set of nodes that determine the embedding of a node of interest

  • In a KK-layer GNN, each node has a receptive field of KK-hop neighborhood

Receptive field overlap for two nodes

  • The shared neighbors quickly grows when we increase the number of hops(num of GNN layers)

Via the notion of the receptive filed to explain over-smoothing

层数过多之后,链接的节点越多,即扩展的节点越多,导致节点的嵌入就会越来越相似。都是连接的差不多的节点。

Be cautious when adding GNN layers - Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help.

Expressive Power for Shallow GNNs? How to make a shallow GNN more expressive?

Pre-processing layers: Important when encoding node features is necessary.

Post-processing layers: Important when reasoning/ transformation over node embeddings are needed

If our problem still requires many GNN layers, we need to add skip connections in GNNs

为什么skip connections?

Example: GCN with Skip Connections

A standard GCN layer

hv(l)=σ(uN(v)W(l)uu(l1)N(v))\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \textbf{W}^{(l)} \frac{\textbf{u}_u^{(l-1)}}{|N(v)|}\right)

A GCN layer with skip connection

hv(l)=σ(uN(v)W(l)uu(l1)N(v)+hv(l1))\textbf{h}_v^{(l)} = \sigma \left( \sum_{u \in N(v)} \textbf{W}^{(l)} \frac{\textbf{u}_u^{(l-1)}}{|N(v)|} + \textbf{h}_v^{(l-1)}\right)

Example: Other Options of Skip Connections

Directly skip to the last layer - The final layer directly aggregates from the all the node embeddings in the previous layers.

Graph Manipulation in GNNs

General GNN Framework

Idea: Raw input graph \neq computational graph

Why Manipulate Graphs?

Our assumption so far has been: Raw input graph == Computational graph

Reasons for breaking this assumption Graph Manipulation Approaches

Feature Augmentation on Graphs

Why do we need feature augmentation?

Input graph does not have node features

  • This is common when we only have the adj. matrix (邻接矩阵)

Standard approaches:

Assign constant values to nodes

Assign unique IDs to nodes

These IDs are converted into one-hot vectors

Certain structures are hard to learn by GNN

We can use cycle count as augmented node features

Other commonly used augmented features

Add Virtual Nodes/ Edges

Motivation: Augment sparse graphs

  1. Add virtual edges
    1. Common approach: Connect 2-hop neghbors via virtual edges
    1. Intuition: Instead of using adj. matrix AA for GNN computation, use A+A2A + A^2

    Example: Bipartite graphs

    • Author-to-papers (they authored)
    • 2-hop virtual edges make an author-author collaboration graph
  1. Add virtual nodes
    1. The virtual node will connect to all the nodes in the graph
      1. Suppose in a sparse graph, two nodes have shortest path distance of 10
      1. After adding the virtual node, all the nodes will have a distance of 2
        1. Node A - Virtual node - Node B
    1. Benefits: Greatly improves message passing in sparse graphs
  1. Our approach so far: All the neighbors are used for message passing
    1. Problem: Dense/large graphs, high-degree nodes
    1. New idea: (Randomly) determine a node’s neighborhood for message passing

Example: Neighborhood Sampling (类似于随机采样)