CS224W-Machine Learning with Graph-GNN1

Basics of Deep Learning

minΘL(y,fΘ(x))\min_{\Theta} \mathcal{L}(\mathcal{y},f_{\Theta}(x))

Deep Learning for Graphs

Setup

Graph GG.

A Naive Approach

Convolutional Networks (卷积网络)

Goal is to generalize convolutions beyond simple lattices Leverage node features/attributes(e.g., text, images)

Real-World Graphs

Permutation Invariance (排列不变性)

Graph does not have a canonical order of the nodes!

We can have many different order plans.

What does it mean by “graph representation” is same for two order plans?

f(A1,X1)=f(A2,X2)f(A_1,X_1) = f (A_2,X_2)

Permutation Equivariance

Summary: Invariance and Equivariance

f(A,X)=f(PAPT,PX)f(A,X) = f(PAP^T,PX)
Pf(A,X)=f(PAPT,PX)Pf(A,X) = f(PAP^T,PX)

Graph Convolutional Networks [Kipf and Welling, ICLR 2017]

图卷积网络

Idea: Node’s neighborhood defines a computation graph

How to propagate information across the graph to compute node features?

Key idea: Generate node embeddings based on local network neighborhoods

Deep Model: Many Layers

Model can be of arbitrary depth:

  • Nodes have embeddings at each layer
  • Layer-0 embedding of node vv is its input feature, xvx_v
  • Layer-k embedding gets information from nodes that are kk hops away

Neighborhood Aggregation (邻居节点的聚合)

Key distinctions are in how different approaches aggregate information across the layers

Basic approach: Average information from neighbors and apply a neural network.

hv0=xvh_v^0 = \textbf{x}_v
hv(k+1)=σ(WkuN(v)hu(k)N(v)+Bkhv(u)), k{0,,K1}h_v^{(k+1)} = \sigma(W_k \sum_{u \in N(v)} \frac{h_u^{(k)}}{|N(v)|}+B_k h_v^{(u)}),\ \forall k \in \{0,\cdots,K-1\}
zv=hv(K)\textbf{z}_v = h_v^{(K)}

GCN: Invariance and Equivariance

Invariance and Equivariance Properties for a GCN?

How do we train the GCN to generate embeddings?

Need to define a loss function on the embeddings.

Model Parameters

hv(0)=xvh^{(0)}_v = \textbf{x}_v
hv(k+1)=σ(WkuN(v)hu(k)N(v)+Bkhv(u)), k{0,,K1}h_v^{(k+1)} = \sigma(W_k \sum_{u \in N(v)} \frac{h_u^{(k)}}{|N(v)|}+B_k h_v^{(u)}),\ \forall k \in \{0,\cdots,K-1\}
zv=hv(K)\textbf{z}_v = h^{(K)}_v

Matrix Formulation

uN(v)hu(k1)N(v)H(k+1)=D1AH(k)\sum_{u \in N(v)} \frac{h_u^{(k-1)}}{|N(v)|} \Longrightarrow H^{(k+1)} = D^{-1}AH^{(k)}

Train A GNN

Unsupervised Training (无监督训练)

minΘL=zu,zvCE(yu,v,DEC(zu,zv))\min_{\Theta}\mathcal{L} = \sum_{\textbf{z}_u,\textbf{z}_v} CE(y_{u,v},DEC(\textbf{z}_u,\textbf{z}_v))

Supervised Training (监督训练)

Directly train the model for a supervised task (e.g. node classification)

Use cross entropy loss

L=vVyvlog(σ(zvTθ))+(1yv)log(1σ(zvTθ))\mathcal{L} = - \sum_{v \in V} \textbf{y}_v \log (\sigma(\textbf{z}^T_v\theta))+ (1-\textbf{y}_v)\log (1-\sigma(\textbf{z}^T_v \theta))

Model Design: Overview

Inductive Capability (归纳)

The same aggregation parameters are shared for all nodes:

  • The number of model parameters is sublinear in V|V| and we can generalize to unseen nodes!

GNNs subsume CNNs

GNN CNN Transformer

Convolutional Neural Network

CNN vs. CNN

hv(l+1)=σ(WluN(v)hu(l)N(v)+Blhv(l)), l{0,,L1}h_v^{(l+1)} = \sigma (W_l \sum_{u \in N(v)} \frac{h_u^{(l)}}{|N(v)|} + B_l h_v^{(l)}), \ \forall l \in \{0,\cdots,L-1\}
hv(l+1)=σ(uN(v)Wluhu(l)+Blhv(l)), l{0,,L1}h_v^{(l+1)} = \sigma (\sum_{u \in N(v)}W_l^u h_u^{(l)} + B_l h_v^{(l)}), \ \forall l \in \{0,\cdots,L-1\}

Key difference: We can learn different WluW_l^u for different “neighbor” uu for pixel vv on the image. The reason is we can pick an order for the 99 neighbors using relative position to the center pixel: {(1,1),(1,0),,(1,1)}\{(-1,-1),(-1,0),\cdots,(1,1)\}

Transformer

Key component: Self-attention

Definition: A general definition of attention: Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

Transformer layer can be seen as a special GNN that runs on a fully connected “word” graph!