CS224W-Machine Learning with Graph-GNN3

GNN Training Pipeline

Output of a GNN: set of node embeddings {hv(L), vG}\{ \textbf{h}^{(L)}_v , \ \forall v \in G \}

Different prediction heads: (Idea: Different task levels require different prediction heads)

Prediction Heads: Node-level

Prediction Heads: Edge-level

Options for  Headedge(hu(L),hv(L))\text{Options for } \ \text{Head}_{edge}( \textbf{h}_u^{(L)},\textbf{h}_v^{(L)})
  1. Concatenation + Linear
    • We have seen this is graph attention
    • y^uv=Linear(Concat(hu(L),hv(L)))\hat{\textbf{y}}_{uv} = \text{Linear}(\text{Concat}(\textbf{h}_u^{(L)}, \textbf{h}_v^{(L)}))
    • Here Linear()\text{Linear}(\cdot) will map 2d2d-dimensional embeddings (since we concatenated embeddings) to kk-dim embeddings (kk-way prediction)
  1. Dot product
    • y^uv=(hu(L))Thv(L)\hat{\text{y}}_{uv} = (\textbf{h}_u^{(L)})^T \textbf{h}_v^{(L)}
    • This approach only applies to 11-way prediction
    • Applying to kk-way prediction:
      • Similar to multi-head attention: W(1),,W(k)\textbf{W}^{(1)}, \cdots,\textbf{W}^{(k)} trainable
      y^uv(1)=(hu(L))TW(1)hv(L)y^uv(k)=(hu(L))TW(k)hv(L)y^uv=Concat(y^uv(1),,yuv(k))Rk\begin{align*} \hat{y}_{uv}^{(1)} &= (\textbf{h}_u^{(L)})^T \textbf{W}^{(1)} \textbf{h}_v^{(L)}\\ &\cdots\\ \hat{y}_{uv}^{(k)} &= (\textbf{h}_u^{(L)})^T \textbf{W}^{(k)} \textbf{h}_v^{(L)} \\ \hat{y}_{uv} & = \text{Concat}(\hat{y}_{uv}^{(1)},\cdots,\textbf{y}_{uv}^{(k)}) \in \mathbb{R}^k \end{align*}

Prediction Heads: Graph-level

Options for Headgraph({hv(L)Rd, vG})\text{Options for} \ \text{Head}_{graph}(\{ \textbf{h}_v^{(L)} \in \mathbb{R}^d, \ \forall v \in G\})
  1. Global mean pooling
    y^G=Mean({hv(L)Rd, vG})\hat{y}_G = \text{Mean}(\{ \textbf{h}_v^{(L)} \in \mathbb{R}^d, \ \forall v \in G\})
  1. Global max pooling
    y^G=Max({hv(L)Rd, vG})\hat{y}_G = \text{Max}(\{ \textbf{h}_v^{(L)} \in \mathbb{R}^d, \ \forall v \in G\})
  1. Global sum pooling
y^G=Sum({hv(L)Rd, vG})\hat{y}_G = \text{Sum}(\{ \textbf{h}_v^{(L)} \in \mathbb{R}^d, \ \forall v \in G\})

Where does ground-truth come form?

Supervised vs Unsupervised\text{Supervised vs Unsupervised}

Supervised learning on graphs

  • Labels come form external sources (predict drug likeness of a molecular graph)

Unsupervised learning on graphs

  • Signals come from graphs themselves (link prediction: predict if two nodes are connected)

Sometimes the differences are blurry

  • We still have “supervision” in unsupervised learning (train a GNN to predict node clustering coefficient)
  • An alternative name for “unsupervised” is “self-supervised”(自监督模型)

Supervised Labels on Graphs\text{Supervised Labels on Graphs}

Supervised labels come from the specific use cases.

  • Node labels yv\textbf{y}_v in a citation network, which subject area does a node belong to
  • Edge labels yuv\textbf{y}_{uv} in a transaction network, whether an edge is fraudulent
  • Graph labels yG\textbf{y}_G among molecular graphs, the drug likeness of graphs

Advice: Reduce your task to node /edge/ graph labels, since they are easy to work with.

Unsupervised Signals on Graphs\textbf{Unsupervised Signals on Graphs}

The problem: sometimes we only have a graph without any external labels

The solutions: ”self-supervised learning”, we can find supervision signals within the graph.

For example:

  • Node-level yv\textbf{y}_{v} Node statistics: such as clustering coefficient, PageRank,…
  • Edge-level yuv\textbf{y}_{uv} Link prediction: hidden the edge between two nodes, predict if there should be a link
  • Graph-level yG\textbf{y}_G Graph statistics: for example, predict if two graphs are isomorphic

Advice:These tasks do not require any external labels!

How do we compute the final loss?

Settings for GNN Training

We have NN data points

Classification: labels y(i)\text{y}^{(i)} with discrete value

Regression: labels y(i)\text{y}^{(i)} with continuous value

Classification Loss

CE(y(i),y^(i))=j=1Kyj(i)log(y^j(i))\text{CE}(\textbf{y}^{(i)},\hat{\textbf{y}}^{(i)}) = - \sum_{j=1}^K \textbf{y}_j^{(i)} \log(\hat{\textbf{y}}_j^{(i)})

where:

y(i)RK=one-hot label encoding]y(i)^RK=prediction after Softmax()\textbf{y}^{(i)} \in \mathbb{R}^K = \text{one-hot label encoding}]\\ \hat{\textbf{y}^{(i)}} \in \mathbb{R}^K = \text{prediction after} \ \text{Softmax}(\cdot)

Regression Loss

MSE(y(i),y^(i))=j=1K(yj(i)y^j(i))2\text{MSE}(y^{(i)},\hat{y}^{(i)}) = \sum_{j=1}^K (y_j^{(i)} - \hat{y}_j^{(i)})^2

where

y(i)Rk=Real valued vector of targetsy^(i)Rk=Real valued vecotor of predictionsy^{(i)} \in \mathbb{R}^k = \text{Real valued vector of targets}\\ \hat{\text{y}}^{(i)} \in \mathbb{R}^k = \text{Real valued vecotor of predictions}
Loss=i=1NMSE(y(i),y^(i))\text{Loss} = \sum_{i=1}^N \text{MSE}(y^{(i)},\hat{y}^{(i)})

How do we measure the success of a GNN?



Evaluation Metrics: Regression

Suppose we make predictions for NN data points

i=1N(y(i)y^(i))2N\sqrt{\sum_{i=1}^N \frac{(y^{(i)}-\hat{y}^{(i)} )^2}{N}}
i=1Ny(i)y^(i)N\frac{\sum_{i=1}^N|y^{(i)}-\hat{y}^{(i)}|}{N}

Evaluation Metrics: Classification

We simply report the accuracy

1[argmax(y^(i))=y(i)]N\frac{1[argmax(\hat{y}^{(i)}) = y^{(i)}]}{N}

Accuracy

TP+TNTP+TNN+FP+FN=TP+TNDataset\frac{TP+TN}{TP+TNN+FP+FN} = \frac{TP+TN}{|Dataset|}

Precision (P)

TPTP+FP\frac{TP}{TP+FP}

Recall (R)

TPTP+FN\frac{TP}{TP+FN}

F1F1-Score

2P×RP+R\frac{2P \times R}{P+R}

ROC Curve: Captures the tradeoff in TPR and FPR as the classification threshold is varied for a binary classifier.

TPR=Recall=TPTP+FNTPR = Recall = \frac{TP}{TP+FN}
FPR=FPFP+TNFPR = \frac{FP}{FP+TN}

ROC AUC: Area under the ROC Curve.

Intuition: The probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

How do we split our dataset into train / validation / test set?

Dataset Split: Fixed / Random Split

分割图是特殊的 为什么? Why Splitting Graphs is Special? 或许每个节点都有可能是相互关联的

Suppose we want to split an image dataset?

Splitting a graph dataset is different!

Solution 1 Transductive setting\text{Solution 1 Transductive setting}

The input graph can be observed in all the dataset splits (training, validation and test set)

Solution 2 Inductive setting\text{Solution 2 Inductive setting}

We break the edges between splits to get multiple graphs

Transductive / Inductive Settings

Node Classification\text{Node Classification}

Graph Classification\text{Graph Classification}

Link Prediction\text{Link Prediction}

How to set up link prediction?