CS224W-Machine Learning with Graph-Graph Transformers

Transformers

Transformers Ingest Tokens

Transformer Blueprint

Self-attention

Multi-head self-attention

Comparing Transformers and GNN

Are self-attention and message passing really different?

Self-attention vs. message passing

Interpreting the Self-Attention Update

Input stored row-wise X=[ xi ]X = [- \ x_i\ -]

Att(X)=softmax(KTQ)V=softmax((XWK)T(XWQ))XWV\begin{align*}Att(X) &= softmax(K^TQ)V \\ &=softmax((XW^K)^T(XW^Q))XW^V \end{align*}
z1=j=15softmaxj(q1Tkj)vjz_1 = \sum_{j=1}^5 softmax_j(q_1^Tk_j)v_j
z1=j=15softmaxj(q1Tkj)vjz_1 = \sum_{j=1}^5 softmax_j(q_1^Tk_j)v_j

Self-Attention as Message Passing

z1=j=15softmaxj(q1Tkj)vjz_1 = \sum_{j=1}^5 softmax_j(q_1^Tk_j)v_j

A New Design Landscape for Graph Transformers

Processing Graphs with Transformers

Components of a Transformer

A final key piece: token ordering

Positional Encodings

Components of a Transformer

Processing Graphs with Transformers

Node as Tokens

How to Add Back Adjacency Information?

Q2: How to design a good positional encoding?

Option 1: Relative Distances

Option 2: Laplacian Eigenvector Positional Encodings

Positional Encoding Steps:

  1. Compute kk eigenvectos
  1. Stack into matrix
  1. iith row is positional encoding for node ii

Laplacian Eigenvector positional encodings can also be used with message-passing GNNs

Edge Features in Self-Attention

Att(X)=softmax(KTQ)VAtt(X) = softmax(K^TQ)V

Summary: Graph Transformer Design Space