CS224W-Machine Learning with Graph-Node Embeddings

Traditional ML for Graphs

Given an input graph, extract node, link and graph-level features, then learn a model (SVM, neural network, etc.) that maps features to labels.

Graph Representation Learning(图表示学习)

Graph Representation Learning alleviates the need to do feature engineering every single time.

Goal: Efficient task-independent feature learning for machine learning with graphs! (不需要特征学习)

Why Embedding? (为什么需要嵌入?)

Task: Map nodes into an embedding space

  • Similarity of embeddings between nodes indicates their similarity in the network.
    • Both nodes are close to each other (connected by an edge)
  • Encode network information
  • Potentially used for many downstream predictions (可用于许多下游预测任务)

Encoder and Decoder

Setup

V:{1,2,3,4}A=(0101100100011110)V:\{1,2,3,4\}\\ A = \begin{pmatrix} 0&1&0&1\\1&0&0&1\\0&0&0&1\\1&1&1&0 \end{pmatrix}

Embedding Nodes

Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph.

Goal: similarity(u,v)zvTzuGoal:\ similarity(u,v) \approx \textbf{}z_v^T \textbf{z}_u

Learning Node Embedding (节点嵌入)

  1. Encoder maps from nodes to embeddings
  1. Define a node similarity function (i.e., a measure of similarity in the original network)
  1. Decoder DECDEC maps from embeddings to the similarity score
  1. Optimize the parameters of the encoder to that DEC(zvTzu)DEC(\textbf{z}^T_v\textbf{z}_u)
similarity(u,v)zvTzusimilarity(u,v) \approx \textbf{}z_v^T \textbf{z}_u

两个关键部分 Encoder + Decoder

“Shallow” Encoding

Simplest encoding approach: Encoder is just an embedding-lookup

ENC(v)=zv=ZvENC(v) = \textbf{z}_v = \textbf{Z} \cdot v

Matrix, each column is a node embedding [what we learn/ optimize]

ZRd×V\textbf{Z} \in \mathbb{R}^{d \times |\mathcal{V}|}

Indicator vector, all zeroes except a one in column indicating node

vIVv \in \mathbb{I}^{|\mathcal{V}|}

Each node is assigned a unique embedding vector

Framework Summary

Random Walk Approaches for Node Embeddings

Notation

Non-Linear functions used to produce predicted probabilities

Random Walk

Random-Walk Embeddings

zuTzvProbability that u and v co-occur on a random walk over the graph\textbf{z}^T_u \textbf{z}_v \approx \text{Probability that}\ u \ \text{and} \ v \ \text{co-occur on a random walk over the graph}

Random-Walk Embeddings

为什么使用随机步嵌入? (Why Random Walks?)

  1. Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighbourhood information.
  1. Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks.

无监督特征学习 (Unsupervised Feature Learning)

Feature Learning as Optimization

  1. Given G=(V,E)G = (V,E)
  1. Our goal is to learn a mapping f:uRd:f(u)=zuf: u \rightarrow \mathbb{R}^d: f(u) = \textbf{z}_u
  1. Log-likelihood objective:
arg maxzuVlogP(NR(u)zu)\argmax_{z} \sum_{u \in V} \log P(N_R(u)|\textbf{z}_u)

随机步优化(Random Walk Optimization)

  1. Run short fixed-length random walks starting from each node uu in the path using some random walk strategy R.R.
  1. For each node uu collect NR(u)N_{R}(u) , the multiset of nodes visited on random walks starting from u.u.
  1. Optimize embeddings according to: Given node uu, predict its neighbours NR(u)N_{R}(u).

see Log-likelihood. NR(u)N_{R}(u) 其中可以含有重复的节点。因为随机步可能多次到达相同的点。

arg maxzuVlogP(NR(u)zu)arg maxzL=uVvNRulog(P(vzu))\argmax_z \sum_{u \in V} \log P(N_R(u)|\textbf{z}_u)\\ \Longrightarrow \argmax_z \mathcal{L} = \sum_{u \in V} \sum_{v \in N_R{u}} -\log(P(v|\textbf{z}_u))

直觉可以使用负对数似然估计。

P(vzu)=ezuTzvnVezuTznP(v|\textbf{z}_u) = \frac{e^{\textbf{z}^T_u\textbf{z}_v}}{\sum_{n \in V} e^{\textbf{z}^T_u \textbf{z}_n}}

如果要这么计算耗费太高,如果使用别的方法解决这个问题?

Solution: Negative Sampling (负采样)

log(exp(zuTzv)nVexp(zuTzn))log(σ(zuTzn))+i=1klog(σ(zuTzn)),niPV-\log (\frac{exp(\textbf{z}^T_u \textbf{z}_v)}{\sum_{n \in V} exp(\textbf{z}^T_u \textbf{z}_n)}) \\ \approx \log (\sigma(\textbf{z}^T_u \textbf{z}_n)) + \sum_{i=1}^k \log (\sigma(-\textbf{z}^T_u \textbf{z}_n)), n_i \sim P_V

Stochastic Gradient Descent

L=uVvNRulog(P(vzu))\mathcal{L} = \sum_{u \in V} \sum_{v \in N_R{u}} - \log(P(v|\textbf{z}_u))

How to we optimize (minimize) it?

Stochastic Gradient Descent: Instead of evaluating gradients over all examples, evaluate it for each individual training example.

Node2vec Algorithm (Linear-time complexity + All 3 steps are individually parallelizable)

  1. Compute edge transition probabilities:
    1. For each edge (s1,w)(s_1,w) we compute edge walk probabilities (based on p,qp,q) of edge (w;)(w;)
  1. Simulate rr random walks of length ll starting from each node uu
  1. Optimize the node2vec objective using Stochastic Gradient Descent

Embeddings Entire Graphs

Goal: Want to embed a subgraph or an entire graph GG. Graph embedding: zG\textbf{z}_G.

Hierarchical Embeddings

DiffPool: We can also hierarchically cluster nodes in graphs, and sum/average the node embeddings according to these clusters.

Matrix Factorization and Node Embeddings (矩阵分解和节点嵌入)

Connection to Matrix Factorization

其他矩阵分解相关内容可见 (Applied Linear Algebra for Data Science Lecturenotes)

Random Walk - based Similarity

How to use Embeddings ? (怎么嵌入?)