CS224W-Machine Learning with Graph-Deep Generative Models for Graphs

Graph Generation

So far, we have been learning form graphs
- We assume the graphs are given:

But how are these graphs generated?

We want to generate realistic graphs, using graph generative modes

Applications:
- Drug discovery, material design
- Social network modeling

History of Graph Generation
- Step 1: Properties of real-world graphs
  - A successful graph generative model should fit these peroperties.
- Step 2: Traditional graph generative models
  - Each come with different assumptions on the graph formulation process
- Step 3: Deep graph generative models
  - Learn the graph formation process from the data.

Deep Graph Encoders

Deep Graph Decoders (this lecture)

Machine Learning for Graph Generation

Task 1: Realistic graph generation
- Generate graphs that are similar to a given set of graphs

Task 2: Goal-directed graph generation
- Generate graphs that optimize given objectives / constraints
  - E.g., Drug molecule generations / optimization

Graph Generative Models

Given: Graphs sampled from $p_{data}(G)$ .

Goal:
- Learn the distribution $p_{model}(G)$ .
- Sample from $p_{model}(G)$

Generative Models Basics

Setup:

Assume we want to learn a generative model from a set of data points (i.e., graphs) $\{x_i\}$
- $p_{data}(x)$ is the data distribution, which is never known to us, but have sampled $x_i \sim p_{data}(x)$
- $p_{model}(x;\theta)$ is the model, parameterized by $\theta$ , that we use to approximate $p_{data}(x)$ .

Goal:
1. Make $p_{model}(x;\theta)$ close to $p_{data}(x)$ (Density estimation)
  - Key Principle: maximum Likelihood
  - Fundamental approach to modeling distributions
    $\theta^* = \argmax_{\theta}\mathbb{E}_{x \sim p_{data}} \log p_{model}(x|\theta)$
    - Find parameters $\theta^*$ , such that for observed data points $x_i \sim p_{data}$ the $\sum_i \log p_{model}(x_i;\theta^*)$ has the highest value, among all possible choices of $\theta$
      - That is, find the model that is most likely to have generated the observed data $x$ .
1. Make sure we can sample from $p_{model}(x;\theta)$ (Sampling)
  1. We need to generate examples (graphs) from $p_{model}(x;\theta)$
  - Goal: Sample from a complex distribution
  - The most common approach:
    - Sample from a simple noise distribution
    $z_i \sim N(0,1)$
    - Transform the noise $z_i$ via $f(\cdot)$
    $x_i = f(z_i;\theta)$
  - How too design $f(\cdot)$ ?
  - Use Deep Neural Networks, and train it using the data we have!

Deep Generative Models

Auto-regressive models:

$p_{model}(x;\theta)$ is used for both density estimation and sampling (remember out two goals)
- Other models like Variational Auto Encoders(VAEs), Generative Adversarial Nets (GAN) have 2 or more models, each playing one of the roles
- Idea: Chain rule. Joint distribution is a product of conditional distributions:
  $p_{model}(x;\theta) = \prod_{t=1}^n p_{model}(x_t|x_1,\cdots,x_{t-1};\theta)$
  - E.g., $x$ is a vector, $x_t$ is the $t$ -th dimension; $x$ is a sentence, $x_t$ is the $t$ -th word.
  - In our case: $x_t$ will be the $t$ -th action (add node, add edge)

GraphRNN: Generating Realistic Graphs

GraphRNN Idea

Generating graphs via sequentially adding nodes and edges

Model Graphs as Sequences

Graph $G$ with node ordering $\pi$ can be uniquely mapped into a senquece of node and edge additions $S^{\pi}$

The sequence $S^{\pi}$ has two levels ( $S$ is a sequence of sequences):

Node-level: add nodes, one at a time (At each step, a new node is added)

Edge-level: add edges between existing nodes
- Each Node-level step is an edge-level sequence
- Edge-level: At each step, add a new edge

Summary: A graph + a node ordering = A sequence of sequences

Node ordering is randomly selected (we will come back to this)

We have transformed graph generation problem into a sequence generation problem

Need to model two processes:

Generate a state for a new node (Node-level sequence)

Generate edges for the new node based on its state (Edge-level sequence)

Approach: Use Recurrent Neural Networks (RNNs) to model these process!

Background: Recurrent NNs

RNNs are designed for sequential data
- RNN sequentially takes input sequence to update its hidden states
- The hidden states summarize all the information input to RNN
- The update is conducted via RNN cells

$s_t$ : State of RNN after step $t$

$x_t$ : Input to RNN at step $t$

$y_t$ : Output of RNN at step $t$

RNN cell: $W,U,V$ : Trainable parameters

More expressive cells: GRU, LSTM, etc.

GraphRNN: Two levels of RNN

GraphRNN has a node-level RNN and an edge-level RNN

Relationship between the two RNNs:
- Node-level RNN generations the initial state for edge-level RNN
- Edge-level RNN sequentially predict if the new node will connect to each of the previous node.

RNN for Sequence Generation

How to use RNN to generate sequences?

Let $x_{t+1} = y_t$ (Use the previous output as input)

How to initialize the input sequence?

Use start of sequence token (SOS) as the initial input
- SOS is usually a vector with all zero / ones

When to stop generation?

Use end of sequence token (EOS) as an extra RNN output
- If output EOS=0, RNN will continue generation
- If output EOS =1, RNN will stop generation.

Towards Edge-Level RNN

Consider the Edge-level RNN for now.

Out goal: Mode $\prod_{k=1}^n p _{model}(x_t|x_1,\cdots,x_{t-1};\theta)$

Let $y_t = p_{model} (x_t|x_1,\cdots,x_{t-1};\theta)$

The we need to sample $x_{t+1}$ from $y_t:x_{t+1} \sim y_t$
- Each step of RNN outputs a probability of a single edge
- We then sample from the distribution, and feed sample to next step:

Suppose we already have trained the edge-level RNN

$y_t$ is a scalar, following a Bernoulli distribution

$p$ means values 1 has probability $p$ , value 0 has probability $1-p$

Edge-level RNN at Training Time

Training the model:

We observe a sequence $y^*$ of edges $[0,0,1,\cdots]$

Principle: Teacher Forcing - Replace input and output by the real sequence

Loss $L$ : Binary cross entropy

Minimize:
- If $y^*_1 = 1$ , we minimize $-\log(y_1)$ ,making $y_1$ higher
- If $y^*_1 =0$ , we minimize $-\log(1-y_1)$ , making $y_1$ lower.
- This way, $y_1$ is fitting the data samples $y^*_1$
- Reminder: $y_1$ is computed by RNN, this loss will adjust RNN parameters accordingly, using back propagation!

Putting Things Together

Our Plan:

Add a new node: We run Node RNN for a step, and use it output to initialize Edge RNN.

Add new edges for the new node: We run Edge RNN to predict if the new node will connect to each of the previous node.

Add another new node: We use the last hidden state of Edge RNN to run Node RNN for another step

Stop graph generation: If Edge RNN outputs EOS at step 1, we know no edges are connected to the new node. We stop the graph generation.

Put Things Together: Training

Put Things Together: Test

Scaling Up and Evaluating Graph Generation

Issue: Tractability

Any node can connect to any prior node.

Too many steps for edge generation
- Need to generate full adjacency matrix
- Complex too-long edge dependencies

Solution: Tractability via BFS

Breadth-First Search node ordering

BFS node ordering:
- Since Node 4 doesn’t connect to Node 1
- We know all Node 1’s neighbors have already been traversed
- Therefore, Node 5 and the following nodes will never connect to node 1
- We only need memory of 2 “steps” rather than $n-1$ steps

Breadth-First Search ordering

Benefits:
- Reduce possible node orderings
  - From $O(n!)$ to number of distinct BFS orderings
- Reduce steps for edge generation
  - Reducing number of previous nodes to look at

BFS reduces the number of steps for edge generation

Evaluating Generated Graphs

Task: Compare two sets of graphs

Goal: Define similarity metrics for graphs

Solution:
- visual similarity
- Graph statistics similarity

Application of Deep Graph Generative Models to Molecule Generation

Application: Drug Discovery

Question: Can we learn a model that can generate valid and realistic molecules with optimized property scores?

Goal-Directed Graph Generation

Generating graphs that:

Optimize a given objective (High scores) - Idea: Reinforcement Learning
- E.g., drug-likeness

Obey underlying rules (Valid)
- E.g., chemical validity rules

Are learned from examples (Realistic)
- Imitating a molecule graph dataset.
  - We have just covered this part.

Solution: GCPN

Graph Convolutional Policy Network (GCPN)

combines graph representation + RL

Key component of GCPN:

Graph Neural Network captures graph structural information

Reinforcement Learning guides the generation towards the desired objectives

Supervised training imitates examples in given datasets

GCPN vs. GraphRNN

Commonality of GCPN & GraphRNN
- Generate graphs sequentially
- Imitate a given graph dataset

Main Differences:
- GCPN uses GNN to predict the generation action
  - Pros: GNN is more expressive than RNN
  - Cons: GNN takes longer time to compute than RNN
- GCPN further uses RL to direct graph generation to our goals
  - RL enables goal-directed graph generation

Sequential graph generation

GraphRNN: predict action based on RNN hidden states

GCPN: predict action based on GNN node embeddings

Overview of GCPN

(a) Insert nodes

(b c) Use GNN to predict which nodes to connect

(d) Take an action (check chemical validity)

(e f) Compute reward

How to set the Reward?

Step reward: Learn to take valid action
- At each step, assign small positive reward for valid action

Final reward: Optimize desired properties
- At the end, assign positive reward for high desired property

\textbf{Reward} = \textbf{Final reward} + \textbf{Step reward}

How to train?

Supervised training: Train policy by imitating the action given by real observed graphs. use gradient.
- We have covered this idea in GraphRNN

Reinforcement Learning training: Train policy to optimize rewards. Use standard policy gradient algorithm.

Training GCPN

Qualitative Results

Visualization of GCPN graphs:

Property optimizattion Generate molecules with high specified property score

Constrained Optimization: Edit a given molecule for a few steps to achieve higher property score