CS224W-Machine Learning with Graph-Fast Neural Subgraph

Subgraphs and Motifs

Definition:

Given graph $G = (V,E)$ :
- Def 1. Node-induced subgraph: Take subset of the nodes and all edges induced by the nodes:
  - $G' = (V',E')$ is a node induced subgraph iff
    - $V' \subseteq V$
    - $E' = \{(u,v) \in E | u,v \in V'\}$
    - $G'$ is the subgraph of $G$ induced by $V'$
  Alternate terminology: “induced subgraph”
- Def 2. Edge-induced subgraph: Take subset of the edges and all corresponding nodes
  - $G' = (V',E')$ is an edge induced subgraph iff
    - $E' \subseteq E$
    - $V' = \{v \in V | (v,u) \in E' \ \text{for some}\ u\}$
  Alternate terminology: “non-induced subgraph” or just “subgraph”

The best definition depends on the domain!

Example:
- Chemistry: Node-induced (functional groups)
- Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)

The preceding definitions define subgraphs when $V' \subseteq V$ and $E' \subseteq E$ , i.e. nodes and edges are taken from the original graph $G$ .

Graph Isomorphism

Graph isomorphism problem: Check whether two graphs are identical:

$G_1 = (V_1,E_1)$ and $G_2 = (V_2 ,E_2)$ are isomorphic if there exists a bijection $f: V_1 \longrightarrow V_2$ such that $(u,v) \in E_1$ iff $(f(u),f(v)) \in E_2$
- $f$ is called the isomorphism

We do not know if graph isomorphism is NP-hard, nor is any polynomial algorithm found for solving graph isomorphism.

$G_2$ is subgraph-isomorphic to $G_1$ if some subgraph of $G_2$ is isomorphic to $G_1$
- We also commonly say $G_1$ is a subgraph of $G_2$
- We can use either the node-induced or edge-induced definition of subgraph
- This problem is NP-hard

Case Example of Subgraphs
- All non-isomorphic, connected, undirected graphs of size 4
- All non-isomorphic, connected, directed graphs of size 3

Network Motifs

Network motifs: “recurring, significant patterns of interconnections”

How to define a network motif:
- Pattern: Small (node-induced) subgraph
- Recurring: Found many times, i.e., with high frequency. How to define frequency?
- Significant: More frequent than expected, i.e., in

Motifs: Induced Subgraphs

the red region has 3 edges.

Why do we need motifs?

Motifs:
- Help us understand how graphs work
- Help us make predictions based on presence or lack of presence in a graph dataset.

Example:
- Feed-forward loops: Found in networks of neurons, where they neutralize “biological noise”
- Parallel loops: Found in food webs
- Single-input modules: Found in gene control networks

Subgraphs Frequency

Let $G_Q$ be a small graph and $G_T$ be a target graph dataset.

Graph-level Subgraph Frequency Definition

Frequency of $G_Q$ in $G_T$ : number of unique subsets of nodes $V_T$ of $G_T$ for which the subgraph of $G_T$ induced by the nodes $V_T$ is isomorphic to $G_Q$

Let $G_Q$ be a small graph, $v$ be a node in $G_Q$ (the “anchor”) and $G_T$ be a target graph dataset.

Node-level Subgraph Frequency Definition: The number of nodes $u$ in $G_T$ for which some subgraph of $G_T$ is isomorphic to $G_Q$ and the isomorphism maps node $u$ to $v$ .

Let $(G_Q,v)$ be called a node-anchored subgraph

Robust to outliers

What if the dataset contains multiple graphs, and we want to compute frequency of subgraphs in the dataset?

Solution: Treat the dataset as a giant graph $G_T$ with disconnected components corresponding to individual graphs.

Defining Motif Significance

To define significance, we need to have a null-model (i.e., point of comparison).

Key idea: Subgraphs that occur in a real net work much more often than in a random network have functional significance.

Defining Random Graphs

Erdos-Renyi (ER) random graphs:

$G_{n,p}$ : undirected graph on $n$ nodes where each edge $(u,v)$ appears i.i.d with probability $p$
- How to generate the graph: Create $n$ nodes, for each pair of nodes $(u,v)$ flip a biased coin with bias $p$ .

Generated graph is a result of a random process: (Three random graphs drawn from $G_{5,0.6}$ )

New Model: Configuration Model

Goal: Generate a random graph with a given degree sequence $k_1,k_2,\cdots,k_N$

Useful as a “null” model of networks:
- We can compare the real network $G^{real}$ and a “random” $G^{rand}$ which has the same degree sequence as $G^{real}$

Configuration model:

Alternative for Spokes: Switching

Start from a given graph G

Repeat the switching step $Q \cdot |E|$ times:
- Select a pair of edges $A \longrightarrow B, C \longrightarrow D$ at random
- Exchange the endpoints to give $A \longrightarrow D, C \longrightarrow B$
  - Exchange edges only if no multiple edges or self-edges are generated

Result: A randomly rewired graph:
- Same node degrees, randomly rewired edges

Q is chosen large enough (e.g., Q=100) for the process to converge

Motif Significance Overview

Intuition: Motifs are overrepresented in a network when compared to random graphs:
- Step 1: Count motifs in the given graph $(G^{real})$
- Step 2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs
- Step 3: Use statistical measures to evaluate how significant is each motif
  - Use Z-score

Z-score for Statistical Significance

$Z_i$ captures statistical significance of motif $i$ :
$Z_i = (N_i^{reak} - \overline{N}_i^{rand}) / std(N_i^{rand})$
- $N_i^{real}$ is # (motif $i$ ) in graph $G^{real}$
- $\overline{N}_i^{rand}$ is average # (motif $i$ ) in random graph instances

Network significance profile (SP):
$SP_i = Z_i / \sqrt{\sum_j Z_j^2}$
- SP is a vector of normalized Z-scores
- The dimension depends on number of motifs considered
- SP emphasizes relative significance of subgraphs:
  - Important for comparison of network of different sizes
  - Generally, larger graphs display higher Z-scores.

Significance Profile

For each subgraph:
- Z-score metric is capable of classifying the subgraph “significance”:
  - Negative values indicate under-representation
  - Positive values indicate over-representation

We create a network significance profile:
- A feature vector with values for all subgraph types

Example SP

Neural Subgraph Representations

Subgraph Matching

Given:
- Large target graph (can be disconnected)
- Query graph (connected)

Decide:
- Is a query graph a subgraph in the target graph?
Node colors indicate the correct mapping of the nodes.

Use GNN to predict subgraph isomorphism.

Intuition: Exploit the geometric shape of embedding space to capture the properties of subgraph isomorphism
- Task Setup
  - Consider a binary prediction: Return True if query is isomorphic to a subgraph of the target graph, else return False.

Overview of the Approach

We are going to work with node-anchored definitions:

We are going to work with node-anchored neighborhoods:

Use GNN to obtain representations of $u$ and $v$

Predict if node $u$ ’s neighborhood is isomorphic to node $v's$ neighborhood:

Why Anchor?

Recall node-level frequency definition: The number of nodes $u$ in $G_T$ for which some subgraph of $G_T$ is isomorphic to $G_Q$ and the isomorphism maps $u$ to $v$ .

We can compute embeddings for $u$ and $v$ using GNN

Use embeddings to decide if neighborhood of $u$ is isomorphic to subgraph of neighborhood of $v$

We not only predict if there exists a mapping, but also a identify corresponding nodes ( $u$ and $v$ )!

Decomposing $G_T$ into Neighborhoods

For each node in $G_T$ :
- Obtain a k-hop neighborhood around the anchor
- Can be performed using breadth-first search (BFS)
- The depth $k$ is a hyper-parameter (e.g. 3)
  - Larger depth results in more expensive model

Same procedure applies to $G_Q$ to obtain the neighborhoods

We embed the neighborhoods using a GNN
- By computing the embeddings for the anchor nodes in therir respective neighborhoods.

Idea: Order Embedding Space

Map graph $A$ to a point $z_A$ into a high-dimensional (e.g. 64-dim) embedding space, such that $z_A$ is non-negative in all dimensions.

Subgraph Order Embedding Space

Why Order Embedding Space?

Subgraph isomorphism relationship can be nicely encoded in order embedding space

Transitivity: If $G_1$ is a subgraph of $G_2$ , $G_2$ is a subgraph of $G_3$ , then $G_1$ is a subgraph of $G_3$

Anti-symmetry: If $G_1$ is a subgraph of $G_2$ , and $G_2$ is a subgraph of $G_1$ , then $G_1$ is isomorphic to $G_2$ .

Closure under intersection: The trivial graph of 1 node is a subgraph of any graph.

All properties have their counter-parts in the order embedding space.

Example:

Order Constraint

We use a GNN to learn to embed neighborhoods and preserve the order embedding structure.

What loss function should we use, so that the learned order embedding reflects the subgraph relationship?

We design loss functions based on the order constraint:
- Order constraint specifies the idea order embedding property that reflects subgraph relationships.

We specify the order constraint to ensure that the subgraph properties are preserved in the order embedding space.

Order Constraint: Loss Function

GNN Embeddings are learned by minimizing a max-margin loss

Define $E(G_q,G_t) = \sum_{i=1}^D(max(0,z_q[i]- z_t[i]))^2$ as the “margin” between graphs $G_q$ and $G_t$

To learn the correct order embeddings, we want to learn $z_q,z_t$ such that
- $E(G_q,G_t)=0$ when $G_q$ is a subgraph of $G_t$ .
- $E(G_q,G_t) >0$ when $G_q$ is not a subgraph of $G_t$ .

Training Neural Subgraph Matching

To learn such embeddings, construct training examples $(G_q,G_t)$ where half the time, $G_q$ is a subgraph of $G_t$ , and the other half, it is not.

Train on these examples by minimizing the following max-marging loss:
- For positive examples: Minimize $E(G_q,G_t)$ when $G_q$ is a subgraph of $G_t$
- For negative example: Minnimize $max(0,\alpha-E(G_q,G_t))$
  - Max-margin loss prevents the model from learning the degenerate strategy of moving embeddings further and further apart forever.

Training Example Construction

Need to generate training queries $G_Q$ and targets $G_T$ from the dataset $G$

Get $G_T$ by choosing a random anchor $v$ and taking all nodes in $G$ within distance $K$ from $v$ to be in $G_T$ .

Postive examples: Sample induced subgraph $G_Q$ of $G_T$ . Use BFS sampling:
- Initialize $S=\{v\},V = \Phi$
- Let $N(S)$ be all neihbors of nodes in $S$ . At every step, sample $10\%$ of the nodes in $N(S)/V$ , put them in $S$ . Put the remaining nodes of $N(S)$ in $V$ .
- After $K$ steps, take the subgraph of $G$ induced by $S$ anchored at $q$

Negative examples( $G_Q$ not subgraph of $G_T$ ): “corrupt” $G_Q$ by adding/removing nodes/edges so it’s no longer a subgraph.

Details:

How many training examples to sample?
- At every iteration, we sample new training pairs
- Benefit: Every iteration, the model sees different subgraph examples
- Improves performance and avoids overfitting - since there are exponential number of possible subgraphs to sample from.

How deep is the BFS sampling?
- A hyper-parameter that trades off runtime and performance.
- Usually use 3-5, depending on size of the dataset.

Subgraph Predictions on New Graphs

Given: query graph $G_q$ anchored at node $q$ , target graph $G_t$ anchored at node $t$ .

Goal: Output whether the query is a node-anchored subgraph of the target.

Procedure:
- If $E(G_q,G_t) < \epsilon$ , predict “True”; else “False”
- $\epsilon$ is a hyper-parameter.

To check if $G_Q$ is isomorphic to a subgraph of $G_T$ , repeat this procedure for all $q \in G_Q, t\in G_T$ . Here $G_q$ is the neighborhood around node $q \in G_Q$ .

Finding Frequent Subgraphs

Generally, finding the most frequent size- $k$ motifs requires solving two challenges:

Enumerating all size $-k$ connected subgraphs

Counting # (occurrences of each subgraph type)

Subgraph isomorphism is NP-complete

Feasible motif size for traditional methods is relatively small (3 to 7)

Finding frequent subgraph patterns is computationally hard

Combinatorial explosion of number of possible patterns

Counting subgraph frequency is NP-hard

Representation learning can tackle these challenges:

Combinatorial explosion → Organize the search space

Subgraph isomorphism → Prediction using GNN

Representation learning can tackle these challenges:

Counting # (occurrences of each subgraph type)
- Solution: Use GNN to “predict” the frequency of the subgraph.

Enumerating all size- $k$ connected subgraphs
- Solution: Don’t enumerate subgraphs but construct a size- $k$ subgraph incrementally
  - Note: We are only interested in high frequency subgraphs

Problem Setup: Frequent Motif Mining

Target graph (dataset) $G_T$ , size parameter $k$

Desired number of results $r$

Goal:: Identify, among all possible graphs of $k$ nodes, the $r$ graphs with the highest frequency in $G_T$ .

We use the node-level definition: The number of nodes $u$ in $G_T$ for which some subgraph of $G_T$ is isomorphic to $G_Q$ and the isomorphism maps $u$ to $v$ .

\textbf{SPMiner}

SPMiner: Key Idea

Decompose input graph $G_T$ into neighborhoods

Embed neighhborhoods into an order embedding space

Key benefit of order embedding: We can quickly “predict” the frequency of a given subgraph $G_Q$

Motif Frequency Estimation

Given: Set of subgraphs (”node-anchored neighborhoods”) $G_{N_i}$ of $G_T$ (sampled randomly)

Key idea: Estimate frequency of $G_Q$ by counting the number of $G_{N_i}$ such taht their embeddings $z_{N_i}$ satisfy $z_Q \leq z_{N_i}$
- This is a consequence of the order embedding space property

SPMiner Search Procedure

Results: Small Motifs

Ground-truth: Find most frequent 10 motifs in dataset by brute-force exact enumeration (expensive)

Question: Can the model identify frequent motifs?

Result: The model identifies 9 and 8 of the top 10 motifs, respectively.