CS224W-Machine Learning with Graph-GNNs for Recommender Systems

Recommender Systems: Task and Evaluation

Recommender system can be naturally modeled as a bipartite graph

A graph with two node types users and items.

Edges connect users and items
- Indicates user-item interaction
- Often associated with timestamp

Recommendation Task

Given: Past user-item interactions

Task
- Predict new items each user will interact in the future
- Can be cast as link prediction problem.
  - Predict new user-item interaction edges given the past edges.
- For $u \in U,v \in V$ , we need to get a real-valued score $f(u,v)$

Modern Recommender System

Problem: Can not evaluate $f(u,v)$ for every user $u$ - item $v$ pair.

Solution: 2-stage process:
- Candidate generation (cheap, fast)
- Ranking (slow, accurate)

Top- $K$  Recommendation

For each user, we recommend $K$ items.

For recommendation to be effective, $K$ needs to be much smaller than the total number of items (up to billions)

$K$ is typically in the order of 10-100.

The goal is to include as many positive items as possible in the top- $K$ recommended items.

Positive items = Items that the user will interact with in the future.

Evaluation metric: Recall @ $K$  (defined next)

For each use $u$ ,
- Let $P_u$ be a set of positive items the user will interact in the future.
- Let $R_u$ be a set of items recommended by the model.
  - In top- $K$ recommendation, $|R_u| = K$ .
  - Items that the user has already interacted are excluded.

Recall @ $K$ for user $u$ is $|P_u \cap R_u|/|P_u|$
- Higher value indicates more positive items are recommended in top- $K$ for user $u$ .

The final Recall @ $K$ is computed by averaging the recall values across all users.

Recommender Systems: Embedding - Based Models

Notation:

$U$ : A set of all users

$V$ : A set of all items

$E$ : A set of observed user-item interactions
- $E = \{(u,v)|u \in U,v\in V, u \ \text{interacted with} \ v \}$

Score Function

To get the top- $K$ items, we need a score function for user-item interaction:

For $u \in U, v \in V$ , we need to get a real-valued scalar $score(u,v)$ .

$K$ items with the largest scores for a given user $u$ (excluding already-interacted items) are then recommended.

Embedding - Based Models

We consider embedding-based models for scoring user-item interactions.

For each user $u \in U$ , let $u \in \mathbb{R}^D$ be its $D$ -dimensional embedding.

For each item $v \in V$ , let $v \in \mathbb{R}^D$ be its $D$ -dimensional embedding.

Let $f_{\theta}(\cdot,\cdot): \mathbb{R}^D \times \mathbb{R}^d \rightarrow \mathbb{R}$ be a parameterized function.

Then, $score(u,v) = f_{\theta}(u,v)$

Training Objective

Embedding-based models have three kinds of parameters:

An encoder to generate user embeddings ${}$ $\{u \} _{u\in U}$

An encoder to generate item embeddings $\{v \}_{v \in V}$

Score function $f_{\theta}(\cdot,\cdot)$

Training objective: Optimize the model parameters to achieve high recall @ $K$ on seen (i.e., training) user-item interactions

We hope this objective would lead to high recall @ $K$ on unseen (i.e., test) interactions

Surrogate Loss Functions

The original training objective (recall @ $K$ ) is not differentiable.

Can not apply efficient gradient-based optimization

Two surrogate loss functions are widely-used to enable efficient gradient-based.

Binary loss

Bayesian Personalized Ranking (BPR) loss

Surrogate losses are differentiable and should align will with the original training objective.

Binary Loss

Define positive/negative edges
- A set of positive edges $E$ (i.e., observed/training user-item interactions)
- A set of negative edges $E_{neg} = \{ (u,v)| (u,v) \notin E ,u \in U,v\in V\}$

Define sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$
- Maps real-valued scores into binary linkelihood scores, i.e., in the range of $[0,1]$

Binary loss: Binary classification of positive / negative edges using $\sigma(f_{\theta}(u,v))$

Binary loss pushes the scores of positive edges higher than those of negative edges.
- This aligns with the training recall metric since positive edges need to be recalled.

Issue: In the binary loss, the scores of ALL positive edges are pushed higher than those of ALL negative edges.
- This would unnecessarily penalize model predictions even if the training recall metric is perfect.
- Why?
  - Consider the simplest case:
    - Two users, two items
    - Metric: Recall @ 1.
    - A model assigns the score for every user-item pair (as shown in the right)
  - Training Recall @1 is 1.0, because $v_0$ (resp. $v_1$ ) is correctly recommended to $u_0$ . (resp. $u_1$ ).
  - However, the binary loss would still penalize the model prediction because the negative $(u_1,v_0)$ edge gets the higher score than the positive edge $(u_0,v_0)$ .
- Key insight: The binary loss is non-personalized in the sense that the positive /negative edges are considered across ALL users at once.
- However, the recall metric is inherently personalized (defined for each user)
  - The non-personalized binary loss is overly-stringent for the personalized recall metric.

Desirable Surrogate Loss

Surrogate loss function should be defined in a personalized manner.
- For each user, we want the scores of positive items to be higher than those of the negative items.
- We do not care about the score ordering across users.

Bayesian Personalized Ranking (BPR) loss achieves this!

Loss Function: BPR Loss

Bayesian Personalized Ranking (BPR) loss is a personalized surrogate loss that aligns better with the recall @ $K$ metric.

For each user $u^* \in U$ , define the rooted positive / negative edges as
- Positive edges rooted at $u^*$
  - $E(u^*) =\{(u^*,v)|(u^*,v)\in E\}$
- Negative edge rooted at $u^*$
  - $E_{neg}(u^*) = \{(u^*,v)| (u^*,v)\in E_{neg} \}$

Training objective: For each user $u^*$ , we want the scores of rooted positive edges $E(u^*)$ to be higher than those of rooted negative edges $E_{neg}(u^*).$
- Aligns with the personalized nature of the recall metric.

BPR Loss for user $u^*$ :

Final BPR Loss: $\frac{1}{|U|} \sum_{u^* \in U}Loss(u^*)$

Mini-batch training for the BPR loss:
- In each mini-batch, we sample a subset of users $U_{mini} \subset U$
  - For each user $u^* \in U _{mini},$ we sample one postive item $v_{pos}$ and a set of sampled negative itmes $V_{neg} = \{v_{neg}\}$ .
  - The mini-batch loss is computed as

Why Embedding Models Work?

Underlying idea: Collaborative filtering
- Recommend items for a user by collecting preferences of many other similar users.
- Similar users tend to prefer similar items.

Key question: How to capture similarity between users / items?

Embedding-based models can capture similarity of users/items!
- Low-dimensional embeddings cannot simply memorize all user-item interaction data.
- Embeddings are forced to capture similarity between users/items to fit the data.
- This allows the models to make effective prediction on unseen user-item interactions.

Neural Graph Collaborative Filtering

Conventional Collaborative Filtering

Conventional collaborative filtering model is based on shallow encoders:
- No user/item features.
- Use shallow encoders for users and items:
  - For every $u \in U$ and $v \in V$ , we prepare shallow learnable embeddings $u,v \in \mathbb{R}^D$ .
- Score function for user $u$ and item $v$ is $f_{\theta}(u,v) = z^T_uz_v$ .

Limitations of Shallow Encoders

The model itself does not explicitly capture graph structure.
- The graph structure is only implicitly captured in the training objective.

Only the first-order graph structure (i.e., edges) is captured in the training objective.
- High-order graph structure (e.g., $K$ -hop paths between two nodes) is not explicitly captured.

\textbf{NCGF}

NGCF: Overview

Neural Graph Collaborative Filtering (NGCF) explicitly incorporates high-order graph structure when generating user /item embeddings.

Key idea: Use a GNN to generate graph-aware user/item embeddings.

NGCF Framework

Given: User-item bipartite graph.

NGCF framework:
- Prepare shallow learnable embedding for each node.
- Use multi-layer GNNs to propagate embeddings along the bipartite graph.
  - High-order graph structure is captured.
- Final embeddings are explicitly graph-aware!

Two kinds of learnable params are jointly learned:
- Shallow user/item embeddings
- GNN’s parameters

Initial Node Embeddings

Set the shallow learnable embeddings as the initial node features:
- For every user $u \in U$ , set $h_u^{(0)}$ as the user’s shallow embedding.
- For every item $v\in V$ , set $h_v^{(0)}$ as the item’s shallow embedding.

Neighbor Aggregation

Iteratively update node embeddings using neighboring embeddings.

h_v^{(k+1)} = \text{COMBINE}(h_v^{(k)},AGGR(\{h_u^{(k)}\}_{u \in N(v)})) \\ h_u^{(k+1)} = \text{COMBINE}(h_u^{(k)},AGGR(\{h_v^{(k)}\}_{u \in N(u)}))

High-order graph structure is captured through iterative neighbor aggregation.

Different architecture choices are possiblle for AGGR and COMBINE

$AGGR(\cdot)$  can be $MEAN(\cdot)$

$COMBINE(x,y)$  can be $ReLU(Linear(Concat(x,y)))$

Final Embeddings and Score Function

After $K$ rounds of neighbor aggregation, we get the final user/item embeddings $h_u^{(K)}$ and $h_v^{(K)}$ .

For all $u\in U, v \in V$ , we set

u \leftarrow h_u^{(K)},\ v \leftarrow h_v ^{(K)}

Score function is the inner product

score (u,v) = u^Tv

\textbf{LightGCN}

Recall: NGCF jointly learns two kinds of parameters:
- Shallow user /item embeddings
- GNN’s parameters

Observation: Shallow learnable embeddings are already quite expressive.
- learned for every node
- Most of the parameter counts are in shallow embeddings
- The GNN parameters may not be so essential for performance.

Overview of the idea:
- Adjacency matrix for a bipartite graph
- Matrix formulation of GCN
- Simplification of GCN by removing non-linearity
  - Related: SGC for scalable GNN.

Adjacency and Embedding Matrices

Adjacency matrix of a (undirected) bipartite graph.

Shallow embedding matrix.

Matrix Formulation of GCN

Define: The diffusion matrix

Let $D$ be the degree matrix of $A$ .

Define the norrmalized adjacency matrix $\tilde{A}$ as

\tilde{A} = D^{\frac{1}{2}}A D ^{\frac{1}{2}}

Let $E^{(k)}$ be the embedding matrix at $k$ -th layer.

Each layer of GCN’s aggregation can be written in a matrix form:
- $\tilde{A}E^{(k)}$ : Neighbor aggregation
- $W^{(k)}$ : Learnable linear transformation

E^{(k+1)} = ReLU (\tilde{A}E^{(k)}W^{(k)})

Note: Different from the original GCN, self-connection is omitted here.

Simplifying GCN

Simplify GCN by removing ReLU non-linearity:

E^{(k+1)} = \tilde{A}E^{(k)}W^{(k)}

The final node embedding matrix is given as:

Removing ReLU significantly simplifies GCN!
- $\tilde{A}^KE$ : Diffusing node embeddings along the graph

E^{(K)} = \tilde{A}^K E W \\ W = W^{(0)} \cdots W^{(K-1)}

Algorithm: Apply $E \leftarrow \tilde{A}E$ for $K$ times.
- Each matrix multiplication diffuses the current embeddings to their non-hop neighbors.
- Note: $\tilde{A}^K$ is dense and never gets materialized. Instead, the above iterative matrix-vector product is used to compute $\tilde{A}^KE$

Multi-Scale Diffusion

We can consider multi-scale diffusion
- The above includes embeddings diffused at multiple hop scales.
- $\alpha_0 E^{(0)} = \alpha_0 \tilde{A}^0 E^{(0)}$ acts as a self-connection (that is omitted in the definition $\tilde{A}$ )
- The coefficients, $\alpha_0,\cdots,\alpha_K$ , are hyper-parameters.

\alpha_0 E^{(0)} + \alpha_1 E^{(1)} + \cdots + \alpha_KE^{(K)}

For simplicity, LightGCN uses the uniform coefficient, i.e., $\alpha_K = \frac{1}{K+1}$ for $k=0,\cdots,K$ .

LightGCN: Model Overview

Given:

Adjacency matrix $A$

Initial learnable embedding matrix $E$

Iteratively diffuse embedding matrix $E$ using $\tilde{A}$ . For $k = 0,\cdots,K-1$

Average the embedding matrices at different scales.

Score function:

Use user/item vectors from $E_{final}$ to score user-item interaction

LightGCN: Intuition

Question: Why does the simple diffusion propagation work well?

Answer: The diffusion directly encourages the embeddings of similar users / items to be similar.
- Similar users share many common neighbors(items) and are expected to have similar future preferences (interact with similar items).

LightGCN and GCN

The embedding propagation of LightGCN is closely related to GCN

Recall: GCN (neighbor aggregation part)
- Self-loop is added in the neighborhood definition

h_v^{(k+1)} = \sum_{u \in N(v)} \frac{1}{\sqrt{d_u}\sqrt{d_v}} \cdot h_u^{(k)}

LightGCN uses the same equation except that
- Self-loop is not added in the neighborhood definition.
- Final embedding takes the average of embeddings from all the layers:
$h_v = \frac{1}{K+1} \sum_{k=0}^K h_v ^{(k)}$

LightGCN and MF: Comparison

Both LightGCN and shallow encoders learn a unique embedding for each user/item.

The difference is that LightGCN uses the diffused user/item embeddings for scoring.

LightGCN performs better than shallow encoders but are also more computationally expensive due to the additional diffusion step.
- The final embedding of a user/item is obtained by aggregating embeddings of its multi-hop neighboring nodes.

\textbf{PinSAGE}

Motivation

P2P recommendation

PinSAGE: Pin Embedding

Unifies visual, textual, and graph information.

The largest industry deployment of a Graph Convolutional Networks.

Huge Adoption across Pinterest

Works for fresh content and is available in a few seconds after pin creation.

Application: Pinterest

PinSage graph convolutional network:

Goal: Generate embeddings for nodes in a large-scale Pinterest graph containing billions of object.

Key Idea: Borrow information from nearby nodes.
- E.g., bed rail Pin might look like a garden fence, but gates and beds are rarely adjacent in the graph.
- Pin embeddings are essential to various tasks like recommendation of Pins, classification, ranking
  - Services like “Related Pins”, ”Search”, ”Shopping”, “Ads”

Harnessing Pins and Boards

PinSAGE: Graph Neural Network

Graph has tens of billions of nodes and edges

Further resolves embeddings across the Pinterest graph

PinSAGE: Methods for Scaling Up

In addition to the GNN model, the PinSAGE introduces several methods to scale the GNN to a billion-scale recommender system (e.g., Pinterest).

Shared negative samples across users in a mini-batch

Hard negative samples

Curriculum learning

Mini-batch training of GNNs on a large-graph

PinSAGE Model

Training Data

1+B repin pairs:
- From Relation Pins surface
- Capture semantic relatedness
- Goal: Embed such pairs to be “neighbors”

Example positive training pairs(Q,X):

Shared Negative Samples

Recall: In BPR loss, for each user $u^* \in U_{mini}$ , we sample one positive item $v_{pos}$ and a set of sampled negative items $V_{neg} = \{v_{neg}\}$

Using more negative samples per user improves the recommendation performance, but it also expensive.
- We need to generate $|U_{mini}|\cdot |V_{neg}|$ embeddings for negative nodes.
- We need to apply $|U_{mini}| \cdot |V_{neg}|$ GNN computational graphs (see down), which is expensive.

Key idea: We can share the same set of negative samples $V_{neg} = \{v_{neg}\}$ across all users $U_{mini}$ in the mini-batch.

This way, we only need to generate $|V_{neg}|$ embeddings for negative nodes.
- This saves the node embedding generation computation by a factor of $|U_{mini}|$ !
- Empirically, the performance stays similar to the non-shared negative sampling scheme.

PinSAGE: Curriculum Learning

Idea: use harder and harder negative samples

Include more and more hard negative samples for each epoch

Key insight: It’s effective to make the negative samples gradually harder in the process of training.
- At $n$ -th epoch, we add $n-1$ hard negative items.
  - # (Hard negatives) gradually increases in the process of training.
- The model will gradually learn to make finer-grained predictions.

Hard Negatives 1

Challenge: Industrial recsys needs to make extremely fine-grained predictions.
- #Total items: Up to billions
- #Items to recommend for each user:: 10 to 100.

Issue: The shared negative items are randomly sampled from all items
- Most of them are “easy negatives”, i.e., a model does not need to be fine-grained to distinguish them from positive items.

We need a way too sample “hard negatives” to force the model to be fine-grained! (看上面的Curriculum Learning)

Hard Negatives 2

For each user node, the hard negatives are item nodes that are close (but not connected) to the user node in the graph.

Hard negatives for user $u \in U$ are obtained as follows:
- Compute random walks from user $u$ .
  - Run random walk with restart from $u$ , obtain visit counts for other items/nodes.
- Sort items in the descending order of their visit count.
- Randomly sample items that are ranked high but not too high, e.g., $2000^{th} - 5000^{th}$ .
  - Item nodes that are close but not too close (connected) to the user node.

The hard negatives for each user are used in addition to the shared negatives.

PinSAGE: Negative Sampling

$(q,p)$ positive pairs are given but various methods to sample negative to form $(q,p,n)$

Distance Weighted Sampling
- Sample negatives so that query-negative distance distribution is approx $U[0.5,1.4]$