CS224W-Machine Learning with Graph-Link Prediction and Causality

The 3 rungs of the ladder of causation

Traditional graph machine learning tasks
- Assume $X \nVDash Y$
- Task: Predict output $Y$ form input $X$
- Data: samples of $(X,Y)$

Background: Inverse Transform Sampling

Data generation algorithm:
- Let $U_X \sim \text{Uniform}(0,1)$ be a random uniform value in the interval $[0,1]$
- Then,
$X:=F^{-1}_X (U_X)$
- Exponential distribution example: $P(X \leq x) = F_X(x) = 1-e^{-\lambda x}$ with inverse $x = F^{-1}_X (U_X) = -\frac{1}{\lambda} \ln (1-U_X)$

Imagine two hypothetical data generators for

do $(X=x)$ changes $f_x$ to a constant in data generation

Imagine two hypothetical data generators for

Now assume we know $X=x'$ , $Y=y'$ . This knowledge changes distribution of $U_x$ and $U_y$

The above data generation can be described by an execution graph, called the causal Directed Acyclic Graph (DAG):

Link prediction for decision-making interventions (e.g., search & recommendations) tends to be causal

P(Accept(i,j)) = yes | do (show \ recommendation = j \ to \ user = i )

Zillow House Offer Example

Biomedical Experiment Causal Graph

(Out-of-distribution Graph Tasks)

Consider an out-of-distribution graph classification task

Test data: Predict $Y^{te}$ given $G^{te}$ ,under $P(Y^{tr}|G^{tr}) = P(Y^{te}|G^{te})$ and $supp(G^{tr}) \neq supp (G^{te})$

Differences between In-distribution and Out-of-distribution tasks

Out-of-distribution tasks are a mix of associational and counterfactual tasks

Data: $(X^{tr},Y^{tr})$

Task: Predict $Y ^{te}$ given $X^{te}$ under $P(Y^{tr} | X^{tr}) = P(Y^{te}| X^{te})$

But the learning is counterfactual
- Without examples from graphs in test $G^{te}$
- The classifier must build a correct predictor for unseen graph sizes
$Y(N=n)|N=n^{tr},G=g^{tr},Y=y^{tr}$

Why is Graph ODD Learning a Counterfactual Task?

Example:

Graph fromation process (Graphon):
- Graph label $Y$ is a function of the graph model $W$ & some random noise
- Graph size $N^{tr}(N^{te})$ is a function of “environment” $E^{tr}(E^{te})$ only
- Train (test) graphs are generated by $W$ and $E^{tr} (E^{te})$ with same random noises.