We want our main desiderata, permutation invariance and permutation equivariance, to still hold
Node Classification
Figure Credit: Jure Leskovec
Link Prediction
Figure Credit: Jure Leskovec
Link Prediction: Social Networks
Figure Credit: Michael Bronstein
Link Prediction: Social Networks
Figure Credit: Chaitanya Joshi
Graph Classification
Duvenaud et.al "Convolutional Networks on Graphs for Learning Molecular Fingerprints", NeurIPS 2015
How to construct GNN layers?
GNN Layers
We construct permutation-equivariant functions $f(\mathbf{X}, \mathbf{A})$ over graphs by shared application of a local permutation-invariant $g(\mathbf{x}_i, \mathbf{X}_{\mathcal{N}_i})$.
$f$: referred to as GNN layer
$g$: referred to as "diffusion", "propagation", "message passing"
How can we concretely define $g$:
many designs have been proposed
most proposed solutions fall into three spatial "flavors"
Messages computed as $\mathbf{m}_{ij}=\psi(\mathbf{x}_i, \mathbf{x}_j)$
Interaction Networks (Battaglia et al., NeurIPS 2016)
MPNN (Gilmer, ICML 2017)
GraphNets (Battaglia et al., ICML 2018)
Most generic GNN layer:
May have scalability or learnability issues
Ideal for computational chemistry, reasoning and simulation
Node Embeddings
Learning Node Embedding
Early "successes" of deep learning on graphs was in embedding nodes into vectors $\mathbf{h}_u$ through an encoder.
At the time, was implemeted as look-up table
What makes a node representation good?
Graphs encode interesting structure.
node representations must preserve graph structure
Features of nodes $i$ and $j$ should be predictive of existence of edge $(i, j)$!
Can be implemented through an unsupervised loss.
$\mathbf{h}_i$ and $\mathbf{h}_j$ must be close iff $(i,j)\in\mathcal{E}$
can use standard cross-entropy loss
$$\sum_{(i,j)\in\mathcal{E}}\log \sigma\left(\mathbf{h}_i^T\mathbf{h}_j\right) + \sum_{(i,j)\notin\mathcal{E}}\log\left(1-\sigma(\mathbf{h}_i^T\mathbf{h}_j)\right)$$
Random-Walks on Graphs
Link prediction objective is a special case of random-walk objectives
redefine the condition from $(i,j)\in\mathcal{E}$ to $i$ and $j$ co-occur in a (short) random-walk
Common unsupervised graph representation learning prior to GNNs
DeepWalk (Perozzi et al., KDD'14)
node2vec (Grover & Leskovec, KDD'16)
LINE (Tang et al., WWW'15)
Local Objectives vs Conv-GNNs
Random walk objectives inherently capture local similarities.
A (convolutional) GNN also summarises local information of the graph!
Neighbouring nodes tend to highly overlap in n-step neighbourhoods
Conv-GNN enforces similar features for neighbouring nodes by design.
DeepWalk-style models learn representations similar to that of convolutional GNN.
Corollary 1: Random-walk objectives can fail to provide useful signal to GNNs.
Corollary 2: At times, DeepWalk can be matched by an untrained Conv-GNN.
Deep Graph Informax (ICLR 2019)
Transformers vs GNNs
Transformers
Transformers is a special case of GCNs when the graph is fully connected.
The neighborhood $\mathcal{N}_i$ is the whole graph.
What does it mean to have a graph fully connected?
It becomes less useful to talk about graphs as each data point is connected to all other points. There is no particular graph structure that can be used.
It would be better to talk about sets rather than graphs in this case.
Transformers are Set Neural Networks.
They are today the best technique to analyze sets/bags of features.
Consider convolution over a sequence represented as cyclical grid graph:
Can we computed as a matrix multiplication with a circulant matrix s.t., $\mathbf{H}=\mathbf{CX}$
$$f(\mathbf{X})=\begin{bmatrix}b & c & & & a \\a & b & c & & \\ & \ddots & \ddots & \ddots & \\ & & a & b & c \\ c & & & a & b \end{bmatrix}\begin{bmatrix} - & \mathbf{x}_0 & - \\ - & \mathbf{x}_1 & - \\ - & \vdots & - \\ - & \mathbf{x}_{n-2} & - \\ - & \mathbf{x}_{n-1} & - \end{bmatrix}$$
Circulant Matrices: Properties and Eignevectors
Circulant matrices commute
$\mathbf{C}(\mathbf{v})\mathbf{C}(\mathbf{w})=\mathbf{C}(\mathbf{w})\mathbf{C}(\mathbf{v})$ for any vectors $\mathbf{v}$ and $\mathbf{w}$
Matrices that commute are jointly diagonalisable.
eigenvectors are shared across all such matrices
Conveniently, the eigenvectors of circulants are the discrete Fourier basis:
$$\phi_{l} = \frac{1}{\sqrt{n}}\left(1,e^{\frac{2\pi il}{n}},e^{\frac{4\pi il}{n}},\dots,e^{\frac{2\pi i(n-1)l}{n}}\right)^T \quad l=0,1,2,\dots,n-1$$
This can be easily computed by studying $\mathbf{C}([0, 1, 0, 0, 0, \dots])$, which is the shift matrix.
DFT and Convolution
We can eigendecompose any circulant matrix as $\mathbf{C}(\Lambda)=\Phi\Lambda\Phi^*$
Changing the eigenvalues in $\Lambda$ expresses any operation that commutes with $\mathbf{L}$.
Commonly referred to as the "graph Fourier transform" (Bruna et al., ICLR'14)
To convolve with a feature matrix $\mathbf{X}$, we do the following (learn $\lambda_i$):
$$f(\mathbf{X})=\Phi\begin{bmatrix}\hat{\lambda}_0 & & \\ & \ddots & \\ & & \hat{\lambda}_{n-1} \end{bmatrix}\Phi^*\mathbf{X}$$
Spectral GNNs
Directly learning eigenvalues is typically inappropriate:
Not localized, doesn't generalize to other graphs, computationally expensive, etc.
Common solution: make eigenvalues $\mathbf{\Lambda}$ related to eigenvalues of $\mathbf{L}$
Typically by a degree-$k$ polynomial function, $p_k$
Chebyshev polynomials (Defferrard et al., NeurIPS'16)
Cayley polynomials (Levie et al., Trans. Sig. Proc.'18)
Using a polynomial in $\mathbf{L}$ we can define a conv-GNN
With coefficients defined by $c_{ij}= (p_k(L))_{ij}$
Transformers with Positional Encodeing for Graphs
Transformers are typically defined on sets.
Order information can be provided to transformers through positional embeddings
Sines/cosines sampled depending on position
$$\begin{equation}
PE_{pos, 2i} = sin\left(\frac{pos}{10000^{\frac{2i}{d_{emb}}}}\right) \mbox{ and } PE_{pos, 2i+1} = cos\left(\frac{pos}{10000^{\frac{2i}{d_{emb}}}}\right)
\end{equation}$$
Very similar to the DFT eigenvectors.
Positional embeddings can be interpreted as eigenvectors of the grid graph.
Same idea can be leveraged to define Transformers on Graphs
Just feed some eigenvectors of the graph Laplacian (columns of $\Phi$)
Graph Transformer from Dwivedi & Bresson (2021)
Probabilistic Graphical Models
Probabilistic Modeling
Treat graphs as probabilistic graphical models (PGMs)
Nodes: Random variables
Edges: dependencies between distributions
Encode independence and conditional independence
allows for compact representation of multivariate probabilistic distributions
$$p(a,b,c,d,e) = p(a)p(b)p(c|a,b)p(d|e)p(c|e)$$
Markov Random Fields
A particular PGM of interest is Markov Random Feild (MRF)
allows decompositio of joint distribution into product of edge potentials
MRFs are modelled through inputs $\mathbf{X}$ and latents $\mathbf{H}$.
latents are akin to node representations
inputs and latents for each node are are related to each other independently
latents are related accordingly to the edges of the graph
Joint probability distribution:
$$p\left(\mathbf{X},\mathbf{H}\right)=\prod_{v\in\mathcal{V}}\phi\left(\mathbf{x}_v,\mathbf{h}_v\right)\prod_{(v_1,v_2)\in\mathcal{E}}\psi\left(\mathbf{h}_{v_1},\mathbf{h}_{v_2}\right)$$
$\phi(\cdot,\cdot)$ and $\psi(\cdot,\cdot)$ are real-valued potential functions
Mean-Field Inference
To embed nodes, we need to sample from the posterior, $p(\mathbf{H}|\mathbf{X})$
estimating the posterior is intractable, even if exact potential functions are known
One popular method of resolving this is mean-field variational inference
Assume posterior is approximated by a product of node-level densities
$$p(\mathbf{H}|\mathbf{X})\approx \prod_{v \in \mathcal{V}}q(\mathbf{h}_v)$$
$q(\cdot)$ is a density that is easy to compute and sample from, e.g., Gaussian
$q(\cdot)$ can be estimated by minimizing $KL\left(\prod_{v}q(v|X)\|p(\mathbf{H}|\mathbf{X})\right)$ to true posterior
Minimising the KL is intractable, but it admits a favourable approximate algorithm
Mean-Field Inference as GNN
We can iteratively update $q(\cdot)$ starting from an initial guess $q^{(0)}(\mathbf{h})$ (variational inference):
$$\log q^{(t+1)}(\mathbf{h}_i) = c_i + \log\phi(\mathbf{x}_i,\mathbf{h}_i) + \sum_{j\in\mathcal{N}_i}\int_{\mathbb{R}^k}q^t(\mathbf{h}_i)\log\psi(\mathbf{h}_i,\mathbf{h}_j)d\mathbf{h}_i$$
Aligns nicely with message-passing GNN.
$$\mathbf{h}_i = \phi\left(\mathbf{x}_i,\bigoplus_{j\in\mathcal{N}_i}\psi(\mathbf{x}_i, \mathbf{x}_j)\right)$$
Graph Isomorphism Testing
How powerful are Graph Neural Networks?
GNNs are a powerful tool for processing real-world graph data
But they won't solve any task specified on a graph accurately!
Canonical example: deciding graph isomorphism
Can a GNN distinguish between two non-isomorphic graphs i.e., $\mathbf{h}_{G_1} \neq \mathbf{h}_{G_2}$?
Permutation invariance mandates that two isomorphic graphs will always be indistinguishable, so this case is OK.
Weisfeiler-Leman Test
Simple but powerful way of distinguishing: pass random hashes of sums along the edges
Connection to conv-GNNs spotted very early; e.g., by GCN (Kipf & Welling, ICLR'17)
It explains why untrained GNNs work well!
Untrained $\sim$ random hash
The test can fail sometimes
GNNs are no more powerful than 1-WL
Over discrete features, GNNs can only be as powerful as the 1-WL test described before
One important condition for maximal power is an injective aggregator (e.g. sum)
Graph isomorphism network (GIN; Xu et al., ICLR'19) proposes a simple, maximally-expressive GNN, following this principle
$$h_v^{(k)}=MLP^{(k)}\left(\left(1+\epsilon^{(k)}\right)\cdot h_v^{(k-1)}+\sum_{v \in \mathcal{N(v)}}h_v^{(k-1)}\right)$$
Higher-Order Graph Neural Networks
We can make GNNs stronger by analysing failure cases of 1-WL.
For example, just like 1-WL, GNNs cannot detect closed triangles
Augment nodes with randomised/positional features (Murphy et al., ICML'19 and You et al., ICML'19)
Can also literally count interesting subgraphs (Bouritsas et al., 2020)
k-WL labels subgraphs of k nodes together.
Exploited by 1-2-3-GNNs (Morris et al., AAAI'19)
If features are real-valued, injectivity of sum falls apart.
No aggregator is better than another aggregator.
Geometric Deep Learning
Geometric Deep Learning
GNNs were described in terms of invariances and equivariances.
Many standard architectures like CNNs and Transformers can be recoverd by,
Local and equivariant layer specified over neighbourhoods
Activation functions
(Potentially: pooling layers that coarsen the structure)
Global and invariant layer over the entire domain
Can design a more general class of geometric deep learning architectures.