Graph Neural Networks - II
CSE 891: Deep Learning
Vishnu Boddeti
Wednesday October 28, 2020
Recap: Graph Structured Data
Standard CNN and RNN architectures don't work on this data.
Slide Credit: Xavier Bresson
Recap: Graph Definition
Graphs G are defined by:
Vertices V
Edges E
Adjacency matrix A
Features:
Node features: h i , h j (user type)
Edge features: e i j (relation type)
Graph features: g (network type)
Recap: Chebyshev Polynomials
Slide Credit: Xavier Bresson
Isotropic Filters
Template Matching
How to define template matching for graphs?
Main issue is the absence of node ordering/positioning.
Node indices are arbitrary and do not match the same information.
How to design template matching invariant to node re-parametrization ?
Simply use the same template features for all neighbors !
Message Passing Neural Networks
Use neural network to mimic unrolled message passing in graphs.
m t + 1 i = ∑ j ∈ N i M t ( h t i , h t j , e i j ) h t + 1 i = ∑ j ∈ N i U t ( h t i , m t + 1 i ) ˆ y = R ( { h T i | i ∈ G } )
Figure Credit: David Duvenaud
Template Matching
One Feature: h ( l + 1 ) i = σ ( ∑ j ∈ N i ( w l ) T h l i j )
d Feature: h ( l + 1 ) i = σ ( ∑ j ∈ N i W l h l i j )
Vector Representation: h ( l + 1 ) i = σ ( A h l W l )
Figure Credit: Xavier Bresson
Vanilla Spatial GCN
Simplest formulation of spatial GCNs
Handle the absence of node ordering
Invariant by node re-parametrization
Deal with different neighborhood sizes
Local reception field by design (only neighbors are considered)
Weight sharing (convolution property)
Independent of graph size
Limited to isotropic capability
h ( l + 1 ) i = σ ( A h l W l )
h ( l + 1 ) i = σ ( 1 d i ∑ j ∈ N i A i j W l h l j )
h ( l + 1 ) i = f G C N ( h l i , h l j : j → i )
Figure Credit: Xavier Bresson
Graph Convolutional Networks
Figure Credit: Michael Bronstein
GCN Challenges
Small world graphs
Bottleneck Problem
Figure Credit: Michael Bronstein
GraphSage
Vanilla GCNs (supposing A i j = 1): h ( l + 1 ) i = σ ( 1 d i ∑ j ∈ N i W l h l j )
GraphSage:
Differentiate template weights W l between neighbors h j and central node h i .
h ( l + 1 ) i = σ ( W l 1 h l i + 1 d i ∑ j ∈ N i W l 2 h l j )
Isotropic GCNs
Variations:
Instead of mean, use Max, LSTM etc.
GraphSage
How do we scale to large graphs?
Mini-Batch Training:
Samples are not IID anymore.
How do we sample?
Figure Credit: Michael Bronstein
Graph Isomorphism Networks
Anistropic GCNs
Reminder:
Standard ConvNets produce anisotropic filters because Euclidean grids have directional structures (up, down, left, right).
GCNs such as ChebNets, CayleyNets, Vanilla GCNs, GraphSage, GIN compute isotropic filters as there is no notion of directions on arbitrary graphs.
How to get anisotropy back in GNNs?
Natural edge features if available (e.g. different bond connections between atoms).
We need an anisotropic mechanism that is independent of the node parametrization.
Idea: Graph attention mechanism can treat neighbors differently.
Graph Attention Networks
GAT uses the attention mechanism to introduce anisotropy in the neighborhood aggregation function.
The network employs a multi-headed architecture to increase the learning capacity, similar to the Transformer.
h l + 1 i = C o n c a t K k = 1 ( E L U ( ∑ j ∈ N i α k , l i j W k 1 h l j ) ) α k , l i j = S o f t m a x N i ( ˆ α k , l i j ) ˆ α k , l i j = L e a k y R e L U ( W k , l 2 C o n c a t ( W k , l 1 h l i , W k , l 1 h l j ) )
Attention mechanism in a one-hop neighborhood
Graph Transformers
Graph version of Transformer:
h l + 1 i = W l C o n c a t K k = 1 ( ∑ j ∈ N i α k , l i j V k , l h l j ) α k , l i j = S o f t m a x N i ( ˆ α k , l i j ) ˆ α k , l i j = ( Q l , k h l i ) T K l , k h l j
Attention mechanism in a on-hop neighborhood
Transformers
Transformers is a special case of GCNs when the graph is fully connected.
The neighborhood N i is the whole graph.
h l + 1 i = W l C o n c a t K k = 1 ( ∑ j ∈ N i α k , l i j V k , l h l j ) α k , l i j = S o f t m a x N i ( ˆ α k , l i j ) ˆ α k , l i j = Q l , k h l i K l , k h l j
Fully Connected Graph
h l + 1 i = C o n c a t K k = 1 ( S o f t m a x ( Q l K l T ) V l ) W l Q l = h l W l Q K l = h l W l K V l = h l W l V
Transformers
What does it mean to have a graph fully connected?
It becomes less useful to talk about graphs as each data point is connected to all other points. There is no particular graph structure that can be used.
It would be better to talk about sets rather than graphs in this case.
Transformers are Set Neural Networks.
They are today the best technique to analyze sets/bags of features.
GNN Pipeline
Standard GNN Pipeline:
Input Layer: Linear embedding of input node/edge features.
GNN Layers: Apply favorite GNN layer L times.
Task-based layer : Graph/node/edge prediction layer.
GNN Pipeline
Slide Credit: Xavier Bresson
GNN Pipeline
Slide Credit: Xavier Bresson
GNN Pipeline
Slide Credit: Xavier Bresson
Node Classification
Figure Credit: Jure Leskovec
Link Prediction
Figure Credit: Jure Leskovec
Social Networks
Figure Credit: Michael Bronstein
Social Networks
Figure Credit: Chaitanya Joshi
Quantum Chemistry
Slide Credit: Xavier Bresson
Computer Vision
Slide Credit: Xavier Bresson
Combinatorial Optimization
Slide Credit: Xavier Bresson
Graph Classification
Duvenaud et.al "Convolutional Networks on Graphs for Learning Molecular Fingerprints", NeurIPS 2015
Semi-Supervised Classification on Graphs
Setting:
Some nodes are labeled
All other nodes are unlabeled
Task:
Predict node label of unlabeled nodes
Evaluate loss on labeled nodes only:
L = − ∑ l ∈ Y L F ∑ f = 1 Y l f log Z l f
Y L : set of labled node indices
Y : label matrix
Z : GCN output (after softmax)
Normalization and Depth
What kind of normalization can we do for graphs?
Node normalization
Pair normalization
Edge normalization
Does depth matter for GCNs?
Scalable Inception Graph Neural Networks
Figure Credit: Michael Bronstein
Scalable Inception Graph Neural Networks
Graph CNN Challenges
How to define compositionality on graphs (i.e. convolution, downsampling, and pooling on graphs)?
How to ensure generalizability across graphs?
And how to make them numerically fast? (as standard CNNs)
Scalability: millions of node, billions of edges
How to process dynamic graphs?
How to incorporate higher-order structures?
How powerful are graph neural networks?
limited theoretical understanding
message passing based GCN are no stronger than Weisfeiler-Lehman tests.
Getting Started
Lots of open questions and uncharted territory.
Standardized Benchmarks: Open Graph Benchmark
Software Packages: Pytorch Geometric or Deep Graph Library
Summary
Generalization of ConvNets to data on graphs
Re-design convolution operator on graphs
Linear complexity for sparse graphs
GPU implementations exist, but not yet optimized for sparse matrix-matrix multiplications.
Spatial GCN is usually more efficient but less principled.
Spectral GCN is more principled but usually slow. Computing Laplacian eigenvectors for large scale data is painful.
Resume presentation
Graph Neural Networks - II CSE 891: Deep Learning Vishnu Boddeti Wednesday October 28, 2020