Implementing backpropagation by hand is like doing assembly programming.
You don't have to do it most of the times.
You need it sometimes, when desired functionality is absent.
You do need it sometimes, if you want to eke out all the efficiencies out of the hardware.
Can we automate the process of computing gradients of arbitrary functions?
Autodiff: An automatic differentiation library.
Terminology
Automatic Differentiation (AutoDiff): A general purpose solution for taking a program that computes a scalar value and automatically constructing a procedure for the computing the derivative of that value.
Forward Mode Autodiff
Reverse Mode Autodiff
Backpropagation: A special case of autodiff applied to neural networks. The words "backpropagation" and "autodiff" are used interchangeably in machine learning.
AutoGrad: It is the name of a particular autodiff package.
Finite Differences
We often use finite differences to check our gradient calculations.
$$
\begin{eqnarray}
y &=& \frac{1}{t_5} \nonumber \\
t_6 &=& y - t \nonumber \\
t_7 &=& t_6^2 \nonumber \\
\mathcal{L} &=& \frac{t_7}{2} \nonumber
\end{eqnarray}
$$
Autodiff Example
import autograd.numpy as np
from autograd import grad
def sigmoid(x):
return 0.5*(np.tanh(x) + 1)
def logistic_predictions(weights, inputs):
# Outputs probability of a label being true according to logistic model.
return sigmoid(np.dot(inputs, weights))
def training_loss(weights):
# Training loss is the negative log-likelihood of the training labels.
preds = logistic_predictions(weights, inputs)
label_probabilities = preds * targets + (1 - preds) * (1 - targets)
return -np.sum(np.log(label_probabilities))
# Build a toy dataset.
inputs = np.array([[0.52, 1.12, 0.77],
[0.88, -1.08, 0.15],
[0.52, 0.06, -1.30],
[0.74, -2.49, 1.39]])
targets = np.array([True, True, False, True])
# Build a function that returns gradients of training loss using autograd.
training_gradient_fun = grad(training_loss)
# Check the gradients numerically, just to be safe.
weights = np.array([0.0, 0.0, 0.0])
# Optimize weights using gradient descent.
print("Initial loss:", training_loss(weights))
for i in range(100):
weights -= training_gradient_fun(weights) * 0.01
print("Trained loss:", training_loss(weights))
Autograd
Autograd is a package that implements the ideas behind automatic differentiation.
Autodiff packages typically first explicitly construct the computational graph.
Frameworks like Tensorflow provide custom language (API) for building computation graphs directly.
Frameworks like PyTorch and Autograd instead build the computational graph by tracing all the operations during forward pass.
The Node class represents a node of the computation graph. The attributes of the class are:
value: actual value computed on a particular set of inputs
fun: the primitive operation defining the node
args: the arguments the op was called with
parents: the parents of the current node
Computational Graph Continued...
Autograd wraps basic Numpy ops while enabling it to build a computation graph.
Computational Graph Example
def logistic(z):
return 1./(1. + np.exp(-z))
# that is equivalent to:
def logistic2(z):
return np.reciprocal(np.add(1, np.exp(np.negative(z))))
z = 1.5
y = logistic(z)
Vector-Jacobian Products
Reminder: Backpropagation equations can be expressed in terms of sums and indices.
Goal: Implement primitive operations and backpropagation in vectorized form.
The Jacobian is the matrix of partial derivatives:
In practice, we never explicitly construct the Jacobian, we instead directly compute the vector-Jacobian product for simplicity and efficiency.
VJP Examples Continued...
For each primitive operation, we must specify VJPs for each of its arguments. Consider $y = \exp(x)$
This is a function which takes in the output gradient (i.e. $y'$), the answer ($y$), and the arguments ($x$), and returns the input gradient ($x'$).
defvjp (defined in core.py) is a convenience routine for registering VJPs. It just adds them to a dict.
Here are some examples from numpy/numpy_vjps.py:
defvjp(negative, lambda g, ans, x: -g)
defvjp(exp, lambda g, ans, x: ans * g)
defvjp(log, lambda g, ans, x: g / x)
defvjp(add, lambda g, ans, x, y: g,
lambda g, ans, x, y: g)
defvjp(multiply, lambda g, ans, x, y: y * g,
lambda g, ans, x, y: x * g))
defvjp(subtract, lambda g, ans, x, y: g,
lambda g, ans, x, y: -g)
Backpropagation
Backpropagation computations can be seen as message passing on the computational graph.
This procedure can be directly implemented using an ordered list and modular functions.
Backpropagation as Message Passing
Consider, standard backpropagation for computing $\mathbf{z}'$:
Not modular: $\mathbf{z}$ needs to know how it is used in the network for computing partial derivatives of $\mathbf{r}$, $\mathbf{s}$, and $\mathbf{t}$.
Backprop as Message Passing Continued...
Node receives messages from children, aggregates them to get its own error signal, then passes messages to parents.
Each message is a VJP
Modularity: each node needs to know how to compute its outgoing messages, i.e. the VJPs corresponding to each of its parents.
Implementation of $\mathbf{z}$ is agnostic to knowing where $\mathbf{z}'$ came from.
Backward Pass
The backwards pass is defined in core.py.
The argument $g$ is the error signal for the end node. This is always $\mathcal{L}' = 1$.
def backward_pass(g, end_node):
outgrads = {end_node : (g, False)}
for node in toposort(end_node):
outgrad = outgrads.pop(node)
ingrads = node.vjp(outgrad[0])
for parent, ingrad in zip(node.parents, ingrads):
outgrads[parent] = add_outgrads(outgrads.get(parent), ingrad)
return outgrad[0]
def add_outgrads(prev_g, g):
if prev_g is None:
return g
return prev_g + g
Backward Pass
grad is just a wrapper around make_vjp which builds the computation graph and feeds it to backward_pass.
grad itself is viewed as a VJP, if we treat $\bar{\mathcal{L}}$ as the $1 \times 1$ matrix with entry 1.