Feed Forward Networks: Learning

CSE 891: Deep Learning

Vishnu Boddeti

Today

Risk Minimization
Loss Functions
Regularization

Recap

Recap: Simple Neural Network

Neuron pre-activation (or input activation)
$a(\mathbf{x}) = b + \sum_i w_ix_i = b + \mathbf{w}^T\mathbf{x}$
Neuron (output) activation

$h(\mathbf{x}) = g(a(\mathbf{x})) = g\left(b+\sum_iw_ix_i\right)$

$\mathbf{w}$ are the connection weights
$b$ is the neuron bias
$g(\cdot)$ is called activation function

Recap: Multilayer Neural Network

Could have $L$ hidden layers:

layer pre-activation for $k>0$ ($\mathbf{h}^{(0)}(\mathbf{x})=\mathbf{x}$)

$\mathbf{a}^{(k)}(\mathbf{x}) = \mathbf{b}^{(k)} + \mathbf{W}^{(k)}\mathbf{h}^{(k-1)}(\mathbf{x})$

hidden layer activation ($k$ from 1 to $L$):

$\mathbf{h}^{(k)}(\mathbf{x}) = \mathbf{g}(\mathbf{a}^{(k)}(\mathbf{x}))$

output layer activation ($k=L+1$):

$\mathbf{h}^{(L+1)}(\mathbf{x}) = \mathbf{o}(\mathbf{a}^{(L+1)}(\mathbf{x})) = f(\mathbf{x})$

Risk Minimization

Empirical Risk Minimization

Framework to design learning algorithms

$$\underset{\theta}{\operatorname{arg min}} \frac{1}{T} \sum_t l\left(f(\mathbf{x}^t;\mathbf{\theta}),\mathbf{y}^t\right) + \lambda\Omega(\mathbf{\theta})$$

Learning is cast as an optimization problem
Loss Function: $l(f(\mathbf{x};\mathbf{\theta}),\mathbf{y})$
$\Omega(\mathbf{\theta})$ is a regularizer

Log-Likelihood

NN estimates $f(\mathbf{x})_c=p(y=c|\mathbf{x})$

we could maximize the probabilities of $y_t$ given $\mathbf{x}_t$ in the training set

To frame as minimization, we minimize the negative log-likelihood

$$l(f(\mathbf{x}),y) = -\log \sum_c 1_{y=c}f(\mathbf{x})_c = -\log f(\mathbf{x})_y$$

we take the log to simplify numerical stability and math simplicity
sometimes referred to as cross-entropy

Loss Functions

Loss Functions: Classification

$l(\hat{y},y) = \sum_i \mathcal{I}(\hat{y}_i \neq y_i)$
Surrogate Loss Functions:

Squared Loss: $\left(y-f(\mathbf{x})\right)^2$
Logistic Loss: $\log\left(1+e^{-yf(\mathbf{x})}\right)$
Hinge Loss: $\left(1-yf(\mathbf{x})\right)_+$
Squared Hinge Loss: $\left(1-yf(\mathbf{x})\right)_+^2$
Cross-Entropy Loss: $-\sum_{i} y_i\log f(\mathbf{x})_i$

Loss Functions: Regression

Euclidean Loss: $\|\mathbf{y}-f(\mathbf{x})\|_2^2$
Manhattan Loss: $\|\mathbf{y}-f(\mathbf{x})\|_1$
Huber Loss:

$$ \begin{cases} \frac{1}{2}\|\mathbf{y}-f(\mathbf{x})\|_2^2 & \quad \text{for } \|\mathbf{y}-f(\mathbf{x})\| < \delta\\ \delta\|\mathbf{y}-f(\mathbf{x})\|_1 -\frac{1}{2}\delta^2 & \quad \text{otherwise}\\ \end{cases} $$

KL Divergence: $\sum_i p_ilog\left(\frac{p_i}{q_i}\right)$

Loss Functions: Embeddings

Cosine Distance: $\frac{\mathbf{x}^T\mathbf{y}}{\|x\|\|y\|}$
Triplet Loss: $\left(1+d(\mathbf{x}_i,\mathbf{x}_j)-d(\mathbf{x}_i,\mathbf{x}_k)\right)_{+}$
Mahalanobis Distance: $(\mathbf{x}-\mathbf{y})^T\mathbf{M}(\mathbf{x}-\mathbf{y})$

Cross-Entropy Loss Example

Unnormalized Log-Prob	Unnormalized Prob.	Prob.	Correct Prob
3.2	24.5	0.13	1.00
5.1	164.0	0.87	0.00
-1.7	0.18	0.00	0.00

Regularization

What is regularization?

the process of constraining the parameter space
alternatively, penalize certain values of $\mathbf{\theta}$

Why do we need regularization?

In Machine Learning regularization $\Leftrightarrow$ generalization
In Deep Learning regularization $\neq$ generalization

$L_2$ Regularization

$$\Omega(\mathbf{\theta}) = \sum_k\sum_i\sum_j(W_{ij}^k)^2 = \sum_k\|\mathbf{W}^k\|_F^2$$

Gradient: $\nabla_{\mathbf{W}^k}\Omega(\mathbf{\theta}) = 2\mathbf{W}^k$
Only applied on weights, not on biases (weight decay)
Can be interpreted as having a Gaussian prior over the weights

$L_1$ Regularization

$$\Omega(\mathbf{\theta}) = \sum_k\sum_i\sum_j|W_{ij}^k|$$

Gradient: $\nabla_{\mathbf{W}^k}\Omega(\mathbf{\theta}) = sign(\mathbf{W}^k)$

$$sign(W^k)_{i,j} = \begin{cases} 1 & \quad \text{for } W^k_{ij} > 0\\ -1 & \quad \text{for } W^k_{ij} < 0 \end{cases}$$

Only applied on weights, but not to biases
Unlike $L_2$, $L_1$ will push certain weights to be exactly zero
Can be interpreted as having a Laplacian prior over the weights

Regularization Geometry

Equivalent Optimization Problems: $$ \begin{equation} \textbf{P1: } \underset{\mathbf{W}}{\operatorname{arg max}} \||\mathbf{y}-\mathbf{W}\mathbf{x}\||_2^2 + \lambda\||\mathbf{W}\||_2^2 \end{equation} $$ $$ \begin{eqnarray} \textbf{P2: } \underset{\mathbf{W}}{\operatorname{arg max}} && \||\mathbf{W}\||_2^2 \nonumber \\ s.t. && \||\mathbf{y}-\mathbf{W}\mathbf{x}\||_2^2 \leq \alpha \nonumber \end{eqnarray} $$

Looking Through the Bayesian Lens

Bayes Rule:

$$ \begin{eqnarray} p(\mathbf{\theta}|\mathbf{x},\mathbf{y}) &=& \frac{p(\mathbf{y}|\mathbf{x},\mathbf{\theta})p(\mathbf{\theta})}{p(\mathbf{y}|\mathbf{x})} \\ &=& \frac{p(\mathbf{y}|\mathbf{x},\mathbf{\theta})p(\mathbf{\theta})}{\int_{\mathbf{\theta}}p(\mathbf{y}|\mathbf{x},\mathbf{\theta})p(\mathbf{\theta})} \end{eqnarray} $$

Likelihood:

$$ p(\mathbf{y}|\mathbf{x},\mathbf{\theta}) = \frac{1}{Z}e^{-E(\mathbf{\theta},\mathbf{x},\mathbf{y})} $$

Prior:

$$p(\mathbf{\theta}) = \frac{1}{Z}e^{-E(\mathbf{\theta})}$$

Maximum-Aposteriori-Learning (MAP)

$$ \begin{eqnarray} \hat{\mathbf{\theta}} &=& \underset{\theta}{\operatorname{arg max}} \prod_i p(\mathbf{\theta}|\mathbf{x}_i,\mathbf{y}_i) \nonumber \\ \end{eqnarray} $$ $$ \begin{eqnarray} &=& \underset{\theta}{\operatorname{arg max}} p(\mathbf{\theta})\frac{\prod_i p(\mathbf{y}_i|\mathbf{x}_i,\mathbf{\theta})}{\prod_i p(\mathbf{y}_i|\mathbf{x}_i)} \nonumber \\ \end{eqnarray} $$ $$ \begin{eqnarray} &=& \underset{\theta}{\operatorname{arg max}} p(\mathbf{\theta})\prod_i p(\mathbf{y}_i|\mathbf{x}_i,\mathbf{\theta}) \nonumber \\ \end{eqnarray} $$ $$ \begin{eqnarray} &=& \underset{\theta}{\operatorname{arg min}} \left(E(\mathbf{\theta},\mathbf{x},\mathbf{y}) + E(\mathbf{\theta})\right) \nonumber \end{eqnarray} $$

Example:

$E(\mathbf{\theta},\mathbf{x},\mathbf{y})=\|\mathbf{y}-\mathbf{X}\mathbf{a}\|_2^2$ and $E(\mathbf{\theta})=\lambda\|\mathbf{a}\|^2_2$

Today

Recap

Recap: Simple Neural Network

Recap: Multilayer Neural Network

Risk Minimization

Empirical Risk Minimization

Log-Likelihood

Loss Functions

Loss Functions: Classification

Loss Functions: Regression

Loss Functions: Embeddings

Cross-Entropy Loss Example

Regularization

Regularization

$L_2$ Regularization

$L_1$ Regularization

Regularization Geometry

Looking Through the Bayesian Lens

Maximum-Aposteriori-Learning (MAP)

Q & A