Feed Forward Networks: Learning
CSE 891: Deep Learning
Vishnu Boddeti
Monday September 13, 2021
Today
- Risk Minimization
- Loss Functions
- Regularization
Recap: Simple Neural Network
- Neuron pre-activation (or input activation)
- a(x)=b+∑iwixi=b+wTx
- Neuron (output) activation
h(x)=g(a(x))=g(b+∑iwixi)
- w are the connection weights
- b is the neuron bias
- g(⋅) is called activation function
Recap: Multilayer Neural Network
- Could have L hidden layers:
- layer pre-activation for k>0 (h(0)(x)=x)
a(k)(x)=b(k)+W(k)h(k−1)(x)
- hidden layer activation (k from 1 to L):
h(k)(x)=g(a(k)(x))
- output layer activation (k=L+1):
h(L+1)(x)=o(a(L+1)(x))=f(x)
Empirical Risk Minimization
- Framework to design learning algorithms
arg minθEP(x,y)[l(f(xt;θ),yt)]+λΩ(θ)
arg minθ1T∑tl(f(xt;θ),yt)+λΩ(θ)
- Learning is cast as an optimization problem
- Loss Function: l(f(xt;θ),yt)
- Ω(θ) is a regularizer
Log-Likelihood
- NN estimates f(x)c=p(y=c|x)
- we could maximize the probabilities of yt given xt in the training set
- To frame as minimization, we minimize the negative log-likelihood
l(f(x),y)=−log∑c1y=cf(x)c=−logf(x)y
- we take the log to simplify numerical stability and math simplicity
- sometimes referred to as cross-entropy
Loss Functions: Classification
- l(ˆy,y)=∑iI(ˆyi≠yi)
- Surrogate Loss Functions:
- Squared Loss: (y−f(x))2
- Logistic Loss: log(1+e−yf(x))
- Hinge Loss: (1−yf(x))+
- Squared Hinge Loss: (1−yf(x))2+
- Cross-Entropy Loss: −∑iyilogf(x)i
Loss Functions: Regression
- Euclidean Loss: ‖y−f(x)‖22
- Manhattan Loss: ‖y−f(x)‖1
- Huber Loss:
{12‖y−f(x)‖22for ‖y−f(x)‖<δδ‖y−f(x)‖1−12δ2otherwise
- KL Divergence: ∑ipilog(piqi)
Loss Functions: Embeddings
- Cosine Distance: xTy‖x‖‖y‖
- Triplet Loss: (1+d(xi,xj)−d(xi,xk))+
- Mahalanobis Distance: (x−y)TM(x−y)
Cross-Entropy Loss Example
Unnormalized Log-Prob |
Unnormalized Prob. |
Prob. |
Correct Prob |
3.2 |
24.5 |
0.13 |
1.00 |
5.1 |
164.0 |
0.87 |
0.00 |
-1.7 |
0.18 |
0.00 |
0.00 |
Regularization
- What is regularization?
- the process of constraining the parameter space
- alternatively, penalize certain values of θ
- Why do we need regularization?
- In Machine Learning regularization ⇔ generalization
- In Deep Learning regularization ≠ generalization
L2 Regularization
Ω(θ)=∑k∑i∑j(Wkij)2=∑k‖Wk‖2F
- Gradient: ∇WkΩ(θ)=2Wk
- Only applied on weights, not on biases (weight decay)
- Can be interpreted as having a Gaussian prior over the weights
L1 Regularization
Ω(θ)=∑k∑i∑j|Wkij|
- Gradient: ∇WkΩ(θ)=sign(Wk)
sign(Wk)i,j={1for Wkij>0−1for Wkij<0
- Only applied on weights, but not to biases
- Unlike L2, L1 will push certain weights to be exactly zero
- Can be interpreted as having a Laplacian prior over the weights
Regularization Geometry
Equivalent Optimization Problems:
P1: arg maxW‖|y−Wx‖|22+λ‖|W‖|22
P2: arg maxW‖|W‖|22s.t.‖|y−Wx‖|22≤α
Looking Through the Bayesian Lens
- Bayes Rule:
p(θ|x,y)=p(y|x,θ)p(θ)p(y|x)=p(y|x,θ)p(θ)∫θp(y|x,θ)p(θ)
- Likelihood:
p(y|x,θ)=1Ze−E(θ,x,y)
- Prior:
p(θ)=1Ze−E(θ)
Maximum-Aposteriori-Learning (MAP)
ˆθ=arg maxθ∏ip(θ|xi,yi)
=arg maxθp(θ)∏ip(yi|xi,θ)∏ip(yi|xi)
=arg maxθp(θ)∏ip(yi|xi,θ)
=arg minθ(E(θ,x,y)+E(θ))
E(θ,x,y)=‖y−Xa‖22 and
E(θ)=λ‖a‖22
Feed Forward Networks: Learning CSE 891: Deep Learning Vishnu Boddeti Monday September 13, 2021