Feed Forward Networks: Introduction
CSE 891: Deep Learning
Vishnu Boddeti
Today
- Artificial Neuron
- Activation Functions
- Capacity of Neural Networks
- Biological Motivation
Artificial Neuron
- Neuron pre-activation (or input activation)
- a(x)=b+∑iwixi=b+wTx
- Neuron (output) activation
- h(x)=g(a(x))=g(b+∑iwixi)
- w are the connection weights
- b is the neuron bias
- g(â‹…) is called activation function
Artificial Neuron
- range determined by g(â‹…)
- bias b only changes the position of the riff
Linear Activation
g(x)=x
- Performs no input squashing
- Quite a boring function...
Sigmoid Activation
g(x)=11+e−x
- Squashes the neuron's pre-activation between 0 and 1
- Always positive
- Bounded
- Strictly increasing
Tanh Activation
g(x)=ex−e−xex+e−x
- Squashes the neuron's pre-activation between -1 and 1
- Can be positive or negative
- Bounded
- Strictly increasing
Rectified Linear Unit Activation
g(x)=max(0,x)
- Bounded below by 0 (always non-negative)
- Not upper bounded
- Strictly increasing
- Tends to yeild neurons with sparse activities
Capcity of Neural Networks
Single Neuron
- Could do binary classification:
- with sigmoid, can interpret neuron as estimating p(y=1|x)
- also known as logistic regression classifier
- if greater than 0.5, predict class 1
- otherwise, predict class 0
- similar idea can be used with Tanh
decision boundary is linear
Capacity of a Single Neuron
- Can solve linearly seperable problems
Capacity of a Single Neuron
- Cannot solve non-linearly separable problems....
- ...unless the input is transformed in a better representation
Neural Network with Hidden Layer
- Hidden layer pre-activation:
a(x)=b1+W1x
(a(x)i=bi1+∑jWi,j1xj)
- Hidden layer activation:
h(x)=g(a(x))
- Output layer activation:
f(x)=o(b2+wT2h(x))
Softmax Activation Function
- For multi-class classification:
- we need multiple outputs (1 output per class)
- we would like to estimate the conditional probability p(y=c|x)
- Softmax activation function at the output:
o(a)=softmax(a)=[expa1∑cexpac,…,expaC∑cexpac]T
- strictly positive
- sums to one
- Predicted class: one with highest estimated probability
Multi-Layer Neural Network
- Could have L hidden layers:
- layer pre-activation for k>0 (h(0)(x)=x)
a(k)(x)=b(k)+W(k)h(k−1)(x)
- hidden layer activation (k from 1 to L):
h(k)(x)=g(a(k)(x))
- output layer activation (k=L+1):
h(L+1)(x)=o(a(L+1)(x))=f(x)
Capacity of Single Hidden Layer Neural Network
Universal Approximation
- Universal approximation theorem (Hornik, 1991):
- "a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units"
- The result applies for sigmoid, tanh and many other hidden layer activation functions.
- This is a good result, but it doesn’t mean there is a learning algorithm that can find the necessary parameter values.
- Many other function classes also known to be universal approximators.


Feed Forward Networks: Introduction CSE 891: Deep Learning Vishnu Boddeti