Attention
CSE 891: Deep Learning
Vishnu Boddeti
- Works on Ordered Sequences
- Good at long sequences: After one RNN layer, $h_T$ "see" the whole sequence
- Not parallelizable: need to compute hidden states sequentially
- Works on Multidimensional Grids
- Bad at long sequences: Need to stack many conv layers for outputs to "see" the whole sequence
- Highly parallel: Each output can be computed in parallel
- Works on Sets of Vectors
- Good at long sequences: after one self-attention layer, each output "sees" all inputs
- Highly parallel: Each output can be computed in parallel
- Very memory intensive