NLP and Transformers
CSE 891: Deep Learning
Vishnu Boddeti
| Layer Type | Complexity Per Layer | Sequential Ops | Max Path Length |
|---|---|---|---|
| Self-Attention | $\mathcal{O}(n^2 \cdot d)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Recurrent | $\mathcal{O}(n \cdot d^2)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
| Convolutional | $\mathcal{O}(k \cdot n \cdot d^2)$ | $\mathcal{O}(1)$ | $\mathcal{O}(\log_k(n))$ |
| Self-Attention (restricted) | $\mathcal{O}(r \cdot n \cdot d)$ | $\mathcal{O}(1)$ | $\mathcal{O}(n/r)$ |
| Model | Layers | Width | Heads | Params | Data | Training |
|---|---|---|---|---|---|---|
| Transformer-Base | 12 | 512 | 8 | 65M | 8x P100 (12 hrs) | |
| Transformer-Large | 12 | 1024 | 16 | 213M | 8x P100 (3.5 days) | |
| BERT-Base | 12 | 768 | 12 | 110M | 13GB | |
| BERT-Large | 24 | 1024 | 16 | 340M | 13GB | |
| XLNet-Large | 24 | 1024 | 16 | 340M | 126GB | 512x TPU-v3 (2.5 days) |
| RoBERTa | 24 | 1024 | 16 | 355M | 160GB | 1024x V100 (1 day) |
| GPT-2 | 48 | 1600 | ? | 1.5B | 40GB | |
| Megatron-LM | 72 | 3072 | 32 | 8.3B | 174GB | 512x V100 (9 days) |
| Turing-NLG | 78 | 4256 | 28 | 17B | ? | 256x V100 |
| GPT-3 | 96 | 12288 | 96 | 175B | 694GB | ? |
| Gopher | 80 | 16384 | 128 | 280B | 10.55TB | 4096x TPUv3 (38 days) |