NLP and Transformers
CSE 891: Deep Learning
Vishnu Boddeti
Layer Type | Complexity Per Layer | Sequential Ops | Max Path Length |
---|---|---|---|
Self-Attention | $\mathcal{O}(n^2 \cdot d)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
Recurrent | $\mathcal{O}(n \cdot d^2)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
Convolutional | $\mathcal{O}(k \cdot n \cdot d^2)$ | $\mathcal{O}(1)$ | $\mathcal{O}(\log_k(n))$ |
Self-Attention (restricted) | $\mathcal{O}(r \cdot n \cdot d)$ | $\mathcal{O}(1)$ | $\mathcal{O}(n/r)$ |
Model | Layers | Width | Heads | Params | Data | Training |
---|---|---|---|---|---|---|
Transformer-Base | 12 | 512 | 8 | 65M | 8x P100 (12 hrs) | |
Transformer-Large | 12 | 1024 | 16 | 213M | 8x P100 (3.5 days) | |
BERT-Base | 12 | 768 | 12 | 110M | 13GB | |
BERT-Large | 24 | 1024 | 16 | 340M | 13GB | |
XLNet-Large | 24 | 1024 | 16 | 340M | 126GB | 512x TPU-v3 (2.5 days) |
RoBERTa | 24 | 1024 | 16 | 355M | 160GB | 1024x V100 (1 day) |
GPT-2 | 48 | 1600 | ? | 1.5B | 40GB | |
Megatron-LM | 72 | 3072 | 32 | 8.3B | 174GB | 512x V100 (9 days) |
Turing-NLG | 78 | 4256 | 28 | 17B | ? | 256x V100 |
GPT-3 | 96 | 12288 | 96 | 175B | 694GB | ? |
Gopher | 80 | 16384 | 128 | 280B | 10.55TB | 4096x TPUv3 (38 days) |