In this blog post, we’ll explore how to estimate various properties of a Transformer model, including:
- Number of parameters
- FLOPs (Floating Point Operations)
- Peak memory footprint
- Checkpoint size
We’ll use a GPT-2 style model as our example, but the principles can be applied to other Transformer architectures as well.
This post is based on Karpathy’s nanoGPT Repo.
Table of Contents
Open Table of Contents
Model Configuration
Let’s start by defining our model’s configuration:
max_seq_len = 1024
vocab_size = 50257
n_layer = 12
n_head = 12
n_embd = 768
bias = False
This configuration represents a GPT-2 small model with 12 layers, 12 attention heads, and an embedding dimension of 768. The sequence length (max_seq_len) is set to 1024, and the vocabulary size is 50,257.
As you can see below, the GPT-2 model comes in different sizes, each with varying numbers of layers, attention heads, and embedding sizes:
Model Type | No. of Layers | No. of Heads | Embedding Size | Parameters |
---|---|---|---|---|
gpt2 | 12 | 12 | 768 | 124M |
gpt2-medium | 24 | 16 | 1024 | 350M |
gpt2-large | 36 | 20 | 1280 | 774M |
gpt2-xl | 48 | 25 | 1600 | 1558M |
Scaling these hyperparameters are a common way to create a “family” of models with varying sizes and capabilities.
Estimating the Number of Parameters
To estimate the number of parameters in our model, we need to consider each component of the Transformer architecture. Let’s break it down:
Embeddings
- Token embeddings:
- Position embeddings:
This assumes a learnable position embedding matrix.
Attention Blocks (for each layer)
- Layer normalization:
- Key, Query, Value projections:
- Output projection:
This assumes that multi-head attention splits the embeddings into parts.
MLP Blocks (for each layer)
- Layer normalization:
- First linear layer:
- Second linear layer:
This assumes a feed-foward network with an expansion factor of 4, (i.e., ffw_size = 4 * n_embd
).
Final Layer Norm and Output
- Final layer normalization:
- Output layer: 0 (due to weight sharing with token embeddings)
Now, let’s implement a function to calculate these parameters:
from collections import OrderedDict
def params():
""" Estimates the number of parameters in the model"""
out = OrderedDict()
# Token and position embeddings
out['embedding/position'] = n_embd * max_seq_len
out['embedding/token'] = n_embd * vocab_size
out['embedding'] = out['embedding/position'] + out['embedding/token']
# Attention blocks
out['attention/ln'] = n_embd # Layer norm
out['attention/kqv'] = n_embd * 3*n_embd
out['attention/proj'] = n_embd**2
out['attention'] = out['attention/ln'] + out['attention/kqv'] + out['attention/proj']
# MLP blocks
ffw_size = 4*n_embd # Feed-forward size
out['mlp/ln'] = n_embd
out['mlp/ffw'] = n_embd * ffw_size
out['mlp/proj'] = ffw_size * n_embd
out['mlp'] = out['mlp/ln'] + out['mlp/ffw'] + out['mlp/proj']
# Transformer block
out['block'] = out['attention'] + out['mlp']
out['transformer'] = n_layer * out['block']
# Final layer norm and output
out['ln_f'] = n_embd
out['dense'] = 0 # 0 because of parameter sharing with embedding layer
# Total
out['total'] = out['embedding'] + out['transformer'] + out['ln_f'] + out['dense']
return out
Let’s run this function and display the results:
p = params()
params_total = p['total']
print(f"Total parameters: {params_total:,}")
print(f"Expected parameters : {124337664:,}")
print("\nParameter breakdown:")
print(f"{'Name':20s} {'Parameters':>12s} {'Ratio (%)':>10s}")
print("-" * 44)
for k, v in p.items():
print(f"{k:20s} {v:12,d} {v/params_total*100:10.4f}")
This will output:
Total parameters: 124,337,664
Expected parameters: 124,337,664
Parameter breakdown:
Name Parameters Ratio (%)
--------------------------------------------
embedding/position 786,432 0.6325
embedding/token 38,597,376 31.0424
embedding 39,383,808 31.6749
attention/ln 768 0.0006
attention/kqv 1,769,472 1.4231
attention/proj 589,824 0.4744
attention 2,360,064 1.8981
mlp/ln 768 0.0006
mlp/ffw 2,359,296 1.8975
mlp/proj 2,359,296 1.8975
mlp 4,719,360 3.7956
block 7,079,424 5.6937
transformer 84,953,088 68.3245
ln_f 768 0.0006
dense 0 0.0000
total 124,337,664 100.0000
As we can see, the total number of parameters is approximately 124 million, which matches the expected value for a GPT-2 small model.
Estimating Checkpoint Size
Now that we know the number of parameters, we can estimate the size of each checkpoint. Parameters are typically stored in 32-bit floating-point format (fp32), and the AdamW optimizer uses two additional buffers per parameter for statistics.
params_bytes = params_total * 4 # 4 bytes per fp32 parameter
params_and_buffers_bytes = params_bytes + 2 * params_bytes
checkpoint_size_gb = params_and_buffers_bytes / 1e9
print(f"Estimated checkpoint size: {checkpoint_size_gb:.2f} GB")
measured_bytes = 1542470366 # from wc -c ckpt.pt
print(f"measured with wc -c ckpt.pt: {measured_bytes}")
print(f"fluff ratio: {measured_bytes/params_and_buffers_bytes*100:.2f}%")
Estimated checkpoint size: 1.49 GB
In practice, the actual checkpoint size might be slightly larger due to additional metadata and potential padding. We can compare our estimate to the actual file size:
measured_bytes = 1542470366 # Measured with 'wc -c ckpt.pt'
measured_gb = measured_bytes / 1e9
fluff_ratio = measured_bytes / params_and_buffers_bytes * 100
print(f"Measured checkpoint size with `wc -c ckpt.pt`: {measured_gb:.2f} GB")
print(f"Fluff ratio: {fluff_ratio:.2f}%")
Measured checkpoint size with `wc -c ckpt.pt`: 1542470366
fluff ratio: 103.38%
This shows that the actual checkpoint size is about 1.54 GB, with a fluff ratio of 103.38%.
Estimating FLOPs
Next, let’s estimate the number of Floating Point Operations (FLOPs) for a single forward pass through our model. We’ll focus on the most computationally intensive operations, primarily matrix multiplications.
First, recall that for a given matrix multiplication between two matrices and of sizes and , respectively, the number of FLOPs is . The reason for the factor of 2 is that each element in the output matrix requires one multiplication and one addition.
Let’s break down the FLOPs for different components of the model - note we only consider weight flops because all other operations (LayerNorm, SoftMax, etc.) are effectively irrelevant in terms of computational cost.
Attention Blocks
- Projection to key, query, values:
- Score calculation:
- Reduction of values:
def flops():
out = OrderedDict()
head_size = n_embd // n_head
# Attention blocks
# 1) Projection to key, query, values
out['attention/kqv'] = 2 * max_seq_len * (n_embd * 3*n_embd)
# 2) Calculating attention scores
out['attention/scores'] = 2 * max_seq_len * max_seq_len * n_embd
# 3) Reduction of values
out['attention/reduce'] = 2 * n_head * (max_seq_len * max_seq_len * head_size)
# 4) Final linear projection
out['attention/proj'] = 2 * max_seq_len * (n_embd * n_embd)
out['attention'] = sum(out['attention/'+k] for k in ['kqv', 'scores', 'reduce', 'proj'])
# MLP blocks
ffw_size = 4*n_embd
out['mlp/ffw1'] = 2 * max_seq_len * (n_embd * ffw_size)
out['mlp/ffw2'] = 2 * max_seq_len * (ffw_size * n_embd)
out['mlp'] = out['mlp/ffw1'] + out['mlp/ffw2']
# Transformer block and total
out['block'] = out['attention'] + out['mlp']
out['transformer'] = n_layer * out['block']
out['dense'] = 2 * max_seq_len * (n_embd * vocab_size)
out['forward_total'] = out['transformer'] + out['dense']
out['backward_total'] = 2 * out['forward_total'] # Estimate backward pass as 2x forward
out['total'] = out['forward_total'] + out['backward_total']
return out
f = flops()
flops_total = f['forward_total']
print("FLOPs breakdown:")
print(f"{'Name':20s} {'FLOPs':>14s} {'Ratio (%)':>10s}")
print("-" * 46)
for k, v in f.items():
print(f"{k:20s} {v:14,d} {v/flops_total*100:10.4f}")
This gives us a detailed breakdown of FLOPs for different components of the model:
FLOPs breakdown:
Name FLOPs Ratio (%)
----------------------------------------------
attention/kqv 3,623,878,656 1.2426
attention/scores 1,610,612,736 0.5522
attention/reduce 1,610,612,736 0.5522
attention/proj 1,207,959,552 0.4142
attention 8,053,063,680 2.7612
mlp/ffw1 4,831,838,208 1.6567
mlp/ffw2 4,831,838,208 1.6567
mlp 9,663,676,416 3.3135
block 17,716,740,096 6.0747
transformer 212,600,881,152 72.8963
dense 79,047,426,048 27.1037
forward_total 291,648,307,200 100.0000
backward_total 583,296,614,400 200.0000
total 874,944,921,600 300.0000
Model FLOPs Utilization (MFU)
To calculate the Model FLOPs Utilization (MFU), we need to compare our model’s achieved FLOPs to the theoretical peak performance of the GPU. Let’s assume we’re using an NVIDIA A100 GPU, which has a theoretical peak performance of 312 TFLOPS for bfloat16 operations on tensor cores.
batch_size = 20 * 5 # Total batch size (including gradient accumulation)
measured_time = 0.755 # Seconds per iteration
measured_throughput = batch_size / measured_time
flops_achieved = f['total'] * measured_throughput
a100_flops_promised = 312e12 # 312 TFLOPS for A100
# The faction of the A100's peak performance that we're achieving
mfu = flops_achieved / a100_flops_promised * 100
print(f"Model FLOPs Utilization: {mfu:.2f}%")
This gives us an MFU of about 37.14%, which indicates that there’s room for optimization in our training process.
Estimating Training Time
Finally, let’s estimate the total time needed to train our model using the
model_size = params()['total']
tokens_num = 300e9 # 300B tokens in the dataset
a100_flops = 312e12 # 312 TFLOPS for A100
assumed_mfu = 0.3 # Assume 30% MFU (accounting for distributed training overhead)
flops_throughput = a100_flops * 8 * assumed_mfu # Assume 8 A100 GPUs
flops_needed = 6 * model_size * tokens_num # 6ND approximation
time_needed_days = (flops_needed / flops_throughput) / (3600 * 24)
print(f"Estimated training time: {time_needed_days:.2f} days")
This gives us an estimated training time of about 3.46 days, which is close to the actual training time of approximately 4 days.
Conclusion
In this blog post, we’ve explored how to estimate various properties of a Transformer model, including the number of parameters, FLOPs, checkpoint size, and training time. These estimates are crucial for planning and optimizing large-scale language model training.
Remember that these are theoretical estimates and may vary in practice due to factors such as hardware efficiency, implementation details, and optimization techniques. Nonetheless, they provide valuable insights into the computational requirements of training large language models.