Estimating Transformer Model Properties: A Deep Dive

In this blog post, we’ll explore how to estimate various properties of a Transformer model, including:

Number of parameters
FLOPs (Floating Point Operations)
Peak memory footprint
Checkpoint size

We’ll use a GPT-2 style model as our example, but the principles can be applied to other Transformer architectures as well.

This post is based on Karpathy’s nanoGPT Repo.

Open Table of Contents

Model Configuration
Estimating the Number of Parameters
Estimating Checkpoint Size
Estimating FLOPs
- Attention Blocks
Model FLOPs Utilization (MFU)
Estimating Training Time
Conclusion

Model Configuration

Let’s start by defining our model’s configuration:

max_seq_len = 1024
vocab_size = 50257
n_layer = 12
n_head = 12
n_embd = 768
bias = False

This configuration represents a GPT-2 small model with 12 layers, 12 attention heads, and an embedding dimension of 768. The sequence length (max_seq_len) is set to 1024, and the vocabulary size is 50,257.

As you can see below, the GPT-2 model comes in different sizes, each with varying numbers of layers, attention heads, and embedding sizes:

Model Type	No. of Layers	No. of Heads	Embedding Size	Parameters
gpt2	12	12	768	124M
gpt2-medium	24	16	1024	350M
gpt2-large	36	20	1280	774M
gpt2-xl	48	25	1600	1558M

Scaling these hyperparameters are a common way to create a “family” of models with varying sizes and capabilities.

Estimating the Number of Parameters

To estimate the number of parameters in our model, we need to consider each component of the Transformer architecture. Let’s break it down:

Embeddings

Token embeddings: $n_{embd} \times vocab\_size$
Position embeddings: $n_{embd} \times max\_seq\_len$

This assumes a learnable position embedding matrix.

Attention Blocks (for each layer)

Layer normalization: $n_{embd}$
Key, Query, Value projections: $n_{embd} \times 3n_{embd}$
Output projection: $n_{embd}^2$

This assumes that multi-head attention splits the embeddings into $n_{head}$ parts.

MLP Blocks (for each layer)

Layer normalization: $n_{embd}$
First linear layer: $n_{embd} \times 4n_{embd}$
Second linear layer: $4n_{embd} \times n_{embd}$

This assumes a feed-forward network with an expansion factor of 4, (i.e., ffw_size = 4 * n_embd).

Final Layer Norm and Output

Final layer normalization: $n_{embd}$
Output layer: 0 (due to weight sharing with token embeddings)

Now, let’s implement a function to calculate these parameters:

from collections import OrderedDict

def params():
    """ Estimates the number of parameters in the model"""
    out = OrderedDict()

    # Token and position embeddings
    out['embedding/position'] = n_embd * max_seq_len
    out['embedding/token'] = n_embd * vocab_size
    out['embedding'] = out['embedding/position'] + out['embedding/token']

    # Attention blocks
    out['attention/ln'] = n_embd  # Layer norm
    out['attention/kqv'] = n_embd * 3*n_embd
    out['attention/proj'] = n_embd**2
    out['attention'] = out['attention/ln'] + out['attention/kqv'] + out['attention/proj']

    # MLP blocks
    ffw_size = 4*n_embd  # Feed-forward size
    out['mlp/ln'] = n_embd
    out['mlp/ffw'] = n_embd * ffw_size
    out['mlp/proj'] = ffw_size * n_embd
    out['mlp'] = out['mlp/ln'] + out['mlp/ffw'] + out['mlp/proj']

    # Transformer block
    out['block'] = out['attention'] + out['mlp']
    out['transformer'] = n_layer * out['block']

    # Final layer norm and output
    out['ln_f'] = n_embd
    out['dense'] = 0  # 0 because of parameter sharing with embedding layer

    # Total
    out['total'] = out['embedding'] + out['transformer'] + out['ln_f'] + out['dense']

    return out

Let’s run this function and display the results:

p = params()
params_total = p['total']
print(f"Total parameters: {params_total:,}")
print(f"Expected parameters : {124337664:,}")
print("\nParameter breakdown:")
print(f"{'Name':20s} {'Parameters':>12s} {'Ratio (%)':>10s}")
print("-" * 44)
for k, v in p.items():
    print(f"{k:20s} {v:12,d} {v/params_total*100:10.4f}")

This will output:

Total parameters: 124,337,664
Expected parameters: 124,337,664

Parameter breakdown:
Name                 Parameters    Ratio (%)
--------------------------------------------
embedding/position        786,432     0.6325
embedding/token        38,597,376    31.0424
embedding              39,383,808    31.6749
attention/ln                  768     0.0006
attention/kqv           1,769,472     1.4231
attention/proj            589,824     0.4744
attention               2,360,064     1.8981
mlp/ln                        768     0.0006
mlp/ffw                 2,359,296     1.8975
mlp/proj                2,359,296     1.8975
mlp                     4,719,360     3.7956
block                   7,079,424     5.6937
transformer            84,953,088    68.3245
ln_f                          768     0.0006
dense                           0     0.0000
total                 124,337,664   100.0000

As we can see, the total number of parameters is approximately 124 million, which matches the expected value for a GPT-2 small model.

Estimating Checkpoint Size

Now that we know the number of parameters, we can estimate the size of each checkpoint. Parameters are typically stored in 32-bit floating-point format (fp32), and the AdamW optimizer uses two additional buffers per parameter for statistics.

params_bytes = params_total * 4  # 4 bytes per fp32 parameter
params_and_buffers_bytes = params_bytes + 2 * params_bytes
checkpoint_size_gb = params_and_buffers_bytes / 1e9

print(f"Estimated checkpoint size: {checkpoint_size_gb:.2f} GB")
measured_bytes = 1542470366 # from wc -c ckpt.pt
print(f"measured with wc -c ckpt.pt: {measured_bytes}")
print(f"fluff ratio: {measured_bytes/params_and_buffers_bytes*100:.2f}%")

Estimated checkpoint size: 1.49 GB

In practice, the actual checkpoint size might be slightly larger due to additional metadata and potential padding. We can compare our estimate to the actual file size:

measured_bytes = 1542470366  # Measured with 'wc -c ckpt.pt'
measured_gb = measured_bytes / 1e9
fluff_ratio = measured_bytes / params_and_buffers_bytes * 100

print(f"Measured checkpoint size with `wc -c ckpt.pt`: {measured_gb:.2f} GB")
print(f"Fluff ratio: {fluff_ratio:.2f}%")

Measured checkpoint size with `wc -c ckpt.pt`: 1542470366
fluff ratio: 103.38%

This shows that the actual checkpoint size is about 1.54 GB, with a fluff ratio of 103.38%.

Estimating FLOPs

Next, let’s estimate the number of Floating Point Operations (FLOPs) for a single forward pass through our model. We’ll focus on the most computationally intensive operations, primarily matrix multiplications.

First, recall that for a given matrix multiplication between two matrices $A$ and $B$ of sizes $(m, n)$ and $(n, p)$ , respectively, the number of FLOPs is $2 \times m \times n \times p$ . The reason for the factor of 2 is that each element in the output matrix requires one multiplication and one addition.

Let’s break down the FLOPs for different components of the model - note we only consider weight flops because all other operations (LayerNorm, SoftMax, etc.) are effectively irrelevant in terms of computational cost.

Attention Blocks

Projection to key, query, values: $2 \times max\_seq\_len \times (n_{embd} \times 3n_{embd})$
Score calculation: $2 \times max\_seq\_len \times max\_seq\_len \times n_{embd}$
Reduction of values: $2 \times n_{head} \times (max\_seq\_len \times max\_seq\_len \times head\_size)$

def flops():
    out = OrderedDict()
    head_size = n_embd // n_head

    # Attention blocks
    # 1) Projection to key, query, values
    out['attention/kqv'] = 2 * max_seq_len * (n_embd * 3*n_embd)
    # 2) Calculating attention scores
    out['attention/scores'] = 2 * max_seq_len * max_seq_len * n_embd
    # 3) Reduction of values
    out['attention/reduce'] = 2 * n_head * (max_seq_len * max_seq_len * head_size)
    # 4) Final linear projection
    out['attention/proj'] = 2 * max_seq_len * (n_embd * n_embd)
    out['attention'] = sum(out['attention/'+k] for k in ['kqv', 'scores', 'reduce', 'proj'])

    # MLP blocks
    ffw_size = 4*n_embd
    out['mlp/ffw1'] = 2 * max_seq_len * (n_embd * ffw_size)
    out['mlp/ffw2'] = 2 * max_seq_len * (ffw_size * n_embd)
    out['mlp'] = out['mlp/ffw1'] + out['mlp/ffw2']

    # Transformer block and total
    out['block'] = out['attention'] + out['mlp']
    out['transformer'] = n_layer * out['block']
    out['dense'] = 2 * max_seq_len * (n_embd * vocab_size)

    out['forward_total'] = out['transformer'] + out['dense']
    out['backward_total'] = 2 * out['forward_total']  # Estimate backward pass as 2x forward
    out['total'] = out['forward_total'] + out['backward_total']

    return out

f = flops()
flops_total = f['forward_total']

print("FLOPs breakdown:")
print(f"{'Name':20s} {'FLOPs':>14s} {'Ratio (%)':>10s}")
print("-" * 46)
for k, v in f.items():
    print(f"{k:20s} {v:14,d} {v/flops_total*100:10.4f}")

This gives us a detailed breakdown of FLOPs for different components of the model:

FLOPs breakdown:
Name                        FLOPs    Ratio (%)
----------------------------------------------
attention/kqv         3,623,878,656     1.2426
attention/scores      1,610,612,736     0.5522
attention/reduce      1,610,612,736     0.5522
attention/proj        1,207,959,552     0.4142
attention             8,053,063,680     2.7612
mlp/ffw1              4,831,838,208     1.6567
mlp/ffw2              4,831,838,208     1.6567
mlp                   9,663,676,416     3.3135
block                17,716,740,096     6.0747
transformer         212,600,881,152    72.8963
dense                79,047,426,048    27.1037
forward_total       291,648,307,200   100.0000
backward_total      583,296,614,400   200.0000
total               874,944,921,600   300.0000

Model FLOPs Utilization (MFU)

To calculate the Model FLOPs Utilization (MFU), we need to compare our model’s achieved FLOPs to the theoretical peak performance of the GPU. Let’s assume we’re using an NVIDIA A100 GPU, which has a theoretical peak performance of 312 TFLOPS for bfloat16 operations on tensor cores.

batch_size = 20 * 5  # Total batch size (including gradient accumulation)
measured_time = 0.755  # Seconds per iteration
measured_throughput = batch_size / measured_time
flops_achieved = f['total'] * measured_throughput

a100_flops_promised = 312e12  # 312 TFLOPS for A100

# The faction of the A100's peak performance that we're achieving
mfu = flops_achieved / a100_flops_promised * 100

print(f"Model FLOPs Utilization: {mfu:.2f}%")

This gives us an MFU of about 37.14%, which indicates that there’s room for optimization in our training process.

Estimating Training Time

Finally, let’s estimate the total time needed to train our model using the

model_size = params()['total']
tokens_num = 300e9  # 300B tokens in the dataset
a100_flops = 312e12  # 312 TFLOPS for A100
assumed_mfu = 0.3  # Assume 30% MFU (accounting for distributed training overhead)
flops_throughput = a100_flops * 8 * assumed_mfu  # Assume 8 A100 GPUs
flops_needed = 6 * model_size * tokens_num  # 6ND approximation
time_needed_days = (flops_needed / flops_throughput) / (3600 * 24)

print(f"Estimated training time: {time_needed_days:.2f} days")

This gives us an estimated training time of about 3.46 days, which is close to the actual training time of approximately 4 days.

Conclusion

In this blog post, we’ve explored how to estimate various properties of a Transformer model, including the number of parameters, FLOPs, checkpoint size, and training time. These estimates are crucial for planning and optimizing large-scale language model training.

Remember that these are theoretical estimates and may vary in practice due to factors such as hardware efficiency, implementation details, and optimization techniques. Nonetheless, they provide valuable insights into the computational requirements of training large language models.