Understanding GPU Peak Performance Calculations with Example

byMeena Kande •December 06, 2024

0

To calculate the peak performance of a GPU in FLOPS (Floating Point Operations Per Second), we use the throughput of a single SM (Streaming Multiprocessor), the number of SMs, and the clock rate of the GPU. Here's a step-by-step explanation of the example calculation provided:

Step 1: Formula

The general formula for calculating peak FLOPS is:

$\text{Peak FLOPS} = (\text{Operations per clock per SM}) \times 2 \times (\text{Number of SMs}) \times (\text{Clock Rate})$

Where:

Operations per clock per SM is the number of multiply-add operations that one SM can execute in one clock cycle.
2 accounts for the fact that each multiply-add operation consists of two FLOPs (multiplication and addition).
Number of SMs is the total number of Streaming Multiprocessors in the GPU.
Clock Rate is the operating frequency of the GPU in Hz.

Step 2: Given Data

From the example:

GPU Model: NVIDIA A100
Number of SMs: 108
Clock Rate: 1.41 GHz = $1.41 \times 10^9$ Hz
Operations per clock per SM:
- For TF32 (TensorFloat32): 78 operations per clock per SM
- For FP16 (Half-Precision): 156 operations per clock per SM

Step 3: Calculation for TF32

Using the formula:

$\text{Peak FLOPS for TF32} = (78 \, \text{operations/clock/SM}) \times 2 \times (108 \, \text{SMs}) \times (1.41 \times 10^9 \, \text{Hz})$

Simplify step-by-step:

$78 \times 2 = 156$ (FLOPs per clock per SM for TF32)
$156 \times 108 = 16,848$ (FLOPs per clock for all SMs)
$16,848 \times 1.41 \times 10^9 = 23,746.68 \times 10^9 = 23.75 \, \text{TFLOPS}$

Thus, the peak performance for TF32 is approximately 156 TFLOPS.

Step 4: Calculation for FP16

Using the formula:

$\text{Peak FLOPS for FP16} = (156 \, \text{operations/clock/SM}) \times 2 \times (108 \, \text{SMs}) \times (1.41 \times 10^9 \, \text{Hz})$

Simplify step-by-step:

$156 \times 2 = 312$ (FLOPs per clock per SM for FP16)
$312 \times 108 = 33,696$ (FLOPs per clock for all SMs)
$33,696 \times 1.41 \times 10^9 = 47,510.16 \times 10^9 = 47.51 \, \text{TFLOPS}$

Thus, the peak performance for FP16 is approximately 312 TFLOPS.

Step 5: Explaining Key Points

Tensor Cores vs. CUDA Cores:

Tensor Cores specialize in matrix operations (e.g., 4x4 blocks), enabling higher throughput for deep learning tasks. CUDA Cores handle other operations like element-wise additions.

Precision Variations:

Tensor Cores support mixed precision calculations, where input tensors might use FP16, but accumulations are done in FP32 to ensure precision.
This flexibility ensures performance optimization without sacrificing accuracy.

3. Use Cases:

TF32: Used in training large-scale AI models with balanced precision and performance.
FP16: Common in inferencing for faster execution with acceptable precision loss.

Conclusion

By understanding this formula and its components, system administrators can evaluate GPU performance for specific workloads, optimize resource allocation, and predict infrastructure capabilities. This knowledge is critical for tasks like configuring AI training clusters or scaling GPU-intensive applications.