Optimize PyTorch Performance for Speed and Memory Efficiency

The training/inference processes of deep learning models are involved lots of steps. The faster each experiment iteration is, the more we can optimize the whole model prediction performance given limited time and resources. I collected and organized several PyTorch tricks and tips to maximize the efficiency of memory usage and minimize the run time. To better leverage these tips, we also need to understand how and why they work.

Checklist:

Data Loading 1. Move the active data to the SSD 2. Dataloader(dataset, num_workers=4*num_GPU) 3. Dataloader(dataset, pin_memory=True)
Data Operations 4. Directly create vectors/matrices/tensors as torch.Tensor and at the device where they will run operations 5. Avoid unnecessary data transfer between CPU and GPU 6. Use torch.from_numpy(numpy_array) or torch.as_tensor(others) 7. Use tensor.to(non_blocking=True) when it’s applicable to overlap data transfers 8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT
Model Architecture 9. Set the sizes of all different architecture designs as the multiples of 8 (for FP16 of mixed precision)
Training 10. Set the batch size as the multiples of 8 and maximize GPU memory usage 11. Use mixed precision for forward pass (but not backward pass) 12. Set gradients to None (e.g., model.zero_grad(set_to_none=True) ) before the optimizer updates the weights 13. Gradient accumulation: update weights for every other x batch to mimic the larger batch size
Inference/Validation 14. Turn off gradient calculation
CNN (Convolutional Neural Network) specific 15. torch.backends.cudnn.benchmark = True 16. Use channels_last memory format for 4D NCHW Tensors 17. Turn off bias for convolutional layers that are right before batch normalization
Distributed optimizations 18. Use DistributedDataParallel instead of DataParallel

Code snippet combining the tips No. 7, 11, 12, 13:

# Combining the tips No.7, 11, 12, 13: nonblocking, AMP, setting 
# gradients as None, and larger effective batch size

model.train()
# Reset the gradients to None
optimizer.zero_grad(set_to_none=True)
scaler = GradScaler()
for i, (features, target) in enumerate(dataloader):
    # these two calls are nonblocking and overlapping
    features = features.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)    

    # Forward pass with mixed precision
    with torch.cuda.amp.autocast(): # autocast as a context manager
        output = model(features)
        loss = criterion(output, target)    

    # Backward pass without mixed precision
    # It's not recommended to use mixed precision for backward pass
    # Because we need more precise loss
    scaler.scale(loss).backward()    

    # Only update weights every other 2 iterations
    # Effective batch size is doubled
    if (i+1) % 2 == 0 or (i+1) == len(dataloader):
        # scaler.step() first unscales the gradients .
        # If these gradients contain infs or NaNs, 
        # optimizer.step() is skipped.
        scaler.step(optimizer)        
        
        # If optimizer.step() was skipped,
        # scaling factor is reduced by the backoff_factor 
        # in GradScaler()
        scaler.update()        

        # Reset the gradients to None
        optimizer.zero_grad(set_to_none=True)

High-level concepts

Overall, you can optimize the time and memory usage by 3 key points. First, reduce the i/o (input/output) as much as possible so that the model pipeline is bound to the calculations (math-limited or math-bound) instead of bound to i/o (bandwidth-limited or memory-bound). This way, we can leverage GPUs and their specialization to accelerate those computations. Second, overlap the processes as much as possible to save time. Third, maximize the memory usage efficiency to save memory. Then saving memory may enable a larger batch size, which saves more time. Having more time facilitates a faster model development cycle and leads to better model performance.

1. Move active data to SSD

Some machines have different hard drives like HHD and SSD. It’s recommended to move the data, which will be used in the active projects, to the SSD (or the hard drive with better i/o) for faster speed.

2. Asynchronous data loading and augmentation

num_workers=0 would make data loading execute only after training or previous process is done. Setting num_workers >0 is expected to accelerate the process more especially for the i/o and augmentation of large data. For GPU specifically, this experiment found that num_workers = 4*num_GPU had the best performance. That being said, you can also test the best num_workers for your machine. To be noted, high num_workers would have a large memory consumption overhead , which is also expected, because more data copies are being processed in the memory at the same time.

Dataloader(dataset, num_workers=4*num_GPU)

3. Use pinned memory to reduce data transfer

Setting pin_memory=True skips the transfer from pageable memory to pinned memory (image by the author, inspired by this image)

GPU cannot access data directly from the pageable memory of the CPU. The setting, pin_memory=True can allocate the staging memory for the data on the CPU host directly and save the time of transferring data from pageable memory to staging memory (i.e., pinned memory a.k.a., page-locked memory). This setting can be combined with num_workers = 4*num_GPU.

Dataloader(dataset, pin_memory=True)

4. Directly create vectors/matrices/tensors as torch.Tensor and at the device where they will run operations

Whenever you need torch.Tensor data for PyTorch, first try to create them at the device where you will use them. Do not use native Python or NumPy to create data and then convert it to torch.Tensor. In most cases, if you are going to use them in GPU, create them in GPU directly.

# Random numbers between 0 and 1
# Same as np.random.rand([10,5])
tensor = torch.rand([10, 5], device=torch.device('cuda:0'))

# Random numbers from normal distribution with mean 0 and variance 1
# Same as np.random.randn([10,5])
tensor = torch.randn([10, 5], device=torch.device('cuda:0'))

The only syntax difference is that random number generation in NumPy needs an additional random, e.g., np.random.rand() vs torch.rand(). Many other functions have the corresponding functions in NumPy:

torch.empty(), torch.zeros(), torch.full(), torch.ones(), torch.eye(),torch.randint() , torch.rand() ,torch.randn()

5. Avoid unnecessary data transfer between CPU and GPU

As I mentioned in the high-level concepts, we’d like to reduce i/o as much as possible. Be aware of these commands below:

# BAD! AVOID THEM IF UNNECESSARY!
print(cuda_tensor)
cuda_tensor.cpu()
cuda_tensor.to_device('cpu')
cpu_tensor.cuda()
cpu_tensor.to_device('cuda')
cuda_tensor.item()
cuda_tensor.numpy()
cuda_tensor.nonzero()
cuda_tensor.tolist()

# Python control flow which depends on operation results of CUDA tensors
if (cuda_tensor != 0).all():
    run_func()

6. Use torch.from_numpy(numpy_array) and torch.as_tensor(others) instead of torch.tensor

torch.tensor() always copies the data.

If both the source device and target device are CPU, torch.from_numpy and torch.as_tensor may not create data copies. If the source data is a NumPy array, it’s faster to use torch.from_numpy(numpy_array). If the source data is a tensor with the same data type and device type, then torch.as_tensor(others) may avoid copying data if applicable. others can be Python list, tuple, or torch.tensor. If the source and target device are different, then we can use the next tip.

torch.from_numpy(numpy_array)
torch.as_tensor(others)

7. Use tensor.to(non_blocking=True) when it’s applicable to overlap data transfers and kernel execution

Overlap the data transfers to reduce the run time. (Image by the author)

Essentially, non_blocking=True allows asynchronous data transfers to reduce the execution time.

for features, target in loader:
    # these two calls are nonblocking and overlapping
    features = features.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)    

    # This is a synchronization point
    # It will wait for previous two lines
    output = model(features)

8. Fuse the pointwise (elementwise) operations into a single kernel by PyTorch JIT

Pointwise operations include common math operations and usually are memory-bound. PyTorch JIT would automatically fuse the adjacent pointwise operations into one single kernel to save multiple memory reads/writes. (Pretty amazing, isn’t it?) For example, the gelu function can be accelerated 4 times for a vector of one million by fusing 5 kernels into 1 . More examples of PyTorch JIT optimization can be found here and here.

@torch.jit.script # JIT decorator
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

9 & 10. Set the sizes of all different architectural designs and batch sizes as the multiples of 8

To maximize the computation efficiency of GPU, it’s the best to ensure different architecture designs (including the input and output size/dimension/channel numbers of neural networks and batch size) are the multiples of 8 or even larger powers of two (e.g., 64, 128 and up to 256). It’s because the Tensor Cores of Nvidia GPUs achieve the best performance for matrix multiplication when the matrix dimensions align to the multiples of powers of two. The matrix multiplication is the most-used operation and possibly the bottleneck, so it’s the best we can make sure the tensors/matrices/vectors have the dimensions that are divisible by powers of two (e.g., 8, 64, 128, and up to 256).

These experiments have shown setting the output dimension and the batch size as the multiples of 8 (i.e., 33712, 4088, 4096) can accelerate the computation by 1.3x to 4x compared to the output dimension of 33708 and the batch sizes of 4084 and 4095, which are indivisible by 8. The acceleration magnitude depends on the process types (e.g., forward passes or the gradient calculations) and the cuBLAS versions. Especially, if you work on NLP, remember to check your output dimension, which is usually the vocabulary size.

Using the multiples of larger than 256 does not add more benefits but also does no harm. The settings depend on the cuBLAS and cuDNN versions and the GPU architecture. You can find the specific Tensor Core requirements of the matrix dimensions here. Since currently PyTorch AMP mostly uses FP16 and FP16 requires the multiples of 8, the multiples of 8 are usually recommended. If you have a more advanced GPU like A100, then you may choose the multiples of 64. If you are using AMD GPU, you may need to check AMD’s documentation.

Besides setting batch size as the multiple of 8, we also maximize the batch size until it hits the memory limit of GPU. In this way, we can spend less time finishing an epoch.

11. Use mixed precision for forward pass but not backward pass

Some of the operations do not need the precision of float64 or float32. Therefore, setting the operations for lower precision can save both memory and execution time. For various applications, Nvidia reported using mixed precision with the GPU with Tensor Cores can speed up by 3.5x to 25x .

It is worth noting that usually the larger the matrix is, the more acceleration mixed precision achieves . In larger neural networks (e.g., BERT), one experiment has shown that the mixed precision can accelerate training by 2.75x and reduce 37% of memory usage . The newer GPU devices with Volta, Turing, Ampere, or Hopper architectures (e.g., T4, V100, RTX 2060, 2070, 2080, 2080 Ti, A100, RTX 3090, RTX 3080, and RTX 3070) can benefit more from mixed precision because they have the Tensor Core architecture, which has special optimization and outperforms CUDA cores .

NVIDIA architectures with Tensor Core support different precisions (image by the author; data source)

To be noted, H100 with Hopper architecture, expected to release in the third quarter of 2022, supports FP8 (float8). PyTorch AMP may be expected to support FP8, too (current v1.11.0 has not supported FP8 yet).

In practice, you’ll need to find a sweet spot between the model accuracy performance and speed performance. I did find mixed precision may reduce the model performance before, which depends on the algorithm, the data and the problem.

It’s quite easy to leverage mixed precision in PyTorch with the automatic mixed precision (AMP) package. The default float point type in PyTorch is float32 . AMP would save memory and time by using float16 for a group of operations (e.g., matmul, linear, conv2d, etc, see full list). AMP would autocast to float32 for some operations (e.g., mse_loss, softmax, etc, see full list). Some operations (e.g., add, see full list) would operate on the widest input type. For example, if one variable is float32 and another one is float16, the addition result would be float32.

autocast automatically applies precisions to different operations. Because loss(es) and gradients are calculated at float16 precision, the gradients might "underflow” and become zeroes when they are too small. GradScaler prevents underflow by multiplying the loss(es) by a scale factor, calculating the gradients based on the scaled loss(es), and then unscaling the gradients before the optimizer updates the weights. If the scaling factor is too large or too small and results in infs or NaNs, then the scaler would update the scaling factor for the next iteration.

scaler = GradScaler()
for features, target in data:
    # Forward pass with mixed precision
    with torch.cuda.amp.autocast(): # autocast as a context manager
        output = model(features)
        loss = criterion(output, target)    

    # Backward pass without mixed precision
    # It's not recommended to use mixed precision for backward pass
    # Because we need more precise loss
    scaler.scale(loss).backward()    

    # scaler.step() first unscales the gradients .
    # If these gradients contain infs or NaNs, 
    # optimizer.step() is skipped.
    scaler.step(optimizer)     

    # If optimizer.step() was skipped,
    # scaling factor is reduced by the backoff_factor in GradScaler()
    scaler.update()

You can also use autocast as a decorator for forward pass functions.

class AutocastModel(nn.Module):
    ...
    @autocast() # autocast as a decorator
    def forward(self, input):
        x = self.model(input)
        return x

12. Set gradients to None before the optimizer updates the weights

Setting gradients to zeroes by model.zero_grad() or optimizer.zero_grad() would execute memset for all parameters and update gradients with reading and writing operations. However, setting the gradients as None would not execute memset and would update gradients with only writing operations. Therefore, setting gradients as None is faster.

# Reset gradients before each step of optimizer
for param in model.parameters():
    param.grad = None

# or (PyTorch >= 1.7)
model.zero_grad(set_to_none=True)

# or (PyTorch >= 1.7)
optimizer.zero_grad(set_to_none=True)

13. Gradient accumulation: update weights for every other x batch to mimic the larger batch size

This tip is about accumulating gradients from more data samples so that the estimation of gradients is more accurate and weights are updated more towards the local/global minimum. This is more helpful especially when the batch size is small (due to either a small GPU memory limit or a large data size per sample).

for i, (features, target) in enumerate(dataloader):
    # Forward pass
    output = model(features)
    loss = criterion(output, target)    

    # Backward pass
    loss.backward()    

    # Only update weights every other 2 iterations
    # Effective batch size is doubled
    if (i+1) % 2 == 0 or (i+1) == len(dataloader):
        # Update weights
        optimizer.step()        

        # Reset the gradients to None
        optimizer.zero_grad(set_to_none=True)

14. Turn off gradient calculation for inference/validation

Essentially, gradient calculation is not necessary for the inference and validation steps if you only calculate the outputs of the model. PyTorch uses an intermediate memory buffer for operations involved in variables of requires_grad=True. Therefore, we can avoid using additional resources by disabling gradient calculation for inference/validation if we know we don’t need any gradient-involved operations.

# torch.no_grad() as a context manager:
    with torch.no_grad():
    output = model(input)

# torch.no_grad() as a function decorator:
@torch.no_grad()
def validation(model, input):
    output = model(input)
return output

15. torch.backends.cudnn.benchmark = True

Setting torch.backends.cudnn.benchmark = True before the training loop can accelerate the computation. Because the performance of cuDNN algorithms to compute the convolution of different kernel sizes varies, the auto-tuner can run a benchmark to find the best algorithm (current algorithms are these, these, and these). It’s recommended to use turn on the setting when your input size doesn’t change often. If the input size changes often, the auto-tuner needs to benchmark too frequently, which might hurt the performance. It can speed up by 1.27x to 1.70x for forward and backward propagation .

torch.backends.cudnn.benchmark = True

16. Use channels_last memory format for 4D NCHW Tensors

4D NCHW is reorganized as NHWC format (image by the author inspired by ref)

Using the channels_last memory format saves images in a pixel-per-pixel manner as the densest format in memory. Original 4D NCHW Tensors are clustered by each channel (Red/Greed/Blue) in memory. After the conversion, x = x.to(memory_format=torch.channels_last), the data is reorganized as NHWC (channels_last format) in memory. You can see each pixel of the RGB layers is closer. This NHWC format has been reported to gain 8% to 35% speedup with AMP of FP16 .

Currently, it is still in beta, and only supports 4D NCHW Tensors and a group of models (e.g., alexnet, mnasnet family, mobilenet_v2, resnet family, shufflenet_v2, squeezenet1, vgg family, see ful list). But I can definitely see this would become a standard optimization.

N, C, H, W = 10, 3, 32, 32

x = torch.rand(N, C, H, W)

# Stride is the gap between one element to the next one 
# in a dimension.
print(x.stride()) 

# (3072, 1024, 32, 1)# Convert the tensor to NHWC in memory
x2 = x.to(memory_format=torch.channels_last)

print(x2.shape)  # (10, 3, 32, 32) as dimensions order preserved

print(x2.stride())  # (3072, 1, 96, 3), which are smaller

print((x==x2).all()) # True because the values were not changed

17. Turn off bias for convolutional layers that are right before batch normalization

This works because mathematically the bias effect would be canceled out by mean subtraction of the batch normalization. We can save the model parameters, run time, and memory.

nn.Conv2d(..., bias=False)

18. Use DistributedDataParallel instead of DataParallel

It’s always prefer DistributedDataParallel over DataParallel for multi-GPU even with only a single node because DistributedDataParallel applies multiprocessing and creates a process for each GPU to bypass Python Global Interpreter Lock (GIL) and speed up.

Summary

In this post, I made a checklist and provided code snippets for 18 PyTorch tips. Then I explained how and why they work one by one in various aspects including data loading, data operations, model architecture, training, inference, CNN-specific optimizations, and distributed computing. Once you deeply understand how they work, you may be able to find the common principles that are applicable to deep learning modeling in any deep learning framework.

Source: Medium - Jack Chih-Hsu Lin

The Tech Platform