Large Scale Transformer model training with Tensor Parallel (TP) — PyTorch Tutorials 2.5.0+cu124 documentation

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

https://www.googletagmanager.com/ns.html?id=GTM-T8XT4PS

Note

View and edit this tutorial in github.

This tutorial demonstrates how to train a large Transformer-like model across hundreds to thousands of GPUs using Tensor Parallel and Fully Sharded Data Parallel.

Prerequisites:

PyTorch 2.3.0 or later installed with CUDA/Linux
Tensor Parallel APIs
Getting Started with DeviceMesh
Getting Started with Fully Sharded Data Parallel

How Tensor Parallel works?

Tensor Parallel (TP) was originally proposed in the Megatron-LM paper, and it is an efficient model parallelism technique to train large scale Transformer models. Sequence Parallel (SP) we mention in this tutorial is a variant of Tensor Parallel that shards on the sequence dimension for nn.LayerNorm or RMSNorm to further save activation memory during training. As the model becomes larger, the activation memory becomes the bottleneck, so in Tensor Parallel training it usually applies Sequence Parallel to LayerNorm or RMSNorm layers.

Figure 1. represents the sharding in Tensor Parallel style on a Transformer model’s MLP and Self-Attention layer, where the matrix multiplications in both attention/MLP happens through sharded computations (image source)

At a high level, PyTorch Tensor Parallel works as follows:

Sharding initialization

Determine which ParallelStyle to apply to each layer and shard the initialized module by calling parallelize_module.
The parallelized modules would have their model parameters be swapped to DTensors, and DTensor would be responsible to run the parallelized module using sharded computation.

Runtime foward/backward