https://pytorch.org/tutorials/intermediate/TP_tutorial.html
https://www.googletagmanager.com/ns.html?id=GTM-T8XT4PS
Note
View and edit this tutorial in github.
This tutorial demonstrates how to train a large Transformer-like model across hundreds to thousands of GPUs using Tensor Parallel and Fully Sharded Data Parallel.
Prerequisites:
Tensor Parallel (TP) was originally proposed in the Megatron-LM paper, and it is an efficient model parallelism technique to train large scale Transformer models. Sequence Parallel (SP) we mention in this tutorial is a variant of Tensor Parallel that shards on the sequence dimension for nn.LayerNorm
or RMSNorm
to further save activation memory during training. As the model becomes larger, the activation memory becomes the bottleneck, so in Tensor Parallel training it usually applies Sequence Parallel to LayerNorm
or RMSNorm
layers.
Figure 1. represents the sharding in Tensor Parallel style on a Transformer model’s MLP and Self-Attention layer, where the matrix multiplications in both attention/MLP happens through sharded computations (image source)
At a high level, PyTorch Tensor Parallel works as follows:
Sharding initialization
ParallelStyle
to apply to each layer and shard the initialized module by calling parallelize_module
.Runtime foward/backward