Data Parallelism

12.2. Data Parallelism#

Data parallelism is one of the most common methods of large model parallelism, and compared to other parallel methods, data parallelism is simpler and more intuitive to implement. As shown in Fig. 12.3, copies of the model are loaded onto different GPU devices, and the training data is split into multiple parts, each of which is independently trained by a different GPU. This programming model is known as Single Program Multiple Data (SPMD).

../_images/data-parallel.svg — Fig. 12.3 Data Parallelism Diagram#

Non-Parallel Training#

Section 2.2 introduces the process of training neural network models. We will first discuss the non-parallel scenario, using the MNIST handwritten digit recognition example for demonstration. As shown in Fig. 12.4, this example illustrates one forward pass and one backward pass.

../_images/data-parallel-single.svg — Fig. 12.4 Training a neural network on a single GPU#

Data Parallelism#

Data parallelism involves splitting the dataset into multiple parts and replicating the model weights on different GPUs. As shown in Fig. 12.5, suppose there are two GPUs, each with a copy of the model weights and a corresponding subset of the input data. On each GPU, the forward and backward propagation processes are carried out independently: forward propagation calculates the output values of each layer, while backward propagation computes the gradients of the model weights. These computations are independent and do not interfere with each other across different GPUs.

../_images/data-parallel-distributed.svg — Fig. 12.5 Training a neural network on two GPUs#

At least during the forward and backward propagation stages, there is no communication cost. However, when it comes to updating the model weights, synchronization is necessary because the gradients obtained on each GPU are different. The gradients can be averaged to obtain the final gradient. Here, we only demonstrate \(\boldsymbol{W}\), where \(\boldsymbol{\frac{\partial L}{\partial W}}^{i}\) is the gradient on a single GPU, and \(\boldsymbol{{\frac{\partial L}{\partial W}}^{sync}}\) is the synchronized average.

\[ \boldsymbol{ {\frac{\partial L}{\partial W}}^{sync} = \frac{1}{\# GPUs} \sum_{i=0}^{\# GPUs} {\frac{\partial L}{\partial W}}^{i} } \]

To synchronize the gradients across different GPUs, you can use the AllReduce primitive provided by MPI. MPI’s AllReduce collects the independently computed gradients from each GPU, averages them, and then broadcasts the averaged gradient back to each GPU.

As shown in Fig. 12.6, during the gradient synchronization stage, MPI’s AllReduce primitive ensures the consistency of gradients across all GPUs.

../_images/data-parallel-all-reduce.svg — Fig. 12.6 When updating model weights, you need to use MPI’s `AllReduce` primitive to synchronize the gradients across all GPUs.#

Data Parallelism

On this page

12.2. Data Parallelism#

Non-Parallel Training#

Data Parallelism#