It All Comes Back to Escape

Ever feel like you just gotta get the hell out of here? You wake up one day and look around and everything is just the same, fine and cozy and all that but that’s exactly the problem. It’s TOO quiet…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Efficient distributed training in deep learning

At the heart of any deep learning system, there is a significant effort in building it.

To have a clear feeling of this trend, let us examine recent large language models (LLMs) such as BERT and GPT-3. Let’s start with BERT or Bidirectional Encoder Representations from Transformers. BERT was introduced in 2018 by Jacob Devlin and his colleagues at Google. Large Language Models such as BERT are responsible for the current revolution of Natural Language Processing (NLP). BERT is a self-supervised learning system that learns representations from enormous text datasets without any kind of annotation. Using representation from BERT, one can solve many supervised NLP tasks like translation and Named Entity Recognition, with a fraction of annotated data. In other words, BERT representations eliminate the necessity of having large annotated datasets to solve NLP tasks.

One of the most impactful revolutions of LLMs like BERT was in the area of Search. Engines like Google Search, Microsoft Bing, or DuckDuck Go, share a common problem — they need to understand language. After all, if one can clearly understand what people are looking for when they type queries on web search engines, an essential part of the problem is already solved.

As a quick example, consider the following query that one might type on Google search:

This trend of scaling deep learning models is not particular to NLP applications. A similar pattern also happened in Computer Vision a few years ago.

As a result, if one is thinking of using state-of-the-art deep learning systems to solve perceptual problems like Vision, NLP, or Speech, chances are that one is going to deal with one of these large models that require distributed training skills for training or fine-tuning the model to a particular task.

There are two main paradigms to distributed training of deep learning models: Data parallelism and Model parallelism. Data parallelism is by far the easiest and most popular among the two. High-performance deep learning frameworks such as PyTorch implement both approaches, but strongly recommend users apply data parallelism instead of model parallelism when possible.

In Data parallelism, the data is divided into separate subsets, where the number of subsets equals the number of available worker nodes. We can think of a worker node as an independent process with its own memory and Python interpreter. Each worker receives its subset of the data and a copy of the original model.

Notice that each worker needs to have enough memory to load the entire model into memory.

The training process works as follows, each worker:

As for step 5, from time to time, workers need to synchronize gradients among themselves so that each node can update its local copy of the model based on the changes from its fellow workers. This process is usually done once per batch and performs gradient synchronization across all processes to ensure model consistency among workers.

Note that since DDP implements multi-process parallelism (each worker is a process), there is no GIL contention because each process has its own Python interpreter. Moreover, the package efficiently takes care of synchronizing gradients, so one does not need to be concerned about that step.

Distributed Data-Parallel is very useful for most situations when training deep learning models. However, there are some cases in which the Data parallelism paradigm does not fit quite well.

One of the implicit assumptions of Data parallelism is that the data is the most memory-heavy component to deal with. From this perspective, it is reasonable to think that we should split the data into smaller subsets and process them separately, as Data parallelism suggests. However, what if the model (or the data) is the most memory-expensive component? In this situation, the paradigm we have discussed so far would not work because, in Data parallelism, the same model is replicated across each worker. In other words, each worker needs to be able to load the entire model in memory.

That is where the other approach to distributed training, Model parallelism, makes itself valuable. In Model parallelism, instead of splitting the training data into subsets for each worker, the workers will now have access to the entire training data. However, rather than replicating the entire model at each worker, a single machine learning model will be split into many workers. Specifically, the layers of a deep learning model are split across different workers. For example, if a model contains 8 layers when using DDP, each GPU will have a replica of each of these 8 layers, whereas when using Model parallelism on, say, two GPUs, each GPU could host 4 layers.

Model parallelism is especially useful for situations where large mini-batches or high-resolution is required or in cases where the model is too large to fit on a single GPU.

Deep learning models have been revolutionizing many areas of industry and science, mainly the ones that deal with perception problems like language, computer vision, and speech. Nowadays, we can take advantage of large models, trained on large datasets using expensive parallel hardware, to solve many tasks of our day-to-day lives. However, training these models is not trivial. On the contrary, there are quite complex parallel training strategies with the power of solving tasks that would require years of serial computing, in a matter of days.

In general, if a model is small enough to fit in a single GPU, the best option is to use the Data parallelism distributed strategy. With just a few lines of code, one can split the work among different nodes, where each worker has its own memory address space and GPU. For situations where the model does not fit entirely in a single GPU, we can split layers of it across different GPUs, that is Model parallelism and is the best strategy to train very large deep learning models.

This piece was written by Thalles Silva with the Innovation Team at Daitan. Thanks to João Caleffi and Kathleen McCabe for reviews and insights.

Add a comment

Related posts:

Does the Internet Inhibit Democracy?

The Internet in itself is a big contradiction. It promotes yet hinders democracy. It makes us think and feel like it promotes democracy in obvious ways, but it inhibits democracy in ways we may not…

How To Manifest Your Dream Life Through Scripting

Scripting is a technique used as a way to manifest your desires by literally writing a script for yourself. Look at it this way, imagine you are the director of a film you’re working on and you’re…

What Drives Change and Financial Success? Culture and Strategy

Sustainable finance continues to move towards the mainstream, and this is fantastic to see. That said, as any connoisseur can attest, the best work in most disciplines is most often under…