Microsoft is preparing to use the GPT-3 license from OpenAI

Unveiled at Microsoft’s Ignite 2020 virtual conference, this initiative aims to enable cloud giants to “build and deliver advanced AI solutions to their customers,” said Kevin Scott, EVP and CTO at Microsoft.

Continuation of the article below

Last year Microsoft announced a billion dollar investment in OpenAI, months after the research organization became a for-profit organization. With this investment, Microsoft wanted to commercialize OpenAI projects, develop new HPC technologies for Azure AI, and continue research on general artificial intelligence (GAN).

Better language models

By obtaining an exclusive license for the GPT-3 language model from OpenAI, Microsoft aims to strengthen its existing language models, which are considered to be the most advanced on the market.

“Microsoft claims that its technologies match human abilities in important areas such as language and language. This year Microsoft began communicating on its massive model and on a large scale,” said Nick McQuire, vice president. Senior President and Head of AI and Business Research at CCS Insight with reference to Turing-NLG. This deep learning language model has more than 17 billion parameters.

Turing NLG was the largest Natural Language Text Generation (NLG) model available when Microsoft introduced it in February this year. According to Microsoft at the time, Turing-NLG could improve the typing assistant functions in the Microsoft Office suite or support advanced chatbots and digital assistants.

OpenAI’s GPT-3, launched in May, surpasses Turing-NLG with its 175 billion parameters. One hundred and seventy-five BILLION parameters: The number is staggering. This easily makes it the largest language model ever developed. Some of the tools code was uploaded to GitHub in June.

An article introducing the GPT-3, written by more than two dozen researchers and engineers at OpenAI, shows that the model’s writing skills are very similar to those of a human and raises ethical questions.

“Huge” opportunities

It is not known how Microsoft will use this GPT-3 model, but the tech giant could improve on its own language models that form the foundation of many products including Microsoft Search, Project Cortex, and its Office suite.

“While these massive models may be difficult to evaluate at this point, this feature needs to improve their products, be it SharePoint, PowerPoint and Office, Bing or Xbox,” says Nick McQuire.

In the blog post published on September 22nd presenting the license agreement, Kevin Scott does not address the company’s plans, but they could go beyond the scope of existing products. “The commercial and creative potential that the GPT-3 model can deliver is huge and offers really new features, most of which were beyond our imagination. Directly support human creativity and ingenuity in areas such as writing and composition, describe and summarize large blocks of data (including code), and convert one natural language to another: the possibilities are only limited by the ideas and scenarios we offer put on the table, ”he wrote.

However, he says that OpenAI will continue to offer GPT-3 and other models through its Azure-hosted API. These will be available to researchers and contributors.

GPT-3 is hosted on a “custom” supercomputer developed by Microsoft and hosted on Azure. What Microsoft’s CTO forgot to mention is that training a model with 175 billion parameters is an obstacle even for a cloud giant like him. At first glance, it is difficult to imagine the computing power and time it would take to get decent performance from such a juggernaut.

“This strong growth is a real problem: the more complex a model is, the longer and more expensive the training. Although it is technically possible to train models with more and more variables, it would take unaffordable time and computing power, ”comments Guillaume Besson, senior data consultant at the IT and IA strategy Quantmetry. Yet Redmond Company claims they have the solution, he recalls.

The secret ingredient: DeepSpeed

To address this problem, Microsoft teams have developed a solution that can train neural networks that may be loaded with trillions of parameters on current devices. Your name DeepSpeed.

DeepSpeed ​​is licensed by MIT, but is subject to a Contribution Agreement (CLA) and is a library for optimizing deep learning algorithms for Pytorch. DeepSpeed ​​was developed in Python, C ++ and works with the CUDA computer interface from Nvidia. It should enable models to be trained ten times larger and ten times faster. To this end, Microsoft engineers and researchers are working on data parallelism to reduce the consumption of VRAM (GPU memory).

With data parallelism, each GPU in a system uses the same model for different data sets. Problem according to the researchers, the states of the model (the parameters, the gradients, the Adam optimizer, etc.) are replicated for each graphical computation unit. To reduce memory consumption, they developed the solution at the heart of DeepSpeed: Zero Redundancy Optimizer (ZeRO).
ZeRO has two flagship capacities. ZeRO-DP may not replicate the states of a model, but rather partition them, while ZeRO-R makes it possible to limit the memory consumption associated with the activation functions applied at the output of a neural network.
As a result, a single 32GB Nvidia V100 GPU could be enough to train a deep learning model with 10 billion parameters, compared to 1.4 billion without ZeRO. It is this solution that made the birth of Turing-NLG possible.

Currently in version 0.03, the project is still young: Microsoft made it publicly available last May and at the same time announced that its supercomputer is hosting the OpenAI NLG model. However, the associated research paper ensures that ZeRO from DeepSpeed ​​can “effectively” train a deep learning model with over 170 billion parameters with a cluster of 400 graphics processors. A 32GB Tesla V100 unit from Nvidia is currently worth around $ 8,000, and Nvidia is pushing businesses toward its new Amp architecture.