Optimize Virtualized Deep Learning Performance

Executive Summary

Four different image classification tests were used to demonstrate the performance benefits of running deep learning inference on the 2nd Generation Intel® Xeon® Scalable processor compared to previous Intel processors, and to show the performance benefits of running on the VMware vSphere® hypervisor compared to bare metal.

The 2nd Generation Intel® Xeon® Scalable processor’s Deep Learning Boost technology includes new Vector Neural Network Instructions (VNNI), which are especially performant with input data expressed as an 8-bit integer (int8) rather than a 32-bit floating point number (fp32). Together with the large VNNI registers, these instructions provide a marked performance improvement in image classification over the previous generation of Intel® Xeon® Scalable processors.

The latest version of vSphere, 7.0, supports VNNI instructions. The work reported in this paper demonstrates a very small virtualization overhead for single image inferencing but major performance advantages for properly configured virtualized servers compared to the same servers running as bare metal.

Introduction

Deep learning typically refers to machine learning on non-tabular data (images, voice, etc.) using neural networks. Some of the applications include image classification (used in license plate detectors or facial recognition systems), object detection (which identifies the objects in an image and is used in autonomous vehicles, for example), or natural language processing (used by voice applications such as Alexa and Siri, and text applications like chatbots).

To train a neural network (Figure 1), one starts with a set of weights wij that represent the multiplicative factor of joining two adjacent neurons in the network. A set of pre-labeled training data (for example, the pixels of an image together with the classification) xi is applied to the input layer and the network is run in the forward propagation direction, resulting in a set of predicted classifications yi. The predicted classifications are then compared to the actual labels, generating a loss function that is used by the back-propagation step to modify the weights in such a manner as to reduce the loss. The cycle of forward propagation followed by back propagation is repeated until a desired accuracy is achieved. The final set of weights along with the model structure are referred to as the trained model.

In inference, a new, unlabeled input is run through the trained model in just one forward propagation step to infer the category of an object or the meaning of a voice command. The two key metrics for inference are latency, the response time to a single image or command, and throughput, the overall amount of inferencing a system can provide.

Popular current models such as the ResNet50 used in this work can have millions of weights and hundreds of layers. To do all the calculations, even for one forward propagation step, requires processors that can quickly handle large numbers of matrix multiply/add operations.

To that end, Intel has introduced Intel® Deep Learning Boost (Intel® DL Boost) technology, a new set of features in their 2nd Generation Intel® Xeon® Scalable processor (“Cascade Lake”). This feature set includes new Vector Neural Network Instructions (VNNI), which are especially performant with input data expressed as an 8-bit integer (int8) rather than a 32-bit floating point number (fp32). Together with the large VNNI registers, these instructions provide a marked performance improvement over the previous generation of Intel® Xeon® Scalable processors (“Skylake”).

vSphere 7 supports the VNNI instructions, which allows VNNI to run on the vSphere hypervisor and to let users take advantage of all that VMware® virtualization offers: enhanced server utilization, scriptable deployment, and ease of management. The work reported in this paper demonstrates a very small virtualization overhead for single image inferencing but major performance advantages for properly configured virtualized servers compared to the same servers running as bare metal.

In the next section, we describe in detail the Intel AI strategy and VNNI architecture. Next, we show the hardware and software configurations used in the tests, followed by descriptions of the individual tests run, and the results of those tests.

Intel’s AI Strategy Over the last few years, we have seen an exponential growth in AI techniques to solve problems in various domains, such as health care, manufacturing, finance, and business operations. This growth has been fueled by the explosion of available data and the need for some sophisticated means for gleaning meaningful insights from the data.

It is expected that the data generated from various “smart” devices will grow from 40 zettabytes to about 175 zettabytes by 2025 [1]. This data will be coming from the greater than 150 billion devices owned by over 6 billion people. This data deluge will need smart analytics to be used in a meaningful way, giving rise to newer innovative and efficient AI techniques. Focusing on developing AI algorithms, or just designing efficient AI chips, will not solve the complexity in data-insight analysis. We need to address this with a holistic systemlevel view that encompasses hardware (AI chips, storage, memory, and fabric), mature and open software (algorithms, libraries, and tools) and the tight co-design of hardware and software.

Intel’s AI strategy offers the most diverse portfolio of highly performant and efficient compute and revolutionizes decades-old memory and storage hierarchies. It fills new gaps identified by AI workloads and enables the development of full-stack software based on open components. This simplifies AI deployments across increasingly heterogeneous hardware environments while integrating with existing frameworks and toolchains. The AI portfolio delivers a robust ecosystem of ready-made solutions for every industry.

From a compute perspective, Intel starts by extending its widely proliferated CPU portfolio to include specific technologies (such as Intel DL Boost) that allow enterprises to accelerate AI applications on existing, familiar infrastructure. From there, Intel offers multiple, discrete hardware accelerators for a wide range of programmability, performance, energy, and latency requirements from cloud to edge, including Intel FPGAs and Intel Movidius Vision Processing Units (VPUs). For the second generation of its Intel Xeon Scalable processors, Intel added specific instructions to accelerate neural network inference, especially when using numeric int8 instructions. These instructions, known as VNNI, give a theoretical speedup of up to 30x over fp32 instructions. The speedup highly depends on the frameworks, models, and the percentage of instructions that can actually use VNNI technology. Intel is also investing in other purpose-built accelerators to speed up deep learning training and inference.

Addressing software is a little more complex due to the various layers of the software stack; the fast growth of tools, libraries, and frameworks in the open source ecosystem; and the optimization of the software for specific hardware to run various types of AI applications and accommodate a variety of developer types.

To read full download the whitepaper:
Optimize Virtualized Deep Learning Performance

SEND ME WHITEPAPER

Previous article3 Reasons Why This Leading Tech Company Uses an Employer of Record
Next articleThe Cost of Contractor Misclassification