New Computing Paradigm for AI: Processing-in-Memory (PIM) Architecture

As larger deep neural networks are trained on the latest and fastest chip technologies, an important challenge remains that bottlenecks performance -- and it is not compute power. You can try to calculate a DNN as fast as possible, but there is data -- and it has to move. Data pipelines on the chip are expensive and new solutions must be developed to advance capabilities.



By Nam Sung Kim, Samsung Electronics.

The deep neural networks (DNNs) model is an essential toolset for AI/ML developers, with a broad range of applications including computer vision, speech recognition, natural language processing, machine translation, and many other areas. However, there is a large and growing problem with implementing DNNs in existing computing systems: while modern processors or accelerators provide plenty of compute capability and modern DRAMs ample memory capacity, the data pipelines connecting processors to memory in existing systems are simply not adequate for the volume of data that needs to be moved between processors and memory in the implementation of modern DNNs.

This problem of the inadequate data pipeline for DNN is rooted in the Von Neumann architecture used in virtually every computing system since the 1940s. For decades, advances in silicon chip performance were matched by advances in chip packaging and circuit board interconnects, which together form the data pipeline connecting the separate compute and memory functions. In recent years, however, growth in the number of chip input-output points and circuit board traces has become constrained by physical limits (see Figure 1), resulting in the mismatch of compute capability and data movement capability of the data pipeline, and this mismatch is growing worse with each new technology generation.

Samsung HBM PIM, Fig 1

Figure 1. Growth in the number of chip input-output points and circuit board traces has become constrained by physical limits.

The challenge for AI developers is summed up by Jamie Hanlon, an engineering team leader at UK processor company Graphcore, when he wrote, “At each layer of the DNN, you need to save the state to external DRAM, load up the next layer of the network and then reload the data to the system. As a result, the already bandwidth and latency constrained off-chip memory interface suffers the additional burden of constantly reloading weights as well as saving and retrieving activations. This (repeated data movement through the limited off-chip interface) significantly slows down the training time and considerably increases power consumption.”

With DNNs and other memory-bound applications creating energy use and thermal cooling issues for data centers and mobile devices alike, a radical change is clearly needed. However, abandoning the established Von Neumann architecture (as would be required for quantum computing, neuromorphic computing, or other radical approaches) would be profoundly disruptive to millions of hardware and software engineers and the thousands of companies and organizations that depend on them for timely development.

 

Overcoming memory bottlenecks with a new approach

 

There is a promising non-disruptive approach that leverages existing CMOS technology to reduce the amount of data that must be exchanged between processor and memory. Known as processing-in-memory (PIM), it integrates AI-oriented processing capabilities into high-bandwidth memory (HBM) and other types of RAM. As shown in Figure 2, this allows memory-bound workloads with low arithmetic intensity to be offloaded from the CPU, GPU, or other primary processor and completed with far less data transfer overhead. Recent tests of HBM-PIM with the Xilinx Virtex Ultrascale+ (Alveo) AI accelerator and an unmodified HBM controller showed a 2.5x performance gain and over a 60 percent reduction in energy.

Samsung HBM PIM, Fig 2

Figure 2: With board-level and package-level interconnect data rates unable to keep pace with advances in processors and memories, many AI functions have become memory-bound. Processor-in-memory (PIM) technology offers an appealing way to relieve the pressure without abandoning the traditional Von Neumann computing architecture.

Importantly, significant gains can be achieved without modifications to the TensorFlow ML source code—tests with an AMD GPU showed a 2.5x performance gain and 60 percent reduction in energy consumption. The use of a software stack ensures compatibility while giving the host processor control over the in-memory processing. Moreover, PIM technology may be modified and adapted into a range of other memory organizations, including DIMM modules that can be used as drop-in replacements for standard DIMMs in the same way that HBM-PIM was used as a drop-in replacement of HBM devices.

The PIM approach offers three significant advantages:

  • Compatibility with legacy architectures and programming enables processor and system developers to easily take advantage of PIM, easing the burden of adoption, and allowing end-users to quickly reap the benefits of PIM’s capabilities for a wide range of applications, including mobile and Edge platforms.
  • The PIM approach has been shown to greatly accelerate memory-bound AI models such as ASR, BERT, Transformer, etc. For example, HBM-PIM provided a 3.5X performance increase for DeepSpeech2 when using LSTM for NLP real-time translation.
  • By reducing the need to move data back and forth between memory and CPU, PIM can reduce energy consumption by over 60 percent, a critically important advantage in almost every type of computing environment.

 

Evolving and Proliferating PIM for Wider Adoption

 

Looking ahead, PIM technology has a number of promising avenues for extension, including implementation in multiple types of DRAM (system-level simulations of LPDDR5-PIM indicate a doubling of performance and 60 percent reduction in energy usage on tasks like speech recognition, translation, and chatbot are possible) As shown in Figure 3, subsequent generations are seen as providing ongoing improvements in system speed and energy efficiency.

Samsung HBM PIM, Fig 3a

Samsung HBM PIM, Fig 3b

Figure 3: PIM technology has potentially wide applications, from data center memory modules to low-power mobile platforms, with multiple data formats supported. Future generations are seen advancing both system performance and energy efficiency beyond today’s levels.

Standardized implementations will be an important part of winning wide adoption, and work has begun between leading corporations and standards-setting bodies with an eye towards establishing a robust PIM ecosystem with wide applicability. Up-front attention to compatibility on form factor, timing, and other considerations means that it will be possible to adopt PIM architecture to existing system architectures in a compatible manner rather than require the deployment of new and incompatible standards. Efforts are already beginning to evaluate the adoption of PIM architecture into DRAM specifications such as HBM3 and LPDDR5.

The big takeaway for AI/ML developers working with DNNs is simple: the more memory-bound your application, and the more sensitivity your physical infrastructure is to energy usage and thermal environment, the greater the applicability of PIM. This architecture begins the process of memory-logic convergence in a way that fits within the constraints of the Von Neumann architecture while addressing its most pressing problem area for certain classes of important applications such as AI/ML.

 

Bio: Nam Sung Kim, Ph.D., ACM & IEEE Fellow, is a Senior Vice President in the Memory Business Unit of Samsung Electronics.

Related: