Untether AI Unveils Its Second-Generation At-Memory Compute Architecture at Hot Chips 2022

PALO ALTO, California – August 23, 2022 – Untether AI®, the leader in at-memory computation for artificial intelligence (AI) workloads, today announced at the HOT CHIPS 2022 conference its next-generation architecture for accelerating AI inference workloads called speedAI devices, with an internal codename “Boqueria.” At 30 TeraFlops per watt (TFlops/W) and 2 PetaFlops of performance, the speedAI architecture sets a new standard for energy efficiency and compute density.

Challenges of AI Inference Acceleration

AI is increasingly being deployed in a variety of markets, from financial technology, smart city and retail, natural language processing, autonomous vehicles, and scientific applications. There has been an explosion in the types of neural network architectures as well as compute demand, resulting in increased energy consumption for AI workloads. These demanding applications require increasing levels of accuracy to ensure safety and quality of results. These requirements of flexibility, performance combined with energy efficiency, and accuracy necessitate a new approach to AI acceleration which Untether AI delivers with its speedAI devices.

“The merits of at-memory compute have been proven with the first generation runAI® device, and the second generation speedAI® architecture enhances the energy efficiency, throughput, accuracy, and scalability of our offering,” said Arun Iyengar, CEO of Untether AI. “speedAI® devices offer an ability that is unmatched by any other inference offering in the marketplace.”

Energy efficiency drives performance

Because at-memory compute is significantly more energy efficient than traditional von Neumann architectures, more TFlops can be performed for a given power envelope. With the introduction of the runAI devices in 2020, Untether AI set a new energy efficiency level at 8 TOPs/W for the INT8 datatype. The speedAI® architecture dramatically improves upon that, delivering 30 TFlops/W. This energy efficiency is a product of the second-generation at-memory compute architecture, over 1,400 optimized RISC-V processors with custom instructions, energy efficient dataflow, and the adoption of a new FP8 datatype, all of which helps quadruple efficiency compared to the previous generation runAI® device. The first member of the family, the speedAI240 device provides 2 PetaFlops of FP8 performance and 1 PetaFlop of BF16 performance. This translates into industry leading performance and efficiency on neural networks like BERT-base, which speedAI240 can run at over 750 queries per second per watt (qps/w), 15x greater than the current state of the art from leading GPUs.

Second-generation memory bank: Designed for flexible, efficient AI acceleration

Each memory bank of the speedAI® architecture has 512 processing elements with direct attachment to dedicated SRAM. These processing elements support INT4, FP8, INT8, and BF16 datatypes, along with zero-detect circuitry for energy conservation and support for 2:1 structured sparsity. Arranged in 8 rows of 64 processing elements, each row has its own dedicated row controller and hardwired reduce functionality to allow flexibility in programing and efficient computation of transformer network functions such as Softmax and LayerNorm. The rows are managed by two RISC-V processors with over 20 custom instructions designed for inference acceleration. The flexibility of the memory bank allows it to adapt to a variety of neural network architectures, including convolutional, transformer, and recommendation networks as well as linear algebra models.

FP8: The new datatype for accurate inference acceleration

In the search for energy efficiency Untether AI’s research determined that two different FP8 formats provided the best mix of precision, range, and efficiency. A 4-mantissa version (FP8p for “precision”) and a 3-mantissa version (FP8r for “range”) provided the best accuracy and throughput for inference across a variety of different networks. For both convolutional networks like ResNet-50 and transformer networks like BERT-Base, Untether AI’s implementation of FP8 results in less than 1/10th of 1 percent of accuracy loss compared to using BF16 data types, with a fourfold increase in throughput and energy efficiency.

Scalability for large language models

The speedAI240 device is designed to scale to large models. The memory architecture is multi-leveled, with 238MB of SRAM dedicated to the processing elements offering 1 petabyte/s of memory bandwidth, four 1MB scratchpads, and two 64-bit wide ports of LPDDR5, providing up to 32GB of external DRAM. Host and chip-to-chip connectivity is provided by high-speed PCI-Express Gen5 interfaces.

The imAIgine Software Development Kit Supports speedAI Family

The Untether AI imAIgine™ Software Development Kit (SDK) provides a path to running networks at high performance, with push-button quantization, optimization, physical allocation, and multi-chip partitioning. The imAIgine SDK also provides an extensive visualization toolkit, cycle-accurate simulator, and an easily integrated runtime API and is available now.

Availability

speedAI devices will be offered as standalone chips as well as a variety of m.2 and PCI-Express form factor cards. Sampling of speedAI240 devices and cards to early access customers is expected to begin in the first half of 2023.

About Untether AI

Untether AI® provides energy-centric AI inference acceleration from the edge to the cloud, supporting any type of neural network model. With its at-memory compute architecture, Untether AI has solved the data movement bottleneck that costs energy and performance in traditional CPUs and GPUs, resulting in high-performance, low-latency neural network inference acceleration without sacrificing accuracy. Untether AI embodies its technology in runAI® and speedAI® devices, tsunAImi® acceleration cards, and its imAIgine® Software Development Kit. Founded in Toronto in 2018, Untether AI is funded by CPPIB, GM Ventures, Intel Capital, Radical Ventures, and Tracker Capital. More information can be found at www.untether.ai.

All references to Untether AI trademarks are the property of Untether AI. All other trademarks mentioned herein are the property of their respective owners.

Untether AI Unveils Its Second-Generation At-Memory Compute Architecture at Hot Chips 2022

speedAI® architecture delivers 2 PetaFlops of performance at 30 TeraFlops/W of energy efficiency

About Untether AI

Media Contact for Untether AI:

Company Contact:

Sign up to our newsletter to receive news and updates