

## Introduction

ECE-GY 9483/CSCI-GA 3033 Special Topics in Electrical Engineering EFFICIENT AI AND HARDWARE ACCELERATOR DESIGN

## Life is Powered by Deep Learning

• Deep Neural Networks (DNNs) have achieved state-of-the-art performance across a variety of domains



- Use a Convolutional Neural Network (CNN) as an example
- This CNN contains four layers
  - 3 convolutional layers
  - 1 fully connected layer





























#### **DNN Execution: A Matrix View**



• Remaining layers follow this pattern.

#### **DNN Execution: A Matrix View**



 Remaining layers follow this pattern

## **Deployment of DNN: Problems**

• The majority of computation workloads for DNN inference involves a series of **matrix multiplications.** 

| 'rose'<br>♠ |
|-------------|
| 4096, 1000  |
| 4096, 4096  |
| 25088, 4096 |
| 512, 4608   |
| 512, 4608   |
| 512, 4608   |
| 512, 4608   |
| 512, 4608   |
| 512, 2304   |
| 256, 2304   |
| 256, 2304   |
| 256, 1152   |
| 128, 1152   |
| 128, 576    |
| 64, 576     |
| 64, 27      |
| <u> </u>    |
| <b>@</b>    |

VGG-16 is a CNN with over 150M weights across 16 matrices



## **Deployment of DNN: Problems**

- DNN suffers due to:
  - High energy consumption
  - High processing latency
  - High storage cost
- DNN needs to maintain high accuracy



Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." *IEEE Access* 6 (2018): 64270-64277.

#### The Era of Large Models (LMs)





#### **Cost of Large Models**



• 1.4e<sup>12</sup> FLOPs to execute GPT-2.



#### **The Cost of Large Models**



- Training GPT-4 required 25,000 A100 GPUs over several weeks.
- Cost: Renting a single high-end GPU on cloud services like AWS can cost \$3-\$5 per hour. Training GPT-4 is estimated to cost \$63-100 million on cloud computing resources.



| Model Size     | FP16   | FP8    | INT4   |
|----------------|--------|--------|--------|
| 8B             | 16 GB  | 8 GB   | 4 GB   |
| 70B            | 140 GB | 70 GB  | 35 GB  |
| 405B LLaMA 3.1 | 810 GB | 405 GB | 203 GB |

Design more aggressive and efficient AI model is of paramount importance









# How to reduce the compute while maintaining a good DNN accuracy?



#### **Research Publications on DNN Pruning and Quantization (2015-2023)**

- Efficient AI has become one of the most popular areas in AI community.
- The recent emergence of large models has further heightened the need for efficient AI.



| Sunnyvale, CA + 2 more                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Wisiting Researcher - AI Accelerators   Meta   Harrisburg, PA • via Monster   ③ 3 days ago   ● Full-time                                                    | Ccelerator Design<br>Karkidi<br>time ⊕ Health insurance ট Paid time off  ☐ Dental insurance |                             | چ | Д |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-----------------------------|---|---|
| Menlo Park, CA AII                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | IBM   Yorktown Heights, NY + via Karkidi   \$ 120K-190K a year                                                                                              | eering/ AI accelerator compiler and Runtime                                                 | gree.                       |   |   |
| Silicon Hardware                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | B Machine Learning Engineer - Efficient Machine Learning<br>Bose Corporation, U.S.A<br>Anywhere • via Workday<br>() 4 days ago f Work from home f Full-time | anager, Al Compiler<br>ome 💼 Full-time 😧 Health insurance                                   |                             | ح | Ω |
| Visiting Research                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | A Machine Learning/Al Engineer<br>Advanced Micro Devices, Inc<br>Boxborough, MA • via AMD Careers                                                           | ation Engineer for Intel AI Accelerator                                                     | alent practical experience. |   |   |
| Control Contr | A Sr Machine Learning Engineer, Al Software Solutions<br>Advanced Micro Devices, Inc<br>Fishkill, NY • via Monster<br>Fishkill, NY • via Monster            | ems ML - Frameworks / Compilers / Kernels<br>reers Jobs                                     | caging).                    |   |   |
| Jc<br>SAI LAB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Artificial Intelligence Engineer<br>Tata Consultancy Services<br>Malvern, PA • via LinkedIn<br>③ 21 hours ago \$ 130K-160K a year 📸 Full-time               | ems ML - Frameworks / Compilers / Kernels<br>Jiter                                          |                             |   | : |

21

#### **AI Tech Startups/Unicorns**







#### **Efficient AI: Full-stack Workflow**





#### **Efficient AI: Full-stack Workflow**





### **Algorithmic Optimization**







### **Efficient DNN Algorithm: Pruning**





### **Efficient DNN Algorithm: Quantization**





#### **Knowledge Distillation**





#### **Efficient AI: Full-stack Workflow**





## **Graph Level Optimization**

#### **CAMEL** Training





(b) Dependency graph  $(y_{1,1}, (y_{2,1}), (y_{3,1}), (x_{1,1}), (x_{2,1})$ 

#### (c) Pseudo instruction

Compute Y3.I = G(X2,I) Overwrite X2.I with Y3.I Compute Y2.I = Pool(Y3.I) + F1(X1.I) Save Y2.I Compute Y1.I = X1.I+F2(Y2.I) Overwrite X1.I with Y1.I

#### (d) Computation pattern of forward pass

| Layer I                     |                             | Layer I+1       |                               |                               |                   |
|-----------------------------|-----------------------------|-----------------|-------------------------------|-------------------------------|-------------------|
| TG,I                        | : <b>T</b> F2,1             | ; <b>T</b> F1,I | TG,I+1                        | TF2,I+1                       | ; <b>T</b> F1,l+1 |
| Compute<br>y <sub>3,1</sub> | Compute<br>y <sub>2,1</sub> | Compute<br>y1,i | Compute<br>y <sub>3,I+1</sub> | Compute<br>y <sub>2,I+1</sub> | Compute<br>y1,I+1 |
| 0 t                         | 1                           | $t_2$ $t_3$     | 3                             | $t_4$ i                       | $t_5$ $t$         |



#### **Kernel Level Optimization**



#### **Efficient AI: Full-stack Workflow**





## **Hardware Support for DNN**

- GPU is better than CPU in terms of throughput for both Neural Network training and inference.
  - GPU leverages the highly parallelized architecture of its computing units to handle computational intensive operations.
- However, GPU:
  - General purpose, although much more specific than CPU.
  - Still not fast and power-efficient enough.
  - Does not support advanced efficient DNN algorithm.





#### **NVIDIA**

| Chip size           | 814 mm <sup>2</sup>         |
|---------------------|-----------------------------|
| On-chip memory      | ~50MB                       |
| Total memory        | ~96GB HBM                   |
| Cores               | 16,896 FP32 + 528<br>Tensor |
| Precision           | FP16/FP8/INT8               |
| Memory<br>bandwidth | 0.003<br>Petabytes/sec      |





https://www.techpowerup.com/gpu-specs/h100-sxm5-96-gb.c3974

#### **NVIDIA**

| Chip size        | -                 |
|------------------|-------------------|
| On-chip memory   | -                 |
| Total memory     | 192GB HBM         |
| Cores            | -                 |
| Precision        | FP16/FP8/FP4/INT8 |
| Memory bandwidth | 8 Terabytes/sec   |



**NVIDIA Blackwell** 



https://wccftech.com/nvidia-blackwell-gpu-architecture-official-208-billion-transistors-5 x-ai-performance-192-gb-hbm3e-memory/

# **Hardware Support for DNN**

- ASIC-based implementations have been recently explored to accelerate the DNN inference.
  - $\circ$  Google's TPU, Apple's Neural Engine, Cerebras Al chip,  $\ldots$
- FPGA-based accelerators for DNN inference have been recently developed.
  - Has good programmability and flexibility
  - Short development cycles
  - Can be used as a benchmark before implementing on ASIC



Tensor Processing Unit (Google)



Alveo Accelerator Card (Xilinx)

37

# Google

| Chip size           | 790 mm <sup>2</sup> |
|---------------------|---------------------|
| On-chip memory      | 112 MB              |
| Total memory        | 32GB HBM            |
| Precision           | BF16/INT8           |
| Memory<br>bandwidth | 1640 TB/sec         |



TPU v6 (Trillium)



# **Cerebras AI Chips**

| Chip size           | 46,225 mm <sup>2</sup> |
|---------------------|------------------------|
| On-chip memory      | 44 GB                  |
| Cores               | 900,000                |
| Memory<br>bandwidth | 21 Petabytes/sec       |



**Cerebras CS-3** 



https://cerebras.ai/applications/high-performance-computing/

# **Systolic Array**

- Kung and Leiserson, "Systolic Arrays for VLSI," 1978 and Kung, "Why systolic architectures?' 1982
- 2D grid of multiplier-accumulators (MACs) for matrix multiplication
- Used by Google TPU for deep learning (2017), etc





TPU (Google)

### **Bit-serial Low-precision Multiplier**





ReLL

# Why We Need Codesign?





# Why We Need Codesign?



Hardware architecture needs to be considered when designing efficient DNN.



# **Column Combining**

#### Packed Format in Sparse Systolic Array Weight Matrix **Column Combining** 32 32 8x reduction in size 64



Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization." *Proceedings of the Twenty-Fourth International Conference on* 44 *Architectural Support for Programming Languages and Operating Systems*. 2019.

# **Column Combining**



- Column combining can greatly increase the utilization efficiency of the systolic array
- Recently, Nvidia A100 GPU adopts a similar idea to support the balanced structured sparsity on their GPU

NYU SAI LAB

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization." *Proceedings of the Twenty-Fourth International Conference on* <sup>45</sup> *Architectural Support for Programming Languages and Operating Systems*. 2019.

#### **FPGA Accelerator**



NYU SAI LAB

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization." *Proceedings of the Twenty-Fourth International Conference on* <sup>46</sup> *Architectural Support for Programming Languages and Operating Systems*. 2019.

### **Term Quantization**



- Low-precision quantization leads to significant quantization error.
- Both weights and input activation are highly biased in values.

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Term revealing: Furthering quantization at run time on quantized dnns." *arXiv preprint arXiv:2007.06389* (2020).

### **Term Quantization**



- We can control the term-level computations by setting a **group term budget**.
- For a group of values, we rank and remove the small terms based on this budget.

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Term revealing: Furthering quantization at run time on quantized dnns." *arXiv preprint arXiv:2007.06389* (2020).

### **Term Quantization: Accelerator Design**



- We propose the term MAC (tMAC) for the efficient implementation of TQ.
- A tMAC processes all term-pair multiplications across a group of weight and data values.
- Each term is represented by their corresponding exponent (2-3 bits).
- The term accumulation can be implemented using half adders.

NYU SAI LAB

Kung, H. T., Bradley McDanel, and Sai Qian Zhang. "Term revealing: Furthering quantization at run time on quantized dnns." *arXiv preprint arXiv:2007.06389* (2020).

#### **Efficient DNN Training: Forward Pass**



X: input maps W: weight filters Y: output maps

• The convolutional operations during the forward propagation can be converted into matrix multiplications



#### **Efficient DNN Training: Backward Pass**



X: input mapsW: weight filtersY: output maps $\nabla X$ : input gradient $\nabla W$ : weight gradient $\nabla Y$ : output gradient

DNN backward propagation involves two matrix multiplications



### **Efficient DNN Training: FAST Algorithm**



- We name this approach FAST (Fast First, Accurate Second Training)
- We linearly increase the training precision across both layer depth and training iterations

NYU SAI LAB

#### **Efficient DNN Training: FAST Algorithm**



- We use Time-to-Accuracy (TTA) as the evaluation metric to compare different approaches
- Our FAST approach achieves the lowest TTA across all the numeric formats

#### **Reversible DNN for Efficient On-chip Learning**









Zhang, Sai Qian, et al. "CAMEL: Co-Designing AI Models and eDRAMs for Efficient On-Device Learning." 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024.

#### **Reversible DNN for Efficient On-chip Learning**





Zhang, Sai Qian, et al. "CAMEL: Co-Designing AI Models and eDRAMs for Efficient On-Device Learning." 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024.

#### **Reversible DNN for Efficient On-chip Learning**





Zhang, Sai Qian, et al. "CAMEL: Co-Designing AI Models and eDRAMs for Efficient On-Device Learning." 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024.

#### **Efficient Deep Self-Supervised Learning**



BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling", Yixuan Luo, Mengye Ren, Sai Qian Zhang.

#### **Softmax Acceleration in Large Models**



"Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference", Tianhua Xia, Sai Qian Zhang, ISLPED'24

#### **Normalization Acceleration in Large Models**





- We adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method.
- We leverages on the strong correlation observed in normalization statistics across consecutive layers, enabling the bypassing of normalization computation through the estimation of statistics



"HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models", Tianfan Peng, Jiajun Wu, Tianhua Xia, Sai Qian Zhang, in DATE 2025.