DSpace at EWHA: A Memory-Centric Architecture for Energy-Efficient Vision Transformer based on High Bandwidth Memory

Browse

My Repository

DSpace at EWHA일반대학원 전자전기공학과 Theses_Master

View : 137 Download: 0

A Memory-Centric Architecture for Energy-Efficient Vision Transformer based on High Bandwidth Memory

Title: A Memory-Centric Architecture for Energy-Efficient Vision Transformer based on High Bandwidth Memory

Authors: 함은경

Issue Date: 2024

Department/Major: 대학원 전자전기공학과

Keywords: Vision Transformer, Processing-Near-Memory, Energy-Efficient

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 김지훈

Abstract: In the rapidly evolving landscape of Artificial Intelligence (AI), the past decade has seen significant progress, positioning AI as a key player in the 4th industrial revolution. The advent of large language models, specifically Transformers, has greatly impacted natural language processing tasks. Since 2020, there has been a growing interest in applying these models, known as Vision Transformers, to computer vision tasks. Despite their success in image classification, challenges persist, including high memory usage and computational demands due to the complexity of these models. This paper presents practical solutions to address these challenges. We introduce Block-Balanced Pruning (BBP) and Compressed Block Row (CSR) techniques, designed for efficient hardware implementation, to reduce the weight and computational load of Vision Transformer models. Additionally, we propose a Processing-Near Memory based Memory-Centric Architecture that optimizes a processing engine for sparse matrix calculations in High-Bandwidth Memory (HBM) at the pseudo-channel level. To further improve efficiency, we introduce a column-major data mapping method, considering read/write characteristics and constraints, resulting in a 1.57 times improvement in the row hit ratio. By employing Block-Balanced Pruning and Compressed Block Row techniques to compress the ViT-B model, memory usage saw a remarkable reduction of 82.3% at 83% target sparsity. Simulation results utilizing Ramulator, a cycle-accurate DRAM simulator, showcased a significant reduction in the DRAM operating cycle to 1.72% when the proposed Processing-Near Memory (PNM) architecture with pruning was applied, compared to the uncompressed Dense Vision Transformer executed on the CPU. Moreover, implementation on Xilinx's Alveo U280 FPGA board revealed an energy efficiency of 5.21 (FPS/W) and a 4.26 times higher FPS compared to the CPU.;최근 10년 사이 비약적으로 발전한 AI (Artificial Intelligence)는 다양한 분야에 접목되며 4차 산업혁명의 대표주자로 주목받고 있다. Transformer는 자연어 처리 (Natural Language Processing) 과제에 활용되는 거대 언어 모델 (Large Language Model)로, 이러한 Transformer를 Computer Vision Task에 활용하는 Vision Transformer에 대한 연구 또한 2020년 이후 활발히 이루어지고 있다. 현재 Image classification 분야의 State-of-the-art 성능을 기록하고 있으나, 다량의 parameter로 인한 메모리 overhead와 높은 연산양 등이 문제점으로 남아 있다. 본 논문에서는 하드웨어 구현에 적합한 Block-Balanced Pruning (BBP) 과 Compressed Block Row(CSR) 기법을 통해 Vision transformer 모델의 가중치를 감소시키고 연산양을 줄였으며, 이에 따른 Sparse Matrix의 연산에 적합하게 설계한 Processing Engine을 HBM (High-Bandwidth Memory)의 Pseudo-channel level에서 병렬적으로 동작시키는 Processing-Near Memory Architecture를 제안하였다. 또한 연산 과정 및 DRAM의 Data read/write 특성과 제약을 고려한 Column-major data mapping 방식을 적용해 Row hit ratio를 1.57배 향상시켰다. Block-Balanced Pruning과 Compressed Block Row 기법을 이용하여 ViT-B 모델을 압축하여 83%의 target sparsity에서 82.3%의 메모리 사용량 감소를 확인하였으며, Cycle-accurate DRAM simulator인 Ramulator을 이용한 시뮬레이션 결과 CPU 상에서 수행된 압축하지 않은 Dense Vision Transformer보다, Pruning과 함께 제안하는 PNM 아키텍처를 적용하였을 경우 DRAM의 동작 cycle이 1.72%로 감소하였다. 또한 Xilinx의 Alveo U280 FPGA 보드를 통해 구현한 결과 CPU보다 4.26배 높은 FPS와 5.21 (FPS/W)의 Energy-Efficiency를 보였다.