DSpace at EWHA: Learning Spatial cues via Tunable Adapters for Monocular 3D Object Detection

Browse

My Repository

DSpace at EWHAETC ETC

View : 162 Download: 0

Learning Spatial cues via Tunable Adapters for Monocular 3D Object Detection

Title: Learning Spatial cues via Tunable Adapters for Monocular 3D Object Detection

Authors: 김성희

Issue Date: 2024

Department/Major: 대학원 인공지능·소프트웨어학부

Keywords: 3D object detection, adapter tuning

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 민동보

Abstract: Monocular 3D object detection is a crucial yet challenging task due to the absence of accurate spatial cues for predicting 3D localization from a single image. Recently, spatial information introduced by depth-guided transformer structure has achieved great improvements, benefiting from the attention mechanism that models the semantic and spatial feature relationship. However, inevitable errors contained in the estimated depth from a single image are not precise enough for reliable 3D localization. To address this problem, we propose a simple yet effective training scheme that introduces spatial information through a recently widely adopted, parameter-efficient tuning approach. We aim to leverage better spatial information acquired from a cross-modal (e.g. LiDAR and RGB image) pre-trained model as weights for the transformer encoders. Subsequently, we freeze these modules and exclusively fine-tune the remaining parts of the model with a tunable context-aware adapter. This enables the learning of both semantic and spatial-aware representations without the burden of additional parameters. To the best of our knowledge, this is the first to introduce spatial cues via the adapter tuning technique, demonstrating the potential for the effective utilization of pre-trained weights for 3D perception from a single image. Experiment results on the KITTI 3D benchmark demonstrate the effectiveness of our approach.;단안 카메라 기반 3 차원 객체 검출은 단일 영상을 이용하여 주변의 3 차원 객체를 검출하고 해당 객체의 위치와 형태를 예측하는 연구이다. 단일 영상 기반의 3 차원 검출기는 정확한 깊이를 추정할 수 없기에, 효과적인 깊이 관련 정보를 예측하고 활용하는 방법이 중요하다. 기존 방법 중 일부는 단일 네트워크에서 추정한 깊이 단서를 트랜스포머 디코더 구조의 입력으로 활용한다. 그러나 이러한 방식은 단일 영상으로부터 생성된 깊이 단서에 의존하기에, 깊이 단서에 포함된 부정확성은 객체 검출 성능 향상에 제한을 가한다. 본 논문에서는 어댑터(adapter) 튜닝 기법을 통해 사전 학습된 깊이 단서를 효과적으로 활용할 수 있는 새로운 학습 방식을 제안한다. 구체적으로, 라이다 (LiDAR)와 이미지 데이터로 사전 학습된 트랜스포머 가중치를 도입하여 고정하고 모델의 나머지 부분을 맥락 정보를 포함한 어댑터로 학습한다. 이로써, 추가적인 파라미터를 사용하지 않고 의미 및 공간 정보를 고려한 표현을 모두 학습할 수 있다. 본 연구는 단일 영상 기반 3 차원 객체 검출 분야에서 깊이 정보를 활용하기 위해 어댑터 튜닝 기법을 도입하는 첫 번째 연구로, 향후 튜닝 기반의 단일 영상 기반 3 차원 인식 연구의 발전에 기여할 잠재력을 지닌다. 본 방법의 효과를 KITTI 3D 벤치마크를 통해 입증한다.