DSpace at EWHA: 질의응답 사전 학습 및 메모리 뱅크를 이용한 비디오 리트리벌 성능 개선

Browse

My Repository

DSpace at EWHA일반대학원 전자전기공학과 Theses_Master

View : 315 Download: 0

질의응답 사전 학습 및 메모리 뱅크를 이용한 비디오 리트리벌 성능 개선

Title: 질의응답 사전 학습 및 메모리 뱅크를 이용한 비디오 리트리벌 성능 개선

Other Titles: A study on Improving Video Retrieval Performance Using Video-QA Dataset

Authors: 김현지

Issue Date: 2023

Department/Major: 대학원 전자전기공학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 강제원

Abstract: 본 논문에서는 비디오 내용 기반 검색을 위해 비디오에 관련된 질문을 이용한 특징 벡터를 메모리 뱅크에 저장하여 학습하는 딥러닝 네트워크를 제안한다. 현재 미디어 콘텐츠 사용의 급격한 증가로, 다양한 영상과 비디오 매체가 증가하고 있다. 일반적인 비디오 검색은 비디오에 달린 색인과 주석을 위주로 하여 이루어져 왔다. 이는 사용자가 비디오에 대한 색인을 어떻게 달았는지에 따라 검색의 질이 좌우된다는 단점과 더불어, 비디오 양이 방대해질수록 색인을 위한 인력 소모가 증가한다는 단점이 있다. 딥러닝 네트워크를 통한 비디오 내용 기반 검색을 적용하면, 이러한 사람의 개입 없이 비디오 내용을 분석하여 원하는 문장과 가장 유사한 비디오를 검색할 수 있다. 기존의 연구들은 배치안의 특징 벡터들만을 negative sample로 삼아 유사도를 계산하게 된다. 따라서 배치 사이즈에 따라 성능 차이가 발생한다. 이를 개선하기 위해 본 논문에서는 두 가지를 제안하였다. 첫 번째, 질의응답 데이터를 활용하여 비디오-텍스트 사전 학습을 진행한다. 질의응답 학습을 하는 과정에서 네트워크는 who, what 질문을 통해 비디오의 시각적 요소에서 집중해야 할 부분을 학습하게 된다. 두 번째, 사전 학습된 트랜스포머를 사용하여 도출해낸 최종 특징 벡터를 메모리뱅크에 넣어 비디오 내용기반 검색 유사도 계산을 할 시 negative sample에 적용시켜 준다. 기존의 네트워크가 배치 안의 샘플들만을 활용하는 데에 반해, 이전에 적용했던 특징 벡터들 또한 계산에 활용할 수 있어 보다 일반적인 텍스트-비디오 유사도를 학습하게 된다. 본 논문은 위 제안한 네트워크를 기존 네트워크 비교, 메모리뱅크의 유무에 따른 결과 비교 실험을 통하여 제안 모델의 정량적 우수함을 증명하였다.;In this paper, we propose a deep learning network which is pre-trained using video QA data and trained by storing feature vectors in a memory bank. Due to the rapid increase using of media content, various images and video media are increasing. A general video search has been performed based on indexes and annotations attached by human. This has disadvantages that the quality of search depends on how the user indexes the video, and the greater the number of videos, the more manpower required for indexing. By applying video content-based search through a deep learning network, it is possible to search for a video most like a desired sentence by analyzing video content without human intervention. Existing studies calculate the similarity of video-text by taking only the feature vectors in the batch as negative samples. Therefore, there is a difference in performance depending on the batch size. To improve this, we propose two things. First, video-text pre-training with using question-answer dataset. In the process QA learning, the network trains what to focus on int the visual frames of the video through ‘who’ and ‘what’ questions. Second, the final feature vectors derived using the pre-trained transformer are put into the memory bank and applied to the negative sample when the video-text similarity is calculated. While the existing networks use only the samples in the batch, proposed network can learn a more general text-video similarity through lots of negative samples. This paper proves the quantitative excellence of the proposed model by comparing the proposed network with an existing network and comparing results according to the presence or absence of a memory bank.