DSpace at EWHA: Efficient Deep Learning-based Video Processing Techniques via Exploiting Temporal Correlation

Browse

My Repository

DSpace at EWHA일반대학원 전자전기공학과 Theses_Ph.D

View : 600 Download: 0

Efficient Deep Learning-based Video Processing Techniques via Exploiting Temporal Correlation

Title: Efficient Deep Learning-based Video Processing Techniques via Exploiting Temporal Correlation

Authors: 김나영

Issue Date: 2022

Department/Major: 대학원 전자전기공학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 강제원

Abstract: 최근 유튜브, 개인 미디어, 인터넷 방송과 같은 비디오 호스팅 서비스의 규모가 급속도로 커지고 있다. 백만명이 넘는 사용자를 보유한 유튜브의 경우 매 본 단위로 오백 시간에 다르는 비디오가 업로드(upload)되고 있다. 이에 따라서 비디오를 활용한 연구가 각광받게 되었다. 더욱이 딥러닝(deep learning)이 여러 컴퓨터 비전 분야에서 상당히 뛰어난 성능을 보이고 있기 때문에, 대부분의 비디오 연구는 딥러닝 기술을 활용하고 있다. 이미지 처리에서 비디오 처리로 연구를 확장하기 위해서는 비디오의 시간적 연관성을 고려하는 것이 매우 중요하다. 따라서 우리는 딥러닝을 기반으로 비디오의 시간적 연관성에 초점을 맞춘 비디오 처리 연구를 제안하였다. 특히 우리는 이전까지의 연구가 비디오의 시간적 연관성을 고려하는 것에 실패로 인하여 문제를 가지고 있던 세가지 비디오 처리 주제를 선택하였다. 첫번째 선택한 주제는 비디오 질의응답이다. 비디오 질의 응답 분야는 비디오와 언어로부터 특징을 정확하게 추출하는 것이 중요하다. 그러나 대부분의 기존 연구에서는 부족한 비디오 표현력으로 인한 성능악화를 겪어왔다. 본 논무에서 우리는 위의 단점을 극복하기 위해 비디오 질의 응답에 적합한 비디오 특징 추출 방법을 제안한다. 자세한 내용은 2장에서 설명한다. 두 번째 선택한 주제는 미래 프레임 예측이다. 성공적인 예측과 생성을 위해서는 비디오에서의 확률적인 움직임 패턴을 고려해야한다. 기존 연구들은 시간에 따라 급속도로 악화되는 결과를 보인다. 이를 극복하기 위하여 우리는 동적 움직임 예측과 생성을 하는 비디오 예측 네트워크를 제안하였다. 우리는 시간적 변동성을 다루기 위한 동적 움직임 커널과 미래 예측의 불확실함으로 인한 흐려지는 문제를 극복하는 맵을 개발하였다. 자세한 내용은 3장에서 설명한다. 마지막으로 선택한 주제는 비디오 코덱 내에서의 비디오 성능 향상이다. 최근 새로운 비디오 표준인 VVC가 압축 성능을 향상시키고 전송 비트를 줄였다. 그럼에도 불구하고 전송해야 할 비트를 줄이기위해서는 비디오 압축 프레임의 성능과 트레이드오프(trade-off)문제가 여전하다. 따라서 인루프 필터(in-loop filter) 내에서의 비디오 향상을 딥러닝으로 해결하고자 하는 연구가 증가하고 있다. 이전 시점에서 압축된 프레임이 존재함에도 불구하고, 기존 연구들은 현재 시점 프레임만 사용하여 비디오 프레임 질을 향상시켰다. 우리는 이미 압축된 다른 시점의 프레임들로 얻은 좋은 질의 정보를 현재 프레임 향상시키는데 사용하는 참조 기반 방법을 제안한다. 자세한 내용은 4장에서 설명한다.;With the rapid development of deep learning, video processing techniques have been actively conducted in recent years. Unlike image processing, video processing needs to pay attention to temporal correlation between adjacent video frames. In this paper, we concentrate on efficient deep learning-based video processing via exploiting temporal correlation of both raw and compressed video format. To prove our proposed network, this thesis contains the research on the following three topics: video multimodality, video frame prediction and generation, and video quality improvement within video codec. Video multimodality Video Question Answering (Video QA) aims to give an answer to the question through semantic reasoning between visual and linguistic information. Recently, handling large amounts of multi-modal video and language information of a video is considered to be important in the industry. However, the current video QA models use deep features, suffered from significant computational complexity and insufficient representation capability both in training and testing. Existing features are extracted using pretrained networks after all the frames are decoded, which is not always suitable for video QA tasks. In this paper, we develop a novel deep neural network to provide video QA features obtained from coded video bit-stream to reduce the complexity. The proposed network includes several dedicated deep modules to both the video QA and the video compression system. This work is the first attempt at the video QA task. The proposed network is predominantly model agnostic. It is integrated into the state-of-the-art networks for improved performance without any computationally expensive motion-related deep models. The experimental results demonstrate that the proposed network outperforms the previous studies with lower complexity. Video frame prediction and generation Future video prediction provides valuable information that helps a computer machine understand the surrounding environment and make critical decisions in real-time. However, long-term video prediction remains a challenging problem due to the complicated spatiotemporal dynamics in a video. In this paper, we propose a dynamic motion estimation and evolution (DMEE) network model to generate unseen future video features from the observed videos in the past. Our primary contribution is to utilize trained kernels in convolutional neural network (CNN) and long short-term memory (LSTM) architectures, adapted to each time step and sample position, to efficiently manage spatiotemporal dynamics. DMEE uses the motion estimation (ME) and motion update (MU) kernels to predict the future video using an end-to-end prediction-update process. In the prediction, the ME kernel estimates the temporal changes. In the update step, the MU kernel combines the estimates with the previously generated frames as reference frames using a weighted average. The kernels are not only used for a current frame, but also are evolved to generate successive frames to enable temporally specific filtering. We perform qualitative performance analysis and quantitative performance analysis based on the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and video classification score developed for examining the visual quality of the generated video. The experiments demonstrated that our algorithm provides better qualitative and quantitative performance superior to the current state-of-the-art algorithms. Frame image quality improvement Currently, several studies have been proposed various convolutional neural network (CNN) to reduce compression artifacts within the video codec. We point out two major problems of previous works. First, conventional deep motion estimation methods using flows [130, 131, 132] often yield inaccurate motions due to quantization noise. Second, previous studies assumed that the same quality of video frames would be used for both training and testing. Nevertheless, during deployments, the different quality substantially degraded performance. To overcome these problems, we developed a video compression artifact removal network (VCARN) to utilize not only the spatial but also temporal correlation during in-loop filtering at various QP values. We verified the results of the proposed method as compared with state-of-the-arts in VVC. Experimental results demonstrate that the proposed method achieves significantly improved performance, by overcoming the challenges in previous work.