DSpace at EWHA: Deep Learning-based Video Coding Techniques via Enhanced Spatio-temporal Prediction

Browse

My Repository

DSpace at EWHA일반대학원 전자전기공학과 Theses_Ph.D

View : 185 Download: 0

Deep Learning-based Video Coding Techniques via Enhanced Spatio-temporal Prediction

Title: Deep Learning-based Video Coding Techniques via Enhanced Spatio-temporal Prediction

Authors: 이정경

Issue Date: 2024

Department/Major: 대학원 전자전기공학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 강제원

Abstract: Since a deep learning-based technique can resolve complicated spatiotemporal dynamics within a video, it is actively applied to a video coding technique to improve coding efficiency. In this dissertation, we propose efficient video compression algorithms using deep learning techniques by enhancing spatio-temporal prediction. This thesis contains research on the following three topics: Inter-frame video coding using deep learning-based technique, reinforcement learning (RL)-based video encoder optimization and compressed 3D neural representation for multi-view video. Inter-frame video coding using deep learning-based technique: In this work, we focus on improving coding efficiency in inter-prediction using Convolutional Neural Network (CNN) based video coding technique. A CNN-based video prediction network (VPN) is proposed to support enhanced motion prediction in video compression and generate a virtual reference frame (VRF), which is synthesized using previously coded frames. The proposed VPN uses two sub-VPN architectures in cascade to predict the current frame in the same time instance. The VRF is expected to have higher temporal correlation than a conventional reference frame, and thus it is substituted for a conventional reference frame. The proposed technique is incorporated into the HEVC inter-coding framework. Particularly, the VRF is managed in a HEVC reference picture list, so that each prediction unit (PU) can choose a better prediction signal through Rate-Distortion optimization without any additional side information. Furthermore, we modify the HEVC inter-prediction mechanisms of Advanced Motion Vector Prediction and Merge modes adaptively when the current PU uses the VRF as a reference frame. In this manner, the proposed technique can exploit the PU-wise multi-hypothesis prediction techniques in HEVC. The proposed technique is incorporated into the current video coding framework to improve Rate-Distortion (R-D) performance. It is shown in experimental results that the proposed technique provides extensive coding gains, respectively, in Random Access (RA) and Low Delay B (LD) coding configurations as compared to the current video coding standardization. Reinforcement learning (RL)-based video encoder optimization: Video coding standards use a prediction structure to arrange video frames and exploit temporal correlations. In this aspect, it is crucial to resolve complicated temporal dependencies among frames to improve coding efficiency because the coding of a preceding frame affects the rate-distortion (R-D) performance of the subsequent frames. Previous algorithms have attempted to address the problem using handcrafted features or analytical models even though natural videos display various temporal characteristics. In this work, a reinforcement learning (RL)-based decision algorithm is proposed to build the optimal hierarchical prediction structure under a random-access configuration (RA-HPS) in Versatile Video Coding (VVC). Accordingly, we formulate an adaptive GOP selection algorithm with a binary tree to represent a policy. We generate an optimal binary tree to minimize the sum of the R-D costs among all plausible binary trees. A new RL policy representation is defined, and the optimal policy is obtained by a sequential update. The tree grows with a hierarchical state-action and a reward sequence in each node. For efficient learning, the proposed technique uses a deep Q-network architecture to capture the temporal correlation between frames, which helps learn the policy of the tree-based RL framework effectively. Experimental results demonstrate that the proposed technique achieves a significant Bjontegaard-Delta (BD)-rate reduction compared with state-of-the-art GOP size-selection algorithms. Compressed 3D neural representation for multi-view video: We acknowledge the imperative requirement for compressing volumetric video in a streaming and continuous format. However, existing works tend to overlook the crucial aspect of compression in the context of volumetric video. Furthermore, multi-view video and its applications often store content with limited temporal frame rates, primarily catering to short-term video sequences. This dissertation introduces a novel approach, a compressed neural representation of multi-vide video designed to reconstruct continuous dynamic 3D scenes. Our approach leverages factorized planes and vectors of 3D scenes at specific time intervals. We propose a neural network to implicitly acquire temporal representations, establishing correlations across different time points. Diverging from prior methods that perform temporal interpolation at fixed rates and predetermined length, we propose the learning strategy of a continuous dynamic neural rendering facilitating the interpolation of video frames at any desired frame rate. Additionally, we introduce the neural rendering coding scheme that supports a Group of Volume, comprising I-volume as key-timestamp and B-volume, to effectively compress the dynamic volumes. To further enhance compression, we utilize the existing codec, VVC, to compress the decomposed feature planes and vectors. Experimental results demonstrate that the proposed method achieves a significant performance both in rendering quality and time interpolated quality compared with state-of-the-art neural rendering method. ;최근 유튜브 및 기타 1인 미디어 플랫폼을 중심으로 한 디지털 컨텐츠의 부상과 컴퓨터와 통신 기술이 발전함에 따라 디지털 멀티미디어 정보를 위한 서비스와 사용자가 급격하게 증가하고 있다. 디지털 영상의 품질 증가와 동영상 활용의 확대, 그리고 3차원 입체 동영상의 등장으로 데이터양은 더욱 증가하였다. 따라서, 멀티미디어의 방대한 데이터 사용에 따라 이를 효과적으로 저장 및 전송하기 위해 비디오 압축에 관한 연구 및 표준화 기술개발이 계속 진행되어 왔다. 특히, 최근 딥러닝 및 인공지능 기술은 여러 분야에서 뛰어난 성과를 보여 영상 압축 기술에서도 딥러닝 기술을 접목하려는 시도들이 있어왔다. 이러한 동향에 발맞추어 본 논문에서는 비디오 내의 복잡한 시공간 연관성 파악에 효과적인 딥러닝 기반 기술 기반 압축 방법을 제안하였다. 여러 멀티미디어 정보의 압축 효율성을 향상하기 위하여 세가지 주제를 통한 방법을 제안한다. 첫번째 선택한 주제로는 딥러닝을 이용한 프레임 간 비디오 코딩이다. 본 연구에서는 Convolutional Neural Network (CNN)을 기반으로 한 비디오 코딩 기술을 활용하여 프레임 간 예측의 압축 효율성을 향상한다. 먼저 CNN 기반 비디오 예측 네트워크는 이전에 코딩된 프레임을 사용하여 기존 참조 프레임보다 더 높은 시간 상관 관계를 가지는 가상 참조 프레임을 생성한다. 제안된 비디오 예측 네트워크를 사용하여 HEVC(High Efficiency Video Coding)의 기존 참조 프레임을 대체하였고 그로 인하여 압축 효율을 높이는 방법을 설명한다. 자세한 내용은 2장에서 설명한다. 두번째 주제는 강화 학습(RL)을 기반으로 한 비디오 인코더 최적화이다. 비디오 부호화 인코더는 비디오 프레임을 정렬하고 시간적 상관 관계를 활용하기 위해 예측 구조를 사용한다. 따라서, 이전 프레임의 코딩이 다음 프레임의 프레임 성능에 영향을 주게 되고 프레임 간 시간 종속성을 해결하는 것은 코딩 효율성을 향상시키기에 중요하다. 해당 주제에서는 강화 학습(Reinforcement Learning) 기반의 의사 결정 알고리즘을 제안하여 VVC(Versatile Video Coding)의 랜덤 액세스 구성에서 최적의 계층적 예측 구조를 고안하고 이진 트리를 사용한 적응형 GOP 선택 알고리즘을 제안한다. 자세한 내용은 3장에서 설명한다. 마지막 주제는 다중 시점 비디오의 3D 공간을 압축하는 방법이다. 공간 재현을 위한 뷰 합성 방법에 대한 기존 연구에서는 시간 적으로 변화하는 공간에 대한 압축의 중요성을 간과하고 있다. 또한 기존 연구에서는 짧은 비디오 시퀀스에 위주의 제한된 시간 프레임 속도를 가지게 되는 방법으로 학습이 진행된다. 이를 해결하기 위하여 본 주제에서는 다중 시점 비디오의 연속적인 동적 3D 장면을 재구성을 위한 압축된 신경망 표현을 제안한다. 특정 시간 간격에서 3D 장면의 분해된 평면 및 벡터를 활용하여 시간적 표현을 암묵적으로 표현 가능한 신경망 기반 네트워크를 제안한다. 또한, 다중 시점 비디오의 3D 공간의 유연한 압축을 위하여 I-볼륨 및 B-볼륨으로 구성된 볼륨 그룹을 구성하고, 최신 비디오 코덱인 VVC를 사용하여 분해된 특징 평면 및 벡터를 압축한다. 자세한 내용은 4장에서 설명한다.