DSpace at EWHA: 아파치 스파크에서의 PARAFAC 분해 기반 텐서 재구성을 이용한 추천 시스템

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 1173 Download: 0

아파치 스파크에서의 PARAFAC 분해 기반 텐서 재구성을 이용한 추천 시스템

Title: 아파치 스파크에서의 PARAFAC 분해 기반 텐서 재구성을 이용한 추천 시스템

Other Titles: PARAFAC Tensor Completion for Recommender System based on Apache Spark

Authors: 임어진

Issue Date: 2019

Department/Major: 대학원 컴퓨터공학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 용환승

Abstract: 기존의 추천 시스템은 사용자와 상품이라는 두 가지의 요소와 그에 따른 평점 값으로 이루어진 2차원 행렬 데이터를 기반으로 하여 추천을 하는 기법을 사용해왔다. 그러나 실제 상황에서는 고객이 물건을 구매할 때, 상품 자체만을 보고 구매하는 경우 외에도 다른 여러 가지 사항을 고려하여 구매에 이르는 경우가 많다. 그런 점에서 2차원 행렬 데이터를 이용한 추천 시스템은 사용자와 상품 외에 다른 여러 가지 고려 사항을 반영하지 못한다는 한계가 있기 때문에 최근에는 사용자와 상품 외에 추가적인 요소를 더하여 세 가지 이상의 항목을 고려하는 추천 시스템에 대한 연구가 활발히 이루어지고 있다. 이러한 세 가지 이상의 항목이 있는 다차원의 배열을 텐서(Tensor)라고 하며, 고차원의 텐서 데이터를 분해하는 알고리즘을 이용한 연구가 데이터 마이닝(Data Mining), 컴퓨터 비전, 선형 대수학 등 여러 분야에 활용되고 있는 추세이다. 텐서 데이터의 주된 문제점은 데이터 상당 부분의 값이 결측되었다는 희소성(Sparsity)의 문제가 있다. 이를 해결하기 위해 고차원의 텐서를 보다 낮은 차원의 배열로 변환 혹은 축소하는 텐서 분해 기법을 이용하여 텐서를 분해하고 분해된 결과를 통해 다시 재구성함으로써 본래에 비어있던 값을 계산된 값으로 복구하는 방식으로 텐서를 완성하는 기법(Tensor Completion)이 사용된다. 그리하여 본 논문에서는 드롭아웃(Dropout)기법에서 고안한 정규화 알고리즘을 통해 텐서를 정규화하고 정규화된 텐서를 분해하고 재구성하는 과정을 거쳐, 비어있는 요소가 없는 완성된 텐서를 이용하여 사용자 기반의 상위 K개의 추천 목록을 제공하는 시스템을 제안하며, 실제 데이터를 이용하여 인 메모리 빅데이터 시스템인 아파치 스파크(Apache Spark)를 기반으로 많은 양의 데이터 처리를 빠른 시간 내에 가능하도록 하였으며 정규화하지 않은 데이터와의 성능 비교를 통해 정규화 알고리즘을 적용한 추천 시스템이 더 향상된 추천 성능을 보임을 확인하였다. ;The existing recommender systems have been based on a two-dimensional matrix where the elements of the matrix are value means scores that is determined based on when a certain user (customer) would choose a specific consumer item (goods). However, in the real world, when a customer purchases an item, the customer not only considers the product itself, but also takes into account other factors, such as seasonal trends or gift-giving events like anniversaries and birthdays. In this respect, the recommendation system using a two-dimensional matrix does not reflect various additional considerations. Therefore, in recent years, there has been active research on a recommender system that considers three or more inputs in addition to users and goods, making it a multi-dimensional array, also known as a tensor. The main issue with using a tensor is that there are a lot of missing values, making it sparse. In order to solve this problem, the tensor can be transformed or shrunk using the tensor decomposition algorithm into a lower dimensional array called a factor matrix. Then, the tensor is reconstructed by calculating factor matrices to fill original empty cells with predicted values. This whole process is called tensor completion. In this paper, we used PARAFAC tensor decomposition for tensor completion based on an in-memory big data system, Apache Spark. In this paper, we propose a user-based Top-K recommender system by normalized PARAFAC tensor completion. This method involves the factorization of a tensor into factor matrices and reconstructs the tensor again. Before decomposition of a tensor, the original tensor is normalized based on each dimension to reduce overfitting. Using the real world dataset, this paper shows the processing of a large amount of data and implements a recommender system based on Apache Spark. In addition, this study has confirmed that the recommender system performance is improved through normalization of the tensor.