DSpace at EWHA: 영화 관객 수 예측을 위한 머신러닝 기법의 성능 평가 연구

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 1345 Download: 0

영화 관객 수 예측을 위한 머신러닝 기법의 성능 평가 연구

Title: 영화 관객 수 예측을 위한 머신러닝 기법의 성능 평가 연구

Other Titles: A Study on the Performance Evaluation of Machine Learning for Predicting the Number of Audiences

Authors: 정찬미

Issue Date: 2020

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 민대기

Abstract: 영화 산업의 특징은 영화 한 편의 제작을 위하여 막대한 자금과 인력이 투입되는 데 비해 수요는 매우 불확실하다는 것이다. 따라서, 정확한 박스오피스 예측은 영화 산업 이해당사자들에게 수익과 직접적으로 연결된, 중요한 의사결정을 내리기 위한 전략적 수단이라고 할 수 있다. 이러한 영화 산업의 수요에 발맞추어 진행되어 온 박스오피스 예측 연구는 크게 ‘예측 알고리즘’ 과 ‘독립 변수 구성’이라는 두 개의 범주로 나누어 진다. 본 연구에서는 박스오피스 예측 알고리즘 중에서도 주로 분류 모델로 활용되었던 머신러닝에 초점을 맞추어, 분류 모델과 회귀 모델을 모두 고려하여 개봉 3주 차 누적관객 수를 예측 및 분류하였다. 분류 모델로는 Random Forest Classifier와 SVM(서포트 벡터 머신)을 사용하여 confusion matrix로 평가하였고, 회귀 모델로는 Random Forest Regressor와 k-NN Regressor를 사용하여 MAPE와 MASE로 평가하였다. 독립 변수 구성 측면에서는 개봉 후 상영 관련하여 상영횟수, 스크린수, 관객 수를 기간별로 반영하였다. 각 영화의 배우, 감독, 배급사, 수입사, 제작사는 더미변수를 사용하지 않고 각각의 전작 관객 수 평균으로 점수화하였다. 또한, 배우 변수의 모호성을 보완하기 위해 수요 예측에 수차례 사용되어져 온 검색량을 개봉 전후 시점을 기준으로 수집하여 반영하였다. 변수별로 편차가 크기 때문에 학습을 정확하게 하기 위하여 표준화 스케일링을 거친 후, 모델별로 10-fold cross validation으로 하이퍼 파라미터를 결정하고 모델의 안정성을 확인하였다. 분석 결과, Random Forest가 회귀 모델과 분류 모델로써 각각 k-NN과 SVM보다 뛰어난 성능을 보였다. 이는 상대적으로 독립변수의 수가 많고 레코드의 수가 적은 본 연구의 데이터 특성상, 여러 모델을 결합한 앙상블 기법의 Random Forest가 단일 알고리즘을 사용한 모델에 비해 정확도가 높게 나온 것으로 판단된다. 본 연구는 박스오피스 예측 분야에서 진전이 더딘 편이었던 머신러닝 기반의 예측 모델에 대하여 회귀기법과 분류기법을 모두 고려하였다는 데 의의가 있다. 이로써 회귀 모델로 예측함과 동시에 분류 모델로 흥행 예측의 오차범위를 제한하여 배급 및 상영시 영화산업 이해당사자의 경영판단에 가이드라인을 제시할 수 있을 것이라 기대한다. 그러나, 분류 모형에서 특정 클래스의 경우 분포가 급격히 떨어지는 구간이라 충분히 학습되지 못한 결과가 도출되었다. 향후에는 더 많은 양의 데이터를 확보하여 분류모델에서 각 클래스별로 충분한 학습이 이루어진다면 더 정확한 결과를 도출할 수 있을 것이다.;The feature of the film industry is that demand is very uncertain, while huge amounts of money and manpower are spent on the production of a film. Therefore, accurate box office forecasts can be said to be a strategic means for making important decisions directly linked to revenues for film industry stakeholders. The box office prediction study, which has been conducted to keep up with the demand in the film industry, is largely divided into two categories: the predictive algorithm and the composition of independent variables. This study focused on machine learning, which was mainly used as a classification model among box office prediction algorithms, and predicted and classified the cumulative number of viewers in the third week by taking into account both classification and regression models. For classification modeling, Random Forest Classifier and SVM(Support Vector Machine) were evaluated by a confusion matrix. For regression modeling, Random Forest Regressor and k-NN Regressor were evaluated by MAPE and MASE. In terms of the composition of independent variables, the number of screens, screening and audiences were reflected by period in relation to the screenings after the release. The actors, directors, distributors, importers, and producers of each film were scored on the average number of audience members for each film without using dummy variables. In addition, to compensate for the ambiguity of the actor’s variables, the volume of searches that have been used several times in estimating demand was collected and reflected on the basis of before and after the release date. Because of the large deviation of each variable, standardized scaling was done to correct learning. Hyper parameters were then determined with 10-fold cross validation for each model and the stability of the model was checked. Analysis shows that Random Forest outperforms k-NN and SVM, respectively, as regression models and classification models. Due to the data characteristics of this study, which have relatively large numbers of independent variables and a small number of records, it is believed that the ensemble technique, which combines multiple models, is more accurate than the model using a single algorithm. This study is meaningful in that both regression and classification techniques were considered for machine learning-based predictive models, which have been slow in the field of box office forecasting This research is expected to help predict the number of audiences with a regression model and to limit the margin of error of box office predictions with a classification model to provide guidelines for management decisions of film industry stakeholders during distribution and screening. However, in the classification model, for certain classes, the distribution is rapidly falling, resulting in insufficient learning. In the future, more data will be needed so that sufficient learning for each class in the classification model. And then, it will be possible to produce more accurate results.