DSpace at EWHA: 전이학습 및 LSTM 모델을 기반으로 한 회귀 데이터 결측 값 대체와 예측 성과의 관계에 관한 연구

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 683 Download: 0

전이학습 및 LSTM 모델을 기반으로 한 회귀 데이터 결측 값 대체와 예측 성과의 관계에 관한 연구

Title: 전이학습 및 LSTM 모델을 기반으로 한 회귀 데이터 결측 값 대체와 예측 성과의 관계에 관한 연구

Other Titles: Does better missing data imputation improve prediction accuracy of time series data?

Authors: 황희선

Issue Date: 2023

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 민대기

Abstract: 누락된 값을 무시한 채 진행한 시계열 데이터 분석은 데이터가 더 이상 유효하지 않게 만들 뿐만 아니라 기계학습 응용 분석에서도 잘못된 결과로 이끌 수 있다. 임의의 결측 데이터가 있는 시계열 맥락을 고려한 본 논문은 완전하고 유사한 시계열로 결측 값을 예측하는 결측 데이터 대체(imputation) 방법을 고려한다. 본 연구는 다음의 두 가지 가설을 실험하는 것을 목표로 한다. (a) 더 유사한 시계열을 학습에 사용하는 것은 덜 유사한 시계열 데이터를 사용했을 때 보다 더 나은 결측 데이터 대체 성능을 제공한다. (b) 더 나은 성능을 보이는 결측 값이 대체된 데이터는 시계열의 미래 예측 문제에서도 더 좋은 성능을 제공한다. 이러한 가설을 평가하기 위해 유클리드 거리를 사용하여 결측 데이터와 후보 데이터들 간의 거리를 측정한 후, 측정된 거리를 기반으로 훈련 데이터를 선택하여 학습하는 LSTM 모델을 고려했다. 본 연구에서 LSTM 모델은 결측 데이터 대체(imputation) 뿐만 아니라 결측 값이 대체된 데이터를 이용하여 예측할 때에도 사용된다. 비교를 위해 단순 이동 평균(Simple Moving Average), 헐 이동 평균(Hull Moving Average) 및 기계학습 모델 중 하나인 XGBoost의 결측 값 대체 성능을 추가로 평가했다. 수치 실험 결과는 최단거리 데이터로 훈련된 LSTM imputation 모델이 최고의 성능을 보였다. 그러나 다른 훈련데이터를 사용하거나 다른 모델을 사용한 결측 값 대체 데이터(imputed data)와 예측 성능에서는 유의미한 차이가 없었다. 이 실험의 결과는 흥미롭게도 유사한 시계열 패턴을 사용하여 시계열 데이터의 일관성을 높이는 것이 효과적이기는 하지만, 이것이 항상 예측 정확도를 보장하는 것은 아님을 시사한다. ;Ignoring missing values makes the data invalid and may lead to biased and misinformed analysis in machine learning applications. In the context of time series with random missing data, this paper considers a data imputation method that predicts missing values from a complete and similar time series. We aim to test two hypotheses: (a) the use of more similar time series provides better missing data imputation than less similar time series; (b) the use of better missing data imputation provides better prediction performance of time series. To evaluate these hypotheses, we develop LSTM models that learn other similar and complete time series, where the similarity is measured by Euclidean distance between two time series. The developed LSTM models are used for both data imputation and prediction with imputed data. We additionally evaluate the performance of simple moving average (SMA), hull moving average (HMA), and XGBoost for a comparison purpose. Numerical experiments show that the LSTM model trained with shortest distance data had the best performance. However, there was no difference in the predictive performance of different imputed data and original data. The result of this experiment interestingly suggest that it is valid using similar time series patterns to increase the consistency of time series data whereas this does not always guarantee predictive accuracy.