DSpace at EWHA: Empirical Study on Statistical Matching Methods

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 819 Download: 0

Empirical Study on Statistical Matching Methods

Title: Empirical Study on Statistical Matching Methods

Authors: 전예리

Issue Date: 2020

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이은경

Abstract: Statistical matching (data fusion) is method of combining multiple data which does not require matching information of exact individual. It is widely used in practice to obtain joint information from various data sources due to growing need for integrated data. Precedent studies in statistical matching have focused mainly on classical approaches including regression imputation, hot deck imputation and mixed methods. Only now, application of different machine learning techniques in statistical matching started to develop. In this paper, we propose the novel statistical matching method combining random forest and k-nearest neighbor. The efficiency of the proposed method is analyzed and compared with existing widely used methods through experimental study with Boston housing data and panel credit card transaction data. As a result, the proposed method has shown satisfiable result compared to other methods in preservation of individual value, correlation structure and distribution of unique variable after matching.;통계적 매칭은 동일인의 데이터가 아니더라도 서로 다른 데이터를 통합할 수 있는 방법이다. 통합 데이터에 대한 필요성이 증대됨에 따라, 여러 출처로부터 온 데이터를 결합해 통합된 정보를 얻을 수 있는 통계적 매칭도 널리 사용되고 있다. 회귀적 매칭, 핫덱 매칭, 그리고 결합 방법론 등 기존에 연구된 통계적 매칭 방법들은 상대적으로 전통적인 방법론들에 기반하고 있다. 최근에 이르러서야 다양한 기계 학습 방법론들을 통계적 매칭에 적용하는 연구들이 시작되고 있으며, 본 논문에서는 랜덤 포레스트와 k-최근접 이웃 방법을 결합한 새로운 통계적 매칭 방법을 제안하였다. 보스톤 주택 데이터와 패널 카드 사용 내역 데이터를 이용한 사례 연구를 통해 제안된 방법과 기존 통계적 매칭 기법들의 성능 비교를 실시해보았다. 그 결과 새롭게 제안된 방법은 기존 방법과 비교했을 때 매칭된 변수의 개별 데이터 값 보존에서 뛰어난 성능을 보여주었으며, 기존 변수들과의 상관성 구조, 개별 변수 분포 유지 측면에서도 좋은 결과를 나타냈다.