DSpace at EWHA: Investigation of Prediction Accuracy and Explainability of Machine Learning Methods for Survival Data

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 232 Download: 0

Investigation of Prediction Accuracy and Explainability of Machine Learning Methods for Survival Data

Title: Investigation of Prediction Accuracy and Explainability of Machine Learning Methods for Survival Data

Authors: 강채리

Issue Date: 2023

Department/Major: 대학원 통계학과

Keywords: 생존분석, Survival Analysis, Cox Proportional Hazards Model, Random Survival Forests, Deepsurv, Permutation Feature Importance

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이동환

Abstract: Survival analysis which analyzes time-to-event has been used in various fields such as medical care, finance, and manufacturing. Cox Proportional Hazards Model (CPH) is the standard model in the field where explainability is as important as the model’s prediction performance because it is easy to explain how the model works. However, CPH requires a limited assumption that the risk ratio over time is constant, and it does not work well in nonlinear data. To alleviate the aforementioned problem, machine learning and deep learning-based survival analysis models have emerged, but these models are the so-called black-box models and are difficult to interpret the results. Therefore, this paper compares the predictive performance of CPH with two machine learning methods-Random Survival Forests (RSF), and DeepSurv. Also, the variable selection performance of those methods using the model-agnostic interpretation methodology are investigated. Comparative studies are conducted in various simulation scenarios, with different data forms, data sizes, and censoring rates. Real data example illustrates the different results of the survival methods.;의료, 금융, 제조 등 다양한 분야에서 관심 사건 발생까지의 시간을 분석하는 생존 분석이 활용되고 있다. 이러한 분야에서는 모델 예측 성능만큼이나 모델 해석이 중요하다는 점에서 Cox Proportional Hazards Model (CPH)이 활발하게 사용되고 있다. 다만, CPH는 시간에 따른 위험비가 일정하다는 제한적인 가정을 전제하며 비선형의 데이터에서는 잘 작동하지 않는다는 문제가 있다. 이러한 문제를 해결하기 위해 머신러닝 및 딥러닝 기반의 생존분석 모델이 등장했지만 이러한 모델들은 black-box로, 해석이 어려워 활발하게 적용되지 않고 있다. 따라서, 이 논문은 model-agnostic 해석 방법론을 활용해 CPH, Random Survival Forests (RSF), DeepSurv의 예측 성능과 변수 선택 성능을 비교한다. 선형과 비선형의 데이터, 데이터의 크기, 중도 절단 비율을 달리하며 8가지 시뮬레이션 환경에서 비교 연구를 수행한다. 각 환경에서 모델 예측 성능 및 모델 해석력까지 두루 갖춘 모델을 찾고자 한다. 이를 통해, 데이터의 특징에 따른 모델 선택에 도움이 될 것이라 기대한다.